[ https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Wendell resolved SPARK-7896. ------------------------------------ Resolution: Fixed Fix Version/s: 1.4.0 > IndexOutOfBoundsException in ChainedBuffer > ------------------------------------------ > > Key: SPARK-7896 > URL: https://issues.apache.org/jira/browse/SPARK-7896 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.4.0 > Reporter: Arun Ahuja > Assignee: Sandy Ryza > Priority: Blocker > Fix For: 1.4.0 > > > I've run into this on two tasks that use the same dataset. > The dataset is a collection of strings where the most common string appears > ~200M times and the next few appear ~50M times each. > for this rdd: RDD[String], I can do rdd.map( x => (x, 1)).reduceByKey( _ + _) > to get the counts (how I got the number above), but I hit the error on > rdd.groupByKey(). > Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do > rdd2.leftOuterJoin(rdd) without hitting this error > {code} > 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage > 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): > java.lang.IndexOutOfBoundsException: 512 > at > scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) > at > org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110) > at > org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141) > at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) > at > org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147) > at > org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org