Hello all - I am attempting to run MLLib's ALS algorithm on a substantial test vector - approx. 200 million records.
I have resolved a few issues I've had with regards to garbage collection, KryoSeralization, and memory usage. I have not been able to get around this issue I see below however: > java.lang. > ArrayIndexOutOfBoundsException: 6106 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS. > scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > org.apache.spark.mllib.recommendation.ALS.org > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) I do not have any negative indices or indices that exceed Int-Max. I have partitioned the input data into 300 partitions and my Spark config is below: .set("spark.executor.memory", "14g") .set("spark.storage.memoryFraction", "0.8") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrator", "MyRegistrator") .set("spark.core.connection.ack.wait.timeout","600") .set("spark.akka.frameSize","50") .set("spark.yarn.executor.memoryOverhead","1024") Does anyone have any suggestions as to why i'm seeing the above error or how to get around it? It may be possible to upgrade to the latest version of Spark but the mechanism for doing so in our environment isn't obvious yet. -Ilya Ganelin