Hi Pat, I've done some further digging and it looks like the problem is occurring when the input files are split up to into parts. The input to the item-similarity matrix is the output from a spark job and it ends up in about 2000 parts (on the hadoop file system). I have reproduced the error locally using a small subset of the rows.
This is a snippet of the file I am using - ... 5138353282348067470,1891081885 4417954190713934181,1828065687 133682221673920382,1454844406 133682221673920382,1129053737 133682221673920382,548627241 133682221673920382,1048452021 8547417492653230933,1121310481 7693904559640861382,1333374361 7204049418352603234,606209305 139299176617553863,467181330 ... When I run the item-similarity against a single input file which contains all the rows, the job succeeds without error. When I break up the input file into 100 parts, and use the directory containing them as input then I get the 'Index outside allowable range' exception. Her are the input files that I used tarred and gzipped - https://s3.amazonaws.com/static.onespot.com/mahout/passing_single_file.tar.gz https://s3.amazonaws.com/static.onespot.com/mahout/failing_split_into_100_parts.tar.gz There are 44067 rows in total, 11858 unique userIds and 24166 unique itemIds. This is the exception that I see on the 100 part run - 15/04/03 12:07:09 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 707) org.apache.mahout.math.IndexException: Index 24190 is outside allowable range of [0,24166) at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I tried splitting the file up in 10,20 and 50 parts and the job completed. Also, should the resulting similarity matrix be the same wether the input is split up or not? I passed in the same random seed for the spark job, but the matrices were different Thanks, Michael On Thu, Apr 2, 2015 at 6:56 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > The input must be tuples (if not using a filter) so the CLI you have expects > user and item ids that are > > user-id1,item-id1 > user-id500,item-id3000 > … > > The ids must be tokenized because it doesn’t use a full csv parser, only > lines of delimited text. > > If this doesn’t help can you supply a snippet of the input > > > On Apr 2, 2015, at 10:39 AM, Michael Kelly <mich...@onespot.com> wrote: > > Hi all, > > I'm running the spark-itemsimilarity job from the cli on an AWS emr > cluster, and I'm running into an exception. > > The input file format is > UserId<tab>ItemId1<tab>ItemId2<tab>ItemId3...... > > There is only one row per user, and a total of 97,000 rows. > > I also tried input with one row per UserId/ItemId pair, which had > about 250,000 rows, but I also saw a similar exception, this time the > out of bounds index was around 110,000. > > The input is stored in hdfs and this is the command I used to start the job - > > mahout spark-itemsimilarity --input userItems --output output --master > yarn-client > > Any idea what the problem might be? > > Thanks, > > Michael > > > > 15/04/02 16:37:40 WARN TaskSetManager: Lost task 1.0 in stage 10.0 > (TID 7631, ip-XX.XX.ec2.internal): > org.apache.mahout.math.IndexException: Index 22050 is outside > allowable range of [0,21997) > > org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) > > > org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) > > > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) > > > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) > > > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > > > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > > scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) > > scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) > > > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > > > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > > > scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) > > scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) > > scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) > > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144) > > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > > > org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > > org.apache.spark.scheduler.Task.run(Task.scala:54) > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > java.lang.Thread.run(Thread.java:745) >