Hi Pat,

I've done some further digging and it looks like the problem is
occurring when the input files are split up to into parts. The input
to the item-similarity matrix is the output from a spark job and it
ends up in about 2000 parts (on the hadoop file system). I have
reproduced the error locally using a small subset of the rows.

This is a snippet of the file I am using -

...

5138353282348067470,1891081885
4417954190713934181,1828065687
133682221673920382,1454844406
133682221673920382,1129053737
133682221673920382,548627241
133682221673920382,1048452021
8547417492653230933,1121310481
7693904559640861382,1333374361
7204049418352603234,606209305
139299176617553863,467181330
...


When I run the item-similarity against a single input file which
contains all the rows, the job succeeds without error.

When I break up the input file into 100 parts, and use the directory
containing them as input then I get the 'Index outside allowable
range' exception.

Her are the input files that I used tarred and gzipped -

https://s3.amazonaws.com/static.onespot.com/mahout/passing_single_file.tar.gz
https://s3.amazonaws.com/static.onespot.com/mahout/failing_split_into_100_parts.tar.gz

There are 44067 rows in total, 11858 unique userIds and 24166 unique itemIds.

This is the exception that I see on the 100 part run -
15/04/03 12:07:09 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 707)
org.apache.mahout.math.IndexException: Index 24190 is outside
allowable range of [0,24166)
at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
at 
org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
at 
org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at 
scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at 
scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at 
scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


I tried splitting the file up in 10,20 and 50 parts and the job completed.
Also, should the resulting similarity matrix be the same wether the
input is split up or not? I passed in the same random seed for the
spark job, but the matrices were different

Thanks,

Michael



On Thu, Apr 2, 2015 at 6:56 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> The input must be tuples (if not using a filter) so the CLI you have expects 
> user and item ids that are
>
> user-id1,item-id1
> user-id500,item-id3000
> …
>
> The ids must be tokenized because it doesn’t use a full csv parser, only 
> lines of delimited text.
>
> If this doesn’t help can you supply a snippet of the input
>
>
> On Apr 2, 2015, at 10:39 AM, Michael Kelly <mich...@onespot.com> wrote:
>
> Hi all,
>
> I'm running the spark-itemsimilarity job from the cli on an AWS emr
> cluster, and I'm running into an exception.
>
> The input file format is
> UserId<tab>ItemId1<tab>ItemId2<tab>ItemId3......
>
> There is only one row per user, and a total of 97,000 rows.
>
> I also tried input with one row per UserId/ItemId pair, which had
> about 250,000 rows, but I also saw a similar exception, this time the
> out of bounds index was around 110,000.
>
> The input is stored in hdfs and this is the command I used to start the job -
>
> mahout spark-itemsimilarity --input userItems --output output --master
> yarn-client
>
> Any idea what the problem might be?
>
> Thanks,
>
> Michael
>
>
>
> 15/04/02 16:37:40 WARN TaskSetManager: Lost task 1.0 in stage 10.0
> (TID 7631, ip-XX.XX.ec2.internal):
> org.apache.mahout.math.IndexException: Index 22050 is outside
> allowable range of [0,21997)
>
>        org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
>
>        
> org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
>
>        
> org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
>
>        
> org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
>
>        
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
>
>        
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
>
>        scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
>
>        scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
>
>        
> scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
>
>        
> scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
>
>        
> scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
>
>        scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
>
>        scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
>
>        scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
>        
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144)
>
>        org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>
>        
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
>
>        
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>
>        
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
>        org.apache.spark.scheduler.Task.run(Task.scala:54)
>
>        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
>        
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>        
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>        java.lang.Thread.run(Thread.java:745)
>

Reply via email to