Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-14 Thread Pat Ferrel
BTW if you are getting a failure that you don’t understand, can you share it? On Apr 14, 2015, at 11:55 AM, Pat Ferrel wrote: > > On Apr 14, 2015, at 11:44 AM, Michael Kelly wrote: > > Hi Pat, > > Had some work to do on the input side for the item-similarity, so just > getting around to ru

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-14 Thread Pat Ferrel
> > On Apr 14, 2015, at 11:44 AM, Michael Kelly wrote: > > Hi Pat, > > Had some work to do on the input side for the item-similarity, so just > getting around to running it on a cluster now. > I'm building from source from the 0.10 release against version 2.4.0 hadoop. > The version of spark

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-14 Thread Michael Kelly
Hi Pat, Had some work to do on the input side for the item-similarity, so just getting around to running it on a cluster now. I'm building from source from the 0.10 release against version 2.4.0 hadoop. The version of spark we are using is 1.1, we use the emr install-spark script to install spark.

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-06 Thread Pat Ferrel
Michael, There is a fix in the latest source on Github. If you’d like to try it mvn clean install -DskipTests #there is a failing test at present so skip them Add your version of hadopp if needed, consult here: http://mahout.apache.org/developers/buildingmahout.html From my side, spark-

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-05 Thread Pat Ferrel
Michael, The problem is in partitioning the data and if you start with one file the partitions are created fine. With a bunch of small files, the optimizer trips up by not catching a range of size=0. This will be fixed in 0.10.1 but for now (0.10.0) you can: 1) concat files into one 2) I can r

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Dmitriy Lyubimov
no, i don't think there's a workaround. it needs a fix; however, in public version there are much more fixes needed so I think this part will be refactored completely in 0.10.1 On Fri, Apr 3, 2015 at 12:38 PM, Pat Ferrel wrote: > OK, it was. Is there a workaround I can try? > > > On Apr 3, 2015,

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Pat Ferrel
nrow and ncol that is On Apr 3, 2015, at 12:45 PM, Pat Ferrel wrote: Could I leave them unspecified, would that work around the problem? On Apr 3, 2015, at 12:38 PM, Dmitriy Lyubimov wrote: Ah. yes i believe it is a bug in non-slim A'A similar to one I fixed for AB' some time ago. It makes er

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Pat Ferrel
Could I leave them unspecified, would that work around the problem? On Apr 3, 2015, at 12:38 PM, Dmitriy Lyubimov wrote: Ah. yes i believe it is a bug in non-slim A'A similar to one I fixed for AB' some time ago. It makes error in computing parallelism and split ranges of the final product. On

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Pat Ferrel
OK, it was. Is there a workaround I can try? On Apr 3, 2015, at 12:22 PM, Dmitriy Lyubimov wrote: Although... i am not aware of one in A'A could be faulty vector length in a matrix if matrix was created by drmWrap with explicit specification of ncol On Fri, Apr 3, 2015 at 12:20 PM, Dmitriy Ly

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Dmitriy Lyubimov
Ah. yes i believe it is a bug in non-slim A'A similar to one I fixed for AB' some time ago. It makes error in computing parallelism and split ranges of the final product. On Fri, Apr 3, 2015 at 12:22 PM, Dmitriy Lyubimov wrote: > Although... i am not aware of one in A'A > > could be faulty vecto

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Dmitriy Lyubimov
Although... i am not aware of one in A'A could be faulty vector length in a matrix if matrix was created by drmWrap with explicit specification of ncol On Fri, Apr 3, 2015 at 12:20 PM, Dmitriy Lyubimov wrote: > it's a bug. There's a number of similar ones in operator A'B. > > On Fri, Apr 3, 20

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Dmitriy Lyubimov
it's a bug. There's a number of similar ones in operator A'B. On Fri, Apr 3, 2015 at 6:23 AM, Michael Kelly wrote: > Hi Pat, > > I've done some further digging and it looks like the problem is > occurring when the input files are split up to into parts. The input > to the item-similarity matrix

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Pat Ferrel
good, looking at it now On Apr 3, 2015, at 11:36 AM, Michael Kelly wrote: Yes, I updated recently, when running on a cluster I checked out the latest master of mahout, and locally I've probably updated in the last week or so. On Fri, Apr 3, 2015 at 7:04 PM, Pat Ferrel wrote: > OK, got it to r

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Michael Kelly
Yes, I updated recently, when running on a cluster I checked out the latest master of mahout, and locally I've probably updated in the last week or so. On Fri, Apr 3, 2015 at 7:04 PM, Pat Ferrel wrote: > OK, got it to reproduce. This is not what I expected. Its too many columns in > a vector-hmm

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Pat Ferrel
OK, got it to reproduce. This is not what I expected. Its too many columns in a vector-hmm. Found the other user’s issue which was null input, not a bug. BTW when did you update Mahout? Just put in the ability to point to dirs so I assume recently? On Apr 3, 2015, at 9:08 AM, Pat Ferrel wrote

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Pat Ferrel
Yeah, that’s exactly what the other user is doing. This should be a common architecture in the future. I’m already looking at the other so will add this too. Thanks a bunch for the data. On Apr 3, 2015, at 8:58 AM, Michael Kelly wrote: Yes, we are using a spark streaming job to create the inp

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Michael Kelly
Yes, we are using a spark streaming job to create the input, and I wasn't repartitioning it, so there were a lot of parts. I'm testing it out now with repartitioning to see if that works. This is just a single interaction type. Thanks again, Michael On Fri, Apr 3, 2015 at 4:52 PM, Pat Ferrel wr

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Pat Ferrel
This sounds like a bug. Thanks for the sample input and narrowing it down. I’ll look at it today. I got a similar question from another user with a lot of part files. A Spark streaming job creates the part files. Is that what you are doing? Is this a single interaction type? On Apr 3, 2015,

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-03 Thread Michael Kelly
Hi Pat, I've done some further digging and it looks like the problem is occurring when the input files are split up to into parts. The input to the item-similarity matrix is the output from a spark job and it ends up in about 2000 parts (on the hadoop file system). I have reproduced the error loca

Re: spark-itemsimilarity IndexException - outside allowable range

2015-04-02 Thread Pat Ferrel
The input must be tuples (if not using a filter) so the CLI you have expects user and item ids that are user-id1,item-id1 user-id500,item-id3000 … The ids must be tokenized because it doesn’t use a full csv parser, only lines of delimited text. If this doesn’t help can you supply a snippet of

spark-itemsimilarity IndexException - outside allowable range

2015-04-02 Thread Michael Kelly
Hi all, I'm running the spark-itemsimilarity job from the cli on an AWS emr cluster, and I'm running into an exception. The input file format is UserIdItemId1ItemId2ItemId3.. There is only one row per user, and a total of 97,000 rows. I also tried input with one row per UserId/ItemId pair,