Processing 50 millions of file for LDA

2013-06-04 Thread nishant rathore
Hi, we are running LDA on 50 million files. Each file is not more than 5 MB. Each file represent the content of the user. Files keeps on updating as we receive new information about the user. Currently we store all these files on ec2 and when we need to run LDA, We transfer those files to S3

Re: Processing 50 millions of file for LDA

2013-06-04 Thread Ted Dunning
Nishant, It is hard to advise on detailed trade-offs for your case but I am pretty sure that there are other options than S3 which is, as you say, very slow in terms of latency due to transferring lots of small objects. One alternative, for instance, would be to use a long-lived MapR cluster to

Generating vectors from a single txt file using Java:KMeans clustering

2013-06-04 Thread Nirmal Kumar
Hi, I am having twitter data in a single txt file as: @VancityBeerGuy - RT @BCBerrie: well @VancityBeerGuy you know what they say about guys with #smallenfreuden right? Hahaha Created At:Mon Jun 03 07:18:46 IST 2013 @IanSylves - RT @PTorgo91: @otterN9NE you're the best thing to happen to the

Generating vectors from a single txt file using Java:KMeans clustering

2013-06-04 Thread Nirmal Kumar
Hi, I am having twitter data in a single txt file as: @VancityBeerGuy - RT @BCBerrie: well @VancityBeerGuy you know what they say about guys with #smallenfreuden right? Hahaha Created At:Mon Jun 03 07:18:46 IST 2013 @IanSylves - RT @PTorgo91: @otterN9NE you're the best thing to happen to the

Re: bottom up clustering

2013-06-04 Thread Dan Filimon
Hi Rajesh, Streaming k-means clusters Vectors (that are in *, VectorWritable sequence files) and outputs IntWritable, CentroidWritable sequence files. A Centroid is the same as a Vector with the addition of an index and a weight. You can getVector() a Centroid to get its Vector. On Mon, Jun

Controlling output locations

2013-06-04 Thread Pat Ferrel
Am I loosing my mind or did the --outputPath option get removed from the MatrixMultiplicationJob recently? It looks like it is now in 'productWith-xxx' so I'll have to search for the most recent dir of that name? And why isn't there a --outputPath option to transpose? I have to search for the

Re: Blending initial recommendations for cross recommendation

2013-06-04 Thread Pat Ferrel
Err, I think there is a mistake below. You want to do [B'A]H_v, where H_v = A', the user's history of views as column vectors. At least that is what my code was doing. On another subject the idea of truncating the user history vectors came up in another thread. In some research we did using

Re: Blending initial recommendations for cross recommendation

2013-06-04 Thread Dominik Hübner
I am using [B'A]H_v in my code as well. Furthermore, the basics of my implementation are done so far (a more general purpose reimplementation of the item-based recommender) and I soon will move on to evaluate the approach with my dataset. On the one hand I'm quite lucky since each of the

Re: FP Growth

2013-06-04 Thread Grant Ingersoll
On Jun 2, 2013, at 10:42 AM, Sebastian Schelter s...@apache.org wrote: I don't think unmaintained code should stay in our codebase. +1 This will only create frustration amongst our users, as they will not get questions answered and bugs fixed. It would also be an obstacle for a 1.0

Re: Blending initial recommendations for cross recommendation

2013-06-04 Thread Pat Ferrel
You don't want to decay the training values, only the query values. The training values indicate user taste similarity and that decays very slowly if at all. The truncation I was talking about is in the query vectors. And even with that I'd measure its effect. If you do this with [B'B]H_p

Dictionary file format in Lucene-Mahout integration

2013-06-04 Thread James Forth
Hello, I’m wondering if anyone can help with a question about the dictionary format in lucene.vector-cvb integration.  I’ve previously used the pathway from text files:  seqdirectory seq2sparse rowid cvb  and it works fine.  The dictionary created by seq2sparse is in sequence file format, and

LeaseExpiredException in ABt job of SSVD

2013-06-04 Thread Ahmed Elgohary
Hi, I am trying to run ssvd on amazon EMR but, I am getting a LeaseExpriedException during the execution of the ABt job. I posted about my problem to AWS forum (herehttp://forums.aws.amazon.com/thread.jspa?threadID=126294tstart=0) as I thought first that it could be a problem with EMR. Now, the

Re: LeaseExpiredException in ABt job of SSVD

2013-06-04 Thread Suneel Marthi
1. It would be helpful if u could post the actual stacktrace for this exception. 2. Could u post the command u r using to execute ssvd? Are you working off of trunk? If not Mahout version? 3. Are u specifying a tempPath when running ssvd? SSVD is series of jobs. It could be that the one of the

Re: Dictionary file format in Lucene-Mahout integration

2013-06-04 Thread Suneel Marthi
Never used lucene.vector myself,  thinking loud here. Assuming that dict.out is in TextFormat. You could use 'seqdirectory' to convert dict to a sequencefileformat. This can then be fed into cvb. From: James Forth jjamesfo...@yahoo.com To:

Re: LeaseExpiredException in ABt job of SSVD

2013-06-04 Thread Ahmed Elgohary
Thanks for your reply. I am using mahout 0.7. I am calling the SSVDSolver.run() method using the code I listed in my previous email (please, let me know If something is not clear). The run method of SSVDSolver does not ask for a tempPath. I set a tempPth only for the DistributedRowMatrix I am