Hi,
we are running LDA on 50 million files.
Each file is not more than 5 MB. Each file represent the content of the
user. Files keeps on updating as we receive new information about the user.
Currently we store all these files on ec2 and when we need to run LDA, We
transfer those files to S3
Nishant,
It is hard to advise on detailed trade-offs for your case but I am pretty
sure that there are other options than S3 which is, as you say, very slow
in terms of latency due to transferring lots of small objects.
One alternative, for instance, would be to use a long-lived MapR cluster to
Hi,
I am having twitter data in a single txt file as:
@VancityBeerGuy - RT @BCBerrie: well @VancityBeerGuy you know what they say
about guys with #smallenfreuden right? Hahaha Created At:Mon Jun 03 07:18:46
IST 2013
@IanSylves - RT @PTorgo91: @otterN9NE you're the best thing to happen to the
Hi,
I am having twitter data in a single txt file as:
@VancityBeerGuy - RT @BCBerrie: well @VancityBeerGuy you know what they say
about guys with #smallenfreuden right? Hahaha Created At:Mon Jun 03 07:18:46
IST 2013
@IanSylves - RT @PTorgo91: @otterN9NE you're the best thing to happen to the
Hi Rajesh,
Streaming k-means clusters Vectors (that are in *, VectorWritable
sequence files) and outputs IntWritable, CentroidWritable sequence files.
A Centroid is the same as a Vector with the addition of an index and a
weight. You can getVector() a Centroid to get its Vector.
On Mon, Jun
Am I loosing my mind or did the --outputPath option get removed from the
MatrixMultiplicationJob recently? It looks like it is now in 'productWith-xxx'
so I'll have to search for the most recent dir of that name? And why isn't
there a --outputPath option to transpose? I have to search for the
Err, I think there is a mistake below.
You want to do [B'A]H_v, where H_v = A', the user's history of views as column
vectors. At least that is what my code was doing.
On another subject the idea of truncating the user history vectors came up in
another thread. In some research we did using
I am using [B'A]H_v in my code as well.
Furthermore, the basics of my implementation are done so far (a more general
purpose reimplementation of the item-based recommender) and I soon will move
on to evaluate the approach with my dataset. On the one hand I'm quite lucky
since each of the
On Jun 2, 2013, at 10:42 AM, Sebastian Schelter s...@apache.org wrote:
I don't think unmaintained code should stay in our codebase.
+1
This will
only create frustration amongst our users, as they will not get
questions answered and bugs fixed. It would also be an obstacle for a
1.0
You don't want to decay the training values, only the query values. The
training values indicate user taste similarity and that decays very slowly if
at all. The truncation I was talking about is in the query vectors. And even
with that I'd measure its effect.
If you do this with [B'B]H_p
Hello,
I’m wondering if anyone can help with a question about the dictionary format in
lucene.vector-cvb integration. I’ve previously used the pathway from text
files: seqdirectory
seq2sparse rowid cvb and it works fine. The
dictionary created by seq2sparse is in sequence file format, and
Hi,
I am trying to run ssvd on amazon EMR but, I am getting a
LeaseExpriedException during the execution of the ABt job. I posted about
my problem to AWS forum
(herehttp://forums.aws.amazon.com/thread.jspa?threadID=126294tstart=0)
as I thought first that it could be a problem with EMR. Now, the
1. It would be helpful if u could post the actual stacktrace for this
exception.
2. Could u post the command u r using to execute ssvd? Are you working off of
trunk? If not Mahout version?
3. Are u specifying a tempPath when running ssvd? SSVD is series of jobs. It
could be that the one of the
Never used lucene.vector myself, thinking loud here. Assuming that dict.out is
in TextFormat.
You could use 'seqdirectory' to convert dict to a sequencefileformat.
This can then be fed into cvb.
From: James Forth jjamesfo...@yahoo.com
To:
Thanks for your reply.
I am using mahout 0.7. I am calling the SSVDSolver.run() method using the
code I listed in my previous email (please, let me know If something is not
clear). The run method of SSVDSolver does not ask for a tempPath. I set a
tempPth only for the DistributedRowMatrix I am
15 matches
Mail list logo