great. at least i know what's wrong :) will check out if cloudera supports mahout 0.8.
meanwhile we'll drop LDA and retry our first approach (k-means) thanks everyone! ________________________________ Da: Suneel Marthi <suneel_mar...@yahoo.com> A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco <zentrop...@yahoo.co.uk> Inviato: Mercoledì 31 Luglio 2013 17:07 Oggetto: Re: Latent Dirichlet Allocatio (cvb) CVB was added to cluster_reuters.sh in 0.8, u wouldn't see it in 0.7. Suggest that you work off of 0.8. ________________________________ From: Marco <zentrop...@yahoo.co.uk> To: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <suneel_mar...@yahoo.com> Sent: Wednesday, July 31, 2013 11:05 AM Subject: Re: Latent Dirichlet Allocatio (cvb) already looked there. no cvb examle or vectordump :( ________________________________ Da: Suneel Marthi <suneel_mar...@yahoo.com> A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco <zentrop...@yahoo.co.uk> Inviato: Mercoledì 31 Luglio 2013 16:55 Oggetto: Re: Latent Dirichlet Allocatio (cvb) @Marco, look at examples/bin/cluster-reuters.sh for reference on how to run cvb (or any other clustering algo in Mahout) and also on how to invoke the vectordump with the option flags. ________________________________ From: Jake Mannix <jake.man...@gmail.com> To: "user@mahout.apache.org" <user@mahout.apache.org>; Marco <zentrop...@yahoo.co.uk> Sent: Wednesday, July 31, 2013 10:51 AM Subject: Re: Latent Dirichlet Allocatio (cvb) On Wed, Jul 31, 2013 at 7:44 AM, Marco <zentrop...@yahoo.co.uk> wrote: > ok. i'll re run it without that nt (which i supposed was NOT optional). > Well, it's not optional if you don't supply a dictionary (which is optional) - one of the two is necessary, or else the system doesn't know how big to make the model. > meanwhile i've re-run it on a smallare datasets and though it run > successfully (and faster!) when i run vectordump i always get Heap space > issue even though we've updated MAHOUT_HEAPSIZE to 10000m > When you use vectordump, what flags are you giving it? There may be a big here. Also, what version of Mahout are you using? > > > > > ________________________________ > Da: Jake Mannix <jake.man...@gmail.com> > A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco < > zentrop...@yahoo.co.uk> > Cc: Suneel Marthi <suneel_mar...@yahoo.com> > Inviato: Mercoledì 31 Luglio 2013 16:34 > Oggetto: Re: Latent Dirichlet Allocatio (cvb) > > > If you're supplying a dictionary file (as you are), I'd suggest not > specifying the "-nt 90000" option - you're apparently specifying a numTerms > less than the actual number of terms in some of your vectors. If you > supply the -dict option, it'll infer the number of terms from reading the > dictionary, and you don't need to specify it. > > > On Wed, Jul 31, 2013 at 7:02 AM, Marco <zentrop...@yahoo.co.uk> wrote: > > > oops! that did the trick. > > > > nonetheless i think the fact that you have to do "rowid" and generate the > > matrix should be added to the wiki. > > > > after waiting for more than an hour i got and error on > > Writing final document/topic inference from lda/matrix/matrix to > > jojoba/do-output > > > > the error is : org.apache.mahout.math.IndexException: Index 90011 is > > outside allowable range of [0,90000) > > > > Here is how I launched it: > > mahout cvb -i jojoba/matrix/matrix -dict jojoba/vectors/dictionary.file-0 > > -o jojoba/to-output -dt jojoba/do-output -k 190 -nt 90000 -mt jojoba/mt > > --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed 37 -block 1 > > > > weird thing is also that the job described as " Writing final topic/term > > distributions from jojoba/mt/model-2 to jojoba/to-output" run > successfully > > but if i now do a vectodump i always get a Java Heaps Space error > > > > > > > > ________________________________ > > Da: Suneel Marthi <suneel_mar...@yahoo.com> > > A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco < > > zentrop...@yahoo.co.uk> > > Inviato: Mercoledì 31 Luglio 2013 11:01 > > Oggetto: Re: Latent Dirichlet Allocatio (cvb) > > > > > > RowId job creates a matrix (IntWritable, VectorWritable) and a docIndex > > (IntWritable, Text). > > > > So you should be seeing 2 files generated - jojoba/matrix/matrix and > > jojoba/matrix/docIndex. > > > > Seems like you have been feeding docIndex as input to cvb which would > > cause this exception, its the matrix that needs to be fed as input to > cvb. > > > > So the input to vb needs to be "jojoba/matrix/matrix". > > > > Give that a try and let us know. > > > > > > > > > > ________________________________ > > From: Marco <zentrop...@yahoo.co.uk> > > To: "user@mahout.apache.org" <user@mahout.apache.org> > > Sent: Wednesday, July 31, 2013 4:20 AM > > Subject: Latent Dirichlet Allocatio (cvb) > > > > > > Hi, I'm new here so forgive my little experience with Mahout. > > > > We're trying to use Mahout (on our hadoop cluster) for calculating topics > > on almost 14000 documents. > > > > I've been following this wiki page (http://goo.gl/DcPVjB) but still > > getting errors. > > > > Here's what I'm doing: > > > > 1) creating sequence file from text files (mahout seqdirectory -i > > jojoba/text-files -o jojoba/seqfiles) > > 2) creating vectors FROM sequence files (mahout seq2sparse -i > > jojoba/seqfiles -o jojoba/vectors -wt tf > > -nv) > > 3) launching CVB like this: > > mahout cvb -i jojoba/vectors/tf-vectors/ -dict > > jojoba/vectors/dictionary.file-0 -o jojoba/to-output -dt jojoba/do-output > > -k 190 -nt 90000 -mt jojoba/mt --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed > 37 > > -block 1 > > > > and I get Exception in thread "main" java.lang.InterruptedException: > > Failed to complete iteration 1 stage 1 > > > > I later learned here ( > > http://stackoverflow.com/questions/14757162/run-cvb-in-mahout-0-8/) that > > I should actually feed cvb a matrix and not the vectors (shouldn't it be > > clearly stated in the wiki?). > > So then I run: > > mahout rowid -i jojoba/vectors/tf-vectors/ -o jojoba/matrix > > > > 3bis) I rerun CVB giving jojoba/matrix as input and I now get > > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > > org.apache.mahout.math.VectorWritable > > > > What am I missing? > > > > Thanks > > a lot for your help > > > > > > -- > > -jake > -- -jake