If you're supplying a dictionary file (as you are), I'd suggest not specifying the "-nt 90000" option - you're apparently specifying a numTerms less than the actual number of terms in some of your vectors. If you supply the -dict option, it'll infer the number of terms from reading the dictionary, and you don't need to specify it.
On Wed, Jul 31, 2013 at 7:02 AM, Marco <zentrop...@yahoo.co.uk> wrote: > oops! that did the trick. > > nonetheless i think the fact that you have to do "rowid" and generate the > matrix should be added to the wiki. > > after waiting for more than an hour i got and error on > Writing final document/topic inference from lda/matrix/matrix to > jojoba/do-output > > the error is : org.apache.mahout.math.IndexException: Index 90011 is > outside allowable range of [0,90000) > > Here is how I launched it: > mahout cvb -i jojoba/matrix/matrix -dict jojoba/vectors/dictionary.file-0 > -o jojoba/to-output -dt jojoba/do-output -k 190 -nt 90000 -mt jojoba/mt > --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed 37 -block 1 > > weird thing is also that the job described as " Writing final topic/term > distributions from jojoba/mt/model-2 to jojoba/to-output" run successfully > but if i now do a vectodump i always get a Java Heaps Space error > > > > ________________________________ > Da: Suneel Marthi <suneel_mar...@yahoo.com> > A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco < > zentrop...@yahoo.co.uk> > Inviato: Mercoledì 31 Luglio 2013 11:01 > Oggetto: Re: Latent Dirichlet Allocatio (cvb) > > > RowId job creates a matrix (IntWritable, VectorWritable) and a docIndex > (IntWritable, Text). > > So you should be seeing 2 files generated - jojoba/matrix/matrix and > jojoba/matrix/docIndex. > > Seems like you have been feeding docIndex as input to cvb which would > cause this exception, its the matrix that needs to be fed as input to cvb. > > So the input to vb needs to be "jojoba/matrix/matrix". > > Give that a try and let us know. > > > > > ________________________________ > From: Marco <zentrop...@yahoo.co.uk> > To: "user@mahout.apache.org" <user@mahout.apache.org> > Sent: Wednesday, July 31, 2013 4:20 AM > Subject: Latent Dirichlet Allocatio (cvb) > > > Hi, I'm new here so forgive my little experience with Mahout. > > We're trying to use Mahout (on our hadoop cluster) for calculating topics > on almost 14000 documents. > > I've been following this wiki page (http://goo.gl/DcPVjB) but still > getting errors. > > Here's what I'm doing: > > 1) creating sequence file from text files (mahout seqdirectory -i > jojoba/text-files -o jojoba/seqfiles) > 2) creating vectors FROM sequence files (mahout seq2sparse -i > jojoba/seqfiles -o jojoba/vectors -wt tf > -nv) > 3) launching CVB like this: > mahout cvb -i jojoba/vectors/tf-vectors/ -dict > jojoba/vectors/dictionary.file-0 -o jojoba/to-output -dt jojoba/do-output > -k 190 -nt 90000 -mt jojoba/mt --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed 37 > -block 1 > > and I get Exception in thread "main" java.lang.InterruptedException: > Failed to complete iteration 1 stage 1 > > I later learned here ( > http://stackoverflow.com/questions/14757162/run-cvb-in-mahout-0-8/) that > I should actually feed cvb a matrix and not the vectors (shouldn't it be > clearly stated in the wiki?). > So then I run: > mahout rowid -i jojoba/vectors/tf-vectors/ -o jojoba/matrix > > 3bis) I rerun CVB giving jojoba/matrix as input and I now get > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.mahout.math.VectorWritable > > What am I missing? > > Thanks > a lot for your help > -- -jake