I think it ran fine this time, but I would like to use *LDAPrintTopics.java *to print words for each of the k clusters.
I have a few issues with it: - The --dict option only accepts a dictionary file which is present on the local file system, not HDFS. That's odd, but ok. - I not only have to move the dictionary.file-0 file from HDFS to the local filesystem, but I have to transform it to text (./hadoop fs -text /user/SOME_USERNAME/projects/lda/mahout_vectors/dictionary.file-0 > dictionary.txt) - this is fine as well - Now if I run the command it complains that "Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2". I checked the code and it expects a Tab separated file with three columns not two (hence the out of bound expcetion at index 2 which is column 3). The current format of the file that I have is WORD (tab) WORD_ID - I changed line 154 of LDAPrintTopics.java to int index = Integer.parseInt(parts[1]); - and also commented lines 148 and 149 Still not done but will continue in the morning. Thanks, Ovi --- Ovidiu Dan - http://www.ovidiudan.com/ Please do not print this e-mail unless it is mandatory My public key can be downloaded from subkeys.pgp.net, or http://www.ovidiudan.com/public.pgp On Fri, Feb 12, 2010 at 5:21 PM, Robin Anil <[email protected]> wrote: > try increasing the min Support. You are already removing words that occur > in > more than 50% of the document. So i see no trouble there > Robin > > > On Sat, Feb 13, 2010 at 3:49 AM, Ovidiu Dan <[email protected]> wrote: > > > Since my texts are not exactly correct English that might be the case. > I'll > > try it in a sec, thanks. > > > > Ovi > > > > --- > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > Please do not print this e-mail unless it is mandatory > > > > My public key can be downloaded from subkeys.pgp.net, or > > http://www.ovidiudan.com/public.pgp > > > > > > On Fri, Feb 12, 2010 at 5:16 PM, Robin Anil <[email protected]> > wrote: > > > > > there are 21K unigrams in reuters(from df count job counters) I am > > running > > > LDA fine with 100K as numWords > > > > > > Robin > > > > > > On Sat, Feb 13, 2010 at 3:42 AM, David Hall <[email protected]> > > wrote: > > > > > > > On Fri, Feb 12, 2010 at 2:00 PM, Ovidiu Dan <[email protected]> > wrote: > > > > > Hi, thanks for the fix. I guess it's one step closer to a worlable > > > > solution. > > > > > I am now getting this error: > > > > > > > > > > 10/02/12 16:59:25 INFO mapred.JobClient: Task Id : > > > > > attempt_201001192218_0234_m_000004_1, Status : FAILED > > > > > java.lang.ArrayIndexOutOfBoundsException: 104017 > > > > > > > > You probably have more words than you allotted at the beginning. > > > > > > > > I know it's less than ideal, but at the moment you need to specify an > > > > upper bound on the number of words. That's the numWords parameter. > > > > > > > > Try upping it by a factor of two or so. > > > > > > > > -- David > > > > > > > > > at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:75) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:40) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:204) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:117) > > > > > at > org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47) > > > > > at > org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:37) > > > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > > > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) > > > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > > > > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > > > > > > > > > Ovi > > > > > > > > > > --- > > > > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > > > > > > > Please do not print this e-mail unless it is mandatory > > > > > > > > > > My public key can be downloaded from subkeys.pgp.net, or > > > > > http://www.ovidiudan.com/public.pgp > > > > > > > > > > > > > > > On Fri, Feb 12, 2010 at 4:29 PM, Ovidiu Dan <[email protected]> > > wrote: > > > > > > > > > >> > > > > >> Will do in a bit, thanks! > > > > >> > > > > >> --- > > > > >> Ovidiu Dan - http://www.ovidiudan.com/ > > > > >> > > > > >> Please do not print this e-mail unless it is mandatory > > > > >> > > > > >> My public key can be downloaded from subkeys.pgp.net, or > > > > >> http://www.ovidiudan.com/public.pgp > > > > >> > > > > >> > > > > >> On Fri, Feb 12, 2010 at 4:27 PM, Robin Anil <[email protected] > > > > > > wrote: > > > > >> > > > > >>> Ovidiu we just commited the fix. Just recreate vectors using the > > -seq > > > > >>> option > > > > >>> added to it. > > > > >>> > > > > >>> Remember to svnup and recompile > > > > >>> > > > > >>> Robin > > > > >>> On Sat, Feb 13, 2010 at 2:39 AM, Jake Mannix < > > [email protected]> > > > > >>> wrote: > > > > >>> > > > > >>> > Robin and I are trying out a fix. I already ran into this in > > > hooking > > > > >>> > the Vectorizer up to my SVD code. > > > > >>> > > > > > >>> > -jake > > > > >>> > > > > > >>> > On Fri, Feb 12, 2010 at 1:08 PM, Ovidiu Dan <[email protected]> > > > > wrote: > > > > >>> > > > > > >>> > > Well, I tried it. The previous problem is fixed, but now I > have > > a > > > > new > > > > >>> and > > > > >>> > > shiny one :( > > > > >>> > > > > > > >>> > > Meet line 95 of LDAInference,java: > > > > >>> > > DenseMatrix phi = new DenseMatrix(state.numTopics, > docLength); > > > > >>> > > > > > > >>> > > docLength is calculated above on line 88: > > > > >>> > > int docLength = wordCounts.size(); > > > > >>> > > > > > > >>> > > My problem is that docLength is always 2147483647. Since > > > > DenseMatrix > > > > >>> > > allocated an array based (also) on this value, I get multiple > > > > >>> "Requested > > > > >>> > > array size exceeds VM limit" messages (an array with > 2147483647 > > > > >>> columns > > > > >>> > > would be quite large). > > > > >>> > > > > > > >>> > > I added a trivial toString function that displays > vector.size() > > > > >>> > > in org/apache/mahout/math/VectorWritable.java, recompiled the > > > > project, > > > > >>> > then > > > > >>> > > ran: > > > > >>> > > > > > > >>> > > ./hadoop fs -text > > > > >>> > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/vectors/part-00000 > > > > >>> > > > > > > >>> > > All lines were DOCUMENT_ID (tab) 2147483647. So all vectors > > > report > > > > >>> > > size 2147483647 > > > > >>> > > > > > > >>> > > I checked the output for vector.zSum() as well, that one > looks > > > > fine. > > > > >>> > > > > > > >>> > > I can confirm that my input SequenceFile is correct, it has > the > > > > >>> following > > > > >>> > > format: > > > > >>> > > - key: Text with unique id of document > > > > >>> > > - value: Text with the contents of the document > > > > >>> > > > > > > >>> > > Ovi > > > > >>> > > > > > > >>> > > --- > > > > >>> > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > >>> > > > > > > >>> > > Please do not print this e-mail unless it is mandatory > > > > >>> > > > > > > >>> > > My public key can be downloaded from subkeys.pgp.net, or > > > > >>> > > http://www.ovidiudan.com/public.pgp > > > > >>> > > > > > > >>> > > > > > > >>> > > On Fri, Feb 12, 2010 at 3:30 PM, Ovidiu Dan < > [email protected]> > > > > wrote: > > > > >>> > > > > > > >>> > > > > > > > >>> > > > Thanks, I also patched it myself but now I have some other > > > > problems. > > > > >>> > I'll > > > > >>> > > > run it and let you know how it goes. > > > > >>> > > > > > > > >>> > > > Ovi > > > > >>> > > > > > > > >>> > > > --- > > > > >>> > > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > >>> > > > > > > > >>> > > > Please do not print this e-mail unless it is mandatory > > > > >>> > > > > > > > >>> > > > My public key can be downloaded from subkeys.pgp.net, or > > > > >>> > > > http://www.ovidiudan.com/public.pgp > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > On Fri, Feb 12, 2010 at 3:25 PM, Robin Anil < > > > > [email protected]> > > > > >>> > > wrote: > > > > >>> > > > > > > > >>> > > >> I fixed the bug here > > > > >>> https://issues.apache.org/jira/browse/MAHOUT-289 > > > > >>> > > >> > > > > >>> > > >> > > > > >>> > > >> Try running now. > > > > >>> > > >> > > > > >>> > > >> Robin > > > > >>> > > >> > > > > >>> > > >> On Sat, Feb 13, 2010 at 1:09 AM, Ovidiu Dan < > > [email protected] > > > > > > > > >>> wrote: > > > > >>> > > >> > > > > >>> > > >> > I checked the code. > > > > >>> > > >> > > > > > >>> > > >> > Line 36 of LDAMapper.java references Vector: > > > > >>> > > >> > > > > > >>> > > >> > public class LDAMapper extends > > > > >>> > > >> > Mapper<WritableComparable<?>, *Vector*, > > IntPairWritable, > > > > >>> > > >> DoubleWritable> > > > > >>> > > >> > { > > > > >>> > > >> > > > > > >>> > > >> > Aren't all elements in Mapper<...> supposed to be > > Writable? > > > Do > > > > >>> you > > > > >>> > > need > > > > >>> > > >> to > > > > >>> > > >> > do a conversion to VectorWritable? > > > > >>> > > >> > > > > > >>> > > >> > Ovi > > > > >>> > > >> > > > > > >>> > > >> > --- > > > > >>> > > >> > Ovidiu Dan - http://www.ovidiudan.com/ > > > > >>> > > >> > > > > > >>> > > >> > Please do not print this e-mail unless it is mandatory > > > > >>> > > >> > > > > > >>> > > >> > My public key can be downloaded from subkeys.pgp.net, > or > > > > >>> > > >> > http://www.ovidiudan.com/public.pgp > > > > >>> > > >> > > > > > >>> > > >> > > > > > >>> > > >> > On Fri, Feb 12, 2010 at 2:32 PM, Ovidiu Dan < > > > [email protected] > > > > > > > > > >>> > wrote: > > > > >>> > > >> > > > > > >>> > > >> > > > > > > >>> > > >> > > Ok I did a clean checkout & install and ran everything > > > > again, > > > > >>> then > > > > >>> > > >> > pointed > > > > >>> > > >> > > LDA to mahout_vectors/vectors. > > > > >>> > > >> > > > > > > >>> > > >> > > Now I get this error: > > > > >>> > > >> > > > > > > >>> > > >> > > 10/02/12 14:28:35 INFO mapred.JobClient: Task Id : > > > > >>> > > >> > > attempt_201001192218_0216_m_000005_0, Status : FAILED > > > > >>> > > >> > > java.lang.ClassCastException: > > > > >>> > *org.apache.mahout.math.VectorWritable > > > > >>> > > >> > > cannot be cast to org.apache.mahout.math.Vector* > > > > >>> > > >> > > at > > > > >>> > > > > org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36) > > > > >>> > > >> > > at > > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > > > >>> > > >> > > at > > > > >>> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) > > > > >>> > > >> > > at > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > > > >>> > > >> > > at > org.apache.hadoop.mapred.Child.main(Child.java:170) > > > > >>> > > >> > > > > > > >>> > > >> > > Ovi > > > > >>> > > >> > > > > > > >>> > > >> > > --- > > > > >>> > > >> > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > >>> > > >> > > > > > > >>> > > >> > > Please do not print this e-mail unless it is mandatory > > > > >>> > > >> > > > > > > >>> > > >> > > My public key can be downloaded from subkeys.pgp.net, > > or > > > > >>> > > >> > > http://www.ovidiudan.com/public.pgp > > > > >>> > > >> > > > > > > >>> > > >> > > > > > > >>> > > >> > > On Fri, Feb 12, 2010 at 1:50 PM, Robin Anil < > > > > >>> [email protected] > > > > >>> > > > > > > >>> > > >> > wrote: > > > > >>> > > >> > > > > > > >>> > > >> > >> Hi Ovidu, > > > > >>> > > >> > >> If you choose tf, the vectors are > generated > > in > > > > >>> > > >> > >> outputfolder/vectors and if you choose tfidf the > > vectors > > > > are > > > > >>> > > >> generated > > > > >>> > > >> > in > > > > >>> > > >> > >> outputfolder/tfidf/vectors. I am in the process of > > > changing > > > > >>> the > > > > >>> > > code > > > > >>> > > >> to > > > > >>> > > >> > >> move > > > > >>> > > >> > >> the output of the map/reduce to a fixed destination > and > > > > that > > > > >>> > > >> exception > > > > >>> > > >> > >> would > > > > >>> > > >> > >> have caused the folders not to move. > > > > >>> > > >> > >> Thats the reason for the last error. > > > > >>> > > >> > >> > > > > >>> > > >> > >> For the first error, I am not sure what is happening. > > > Could > > > > >>> you > > > > >>> > do > > > > >>> > > a > > > > >>> > > >> > clean > > > > >>> > > >> > >> compile of mahout > > > > >>> > > >> > >> > > > > >>> > > >> > >> mvn clean install -DskipTests=true and make sure you > > svn > > > up > > > > >>> the > > > > >>> > > trunk > > > > >>> > > >> > >> before > > > > >>> > > >> > >> doing that > > > > >>> > > >> > >> > > > > >>> > > >> > >> Then point your LDA to mahout_vectors/vectors > > > > >>> > > >> > >> > > > > >>> > > >> > >> Robin > > > > >>> > > >> > >> > > > > >>> > > >> > >> > > > > >>> > > >> > >> On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan < > > > > >>> [email protected]> > > > > >>> > > >> wrote: > > > > >>> > > >> > >> > > > > >>> > > >> > >> > Hi again, > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > Is there any workaround for my problem(s)? Or is > > there > > > > any > > > > >>> > other > > > > >>> > > >> way > > > > >>> > > >> > >> that > > > > >>> > > >> > >> > would allow me to transform many many small > messages > > > > >>> (they're > > > > >>> > > >> Tweets) > > > > >>> > > >> > >> into > > > > >>> > > >> > >> > Mahout vectors, and the run LDA on them, without > > > getting > > > > >>> these > > > > >>> > > >> errors? > > > > >>> > > >> > >> > Converting them to txt files would be a bit of a > pain > > > > >>> because I > > > > >>> > > >> would > > > > >>> > > >> > >> get > > > > >>> > > >> > >> > millions of very small files. And a Lucene index > > would > > > be > > > > a > > > > >>> bit > > > > >>> > > >> > overkill > > > > >>> > > >> > >> I > > > > >>> > > >> > >> > think. > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > Thanks, > > > > >>> > > >> > >> > Ovi > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > --- > > > > >>> > > >> > >> > Ovidiu Dan - http://www.ovidiudan.com/ > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > Please do not print this e-mail unless it is > > mandatory > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > My public key can be downloaded from > subkeys.pgp.net > > , > > > or > > > > >>> > > >> > >> > http://www.ovidiudan.com/public.pgp > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil < > > > > >>> > > [email protected]> > > > > >>> > > >> > >> wrote: > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > Was meant for the dev list. I am looking into the > > > first > > > > >>> error > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > -bcc mahout-user > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > ---------- Forwarded message ---------- > > > > >>> > > >> > >> > > From: Robin Anil <[email protected]> > > > > >>> > > >> > >> > > Date: Fri, Feb 12, 2010 at 2:20 PM > > > > >>> > > >> > >> > > Subject: Re: Problem converting SequenceFile to > > > > vectors, > > > > >>> then > > > > >>> > > >> > running > > > > >>> > > >> > >> LDA > > > > >>> > > >> > >> > > To: [email protected] > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > Hi, > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > This confusion arises from the fact that we > > use > > > > >>> > > intermediate > > > > >>> > > >> > >> folders > > > > >>> > > >> > >> > > as subfolders under output folder. How about we > > > > >>> standardize > > > > >>> > on > > > > >>> > > >> all > > > > >>> > > >> > the > > > > >>> > > >> > >> > jobs > > > > >>> > > >> > >> > > taking input, intermediate and output folder?. If > > not > > > > this > > > > >>> > then > > > > >>> > > >> for > > > > >>> > > >> > >> the > > > > >>> > > >> > >> > > next > > > > >>> > > >> > >> > > release? > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > Robin > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan < > > > > >>> > [email protected] > > > > >>> > > > > > > > >>> > > >> > >> wrote: > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > Hello Mahout developers / users, > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > I am trying to convert a properly formatted > > > > SequenceFile > > > > >>> to > > > > >>> > > >> Mahout > > > > >>> > > >> > >> > > vectors > > > > >>> > > >> > >> > > > to run LDA on them. As reference I am using > these > > > two > > > > >>> > > >> documents: > > > > >>> > > >> > >> > > > > > > > >>> > > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html > > > > >>> > > >> > >> > > > > > > > >>> > > >> > > > http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > I got the Mahout code from SVN on February 11th > > > 2010. > > > > >>> Below > > > > >>> > I > > > > >>> > > >> am > > > > >>> > > >> > >> > listing > > > > >>> > > >> > >> > > > the > > > > >>> > > >> > >> > > > steps I have took and the problems I have > > > > encountered: > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > export > > > > HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/ > > > > >>> > > >> > >> > > > export > > > > >>> > > >> > >> > > > > > >>> > > MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/ > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar > > > > >>> > > >> > >> > > > > > > > >>> > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > > > > >>> > > >> > >> > > > > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles > > > > -i > > > > >>> > > >> > >> > > > > > > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ > > > > >>> -o > > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ > > -wt > > > tf > > > > >>> > -chunk > > > > >>> > > >> 300 > > > > >>> > > >> > -a > > > > >>> > > >> > >> > > > > > > org.apache.lucene.analysis.standard.StandardAnalyzer > > > > >>> > > >> --minSupport > > > > >>> > > >> > 2 > > > > >>> > > >> > >> > > --minDF > > > > >>> > > >> > >> > > > 1 --maxDFPercent 50 --norm 2 > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > *Problem #1: *Got this error at the end, but I > > > think > > > > >>> > > everything > > > > >>> > > >> > >> > finished > > > > >>> > > >> > >> > > > more or less correctly: > > > > >>> > > >> > >> > > > Exception in thread "main" > > > > java.lang.NoSuchMethodError: > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V > > > > >>> > > >> > >> > > > at > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173) > > > > >>> > > >> > >> > > > at > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254) > > > > >>> > > >> > >> > > > at > > > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > > > >>> > > Method) > > > > >>> > > >> > >> > > > at > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > > > >>> > > >> > >> > > > at > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > >>> > > >> > >> > > > at > > java.lang.reflect.Method.invoke(Method.java:597) > > > > >>> > > >> > >> > > > at > > > > org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar > > > > >>> > > >> > >> > > > > > > > >>> > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > > > > >>> > > >> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i > > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ > -o > > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 > > > > --numWords > > > > >>> > > 100000 > > > > >>> > > >> > >> > > > --numReducers 33 > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > *Problem #2: *Exception in thread "main" > > > > >>> > > >> > >> java.io.FileNotFoundException: > > > > >>> > > >> > >> > > > File > > > > >>> > > >> > >> > > > does not exist: > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > *Tried to fix:* > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > ../../hadoop fs -mv > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000 > > > > >>> > > >> > >> > > > > > > > >>> > > >> > > > > > >>> > > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > *Ran again:* > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar > > > > >>> > > >> > >> > > > > > > > >>> > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > > > > >>> > > >> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i > > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ > -o > > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 > > > > --numWords > > > > >>> > > 100000 > > > > >>> > > >> > >> > > > --numReducers 33 > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > *Problem #3:* > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > Exception in thread "main" > > > > >>> java.io.FileNotFoundException: > > > > >>> > > File > > > > >>> > > >> > does > > > > >>> > > >> > >> not > > > > >>> > > >> > >> > > > exist: > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > [had...@some_server retweets]$ ../../hadoop fs > > -ls > > > > >>> > > >> > >> > > > > > > > >>> > > >> > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/ > > > > >>> > > >> > >> > > > Found 3 items > > > > >>> > > >> > >> > > > -rw-r--r-- 3 hadoop supergroup 129721338 > > > > 2010-02-11 > > > > >>> > 23:54 > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000 > > > > >>> > > >> > >> > > > -rw-r--r-- 3 hadoop supergroup 128256085 > > > > 2010-02-11 > > > > >>> > 23:54 > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001 > > > > >>> > > >> > >> > > > -rw-r--r-- 3 hadoop supergroup 24160265 > > > > 2010-02-11 > > > > >>> > 23:54 > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002 > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > Also, as a *bonus problem*, If the input > > > > >>> > > >> > >> > > > folder > > > > >>> > /user/MY_USERNAME/projects/lda/twitter_sequence_files > > > > >>> > > >> > >> contains > > > > >>> > > >> > >> > > more > > > > >>> > > >> > >> > > > than one file (for example if I run only the > maps > > > > >>> without a > > > > >>> > > >> final > > > > >>> > > >> > >> > > reducer), > > > > >>> > > >> > >> > > > this whole chain doesn't work. > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > Thanks, > > > > >>> > > >> > >> > > > Ovi > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > --- > > > > >>> > > >> > >> > > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > Please do not print this e-mail unless it is > > > > mandatory > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > My public key can be downloaded from > > > subkeys.pgp.net > > > > , > > > > >>> or > > > > >>> > > >> > >> > > > http://www.ovidiudan.com/public.pgp > > > > >>> > > >> > >> > > > > > > > >>> > > >> > >> > > > > > > >>> > > >> > >> > > > > > >>> > > >> > >> > > > > >>> > > >> > > > > > > >>> > > >> > > > > > > >>> > > >> > > > > > >>> > > >> > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >> > > > > >> > > > > > > > > > > > > > > >
