Ok I did a clean checkout & install and ran everything again, then pointed LDA to mahout_vectors/vectors.
Now I get this error: 10/02/12 14:28:35 INFO mapred.JobClient: Task Id : attempt_201001192218_0216_m_000005_0, Status : FAILED java.lang.ClassCastException: *org.apache.mahout.math.VectorWritable cannot be cast to org.apache.mahout.math.Vector* at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Ovi --- Ovidiu Dan - http://www.ovidiudan.com/ Please do not print this e-mail unless it is mandatory My public key can be downloaded from subkeys.pgp.net, or http://www.ovidiudan.com/public.pgp On Fri, Feb 12, 2010 at 1:50 PM, Robin Anil <[email protected]> wrote: > Hi Ovidu, > If you choose tf, the vectors are generated in > outputfolder/vectors and if you choose tfidf the vectors are generated in > outputfolder/tfidf/vectors. I am in the process of changing the code to > move > the output of the map/reduce to a fixed destination and that exception > would > have caused the folders not to move. > Thats the reason for the last error. > > For the first error, I am not sure what is happening. Could you do a clean > compile of mahout > > mvn clean install -DskipTests=true and make sure you svn up the trunk > before > doing that > > Then point your LDA to mahout_vectors/vectors > > Robin > > > On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan <[email protected]> wrote: > > > Hi again, > > > > Is there any workaround for my problem(s)? Or is there any other way that > > would allow me to transform many many small messages (they're Tweets) > into > > Mahout vectors, and the run LDA on them, without getting these errors? > > Converting them to txt files would be a bit of a pain because I would get > > millions of very small files. And a Lucene index would be a bit overkill > I > > think. > > > > Thanks, > > Ovi > > > > --- > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > Please do not print this e-mail unless it is mandatory > > > > My public key can be downloaded from subkeys.pgp.net, or > > http://www.ovidiudan.com/public.pgp > > > > > > On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil <[email protected]> > wrote: > > > > > Was meant for the dev list. I am looking into the first error > > > > > > -bcc mahout-user > > > > > > > > > ---------- Forwarded message ---------- > > > From: Robin Anil <[email protected]> > > > Date: Fri, Feb 12, 2010 at 2:20 PM > > > Subject: Re: Problem converting SequenceFile to vectors, then running > LDA > > > To: [email protected] > > > > > > > > > Hi, > > > > > > This confusion arises from the fact that we use intermediate > folders > > > as subfolders under output folder. How about we standardize on all the > > jobs > > > taking input, intermediate and output folder?. If not this then for the > > > next > > > release? > > > > > > Robin > > > > > > > > > > > > > > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <[email protected]> wrote: > > > > > > > Hello Mahout developers / users, > > > > > > > > I am trying to convert a properly formatted SequenceFile to Mahout > > > vectors > > > > to run LDA on them. As reference I am using these two documents: > > > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html > > > > http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html > > > > > > > > I got the Mahout code from SVN on February 11th 2010. Below I am > > listing > > > > the > > > > steps I have took and the problems I have encountered: > > > > > > > > export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/ > > > > export > > MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/ > > > > > > > > $HADOOP_HOME/bin/hadoop jar > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i > > > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk 300 -a > > > > org.apache.lucene.analysis.standard.StandardAnalyzer --minSupport 2 > > > --minDF > > > > 1 --maxDFPercent 50 --norm 2 > > > > > > > > *Problem #1: *Got this error at the end, but I think everything > > finished > > > > more or less correctly: > > > > Exception in thread "main" java.lang.NoSuchMethodError: > > > > > > > > > > > > > > org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V > > > > at > > > > > > > > > > > > > > org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173) > > > > at > > > > > > > > > > > > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254) > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > at > > > > > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > > > at > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > > > > > > > $HADOOP_HOME/bin/hadoop jar > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > > > > org.apache.mahout.clustering.lda.LDADriver -i > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o > > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000 > > > > --numReducers 33 > > > > > > > > *Problem #2: *Exception in thread "main" > java.io.FileNotFoundException: > > > > File > > > > does not exist: > > > > > > > > > > > > > > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > > > > > > > > *Tried to fix:* > > > > > > > > ../../hadoop fs -mv > > > > > > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000 > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > > > > > > > > *Ran again:* > > > > > > > > $HADOOP_HOME/bin/hadoop jar > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > > > > org.apache.mahout.clustering.lda.LDADriver -i > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o > > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000 > > > > --numReducers 33 > > > > > > > > *Problem #3:* > > > > > > > > Exception in thread "main" java.io.FileNotFoundException: File does > not > > > > exist: > > > > > > > > > > > > > > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data > > > > > > > > [had...@some_server retweets]$ ../../hadoop fs -ls > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/ > > > > Found 3 items > > > > -rw-r--r-- 3 hadoop supergroup 129721338 2010-02-11 23:54 > > > > > > > > > > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000 > > > > -rw-r--r-- 3 hadoop supergroup 128256085 2010-02-11 23:54 > > > > > > > > > > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001 > > > > -rw-r--r-- 3 hadoop supergroup 24160265 2010-02-11 23:54 > > > > > > > > > > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002 > > > > > > > > Also, as a *bonus problem*, If the input > > > > folder /user/MY_USERNAME/projects/lda/twitter_sequence_files contains > > > more > > > > than one file (for example if I run only the maps without a final > > > reducer), > > > > this whole chain doesn't work. > > > > > > > > Thanks, > > > > Ovi > > > > > > > > --- > > > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > > > > > Please do not print this e-mail unless it is mandatory > > > > > > > > My public key can be downloaded from subkeys.pgp.net, or > > > > http://www.ovidiudan.com/public.pgp > > > > > > > > > >
