Hello Mahout developers / users, I am trying to convert a properly formatted SequenceFile to Mahout vectors to run LDA on them. As reference I am using these two documents: http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
I got the Mahout code from SVN on February 11th 2010. Below I am listing the steps I have took and the problems I have encountered: export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/ export MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/ $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk 300 -a org.apache.lucene.analysis.standard.StandardAnalyzer --minSupport 2 --minDF 1 --maxDFPercent 50 --norm 2 *Problem #1: *Got this error at the end, but I think everything finished more or less correctly: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V at org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173) at org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.clustering.lda.LDADriver -i /user/MY_USERNAME/projects/lda/mahout_vectors/ -o /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000 --numReducers 33 *Problem #2: *Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data *Tried to fix:* ../../hadoop fs -mv /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000 /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data *Ran again:* $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.clustering.lda.LDADriver -i /user/MY_USERNAME/projects/lda/mahout_vectors/ -o /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000 --numReducers 33 *Problem #3:* Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data [had...@some_server retweets]$ ../../hadoop fs -ls /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/ Found 3 items -rw-r--r-- 3 hadoop supergroup 129721338 2010-02-11 23:54 /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000 -rw-r--r-- 3 hadoop supergroup 128256085 2010-02-11 23:54 /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001 -rw-r--r-- 3 hadoop supergroup 24160265 2010-02-11 23:54 /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002 Also, as a *bonus problem*, If the input folder /user/MY_USERNAME/projects/lda/twitter_sequence_files contains more than one file (for example if I run only the maps without a final reducer), this whole chain doesn't work. Thanks, Ovi --- Ovidiu Dan - http://www.ovidiudan.com/ Please do not print this e-mail unless it is mandatory My public key can be downloaded from subkeys.pgp.net, or http://www.ovidiudan.com/public.pgp
