Hi Ovidu,
If you choose tf, the vectors are generated in
outputfolder/vectors and if you choose tfidf the vectors are generated in
outputfolder/tfidf/vectors. I am in the process of changing the code to move
the output of the map/reduce to a fixed destination and that exception would
have caused the folders not to move.
Thats the reason for the last error.
For the first error, I am not sure what is happening. Could you do a clean
compile of mahout
mvn clean install -DskipTests=true and make sure you svn up the trunk before
doing that
Then point your LDA to mahout_vectors/vectors
Robin
On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan <[email protected]> wrote:
> Hi again,
>
> Is there any workaround for my problem(s)? Or is there any other way that
> would allow me to transform many many small messages (they're Tweets) into
> Mahout vectors, and the run LDA on them, without getting these errors?
> Converting them to txt files would be a bit of a pain because I would get
> millions of very small files. And a Lucene index would be a bit overkill I
> think.
>
> Thanks,
> Ovi
>
> ---
> Ovidiu Dan - http://www.ovidiudan.com/
>
> Please do not print this e-mail unless it is mandatory
>
> My public key can be downloaded from subkeys.pgp.net, or
> http://www.ovidiudan.com/public.pgp
>
>
> On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil <[email protected]> wrote:
>
> > Was meant for the dev list. I am looking into the first error
> >
> > -bcc mahout-user
> >
> >
> > ---------- Forwarded message ----------
> > From: Robin Anil <[email protected]>
> > Date: Fri, Feb 12, 2010 at 2:20 PM
> > Subject: Re: Problem converting SequenceFile to vectors, then running LDA
> > To: [email protected]
> >
> >
> > Hi,
> >
> > This confusion arises from the fact that we use intermediate folders
> > as subfolders under output folder. How about we standardize on all the
> jobs
> > taking input, intermediate and output folder?. If not this then for the
> > next
> > release?
> >
> > Robin
> >
> >
> >
> >
> > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <[email protected]> wrote:
> >
> > > Hello Mahout developers / users,
> > >
> > > I am trying to convert a properly formatted SequenceFile to Mahout
> > vectors
> > > to run LDA on them. As reference I am using these two documents:
> > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
> > > http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
> > >
> > > I got the Mahout code from SVN on February 11th 2010. Below I am
> listing
> > > the
> > > steps I have took and the problems I have encountered:
> > >
> > > export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/
> > > export
> MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/
> > >
> > > $HADOOP_HOME/bin/hadoop jar
> > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i
> > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o
> > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk 300 -a
> > > org.apache.lucene.analysis.standard.StandardAnalyzer --minSupport 2
> > --minDF
> > > 1 --maxDFPercent 50 --norm 2
> > >
> > > *Problem #1: *Got this error at the end, but I think everything
> finished
> > > more or less correctly:
> > > Exception in thread "main" java.lang.NoSuchMethodError:
> > >
> > >
> >
> org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V
> > > at
> > >
> > >
> >
> org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173)
> > > at
> > >
> > >
> >
> org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > at java.lang.reflect.Method.invoke(Method.java:597)
> > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > >
> > > $HADOOP_HOME/bin/hadoop jar
> > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> > > org.apache.mahout.clustering.lda.LDADriver -i
> > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
> > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
> > > --numReducers 33
> > >
> > > *Problem #2: *Exception in thread "main" java.io.FileNotFoundException:
> > > File
> > > does not exist:
> > >
> > >
> >
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
> > >
> > > *Tried to fix:*
> > >
> > > ../../hadoop fs -mv
> > >
> >
> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000
> > > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
> > >
> > > *Ran again:*
> > >
> > > $HADOOP_HOME/bin/hadoop jar
> > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> > > org.apache.mahout.clustering.lda.LDADriver -i
> > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
> > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
> > > --numReducers 33
> > >
> > > *Problem #3:*
> > >
> > > Exception in thread "main" java.io.FileNotFoundException: File does not
> > > exist:
> > >
> > >
> >
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data
> > >
> > > [had...@some_server retweets]$ ../../hadoop fs -ls
> > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/
> > > Found 3 items
> > > -rw-r--r-- 3 hadoop supergroup 129721338 2010-02-11 23:54
> > >
> > >
> >
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000
> > > -rw-r--r-- 3 hadoop supergroup 128256085 2010-02-11 23:54
> > >
> > >
> >
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001
> > > -rw-r--r-- 3 hadoop supergroup 24160265 2010-02-11 23:54
> > >
> > >
> >
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002
> > >
> > > Also, as a *bonus problem*, If the input
> > > folder /user/MY_USERNAME/projects/lda/twitter_sequence_files contains
> > more
> > > than one file (for example if I run only the maps without a final
> > reducer),
> > > this whole chain doesn't work.
> > >
> > > Thanks,
> > > Ovi
> > >
> > > ---
> > > Ovidiu Dan - http://www.ovidiudan.com/
> > >
> > > Please do not print this e-mail unless it is mandatory
> > >
> > > My public key can be downloaded from subkeys.pgp.net, or
> > > http://www.ovidiudan.com/public.pgp
> > >
> >
>