Was meant for the dev list. I am looking into the first error

-bcc mahout-user


---------- Forwarded message ----------
From: Robin Anil <robin.a...@gmail.com>
Date: Fri, Feb 12, 2010 at 2:20 PM
Subject: Re: Problem converting SequenceFile to vectors, then running LDA
To: mahout-u...@lucene.apache.org


Hi,

      This confusion arises from the fact that we use intermediate folders
as subfolders under output folder. How about we standardize on all the jobs
taking input, intermediate and output folder?. If not this then for the next
release?

Robin




On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <zma...@gmail.com> wrote:

> Hello Mahout developers / users,
>
> I am trying to convert a properly formatted SequenceFile to Mahout vectors
> to run LDA on them. As reference I am using these two documents:
> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
>
> I got the Mahout code from SVN on February 11th 2010. Below I am listing
> the
> steps I have took and the problems I have encountered:
>
> export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/
> export MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/
>
> $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> org.apache.mahout.text.SparseVectorsFromSequenceFiles -i
> /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o
> /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk 300 -a
> org.apache.lucene.analysis.standard.StandardAnalyzer --minSupport 2 --minDF
> 1 --maxDFPercent 50 --norm 2
>
> *Problem #1: *Got this error at the end, but I think everything finished
> more or less correctly:
> Exception in thread "main" java.lang.NoSuchMethodError:
>
> org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V
> at
>
> org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173)
> at
>
> org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> org.apache.mahout.clustering.lda.LDADriver -i
> /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
> /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
> --numReducers 33
>
> *Problem #2: *Exception in thread "main" java.io.FileNotFoundException:
> File
> does not exist:
>
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
>
> *Tried to fix:*
>
> ../../hadoop fs -mv
> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000
> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
>
> *Ran again:*
>
> $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> org.apache.mahout.clustering.lda.LDADriver -i
> /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
> /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
> --numReducers 33
>
> *Problem #3:*
>
> Exception in thread "main" java.io.FileNotFoundException: File does not
> exist:
>
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data
>
> [had...@some_server retweets]$ ../../hadoop fs -ls
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/
> Found 3 items
> -rw-r--r--   3 hadoop supergroup  129721338 2010-02-11 23:54
>
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000
> -rw-r--r--   3 hadoop supergroup  128256085 2010-02-11 23:54
>
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001
> -rw-r--r--   3 hadoop supergroup   24160265 2010-02-11 23:54
>
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002
>
> Also, as a *bonus problem*, If the input
> folder /user/MY_USERNAME/projects/lda/twitter_sequence_files contains more
> than one file (for example if I run only the maps without a final reducer),
> this whole chain doesn't work.
>
> Thanks,
> Ovi
>
> ---
> Ovidiu Dan - http://www.ovidiudan.com/
>
> Please do not print this e-mail unless it is mandatory
>
> My public key can be downloaded from subkeys.pgp.net, or
> http://www.ovidiudan.com/public.pgp
>

Reply via email to