I fixed the bug here https://issues.apache.org/jira/browse/MAHOUT-289


Try running now.

Robin

On Sat, Feb 13, 2010 at 1:09 AM, Ovidiu Dan <[email protected]> wrote:

> I checked the code.
>
> Line 36 of LDAMapper.java references Vector:
>
> public class LDAMapper extends
>    Mapper<WritableComparable<?>, *Vector*, IntPairWritable, DoubleWritable>
> {
>
> Aren't all elements in Mapper<...> supposed to be Writable? Do you need to
> do a conversion to VectorWritable?
>
> Ovi
>
> ---
> Ovidiu Dan - http://www.ovidiudan.com/
>
> Please do not print this e-mail unless it is mandatory
>
> My public key can be downloaded from subkeys.pgp.net, or
> http://www.ovidiudan.com/public.pgp
>
>
> On Fri, Feb 12, 2010 at 2:32 PM, Ovidiu Dan <[email protected]> wrote:
>
> >
> > Ok I did a clean checkout & install and ran everything again, then
> pointed
> > LDA to mahout_vectors/vectors.
> >
> > Now I get this error:
> >
> > 10/02/12 14:28:35 INFO mapred.JobClient: Task Id :
> > attempt_201001192218_0216_m_000005_0, Status : FAILED
> > java.lang.ClassCastException: *org.apache.mahout.math.VectorWritable
> > cannot be cast to org.apache.mahout.math.Vector*
> >  at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >  at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >
> > Ovi
> >
> > ---
> > Ovidiu Dan - http://www.ovidiudan.com/
> >
> > Please do not print this e-mail unless it is mandatory
> >
> > My public key can be downloaded from subkeys.pgp.net, or
> > http://www.ovidiudan.com/public.pgp
> >
> >
> > On Fri, Feb 12, 2010 at 1:50 PM, Robin Anil <[email protected]>
> wrote:
> >
> >> Hi Ovidu,
> >>            If you choose tf, the vectors are generated in
> >> outputfolder/vectors and if you choose tfidf the vectors are generated
> in
> >> outputfolder/tfidf/vectors. I am in the process of changing the code to
> >> move
> >> the output of the map/reduce to a fixed destination and that exception
> >> would
> >> have caused the folders not to move.
> >> Thats the reason for the last error.
> >>
> >> For the first error, I am not sure what is happening. Could you do a
> clean
> >> compile of mahout
> >>
> >> mvn clean install -DskipTests=true and make sure you svn up the trunk
> >> before
> >> doing that
> >>
> >> Then point your LDA to mahout_vectors/vectors
> >>
> >> Robin
> >>
> >>
> >> On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan <[email protected]> wrote:
> >>
> >> > Hi again,
> >> >
> >> > Is there any workaround for my problem(s)? Or is there any other way
> >> that
> >> > would allow me to transform many many small messages (they're Tweets)
> >> into
> >> > Mahout vectors, and the run LDA on them, without getting these errors?
> >> > Converting them to txt files would be a bit of a pain because I would
> >> get
> >> > millions of very small files. And a Lucene index would be a bit
> overkill
> >> I
> >> > think.
> >> >
> >> > Thanks,
> >> > Ovi
> >> >
> >> > ---
> >> > Ovidiu Dan - http://www.ovidiudan.com/
> >> >
> >> > Please do not print this e-mail unless it is mandatory
> >> >
> >> > My public key can be downloaded from subkeys.pgp.net, or
> >> > http://www.ovidiudan.com/public.pgp
> >> >
> >> >
> >> > On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil <[email protected]>
> >> wrote:
> >> >
> >> > > Was meant for the dev list. I am looking into the first error
> >> > >
> >> > > -bcc mahout-user
> >> > >
> >> > >
> >> > > ---------- Forwarded message ----------
> >> > > From: Robin Anil <[email protected]>
> >> > > Date: Fri, Feb 12, 2010 at 2:20 PM
> >> > > Subject: Re: Problem converting SequenceFile to vectors, then
> running
> >> LDA
> >> > > To: [email protected]
> >> > >
> >> > >
> >> > > Hi,
> >> > >
> >> > >      This confusion arises from the fact that we use intermediate
> >> folders
> >> > > as subfolders under output folder. How about we standardize on all
> the
> >> > jobs
> >> > > taking input, intermediate and output folder?. If not this then for
> >> the
> >> > > next
> >> > > release?
> >> > >
> >> > > Robin
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <[email protected]>
> >> wrote:
> >> > >
> >> > > > Hello Mahout developers / users,
> >> > > >
> >> > > > I am trying to convert a properly formatted SequenceFile to Mahout
> >> > > vectors
> >> > > > to run LDA on them. As reference I am using these two documents:
> >> > > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
> >> > > > http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
> >> > > >
> >> > > > I got the Mahout code from SVN on February 11th 2010. Below I am
> >> > listing
> >> > > > the
> >> > > > steps I have took and the problems I have encountered:
> >> > > >
> >> > > > export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/
> >> > > > export
> >> > MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/
> >> > > >
> >> > > > $HADOOP_HOME/bin/hadoop jar
> >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> >> > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i
> >> > > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o
> >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk 300
> -a
> >> > > > org.apache.lucene.analysis.standard.StandardAnalyzer --minSupport
> 2
> >> > > --minDF
> >> > > > 1 --maxDFPercent 50 --norm 2
> >> > > >
> >> > > > *Problem #1: *Got this error at the end, but I think everything
> >> > finished
> >> > > > more or less correctly:
> >> > > > Exception in thread "main" java.lang.NoSuchMethodError:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V
> >> > > > at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173)
> >> > > > at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254)
> >> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> > > > at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >> > > > at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> > > > at java.lang.reflect.Method.invoke(Method.java:597)
> >> > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >> > > >
> >> > > > $HADOOP_HOME/bin/hadoop jar
> >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> >> > > > org.apache.mahout.clustering.lda.LDADriver -i
> >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
> >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
> >> > > > --numReducers 33
> >> > > >
> >> > > > *Problem #2: *Exception in thread "main"
> >> java.io.FileNotFoundException:
> >> > > > File
> >> > > > does not exist:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
> >> > > >
> >> > > > *Tried to fix:*
> >> > > >
> >> > > > ../../hadoop fs -mv
> >> > > >
> >> > >
> >> >
> >>
> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000
> >> > > >
> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
> >> > > >
> >> > > > *Ran again:*
> >> > > >
> >> > > > $HADOOP_HOME/bin/hadoop jar
> >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> >> > > > org.apache.mahout.clustering.lda.LDADriver -i
> >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
> >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
> >> > > > --numReducers 33
> >> > > >
> >> > > > *Problem #3:*
> >> > > >
> >> > > > Exception in thread "main" java.io.FileNotFoundException: File
> does
> >> not
> >> > > > exist:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data
> >> > > >
> >> > > > [had...@some_server retweets]$ ../../hadoop fs -ls
> >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/
> >> > > > Found 3 items
> >> > > > -rw-r--r--   3 hadoop supergroup  129721338 2010-02-11 23:54
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000
> >> > > > -rw-r--r--   3 hadoop supergroup  128256085 2010-02-11 23:54
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001
> >> > > > -rw-r--r--   3 hadoop supergroup   24160265 2010-02-11 23:54
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002
> >> > > >
> >> > > > Also, as a *bonus problem*, If the input
> >> > > > folder /user/MY_USERNAME/projects/lda/twitter_sequence_files
> >> contains
> >> > > more
> >> > > > than one file (for example if I run only the maps without a final
> >> > > reducer),
> >> > > > this whole chain doesn't work.
> >> > > >
> >> > > > Thanks,
> >> > > > Ovi
> >> > > >
> >> > > > ---
> >> > > > Ovidiu Dan - http://www.ovidiudan.com/
> >> > > >
> >> > > > Please do not print this e-mail unless it is mandatory
> >> > > >
> >> > > > My public key can be downloaded from subkeys.pgp.net, or
> >> > > > http://www.ovidiudan.com/public.pgp
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Reply via email to