Re: Problem converting SequenceFile to vectors, then running LDA

Robin Anil Fri, 12 Feb 2010 11:46:57 -0800

Ovidiu seems to have found a Blocker. I dont know how this is possible. When
Vector was removed the Writable interface, I dont know how it was still
being accepted by the Mapper Interface


aargh Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>  They dont specify that they
are extending Writable thats why it was compiling. No wonder the tests didnt
pick it up.

We will have to sweep the code to check what else broke after we moved
Writable out of Vector

Robib

On Sat, Feb 13, 2010 at 1:09 AM, Ovidiu Dan <zma...@gmail.com> wrote:

> I checked the code.
>
> Line 36 of LDAMapper.java references Vector:
>
> public class LDAMapper extends
>    Mapper<WritableComparable<?>, *Vector*, IntPairWritable, DoubleWritable>
> {
>
> Aren't all elements in Mapper<...> supposed to be Writable? Do you need to
> do a conversion to VectorWritable?
>
> Ovi
>
> ---
> Ovidiu Dan - http://www.ovidiudan.com/
>
> Please do not print this e-mail unless it is mandatory
>
> My public key can be downloaded from subkeys.pgp.net, or
> http://www.ovidiudan.com/public.pgp
>
>
> On Fri, Feb 12, 2010 at 2:32 PM, Ovidiu Dan <zma...@gmail.com> wrote:
>
> >
> > Ok I did a clean checkout & install and ran everything again, then
> pointed
> > LDA to mahout_vectors/vectors.
> >
> > Now I get this error:
> >
> > 10/02/12 14:28:35 INFO mapred.JobClient: Task Id :
> > attempt_201001192218_0216_m_000005_0, Status : FAILED
> > java.lang.ClassCastException: *org.apache.mahout.math.VectorWritable
> > cannot be cast to org.apache.mahout.math.Vector*
> >  at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >  at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >
> > Ovi
> >
> > ---
> > Ovidiu Dan - http://www.ovidiudan.com/
> >
> > Please do not print this e-mail unless it is mandatory
> >
> > My public key can be downloaded from subkeys.pgp.net, or
> > http://www.ovidiudan.com/public.pgp
> >
> >
> > On Fri, Feb 12, 2010 at 1:50 PM, Robin Anil <robin.a...@gmail.com>
> wrote:
> >
> >> Hi Ovidu,
> >>            If you choose tf, the vectors are generated in
> >> outputfolder/vectors and if you choose tfidf the vectors are generated
> in
> >> outputfolder/tfidf/vectors. I am in the process of changing the code to
> >> move
> >> the output of the map/reduce to a fixed destination and that exception
> >> would
> >> have caused the folders not to move.
> >> Thats the reason for the last error.
> >>
> >> For the first error, I am not sure what is happening. Could you do a
> clean
> >> compile of mahout
> >>
> >> mvn clean install -DskipTests=true and make sure you svn up the trunk
> >> before
> >> doing that
> >>
> >> Then point your LDA to mahout_vectors/vectors
> >>
> >> Robin
> >>
> >>
> >> On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan <zma...@gmail.com> wrote:
> >>
> >> > Hi again,
> >> >
> >> > Is there any workaround for my problem(s)? Or is there any other way
> >> that
> >> > would allow me to transform many many small messages (they're Tweets)
> >> into
> >> > Mahout vectors, and the run LDA on them, without getting these errors?
> >> > Converting them to txt files would be a bit of a pain because I would
> >> get
> >> > millions of very small files. And a Lucene index would be a bit
> overkill
> >> I
> >> > think.
> >> >
> >> > Thanks,
> >> > Ovi
> >> >
> >> > ---
> >> > Ovidiu Dan - http://www.ovidiudan.com/
> >> >
> >> > Please do not print this e-mail unless it is mandatory
> >> >
> >> > My public key can be downloaded from subkeys.pgp.net, or
> >> > http://www.ovidiudan.com/public.pgp
> >> >
> >> >
> >> > On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil <robin.a...@gmail.com>
> >> wrote:
> >> >
> >> > > Was meant for the dev list. I am looking into the first error
> >> > >
> >> > > -bcc mahout-user
> >> > >
> >> > >
> >> > > ---------- Forwarded message ----------
> >> > > From: Robin Anil <robin.a...@gmail.com>
> >> > > Date: Fri, Feb 12, 2010 at 2:20 PM
> >> > > Subject: Re: Problem converting SequenceFile to vectors, then
> running
> >> LDA
> >> > > To: mahout-u...@lucene.apache.org
> >> > >
> >> > >
> >> > > Hi,
> >> > >
> >> > >      This confusion arises from the fact that we use intermediate
> >> folders
> >> > > as subfolders under output folder. How about we standardize on all
> the
> >> > jobs
> >> > > taking input, intermediate and output folder?. If not this then for
> >> the
> >> > > next
> >> > > release?
> >> > >
> >> > > Robin
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <zma...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hello Mahout developers / users,
> >> > > >
> >> > > > I am trying to convert a properly formatted SequenceFile to Mahout
> >> > > vectors
> >> > > > to run LDA on them. As reference I am using these two documents:
> >> > > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
> >> > > > http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
> >> > > >
> >> > > > I got the Mahout code from SVN on February 11th 2010. Below I am
> >> > listing
> >> > > > the
> >> > > > steps I have took and the problems I have encountered:
> >> > > >
> >> > > > export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/
> >> > > > export
> >> > MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/
> >> > > >
> >> > > > $HADOOP_HOME/bin/hadoop jar
> >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> >> > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i
> >> > > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o
> >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk 300
> -a
> >> > > > org.apache.lucene.analysis.standard.StandardAnalyzer --minSupport
> 2
> >> > > --minDF
> >> > > > 1 --maxDFPercent 50 --norm 2
> >> > > >
> >> > > > *Problem #1: *Got this error at the end, but I think everything
> >> > finished
> >> > > > more or less correctly:
> >> > > > Exception in thread "main" java.lang.NoSuchMethodError:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V
> >> > > > at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173)
> >> > > > at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254)
> >> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> > > > at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >> > > > at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> > > > at java.lang.reflect.Method.invoke(Method.java:597)
> >> > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >> > > >
> >> > > > $HADOOP_HOME/bin/hadoop jar
> >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> >> > > > org.apache.mahout.clustering.lda.LDADriver -i
> >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
> >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
> >> > > > --numReducers 33
> >> > > >
> >> > > > *Problem #2: *Exception in thread "main"
> >> java.io.FileNotFoundException:
> >> > > > File
> >> > > > does not exist:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
> >> > > >
> >> > > > *Tried to fix:*
> >> > > >
> >> > > > ../../hadoop fs -mv
> >> > > >
> >> > >
> >> >
> >>
> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000
> >> > > >
> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
> >> > > >
> >> > > > *Ran again:*
> >> > > >
> >> > > > $HADOOP_HOME/bin/hadoop jar
> >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> >> > > > org.apache.mahout.clustering.lda.LDADriver -i
> >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
> >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
> >> > > > --numReducers 33
> >> > > >
> >> > > > *Problem #3:*
> >> > > >
> >> > > > Exception in thread "main" java.io.FileNotFoundException: File
> does
> >> not
> >> > > > exist:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data
> >> > > >
> >> > > > [had...@some_server retweets]$ ../../hadoop fs -ls
> >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/
> >> > > > Found 3 items
> >> > > > -rw-r--r--   3 hadoop supergroup  129721338 2010-02-11 23:54
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000
> >> > > > -rw-r--r--   3 hadoop supergroup  128256085 2010-02-11 23:54
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001
> >> > > > -rw-r--r--   3 hadoop supergroup   24160265 2010-02-11 23:54
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002
> >> > > >
> >> > > > Also, as a *bonus problem*, If the input
> >> > > > folder /user/MY_USERNAME/projects/lda/twitter_sequence_files
> >> contains
> >> > > more
> >> > > > than one file (for example if I run only the maps without a final
> >> > > reducer),
> >> > > > this whole chain doesn't work.
> >> > > >
> >> > > > Thanks,
> >> > > > Ovi
> >> > > >
> >> > > > ---
> >> > > > Ovidiu Dan - http://www.ovidiudan.com/
> >> > > >
> >> > > > Please do not print this e-mail unless it is mandatory
> >> > > >
> >> > > > My public key can be downloaded from subkeys.pgp.net, or
> >> > > > http://www.ovidiudan.com/public.pgp
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Problem converting SequenceFile to vectors, then running LDA

Reply via email to