Ovidiu seems to have found a Blocker. I dont know how this is possible. When Vector was removed the Writable interface, I dont know how it was still being accepted by the Mapper Interface
aargh Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> They dont specify that they are extending Writable thats why it was compiling. No wonder the tests didnt pick it up. We will have to sweep the code to check what else broke after we moved Writable out of Vector Robib On Sat, Feb 13, 2010 at 1:09 AM, Ovidiu Dan <zma...@gmail.com> wrote: > I checked the code. > > Line 36 of LDAMapper.java references Vector: > > public class LDAMapper extends > Mapper<WritableComparable<?>, *Vector*, IntPairWritable, DoubleWritable> > { > > Aren't all elements in Mapper<...> supposed to be Writable? Do you need to > do a conversion to VectorWritable? > > Ovi > > --- > Ovidiu Dan - http://www.ovidiudan.com/ > > Please do not print this e-mail unless it is mandatory > > My public key can be downloaded from subkeys.pgp.net, or > http://www.ovidiudan.com/public.pgp > > > On Fri, Feb 12, 2010 at 2:32 PM, Ovidiu Dan <zma...@gmail.com> wrote: > > > > > Ok I did a clean checkout & install and ran everything again, then > pointed > > LDA to mahout_vectors/vectors. > > > > Now I get this error: > > > > 10/02/12 14:28:35 INFO mapred.JobClient: Task Id : > > attempt_201001192218_0216_m_000005_0, Status : FAILED > > java.lang.ClassCastException: *org.apache.mahout.math.VectorWritable > > cannot be cast to org.apache.mahout.math.Vector* > > at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > > > Ovi > > > > --- > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > Please do not print this e-mail unless it is mandatory > > > > My public key can be downloaded from subkeys.pgp.net, or > > http://www.ovidiudan.com/public.pgp > > > > > > On Fri, Feb 12, 2010 at 1:50 PM, Robin Anil <robin.a...@gmail.com> > wrote: > > > >> Hi Ovidu, > >> If you choose tf, the vectors are generated in > >> outputfolder/vectors and if you choose tfidf the vectors are generated > in > >> outputfolder/tfidf/vectors. I am in the process of changing the code to > >> move > >> the output of the map/reduce to a fixed destination and that exception > >> would > >> have caused the folders not to move. > >> Thats the reason for the last error. > >> > >> For the first error, I am not sure what is happening. Could you do a > clean > >> compile of mahout > >> > >> mvn clean install -DskipTests=true and make sure you svn up the trunk > >> before > >> doing that > >> > >> Then point your LDA to mahout_vectors/vectors > >> > >> Robin > >> > >> > >> On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan <zma...@gmail.com> wrote: > >> > >> > Hi again, > >> > > >> > Is there any workaround for my problem(s)? Or is there any other way > >> that > >> > would allow me to transform many many small messages (they're Tweets) > >> into > >> > Mahout vectors, and the run LDA on them, without getting these errors? > >> > Converting them to txt files would be a bit of a pain because I would > >> get > >> > millions of very small files. And a Lucene index would be a bit > overkill > >> I > >> > think. > >> > > >> > Thanks, > >> > Ovi > >> > > >> > --- > >> > Ovidiu Dan - http://www.ovidiudan.com/ > >> > > >> > Please do not print this e-mail unless it is mandatory > >> > > >> > My public key can be downloaded from subkeys.pgp.net, or > >> > http://www.ovidiudan.com/public.pgp > >> > > >> > > >> > On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil <robin.a...@gmail.com> > >> wrote: > >> > > >> > > Was meant for the dev list. I am looking into the first error > >> > > > >> > > -bcc mahout-user > >> > > > >> > > > >> > > ---------- Forwarded message ---------- > >> > > From: Robin Anil <robin.a...@gmail.com> > >> > > Date: Fri, Feb 12, 2010 at 2:20 PM > >> > > Subject: Re: Problem converting SequenceFile to vectors, then > running > >> LDA > >> > > To: mahout-u...@lucene.apache.org > >> > > > >> > > > >> > > Hi, > >> > > > >> > > This confusion arises from the fact that we use intermediate > >> folders > >> > > as subfolders under output folder. How about we standardize on all > the > >> > jobs > >> > > taking input, intermediate and output folder?. If not this then for > >> the > >> > > next > >> > > release? > >> > > > >> > > Robin > >> > > > >> > > > >> > > > >> > > > >> > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <zma...@gmail.com> > >> wrote: > >> > > > >> > > > Hello Mahout developers / users, > >> > > > > >> > > > I am trying to convert a properly formatted SequenceFile to Mahout > >> > > vectors > >> > > > to run LDA on them. As reference I am using these two documents: > >> > > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html > >> > > > http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html > >> > > > > >> > > > I got the Mahout code from SVN on February 11th 2010. Below I am > >> > listing > >> > > > the > >> > > > steps I have took and the problems I have encountered: > >> > > > > >> > > > export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/ > >> > > > export > >> > MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/ > >> > > > > >> > > > $HADOOP_HOME/bin/hadoop jar > >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > >> > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i > >> > > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk 300 > -a > >> > > > org.apache.lucene.analysis.standard.StandardAnalyzer --minSupport > 2 > >> > > --minDF > >> > > > 1 --maxDFPercent 50 --norm 2 > >> > > > > >> > > > *Problem #1: *Got this error at the end, but I think everything > >> > finished > >> > > > more or less correctly: > >> > > > Exception in thread "main" java.lang.NoSuchMethodError: > >> > > > > >> > > > > >> > > > >> > > >> > org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V > >> > > > at > >> > > > > >> > > > > >> > > > >> > > >> > org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173) > >> > > > at > >> > > > > >> > > > > >> > > > >> > > >> > org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254) > >> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >> > > > at > >> > > > > >> > > > > >> > > > >> > > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >> > > > at > >> > > > > >> > > > > >> > > > >> > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> > > > at java.lang.reflect.Method.invoke(Method.java:597) > >> > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > >> > > > > >> > > > $HADOOP_HOME/bin/hadoop jar > >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > >> > > > org.apache.mahout.clustering.lda.LDADriver -i > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000 > >> > > > --numReducers 33 > >> > > > > >> > > > *Problem #2: *Exception in thread "main" > >> java.io.FileNotFoundException: > >> > > > File > >> > > > does not exist: > >> > > > > >> > > > > >> > > > >> > > >> > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > >> > > > > >> > > > *Tried to fix:* > >> > > > > >> > > > ../../hadoop fs -mv > >> > > > > >> > > > >> > > >> > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000 > >> > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > >> > > > > >> > > > *Ran again:* > >> > > > > >> > > > $HADOOP_HOME/bin/hadoop jar > >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > >> > > > org.apache.mahout.clustering.lda.LDADriver -i > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000 > >> > > > --numReducers 33 > >> > > > > >> > > > *Problem #3:* > >> > > > > >> > > > Exception in thread "main" java.io.FileNotFoundException: File > does > >> not > >> > > > exist: > >> > > > > >> > > > > >> > > > >> > > >> > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data > >> > > > > >> > > > [had...@some_server retweets]$ ../../hadoop fs -ls > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/ > >> > > > Found 3 items > >> > > > -rw-r--r-- 3 hadoop supergroup 129721338 2010-02-11 23:54 > >> > > > > >> > > > > >> > > > >> > > >> > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000 > >> > > > -rw-r--r-- 3 hadoop supergroup 128256085 2010-02-11 23:54 > >> > > > > >> > > > > >> > > > >> > > >> > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001 > >> > > > -rw-r--r-- 3 hadoop supergroup 24160265 2010-02-11 23:54 > >> > > > > >> > > > > >> > > > >> > > >> > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002 > >> > > > > >> > > > Also, as a *bonus problem*, If the input > >> > > > folder /user/MY_USERNAME/projects/lda/twitter_sequence_files > >> contains > >> > > more > >> > > > than one file (for example if I run only the maps without a final > >> > > reducer), > >> > > > this whole chain doesn't work. > >> > > > > >> > > > Thanks, > >> > > > Ovi > >> > > > > >> > > > --- > >> > > > Ovidiu Dan - http://www.ovidiudan.com/ > >> > > > > >> > > > Please do not print this e-mail unless it is mandatory > >> > > > > >> > > > My public key can be downloaded from subkeys.pgp.net, or > >> > > > http://www.ovidiudan.com/public.pgp > >> > > > > >> > > > >> > > >> > > > > >