Robin and I are trying out a fix. I already ran into this in hooking the Vectorizer up to my SVD code.
-jake On Fri, Feb 12, 2010 at 1:08 PM, Ovidiu Dan <[email protected]> wrote: > Well, I tried it. The previous problem is fixed, but now I have a new and > shiny one :( > > Meet line 95 of LDAInference,java: > DenseMatrix phi = new DenseMatrix(state.numTopics, docLength); > > docLength is calculated above on line 88: > int docLength = wordCounts.size(); > > My problem is that docLength is always 2147483647. Since DenseMatrix > allocated an array based (also) on this value, I get multiple "Requested > array size exceeds VM limit" messages (an array with 2147483647 columns > would be quite large). > > I added a trivial toString function that displays vector.size() > in org/apache/mahout/math/VectorWritable.java, recompiled the project, then > ran: > > ./hadoop fs -text > /user/MY_USERNAME/projects/lda/mahout_vectors/vectors/part-00000 > > All lines were DOCUMENT_ID (tab) 2147483647. So all vectors report > size 2147483647 > > I checked the output for vector.zSum() as well, that one looks fine. > > I can confirm that my input SequenceFile is correct, it has the following > format: > - key: Text with unique id of document > - value: Text with the contents of the document > > Ovi > > --- > Ovidiu Dan - http://www.ovidiudan.com/ > > Please do not print this e-mail unless it is mandatory > > My public key can be downloaded from subkeys.pgp.net, or > http://www.ovidiudan.com/public.pgp > > > On Fri, Feb 12, 2010 at 3:30 PM, Ovidiu Dan <[email protected]> wrote: > > > > > Thanks, I also patched it myself but now I have some other problems. I'll > > run it and let you know how it goes. > > > > Ovi > > > > --- > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > Please do not print this e-mail unless it is mandatory > > > > My public key can be downloaded from subkeys.pgp.net, or > > http://www.ovidiudan.com/public.pgp > > > > > > On Fri, Feb 12, 2010 at 3:25 PM, Robin Anil <[email protected]> > wrote: > > > >> I fixed the bug here https://issues.apache.org/jira/browse/MAHOUT-289 > >> > >> > >> Try running now. > >> > >> Robin > >> > >> On Sat, Feb 13, 2010 at 1:09 AM, Ovidiu Dan <[email protected]> wrote: > >> > >> > I checked the code. > >> > > >> > Line 36 of LDAMapper.java references Vector: > >> > > >> > public class LDAMapper extends > >> > Mapper<WritableComparable<?>, *Vector*, IntPairWritable, > >> DoubleWritable> > >> > { > >> > > >> > Aren't all elements in Mapper<...> supposed to be Writable? Do you > need > >> to > >> > do a conversion to VectorWritable? > >> > > >> > Ovi > >> > > >> > --- > >> > Ovidiu Dan - http://www.ovidiudan.com/ > >> > > >> > Please do not print this e-mail unless it is mandatory > >> > > >> > My public key can be downloaded from subkeys.pgp.net, or > >> > http://www.ovidiudan.com/public.pgp > >> > > >> > > >> > On Fri, Feb 12, 2010 at 2:32 PM, Ovidiu Dan <[email protected]> wrote: > >> > > >> > > > >> > > Ok I did a clean checkout & install and ran everything again, then > >> > pointed > >> > > LDA to mahout_vectors/vectors. > >> > > > >> > > Now I get this error: > >> > > > >> > > 10/02/12 14:28:35 INFO mapred.JobClient: Task Id : > >> > > attempt_201001192218_0216_m_000005_0, Status : FAILED > >> > > java.lang.ClassCastException: *org.apache.mahout.math.VectorWritable > >> > > cannot be cast to org.apache.mahout.math.Vector* > >> > > at > org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36) > >> > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > >> > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) > >> > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > >> > > at org.apache.hadoop.mapred.Child.main(Child.java:170) > >> > > > >> > > Ovi > >> > > > >> > > --- > >> > > Ovidiu Dan - http://www.ovidiudan.com/ > >> > > > >> > > Please do not print this e-mail unless it is mandatory > >> > > > >> > > My public key can be downloaded from subkeys.pgp.net, or > >> > > http://www.ovidiudan.com/public.pgp > >> > > > >> > > > >> > > On Fri, Feb 12, 2010 at 1:50 PM, Robin Anil <[email protected]> > >> > wrote: > >> > > > >> > >> Hi Ovidu, > >> > >> If you choose tf, the vectors are generated in > >> > >> outputfolder/vectors and if you choose tfidf the vectors are > >> generated > >> > in > >> > >> outputfolder/tfidf/vectors. I am in the process of changing the > code > >> to > >> > >> move > >> > >> the output of the map/reduce to a fixed destination and that > >> exception > >> > >> would > >> > >> have caused the folders not to move. > >> > >> Thats the reason for the last error. > >> > >> > >> > >> For the first error, I am not sure what is happening. Could you do > a > >> > clean > >> > >> compile of mahout > >> > >> > >> > >> mvn clean install -DskipTests=true and make sure you svn up the > trunk > >> > >> before > >> > >> doing that > >> > >> > >> > >> Then point your LDA to mahout_vectors/vectors > >> > >> > >> > >> Robin > >> > >> > >> > >> > >> > >> On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan <[email protected]> > >> wrote: > >> > >> > >> > >> > Hi again, > >> > >> > > >> > >> > Is there any workaround for my problem(s)? Or is there any other > >> way > >> > >> that > >> > >> > would allow me to transform many many small messages (they're > >> Tweets) > >> > >> into > >> > >> > Mahout vectors, and the run LDA on them, without getting these > >> errors? > >> > >> > Converting them to txt files would be a bit of a pain because I > >> would > >> > >> get > >> > >> > millions of very small files. And a Lucene index would be a bit > >> > overkill > >> > >> I > >> > >> > think. > >> > >> > > >> > >> > Thanks, > >> > >> > Ovi > >> > >> > > >> > >> > --- > >> > >> > Ovidiu Dan - http://www.ovidiudan.com/ > >> > >> > > >> > >> > Please do not print this e-mail unless it is mandatory > >> > >> > > >> > >> > My public key can be downloaded from subkeys.pgp.net, or > >> > >> > http://www.ovidiudan.com/public.pgp > >> > >> > > >> > >> > > >> > >> > On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil < > [email protected]> > >> > >> wrote: > >> > >> > > >> > >> > > Was meant for the dev list. I am looking into the first error > >> > >> > > > >> > >> > > -bcc mahout-user > >> > >> > > > >> > >> > > > >> > >> > > ---------- Forwarded message ---------- > >> > >> > > From: Robin Anil <[email protected]> > >> > >> > > Date: Fri, Feb 12, 2010 at 2:20 PM > >> > >> > > Subject: Re: Problem converting SequenceFile to vectors, then > >> > running > >> > >> LDA > >> > >> > > To: [email protected] > >> > >> > > > >> > >> > > > >> > >> > > Hi, > >> > >> > > > >> > >> > > This confusion arises from the fact that we use > intermediate > >> > >> folders > >> > >> > > as subfolders under output folder. How about we standardize on > >> all > >> > the > >> > >> > jobs > >> > >> > > taking input, intermediate and output folder?. If not this then > >> for > >> > >> the > >> > >> > > next > >> > >> > > release? > >> > >> > > > >> > >> > > Robin > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <[email protected] > > > >> > >> wrote: > >> > >> > > > >> > >> > > > Hello Mahout developers / users, > >> > >> > > > > >> > >> > > > I am trying to convert a properly formatted SequenceFile to > >> Mahout > >> > >> > > vectors > >> > >> > > > to run LDA on them. As reference I am using these two > >> documents: > >> > >> > > > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html > >> > >> > > > > >> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html > >> > >> > > > > >> > >> > > > I got the Mahout code from SVN on February 11th 2010. Below I > >> am > >> > >> > listing > >> > >> > > > the > >> > >> > > > steps I have took and the problems I have encountered: > >> > >> > > > > >> > >> > > > export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/ > >> > >> > > > export > >> > >> > MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/ > >> > >> > > > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar > >> > >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > >> > >> > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i > >> > >> > > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk > >> 300 > >> > -a > >> > >> > > > org.apache.lucene.analysis.standard.StandardAnalyzer > >> --minSupport > >> > 2 > >> > >> > > --minDF > >> > >> > > > 1 --maxDFPercent 50 --norm 2 > >> > >> > > > > >> > >> > > > *Problem #1: *Got this error at the end, but I think > everything > >> > >> > finished > >> > >> > > > more or less correctly: > >> > >> > > > Exception in thread "main" java.lang.NoSuchMethodError: > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V > >> > >> > > > at > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173) > >> > >> > > > at > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254) > >> > >> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > >> > >> > > > at > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >> > >> > > > at > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> > >> > > > at java.lang.reflect.Method.invoke(Method.java:597) > >> > >> > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > >> > >> > > > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar > >> > >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > >> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o > >> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords > 100000 > >> > >> > > > --numReducers 33 > >> > >> > > > > >> > >> > > > *Problem #2: *Exception in thread "main" > >> > >> java.io.FileNotFoundException: > >> > >> > > > File > >> > >> > > > does not exist: > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > >> > >> > > > > >> > >> > > > *Tried to fix:* > >> > >> > > > > >> > >> > > > ../../hadoop fs -mv > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000 > >> > >> > > > > >> > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > >> > >> > > > > >> > >> > > > *Ran again:* > >> > >> > > > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar > >> > >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > >> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o > >> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords > 100000 > >> > >> > > > --numReducers 33 > >> > >> > > > > >> > >> > > > *Problem #3:* > >> > >> > > > > >> > >> > > > Exception in thread "main" java.io.FileNotFoundException: > File > >> > does > >> > >> not > >> > >> > > > exist: > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data > >> > >> > > > > >> > >> > > > [had...@some_server retweets]$ ../../hadoop fs -ls > >> > >> > > > > >> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/ > >> > >> > > > Found 3 items > >> > >> > > > -rw-r--r-- 3 hadoop supergroup 129721338 2010-02-11 23:54 > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000 > >> > >> > > > -rw-r--r-- 3 hadoop supergroup 128256085 2010-02-11 23:54 > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001 > >> > >> > > > -rw-r--r-- 3 hadoop supergroup 24160265 2010-02-11 23:54 > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002 > >> > >> > > > > >> > >> > > > Also, as a *bonus problem*, If the input > >> > >> > > > folder /user/MY_USERNAME/projects/lda/twitter_sequence_files > >> > >> contains > >> > >> > > more > >> > >> > > > than one file (for example if I run only the maps without a > >> final > >> > >> > > reducer), > >> > >> > > > this whole chain doesn't work. > >> > >> > > > > >> > >> > > > Thanks, > >> > >> > > > Ovi > >> > >> > > > > >> > >> > > > --- > >> > >> > > > Ovidiu Dan - http://www.ovidiudan.com/ > >> > >> > > > > >> > >> > > > Please do not print this e-mail unless it is mandatory > >> > >> > > > > >> > >> > > > My public key can be downloaded from subkeys.pgp.net, or > >> > >> > > > http://www.ovidiudan.com/public.pgp > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > > >> > > > >> > > >> > > > > >
