On Fri, Feb 12, 2010 at 2:00 PM, Ovidiu Dan <[email protected]> wrote: > Hi, thanks for the fix. I guess it's one step closer to a worlable solution. > I am now getting this error: > > 10/02/12 16:59:25 INFO mapred.JobClient: Task Id : > attempt_201001192218_0234_m_000004_1, Status : FAILED > java.lang.ArrayIndexOutOfBoundsException: 104017
You probably have more words than you allotted at the beginning. I know it's less than ideal, but at the moment you need to specify an upper bound on the number of words. That's the numWords parameter. Try upping it by a factor of two or so. -- David > at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:75) > at > org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:40) > at > org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:204) > at > org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:117) > at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47) > at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:37) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > Ovi > > --- > Ovidiu Dan - http://www.ovidiudan.com/ > > Please do not print this e-mail unless it is mandatory > > My public key can be downloaded from subkeys.pgp.net, or > http://www.ovidiudan.com/public.pgp > > > On Fri, Feb 12, 2010 at 4:29 PM, Ovidiu Dan <[email protected]> wrote: > >> >> Will do in a bit, thanks! >> >> --- >> Ovidiu Dan - http://www.ovidiudan.com/ >> >> Please do not print this e-mail unless it is mandatory >> >> My public key can be downloaded from subkeys.pgp.net, or >> http://www.ovidiudan.com/public.pgp >> >> >> On Fri, Feb 12, 2010 at 4:27 PM, Robin Anil <[email protected]> wrote: >> >>> Ovidiu we just commited the fix. Just recreate vectors using the -seq >>> option >>> added to it. >>> >>> Remember to svnup and recompile >>> >>> Robin >>> On Sat, Feb 13, 2010 at 2:39 AM, Jake Mannix <[email protected]> >>> wrote: >>> >>> > Robin and I are trying out a fix. I already ran into this in hooking >>> > the Vectorizer up to my SVD code. >>> > >>> > -jake >>> > >>> > On Fri, Feb 12, 2010 at 1:08 PM, Ovidiu Dan <[email protected]> wrote: >>> > >>> > > Well, I tried it. The previous problem is fixed, but now I have a new >>> and >>> > > shiny one :( >>> > > >>> > > Meet line 95 of LDAInference,java: >>> > > DenseMatrix phi = new DenseMatrix(state.numTopics, docLength); >>> > > >>> > > docLength is calculated above on line 88: >>> > > int docLength = wordCounts.size(); >>> > > >>> > > My problem is that docLength is always 2147483647. Since DenseMatrix >>> > > allocated an array based (also) on this value, I get multiple >>> "Requested >>> > > array size exceeds VM limit" messages (an array with 2147483647 >>> columns >>> > > would be quite large). >>> > > >>> > > I added a trivial toString function that displays vector.size() >>> > > in org/apache/mahout/math/VectorWritable.java, recompiled the project, >>> > then >>> > > ran: >>> > > >>> > > ./hadoop fs -text >>> > > /user/MY_USERNAME/projects/lda/mahout_vectors/vectors/part-00000 >>> > > >>> > > All lines were DOCUMENT_ID (tab) 2147483647. So all vectors report >>> > > size 2147483647 >>> > > >>> > > I checked the output for vector.zSum() as well, that one looks fine. >>> > > >>> > > I can confirm that my input SequenceFile is correct, it has the >>> following >>> > > format: >>> > > - key: Text with unique id of document >>> > > - value: Text with the contents of the document >>> > > >>> > > Ovi >>> > > >>> > > --- >>> > > Ovidiu Dan - http://www.ovidiudan.com/ >>> > > >>> > > Please do not print this e-mail unless it is mandatory >>> > > >>> > > My public key can be downloaded from subkeys.pgp.net, or >>> > > http://www.ovidiudan.com/public.pgp >>> > > >>> > > >>> > > On Fri, Feb 12, 2010 at 3:30 PM, Ovidiu Dan <[email protected]> wrote: >>> > > >>> > > > >>> > > > Thanks, I also patched it myself but now I have some other problems. >>> > I'll >>> > > > run it and let you know how it goes. >>> > > > >>> > > > Ovi >>> > > > >>> > > > --- >>> > > > Ovidiu Dan - http://www.ovidiudan.com/ >>> > > > >>> > > > Please do not print this e-mail unless it is mandatory >>> > > > >>> > > > My public key can be downloaded from subkeys.pgp.net, or >>> > > > http://www.ovidiudan.com/public.pgp >>> > > > >>> > > > >>> > > > On Fri, Feb 12, 2010 at 3:25 PM, Robin Anil <[email protected]> >>> > > wrote: >>> > > > >>> > > >> I fixed the bug here >>> https://issues.apache.org/jira/browse/MAHOUT-289 >>> > > >> >>> > > >> >>> > > >> Try running now. >>> > > >> >>> > > >> Robin >>> > > >> >>> > > >> On Sat, Feb 13, 2010 at 1:09 AM, Ovidiu Dan <[email protected]> >>> wrote: >>> > > >> >>> > > >> > I checked the code. >>> > > >> > >>> > > >> > Line 36 of LDAMapper.java references Vector: >>> > > >> > >>> > > >> > public class LDAMapper extends >>> > > >> > Mapper<WritableComparable<?>, *Vector*, IntPairWritable, >>> > > >> DoubleWritable> >>> > > >> > { >>> > > >> > >>> > > >> > Aren't all elements in Mapper<...> supposed to be Writable? Do >>> you >>> > > need >>> > > >> to >>> > > >> > do a conversion to VectorWritable? >>> > > >> > >>> > > >> > Ovi >>> > > >> > >>> > > >> > --- >>> > > >> > Ovidiu Dan - http://www.ovidiudan.com/ >>> > > >> > >>> > > >> > Please do not print this e-mail unless it is mandatory >>> > > >> > >>> > > >> > My public key can be downloaded from subkeys.pgp.net, or >>> > > >> > http://www.ovidiudan.com/public.pgp >>> > > >> > >>> > > >> > >>> > > >> > On Fri, Feb 12, 2010 at 2:32 PM, Ovidiu Dan <[email protected]> >>> > wrote: >>> > > >> > >>> > > >> > > >>> > > >> > > Ok I did a clean checkout & install and ran everything again, >>> then >>> > > >> > pointed >>> > > >> > > LDA to mahout_vectors/vectors. >>> > > >> > > >>> > > >> > > Now I get this error: >>> > > >> > > >>> > > >> > > 10/02/12 14:28:35 INFO mapred.JobClient: Task Id : >>> > > >> > > attempt_201001192218_0216_m_000005_0, Status : FAILED >>> > > >> > > java.lang.ClassCastException: >>> > *org.apache.mahout.math.VectorWritable >>> > > >> > > cannot be cast to org.apache.mahout.math.Vector* >>> > > >> > > at >>> > > org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36) >>> > > >> > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >>> > > >> > > at >>> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) >>> > > >> > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>> > > >> > > at org.apache.hadoop.mapred.Child.main(Child.java:170) >>> > > >> > > >>> > > >> > > Ovi >>> > > >> > > >>> > > >> > > --- >>> > > >> > > Ovidiu Dan - http://www.ovidiudan.com/ >>> > > >> > > >>> > > >> > > Please do not print this e-mail unless it is mandatory >>> > > >> > > >>> > > >> > > My public key can be downloaded from subkeys.pgp.net, or >>> > > >> > > http://www.ovidiudan.com/public.pgp >>> > > >> > > >>> > > >> > > >>> > > >> > > On Fri, Feb 12, 2010 at 1:50 PM, Robin Anil < >>> [email protected] >>> > > >>> > > >> > wrote: >>> > > >> > > >>> > > >> > >> Hi Ovidu, >>> > > >> > >> If you choose tf, the vectors are generated in >>> > > >> > >> outputfolder/vectors and if you choose tfidf the vectors are >>> > > >> generated >>> > > >> > in >>> > > >> > >> outputfolder/tfidf/vectors. I am in the process of changing >>> the >>> > > code >>> > > >> to >>> > > >> > >> move >>> > > >> > >> the output of the map/reduce to a fixed destination and that >>> > > >> exception >>> > > >> > >> would >>> > > >> > >> have caused the folders not to move. >>> > > >> > >> Thats the reason for the last error. >>> > > >> > >> >>> > > >> > >> For the first error, I am not sure what is happening. Could >>> you >>> > do >>> > > a >>> > > >> > clean >>> > > >> > >> compile of mahout >>> > > >> > >> >>> > > >> > >> mvn clean install -DskipTests=true and make sure you svn up >>> the >>> > > trunk >>> > > >> > >> before >>> > > >> > >> doing that >>> > > >> > >> >>> > > >> > >> Then point your LDA to mahout_vectors/vectors >>> > > >> > >> >>> > > >> > >> Robin >>> > > >> > >> >>> > > >> > >> >>> > > >> > >> On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan < >>> [email protected]> >>> > > >> wrote: >>> > > >> > >> >>> > > >> > >> > Hi again, >>> > > >> > >> > >>> > > >> > >> > Is there any workaround for my problem(s)? Or is there any >>> > other >>> > > >> way >>> > > >> > >> that >>> > > >> > >> > would allow me to transform many many small messages >>> (they're >>> > > >> Tweets) >>> > > >> > >> into >>> > > >> > >> > Mahout vectors, and the run LDA on them, without getting >>> these >>> > > >> errors? >>> > > >> > >> > Converting them to txt files would be a bit of a pain >>> because I >>> > > >> would >>> > > >> > >> get >>> > > >> > >> > millions of very small files. And a Lucene index would be a >>> bit >>> > > >> > overkill >>> > > >> > >> I >>> > > >> > >> > think. >>> > > >> > >> > >>> > > >> > >> > Thanks, >>> > > >> > >> > Ovi >>> > > >> > >> > >>> > > >> > >> > --- >>> > > >> > >> > Ovidiu Dan - http://www.ovidiudan.com/ >>> > > >> > >> > >>> > > >> > >> > Please do not print this e-mail unless it is mandatory >>> > > >> > >> > >>> > > >> > >> > My public key can be downloaded from subkeys.pgp.net, or >>> > > >> > >> > http://www.ovidiudan.com/public.pgp >>> > > >> > >> > >>> > > >> > >> > >>> > > >> > >> > On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil < >>> > > [email protected]> >>> > > >> > >> wrote: >>> > > >> > >> > >>> > > >> > >> > > Was meant for the dev list. I am looking into the first >>> error >>> > > >> > >> > > >>> > > >> > >> > > -bcc mahout-user >>> > > >> > >> > > >>> > > >> > >> > > >>> > > >> > >> > > ---------- Forwarded message ---------- >>> > > >> > >> > > From: Robin Anil <[email protected]> >>> > > >> > >> > > Date: Fri, Feb 12, 2010 at 2:20 PM >>> > > >> > >> > > Subject: Re: Problem converting SequenceFile to vectors, >>> then >>> > > >> > running >>> > > >> > >> LDA >>> > > >> > >> > > To: [email protected] >>> > > >> > >> > > >>> > > >> > >> > > >>> > > >> > >> > > Hi, >>> > > >> > >> > > >>> > > >> > >> > > This confusion arises from the fact that we use >>> > > intermediate >>> > > >> > >> folders >>> > > >> > >> > > as subfolders under output folder. How about we >>> standardize >>> > on >>> > > >> all >>> > > >> > the >>> > > >> > >> > jobs >>> > > >> > >> > > taking input, intermediate and output folder?. If not this >>> > then >>> > > >> for >>> > > >> > >> the >>> > > >> > >> > > next >>> > > >> > >> > > release? >>> > > >> > >> > > >>> > > >> > >> > > Robin >>> > > >> > >> > > >>> > > >> > >> > > >>> > > >> > >> > > >>> > > >> > >> > > >>> > > >> > >> > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan < >>> > [email protected] >>> > > > >>> > > >> > >> wrote: >>> > > >> > >> > > >>> > > >> > >> > > > Hello Mahout developers / users, >>> > > >> > >> > > > >>> > > >> > >> > > > I am trying to convert a properly formatted SequenceFile >>> to >>> > > >> Mahout >>> > > >> > >> > > vectors >>> > > >> > >> > > > to run LDA on them. As reference I am using these two >>> > > >> documents: >>> > > >> > >> > > > >>> > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html >>> > > >> > >> > > > >>> > > >> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html >>> > > >> > >> > > > >>> > > >> > >> > > > I got the Mahout code from SVN on February 11th 2010. >>> Below >>> > I >>> > > >> am >>> > > >> > >> > listing >>> > > >> > >> > > > the >>> > > >> > >> > > > steps I have took and the problems I have encountered: >>> > > >> > >> > > > >>> > > >> > >> > > > export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/ >>> > > >> > >> > > > export >>> > > >> > >> > >>> > MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/ >>> > > >> > >> > > > >>> > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar >>> > > >> > >> > > > >>> > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job >>> > > >> > >> > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ >>> -o >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf >>> > -chunk >>> > > >> 300 >>> > > >> > -a >>> > > >> > >> > > > org.apache.lucene.analysis.standard.StandardAnalyzer >>> > > >> --minSupport >>> > > >> > 2 >>> > > >> > >> > > --minDF >>> > > >> > >> > > > 1 --maxDFPercent 50 --norm 2 >>> > > >> > >> > > > >>> > > >> > >> > > > *Problem #1: *Got this error at the end, but I think >>> > > everything >>> > > >> > >> > finished >>> > > >> > >> > > > more or less correctly: >>> > > >> > >> > > > Exception in thread "main" java.lang.NoSuchMethodError: >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V >>> > > >> > >> > > > at >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173) >>> > > >> > >> > > > at >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254) >>> > > >> > >> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> > > Method) >>> > > >> > >> > > > at >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> > > >> > >> > > > at >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> > > >> > >> > > > at java.lang.reflect.Method.invoke(Method.java:597) >>> > > >> > >> > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>> > > >> > >> > > > >>> > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar >>> > > >> > >> > > > >>> > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job >>> > > >> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords >>> > > 100000 >>> > > >> > >> > > > --numReducers 33 >>> > > >> > >> > > > >>> > > >> > >> > > > *Problem #2: *Exception in thread "main" >>> > > >> > >> java.io.FileNotFoundException: >>> > > >> > >> > > > File >>> > > >> > >> > > > does not exist: >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data >>> > > >> > >> > > > >>> > > >> > >> > > > *Tried to fix:* >>> > > >> > >> > > > >>> > > >> > >> > > > ../../hadoop fs -mv >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000 >>> > > >> > >> > > > >>> > > >> > >>> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data >>> > > >> > >> > > > >>> > > >> > >> > > > *Ran again:* >>> > > >> > >> > > > >>> > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar >>> > > >> > >> > > > >>> > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job >>> > > >> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords >>> > > 100000 >>> > > >> > >> > > > --numReducers 33 >>> > > >> > >> > > > >>> > > >> > >> > > > *Problem #3:* >>> > > >> > >> > > > >>> > > >> > >> > > > Exception in thread "main" >>> java.io.FileNotFoundException: >>> > > File >>> > > >> > does >>> > > >> > >> not >>> > > >> > >> > > > exist: >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data >>> > > >> > >> > > > >>> > > >> > >> > > > [had...@some_server retweets]$ ../../hadoop fs -ls >>> > > >> > >> > > > >>> > > >> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/ >>> > > >> > >> > > > Found 3 items >>> > > >> > >> > > > -rw-r--r-- 3 hadoop supergroup 129721338 2010-02-11 >>> > 23:54 >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000 >>> > > >> > >> > > > -rw-r--r-- 3 hadoop supergroup 128256085 2010-02-11 >>> > 23:54 >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001 >>> > > >> > >> > > > -rw-r--r-- 3 hadoop supergroup 24160265 2010-02-11 >>> > 23:54 >>> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > >>> > > >> >>> > > >>> > >>> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002 >>> > > >> > >> > > > >>> > > >> > >> > > > Also, as a *bonus problem*, If the input >>> > > >> > >> > > > folder >>> > /user/MY_USERNAME/projects/lda/twitter_sequence_files >>> > > >> > >> contains >>> > > >> > >> > > more >>> > > >> > >> > > > than one file (for example if I run only the maps >>> without a >>> > > >> final >>> > > >> > >> > > reducer), >>> > > >> > >> > > > this whole chain doesn't work. >>> > > >> > >> > > > >>> > > >> > >> > > > Thanks, >>> > > >> > >> > > > Ovi >>> > > >> > >> > > > >>> > > >> > >> > > > --- >>> > > >> > >> > > > Ovidiu Dan - http://www.ovidiudan.com/ >>> > > >> > >> > > > >>> > > >> > >> > > > Please do not print this e-mail unless it is mandatory >>> > > >> > >> > > > >>> > > >> > >> > > > My public key can be downloaded from subkeys.pgp.net, >>> or >>> > > >> > >> > > > http://www.ovidiudan.com/public.pgp >>> > > >> > >> > > > >>> > > >> > >> > > >>> > > >> > >> > >>> > > >> > >> >>> > > >> > > >>> > > >> > > >>> > > >> > >>> > > >> >>> > > > >>> > > > >>> > > >>> > >>> >> >> >
