Well, I tried it. The previous problem is fixed, but now I have a new and
shiny one :(

Meet line 95 of LDAInference,java:
DenseMatrix phi = new DenseMatrix(state.numTopics, docLength);

docLength is calculated above on line 88:
int docLength = wordCounts.size();

My problem is that docLength is always 2147483647. Since DenseMatrix
allocated an array based (also) on this value, I get multiple "Requested
array size exceeds VM limit" messages (an array with 2147483647 columns
would be quite large).

I added a trivial toString function that displays vector.size()
in org/apache/mahout/math/VectorWritable.java, recompiled the project, then
ran:

./hadoop fs -text
/user/MY_USERNAME/projects/lda/mahout_vectors/vectors/part-00000

All lines were DOCUMENT_ID (tab) 2147483647. So all vectors report
size 2147483647

I checked the output for vector.zSum() as well, that one looks fine.

I can confirm that my input SequenceFile is correct, it has the following
format:
- key: Text with unique id of document
- value: Text with the contents of the document

Ovi

---
Ovidiu Dan - http://www.ovidiudan.com/

Please do not print this e-mail unless it is mandatory

My public key can be downloaded from subkeys.pgp.net, or
http://www.ovidiudan.com/public.pgp


On Fri, Feb 12, 2010 at 3:30 PM, Ovidiu Dan <[email protected]> wrote:

>
> Thanks, I also patched it myself but now I have some other problems. I'll
> run it and let you know how it goes.
>
> Ovi
>
> ---
> Ovidiu Dan - http://www.ovidiudan.com/
>
> Please do not print this e-mail unless it is mandatory
>
> My public key can be downloaded from subkeys.pgp.net, or
> http://www.ovidiudan.com/public.pgp
>
>
> On Fri, Feb 12, 2010 at 3:25 PM, Robin Anil <[email protected]> wrote:
>
>> I fixed the bug here https://issues.apache.org/jira/browse/MAHOUT-289
>>
>>
>> Try running now.
>>
>> Robin
>>
>> On Sat, Feb 13, 2010 at 1:09 AM, Ovidiu Dan <[email protected]> wrote:
>>
>> > I checked the code.
>> >
>> > Line 36 of LDAMapper.java references Vector:
>> >
>> > public class LDAMapper extends
>> >    Mapper<WritableComparable<?>, *Vector*, IntPairWritable,
>> DoubleWritable>
>> > {
>> >
>> > Aren't all elements in Mapper<...> supposed to be Writable? Do you need
>> to
>> > do a conversion to VectorWritable?
>> >
>> > Ovi
>> >
>> > ---
>> > Ovidiu Dan - http://www.ovidiudan.com/
>> >
>> > Please do not print this e-mail unless it is mandatory
>> >
>> > My public key can be downloaded from subkeys.pgp.net, or
>> > http://www.ovidiudan.com/public.pgp
>> >
>> >
>> > On Fri, Feb 12, 2010 at 2:32 PM, Ovidiu Dan <[email protected]> wrote:
>> >
>> > >
>> > > Ok I did a clean checkout & install and ran everything again, then
>> > pointed
>> > > LDA to mahout_vectors/vectors.
>> > >
>> > > Now I get this error:
>> > >
>> > > 10/02/12 14:28:35 INFO mapred.JobClient: Task Id :
>> > > attempt_201001192218_0216_m_000005_0, Status : FAILED
>> > > java.lang.ClassCastException: *org.apache.mahout.math.VectorWritable
>> > > cannot be cast to org.apache.mahout.math.Vector*
>> > >  at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36)
>> > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> > >  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
>> > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>> > >  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> > >
>> > > Ovi
>> > >
>> > > ---
>> > > Ovidiu Dan - http://www.ovidiudan.com/
>> > >
>> > > Please do not print this e-mail unless it is mandatory
>> > >
>> > > My public key can be downloaded from subkeys.pgp.net, or
>> > > http://www.ovidiudan.com/public.pgp
>> > >
>> > >
>> > > On Fri, Feb 12, 2010 at 1:50 PM, Robin Anil <[email protected]>
>> > wrote:
>> > >
>> > >> Hi Ovidu,
>> > >>            If you choose tf, the vectors are generated in
>> > >> outputfolder/vectors and if you choose tfidf the vectors are
>> generated
>> > in
>> > >> outputfolder/tfidf/vectors. I am in the process of changing the code
>> to
>> > >> move
>> > >> the output of the map/reduce to a fixed destination and that
>> exception
>> > >> would
>> > >> have caused the folders not to move.
>> > >> Thats the reason for the last error.
>> > >>
>> > >> For the first error, I am not sure what is happening. Could you do a
>> > clean
>> > >> compile of mahout
>> > >>
>> > >> mvn clean install -DskipTests=true and make sure you svn up the trunk
>> > >> before
>> > >> doing that
>> > >>
>> > >> Then point your LDA to mahout_vectors/vectors
>> > >>
>> > >> Robin
>> > >>
>> > >>
>> > >> On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan <[email protected]>
>> wrote:
>> > >>
>> > >> > Hi again,
>> > >> >
>> > >> > Is there any workaround for my problem(s)? Or is there any other
>> way
>> > >> that
>> > >> > would allow me to transform many many small messages (they're
>> Tweets)
>> > >> into
>> > >> > Mahout vectors, and the run LDA on them, without getting these
>> errors?
>> > >> > Converting them to txt files would be a bit of a pain because I
>> would
>> > >> get
>> > >> > millions of very small files. And a Lucene index would be a bit
>> > overkill
>> > >> I
>> > >> > think.
>> > >> >
>> > >> > Thanks,
>> > >> > Ovi
>> > >> >
>> > >> > ---
>> > >> > Ovidiu Dan - http://www.ovidiudan.com/
>> > >> >
>> > >> > Please do not print this e-mail unless it is mandatory
>> > >> >
>> > >> > My public key can be downloaded from subkeys.pgp.net, or
>> > >> > http://www.ovidiudan.com/public.pgp
>> > >> >
>> > >> >
>> > >> > On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil <[email protected]>
>> > >> wrote:
>> > >> >
>> > >> > > Was meant for the dev list. I am looking into the first error
>> > >> > >
>> > >> > > -bcc mahout-user
>> > >> > >
>> > >> > >
>> > >> > > ---------- Forwarded message ----------
>> > >> > > From: Robin Anil <[email protected]>
>> > >> > > Date: Fri, Feb 12, 2010 at 2:20 PM
>> > >> > > Subject: Re: Problem converting SequenceFile to vectors, then
>> > running
>> > >> LDA
>> > >> > > To: [email protected]
>> > >> > >
>> > >> > >
>> > >> > > Hi,
>> > >> > >
>> > >> > >      This confusion arises from the fact that we use intermediate
>> > >> folders
>> > >> > > as subfolders under output folder. How about we standardize on
>> all
>> > the
>> > >> > jobs
>> > >> > > taking input, intermediate and output folder?. If not this then
>> for
>> > >> the
>> > >> > > next
>> > >> > > release?
>> > >> > >
>> > >> > > Robin
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <[email protected]>
>> > >> wrote:
>> > >> > >
>> > >> > > > Hello Mahout developers / users,
>> > >> > > >
>> > >> > > > I am trying to convert a properly formatted SequenceFile to
>> Mahout
>> > >> > > vectors
>> > >> > > > to run LDA on them. As reference I am using these two
>> documents:
>> > >> > > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>> > >> > > >
>> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
>> > >> > > >
>> > >> > > > I got the Mahout code from SVN on February 11th 2010. Below I
>> am
>> > >> > listing
>> > >> > > > the
>> > >> > > > steps I have took and the problems I have encountered:
>> > >> > > >
>> > >> > > > export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/
>> > >> > > > export
>> > >> > MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/
>> > >> > > >
>> > >> > > > $HADOOP_HOME/bin/hadoop jar
>> > >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
>> > >> > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i
>> > >> > > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o
>> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk
>> 300
>> > -a
>> > >> > > > org.apache.lucene.analysis.standard.StandardAnalyzer
>> --minSupport
>> > 2
>> > >> > > --minDF
>> > >> > > > 1 --maxDFPercent 50 --norm 2
>> > >> > > >
>> > >> > > > *Problem #1: *Got this error at the end, but I think everything
>> > >> > finished
>> > >> > > > more or less correctly:
>> > >> > > > Exception in thread "main" java.lang.NoSuchMethodError:
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V
>> > >> > > > at
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173)
>> > >> > > > at
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254)
>> > >> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > >> > > > at
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > >> > > > at
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > >> > > > at java.lang.reflect.Method.invoke(Method.java:597)
>> > >> > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>> > >> > > >
>> > >> > > > $HADOOP_HOME/bin/hadoop jar
>> > >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
>> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i
>> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
>> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
>> > >> > > > --numReducers 33
>> > >> > > >
>> > >> > > > *Problem #2: *Exception in thread "main"
>> > >> java.io.FileNotFoundException:
>> > >> > > > File
>> > >> > > > does not exist:
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
>> > >> > > >
>> > >> > > > *Tried to fix:*
>> > >> > > >
>> > >> > > > ../../hadoop fs -mv
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000
>> > >> > > >
>> > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
>> > >> > > >
>> > >> > > > *Ran again:*
>> > >> > > >
>> > >> > > > $HADOOP_HOME/bin/hadoop jar
>> > >> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
>> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i
>> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o
>> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
>> > >> > > > --numReducers 33
>> > >> > > >
>> > >> > > > *Problem #3:*
>> > >> > > >
>> > >> > > > Exception in thread "main" java.io.FileNotFoundException: File
>> > does
>> > >> not
>> > >> > > > exist:
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data
>> > >> > > >
>> > >> > > > [had...@some_server retweets]$ ../../hadoop fs -ls
>> > >> > > >
>> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/
>> > >> > > > Found 3 items
>> > >> > > > -rw-r--r--   3 hadoop supergroup  129721338 2010-02-11 23:54
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000
>> > >> > > > -rw-r--r--   3 hadoop supergroup  128256085 2010-02-11 23:54
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001
>> > >> > > > -rw-r--r--   3 hadoop supergroup   24160265 2010-02-11 23:54
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002
>> > >> > > >
>> > >> > > > Also, as a *bonus problem*, If the input
>> > >> > > > folder /user/MY_USERNAME/projects/lda/twitter_sequence_files
>> > >> contains
>> > >> > > more
>> > >> > > > than one file (for example if I run only the maps without a
>> final
>> > >> > > reducer),
>> > >> > > > this whole chain doesn't work.
>> > >> > > >
>> > >> > > > Thanks,
>> > >> > > > Ovi
>> > >> > > >
>> > >> > > > ---
>> > >> > > > Ovidiu Dan - http://www.ovidiudan.com/
>> > >> > > >
>> > >> > > > Please do not print this e-mail unless it is mandatory
>> > >> > > >
>> > >> > > > My public key can be downloaded from subkeys.pgp.net, or
>> > >> > > > http://www.ovidiudan.com/public.pgp
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Reply via email to