Re: Problem converting SequenceFile to vectors, then running LDA

Ovidiu Dan Fri, 12 Feb 2010 22:21:54 -0800

I think it ran fine this time, but I would like to use *LDAPrintTopics.java
*to print words for each of the k clusters.


I have a few issues with it:

   - The --dict option only accepts a dictionary file which is present on
   the local file system, not HDFS. That's odd, but ok.
   - I not only have to move the dictionary.file-0 file from HDFS to the
   local filesystem, but I have to transform it to text (./hadoop fs -text
   /user/SOME_USERNAME/projects/lda/mahout_vectors/dictionary.file-0 >
   dictionary.txt) - this is fine as well
   - Now if I run the command it complains that "Exception in thread "main"
   java.lang.ArrayIndexOutOfBoundsException: 2". I checked the code and it
   expects a Tab separated file with three columns not two (hence the out of
   bound expcetion at index 2 which is column 3). The current format of the
   file that I have is WORD (tab) WORD_ID
   - I changed line 154 of LDAPrintTopics.java to int index =
   Integer.parseInt(parts[1]); - and also commented lines 148 and 149

Still not done but will continue in the morning.

Thanks,
Ovi

---
Ovidiu Dan - http://www.ovidiudan.com/

Please do not print this e-mail unless it is mandatory

My public key can be downloaded from subkeys.pgp.net, or
http://www.ovidiudan.com/public.pgp


On Fri, Feb 12, 2010 at 5:21 PM, Robin Anil <[email protected]> wrote:

> try increasing the min Support. You are already removing words that occur
> in
> more than 50% of the document. So i see no trouble there
> Robin
>
>
> On Sat, Feb 13, 2010 at 3:49 AM, Ovidiu Dan <[email protected]> wrote:
>
> > Since my texts are not exactly correct English that might be the case.
> I'll
> > try it in a sec, thanks.
> >
> > Ovi
> >
> > ---
> > Ovidiu Dan - http://www.ovidiudan.com/
> >
> > Please do not print this e-mail unless it is mandatory
> >
> > My public key can be downloaded from subkeys.pgp.net, or
> > http://www.ovidiudan.com/public.pgp
> >
> >
> > On Fri, Feb 12, 2010 at 5:16 PM, Robin Anil <[email protected]>
> wrote:
> >
> > > there are 21K unigrams in reuters(from df count job counters) I am
> > running
> > > LDA fine with 100K as numWords
> > >
> > > Robin
> > >
> > > On Sat, Feb 13, 2010 at 3:42 AM, David Hall <[email protected]>
> > wrote:
> > >
> > > > On Fri, Feb 12, 2010 at 2:00 PM, Ovidiu Dan <[email protected]>
> wrote:
> > > > > Hi, thanks for the fix. I guess it's one step closer to a worlable
> > > > solution.
> > > > > I am now getting this error:
> > > > >
> > > > > 10/02/12 16:59:25 INFO mapred.JobClient: Task Id :
> > > > > attempt_201001192218_0234_m_000004_1, Status : FAILED
> > > > > java.lang.ArrayIndexOutOfBoundsException: 104017
> > > >
> > > > You probably have more words than you allotted at the beginning.
> > > >
> > > > I know it's less than ideal, but at the moment you need to specify an
> > > > upper bound on the number of words. That's the numWords parameter.
> > > >
> > > > Try upping it by a factor of two or so.
> > > >
> > > > -- David
> > > >
> > > > > at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:75)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:40)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:204)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:117)
> > > > > at
> org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
> > > > > at
> org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:37)
> > > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> > > > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> > > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > > > at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > > >
> > > > > Ovi
> > > > >
> > > > > ---
> > > > > Ovidiu Dan - http://www.ovidiudan.com/
> > > > >
> > > > > Please do not print this e-mail unless it is mandatory
> > > > >
> > > > > My public key can be downloaded from subkeys.pgp.net, or
> > > > > http://www.ovidiudan.com/public.pgp
> > > > >
> > > > >
> > > > > On Fri, Feb 12, 2010 at 4:29 PM, Ovidiu Dan <[email protected]>
> > wrote:
> > > > >
> > > > >>
> > > > >> Will do in a bit, thanks!
> > > > >>
> > > > >> ---
> > > > >> Ovidiu Dan - http://www.ovidiudan.com/
> > > > >>
> > > > >> Please do not print this e-mail unless it is mandatory
> > > > >>
> > > > >> My public key can be downloaded from subkeys.pgp.net, or
> > > > >> http://www.ovidiudan.com/public.pgp
> > > > >>
> > > > >>
> > > > >> On Fri, Feb 12, 2010 at 4:27 PM, Robin Anil <[email protected]
> >
> > > > wrote:
> > > > >>
> > > > >>> Ovidiu we just commited the fix. Just recreate vectors using the
> > -seq
> > > > >>> option
> > > > >>> added to it.
> > > > >>>
> > > > >>> Remember to svnup and recompile
> > > > >>>
> > > > >>> Robin
> > > > >>> On Sat, Feb 13, 2010 at 2:39 AM, Jake Mannix <
> > [email protected]>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Robin and I are trying out a fix.  I already ran into this in
> > > hooking
> > > > >>> > the Vectorizer up to my SVD code.
> > > > >>> >
> > > > >>> >  -jake
> > > > >>> >
> > > > >>> > On Fri, Feb 12, 2010 at 1:08 PM, Ovidiu Dan <[email protected]>
> > > > wrote:
> > > > >>> >
> > > > >>> > > Well, I tried it. The previous problem is fixed, but now I
> have
> > a
> > > > new
> > > > >>> and
> > > > >>> > > shiny one :(
> > > > >>> > >
> > > > >>> > > Meet line 95 of LDAInference,java:
> > > > >>> > > DenseMatrix phi = new DenseMatrix(state.numTopics,
> docLength);
> > > > >>> > >
> > > > >>> > > docLength is calculated above on line 88:
> > > > >>> > > int docLength = wordCounts.size();
> > > > >>> > >
> > > > >>> > > My problem is that docLength is always 2147483647. Since
> > > > DenseMatrix
> > > > >>> > > allocated an array based (also) on this value, I get multiple
> > > > >>> "Requested
> > > > >>> > > array size exceeds VM limit" messages (an array with
> 2147483647
> > > > >>> columns
> > > > >>> > > would be quite large).
> > > > >>> > >
> > > > >>> > > I added a trivial toString function that displays
> vector.size()
> > > > >>> > > in org/apache/mahout/math/VectorWritable.java, recompiled the
> > > > project,
> > > > >>> > then
> > > > >>> > > ran:
> > > > >>> > >
> > > > >>> > > ./hadoop fs -text
> > > > >>> > >
> > /user/MY_USERNAME/projects/lda/mahout_vectors/vectors/part-00000
> > > > >>> > >
> > > > >>> > > All lines were DOCUMENT_ID (tab) 2147483647. So all vectors
> > > report
> > > > >>> > > size 2147483647
> > > > >>> > >
> > > > >>> > > I checked the output for vector.zSum() as well, that one
> looks
> > > > fine.
> > > > >>> > >
> > > > >>> > > I can confirm that my input SequenceFile is correct, it has
> the
> > > > >>> following
> > > > >>> > > format:
> > > > >>> > > - key: Text with unique id of document
> > > > >>> > > - value: Text with the contents of the document
> > > > >>> > >
> > > > >>> > > Ovi
> > > > >>> > >
> > > > >>> > > ---
> > > > >>> > > Ovidiu Dan - http://www.ovidiudan.com/
> > > > >>> > >
> > > > >>> > > Please do not print this e-mail unless it is mandatory
> > > > >>> > >
> > > > >>> > > My public key can be downloaded from subkeys.pgp.net, or
> > > > >>> > > http://www.ovidiudan.com/public.pgp
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > On Fri, Feb 12, 2010 at 3:30 PM, Ovidiu Dan <
> [email protected]>
> > > > wrote:
> > > > >>> > >
> > > > >>> > > >
> > > > >>> > > > Thanks, I also patched it myself but now I have some other
> > > > problems.
> > > > >>> > I'll
> > > > >>> > > > run it and let you know how it goes.
> > > > >>> > > >
> > > > >>> > > > Ovi
> > > > >>> > > >
> > > > >>> > > > ---
> > > > >>> > > > Ovidiu Dan - http://www.ovidiudan.com/
> > > > >>> > > >
> > > > >>> > > > Please do not print this e-mail unless it is mandatory
> > > > >>> > > >
> > > > >>> > > > My public key can be downloaded from subkeys.pgp.net, or
> > > > >>> > > > http://www.ovidiudan.com/public.pgp
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > > On Fri, Feb 12, 2010 at 3:25 PM, Robin Anil <
> > > > [email protected]>
> > > > >>> > > wrote:
> > > > >>> > > >
> > > > >>> > > >> I fixed the bug here
> > > > >>> https://issues.apache.org/jira/browse/MAHOUT-289
> > > > >>> > > >>
> > > > >>> > > >>
> > > > >>> > > >> Try running now.
> > > > >>> > > >>
> > > > >>> > > >> Robin
> > > > >>> > > >>
> > > > >>> > > >> On Sat, Feb 13, 2010 at 1:09 AM, Ovidiu Dan <
> > [email protected]
> > > >
> > > > >>> wrote:
> > > > >>> > > >>
> > > > >>> > > >> > I checked the code.
> > > > >>> > > >> >
> > > > >>> > > >> > Line 36 of LDAMapper.java references Vector:
> > > > >>> > > >> >
> > > > >>> > > >> > public class LDAMapper extends
> > > > >>> > > >> >    Mapper<WritableComparable<?>, *Vector*,
> > IntPairWritable,
> > > > >>> > > >> DoubleWritable>
> > > > >>> > > >> > {
> > > > >>> > > >> >
> > > > >>> > > >> > Aren't all elements in Mapper<...> supposed to be
> > Writable?
> > > Do
> > > > >>> you
> > > > >>> > > need
> > > > >>> > > >> to
> > > > >>> > > >> > do a conversion to VectorWritable?
> > > > >>> > > >> >
> > > > >>> > > >> > Ovi
> > > > >>> > > >> >
> > > > >>> > > >> > ---
> > > > >>> > > >> > Ovidiu Dan - http://www.ovidiudan.com/
> > > > >>> > > >> >
> > > > >>> > > >> > Please do not print this e-mail unless it is mandatory
> > > > >>> > > >> >
> > > > >>> > > >> > My public key can be downloaded from subkeys.pgp.net,
> or
> > > > >>> > > >> > http://www.ovidiudan.com/public.pgp
> > > > >>> > > >> >
> > > > >>> > > >> >
> > > > >>> > > >> > On Fri, Feb 12, 2010 at 2:32 PM, Ovidiu Dan <
> > > [email protected]
> > > > >
> > > > >>> > wrote:
> > > > >>> > > >> >
> > > > >>> > > >> > >
> > > > >>> > > >> > > Ok I did a clean checkout & install and ran everything
> > > > again,
> > > > >>> then
> > > > >>> > > >> > pointed
> > > > >>> > > >> > > LDA to mahout_vectors/vectors.
> > > > >>> > > >> > >
> > > > >>> > > >> > > Now I get this error:
> > > > >>> > > >> > >
> > > > >>> > > >> > > 10/02/12 14:28:35 INFO mapred.JobClient: Task Id :
> > > > >>> > > >> > > attempt_201001192218_0216_m_000005_0, Status : FAILED
> > > > >>> > > >> > > java.lang.ClassCastException:
> > > > >>> > *org.apache.mahout.math.VectorWritable
> > > > >>> > > >> > > cannot be cast to org.apache.mahout.math.Vector*
> > > > >>> > > >> > >  at
> > > > >>> > >
> > org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36)
> > > > >>> > > >> > > at
> > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> > > > >>> > > >> > >  at
> > > > >>> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> > > > >>> > > >> > > at
> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > > > >>> > > >> > >  at
> org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > > >>> > > >> > >
> > > > >>> > > >> > > Ovi
> > > > >>> > > >> > >
> > > > >>> > > >> > > ---
> > > > >>> > > >> > > Ovidiu Dan - http://www.ovidiudan.com/
> > > > >>> > > >> > >
> > > > >>> > > >> > > Please do not print this e-mail unless it is mandatory
> > > > >>> > > >> > >
> > > > >>> > > >> > > My public key can be downloaded from subkeys.pgp.net,
> > or
> > > > >>> > > >> > > http://www.ovidiudan.com/public.pgp
> > > > >>> > > >> > >
> > > > >>> > > >> > >
> > > > >>> > > >> > > On Fri, Feb 12, 2010 at 1:50 PM, Robin Anil <
> > > > >>> [email protected]
> > > > >>> > >
> > > > >>> > > >> > wrote:
> > > > >>> > > >> > >
> > > > >>> > > >> > >> Hi Ovidu,
> > > > >>> > > >> > >>            If you choose tf, the vectors are
> generated
> > in
> > > > >>> > > >> > >> outputfolder/vectors and if you choose tfidf the
> > vectors
> > > > are
> > > > >>> > > >> generated
> > > > >>> > > >> > in
> > > > >>> > > >> > >> outputfolder/tfidf/vectors. I am in the process of
> > > changing
> > > > >>> the
> > > > >>> > > code
> > > > >>> > > >> to
> > > > >>> > > >> > >> move
> > > > >>> > > >> > >> the output of the map/reduce to a fixed destination
> and
> > > > that
> > > > >>> > > >> exception
> > > > >>> > > >> > >> would
> > > > >>> > > >> > >> have caused the folders not to move.
> > > > >>> > > >> > >> Thats the reason for the last error.
> > > > >>> > > >> > >>
> > > > >>> > > >> > >> For the first error, I am not sure what is happening.
> > > Could
> > > > >>> you
> > > > >>> > do
> > > > >>> > > a
> > > > >>> > > >> > clean
> > > > >>> > > >> > >> compile of mahout
> > > > >>> > > >> > >>
> > > > >>> > > >> > >> mvn clean install -DskipTests=true and make sure you
> > svn
> > > up
> > > > >>> the
> > > > >>> > > trunk
> > > > >>> > > >> > >> before
> > > > >>> > > >> > >> doing that
> > > > >>> > > >> > >>
> > > > >>> > > >> > >> Then point your LDA to mahout_vectors/vectors
> > > > >>> > > >> > >>
> > > > >>> > > >> > >> Robin
> > > > >>> > > >> > >>
> > > > >>> > > >> > >>
> > > > >>> > > >> > >> On Sat, Feb 13, 2010 at 12:14 AM, Ovidiu Dan <
> > > > >>> [email protected]>
> > > > >>> > > >> wrote:
> > > > >>> > > >> > >>
> > > > >>> > > >> > >> > Hi again,
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >> > Is there any workaround for my problem(s)? Or is
> > there
> > > > any
> > > > >>> > other
> > > > >>> > > >> way
> > > > >>> > > >> > >> that
> > > > >>> > > >> > >> > would allow me to transform many many small
> messages
> > > > >>> (they're
> > > > >>> > > >> Tweets)
> > > > >>> > > >> > >> into
> > > > >>> > > >> > >> > Mahout vectors, and the run LDA on them, without
> > > getting
> > > > >>> these
> > > > >>> > > >> errors?
> > > > >>> > > >> > >> > Converting them to txt files would be a bit of a
> pain
> > > > >>> because I
> > > > >>> > > >> would
> > > > >>> > > >> > >> get
> > > > >>> > > >> > >> > millions of very small files. And a Lucene index
> > would
> > > be
> > > > a
> > > > >>> bit
> > > > >>> > > >> > overkill
> > > > >>> > > >> > >> I
> > > > >>> > > >> > >> > think.
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >> > Thanks,
> > > > >>> > > >> > >> > Ovi
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >> > ---
> > > > >>> > > >> > >> > Ovidiu Dan - http://www.ovidiudan.com/
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >> > Please do not print this e-mail unless it is
> > mandatory
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >> > My public key can be downloaded from
> subkeys.pgp.net
> > ,
> > > or
> > > > >>> > > >> > >> > http://www.ovidiudan.com/public.pgp
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >> > On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil <
> > > > >>> > > [email protected]>
> > > > >>> > > >> > >> wrote:
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >> > > Was meant for the dev list. I am looking into the
> > > first
> > > > >>> error
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > > -bcc mahout-user
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > > ---------- Forwarded message ----------
> > > > >>> > > >> > >> > > From: Robin Anil <[email protected]>
> > > > >>> > > >> > >> > > Date: Fri, Feb 12, 2010 at 2:20 PM
> > > > >>> > > >> > >> > > Subject: Re: Problem converting SequenceFile to
> > > > vectors,
> > > > >>> then
> > > > >>> > > >> > running
> > > > >>> > > >> > >> LDA
> > > > >>> > > >> > >> > > To: [email protected]
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > > Hi,
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > >      This confusion arises from the fact that we
> > use
> > > > >>> > > intermediate
> > > > >>> > > >> > >> folders
> > > > >>> > > >> > >> > > as subfolders under output folder. How about we
> > > > >>> standardize
> > > > >>> > on
> > > > >>> > > >> all
> > > > >>> > > >> > the
> > > > >>> > > >> > >> > jobs
> > > > >>> > > >> > >> > > taking input, intermediate and output folder?. If
> > not
> > > > this
> > > > >>> > then
> > > > >>> > > >> for
> > > > >>> > > >> > >> the
> > > > >>> > > >> > >> > > next
> > > > >>> > > >> > >> > > release?
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > > Robin
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <
> > > > >>> > [email protected]
> > > > >>> > > >
> > > > >>> > > >> > >> wrote:
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> > > > Hello Mahout developers / users,
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > I am trying to convert a properly formatted
> > > > SequenceFile
> > > > >>> to
> > > > >>> > > >> Mahout
> > > > >>> > > >> > >> > > vectors
> > > > >>> > > >> > >> > > > to run LDA on them. As reference I am using
> these
> > > two
> > > > >>> > > >> documents:
> > > > >>> > > >> > >> > > >
> > > > >>> > >
> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
> > > > >>> > > >> > >> > > >
> > > > >>> > > >>
> > > http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > I got the Mahout code from SVN on February 11th
> > > 2010.
> > > > >>> Below
> > > > >>> > I
> > > > >>> > > >> am
> > > > >>> > > >> > >> > listing
> > > > >>> > > >> > >> > > > the
> > > > >>> > > >> > >> > > > steps I have took and the problems I have
> > > > encountered:
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > export
> > > > HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/
> > > > >>> > > >> > >> > > > export
> > > > >>> > > >> > >> >
> > > > >>> >
> MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar
> > > > >>> > > >> > >> > > >
> > > > >>> > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> > > > >>> > > >> > >> > > >
> > > org.apache.mahout.text.SparseVectorsFromSequenceFiles
> > > > -i
> > > > >>> > > >> > >> > > >
> > > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/
> > > > >>> -o
> > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/
> > -wt
> > > tf
> > > > >>> > -chunk
> > > > >>> > > >> 300
> > > > >>> > > >> > -a
> > > > >>> > > >> > >> > > >
> > > org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > >>> > > >> --minSupport
> > > > >>> > > >> > 2
> > > > >>> > > >> > >> > > --minDF
> > > > >>> > > >> > >> > > > 1 --maxDFPercent 50 --norm 2
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > *Problem #1: *Got this error at the end, but I
> > > think
> > > > >>> > > everything
> > > > >>> > > >> > >> > finished
> > > > >>> > > >> > >> > > > more or less correctly:
> > > > >>> > > >> > >> > > > Exception in thread "main"
> > > > java.lang.NoSuchMethodError:
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V
> > > > >>> > > >> > >> > > > at
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173)
> > > > >>> > > >> > >> > > > at
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254)
> > > > >>> > > >> > >> > > > at
> > > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > > > >>> > > Method)
> > > > >>> > > >> > >> > > > at
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > > >>> > > >> > >> > > > at
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > >>> > > >> > >> > > > at
> > java.lang.reflect.Method.invoke(Method.java:597)
> > > > >>> > > >> > >> > > > at
> > > > org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar
> > > > >>> > > >> > >> > > >
> > > > >>> > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> > > > >>> > > >> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i
> > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/
> -o
> > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20
> > > > --numWords
> > > > >>> > > 100000
> > > > >>> > > >> > >> > > > --numReducers 33
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > *Problem #2: *Exception in thread "main"
> > > > >>> > > >> > >> java.io.FileNotFoundException:
> > > > >>> > > >> > >> > > > File
> > > > >>> > > >> > >> > > > does not exist:
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > *Tried to fix:*
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > ../../hadoop fs -mv
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> >
> > > > >>>
> > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > *Ran again:*
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > $HADOOP_HOME/bin/hadoop jar
> > > > >>> > > >> > >> > > >
> > > > >>> > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
> > > > >>> > > >> > >> > > > org.apache.mahout.clustering.lda.LDADriver -i
> > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/
> -o
> > > > >>> > > >> > >> > > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20
> > > > --numWords
> > > > >>> > > 100000
> > > > >>> > > >> > >> > > > --numReducers 33
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > *Problem #3:*
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > Exception in thread "main"
> > > > >>> java.io.FileNotFoundException:
> > > > >>> > > File
> > > > >>> > > >> > does
> > > > >>> > > >> > >> not
> > > > >>> > > >> > >> > > > exist:
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > [had...@some_server retweets]$ ../../hadoop fs
> > -ls
> > > > >>> > > >> > >> > > >
> > > > >>> > > >>
> > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/
> > > > >>> > > >> > >> > > > Found 3 items
> > > > >>> > > >> > >> > > > -rw-r--r--   3 hadoop supergroup  129721338
> > > > 2010-02-11
> > > > >>> > 23:54
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000
> > > > >>> > > >> > >> > > > -rw-r--r--   3 hadoop supergroup  128256085
> > > > 2010-02-11
> > > > >>> > 23:54
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001
> > > > >>> > > >> > >> > > > -rw-r--r--   3 hadoop supergroup   24160265
> > > > 2010-02-11
> > > > >>> > 23:54
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > Also, as a *bonus problem*, If the input
> > > > >>> > > >> > >> > > > folder
> > > > >>> > /user/MY_USERNAME/projects/lda/twitter_sequence_files
> > > > >>> > > >> > >> contains
> > > > >>> > > >> > >> > > more
> > > > >>> > > >> > >> > > > than one file (for example if I run only the
> maps
> > > > >>> without a
> > > > >>> > > >> final
> > > > >>> > > >> > >> > > reducer),
> > > > >>> > > >> > >> > > > this whole chain doesn't work.
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > Thanks,
> > > > >>> > > >> > >> > > > Ovi
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > ---
> > > > >>> > > >> > >> > > > Ovidiu Dan - http://www.ovidiudan.com/
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > Please do not print this e-mail unless it is
> > > > mandatory
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > > > My public key can be downloaded from
> > > subkeys.pgp.net
> > > > ,
> > > > >>> or
> > > > >>> > > >> > >> > > > http://www.ovidiudan.com/public.pgp
> > > > >>> > > >> > >> > > >
> > > > >>> > > >> > >> > >
> > > > >>> > > >> > >> >
> > > > >>> > > >> > >>
> > > > >>> > > >> > >
> > > > >>> > > >> > >
> > > > >>> > > >> >
> > > > >>> > > >>
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Problem converting SequenceFile to vectors, then running LDA

Reply via email to