Re: Extracting document/topic inference with the new lda cvb algorithm

Andy Schlaikjer Wed, 04 Jul 2012 10:40:21 -0700

I haven't looked into the vector dumper code in detail, but I remember
having successfully run some version of it without an input dictionary.
Perhaps you've stumbled into a legitimate bug with the utility? For the
time being you might also try the sequence file dumper util which is
somewhat more generic but may suit your purpose here.


Andy


On Wed, Jul 4, 2012 at 9:42 AM, Caroline Meyer <caromeye...@gmail.com>wrote:

> Hi Andy
>
> If I only use the -s and -o options I get this null pointer exception:
>
> Exception in thread "main" java.lang.NullPointerException
> at
> org.apache.mahout.utils.vectors.VectorHelper$1.apply(VectorHelper.java:118)
> at
> org.apache.mahout.utils.vectors.VectorHelper$1.apply(VectorHelper.java:115)
> at com.google.common.collect.Iterators$8.next(Iterators.java:765)
> at java.util.AbstractCollection.toArray(AbstractCollection.java:124)
> at java.util.ArrayList.<init>(ArrayList.java:131)
> at com.google.common.collect.Lists.newArrayList(Lists.java:119)
> at
>
> org.apache.mahout.utils.vectors.VectorHelper.toWeightedTerms(VectorHelper.java:114)
> at
>
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:124)
> at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:241)
>
> In the code it looks like it is looking for a dictionary that is
> not specified. Is there another option i am missing?
>
> Cheers,
> Caroline
>
>
> On Wed, Jul 4, 2012 at 6:10 PM, Andy Schlaikjer <
> andrew.schlaik...@gmail.com
> > wrote:
>
> > Hi Caroline,
> >
> > Jake Mannix and I wrote the LDA CVB implementation. Apologies for the
> light
> > documentation.
> >
> > When you invoked Mahout, did you supply the "--doc_topic_output <path>"
> > parameter? If this is present, after training a model the driver app will
> > apply the model to the input term-vectors, storing inference results in
> the
> > specified path. If the parameter isn't specified, this final inference
> run
> > is skipped:
> >
> >
> >
> https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L74
> >
> >
> https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L331
> >
> > So, assuming you did generate inference output, I should note that both
> the
> > model and inference output have the *same* format: Both the topic-term
> > matrix and doc-topic inference output are stored as
> > SequenceFile<IntWritable, VectorWritable> data. If you point the
> vectordump
> > util at either data set and supply a dictionary, it'll happily map term
> ids
> > or topic ids into term strings using that dictionary... Quite confusing.
> > Just make sure that when you run vectordump against the doc-topic data
> that
> > you don't supply the dictionary-- This way, you'll see the raw topic ids
> > (zero-based indices) in output, instead of whatever terms those indices
> > might correspond to in your dictionary.
> >
> > Best,
> > Andy
> > @sagemintblue
> >
> >
> > On Wed, Jul 4, 2012 at 2:30 AM, Caroline Meyer <caromeye...@gmail.com
> > >wrote:
> >
> > > Hey Guys,
> > >
> > > I have been able to successfully execute the new lda algorithm as well
> as
> > > extract the topic/term inference with vectordump. What I was not able
> to
> > do
> > > was get the document/topic inference. When I run the same vectordump
> > > command I get the same kinds of vectors (term:probability) as before.
> > > Should the vectors not be (topic:probability)?
> > >
> > > The command I run is:
> > >
> > > vectordump -s temp/lda-cvb-doc/part-m-00000 -d
> > > temp/vectors/dictionary.file-* -dt sequencefile -o
> > temp/lda-cvb-topics.txt
> > >
> > > I have not been able to find any documentation except what's in the
> code.
> > > Thanks for the help.
> > >
> > > Cheers,
> > > Caroline
> > >
> >
>

Re: Extracting document/topic inference with the new lda cvb algorithm

Reply via email to