Re: Extracting document/topic inference with the new lda cvb algorithm

Caroline Meyer Wed, 04 Jul 2012 09:43:09 -0700

Hi Andy

If I only use the -s and -o options I get this null pointer exception:


Exception in thread "main" java.lang.NullPointerException
at
org.apache.mahout.utils.vectors.VectorHelper$1.apply(VectorHelper.java:118)
at
org.apache.mahout.utils.vectors.VectorHelper$1.apply(VectorHelper.java:115)
at com.google.common.collect.Iterators$8.next(Iterators.java:765)
at java.util.AbstractCollection.toArray(AbstractCollection.java:124)
at java.util.ArrayList.<init>(ArrayList.java:131)
at com.google.common.collect.Lists.newArrayList(Lists.java:119)
at
org.apache.mahout.utils.vectors.VectorHelper.toWeightedTerms(VectorHelper.java:114)
at
org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:124)
at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:241)

In the code it looks like it is looking for a dictionary that is
not specified. Is there another option i am missing?

Cheers,
Caroline


On Wed, Jul 4, 2012 at 6:10 PM, Andy Schlaikjer <andrew.schlaik...@gmail.com
> wrote:

> Hi Caroline,
>
> Jake Mannix and I wrote the LDA CVB implementation. Apologies for the light
> documentation.
>
> When you invoked Mahout, did you supply the "--doc_topic_output <path>"
> parameter? If this is present, after training a model the driver app will
> apply the model to the input term-vectors, storing inference results in the
> specified path. If the parameter isn't specified, this final inference run
> is skipped:
>
>
> https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L74
>
> https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L331
>
> So, assuming you did generate inference output, I should note that both the
> model and inference output have the *same* format: Both the topic-term
> matrix and doc-topic inference output are stored as
> SequenceFile<IntWritable, VectorWritable> data. If you point the vectordump
> util at either data set and supply a dictionary, it'll happily map term ids
> or topic ids into term strings using that dictionary... Quite confusing.
> Just make sure that when you run vectordump against the doc-topic data that
> you don't supply the dictionary-- This way, you'll see the raw topic ids
> (zero-based indices) in output, instead of whatever terms those indices
> might correspond to in your dictionary.
>
> Best,
> Andy
> @sagemintblue
>
>
> On Wed, Jul 4, 2012 at 2:30 AM, Caroline Meyer <caromeye...@gmail.com
> >wrote:
>
> > Hey Guys,
> >
> > I have been able to successfully execute the new lda algorithm as well as
> > extract the topic/term inference with vectordump. What I was not able to
> do
> > was get the document/topic inference. When I run the same vectordump
> > command I get the same kinds of vectors (term:probability) as before.
> > Should the vectors not be (topic:probability)?
> >
> > The command I run is:
> >
> > vectordump -s temp/lda-cvb-doc/part-m-00000 -d
> > temp/vectors/dictionary.file-* -dt sequencefile -o
> temp/lda-cvb-topics.txt
> >
> > I have not been able to find any documentation except what's in the code.
> > Thanks for the help.
> >
> > Cheers,
> > Caroline
> >
>

Re: Extracting document/topic inference with the new lda cvb algorithm

Reply via email to