What do you get out, and what exactly is your commandline invocation?
On Mon, Jun 24, 2013 at 6:58 AM, Mark Wicks <mawi...@gmail.com> wrote: > As a slight correction to my earlier post on running cvb from the > trunk, the Nan values were my mistake. However, I still haven't had > any success getting it to write document/topic inferences. > > On Sat, Jun 22, 2013 at 7:21 AM, Mark Wicks <mawi...@gmail.com> wrote: > > I tried with cvb from trunk and ran into several problems: > > > > 1) The topic/term distributions were all Nan. > > 2) The initial perplexity was Nan. > > 3) It never wrote the document/topic inferences. > > 4) It exited with an exception stating that the topic/term > > distribution output directory already exists, after successfully > > creating it and writing to it. It did not exist before running cvb. > > > > > > On Thu, Jun 20, 2013 at 10:18 PM, Jake Mannix <jake.man...@gmail.com> > wrote: > >> There was a bug in Mahout 0.7 regarding the doc/topic outputs, > >> can you try your little test on trunk, and see if you get a more > >> sensible / interpretable result? > >> > >> > >> On Thu, Jun 20, 2013 at 10:17 AM, Mark Wicks <mawi...@gmail.com> wrote: > >> > >>> I apologize for posting this again. I sent it during the weekend and > >>> didn't get any response (which seems unusual for this list :)). > >>> I am hoping that someone with some LDA/cvb experience who can help > >>> might have missed it over the weekend. > >>> Can someone tell me (1) if the document-topic distribution below makes > >>> sense for the term frequencies shown and (2) how I should interpret > >>> it. > >>> > >>> Mark Wicks > >>> > >>> On Sat, Jun 15, 2013 at 9:22 AM, Mark Wicks <mawi...@gmail.com> wrote: > >>> > I am having trouble interpreting the "doc-topic" distribution > produced > >>> > by the cvb implementation of LDA in Mahout 0.7. Here's the > >>> > term-frequency matrix for a simple test case (shown here as the > output > >>> > of mahout seqdumper): > >>> > > >>> > Key: /d01: Value: /d01:{0:30.0,1:10.0} > >>> > Key: /d02: Value: /d02:{0:60.0,1:20.0} > >>> > Key: /d03: Value: /d03:{0:30.0,1:10.0} > >>> > Key: /d04: Value: /d04:{0:60.0,1:20.0} > >>> > Key: /x01: Value: /x01:{2:30.0,3:10.0} > >>> > Key: /x02: Value: /x02:{2:60.0,3:20.0} > >>> > Key: /x03: Value: /x03:{2:30.0,3:10.0} > >>> > Count: 7 > >>> > > >>> > The intent here was that the d01 through d04 documents would consist > >>> almost > >>> > entirely of one topic represented almost entirely by terms 0 and 1 > >>> > with a topic-term > >>> > distribution of [0.75, 0.25, epsilon, epsilon] and that the x01 > >>> > through x03 documents > >>> > would consist almost entirely of a second topic represented almost > >>> entirely by > >>> > terms 2 and 3 with a topic-term distribution of [epsilon, epsilon, > >>> > 0.75, 0.25]. Since > >>> > the "d" documents do not contain terms 2 or 3 and the "x" documents > do > >>> > not contain > >>> > terms 0 or 1, I expected to see document topic distributions that > were > >>> > approximately > >>> > equal to > >>> > > >>> > d01: 1 0 > >>> > d01: 1 0 > >>> > d02: 1 0 > >>> > d03: 1 0 > >>> > x01: 0 1 > >>> > x02: 0 1 > >>> > x03: 0 1 > >>> > > >>> > I ran the following command (where the simplelda/sparse/matrix > directory > >>> > contained the previous term frequency matrix). The algorithm ran to > >>> completion > >>> > (meaning that it converged before the maximum number of iterations > was > >>> > exceeded). > >>> > > >>> > mahout cvb \ > >>> > -i simplelda/sparse/matrix \ > >>> > -dict simplelda/sparse/dictionary.file-0 \ > >>> > -ow -o simplelda/cvb-topics \ > >>> > -dt simplelda/cvb-classifications \ > >>> > -tf 0.25 \ > >>> > -block 4 \ > >>> > -x 20 \ > >>> > -cd 1e-10 \ > >>> > -k 2 \ > >>> > --tempDir simplelda/temp-k2 \ > >>> > -seed 6956 > >>> > > >>> > The topic-term frequencies written to simplelda/cvb-topics were > accurate > >>> and as > >>> > expected: > >>> > > >>> > > >>> > {0:0.7499999999895863,1:0.2499999999548601,2:2.7776873636508568E-11,3:2.777682733874987E-11} > >>> > > >>> > {0:9.375466996550278E-11,1:9.375456577819702E-11,2:0.7499999998802006,3:0.24999999993229008} > >>> > > >>> > However, the document-topic distribution output written to > >>> > simplelda/cvbclassifications was not at all what I expected: > >>> > > >>> > Key: 0: Value: {0:0.05705773500297721,1:0.9429422649970228} > >>> > Key: 1: Value: {0:0.05705773500297721,1:0.9429422649970228} > >>> > Key: 2: Value: {0:0.05705773500297721,1:0.9429422649970228} > >>> > Key: 3: Value: {0:0.05705773500297721,1:0.9429422649970228} > >>> > Key: 4: Value: {0:0.4335650246424872,1:0.5664349753575127} > >>> > Key: 5: Value: {0:0.4335650246424872,1:0.5664349753575127} > >>> > Key: 6: Value: {0:0.4335650246424872,1:0.5664349753575127} > >>> > Count: 7 > >>> > > >>> > These are called "doc-topic distributions" in the help output, so I > >>> > interpreted this to > >>> > mean that the estimator concluded the "d" document terms were most > >>> likely all > >>> > drawn from the second topic. But the "d" documents contain no terms > from > >>> the > >>> > second topic! Likewise, the "x" documents contain no terms from the > >>> > first topic, so > >>> > why is there a relatively large value (0.4335) in the first column. > If > >>> > this document- > >>> > topic distribution produced by cvb is correct, what does it > represent? > >>> > >> > >> > >> > >> -- > >> > >> -jake > -- -jake