I am having trouble interpreting the "doc-topic" distribution produced
by the cvb implementation of LDA in Mahout 0.7. Here's the
term-frequency matrix for a simple test case (shown here as the output
of mahout seqdumper):

Key: /d01: Value: /d01:{0:30.0,1:10.0}
Key: /d02: Value: /d02:{0:60.0,1:20.0}
Key: /d03: Value: /d03:{0:30.0,1:10.0}
Key: /d04: Value: /d04:{0:60.0,1:20.0}
Key: /x01: Value: /x01:{2:30.0,3:10.0}
Key: /x02: Value: /x02:{2:60.0,3:20.0}
Key: /x03: Value: /x03:{2:30.0,3:10.0}
Count: 7

The intent here was that the d01 through d04 documents would consist almost
entirely of one topic represented almost entirely by terms 0 and 1
with a topic-term
distribution of [0.75, 0.25, epsilon, epsilon] and that the x01
through x03 documents
would consist almost entirely of a second topic represented almost entirely by
terms 2 and 3 with a topic-term distribution of [epsilon, epsilon,
0.75, 0.25]. Since
the "d" documents do not contain terms 2 or 3 and the "x" documents do
not contain
terms 0 or 1, I expected to see document topic distributions that were
approximately
equal to

d01: 1 0
d01: 1 0
d02: 1 0
d03: 1 0
x01: 0 1
x02: 0 1
x03: 0 1

I ran the following command (where the simplelda/sparse/matrix directory
contained the previous term frequency matrix). The algorithm ran to completion
(meaning that it converged before the maximum number of iterations was
exceeded).

mahout  cvb \
   -i simplelda/sparse/matrix \
   -dict simplelda/sparse/dictionary.file-0 \
   -ow -o simplelda/cvb-topics \
   -dt simplelda/cvb-classifications \
        -tf  0.25 \
   -block 4 \
   -x 20 \
   -cd 1e-10 \
   -k 2 \
   --tempDir simplelda/temp-k2 \
   -seed 6956

The topic-term frequencies written to simplelda/cvb-topics were accurate and as
expected:

{0:0.7499999999895863,1:0.2499999999548601,2:2.7776873636508568E-11,3:2.777682733874987E-11}
{0:9.375466996550278E-11,1:9.375456577819702E-11,2:0.7499999998802006,3:0.24999999993229008}

However, the document-topic distribution output written to
simplelda/cvbclassifications was not at all what I expected:

Key: 0: Value: {0:0.05705773500297721,1:0.9429422649970228}
Key: 1: Value: {0:0.05705773500297721,1:0.9429422649970228}
Key: 2: Value: {0:0.05705773500297721,1:0.9429422649970228}
Key: 3: Value: {0:0.05705773500297721,1:0.9429422649970228}
Key: 4: Value: {0:0.4335650246424872,1:0.5664349753575127}
Key: 5: Value: {0:0.4335650246424872,1:0.5664349753575127}
Key: 6: Value: {0:0.4335650246424872,1:0.5664349753575127}
Count: 7

These are called "doc-topic distributions" in the help output, so I
interpreted this to
mean that the estimator concluded the "d" document terms were most likely all
drawn from the second topic. But the "d" documents contain no terms from the
second topic! Likewise, the "x" documents contain no terms from the
first topic, so
why is there a relatively large value (0.4335) in the first column. If
this document-
topic distribution produced by cvb is correct, what does it represent?

Reply via email to