I am having trouble interpreting the "doc-topic" distribution produced by the cvb implementation of LDA in Mahout 0.7. Here's the term-frequency matrix for a simple test case (shown here as the output of mahout seqdumper):
Key: /d01: Value: /d01:{0:30.0,1:10.0} Key: /d02: Value: /d02:{0:60.0,1:20.0} Key: /d03: Value: /d03:{0:30.0,1:10.0} Key: /d04: Value: /d04:{0:60.0,1:20.0} Key: /x01: Value: /x01:{2:30.0,3:10.0} Key: /x02: Value: /x02:{2:60.0,3:20.0} Key: /x03: Value: /x03:{2:30.0,3:10.0} Count: 7 The intent here was that the d01 through d04 documents would consist almost entirely of one topic represented almost entirely by terms 0 and 1 with a topic-term distribution of [0.75, 0.25, epsilon, epsilon] and that the x01 through x03 documents would consist almost entirely of a second topic represented almost entirely by terms 2 and 3 with a topic-term distribution of [epsilon, epsilon, 0.75, 0.25]. Since the "d" documents do not contain terms 2 or 3 and the "x" documents do not contain terms 0 or 1, I expected to see document topic distributions that were approximately equal to d01: 1 0 d01: 1 0 d02: 1 0 d03: 1 0 x01: 0 1 x02: 0 1 x03: 0 1 I ran the following command (where the simplelda/sparse/matrix directory contained the previous term frequency matrix). The algorithm ran to completion (meaning that it converged before the maximum number of iterations was exceeded). mahout cvb \ -i simplelda/sparse/matrix \ -dict simplelda/sparse/dictionary.file-0 \ -ow -o simplelda/cvb-topics \ -dt simplelda/cvb-classifications \ -tf 0.25 \ -block 4 \ -x 20 \ -cd 1e-10 \ -k 2 \ --tempDir simplelda/temp-k2 \ -seed 6956 The topic-term frequencies written to simplelda/cvb-topics were accurate and as expected: {0:0.7499999999895863,1:0.2499999999548601,2:2.7776873636508568E-11,3:2.777682733874987E-11} {0:9.375466996550278E-11,1:9.375456577819702E-11,2:0.7499999998802006,3:0.24999999993229008} However, the document-topic distribution output written to simplelda/cvbclassifications was not at all what I expected: Key: 0: Value: {0:0.05705773500297721,1:0.9429422649970228} Key: 1: Value: {0:0.05705773500297721,1:0.9429422649970228} Key: 2: Value: {0:0.05705773500297721,1:0.9429422649970228} Key: 3: Value: {0:0.05705773500297721,1:0.9429422649970228} Key: 4: Value: {0:0.4335650246424872,1:0.5664349753575127} Key: 5: Value: {0:0.4335650246424872,1:0.5664349753575127} Key: 6: Value: {0:0.4335650246424872,1:0.5664349753575127} Count: 7 These are called "doc-topic distributions" in the help output, so I interpreted this to mean that the estimator concluded the "d" document terms were most likely all drawn from the second topic. But the "d" documents contain no terms from the second topic! Likewise, the "x" documents contain no terms from the first topic, so why is there a relatively large value (0.4335) in the first column. If this document- topic distribution produced by cvb is correct, what does it represent?