I am attempting to run the new LDA algorithm cvb (Mahout version 0.6) against
the Reuters data. I just added another
entry to the cluster-reuters.sh example script as follows:
******************************************************************************
elif [ "x$clustertype" == "xcvb" ]; then
$MAHOUT seq2sparse \
-i ${WORK_DIR}/reuters-out-seqdir/ \
-o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
-wt tf -seq -nr 3 --namedVector \
&& \
$MAHOUT cvb \
-i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
-o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
-dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
-mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
&& \
$MAHOUT ldatopics \
-i ${WORK_DIR}/reuters-cvb/state-2 \
-d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
-dt sequencefile
******************************************************************************
I successfully ran the previous LDA algorithm against Reuters but I am most
interested in this new implementation of LDA because I want the new feature
that generates document-to-cluster mappings (e.g., parameter –dt).
When I run the above code via Hadoop pseudo distributed mode as well as on a
small cluster I receive the same error from the "mahout cvb" command. All the
pre-clustering logic including sequence file and sparse vector generation works
fine but when the cvb clustering is attempted the mappers fail with the
following error in the Hadoop map task log:
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable
at
org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Any help with resolving the problem would be appreciated.
Dan