Problem running new LDA algorithm (cvb) against the Reuters data

DAN HELM Fri, 04 May 2012 20:54:59 -0700
I am attempting to run the new LDA algorithm cvb (Mahout version 0.6) against 
the Reuters data.   I just added another 
entry to the cluster-reuters.sh example script as follows:
******************************************************************************
elif [ "x$clustertype" == "xcvb" ]; then
  $MAHOUT seq2sparse \
    -i ${WORK_DIR}/reuters-out-seqdir/ \
    -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
    -wt tf -seq -nr 3 --namedVector \
  && \
  $MAHOUT cvb \
    -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
    -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
    -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
    -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
  && \
  $MAHOUT ldatopics \
    -i ${WORK_DIR}/reuters-cvb/state-2 \
    -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
    -dt sequencefile
******************************************************************************
I successfully ran the previous LDA algorithm against Reuters but I am most 
interested in this new implementation of LDA because I want the new feature 
that generates document-to-cluster mappings (e.g., parameter –dt).
 
When I run the above code via Hadoop pseudo distributed mode as well as on a 
small cluster I receive the same error from the "mahout cvb" command.  All the 
pre-clustering logic including sequence file and sparse vector generation works 
fine but when the cvb clustering is attempted the mappers fail with the 
following error in the Hadoop map task log:
 
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
org.apache.hadoop.io.IntWritable
 at 
org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 
Any help with resolving the problem would be appreciated.
 
Dan
Problem running new LDA algorithm (cvb) against the Reuters data

Reply via email to