I'm about to head to bed right now (long day, flight to and from sf in one
day, need sleep), but short answer is
that the new LDA requires SequenceFile<IntWritable, VectorWritable> as
input (the same disk format
as DistributedRowMatrix), which you can get out of SequenceFile<Text,
VectorWritable> by running the
RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before
running CVB.Let us know if that doesn't help! On Fri, May 4, 2012 at 8:54 PM, DAN HELM <[email protected]> wrote: > I am attempting to run the new LDA algorithm cvb (Mahout version 0.6) > against the Reuters data. I just added another > entry to the cluster-reuters.sh example script as follows: > > ****************************************************************************** > elif [ "x$clustertype" == "xcvb" ]; then > $MAHOUT seq2sparse \ > -i ${WORK_DIR}/reuters-out-seqdir/ \ > -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \ > -wt tf -seq -nr 3 --namedVector \ > && \ > $MAHOUT cvb \ > -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \ > -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \ > -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \ > -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \ > && \ > $MAHOUT ldatopics \ > -i ${WORK_DIR}/reuters-cvb/state-2 \ > -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \ > -dt sequencefile > > ****************************************************************************** > I successfully ran the previous LDA algorithm against Reuters but I am > most interested in this new implementation of LDA because I want the new > feature that generates document-to-cluster mappings (e.g., parameter –dt). > > When I run the above code via Hadoop pseudo distributed mode as well as on > a small cluster I receive the same error from the "mahout cvb" command. > All the pre-clustering logic including sequence file and sparse vector > generation works fine but when the cvb clustering is attempted the mappers > fail with the following error in the Hadoop map task log: > > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.hadoop.io.IntWritable > at > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > Any help with resolving the problem would be appreciated. > > Dan -- -jake
