the title of doc which is the key of sequenceFile need to be number
2012/5/5 Jake Mannix <[email protected]>: > I'm about to head to bed right now (long day, flight to and from sf in one > day, need sleep), but short answer is > that the new LDA requires SequenceFile<IntWritable, VectorWritable> as > input (the same disk format > as DistributedRowMatrix), which you can get out of SequenceFile<Text, > VectorWritable> by running the > RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before > running CVB. > > Let us know if that doesn't help! > > On Fri, May 4, 2012 at 8:54 PM, DAN HELM <[email protected]> wrote: > >> I am attempting to run the new LDA algorithm cvb (Mahout version 0.6) >> against the Reuters data. I just added another >> entry to the cluster-reuters.sh example script as follows: >> >> ****************************************************************************** >> elif [ "x$clustertype" == "xcvb" ]; then >> $MAHOUT seq2sparse \ >> -i ${WORK_DIR}/reuters-out-seqdir/ \ >> -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \ >> -wt tf -seq -nr 3 --namedVector \ >> && \ >> $MAHOUT cvb \ >> -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \ >> -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \ >> -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \ >> -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \ >> && \ >> $MAHOUT ldatopics \ >> -i ${WORK_DIR}/reuters-cvb/state-2 \ >> -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \ >> -dt sequencefile >> >> ****************************************************************************** >> I successfully ran the previous LDA algorithm against Reuters but I am >> most interested in this new implementation of LDA because I want the new >> feature that generates document-to-cluster mappings (e.g., parameter –dt). >> >> When I run the above code via Hadoop pseudo distributed mode as well as on >> a small cluster I receive the same error from the "mahout cvb" command. >> All the pre-clustering logic including sequence file and sparse vector >> generation works fine but when the cvb clustering is attempted the mappers >> fail with the following error in the Hadoop map task log: >> >> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to >> org.apache.hadoop.io.IntWritable >> at >> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >> at org.apache.hadoop.mapred.Child.main(Child.java:170) >> >> Any help with resolving the problem would be appreciated. >> >> Dan > > > > > -- > > -jake
