you can modify thc code CachingCVB0Mapper.map CachingCVB0PerplexityMapper.map CVB0DocInferenceMapper.map with SequenceFile<WritableComparable<?>,VectorWritable> instead,and then convert the type of key to Integer
2012/5/5 chenghao liu <[email protected]>: > the title of doc which is the key of sequenceFile need to be number > > 2012/5/5 Jake Mannix <[email protected]>: >> I'm about to head to bed right now (long day, flight to and from sf in one >> day, need sleep), but short answer is >> that the new LDA requires SequenceFile<IntWritable, VectorWritable> as >> input (the same disk format >> as DistributedRowMatrix), which you can get out of SequenceFile<Text, >> VectorWritable> by running the >> RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before >> running CVB. >> >> Let us know if that doesn't help! >> >> On Fri, May 4, 2012 at 8:54 PM, DAN HELM <[email protected]> wrote: >> >>> I am attempting to run the new LDA algorithm cvb (Mahout version 0.6) >>> against the Reuters data. I just added another >>> entry to the cluster-reuters.sh example script as follows: >>> >>> ****************************************************************************** >>> elif [ "x$clustertype" == "xcvb" ]; then >>> $MAHOUT seq2sparse \ >>> -i ${WORK_DIR}/reuters-out-seqdir/ \ >>> -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \ >>> -wt tf -seq -nr 3 --namedVector \ >>> && \ >>> $MAHOUT cvb \ >>> -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \ >>> -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \ >>> -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \ >>> -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \ >>> && \ >>> $MAHOUT ldatopics \ >>> -i ${WORK_DIR}/reuters-cvb/state-2 \ >>> -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \ >>> -dt sequencefile >>> >>> ****************************************************************************** >>> I successfully ran the previous LDA algorithm against Reuters but I am >>> most interested in this new implementation of LDA because I want the new >>> feature that generates document-to-cluster mappings (e.g., parameter –dt). >>> >>> When I run the above code via Hadoop pseudo distributed mode as well as on >>> a small cluster I receive the same error from the "mahout cvb" command. >>> All the pre-clustering logic including sequence file and sparse vector >>> generation works fine but when the cvb clustering is attempted the mappers >>> fail with the following error in the Hadoop map task log: >>> >>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to >>> org.apache.hadoop.io.IntWritable >>> at >>> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>> at org.apache.hadoop.mapred.Child.main(Child.java:170) >>> >>> Any help with resolving the problem would be appreciated. >>> >>> Dan >> >> >> >> >> -- >> >> -jake
