Re: Problem running new LDA algorithm (cvb) against the Reuters data

chenghao liu Fri, 04 May 2012 23:11:17 -0700

the title of doc which is the key of sequenceFile need to be number


2012/5/5 Jake Mannix <[email protected]>:
> I'm about to head to bed right now (long day, flight to and from sf in one
> day, need sleep), but short answer is
> that the new LDA requires SequenceFile<IntWritable, VectorWritable> as
> input (the same disk format
> as DistributedRowMatrix), which you can get out of SequenceFile<Text,
> VectorWritable> by running the
> RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before
> running CVB.
>
> Let us know if that doesn't help!
>
> On Fri, May 4, 2012 at 8:54 PM, DAN HELM <[email protected]> wrote:
>
>> I am attempting to run the new LDA algorithm cvb (Mahout version 0.6)
>> against the Reuters data.   I just added another
>> entry to the cluster-reuters.sh example script as follows:
>>
>> ******************************************************************************
>> elif [ "x$clustertype" == "xcvb" ]; then
>>   $MAHOUT seq2sparse \
>>     -i ${WORK_DIR}/reuters-out-seqdir/ \
>>     -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
>>     -wt tf -seq -nr 3 --namedVector \
>>   && \
>>   $MAHOUT cvb \
>>     -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
>>     -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
>>     -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>>     -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
>>   && \
>>   $MAHOUT ldatopics \
>>     -i ${WORK_DIR}/reuters-cvb/state-2 \
>>     -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>>     -dt sequencefile
>>
>> ******************************************************************************
>> I successfully ran the previous LDA algorithm against Reuters but I am
>> most interested in this new implementation of LDA because I want the new
>> feature that generates document-to-cluster mappings (e.g., parameter –dt).
>>
>> When I run the above code via Hadoop pseudo distributed mode as well as on
>> a small cluster I receive the same error from the "mahout cvb" command.
>> All the pre-clustering logic including sequence file and sparse vector
>> generation works fine but when the cvb clustering is attempted the mappers
>> fail with the following error in the Hadoop map task log:
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.hadoop.io.IntWritable
>>  at
>> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> Any help with resolving the problem would be appreciated.
>>
>> Dan
>
>
>
>
> --
>
>  -jake

Re: Problem running new LDA algorithm (cvb) against the Reuters data

Reply via email to