Re: Problem running new LDA algorithm (cvb) against the Reuters data

Jake Mannix Fri, 04 May 2012 22:29:53 -0700

I'm about to head to bed right now (long day, flight to and from sf in one
day, need sleep), but short answer is
that the new LDA requires SequenceFile<IntWritable, VectorWritable> as
input (the same disk format
as DistributedRowMatrix), which you can get out of SequenceFile<Text,
VectorWritable> by running the
RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before
running CVB.


Let us know if that doesn't help!

On Fri, May 4, 2012 at 8:54 PM, DAN HELM <[email protected]> wrote:

> I am attempting to run the new LDA algorithm cvb (Mahout version 0.6)
> against the Reuters data.   I just added another
> entry to the cluster-reuters.sh example script as follows:
>
> ******************************************************************************
> elif [ "x$clustertype" == "xcvb" ]; then
>   $MAHOUT seq2sparse \
>     -i ${WORK_DIR}/reuters-out-seqdir/ \
>     -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
>     -wt tf -seq -nr 3 --namedVector \
>   && \
>   $MAHOUT cvb \
>     -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
>     -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
>     -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>     -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
>   && \
>   $MAHOUT ldatopics \
>     -i ${WORK_DIR}/reuters-cvb/state-2 \
>     -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>     -dt sequencefile
>
> ******************************************************************************
> I successfully ran the previous LDA algorithm against Reuters but I am
> most interested in this new implementation of LDA because I want the new
> feature that generates document-to-cluster mappings (e.g., parameter –dt).
>
> When I run the above code via Hadoop pseudo distributed mode as well as on
> a small cluster I receive the same error from the "mahout cvb" command.
> All the pre-clustering logic including sequence file and sparse vector
> generation works fine but when the cvb clustering is attempted the mappers
> fail with the following error in the Hadoop map task log:
>
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.IntWritable
>  at
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> Any help with resolving the problem would be appreciated.
>
> Dan




-- 

  -jake

Re: Problem running new LDA algorithm (cvb) against the Reuters data

Reply via email to