I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3).

I'm running it with 
hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \
cvb \
-i /lda/matrix-converted/matrix \
-o 's3n://.../lda/results \
-dict /lda/dictionary.file-0 \
-dt s3n://.../lda/doc-topics \
-k 10 -x 10

The dictionary has around 1,000,000 terms
The input vector has around 600,000 documents (It's a 70MB file) with 10-100 
terms in them. 
I created with the matrix file with a block size of 1MB. Each iteration of CVB 
is using 70 mappers and takes close to an hour for each mapper to run.

Is this expected performance under these conditions? Are there any parameters I 
can tune?

David

Reply via email to