I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3). I'm running it with hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \ cvb \ -i /lda/matrix-converted/matrix \ -o 's3n://.../lda/results \ -dict /lda/dictionary.file-0 \ -dt s3n://.../lda/doc-topics \ -k 10 -x 10
The dictionary has around 1,000,000 terms The input vector has around 600,000 documents (It's a 70MB file) with 10-100 terms in them. I created with the matrix file with a block size of 1MB. Each iteration of CVB is using 70 mappers and takes close to an hour for each mapper to run. Is this expected performance under these conditions? Are there any parameters I can tune? David