> > On cloudera 8 node c1.medium on 6 GB compressed(26GB uncompressed wikipeda)
> 32 mappers 10 reducers(8 nodes, dunno why it is limited to 10) as compared to 16 mappers and 8 reducers(4 nodes) org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer -chunk 512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3 > Dictionary size 78MB > tokenzing step 9 min > word count 8:30 min > 1pass tf vectorization 20 min > df counting 8 min > 1pass tfidf vectorization 9 min > total 57 min Thats linear scaling :)