>
>
On cloudera 8 node c1.medium on 6 GB compressed(26GB uncompressed wikipeda)

> 32 mappers 10 reducers(8 nodes, dunno why it is limited to 10) as compared
to 16 mappers and 8 reducers(4 nodes)

org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o
wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer -chunk
512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3

>
Dictionary size 78MB

>
tokenzing step 9 min

> word count       8:30 min

> 1pass tf vectorization 20 min

> df counting       8 min

> 1pass tfidf vectorization 9 min

>
total 57 min

Thats linear scaling :)

Reply via email to