What's the final size of the vectoized output? -jake
On Feb 28, 2010 6:47 PM, "Robin Anil" <robin.a...@gmail.com> wrote: Finally some good news tried with cloudera 4 node c1.medium on 6 GB compressed(26GB uncompressed wikipeda) org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer -chunk 512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3 Dictionary size 78MB tokenzing step 21 min word count 16 min 1pass tf vectorization 41 min df counting 16 min 1pass tfidf vectorization 23 min total 1:57 min I will try with 8 nodes and see if it scales linearly BTW. All these are IO bound. So might speed up more with mapper output and output compression and more so with LZO instead of Gzip. Identity mapper stage seems to be the slowest in all these partial vectorization steps. I find it redundant. Maybe we can make it a map only job and increase the passes and remove the extra IO needed for shuffling and sorting.