correction On Mon, Mar 1, 2010 at 8:16 AM, Robin Anil <robin.a...@gmail.com> wrote:
> Finally some good news tried with cloudera 4 node c1.medium on 6 GB > compressed(26GB uncompressed wikipeda) > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o > wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer -chunk > 512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3 > > Dictionary size 78MB > > tokenzing step 21 min > word count 16 min > 1pass tf vectorization 41 min > df counting 16 min > 1pass tfidf vectorization 23 min > > total 1 hour 57 min > > I will try with 8 nodes and see if it scales linearly > > BTW. All these are IO bound. So might speed up more with mapper output and > output compression and more so with LZO instead of Gzip. > > Identity mapper stage seems to be the slowest in all these partial > vectorization steps. I find it redundant. Maybe we can make it a map only > job and increase the passes and remove the extra IO needed for shuffling and > sorting. > >