Hi Robin, Did you try to rerun this on EMR?
Sent from my phone On Feb 28, 2010, at 9:38 PM, "Robin Anil" <robin.a...@gmail.com> wrote: >> >> > On cloudera 8 node c1.medium on 6 GB compressed(26GB uncompressed > wikipeda) > >> 32 mappers 10 reducers(8 nodes, dunno why it is limited to 10) as >> compared > to 16 mappers and 8 reducers(4 nodes) > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o > wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer - > chunk > 512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3 > >> > Dictionary size 78MB > >> > tokenzing step 9 min > >> word count 8:30 min > >> 1pass tf vectorization 20 min > >> df counting 8 min > >> 1pass tfidf vectorization 9 min > >> > total 57 min > > Thats linear scaling :)