What's the final size of the vectoized output?

  -jake

On Feb 28, 2010 6:47 PM, "Robin Anil" <robin.a...@gmail.com> wrote:

Finally some good news tried with cloudera 4 node c1.medium on 6 GB
compressed(26GB uncompressed wikipeda)

org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o
wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer -chunk
512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3

Dictionary size 78MB

tokenzing step 21 min
word count       16 min
1pass tf vectorization 41 min
df counting       16 min
1pass tfidf vectorization 23 min

total 1:57 min

I will try with 8 nodes and see if it scales linearly

BTW. All these are IO bound. So might speed up more with mapper output and
output compression and more so with LZO instead of Gzip.

Identity mapper stage seems to be the slowest in all these partial
vectorization steps. I find it redundant. Maybe we can make it a map only
job and increase the passes and remove the extra IO needed for shuffling and
sorting.

Reply via email to