correction

On Mon, Mar 1, 2010 at 8:16 AM, Robin Anil <robin.a...@gmail.com> wrote:

> Finally some good news tried with cloudera 4 node c1.medium on 6 GB
> compressed(26GB uncompressed wikipeda)
>
> org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o
> wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer -chunk
> 512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3
>
> Dictionary size 78MB
>
> tokenzing step 21 min
> word count       16 min
> 1pass tf vectorization 41 min
> df counting       16 min
> 1pass tfidf vectorization 23 min
>
> total 1 hour 57 min
>
> I will try with 8 nodes and see if it scales linearly
>
> BTW. All these are IO bound. So might speed up more with mapper output and
> output compression and more so with LZO instead of Gzip.
>
> Identity mapper stage seems to be the slowest in all these partial
> vectorization steps. I find it redundant. Maybe we can make it a map only
> job and increase the passes and remove the extra IO needed for shuffling and
> sorting.
>
>

Reply via email to