12 GB uncompressed. I am uploading to s3 at the moment

regex :)

s3://mahout-wikipedia/unigram-tfidf-vectors/part-0000[0-9]

On Mon, Mar 1, 2010 at 8:56 AM, Jake Mannix <jake.man...@gmail.com> wrote:

> What's the final size of the vectoized output?
>
>  -jake
>
> On Feb 28, 2010 6:47 PM, "Robin Anil" <robin.a...@gmail.com> wrote:
>
> Finally some good news tried with cloudera 4 node c1.medium on 6 GB
> compressed(26GB uncompressed wikipeda)
>
> org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o
> wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer -chunk
> 512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3
>
> Dictionary size 78MB
>
> tokenzing step 21 min
> word count       16 min
> 1pass tf vectorization 41 min
> df counting       16 min
> 1pass tfidf vectorization 23 min
>
> total 1:57 min
>
> I will try with 8 nodes and see if it scales linearly
>
> BTW. All these are IO bound. So might speed up more with mapper output and
> output compression and more so with LZO instead of Gzip.
>
> Identity mapper stage seems to be the slowest in all these partial
> vectorization steps. I find it redundant. Maybe we can make it a map only
> job and increase the passes and remove the extra IO needed for shuffling
> and
> sorting.
>

Reply via email to