12 GB uncompressed. I am uploading to s3 at the moment regex :)
s3://mahout-wikipedia/unigram-tfidf-vectors/part-0000[0-9] On Mon, Mar 1, 2010 at 8:56 AM, Jake Mannix <jake.man...@gmail.com> wrote: > What's the final size of the vectoized output? > > -jake > > On Feb 28, 2010 6:47 PM, "Robin Anil" <robin.a...@gmail.com> wrote: > > Finally some good news tried with cloudera 4 node c1.medium on 6 GB > compressed(26GB uncompressed wikipeda) > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o > wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer -chunk > 512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3 > > Dictionary size 78MB > > tokenzing step 21 min > word count 16 min > 1pass tf vectorization 41 min > df counting 16 min > 1pass tfidf vectorization 23 min > > total 1:57 min > > I will try with 8 nodes and see if it scales linearly > > BTW. All these are IO bound. So might speed up more with mapper output and > output compression and more so with LZO instead of Gzip. > > Identity mapper stage seems to be the slowest in all these partial > vectorization steps. I find it redundant. Maybe we can make it a map only > job and increase the passes and remove the extra IO needed for shuffling > and > sorting. >