Hi Robin,

Did you try to rerun this on EMR?

Sent from my phone

On Feb 28, 2010, at 9:38 PM, "Robin Anil" <robin.a...@gmail.com> wrote:

>>
>>
> On cloudera 8 node c1.medium on 6 GB compressed(26GB uncompressed  
> wikipeda)
>
>> 32 mappers 10 reducers(8 nodes, dunno why it is limited to 10) as  
>> compared
> to 16 mappers and 8 reducers(4 nodes)
>
> org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/ -o
> wikipedia-unigram/ -a org.apache.mahout.analysis.WikipediaAnalyzer - 
> chunk
> 512 -wt tfidf -md 3 -x 99 -ml 1 -ng 1 -w -s 3
>
>>
> Dictionary size 78MB
>
>>
> tokenzing step 9 min
>
>> word count       8:30 min
>
>> 1pass tf vectorization 20 min
>
>> df counting       8 min
>
>> 1pass tfidf vectorization 9 min
>
>>
> total 57 min
>
> Thats linear scaling :)

Reply via email to