Clustering of wiki dump.

Огњен Кубат Fri, 06 Jun 2014 06:32:08 -0700

Hi everyone,

I'm interested in clustering of wikipedia articles dump (~45GB xml) with
kmeans or fkmeans. Can anyone tell me something about required architecture
of hadoop cluster for this size of job? I have tried to do clustering on
cluster of 20 quad core with 32GB of RAM each, but unfortunatley I did not
have success. Is this architecture enough? How could I set up memory for
map and reduce jobs for hadoop? Am I doing something wrong?


This is how my kmeans command look like:

bin/mahout kmeans -i /out-vectors/tfidf-vectors -o /kmeans/clusters -c
/kmeans/initial -xm mapreduce --maxIter 8 --numClusters 9000 --clustering
--overwrite \
-Dmapreduce.map.java.opts=-Xmx6g \
-Dmapreduce.reduce.java.opts=-Xmx6g \
-Dmapred.child.java.opts=-Xmx6g \
-Dmapreduce.reduce.memory.mb=8192 -Dmapreduce.map.memory.mb=8192 \
-Dmapred.reduce.tasks=160

Clustering of wiki dump.

Reply via email to