Hi everyone, I'm interested in clustering of wikipedia articles dump (~45GB xml) with kmeans or fkmeans. Can anyone tell me something about required architecture of hadoop cluster for this size of job? I have tried to do clustering on cluster of 20 quad core with 32GB of RAM each, but unfortunatley I did not have success. Is this architecture enough? How could I set up memory for map and reduce jobs for hadoop? Am I doing something wrong?
This is how my kmeans command look like: bin/mahout kmeans -i /out-vectors/tfidf-vectors -o /kmeans/clusters -c /kmeans/initial -xm mapreduce --maxIter 8 --numClusters 9000 --clustering --overwrite \ -Dmapreduce.map.java.opts=-Xmx6g \ -Dmapreduce.reduce.java.opts=-Xmx6g \ -Dmapred.child.java.opts=-Xmx6g \ -Dmapreduce.reduce.memory.mb=8192 -Dmapreduce.map.memory.mb=8192 \ -Dmapred.reduce.tasks=160