I don't know where the timeout is happening, but each mapper and each reducer writes all its clusters out at the end of its run. With a large number of clusters, and with the non-sparse center and radius vectors that tend to accumulate, this could take a while...

On 3/8/13 9:46 AM, Colum Foley wrote:
Hi All,

When I run KMeans clustering on a cluster, i notice that when I have
"large" values for k (i.e approx >1000) I get loads of hadoop write
errors:

  INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while waiting
for channel to be ready for read. ch : java.nio.channels.SocketChannel

This continues indefinitely and lots of part-0xxxxx files are produced
of sizes of around 30kbs.

If I reduce the value for k it runs fine. Furthermore If I run it in
local mode with high values of k it runs fine.

The command I am using is as follows:

mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
--clusters tmp -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
1.0 -x 20 -cl -k 10000

I am running mahout 0.7.

Are there some performance parameters I need to tune for mahout when
dealing with large volumes of data?

Thanks,
Colum



Reply via email to