I don't know where the timeout is happening, but each mapper and each
reducer writes all its clusters out at the end of its run. With a large
number of clusters, and with the non-sparse center and radius vectors
that tend to accumulate, this could take a while...
On 3/8/13 9:46 AM, Colum Foley wrote:
Hi All,
When I run KMeans clustering on a cluster, i notice that when I have
"large" values for k (i.e approx >1000) I get loads of hadoop write
errors:
INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while waiting
for channel to be ready for read. ch : java.nio.channels.SocketChannel
This continues indefinitely and lots of part-0xxxxx files are produced
of sizes of around 30kbs.
If I reduce the value for k it runs fine. Furthermore If I run it in
local mode with high values of k it runs fine.
The command I am using is as follows:
mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
--clusters tmp -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
1.0 -x 20 -cl -k 10000
I am running mahout 0.7.
Are there some performance parameters I need to tune for mahout when
dealing with large volumes of data?
Thanks,
Colum