K-means implementation

Marko Dinic Fri, 23 Jan 2015 02:52:09 -0800

Hello everyone,

I was digging through K-means implementation on Hadoop and I'm a bitconfused with one thing so I wanted to check.

To calculate the distance from point to all centroids, centroids need tobe accessed from every mapper.So it seemed logical to me to put the centroids (sequenceFile) to theDistributed cache.

But, it seems that it isn't realized like that, but the sequence file isused like a file on HDFS. My understanding is that centroids file isdistributed like any other file on HDFS, so any mapper needs to readfrom it by contacting each of the Data nodes on which the file isdistributed.

Please correct me if I'm wrong or if I have interpreted the codewrongly, but if not, why is it like that? Wouldn't it have more sense touse Distributed cache, since every mapper needs the centroids file? Iguess that one problem would be that you need to copy to distributedcache in each iteration (because centroids are changing), but that seemsfaster than reading the file from the distributed system.


If I'm not right, can anyone please explain how is it really implemented?

Thanks

K-means implementation

Reply via email to