Hello everyone,

I was digging through K-means implementation on Hadoop and I'm a bit confused with one thing so I wanted to check.

To calculate the distance from point to all centroids, centroids need to be accessed from every mapper. So it seemed logical to me to put the centroids (sequenceFile) to the Distributed cache.

But, it seems that it isn't realized like that, but the sequence file is used like a file on HDFS. My understanding is that centroids file is distributed like any other file on HDFS, so any mapper needs to read from it by contacting each of the Data nodes on which the file is distributed.

Please correct me if I'm wrong or if I have interpreted the code wrongly, but if not, why is it like that? Wouldn't it have more sense to use Distributed cache, since every mapper needs the centroids file? I guess that one problem would be that you need to copy to distributed cache in each iteration (because centroids are changing), but that seems faster than reading the file from the distributed system.

If I'm not right, can anyone please explain how is it really implemented?

Thanks


Reply via email to