Hello everyone,
I was digging through K-means implementation on Hadoop and I'm a bit
confused with one thing so I wanted to check.
To calculate the distance from point to all centroids, centroids need to
be accessed from every mapper.
So it seemed logical to me to put the centroids (sequenceFile) to the
Distributed cache.
But, it seems that it isn't realized like that, but the sequence file is
used like a file on HDFS. My understanding is that centroids file is
distributed like any other file on HDFS, so any mapper needs to read
from it by contacting each of the Data nodes on which the file is
distributed.
Please correct me if I'm wrong or if I have interpreted the code
wrongly, but if not, why is it like that? Wouldn't it have more sense to
use Distributed cache, since every mapper needs the centroids file? I
guess that one problem would be that you need to copy to distributed
cache in each iteration (because centroids are changing), but that seems
faster than reading the file from the distributed system.
If I'm not right, can anyone please explain how is it really implemented?
Thanks