Hi, I'm now reading the source code of
"org.apache.mahout.clustering.kmeans.RandomSeedGenerator".
There may be a problem in function "buildRandom" which aims to select
the random k centroid vectors from streaming records.
I'm wondering whether this algorithm is correct and I think the right
algorithm is as follows:
To select the k elements from streaming resource, we put the first k
elements (i=1,2,...,k) into the buffer.
When the /i/th (i > k) element comes, we want to keep it with the
probability of k/i.
If the /i/th element is selected, then we can randomly delete an element
from the buffer and add the /i/th element into the buffer.
So I think the code with red color is doubtable. If I'm wrong, please
tell me. Thx!
for (Pair<Writable,VectorWritable> record
: new
SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(),
true, conf)) {
Writable key = record.getFirst();
VectorWritable value = record.getSecond();
Cluster newCluster = new Cluster(value.get(),
nextClusterId++, measure);
newCluster.observe(value.get(), 1);
Text newText = new Text(key.toString());
int currentSize = chosenTexts.size();
if (currentSize < k) {
chosenTexts.add(newText);
chosenClusters.add(newCluster);
} else if (random.nextInt(currentSize + 1) == 0) { // with
chance 1/(currentSize+1) pick new element
int indexToRemove = random.nextInt(currentSize); // evict
one chosen randomly
chosenTexts.remove(indexToRemove);
chosenClusters.remove(indexToRemove);
chosenTexts.add(newText);
chosenClusters.add(newCluster);
}
}