Lijie,
You are correct. This code is in error. The mailing list lost your
coloring, but your point is still there.
I think that the code should be this instead. Ironically, the comment in
the original code describes what the code does accurately.
int itemsSeenSoFar = 0;
for (Pair<Writable,VectorWritable> record
: new SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(),
true, conf)) {
itemsSeenSoFar++;
Writable key = record.getFirst();
VectorWritable value = record.getSecond();
Cluster newCluster = new Cluster(value.get(), nextClusterId++, measure);
newCluster.observe(value.get(), 1);
Text newText = new Text(key.toString());
int currentSize = chosenTexts.size();
if (currentSize < k) {
chosenTexts.add(newText);
chosenClusters.add(newCluster);
} else {
int itemToReplace = random.nextInt(itemsSeenSoFar);
chosenTexts.set(itemToReplace, newText);
chosenClusters.set(itemToReplace, newCluster);
}
}
On Thu, Dec 15, 2011 at 9:10 PM, Lijie Xu <[email protected]> wrote:
> Hi, I'm now reading the source code of "org.apache.mahout.clustering.**
> kmeans.RandomSeedGenerator".
> There may be a problem in function "buildRandom" which aims to select the
> random k centroid vectors from streaming records.
>
> I'm wondering whether this algorithm is correct and I think the right
> algorithm is as follows:
>
> To select the k elements from streaming resource, we put the first k
> elements (i=1,2,...,k) into the buffer.
> When the /i/th (i > k) element comes, we want to keep it with the
> probability of k/i.
> If the /i/th element is selected, then we can randomly delete an element
> from the buffer and add the /i/th element into the buffer.
>
> So I think the code with red color is doubtable. If I'm wrong, please tell
> me. Thx!
>
> for (Pair<Writable,VectorWritable> record
> : new SequenceFileIterable<Writable,**
> VectorWritable>(fileStatus.**getPath(), true, conf)) {
> Writable key = record.getFirst();
> VectorWritable value = record.getSecond();
> Cluster newCluster = new Cluster(value.get(), nextClusterId++,
> measure);
> newCluster.observe(value.get()**, 1);
> Text newText = new Text(key.toString());
> int currentSize = chosenTexts.size();
> if (currentSize < k) {
> chosenTexts.add(newText);
> chosenClusters.add(newCluster)**;
> } else if (random.nextInt(currentSize + 1) == 0) { // with chance
> 1/(currentSize+1) pick new element
> int indexToRemove = random.nextInt(currentSize); // evict one
> chosen randomly
> chosenTexts.remove(**indexToRemove);
> chosenClusters.remove(**indexToRemove);
> chosenTexts.add(newText);
> chosenClusters.add(newCluster)**;
> }
> }
>
>
>