Hi, I'm now reading the source code of "org.apache.mahout.clustering.kmeans.RandomSeedGenerator". There may be a problem in function "buildRandom" which aims to select the random k centroid vectors from streaming records.

I'm wondering whether this algorithm is correct and I think the right algorithm is as follows:

To select the k elements from streaming resource, we put the first k elements (i=1,2,...,k) into the buffer. When the /i/th (i > k) element comes, we want to keep it with the probability of k/i. If the /i/th element is selected, then we can randomly delete an element from the buffer and add the /i/th element into the buffer.

So I think the code with red color is doubtable. If I'm wrong, please tell me. Thx!

for (Pair<Writable,VectorWritable> record
: new SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(), true, conf)) {
          Writable key = record.getFirst();
          VectorWritable value = record.getSecond();
Cluster newCluster = new Cluster(value.get(), nextClusterId++, measure);
          newCluster.observe(value.get(), 1);
          Text newText = new Text(key.toString());
          int currentSize = chosenTexts.size();
          if (currentSize < k) {
            chosenTexts.add(newText);
            chosenClusters.add(newCluster);
} else if (random.nextInt(currentSize + 1) == 0) { // with chance 1/(currentSize+1) pick new element int indexToRemove = random.nextInt(currentSize); // evict one chosen randomly
            chosenTexts.remove(indexToRemove);
            chosenClusters.remove(indexToRemove);
            chosenTexts.add(newText);
            chosenClusters.add(newCluster);
          }
        }


Reply via email to