Random Selection Algorithm Problem in org.apache.mahout.clustering.kmeans.RandomSeedGenerator

Lijie Xu Thu, 15 Dec 2011 21:10:53 -0800

Hi, I'm now reading the source code of"org.apache.mahout.clustering.kmeans.RandomSeedGenerator".There may be a problem in function "buildRandom" which aims to selectthe random k centroid vectors from streaming records.

I'm wondering whether this algorithm is correct and I think the rightalgorithm is as follows:

To select the k elements from streaming resource, we put the first kelements (i=1,2,...,k) into the buffer.When the /i/th (i > k) element comes, we want to keep it with theprobability of k/i.If the /i/th element is selected, then we can randomly delete an elementfrom the buffer and add the /i/th element into the buffer.

So I think the code with red color is doubtable. If I'm wrong, pleasetell me. Thx!


for (Pair<Writable,VectorWritable> record

: newSequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(),true, conf)) {

          Writable key = record.getFirst();
          VectorWritable value = record.getSecond();

Cluster newCluster = new Cluster(value.get(),nextClusterId++, measure);

          newCluster.observe(value.get(), 1);
          Text newText = new Text(key.toString());
          int currentSize = chosenTexts.size();
          if (currentSize < k) {
            chosenTexts.add(newText);
            chosenClusters.add(newCluster);

} else if (random.nextInt(currentSize + 1) == 0) { // withchance 1/(currentSize+1) pick new elementint indexToRemove = random.nextInt(currentSize); // evictone chosen randomly

            chosenTexts.remove(indexToRemove);
            chosenClusters.remove(indexToRemove);
            chosenTexts.add(newText);
            chosenClusters.add(newCluster);
          }
        }

Random Selection Algorithm Problem in org.apache.mahout.clustering.kmeans.RandomSeedGenerator

Reply via email to