I hadn't looked at this before but yeah you have a good point, it does slightly favor earlier elements. There's no reason this shouldn't be properly random even if it doesn't make much difference in practice.
mahout-dev, mind if I... 1. Make the algorithm truly select a random k elements? 2. May we standardize on the RandomUtils class for creating random number generators? it means we can control the implementation, but better, can make it deterministic at will for testing 3. And may I fix up some style stuff... for instance nothing should be both 'transient' and 'static', and "== true" is redundant, etc. On Wed, Jul 1, 2009 at 7:07 PM, Adil Aijaz<[email protected]> wrote: > I was looking at the RandomSeedGenerator and, correct me if I am wrong, but > it is not really random; rather it does a bunch of bernoulli trials where > the points that are in the beginning of your data are always going to have a > higher chance of being selected than those near the end. > > Maybe that's not a problem since given sufficient iterations kmeans should > converge toward a solution. But, I thought I'd point it out in case there is > an issue here. > > Adil >
