Ted, I've been meaning to ask you about this. Currently, we have a parameter called clusterLogFactor [1] that we multiply by the number of points seen so far.
This is (I guess) meant to behave like the k*log(n) recommended value for the number of clusters in the paper. So, clusterLogFactor should actually be k (the number of clusters). What I'm saying here is... We get a numClusters parameter anyway. Currently I set this to k*log(N) (where N is the total number of points at the beginning). I propose that instead of having two confusing parameters: estimatedNumClusters and clusterLogFactor, to just have one, numClusters that has the same semantics as in BallKMeans. It's about time these were properly documented. Additionally, I'd remove the max at line 232. How about it? [1] https://github.com/dfilimon/knn/blob/d6891060b5488e492fd4bcc50343211b8d7da1dd/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java#L47
