Well, the point of clusterLogFactor was original to be c in the expression
c * k * log N.  The idea was that this would normally be in the range from
1 to 3 and would be a fudge factor to increase k log N.  This is useful
where we know k, but not N.

On the other hand, in some cases we can just pick the maximum number of
sketch clusters and we just want to set it so.  This is very useful in
testing.  This number is almost always larger than the value of c * k * log
N so the max is pretty decent as a way of choosing between the two.

I agree that this should be normalized and documented, but I think we
probably will want two behaviors until we really understand how to pick an
adaptive limit.  I am happy if you have some way to expressive this well.

On Wed, Jan 2, 2013 at 3:38 PM, Dan Filimon <[email protected]>wrote:

> Ted, I've been meaning to ask you about this.
> Currently, we have a parameter called clusterLogFactor [1] that we
> multiply by the number of points seen so far.
>
> This is (I guess) meant to behave like the k*log(n) recommended value
> for the number of clusters in the paper. So, clusterLogFactor should
> actually be k (the number of clusters).
>
> What I'm saying here is...
>
> We get a numClusters parameter anyway. Currently I set this to
> k*log(N) (where N is the total number of points at the beginning).
>
> I propose that instead of having two confusing parameters:
> estimatedNumClusters and clusterLogFactor, to just have one,
> numClusters that has the same semantics as in BallKMeans.
> It's about time these were properly documented.
>
> Additionally, I'd remove the max at line 232.
>
> How about it?
>
> [1]
> https://github.com/dfilimon/knn/blob/d6891060b5488e492fd4bcc50343211b8d7da1dd/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java#L47
>

Reply via email to