What Dan says here is correct.  The lack of dependence on k in the current
code is definitely a problem.

The work-around is to set the maxClusters to the point that the log factor
should have grown to.  That sucks so we should fix the heuristic sizing
along the lines that Dan says.  There should still be a way to force the
size you want, but it should not be the primary API.

On Thu, Dec 13, 2012 at 3:10 PM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:

> From what I can tell though, this doesn't increase fast enough to get
> a good sketch, since we're only getting in points sequentially so that
> log will be small initially.
> I'm also unsure why we use that particular estimate. The paper just
> uses k log n, but we don't have the real k (although the initial
> numClusters might be that). I want to try:
> estimatedNumClusters = estimatedNumClusters * clusterLogFactor *
> log(numProcessedDatapoints)
>
> This in my mind is closer to what the paper says, although that
> clusterLogFactor (which is 10) might be a bit too high.
> Long story, short it's kind of ambiguous what the numClusters actually
> is when you call streaming k-means initially and that needs some
> clarification.
>

Reply via email to