Roughly.

But it also gives you a small-ish surrogate for your data that would let
you use all kinds of different clustering methods since the surrogate fits
in memory.

On Sat, May 12, 2012 at 9:51 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> This why canopy has been frustrating because by varying t I would have
> hoped to generate these levels of specificity, then replace hierarchical
> clustering with a similarity measure. In other words L1 has 1000 docs per
> cluster, L2 has 100 docs per cluster. I'd find the 100 docs closest to L1
> clusters (that's all the user wants to see in my case) and reference the 10
> L2 clusters nearest by centroid similarity using rowsimilarity to
> calculate. I'm hoping that this is a useful way to browse the information
> space.
>
> Naively speaking your streaming k seems to have elements of this built in.
>

Reply via email to