Roughly. But it also gives you a small-ish surrogate for your data that would let you use all kinds of different clustering methods since the surrogate fits in memory.
On Sat, May 12, 2012 at 9:51 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > This why canopy has been frustrating because by varying t I would have > hoped to generate these levels of specificity, then replace hierarchical > clustering with a similarity measure. In other words L1 has 1000 docs per > cluster, L2 has 100 docs per cluster. I'd find the 100 docs closest to L1 > clusters (that's all the user wants to see in my case) and reference the 10 > L2 clusters nearest by centroid similarity using rowsimilarity to > calculate. I'm hoping that this is a useful way to browse the information > space. > > Naively speaking your streaming k seems to have elements of this built in. >