Have you looked at the streaming k-means work? The basic idea is that you generate a sketch of the data which you can then cluster in-memory. That lets you use very advanced centroid generation algorithms that require lots of processing.
On Tue, Nov 26, 2013 at 6:29 AM, Chih-Hsien Wu <chjaso...@gmail.com> wrote: > Hi all, I'm trying to clustering text documents via top-down approach. I > have experienced both random seed and canopy generation, and have seen > their pros and cons. I realize that canopy is great for not known exact > cluster numbers; nevertheless, the memory need for canopy is great. I was > hoping to find something similar to canopy generation and was wondering if > there is any other recommendation? >