Re: KMeans|| opinions

Ted Dunning Fri, 04 Apr 2014 14:26:22 -0700

On Fri, Apr 4, 2014 at 5:44 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > Pipe-pipe is an initialization algorithm which (roughly) parallelizes
> > something like the ++ algorithm and requires logarithmically many
> parallel
> > passes over the data as opposed to the k passes required for ++.
> >
>
> i'll re-read it again. The way i understood the paper (and the code),
> pipe-pipe  actually uniformly subsamples data in each pass to bring in a
> limited number of points in each pass which is O(k) not O(n). This is not
> the same as going over each point computing distance-based probability
> metrics for them, if that's what you imply by going for ++ comparison.
>

I don't understand your point here.  My original point was that || makes
logarithmic number (log N or log k, I don't know which) of passes over the
data (direct quote from the paper) while ++ makes O(k) passes over the
data.  Such passes are often I/O bound.

Each ++ pass is made with the goal of selecting a new initial point.  This
is a sampling procedure based on all the previous initial points.  This
sounds very much like || except for the number of passes.

So my point is O(log mumble) versus O(k).  You say that pipe-pipe samples
data.  You say that is different than ++ sampling data.

Is it the uniform versus non-uniform sampling that you are talking about?
 That hardly seems important compared to the difference in number of passes.

DP-means would have to do subsampling as well.
>

Re: KMeans|| opinions

Reply via email to