On Fri, Apr 4, 2014 at 5:44 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > Pipe-pipe is an initialization algorithm which (roughly) parallelizes > > something like the ++ algorithm and requires logarithmically many > parallel > > passes over the data as opposed to the k passes required for ++. > > > > i'll re-read it again. The way i understood the paper (and the code), > pipe-pipe actually uniformly subsamples data in each pass to bring in a > limited number of points in each pass which is O(k) not O(n). This is not > the same as going over each point computing distance-based probability > metrics for them, if that's what you imply by going for ++ comparison. > I don't understand your point here. My original point was that || makes logarithmic number (log N or log k, I don't know which) of passes over the data (direct quote from the paper) while ++ makes O(k) passes over the data. Such passes are often I/O bound. Each ++ pass is made with the goal of selecting a new initial point. This is a sampling procedure based on all the previous initial points. This sounds very much like || except for the number of passes. So my point is O(log mumble) versus O(k). You say that pipe-pipe samples data. You say that is different than ++ sampling data. Is it the uniform versus non-uniform sampling that you are talking about? That hardly seems important compared to the difference in number of passes. DP-means would have to do subsampling as well. >