On Tue, Apr 8, 2014 at 8:17 AM, Dmitriy Lyubimov <[email protected]> wrote:
> I suspect mllib code would suffer from non-determinsitic parallelism > behavior in its Lloyd iteration (as well as as memory overflow risk) in > certain corner case situations such as there are a lot of datapoints but > very few clusters sought. Spark (for the right reasons) doesn't believe in > sort-n-spill stuff which means centroid recomputation may suffer from > decreased or assymmetric parallelism after groupByKey call, especially if > cluster attribution counts end up heavily skewed for whatever reason. > > E.g. if you have 1 bln points and two centroids, it's my understanding that > post-attribution centroid reduction will create only two tasks each > processing no less than 500 mln attributed points, even if cluster capacity > quite adequately offers 500 tasks for this job. > This should be susceptible to combiner-like pre-aggregation. That should give essentially perfect parallelism.
