On Tue, Apr 8, 2014 at 8:17 AM, Dmitriy Lyubimov <[email protected]> wrote:

> I suspect mllib code would suffer from non-determinsitic parallelism
> behavior in its Lloyd iteration (as well as as memory overflow risk) in
> certain corner case situations such as there are a lot of datapoints but
> very few clusters sought. Spark (for the right reasons) doesn't believe in
> sort-n-spill stuff which means centroid recomputation may suffer from
> decreased or assymmetric parallelism after groupByKey call, especially if
> cluster attribution counts end up heavily skewed for whatever reason.
>
> E.g. if you have 1 bln points and two centroids, it's my understanding that
> post-attribution centroid reduction will create only two tasks each
> processing no less than 500 mln attributed points, even if cluster capacity
> quite adequately offers 500 tasks for this job.
>

This should be susceptible to combiner-like pre-aggregation.  That should
give essentially perfect parallelism.

Reply via email to