I probably will try to answer these on the list. Slack is only on my phone
now.

(1) par, generally, does not do shuffle. Internally its implementation
largely relies on suffless coalesce() Spark api.

What it means, it will do great job reducing or increasing parallelism 5x
or more without doing shuffle and observing approximate uniformity of
splits.

(2) as a corollary, it means it will not eliminate problem of empty
partitions.

(3) optimizer will not create problems in RDDs if initial rdd passed to
drmWrap() did not have problems.

(4) optimizer does not validate rdds (in drmWrap() or elsewhere) for
correctness for expense reasons.

However, we probably may want to create a routine that validates internal
rdd structure (as a map-only or all-reduce op) which can be used by tools
like data importers before passing data to algebra.

-d

Reply via email to