PS problems of rdds we are trying to feed to drmWrap() are mostly of 2 kinds:
(1) skewed/incorrect partitioning, e.g. due to prefiltering or degenerate splitting (at least 1 row vector is required in every partition); (2) invalid data dimensionality (although some operators may be forgiving of that, in general, all vectors in a row-partitioned format must have the same cardinality). Either problem is responsibility of data loader (e.g. drmDfsRead()), not algebra's. so no need to try to hack it in the optimizer's guts. On Mon, Apr 6, 2015 at 3:56 PM, Dmitriy Lyubimov <[email protected]> wrote: > I probably will try to answer these on the list. Slack is only on my phone > now. > > (1) par, generally, does not do shuffle. Internally its implementation > largely relies on suffless coalesce() Spark api. > > What it means, it will do great job reducing or increasing parallelism 5x > or more without doing shuffle and observing approximate uniformity of > splits. > > (2) as a corollary, it means it will not eliminate problem of empty > partitions. > > (3) optimizer will not create problems in RDDs if initial rdd passed to > drmWrap() did not have problems. > > (4) optimizer does not validate rdds (in drmWrap() or elsewhere) for > correctness for expense reasons. > > However, we probably may want to create a routine that validates internal > rdd structure (as a map-only or all-reduce op) which can be used by tools > like data importers before passing data to algebra. > > -d >
