On Mon, Apr 6, 2015 at 5:23 PM, Pat Ferrel <[email protected]> wrote:
> > Skewed or incorrect partitioning, given that I don’t know what that means, > may or may not happen. If it needs to be checked or adjusted it certainly > could be done. It would probably be best to do it in some variant or method > of CheckpointedDrm, right? > > > So it seems like we have two things to look at post 0.10.0 > 1) do we need an rdd validator or smart repartitioner and what exactly > should it do. > i don't believe we need smart repartitioner, pragmatically. coalesce, maybe. No, it is not method on checkpointedDRM -- this is abstract algebra, has nothing to do with rdds. Just a method somewhere in sparkbindings package, just next to drmWrap(). Maybe drmWrap can also accept a flag to do that automatically, false by default. > 2) what should we do to optimize reading many small files vs less large > files. It seems like the two cases are ignored by Spark whether reading > sequence files of text files. > I think this is what par() is for. coalesce will do nicely. Strictly speaking, it would be a pretty weird loader to encounter significant partitioning skew. I think in Spark coalesce (doShuffle=true) should largely address these.
