Re: par() stuff

Dmitriy Lyubimov Mon, 06 Apr 2015 17:33:06 -0700

On Mon, Apr 6, 2015 at 5:23 PM, Pat Ferrel <[email protected]> wrote:


>
> Skewed or incorrect partitioning, given that I don’t know what that means,
> may or may not happen. If it needs to be checked or adjusted it certainly
> could be done. It would probably be best to do it in some variant or method
> of CheckpointedDrm, right?
>
>
> So it seems like we have two things to look at post 0.10.0
> 1) do we need an rdd validator or smart repartitioner and what exactly
> should it do.
>

i don't believe we need smart repartitioner, pragmatically. coalesce,
maybe.
No, it is not method on checkpointedDRM -- this is abstract algebra, has
nothing to do with rdds. Just a method somewhere in sparkbindings package,
just next to drmWrap(). Maybe drmWrap can also accept a flag to do that
automatically, false by default.


> 2) what should we do to optimize reading many small files vs less large
> files. It seems like the two cases are ignored by Spark whether reading
> sequence files of text files.
>

I think this is what par() is for. coalesce will do nicely.

Strictly speaking, it would be a pretty weird loader to encounter
significant partitioning skew. I think in Spark coalesce (doShuffle=true)
should largely address these.

Re: par() stuff

Reply via email to