Hi,

you can control this kinda issue in the comming v2.0.
See https://www.mail-archive.com/user@spark.apache.org/msg51603.html

// maropu


On Sat, Jun 4, 2016 at 10:23 AM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Hi Saif!
>
>
>
> When you say this happens with spark-csv, are the files gzipped by any
> chance? GZip is non-splittable so if you’re seeing skew simply from loading
> data it could be you have some extremely large gzip files. So for a single
> stage job you will have those tasks lagging compared to the smaller gzips.
> As you already said, the option there would be to repartition at the
> expense of shuffling. If you’re seeing this with parquet files, what do the
> individual part-* files look like (size, compression type, etc.)?
>
>
>
> Thanks,
>
> Silvio
>
>
>
> *From: *"saif.a.ell...@wellsfargo.com" <saif.a.ell...@wellsfargo.com>
> *Date: *Friday, June 3, 2016 at 8:31 AM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Strategies for propery load-balanced partitioning
>
>
>
> Hello everyone!
>
>
>
> I was noticing that, when reading parquet files or actually any kind of
> source data frame data (spark-csv, etc), default partinioning is not fair.
>
> Action tasks usually act very fast on some partitions and very slow on
> some others, and frequently, even fast on all but last partition (which
> looks like it reads +50% of the data input size).
>
>
>
> I notice that each task is loading some portion of the data, say 1024MB
> chunks, and some task loading 20+GB of data.
>
>
>
> Applying repartition strategies solve this issue properly and general
> performance is increased considerably, but for very large dataframes,
> repartitioning is a costly process.
>
>
>
> In short, what are the available strategies or configurations that help
> reading from disk or hdfs with proper executor-data-distribution??
>
>
>
> If this needs to be more specific, I am strictly focused on PARQUET files
> rom HDFS. I know there are some MIN
>
>
>
> Really appreciate,
>
> Saif
>
>
>



-- 
---
Takeshi Yamamuro

Reply via email to