Hi, you can control this kinda issue in the comming v2.0. See https://www.mail-archive.com/user@spark.apache.org/msg51603.html
// maropu On Sat, Jun 4, 2016 at 10:23 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Hi Saif! > > > > When you say this happens with spark-csv, are the files gzipped by any > chance? GZip is non-splittable so if you’re seeing skew simply from loading > data it could be you have some extremely large gzip files. So for a single > stage job you will have those tasks lagging compared to the smaller gzips. > As you already said, the option there would be to repartition at the > expense of shuffling. If you’re seeing this with parquet files, what do the > individual part-* files look like (size, compression type, etc.)? > > > > Thanks, > > Silvio > > > > *From: *"saif.a.ell...@wellsfargo.com" <saif.a.ell...@wellsfargo.com> > *Date: *Friday, June 3, 2016 at 8:31 AM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *Strategies for propery load-balanced partitioning > > > > Hello everyone! > > > > I was noticing that, when reading parquet files or actually any kind of > source data frame data (spark-csv, etc), default partinioning is not fair. > > Action tasks usually act very fast on some partitions and very slow on > some others, and frequently, even fast on all but last partition (which > looks like it reads +50% of the data input size). > > > > I notice that each task is loading some portion of the data, say 1024MB > chunks, and some task loading 20+GB of data. > > > > Applying repartition strategies solve this issue properly and general > performance is increased considerably, but for very large dataframes, > repartitioning is a costly process. > > > > In short, what are the available strategies or configurations that help > reading from disk or hdfs with proper executor-data-distribution?? > > > > If this needs to be more specific, I am strictly focused on PARQUET files > rom HDFS. I know there are some MIN > > > > Really appreciate, > > Saif > > > -- --- Takeshi Yamamuro