Re: Performance regression for partitioned parquet data

Michael Allman Tue, 13 Jun 2017 10:05:48 -0700

Hi Bertrand,

I encourage you to create a ticket for this and submit a PR if you have time. 
Please add me as a listener, and I'll try to contribute/review.


Michael

> On Jun 6, 2017, at 5:18 AM, Bertrand Bossy <bertrand.bo...@teralytics.ch> 
> wrote:
> 
> Hi,
> 
> since moving to spark 2.1 from 2.0, we experience a performance regression 
> when reading a large, partitioned parquet dataset: 
> 
> We observe many (hundreds) very short jobs executing before the job that 
> reads the data is starting. I looked into this issue and pinned it down to 
> PartitioningAwareFileIndex: While recursively listing the directories, if a 
> directory contains more than 
> "spark.sql.sources.parallelPartitionDiscovery.threshold" (default: 32) paths, 
> the children are listed using a spark job. Because the tree is listed 
> serially, this can result in a lot of small spark jobs executed one after the 
> other and the overhead dominates. Performance can be improved by tuning 
> "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this is 
> not a satisfactory solution.
> 
> I think that the current behaviour could be improved by walking the directory 
> tree in breadth first search order and only launching one spark job to list 
> files in parallel if the number of paths to be listed at some level exceeds 
> spark.sql.sources.parallelPartitionDiscovery.threshold .
> 
> Does this approach make sense? I have found "Regression in file listing 
> performance" ( https://issues.apache.org/jira/browse/SPARK-18679 
> <https://issues.apache.org/jira/browse/SPARK-18679> ) as the most closely 
> related ticket.
> 
> Unless there is a reason for the current behaviour, I will create a ticket on 
> this soon. I might have some time in the coming days to work on this.
> 
> Regards,
> Bertrand
> 
> -- 
> Bertrand Bossy | TERALYTICS
> 
> software engineer
> 
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
> www.teralytics.net <http://www.teralytics.net/>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de 
> Vries
> This e-mail message contains confidential information which is for the sole 
> attention and use of the intended recipient. Please notify us at once if you 
> think that it may not be intended for you and delete it immediately.
> 
>

Re: Performance regression for partitioned parquet data

Reply via email to