Hi Bertrand, I encourage you to create a ticket for this and submit a PR if you have time. Please add me as a listener, and I'll try to contribute/review.
Michael > On Jun 6, 2017, at 5:18 AM, Bertrand Bossy <bertrand.bo...@teralytics.ch> > wrote: > > Hi, > > since moving to spark 2.1 from 2.0, we experience a performance regression > when reading a large, partitioned parquet dataset: > > We observe many (hundreds) very short jobs executing before the job that > reads the data is starting. I looked into this issue and pinned it down to > PartitioningAwareFileIndex: While recursively listing the directories, if a > directory contains more than > "spark.sql.sources.parallelPartitionDiscovery.threshold" (default: 32) paths, > the children are listed using a spark job. Because the tree is listed > serially, this can result in a lot of small spark jobs executed one after the > other and the overhead dominates. Performance can be improved by tuning > "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this is > not a satisfactory solution. > > I think that the current behaviour could be improved by walking the directory > tree in breadth first search order and only launching one spark job to list > files in parallel if the number of paths to be listed at some level exceeds > spark.sql.sources.parallelPartitionDiscovery.threshold . > > Does this approach make sense? I have found "Regression in file listing > performance" ( https://issues.apache.org/jira/browse/SPARK-18679 > <https://issues.apache.org/jira/browse/SPARK-18679> ) as the most closely > related ticket. > > Unless there is a reason for the current behaviour, I will create a ticket on > this soon. I might have some time in the coming days to work on this. > > Regards, > Bertrand > > -- > Bertrand Bossy | TERALYTICS > > software engineer > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > www.teralytics.net <http://www.teralytics.net/> > Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de > Vries > This e-mail message contains confidential information which is for the sole > attention and use of the intended recipient. Please notify us at once if you > think that it may not be intended for you and delete it immediately. > >