I might have a similar problem: in the spark-shell: val data = spark.read.parquet("...")
after hitting enter, it takes more than 30 seconds for the "read" to complete and return the command line. I am running Spark 2.1.1. But I have also tested it on 2.0.2 and encountered the same issue. thanks, Mike On Tue, Jun 13, 2017 at 10:05 AM, Michael Allman <mich...@videoamp.com> wrote: > Hi Bertrand, > > I encourage you to create a ticket for this and submit a PR if you have > time. Please add me as a listener, and I'll try to contribute/review. > > Michael > > On Jun 6, 2017, at 5:18 AM, Bertrand Bossy <bertrand.bo...@teralytics.ch> > wrote: > > Hi, > > since moving to spark 2.1 from 2.0, we experience a performance regression > when reading a large, partitioned parquet dataset: > > We observe many (hundreds) very short jobs executing before the job that > reads the data is starting. I looked into this issue and pinned it down to > PartitioningAwareFileIndex: While recursively listing the directories, if a > directory contains more than "spark.sql.sources. > parallelPartitionDiscovery.threshold" (default: 32) paths, the children > are listed using a spark job. Because the tree is listed serially, this can > result in a lot of small spark jobs executed one after the other and the > overhead dominates. Performance can be improved by tuning > "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this > is not a satisfactory solution. > > I think that the current behaviour could be improved by walking the > directory tree in breadth first search order and only launching one spark > job to list files in parallel if the number of paths to be listed at some > level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold . > > Does this approach make sense? I have found "Regression in file listing > performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as the > most closely related ticket. > > Unless there is a reason for the current behaviour, I will create a ticket > on this soon. I might have some time in the coming days to work on this. > > Regards, > Bertrand > > -- > > Bertrand Bossy | TERALYTICS > > *software engineer* > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > www.teralytics.net > > Company registration number: CH-020.3.037.709-7 | Trade register Canton > Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann > de Vries > > This e-mail message contains confidential information which is for the > sole attention and use of the intended recipient. Please notify us at once > if you think that it may not be intended for you and delete it immediately. > > > >