Github user mallman commented on the issue: https://github.com/apache/spark/pull/13818 > I have a few questions. > > Is it a regression from 1.6? Looks like not? I don't know about 1.6. I know it's a regression from 1.5. > Is it a correctness issue or a performance issue? Seems it is a performance issue? It is a performance issue. > If it is a performance issue. What is the impact? For a hive parquet/orc table, after we convert them to Spark's native code path, there is no partitioning discovery. So, I guess the performance is mainly coming from querying metastore? If so, what will be the perf difference after spark.sql.hive.metastorePartitionPruning (only querying needed partition info from Hive metastore) is enabled? The problem this PR addresses occurs in the analysis phase of query planning. The property `spark.sql.hive.metastorePartitionPruning` only comes into play in `HiveTableScanExec`, which is part of physical planning. (And I don't believe it's used to read Parquet tables.) Therefore, that property has no bearing on this problem. Regarding the impact, I'll quote from the last paragraph of the PR description: > Building a large HadoopFsRelation requires stat-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL for these cases. ---- > My feeling is that if it is a perf issue and it is not a regression from 1.6, merging to master should be good enough. For some (like us), I'd say this extends beyond a performance issue into a usability issue. We can't use Spark 2.0 as-is if it takes us several minutes to build a query plan.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org