[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

mallman Tue, 05 Jul 2016 12:57:55 -0700

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    > I have a few questions.
    > 
    > Is it a regression from 1.6? Looks like not?
    
    I don't know about 1.6. I know it's a regression from 1.5.
    
    > Is it a correctness issue or a performance issue? Seems it is a 
performance issue?
    
    It is a performance issue.
    
    > If it is a performance issue. What is the impact? For a hive parquet/orc 
table, after we convert them to Spark's native code path, there is no 
partitioning discovery. So, I guess the performance is mainly coming from 
querying metastore? If so, what will be the perf difference after 
spark.sql.hive.metastorePartitionPruning (only querying needed partition info 
from Hive metastore) is enabled?
    
    The problem this PR addresses occurs in the analysis phase of query 
planning. The property `spark.sql.hive.metastorePartitionPruning` only comes 
into play in `HiveTableScanExec`, which is part of physical planning. (And I 
don't believe it's used to read Parquet tables.) Therefore, that property has 
no bearing on this problem.
    
    Regarding the impact, I'll quote from the last paragraph of the PR 
description:
    
    > Building a large HadoopFsRelation requires stat-ing all of its data 
files. In our environment, where we have tables with 10's of thousands of 
partitions, the difference between using a cached relation versus a new one is 
a matter of seconds versus minutes. Caching partitioned table metadata vastly 
improves the usability of Spark SQL for these cases.
    
    ----
    
    > My feeling is that if it is a perf issue and it is not a regression from 
1.6, merging to master should be good enough.
    
    For some (like us), I'd say this extends beyond a performance issue into a 
usability issue. We can't use Spark 2.0 as-is if it takes us several minutes to 
build a query plan.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Reply via email to