[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

ericl Tue, 11 Oct 2016 11:48:07 -0700

Github user ericl commented on the issue:

    https://github.com/apache/spark/pull/14690
  
    > For one thing, a ListingFileCatalog performs a file tree traversal right 
off the bat. However, the external catalog returns the locations of partitions 
as part of the listPartitionsByFilter call. I believe that should suffice for 
the purpose of building a query plan for metastore-backed tables and executing 
it.
    
    You'd have to re-implement a large portion of the parallel traversal logic 
here right? I think we should keep this PR minimal and leave that for future 
work. I am also thinking of adding a per-directory file listing cache as a 
followup to avoid performance regressions, which would likely involve 
refactoring this path anyways.
    
    >I would be wary of amending our data sources to support case-insensitive 
field resolution. For one thing, strictly speaking it can lead to ambiguity in 
schema resolution. In theâpotential but unlikelyâevent that a 
(case-sensitive) data source schema has two distinct fields x1 and x2 such that 
x1.toLowerCase == x2.toLowerCase we're going to get undefined behavior.
    
    > For another, for case-sensitive data sources this adds code complexity in 
their implementation.
    
    I do agree this might be an issue with other datasources. For parquet 
though, I talked with @liancheng and we don't think there are any issues with 
supporting case-insensitive field resolution. Given that, I think we can also 
leave this for future work when we add datasource table support. It might also 
be that we need to add back something like 
https://github.com/apache/spark/pull/14750
    
    > Finally, this would require us to read the schema files. That's something 
I'm trying to avoid in this patch.
    
    Not sure what you mean here, but the parquet change should be execution 
time only. I'll submit a pr here for that.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

Reply via email to