GitHub user mallman opened a pull request:

    https://github.com/apache/spark/pull/13686

    [SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate

    ## What changes were proposed in this pull request?
    
    The `getCached` method of `HiveMetastoreCatalog` computes 
`pathsInMetastore` from the metastore relation's catalog table. This only 
returns the table base path, which is not correct for partitioned
    tables. As a result, cached lookups on partitioned tables always miss, and 
these relations are always recomputed.
    
    Rather than get `pathsInMetastore` from
    
        metastoreRelation.catalogTable.storage.locationUri.toSeq
    
    I modified the `getCached` method to take a `pathsInMetastore` argument. 
Calls to this method pass in the paths computed from calls to the Hive 
metastore. This is how `getCached` was implemented in Spark 1.5:
    
    
https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444.
    
    ## How was this patch tested?
    
    I tested by (temporarily) adding logging to the `getCached` method and ran 
`spark.table("...")` on a partitioned table in a spark-shell before and after 
this patch. Before this patch, the value of `useCached` in `getCached` was 
`false`. After the patch it was `true`. I also validated that caching still 
works for unpartitioned tables.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/VideoAmp/spark-public spark-15968

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13686.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13686
    
----
commit 60bfe10e350d245632e940aa758cec4f0d2c4006
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-06-15T16:52:17Z

    [SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate
    partitioned metastore relation when searching the internal table cache
    
    The `getCached` method of `HiveMetastoreCatalog` computes
    `pathsInMetastore` from the metastore relation's catalog table. This
    only returns the table base path, which is not correct for partitioned
    tables. As a result, cached lookups on partitioned tables always miss,
      and these relations are always recomputed.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to