GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/13686
[SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate ## What changes were proposed in this pull request? The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is not correct for partitioned tables. As a result, cached lookups on partitioned tables always miss, and these relations are always recomputed. Rather than get `pathsInMetastore` from metastoreRelation.catalogTable.storage.locationUri.toSeq I modified the `getCached` method to take a `pathsInMetastore` argument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is how `getCached` was implemented in Spark 1.5: https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444. ## How was this patch tested? I tested by (temporarily) adding logging to the `getCached` method and ran `spark.table("...")` on a partitioned table in a spark-shell before and after this patch. Before this patch, the value of `useCached` in `getCached` was `false`. After the patch it was `true`. I also validated that caching still works for unpartitioned tables. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-15968 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13686.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13686 ---- commit 60bfe10e350d245632e940aa758cec4f0d2c4006 Author: Michael Allman <mich...@videoamp.com> Date: 2016-06-15T16:52:17Z [SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is not correct for partitioned tables. As a result, cached lookups on partitioned tables always miss, and these relations are always recomputed. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org