[ https://issues.apache.org/jira/browse/SPARK-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin resolved SPARK-16980. --------------------------------- Resolution: Fixed Fix Version/s: 2.1.0 > Load only catalog table partition metadata required to answer a query > --------------------------------------------------------------------- > > Key: SPARK-16980 > URL: https://issues.apache.org/jira/browse/SPARK-16980 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 2.0.0 > Reporter: Michael Allman > Assignee: Michael Allman > Fix For: 2.1.0 > > > Currently, when a user reads from a partitioned Hive table whose metadata are > not cached (and for which Hive table conversion is enabled and supported), > all partition metadata are fetched from the metastore: > https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260 > However, if the user's query includes partition pruning predicates then we > only need the subset of these metadata which satisfy those predicates. > This issue tracks work to modify the current query planning scheme so that > unnecessary partition metadata are not loaded. > I've prototyped two possible approaches. The first extends > {{o.a.s.s.c.catalog.ExternalCatalog}} and as such is more generally > applicable. It requires some new abstractions and refactoring of > {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater > burden on other implementations of {{ExternalCatalog}}. Currently the only > other implementation of {{ExternalCatalog}} is {{InMemoryCatalog}}, and my > prototype throws an {{UnsupportedOperationException}} on that implementation. > The second prototype is simpler and only touches code in the {{hive}} > project. Basically, conversion of a partitioned {{MetastoreRelation}} to > {{HadoopFsRelation}} is deferred to physical planning. During physical > planning, the partition pruning filters in the query plan are used to > identify the required partition metadata and a {{HadoopFsRelation}} is built > from those. The new query plan is then re-injected into the physical planner > and proceeds as normal for a {{HadoopFsRelation}}. > On the Spark dev mailing list, [~ekhliang] expressed a preference for the > approach I took in my first POC. (See > http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html) > Based on that, I'm going to open a PR with that patch as a starting point > for an architectural/design review. It will not be a complete patch ready for > integration into Spark master. Rather, I would like to get early feedback on > the implementation details so I can shape the PR before committing a large > amount of time on a finished product. I will open another PR for the second > approach for comparison if requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org