[jira] [Created] (SPARK-16980) Load only catalog table partition metadata required to answer a query

Michael Allman (JIRA) Tue, 09 Aug 2016 11:16:35 -0700

Michael Allman created SPARK-16980:
--------------------------------------

             Summary: Load only catalog table partition metadata required to 
answer a query
                 Key: SPARK-16980
                 URL: https://issues.apache.org/jira/browse/SPARK-16980
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Michael Allman

Currently, when a user reads from a partitioned Hive table whose metadata are
not cached (and for which Hive table conversion is enabled and supported), all
partition metadata are fetched from the metastore:

https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260

However, if the user's query includes partition pruning predicates then we only
need the subset of these metadata which satisfy those predicates.

This issue tracks work to modify the current query planning scheme so that
unnecessary partition metadata are not loaded.

I've prototyped two possible approaches. The first extends
`org.apache.spark.sql.catalyst.catalog.ExternalCatalog` and as such is more
generally applicable. It requires some new abstractions and refactoring of
`HadoopFsRelation` and `FileCatalog`, among others. It places a greater burden
on other implementations of `ExternalCatalog`. Currently the only other
implementation of `ExternalCatalog` is `InMemoryCatalog`, and my prototype
throws an `UnsupportedOperationException` on that implementation.

The second prototype is simpler and only touches code in the `hive` project.
Basically, conversion of a partitioned `MetastoreRelation` to
`HadoopFsRelation` is deferred to physical planning. During physical planning,
the partition pruning filters in the query plan are used to identify the
required partition metadata and a `HadoopFsRelation` is built from those. The
new query plan is then re-injected into the physical planner and proceeds as
normal for a `HadoopFsRelation`.

On the Spark dev mailing list, [~ekhliang] expressed a preference for the
approach I took in my first POC. (See
http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html)
Based on that, I'm going to open a PR with that patch as a starting point for
an architectural/design review. It will not be a complete patch ready for
integration into Spark master. Rather, I would like to get early feedback on
the implementation details so I can shape the PR before committing a large
amount of time on a finished product. I will open another PR for the second
approach for comparison if requested.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16980) Load only catalog table partition metadata required to answer a query

Reply via email to