[
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125116#comment-14125116
]
Lefty Leverenz commented on HIVE-7223:
--------------------------------------
Does this need to be documented in the wiki?
(Also, Fix Version/s should be 0.14.0.)
> Support generic PartitionSpecs in Metastore partition-functions
> ---------------------------------------------------------------
>
> Key: HIVE-7223
> URL: https://issues.apache.org/jira/browse/HIVE-7223
> Project: Hive
> Issue Type: Improvement
> Components: HCatalog, Metastore
> Affects Versions: 0.12.0, 0.13.0
> Reporter: Mithun Radhakrishnan
> Assignee: Mithun Radhakrishnan
> Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch,
> HIVE-7223.4.patch, HIVE-7223.5.patch
>
>
> Currently, the functions in the HiveMetaStore API that handle multiple
> partitions do so using List<Partition>. E.g.
> {code}
> public List<Partition> listPartitions(String db_name, String tbl_name, short
> max_parts);
> public List<Partition> listPartitionsByFilter(String db_name, String
> tbl_name, String filter, short max_parts);
> public int add_partitions(List<Partition> new_parts);
> {code}
> Partition objects are fairly heavyweight, since each Partition carries its
> own copy of a StorageDescriptor, partition-values, etc. Tables with tens of
> thousands of partitions take so long to have their partitions listed that the
> client times out with default hive.metastore.client.socket.timeout. There is
> the additional expense of serializing and deserializing metadata for large
> sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic
> should help in this regard.
> In a date-partitioned table, all sub-partitions for a particular date are
> *likely* (but not expected) to have:
> # The same base directory (e.g. {{/feeds/search/20140601/}})
> # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
> # The same SerDe/StorageHandler/IOFormat classes
> # Sorting/Bucketing/SkewInfo settings
> In this “most likely” scenario (henceforth termed “normal”), it’s possible to
> represent the partition-list (for a date) in a more condensed form: a list of
> LighterPartition instances, all sharing a common StorageDescriptor whose
> location points to the root directory.
> We can go one better for the {{add_partitions()}} case: When adding all
> partitions for a given date, the “normal” case affords us the ability to
> specify the top-level date-directory, where sub-partitions can be inferred
> from the HDFS directory-path.
> These extensions are hard to introduce at the metastore-level, since
> partition-functions explicitly specify {{List<Partition>}} arguments. I
> wonder if a {{PartitionSpec}} interface might help:
> {code}
> public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ...
> ;
> public int add_partitions( PartitionSpec new_parts ) throws … ;
> {code}
> where the PartitionSpec looks like:
> {code}
> public interface PartitionSpec {
> public List<Partition> getPartitions();
> public List<String> getPartNames();
> public Iterator<Partition> getPartitionIter();
> public Iterator<String> getPartNameIter();
> }
> {code}
> For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement
> {{PartitionSpec}}, store a top-level directory, and return Partition
> instances from sub-directory names, while storing a single StorageDescriptor
> for all of them.
> Similarly, list_partitions() could return a List<PartitionSpec>, where each
> PartitionSpec corresponds to a set or partitions that can share a
> StorageDescriptor.
> By exposing iterator semantics, neither the client nor the metastore need
> instantiate all partitions at once. That should help with memory requirements.
> In case no smart grouping is possible, we could just fall back on a
> {{DefaultPartitionSpec}} which composes {{List<Partition>}}, and is no worse
> than status quo.
> PartitionSpec abstracts away how a set of partitions may be represented. A
> tighter representation allows us to communicate metadata for a larger number
> of Partitions, with less Thrift traffic.
> Given that Thrift doesn’t support polymorphism, we’d have to implement the
> PartitionSpec as a Thrift Union of supported implementations. (We could
> convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec
> sub-class.)
> Thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)