[ https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mithun Radhakrishnan updated HIVE-7223: --------------------------------------- Attachment: HIVE-7223.2.patch Updated patch, with Thrift definitions updated, etc. > Support generic PartitionSpecs in Metastore partition-functions > --------------------------------------------------------------- > > Key: HIVE-7223 > URL: https://issues.apache.org/jira/browse/HIVE-7223 > Project: Hive > Issue Type: Improvement > Components: HCatalog, Metastore > Affects Versions: 0.12.0, 0.13.0 > Reporter: Mithun Radhakrishnan > Assignee: Mithun Radhakrishnan > Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch > > > Currently, the functions in the HiveMetaStore API that handle multiple > partitions do so using List<Partition>. E.g. > {code} > public List<Partition> listPartitions(String db_name, String tbl_name, short > max_parts); > public List<Partition> listPartitionsByFilter(String db_name, String > tbl_name, String filter, short max_parts); > public int add_partitions(List<Partition> new_parts); > {code} > Partition objects are fairly heavyweight, since each Partition carries its > own copy of a StorageDescriptor, partition-values, etc. Tables with tens of > thousands of partitions take so long to have their partitions listed that the > client times out with default hive.metastore.client.socket.timeout. There is > the additional expense of serializing and deserializing metadata for large > sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic > should help in this regard. > In a date-partitioned table, all sub-partitions for a particular date are > *likely* (but not expected) to have: > # The same base directory (e.g. {{/feeds/search/20140601/}}) > # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}}) > # The same SerDe/StorageHandler/IOFormat classes > # Sorting/Bucketing/SkewInfo settings > In this “most likely” scenario (henceforth termed “normal”), it’s possible to > represent the partition-list (for a date) in a more condensed form: a list of > LighterPartition instances, all sharing a common StorageDescriptor whose > location points to the root directory. > We can go one better for the {{add_partitions()}} case: When adding all > partitions for a given date, the “normal” case affords us the ability to > specify the top-level date-directory, where sub-partitions can be inferred > from the HDFS directory-path. > These extensions are hard to introduce at the metastore-level, since > partition-functions explicitly specify {{List<Partition>}} arguments. I > wonder if a {{PartitionSpec}} interface might help: > {code} > public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... > ; > public int add_partitions( PartitionSpec new_parts ) throws … ; > {code} > where the PartitionSpec looks like: > {code} > public interface PartitionSpec { > public List<Partition> getPartitions(); > public List<String> getPartNames(); > public Iterator<Partition> getPartitionIter(); > public Iterator<String> getPartNameIter(); > } > {code} > For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement > {{PartitionSpec}}, store a top-level directory, and return Partition > instances from sub-directory names, while storing a single StorageDescriptor > for all of them. > Similarly, list_partitions() could return a List<PartitionSpec>, where each > PartitionSpec corresponds to a set or partitions that can share a > StorageDescriptor. > By exposing iterator semantics, neither the client nor the metastore need > instantiate all partitions at once. That should help with memory requirements. > In case no smart grouping is possible, we could just fall back on a > {{DefaultPartitionSpec}} which composes {{List<Partition>}}, and is no worse > than status quo. > PartitionSpec abstracts away how a set of partitions may be represented. A > tighter representation allows us to communicate metadata for a larger number > of Partitions, with less Thrift traffic. > Given that Thrift doesn’t support polymorphism, we’d have to implement the > PartitionSpec as a Thrift Union of supported implementations. (We could > convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec > sub-class.) > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)