[ 
https://issues.apache.org/jira/browse/HIVE-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553396#comment-16553396
 ] 

Vihang Karajgaonkar commented on HIVE-19715:
--------------------------------------------

Attached the first version of the design proposal for the new API.

TLDR
The API reuses existing {{PartitionSpec}} objects and makes some of the fields 
in PartitionSpec as optional. It also supports the following:
1. Projection list which is a list of string of dot separated field names. So 
example, clients who are interested only in partition locations can request 
{{sd.location}} and the result will only include the locations instead of the 
full partition objects.
2. FilterSpec which is provides different ways to filter the partitions for a 
given table. The current supports {{BY_NAMES}}, {{BY_VALUES}} or {{BY_EXPR}}. 
Although its not clear if there is value is providing {{BY_VALUES}} filters.
3. Pagination: API response contains a Pagination token which can used by the 
clients to send subsequent requests to retrieve configurable batches of 
partitions. The pagination token itself is a {{byte[]}} which client doesn't 
need to interpret. Internally server can send some values to in the token like 
last {{PART_ID}} sent previously, table modification stamp etc.

Any thoughts or suggestions?

cc: [~alangates] [~thejas] [~tlipcon] [~akolb]

> Consolidated and flexible API for fetching partition metadata from HMS
> ----------------------------------------------------------------------
>
>                 Key: HIVE-19715
>                 URL: https://issues.apache.org/jira/browse/HIVE-19715
>             Project: Hive
>          Issue Type: New Feature
>          Components: Standalone Metastore
>            Reporter: Todd Lipcon
>            Assignee: Vihang Karajgaonkar
>            Priority: Major
>         Attachments: HIVE-19715-design-doc.pdf
>
>
> Currently, the HMS thrift API exposes 17 different APIs for fetching 
> partition-related information. There is somewhat of a combinatorial explosion 
> going on, where each API has variants with and without "auth" info, by pspecs 
> vs names, by filters, by exprs, etc. Having all of these separate APIs long 
> term is a maintenance burden and also more confusing for consumers.
> Additionally, even with all of these APIs, there is a lack of granularity in 
> fetching only the information needed for a particular use case. For example, 
> in some use cases it may be beneficial to only fetch the partition locations 
> without wasting effort fetching statistics, etc.
> This JIRA proposes that we add a new "one API to rule them all" for fetching 
> partition info. The request and response would be encapsulated in structs. 
> Some desirable properties:
> - the request should be able to specify which pieces of information are 
> required (eg location, properties, etc)
> - in the case of partition parameters, the request should be able to do 
> either whitelisting or blacklisting (eg to exclude large incremental column 
> stats HLL dumped in there by Impala)
> - the request should optionally specify auth info (to encompas the 
> "with_auth" variants)
> - the request should be able to designate the set of partitions to access 
> through one of several different methods (eg "all", list<name>, expr, 
> part_vals, etc) 
> - the struct should be easily evolvable so that new pieces of info can be 
> added
> - the response should be designed in such a way as to avoid transferring 
> redundant information for common cases (eg simple "dictionary coding" of 
> strings like parameter names, etc)
> - the API should support some form of pagination for tables with large 
> partition counts



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to