[
https://issues.apache.org/jira/browse/HIVE-12285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Elliot West updated HIVE-12285:
---
Description:
With the introduction of a concurrency model (HIVE-1293) Hive uses locks to
coordinate access and updates to both table data and metadata. Within the Hive
CLI such lock management is seamless. However, Hive provides additional APIs
that permit interaction with data repositories, namely the HCatalog APIs.
Currently, operations implemented by this API do not participate with Hive's
locking scheme. Furthermore, access to the locking mechanisms is not exposed by
the APIs (as is the case with the Metastore Thrift API) and so users are not
able to explicitly interact with locks either. This has created a less than
ideal situation where users of the APIs have no choice but to manipulate these
data repositories outside of the command of Hive's lock management, potentially
resulting in situations where data inconsistencies can occur both for external
processes using the API and for queries executing within Hive.
h3. Scope of work
This ticket is concerned with sections of the HCatalog API that deal with DDL
type operations using the metastore, not with those whose purpose is to
read/write table data. A separate issue already exists for adding locking to
HCat readers and writers (HIVE-6207).
h3. Proposed work
The following work items would serve as a minimum deliverable that would both
allow API users to effectively work with locks:
* Comprehensively document on the wiki the locks required for various Hive
operations. At a minimum this should cover all operations exposed by
{{HCatClient}}. The [Locking design
document|https://cwiki.apache.org/confluence/display/Hive/Locking] can be used
as a starting point or perhaps updated.
* Implement methods and types in the {{HCatClient}} API that allow users to
manipulate Hive locks. For the most part I'd expect these to delegate to the
metastore API implementations:
** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.lock(LockRequest)}}
** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.checkLock(long)}}
** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.unlock(long)}}
** -{{org.apache.hadoop.hive.metastore.IMetaStoreClient.showLocks()}}-
** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.heartbeat(long, long)}}
** {{org.apache.hadoop.hive.metastore.api.LockComponent}}
** {{org.apache.hadoop.hive.metastore.api.LockRequest}}
** {{org.apache.hadoop.hive.metastore.api.LockResponse}}
** {{org.apache.hadoop.hive.metastore.api.LockLevel}}
** {{org.apache.hadoop.hive.metastore.api.LockType}}
** {{org.apache.hadoop.hive.metastore.api.LockState}}
** -{{org.apache.hadoop.hive.metastore.api.ShowLocksResponse}}-
h3. Additional proposals
Explicit lock management should be fairly simple to add to {{HCatClient}},
however it puts the onus on the API user to correctly understand and implement
code that uses lock in an appropriate manner. Failure to do so may have
undesirable consequences. With a simpler user model the operations exposed on
the API would automatically acquire and release the locks that they need. This
might work well for small numbers of operations, but not perhaps for large
sequences of invocations. (Do we need to worry about this though as the API
methods usually accept batches?). Additionally tasks such as heartbeat
management could also be handled implicitly for long running sets of
operations. With these concerns in mind it may also be beneficial to deliver
some of the following:
* A means to automatically acquire/release appropriate locks for {{HCatClient}}
operations.
* A component that maintains a lock heartbeat from the client.
* A strategy for switching between manual/automatic lock management, analogous
to SQL's {{autocommit}} for transactions.
An API for lock and heartbeat management already exists in the HCatalog
Mutation API (see: {{org.apache.hive.hcatalog.streaming.mutate.client.lock}}).
It will likely make sense to refactor either this code and/or code that uses it.
was:
With the introduction of a concurrency model (HIVE-1293) Hive uses locks to
coordinate access and updates to both table data and metadata. Within the Hive
CLI such lock management is seamless. However, Hive provides additional APIs
that permit interaction with data repositories, namely the HCatalog APIs.
Currently, operations implemented by this API do not participate with Hive's
locking scheme. Furthermore, access to the locking mechanisms is not exposed by
the APIs (as is the case with the Metastore Thrift API) and so users are not
able to explicitly interact with locks either. This has created a less than
ideal situation where users of the APIs have no choice but to manipulate these
data repositories outside of the command of Hive's lock management, potentially
resulting in situations where data inconsistencies can