Garima Dosi created HIVE-15352:
----------------------------------
Summary: MVCC (Multi Versioned Concurrency Control) in Hive
Key: HIVE-15352
URL: https://issues.apache.org/jira/browse/HIVE-15352
Project: Hive
Issue Type: New Feature
Reporter: Garima Dosi
Use Case
While working with providing solutions for various applications, we see that
there is at times, a need to provide multi version concurrency support for
certain datasets. The requirement of multi versioned concurrency is mainly due
to two reasons –
• Simultaneous querying and loading from tables or datasets, which requires
maintaining versions for reading and writing (Locking is not the right option
here)
• Maintaining historical load of tables/datasets upto some extent
Both of these requirements are seen in data management systems (warehouses etc).
What happens without MVCC in Hive?
In cases, where MVCC had to be done, design similar to this -
https://dzone.com/articles/zookeeper-a-real-world-example-of-how-to-use-it was
followed to make it work. Zookeeper was used to maintain versions and provide
MVCC support. However, this design poses a limitation if a normal user would
like to query a hive table because he will not be aware of the current version
to be queried. The additional layer to match versions in zookeeper with the
dataset to be queried introduces a bit of an overhead for normal users and
hence, the request to make this feature available in Hive.
Hive Design for Support of MVCC
The hive design for MVCC support can be as described below (It would somewhat
follow the article mentioned in the previous section) –
1. The first thing should be the ability for the user to specify that this is a
MVCC table. So, a DDL something like this –
create table <table_name> ( <column_specs>) MULTI_VERSIONED ON [sequence, time]
Internally this DDL can be translated to a partitioned table either on a
sequence number (auto-generated by Hive) or a timestamp. The metastore would
keep this information.
2. DMLs related to inserting or loading data to the table would remain the same
for an end user. However, internally Hive would automatically detect that a
table is a multi-versioned table and write the new data to a new partition with
a new version of the dataset. The Hive Metastore would also be updated with the
current version.
3. DMLs related to querying data from the table would remain the same for a
user. However, internally Hive would use the latest version for queries. Latest
version is always stored in the metastore.
Management of obsolete versions
The obsolete versions can be deleted based on the following –
1.Either a setting which simply says delete the version which is older than a
threshold and is not active, OR
2.By tracking the count of queries running on older versions and deleting the
ones which are not the latest and are not being used by any query. This would
require some sort of a background thread monitoring the table for obsolete
versions. As shown in the article mentioned above, this would also require
incrementing version count whenever a version is queried and decrement it once
the query is done.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)