[jira] Created: (HIVE-834) Hive merge & queries concurrency issue

Jerome Boulon (JIRA) Tue, 15 Sep 2009 13:42:26 -0700

Hive merge & queries concurrency issue
--------------------------------------


                 Key: HIVE-834
                 URL: https://issues.apache.org/jira/browse/HIVE-834
             Project: Hadoop Hive
          Issue Type: Improvement
          Components: Metastore, Query Processor, Server Infrastructure
            Reporter: Jerome Boulon


Today we are loading our Hive table every XX minutes so at the end of the day 
or sooner we have to run a hive merge in order to 1) reduce the number of file 
on HDFS and 2) to improve Hive performance.

During that merge, if we run a query against that table we may have a 
FileNotFound exception because of the merge.
The idea is to use some kind of versioning to be able to run some queries while 
Hive is doing a merge.

The merge will do at the high level:
1) Create a new Version V2, so new writer will write to the new version, 
readers will have to read from both
2.0) Put a Merge Flag to V1 directory  with a UUID/timeout/etc tp prevent any 
other merge while that one is running
2.1) New select queries will read from V1 and V2
2.2) New write queries will write to V2
3) Run the merge 
4) Publish the new folder V3
5) Readers will now read from V2 and V3
6) Older version can be removed in background so running queries will not fail

In practice it's a little bit more complicated because we need to that in a 
transaction but sounds feasible
and would involved something like either Zookeeper or Database transactions.

Also, it will be nice to be able to trigger a 2 levels merge:
- quick merge during the day: file size less than XX MB (while your partition 
is still active,hot data)
- full merge at the end of the day (cold data)





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HIVE-834) Hive merge & queries concurrency issue

Reply via email to