Hive merge & queries concurrency issue --------------------------------------
Key: HIVE-834 URL: https://issues.apache.org/jira/browse/HIVE-834 Project: Hadoop Hive Issue Type: Improvement Components: Metastore, Query Processor, Server Infrastructure Reporter: Jerome Boulon Today we are loading our Hive table every XX minutes so at the end of the day or sooner we have to run a hive merge in order to 1) reduce the number of file on HDFS and 2) to improve Hive performance. During that merge, if we run a query against that table we may have a FileNotFound exception because of the merge. The idea is to use some kind of versioning to be able to run some queries while Hive is doing a merge. The merge will do at the high level: 1) Create a new Version V2, so new writer will write to the new version, readers will have to read from both 2.0) Put a Merge Flag to V1 directory with a UUID/timeout/etc tp prevent any other merge while that one is running 2.1) New select queries will read from V1 and V2 2.2) New write queries will write to V2 3) Run the merge 4) Publish the new folder V3 5) Readers will now read from V2 and V3 6) Older version can be removed in background so running queries will not fail In practice it's a little bit more complicated because we need to that in a transaction but sounds feasible and would involved something like either Zookeeper or Database transactions. Also, it will be nice to be able to trigger a 2 levels merge: - quick merge during the day: file size less than XX MB (while your partition is still active,hot data) - full merge at the end of the day (cold data) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.