[jira] [Commented] (HIVE-5317) Implement insert, update, and delete in Hive with full ACID support

Alan Gates (JIRA) Fri, 20 Sep 2013 12:44:26 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773322#comment-13773322
 ]


Alan Gates commented on HIVE-5317:
----------------------------------

bq. Regarding use cases, it appears that this design won't be able to have fast 
performance for fine-grained inserts. ...
Agreed, this will fail badly in a one insert at a time situation.  That isn't 
what we're going after.  We would like to be able to handle a batch inserts 
every minute, but for the moment that seems like the floor.

bq. I didn't see any obvious problem, due to the versioned scans, but is this 
design safe from the Halloween problem?
As a rule Hive jobs always define their input up front and then scan only once. 
 So even though an update is writing a new record, the delta file it's writing 
into shouldn't be defined as part of it's input.  In the future when we move to 
having one delta file rather than one per write (more details on that to 
follow), this may be more of an issue, and we'll need to think about how to 
avoid it.

bq. How do you ensure that the delta files are fully written (committed) to the 
storage system before the metastore treats the transaction that created the 
delta file as committed?
The OutputCommitter will move the new delta files from a temp directory to the 
directory of the base file (as is standard in Hadoop apps).  Only after this 
will the Hive client communicate to the metastore that the transaction is 
committed.  If there is a failure between moving the files from temp to base 
dir, readers will still ignore these files as they will have a transaction id 
that is listed as aborted.

bq. It's not completely clear why you need exactly the transaction ID 
information specified in the delta file names. E.g. would just the transaction 
ID (start timestamp) be enough?
The reason for including the end id is so that readers can quickly decide 
whether they need to scan that file at all, and potentially prune files from 
their scans.  Does that answer the question?

bq. Is there any issue with correctness of task retry in the presence of 
updates if a task fails? 
As in standard Hadoop practice, output from tasks will be written to a temp 
directory.  Failed or killed tasks' output will never be promoted to the base 
file directory and thus will never be seen by readers.

I'm working on updating the doc with answers to these.  One of us will post the 
updated doc soon.

                
> Implement insert, update, and delete in Hive with full ACID support
> -------------------------------------------------------------------
>
>                 Key: HIVE-5317
>                 URL: https://issues.apache.org/jira/browse/HIVE-5317
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: InsertUpdatesinHive.pdf
>
>
> Many customers want to be able to insert, update and delete rows from Hive 
> tables with full ACID support. The use cases are varied, but the form of the 
> queries that should be supported are:
> * INSERT INTO tbl SELECT …
> * INSERT INTO tbl VALUES ...
> * UPDATE tbl SET … WHERE …
> * DELETE FROM tbl WHERE …
> * MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED THEN 
> ...
> * SET TRANSACTION LEVEL …
> * BEGIN/END TRANSACTION
> Use Cases
> * Once an hour, a set of inserts and updates (up to 500k rows) for various 
> dimension tables (eg. customer, inventory, stores) needs to be processed. The 
> dimension tables have primary keys and are typically bucketed and sorted on 
> those keys.
> * Once a day a small set (up to 100k rows) of records need to be deleted for 
> regulatory compliance.
> * Once an hour a log of transactions is exported from a RDBS and the fact 
> tables need to be updated (up to 1m rows)  to reflect the new data. The 
> transactions are a combination of inserts, updates, and deletes. The table is 
> partitioned and bucketed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-5317) Implement insert, update, and delete in Hive with full ACID support

Reply via email to