[ 
https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14620692#comment-14620692
 ] 

Elliot West commented on HIVE-10165:
------------------------------------

Thanks [~ekoifman]. With regards to you observation, I agree that the use of 
locks is incorrect. I followed the pattern in the existing Streaming API but or 
course that is concerned with inserts only. Using [this 
reference|http://www.slideshare.net/Hadoop_Summit/adding-acid-transactions-inserts-updates-a]
 I note that I should be using a semi-shared lock. I’d be grateful for any 
additional advice you can give on when each lock type/target should be 
employed. A potential concern of mine is that the system may not know the set 
of partitions when the transaction is initiated. In this case would it suffice 
to use a lock with a broader scope (i.e. a table lock?), or should I acquire 
additional locks each time I encounter a new partition?

As a side note, it appears as though the current locking documentation does not 
cover update/delete scenarios or semi-shared locks. I'll volunteer to update 
these pages once I have a clearer understanding of how these lock types apply 
to these operations and partitions:

* https://cwiki.apache.org/confluence/display/Hive/Locking
* https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Finally, as this issue is now resolved, should I submit patches using 
additional JIRA issues or reopen this one?

> Improve hive-hcatalog-streaming extensibility and support updates and deletes.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-10165
>                 URL: https://issues.apache.org/jira/browse/HIVE-10165
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>    Affects Versions: 1.2.0
>            Reporter: Elliot West
>            Assignee: Elliot West
>              Labels: TODOC2.0, streaming_api
>             Fix For: 2.0.0
>
>         Attachments: HIVE-10165.0.patch, HIVE-10165.10.patch, 
> HIVE-10165.4.patch, HIVE-10165.5.patch, HIVE-10165.6.patch, 
> HIVE-10165.7.patch, HIVE-10165.9.patch, mutate-system-overview.png
>
>
> h3. Overview
> I'd like to extend the 
> [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
>  API so that it also supports the writing of record updates and deletes in 
> addition to the already supported inserts.
> h3. Motivation
> We have many Hadoop processes outside of Hive that merge changed facts into 
> existing datasets. Traditionally we achieve this by: reading in a 
> ground-truth dataset and a modified dataset, grouping by a key, sorting by a 
> sequence and then applying a function to determine inserted, updated, and 
> deleted rows. However, in our current scheme we must rewrite all partitions 
> that may potentially contain changes. In practice the number of mutated 
> records is very small when compared with the records contained in a 
> partition. This approach results in a number of operational issues:
> * Excessive amount of write activity required for small data changes.
> * Downstream applications cannot robustly read these datasets while they are 
> being updated.
> * Due to scale of the updates (hundreds or partitions) the scope for 
> contention is high. 
> I believe we can address this problem by instead writing only the changed 
> records to a Hive transactional table. This should drastically reduce the 
> amount of data that we need to write and also provide a means for managing 
> concurrent access to the data. Our existing merge processes can read and 
> retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to 
> an updated form of the hive-hcatalog-streaming API which will then have the 
> required data to perform an update or insert in a transactional manner. 
> h3. Benefits
> * Enables the creation of large-scale dataset merge processes  
> * Opens up Hive transactional functionality in an accessible manner to 
> processes that operate outside of Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to