[jira] [Updated] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes.

Elliot West (JIRA) Tue, 31 Mar 2015 07:01:53 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Elliot West updated HIVE-10165:
-------------------------------
    Attachment: ReflectiveOperationWriter.java

> Improve hive-hcatalog-streaming extensibility and support updates and deletes.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-10165
>                 URL: https://issues.apache.org/jira/browse/HIVE-10165
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>            Reporter: Elliot West
>            Assignee: Alan Gates
>              Labels: streaming_api
>             Fix For: 1.2.0
>
>         Attachments: HIVE-10165.0.patch, HIVE-10165.1.patch, 
> HIVE-10165.2.patch, ReflectiveOperationWriter.java
>
>
> h3. Overview
> I'd like to extend the 
> [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
>  API so that it also supports the writing of record updates and deletes in 
> addition to the already supported inserts.
> h3. Motivation
> We have many Hadoop processes outside of Hive that merge changed facts into 
> existing datasets. Traditionally we achieve this by: reading in a 
> ground-truth dataset and a modified dataset, grouping by a key, sorting by a 
> sequence and then applying a function to determine inserted, updated, and 
> deleted rows. However, in our current scheme we must rewrite all partitions 
> that may potentially contain changes. In practice the number of mutated 
> records is very small when compared with the records contained in a 
> partition. This approach results in a number of operational issues:
> * Excessive amount of write activity required for small data changes.
> * Downstream applications cannot robustly read these datasets while they are 
> being updated.
> * Due to scale of the updates (hundreds or partitions) the scope for 
> contention is high. 
> I believe we can address this problem by instead writing only the changed 
> records to a Hive transactional table. This should drastically reduce the 
> amount of data that we need to write and also provide a means for managing 
> concurrent access to the data. Our existing merge processes can read and 
> retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to 
> an updated form of the hive-hcatalog-streaming API which will then have the 
> required data to perform an update or insert in a transactional manner. 
> h3. Benefits
> * Enables the creation of large-scale dataset merge processes  
> * Opens up Hive transactional functionality in an accessible manner to 
> processes that operate outside of Hive.
> h3. Implementation
> We've patched the API to provide visibility to the underlying 
> {{OrcRecordUpdater}} and allow extension of the {{AbstractRecordWriter}} by 
> third-parties outside of the package. We've also updated the user facing 
> interfaces to provide update and delete functionality. I've provided the 
> modifications as three incremental patches. Generally speaking, each patch 
> makes the API less backwards compatible but more consistent with respect to 
> offering updates, deletes as well as writes (inserts). Ideally I hope that 
> all three patches have merit, but only the first patch is absolutely 
> necessary to enable the features we need on the API, and it does so in a 
> backwards compatible way. I'll summarise the contents of each patch:
> h4. [^HIVE-10165.0.patch] - Required
> This patch contains what we consider to be the minimum amount of changes 
> required to allow users to create {{RecordWriter}} subclasses that can 
> insert, update, and  delete records. These changes also maintain backwards 
> compatibility at the expense of confusing the API a little. Note that the row 
> representation has be changed from {{byte[]}} to {{Object}}. Within our data 
> processing jobs our records are often available in a strongly typed and 
> decoded form such as a POJO or a Tuple object. Therefore is seems to make 
> sense that we are able to pass this through to the {{OrcRecordUpdater}} 
> without having to go through a {{byte[]}} encoding step. This our course 
> still allows users to use {{byte[]}} if they wish.
> h4. [^HIVE-10165.1.patch] - Nice to have
> This patch builds on the changes made in the *required* patch and aims to 
> make the API cleaner and more consistent while accommodating updates and 
> inserts. It also adds some logic to prevent the user from submitting multiple 
> operation types to a single {{TransactionBatch}} as we found this creates 
> data inconsistencies within the Hive table. This patch breaks backwards 
> compatibility.
> h4. [^HIVE-10165.2.patch] - Nomenclature
> This final patch simply renames some of existing types to more accurately 
> convey their increased responsibilities. The API is no longer writing just 
> new records, it is now also responsible for writing operations that are 
> applied to existing records. This patch breaks backwards compatibility.
> h3. Example
> I've attached simple typical usage of the API. This is not a patch and is 
> intended as an illustration only.
> h3. Known issues
> I have not yet provided any unit tests for the extended functionality. I 
> fully expect that these are required and will work on these if these patches 
> have merit.
> *Note: Attachments to follow.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes.

Reply via email to