[ https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Elliot West updated HIVE-10165: ------------------------------- Attachment: ReflectiveOperationWriter.java > Improve hive-hcatalog-streaming extensibility and support updates and deletes. > ------------------------------------------------------------------------------ > > Key: HIVE-10165 > URL: https://issues.apache.org/jira/browse/HIVE-10165 > Project: Hive > Issue Type: Improvement > Components: HCatalog > Reporter: Elliot West > Assignee: Alan Gates > Labels: streaming_api > Fix For: 1.2.0 > > Attachments: HIVE-10165.0.patch, HIVE-10165.1.patch, > HIVE-10165.2.patch, ReflectiveOperationWriter.java > > > h3. Overview > I'd like to extend the > [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest] > API so that it also supports the writing of record updates and deletes in > addition to the already supported inserts. > h3. Motivation > We have many Hadoop processes outside of Hive that merge changed facts into > existing datasets. Traditionally we achieve this by: reading in a > ground-truth dataset and a modified dataset, grouping by a key, sorting by a > sequence and then applying a function to determine inserted, updated, and > deleted rows. However, in our current scheme we must rewrite all partitions > that may potentially contain changes. In practice the number of mutated > records is very small when compared with the records contained in a > partition. This approach results in a number of operational issues: > * Excessive amount of write activity required for small data changes. > * Downstream applications cannot robustly read these datasets while they are > being updated. > * Due to scale of the updates (hundreds or partitions) the scope for > contention is high. > I believe we can address this problem by instead writing only the changed > records to a Hive transactional table. This should drastically reduce the > amount of data that we need to write and also provide a means for managing > concurrent access to the data. Our existing merge processes can read and > retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to > an updated form of the hive-hcatalog-streaming API which will then have the > required data to perform an update or insert in a transactional manner. > h3. Benefits > * Enables the creation of large-scale dataset merge processes > * Opens up Hive transactional functionality in an accessible manner to > processes that operate outside of Hive. > h3. Implementation > We've patched the API to provide visibility to the underlying > {{OrcRecordUpdater}} and allow extension of the {{AbstractRecordWriter}} by > third-parties outside of the package. We've also updated the user facing > interfaces to provide update and delete functionality. I've provided the > modifications as three incremental patches. Generally speaking, each patch > makes the API less backwards compatible but more consistent with respect to > offering updates, deletes as well as writes (inserts). Ideally I hope that > all three patches have merit, but only the first patch is absolutely > necessary to enable the features we need on the API, and it does so in a > backwards compatible way. I'll summarise the contents of each patch: > h4. [^HIVE-10165.0.patch] - Required > This patch contains what we consider to be the minimum amount of changes > required to allow users to create {{RecordWriter}} subclasses that can > insert, update, and delete records. These changes also maintain backwards > compatibility at the expense of confusing the API a little. Note that the row > representation has be changed from {{byte[]}} to {{Object}}. Within our data > processing jobs our records are often available in a strongly typed and > decoded form such as a POJO or a Tuple object. Therefore is seems to make > sense that we are able to pass this through to the {{OrcRecordUpdater}} > without having to go through a {{byte[]}} encoding step. This our course > still allows users to use {{byte[]}} if they wish. > h4. [^HIVE-10165.1.patch] - Nice to have > This patch builds on the changes made in the *required* patch and aims to > make the API cleaner and more consistent while accommodating updates and > inserts. It also adds some logic to prevent the user from submitting multiple > operation types to a single {{TransactionBatch}} as we found this creates > data inconsistencies within the Hive table. This patch breaks backwards > compatibility. > h4. [^HIVE-10165.2.patch] - Nomenclature > This final patch simply renames some of existing types to more accurately > convey their increased responsibilities. The API is no longer writing just > new records, it is now also responsible for writing operations that are > applied to existing records. This patch breaks backwards compatibility. > h3. Example > I've attached simple typical usage of the API. This is not a patch and is > intended as an illustration only. > h3. Known issues > I have not yet provided any unit tests for the extended functionality. I > fully expect that these are required and will work on these if these patches > have merit. > *Note: Attachments to follow.* -- This message was sent by Atlassian JIRA (v6.3.4#6332)