Re: Adding update/delete to the hive-hcatalog-streaming API

Alan Gates Thu, 26 Mar 2015 14:49:45 -0700

The missing piece for adding update and delete to the streaming API is aprimary key. Updates and deletes in SQL work by scanning the table orpartition where the record resides. This is assumed to be ok since weare not supporting transactional workloads and thus update/deletes areassumed to be infrequent. But a need to scan for each update or deletewill not perform adequately in the streaming case.

I've had a few discussions with others recently who are thinking ofadding merge like functionality, where you would upload all changes to atemp table and then in one scan/transaction apply those changes. Thisis a common way to handle these situations for data warehouses, and ismuch easier than adding a primary key concept to Hive.


Alan.

Elliot West <mailto:tea...@gmail.com>
March 26, 2015 at 14:08
Hi,
I'd like to ascertain if it might be possible to add 'update' and'delete' operations to the hive-hcatalog-streaming API. I've beenlooking at the API with interest for the last week as it appears tohave the potential to help with some general data processing patternsthat are prevalent where I work. Ultimately, we continuously loadlarge amounts of data into Hadoop which is partitioned by some timeinterval - usually hour, day, or month depending on the data size.However, the records that reside in this data can change. We oftenreceive some new information that mutates part of an existing recordalready stored in a partition in HDFS. Typically the amount ofmutations is very small compared to the number of records in eachpartitions.
To handle this currently we re-read and re-write all partitions thatcould potentially be affected by new data. In practice a single hour'sworth of new data can require the reading and writing of 1 month'sworth of partitions. By storing the data in a transactional Hive tableI believe that we can instead issue updates and deletes for only theaffected rows. Although we do use Hive for analytics on this data,much of the processing that generates and consumes the data isperformed using Cascading. Therefore I'd like to be able to read andwrite the data via an API which we'd aim to integrate into a CascadingTap of some description. Our Cascading processes could determine thenew, updated, and deleted records and then use the API to stream thesechanges to the transactional Hive table.
We have most of this working in a proof of concept, but ashive-hcatalog-streaming does not expose the delete/update methods ofthe OrcRecordUpdater we've had to hack together something unpleasantbased on the original API.
As a first step I'd like to check if there is any appetite for addingsuch functionality to the API or if this goes against the originalmotivations of the project? If this suggestion sounds reasonable thenI'd be keen to help move this forward.
Thanks - Elliot.

Re: Adding update/delete to the hive-hcatalog-streaming API

Reply via email to