Are you saying that when the records arrive you don't know updates from
inserts and you're already doing processing to determine that? If so,
this is exactly the case we'd like to hit with the merge functionality.
If you're already scanning the existing ORC file and obtaining the
unique identifier (which is the triple (rowid, txnid, bucketid)) and
determining which records have changed then you could certainly change
the streaming interface to enable passing the records through in bulk
(along with operation markers to indicate insert/update/delete) and
write those in a delta file in one pass.
Alan.
Elliot West <mailto:tea...@gmail.com>
March 26, 2015 at 15:10
Hi, thanks for your quick reply.
I see your point, but in my case would I not have the required
RecordIdentifiers available as I'm already reading the entire
partition to determine which records have changed? Admittedly Hive
will not reveal the ROW__IDs to me but I assume (incorrectly perhaps)
that these are in the underlying ORCFile and therefore can be read by
my Cascading process. In this scenario (where I have the relevant
recIds to hand) why would it then be necessary to perform additional
scans when issuing the deletes/updates?
Apologies if I'm misunderstanding something.
Thanks - Elliot.
On Thursday, March 26, 2015, Alan Gates <alanfga...@gmail.com
<mailto:alanfga...@gmail.com>> wrote:
Alan Gates <mailto:alanfga...@gmail.com>
March 26, 2015 at 14:48
The missing piece for adding update and delete to the streaming API is
a primary key. Updates and deletes in SQL work by scanning the table
or partition where the record resides. This is assumed to be ok since
we are not supporting transactional workloads and thus update/deletes
are assumed to be infrequent. But a need to scan for each update or
delete will not perform adequately in the streaming case.
I've had a few discussions with others recently who are thinking of
adding merge like functionality, where you would upload all changes to
a temp table and then in one scan/transaction apply those changes.
This is a common way to handle these situations for data warehouses,
and is much easier than adding a primary key concept to Hive.
Alan.
Elliot West <mailto:tea...@gmail.com>
March 26, 2015 at 14:08
Hi,
I'd like to ascertain if it might be possible to add 'update' and
'delete' operations to the hive-hcatalog-streaming API. I've been
looking at the API with interest for the last week as it appears to
have the potential to help with some general data processing patterns
that are prevalent where I work. Ultimately, we continuously load
large amounts of data into Hadoop which is partitioned by some time
interval - usually hour, day, or month depending on the data size.
However, the records that reside in this data can change. We often
receive some new information that mutates part of an existing record
already stored in a partition in HDFS. Typically the amount of
mutations is very small compared to the number of records in each
partitions.
To handle this currently we re-read and re-write all partitions that
could potentially be affected by new data. In practice a single hour's
worth of new data can require the reading and writing of 1 month's
worth of partitions. By storing the data in a transactional Hive table
I believe that we can instead issue updates and deletes for only the
affected rows. Although we do use Hive for analytics on this data,
much of the processing that generates and consumes the data is
performed using Cascading. Therefore I'd like to be able to read and
write the data via an API which we'd aim to integrate into a Cascading
Tap of some description. Our Cascading processes could determine the
new, updated, and deleted records and then use the API to stream these
changes to the transactional Hive table.
We have most of this working in a proof of concept, but as
hive-hcatalog-streaming does not expose the delete/update methods of
the OrcRecordUpdater we've had to hack together something unpleasant
based on the original API.
As a first step I'd like to check if there is any appetite for adding
such functionality to the API or if this goes against the original
motivations of the project? If this suggestion sounds reasonable then
I'd be keen to help move this forward.
Thanks - Elliot.