Re: Adding update/delete to the hive-hcatalog-streaming API

2015-04-01 Thread Elliot West
Hi Alan,

Regarding the streaming changes, I've raised an issue and submitted patches
here: https://issues.apache.org/jira/browse/HIVE-10165

Thanks - Elliot.

On 26 March 2015 at 23:20, Alan Gates  wrote:

>
>
>   Elliot West 
>  March 26, 2015 at 15:58
> Hi Alan,
>
> Yes, this is precisely our situation. The issues I'm having with the
> current API are that I cannot intercept the creation of the
> OrcRecordUpdater to set the recordIdColumn in the AcidOutputFormat.Options
> instance. Additionally, I cannot extend the TransactionBatch interface to
> expose further operations and instead have to encode the operation type in
> the row parameter and decode it in the RecordWriter implementation -
> possible but not very elegant.
>
> Yes, you'd have to make changes in the API itself to make this possible,
> which is fine.  I'm happy to review patches for this.
>
>
> Regarding the merge functionality, is this a new suggested feature of
> Hive? Is there a JIRA I could track?
>
> Not yet, it's just discussions I've had with users like you who'd be
> interested in merge, though from the SQL side rather than streaming.  I am
> not aware of anyone working on it at the moment.
>
> Alan.
>
>
> Thanks - Elliot.
>
>
> On Thursday, March 26, 2015, Alan Gates  wrote:
>Alan Gates 
>  March 26, 2015 at 15:30
>  Are you saying that when the records arrive you don't know updates from
> inserts and you're already doing processing to determine that?  If so, this
> is exactly the case we'd like to hit with the merge functionality.
>
> If you're already scanning the existing ORC file and obtaining the unique
> identifier (which is the triple (rowid, txnid, bucketid)) and determining
> which records have changed then you could certainly change the streaming
> interface to enable passing the records through  in bulk (along with
> operation markers to indicate insert/update/delete) and write those in a
> delta file in one pass.
>
> Alan.
>
>   Elliot West 
>  March 26, 2015 at 15:10
> Hi, thanks for your quick reply.
>
> I see your point, but in my case would I not have the required
> RecordIdentifiers available as I'm already reading the entire partition to
> determine which records have changed?  Admittedly Hive will not reveal
> the ROW__IDs to me but I assume (incorrectly perhaps) that these are in the
> underlying ORCFile and therefore can be read by my Cascading process. In
> this scenario (where I have the relevant recIds to hand) why would it then
> be necessary to perform additional scans when issuing the deletes/updates?
>
> Apologies if I'm misunderstanding something.
>
> Thanks - Elliot.
>
> On Thursday, March 26, 2015, Alan Gates  wrote:
>   Alan Gates 
>  March 26, 2015 at 14:48
>  The missing piece for adding update and delete to the streaming API is a
> primary key.  Updates and deletes in SQL work by scanning the table or
> partition where the record resides.  This is assumed to be ok since we are
> not supporting transactional workloads and thus update/deletes are assumed
> to be infrequent.  But a need to scan for each update or delete will not
> perform adequately in the streaming case.
>
> I've had a few discussions with others recently who are thinking of adding
> merge like functionality, where you would upload all changes to a temp
> table and then in one scan/transaction apply those changes.  This is a
> common way to handle these situations for data warehouses, and is much
> easier than adding a primary key concept to Hive.
>
> Alan.
>
>   Elliot West 
>  March 26, 2015 at 14:08
> Hi,
>
> I'd like to ascertain if it might be possible to add 'update' and 'delete'
> operations to the hive-hcatalog-streaming API. I've been looking at the API
> with interest for the last week as it appears to have the potential to help
> with some general data processing patterns that are prevalent where I work.
> Ultimately, we continuously load large amounts of data into Hadoop which is
> partitioned by some time interval - usually hour, day, or month depending
> on the data size. However, the records that reside in this data can change.
> We often receive some new information that mutates part of an existing
> record already stored in a partition in HDFS. Typically the amount of
> mutations is very small compared to the number of records in each
> partitions.
>
> To handle this currently we re-read and re-write all partitions that could
> potentially be affected by new data. In practice a single hour's worth of
> new data can require the reading and writing of 1 month's worth of
> partitions. By storing the data in a transactional Hive table I believe
> that we can instead issue updates and deletes for only the affected rows.
> Although we do use Hive for analytics on this data, much of the processing
> that generates and consumes the data is performed using Cascading.
> Therefore I'd like to be able to read and write the data via an API which
> we'd aim to integrate into a Cascading Tap of some description. Our
> Cascadi

RE: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Mich Talebzadeh
Thanks for that Elliot.

 

As a matter of interest what is the source of data in this case. Is the data 
delivered periodically including new rows and deltas?

 

Cheers,

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Elliot West [mailto:tea...@gmail.com] 
Sent: 26 March 2015 23:04
To: user@hive.apache.org
Subject: Re: Adding update/delete to the hive-hcatalog-streaming API

 

Hi Mich,

 

Yes, we have a timestamp on each record. Our processes effectively group by a 
key and order by time stamp.

 

Cheers - Elliot.



Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Alan Gates




Elliot West 
March 26, 2015 at 15:58
Hi Alan,

Yes, this is precisely our situation. The issues I'm having with the 
current API are that I cannot intercept the creation of the 
OrcRecordUpdater to set the recordIdColumn in 
the AcidOutputFormat.Options instance. Additionally, I cannot extend 
the TransactionBatch interface to expose further operations and 
instead have to encode the operation type in the row parameter and 
decode it in the RecordWriter implementation - possible but not very 
elegant.
Yes, you'd have to make changes in the API itself to make this possible, 
which is fine.  I'm happy to review patches for this.


Regarding the merge functionality, is this a new suggested feature of 
Hive? Is there a JIRA I could track?
Not yet, it's just discussions I've had with users like you who'd be 
interested in merge, though from the SQL side rather than streaming.  I 
am not aware of anyone working on it at the moment.


Alan.


Thanks - Elliot.


On Thursday, March 26, 2015, Alan Gates > wrote:

Alan Gates 
March 26, 2015 at 15:30
Are you saying that when the records arrive you don't know updates 
from inserts and you're already doing processing to determine that?  
If so, this is exactly the case we'd like to hit with the merge 
functionality.


If you're already scanning the existing ORC file and obtaining the 
unique identifier (which is the triple (rowid, txnid, bucketid)) and 
determining which records have changed then you could certainly change 
the streaming interface to enable passing the records through  in bulk 
(along with operation markers to indicate insert/update/delete) and 
write those in a delta file in one pass.


Alan.

Elliot West 
March 26, 2015 at 15:10
Hi, thanks for your quick reply.

I see your point, but in my case would I not have the required 
RecordIdentifiers available as I'm already reading the entire 
partition to determine which records have changed?  Admittedly Hive 
will not reveal the ROW__IDs to me but I assume (incorrectly perhaps) 
that these are in the underlying ORCFile and therefore can be read by 
my Cascading process. In this scenario (where I have the relevant 
recIds to hand) why would it then be necessary to perform additional 
scans when issuing the deletes/updates?


Apologies if I'm misunderstanding something.

Thanks - Elliot.

On Thursday, March 26, 2015, Alan Gates > wrote:

Alan Gates 
March 26, 2015 at 14:48
The missing piece for adding update and delete to the streaming API is 
a primary key.  Updates and deletes in SQL work by scanning the table 
or partition where the record resides.  This is assumed to be ok since 
we are not supporting transactional workloads and thus update/deletes 
are assumed to be infrequent.  But a need to scan for each update or 
delete will not perform adequately in the streaming case.


I've had a few discussions with others recently who are thinking of 
adding merge like functionality, where you would upload all changes to 
a temp table and then in one scan/transaction apply those changes.  
This is a common way to handle these situations for data warehouses, 
and is much easier than adding a primary key concept to Hive.


Alan.

Elliot West 
March 26, 2015 at 14:08
Hi,

I'd like to ascertain if it might be possible to add 'update' and 
'delete' operations to the hive-hcatalog-streaming API. I've been 
looking at the API with interest for the last week as it appears to 
have the potential to help with some general data processing patterns 
that are prevalent where I work. Ultimately, we continuously load 
large amounts of data into Hadoop which is partitioned by some time 
interval - usually hour, day, or month depending on the data size. 
However, the records that reside in this data can change. We often 
receive some new information that mutates part of an existing record 
already stored in a partition in HDFS. Typically the amount of 
mutations is very small compared to the number of records in each 
partitions.


To handle this currently we re-read and re-write all partitions that 
could potentially be affected by new data. In practice a single hour's 
worth of new data can require the reading and writing of 1 month's 
worth of partitions. By storing the data in a transactional Hive table 
I believe that we can instead issue updates and deletes for only the 
affected rows. Although we do use Hive for analytics on this data, 
much of the processing that generates and consumes the data is 
performed using Cascading. Therefore I'd like to be able to read and 
write the data via an API which we'd aim to integrate into a Cascading 
Tap of some description. Our Cascading processes could determine the 
new, updated, and deleted records and then use the API to stream these 
changes to the transactional Hive table.


We have m

Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West
Hi Mich,

Yes, we have a timestamp on each record. Our processes effectively group by
a key and order by time stamp.

Cheers - Elliot.


Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West
Hi Alan,

Yes, this is precisely our situation. The issues I'm having with the
current API are that I cannot intercept the creation of the
OrcRecordUpdater to set the recordIdColumn in the AcidOutputFormat.Options
instance. Additionally, I cannot extend the TransactionBatch interface to
expose further operations and instead have to encode the operation type in
the row parameter and decode it in the RecordWriter implementation -
possible but not very elegant.

Regarding the merge functionality, is this a new suggested feature of Hive?
Is there a JIRA I could track?

Thanks - Elliot.


On Thursday, March 26, 2015, Alan Gates  wrote:

> Are you saying that when the records arrive you don't know updates from
> inserts and you're already doing processing to determine that?  If so, this
> is exactly the case we'd like to hit with the merge functionality.
>
> If you're already scanning the existing ORC file and obtaining the unique
> identifier (which is the triple (rowid, txnid, bucketid)) and determining
> which records have changed then you could certainly change the streaming
> interface to enable passing the records through  in bulk (along with
> operation markers to indicate insert/update/delete) and write those in a
> delta file in one pass.
>
> Alan.
>
>   Elliot West
>  March 26, 2015 at 15:10
> Hi, thanks for your quick reply.
>
> I see your point, but in my case would I not have the required
> RecordIdentifiers available as I'm already reading the entire partition to
> determine which records have changed?  Admittedly Hive will not reveal
> the ROW__IDs to me but I assume (incorrectly perhaps) that these are in the
> underlying ORCFile and therefore can be read by my Cascading process. In
> this scenario (where I have the relevant recIds to hand) why would it then
> be necessary to perform additional scans when issuing the deletes/updates?
>
> Apologies if I'm misunderstanding something.
>
> Thanks - Elliot.
>
> On Thursday, March 26, 2015, Alan Gates  wrote:
>   Alan Gates
>  March 26, 2015 at 14:48
>  The missing piece for adding update and delete to the streaming API is a
> primary key.  Updates and deletes in SQL work by scanning the table or
> partition where the record resides.  This is assumed to be ok since we are
> not supporting transactional workloads and thus update/deletes are assumed
> to be infrequent.  But a need to scan for each update or delete will not
> perform adequately in the streaming case.
>
> I've had a few discussions with others recently who are thinking of adding
> merge like functionality, where you would upload all changes to a temp
> table and then in one scan/transaction apply those changes.  This is a
> common way to handle these situations for data warehouses, and is much
> easier than adding a primary key concept to Hive.
>
> Alan.
>
>   Elliot West
>  March 26, 2015 at 14:08
> Hi,
>
> I'd like to ascertain if it might be possible to add 'update' and 'delete'
> operations to the hive-hcatalog-streaming API. I've been looking at the API
> with interest for the last week as it appears to have the potential to help
> with some general data processing patterns that are prevalent where I work.
> Ultimately, we continuously load large amounts of data into Hadoop which is
> partitioned by some time interval - usually hour, day, or month depending
> on the data size. However, the records that reside in this data can change.
> We often receive some new information that mutates part of an existing
> record already stored in a partition in HDFS. Typically the amount of
> mutations is very small compared to the number of records in each
> partitions.
>
> To handle this currently we re-read and re-write all partitions that could
> potentially be affected by new data. In practice a single hour's worth of
> new data can require the reading and writing of 1 month's worth of
> partitions. By storing the data in a transactional Hive table I believe
> that we can instead issue updates and deletes for only the affected rows.
> Although we do use Hive for analytics on this data, much of the processing
> that generates and consumes the data is performed using Cascading.
> Therefore I'd like to be able to read and write the data via an API which
> we'd aim to integrate into a Cascading Tap of some description. Our
> Cascading processes could determine the new, updated, and deleted records
> and then use the API to stream these changes to the transactional Hive
> table.
>
> We have most of this working in a proof of concept, but as
> hive-hcatalog-streaming does not expose the delete/update methods of the
> OrcRecordUpdater we've had to hack together something unpleasant based on
> the original API.
>
> As a first step I'd like to check if there is any appetite for adding such
> functionality to the API or if this goes against the original motivations
> of the project? If this suggestion sounds reasonable then I'd be keen to
> help move this forward.
>
> Thanks - Elliot.
>
>
>


RE: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Mich Talebzadeh
Have you seen this article although it looks a bit dated.

 

Adding ACID to Apache Hive 
<http://hortonworks.com/blog/adding-acid-to-apache-hive/> 

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Elliot West [mailto:tea...@gmail.com] 
Sent: 26 March 2015 22:10
To: user@hive.apache.org
Subject: Re: Adding update/delete to the hive-hcatalog-streaming API

 

Hi, thanks for your quick reply.

 

I see your point, but in my case would I not have the required 
RecordIdentifiers available as I'm already reading the entire partition to 
determine which records have changed?  Admittedly Hive will not reveal the 
ROW__IDs to me but I assume (incorrectly perhaps) that these are in the 
underlying ORCFile and therefore can be read by my Cascading process. In this 
scenario (where I have the relevant recIds to hand) why would it then be 
necessary to perform additional scans when issuing the deletes/updates?

 

Apologies if I'm misunderstanding something.

 

Thanks - Elliot.

 

On Thursday, March 26, 2015, Alan Gates  wrote:

The missing piece for adding update and delete to the streaming API is a 
primary key.  Updates and deletes in SQL work by scanning the table or 
partition where the record resides.  This is assumed to be ok since we are not 
supporting transactional workloads and thus update/deletes are assumed to be 
infrequent.  But a need to scan for each update or delete will not perform 
adequately in the streaming case.

I've had a few discussions with others recently who are thinking of adding 
merge like functionality, where you would upload all changes to a temp table 
and then in one scan/transaction apply those changes.  This is a common way to 
handle these situations for data warehouses, and is much easier than adding a 
primary key concept to Hive.

Alan.






  Elliot West

March 26, 2015 at 14:08

Hi,

 

I'd like to ascertain if it might be possible to add 'update' and 'delete' 
operations to the hive-hcatalog-streaming API. I've been looking at the API 
with interest for the last week as it appears to have the potential to help 
with some general data processing patterns that are prevalent where I work. 
Ultimately, we continuously load large amounts of data into Hadoop which is 
partitioned by some time interval - usually hour, day, or month depending on 
the data size. However, the records that reside in this data can change. We 
often receive some new information that mutates part of an existing record 
already stored in a partition in HDFS. Typically the amount of mutations is 
very small compared to the number of records in each partitions.

 

To handle this currently we re-read and re-write all partitions that could 
potentially be affected by new data. In practice a single hour's worth of new 
data can require the reading and writing of 1 month's worth of partitions. By 
storing the data in a transactional Hive table I believe that we can instead 
issue updates and deletes for only the affected rows. Although we do use Hive 
for analytics on this data, much of the processing that generates and consumes 
the data is performed using Cascading. Therefore I'd like to be able to read 
and write the data via an API which we'd aim to integrate into a Cascading Tap 
of some description. Our Cascading processes could determine the new, updated, 
and deleted records and then use the API to stream these changes to the 
transactional Hive table.

 

We have most of this working in a proof of concept, but as 
hive-hcatalog-streaming does not expose the delete/update methods of the 
OrcRecordUpdater we've had to hack together something unpleasant based on the 
original API.

 

As a first step I'd like to check if there is any appetite for adding such 
functionality to the API or if this goes against the original motivations of 
the project? If this suggestion sounds reasonable then I'd be keen to help move 
this forward.

 

Thanks - Elliot.

 



Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Alan Gates
Are you saying that when the records arrive you don't know updates from 
inserts and you're already doing processing to determine that?  If so, 
this is exactly the case we'd like to hit with the merge functionality.


If you're already scanning the existing ORC file and obtaining the 
unique identifier (which is the triple (rowid, txnid, bucketid)) and 
determining which records have changed then you could certainly change 
the streaming interface to enable passing the records through  in bulk 
(along with operation markers to indicate insert/update/delete) and 
write those in a delta file in one pass.


Alan.


Elliot West 
March 26, 2015 at 15:10
Hi, thanks for your quick reply.

I see your point, but in my case would I not have the required 
RecordIdentifiers available as I'm already reading the entire 
partition to determine which records have changed?  Admittedly Hive 
will not reveal the ROW__IDs to me but I assume (incorrectly perhaps) 
that these are in the underlying ORCFile and therefore can be read by 
my Cascading process. In this scenario (where I have the relevant 
recIds to hand) why would it then be necessary to perform additional 
scans when issuing the deletes/updates?


Apologies if I'm misunderstanding something.

Thanks - Elliot.

On Thursday, March 26, 2015, Alan Gates > wrote:

Alan Gates 
March 26, 2015 at 14:48
The missing piece for adding update and delete to the streaming API is 
a primary key.  Updates and deletes in SQL work by scanning the table 
or partition where the record resides.  This is assumed to be ok since 
we are not supporting transactional workloads and thus update/deletes 
are assumed to be infrequent.  But a need to scan for each update or 
delete will not perform adequately in the streaming case.


I've had a few discussions with others recently who are thinking of 
adding merge like functionality, where you would upload all changes to 
a temp table and then in one scan/transaction apply those changes.  
This is a common way to handle these situations for data warehouses, 
and is much easier than adding a primary key concept to Hive.


Alan.

Elliot West 
March 26, 2015 at 14:08
Hi,

I'd like to ascertain if it might be possible to add 'update' and 
'delete' operations to the hive-hcatalog-streaming API. I've been 
looking at the API with interest for the last week as it appears to 
have the potential to help with some general data processing patterns 
that are prevalent where I work. Ultimately, we continuously load 
large amounts of data into Hadoop which is partitioned by some time 
interval - usually hour, day, or month depending on the data size. 
However, the records that reside in this data can change. We often 
receive some new information that mutates part of an existing record 
already stored in a partition in HDFS. Typically the amount of 
mutations is very small compared to the number of records in each 
partitions.


To handle this currently we re-read and re-write all partitions that 
could potentially be affected by new data. In practice a single hour's 
worth of new data can require the reading and writing of 1 month's 
worth of partitions. By storing the data in a transactional Hive table 
I believe that we can instead issue updates and deletes for only the 
affected rows. Although we do use Hive for analytics on this data, 
much of the processing that generates and consumes the data is 
performed using Cascading. Therefore I'd like to be able to read and 
write the data via an API which we'd aim to integrate into a Cascading 
Tap of some description. Our Cascading processes could determine the 
new, updated, and deleted records and then use the API to stream these 
changes to the transactional Hive table.


We have most of this working in a proof of concept, but as 
hive-hcatalog-streaming does not expose the delete/update methods of 
the OrcRecordUpdater we've had to hack together something unpleasant 
based on the original API.


As a first step I'd like to check if there is any appetite for adding 
such functionality to the API or if this goes against the original 
motivations of the project? If this suggestion sounds reasonable then 
I'd be keen to help move this forward.


Thanks - Elliot.


RE: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Mich Talebzadeh
Hi Elliot,

 

How do you determine a record in a partition has changed? Are you relying on 
timestamp or something like that?

 

Thanks

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Elliot West [mailto:tea...@gmail.com] 
Sent: 26 March 2015 22:10
To: user@hive.apache.org
Subject: Re: Adding update/delete to the hive-hcatalog-streaming API

 

Hi, thanks for your quick reply.

 

I see your point, but in my case would I not have the required 
RecordIdentifiers available as I'm already reading the entire partition to 
determine which records have changed?  Admittedly Hive will not reveal the 
ROW__IDs to me but I assume (incorrectly perhaps) that these are in the 
underlying ORCFile and therefore can be read by my Cascading process. In this 
scenario (where I have the relevant recIds to hand) why would it then be 
necessary to perform additional scans when issuing the deletes/updates?

 

Apologies if I'm misunderstanding something.

 

Thanks - Elliot.

 

On Thursday, March 26, 2015, Alan Gates  wrote:

The missing piece for adding update and delete to the streaming API is a 
primary key.  Updates and deletes in SQL work by scanning the table or 
partition where the record resides.  This is assumed to be ok since we are not 
supporting transactional workloads and thus update/deletes are assumed to be 
infrequent.  But a need to scan for each update or delete will not perform 
adequately in the streaming case.

I've had a few discussions with others recently who are thinking of adding 
merge like functionality, where you would upload all changes to a temp table 
and then in one scan/transaction apply those changes.  This is a common way to 
handle these situations for data warehouses, and is much easier than adding a 
primary key concept to Hive.

Alan.






  Elliot West

March 26, 2015 at 14:08

Hi,

 

I'd like to ascertain if it might be possible to add 'update' and 'delete' 
operations to the hive-hcatalog-streaming API. I've been looking at the API 
with interest for the last week as it appears to have the potential to help 
with some general data processing patterns that are prevalent where I work. 
Ultimately, we continuously load large amounts of data into Hadoop which is 
partitioned by some time interval - usually hour, day, or month depending on 
the data size. However, the records that reside in this data can change. We 
often receive some new information that mutates part of an existing record 
already stored in a partition in HDFS. Typically the amount of mutations is 
very small compared to the number of records in each partitions.

 

To handle this currently we re-read and re-write all partitions that could 
potentially be affected by new data. In practice a single hour's worth of new 
data can require the reading and writing of 1 month's worth of partitions. By 
storing the data in a transactional Hive table I believe that we can instead 
issue updates and deletes for only the affected rows. Although we do use Hive 
for analytics on this data, much of the processing that generates and consumes 
the data is performed using Cascading. Therefore I'd like to be able to read 
and write the data via an API which we'd aim to integrate into a Cascading Tap 
of some description. Our Cascading processes could determine the new, updated, 
and deleted records and then use the API to stream these changes to the 
transactional Hive table.

 

We have most of this working in a proof of concept, but as 
hive-hcatalog-streaming does not expose the delete/update methods of the 
OrcRecordUpdater we've had to hack together something unpleasant based on the 
original API.

 

As a first step I'd like to check if there is any appetite for adding such 
functionality to the API or if this goes against the original motivations of 
the project? If this suggestion sounds reasonable then I'd be keen to help move 
this forward.

 

Thanks - Elliot.

 



Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West
Hi, thanks for your quick reply.

I see your point, but in my case would I not have the required
RecordIdentifiers available as I'm already reading the entire partition to
determine which records have changed?  Admittedly Hive will not reveal
the ROW__IDs to me but I assume (incorrectly perhaps) that these are in the
underlying ORCFile and therefore can be read by my Cascading process. In
this scenario (where I have the relevant recIds to hand) why would it then
be necessary to perform additional scans when issuing the deletes/updates?

Apologies if I'm misunderstanding something.

Thanks - Elliot.

On Thursday, March 26, 2015, Alan Gates  wrote:

> The missing piece for adding update and delete to the streaming API is a
> primary key.  Updates and deletes in SQL work by scanning the table or
> partition where the record resides.  This is assumed to be ok since we are
> not supporting transactional workloads and thus update/deletes are assumed
> to be infrequent.  But a need to scan for each update or delete will not
> perform adequately in the streaming case.
>
> I've had a few discussions with others recently who are thinking of adding
> merge like functionality, where you would upload all changes to a temp
> table and then in one scan/transaction apply those changes.  This is a
> common way to handle these situations for data warehouses, and is much
> easier than adding a primary key concept to Hive.
>
> Alan.
>
>   Elliot West 
>  March 26, 2015 at 14:08
> Hi,
>
> I'd like to ascertain if it might be possible to add 'update' and 'delete'
> operations to the hive-hcatalog-streaming API. I've been looking at the API
> with interest for the last week as it appears to have the potential to help
> with some general data processing patterns that are prevalent where I work.
> Ultimately, we continuously load large amounts of data into Hadoop which is
> partitioned by some time interval - usually hour, day, or month depending
> on the data size. However, the records that reside in this data can change.
> We often receive some new information that mutates part of an existing
> record already stored in a partition in HDFS. Typically the amount of
> mutations is very small compared to the number of records in each
> partitions.
>
> To handle this currently we re-read and re-write all partitions that could
> potentially be affected by new data. In practice a single hour's worth of
> new data can require the reading and writing of 1 month's worth of
> partitions. By storing the data in a transactional Hive table I believe
> that we can instead issue updates and deletes for only the affected rows.
> Although we do use Hive for analytics on this data, much of the processing
> that generates and consumes the data is performed using Cascading.
> Therefore I'd like to be able to read and write the data via an API which
> we'd aim to integrate into a Cascading Tap of some description. Our
> Cascading processes could determine the new, updated, and deleted records
> and then use the API to stream these changes to the transactional Hive
> table.
>
> We have most of this working in a proof of concept, but as
> hive-hcatalog-streaming does not expose the delete/update methods of the
> OrcRecordUpdater we've had to hack together something unpleasant based on
> the original API.
>
> As a first step I'd like to check if there is any appetite for adding such
> functionality to the API or if this goes against the original motivations
> of the project? If this suggestion sounds reasonable then I'd be keen to
> help move this forward.
>
> Thanks - Elliot.
>
>
>


RE: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Mich Talebzadeh
Hi,

 

The primary key is required for updates/deletes to uniquely identify the record 
that needs to be updated otherwise you are going to have unpredictable results 
as you may be updating too many or deleting too many.

 

This is indeed a requirement for real time delivery of data to data warehouses 
(DW) such as Sybase IQ using replication server. The replicate database (in 
this case DW) does not need to have that primary key. The replication 
definition for that end table will have to have a primary key or unique index 
as part of replication definition.

 

If the idea is to get data from RDBMS (read ACID compliant database) to Hive 
real time, I would prefer this unique key to be built in in replication 
definition. Hive does not have to have unique key as any data getting to Hive 
from RDBMS is expected to be transactional and there is no need to do sanity 
check in Hive. What you are mentioning as temp is (I gather if I am correct)  
is eferred to staging in DW. However, there are now requirement for DW to 
receive replicate data from transactional databases through SAP replication 
server or Oracle Golden Gate.

 

HTH,

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Alan Gates [mailto:alanfga...@gmail.com] 
Sent: 26 March 2015 21:48
To: user@hive.apache.org
Subject: Re: Adding update/delete to the hive-hcatalog-streaming API

 

The missing piece for adding update and delete to the streaming API is a 
primary key.  Updates and deletes in SQL work by scanning the table or 
partition where the record resides.  This is assumed to be ok since we are not 
supporting transactional workloads and thus update/deletes are assumed to be 
infrequent.  But a need to scan for each update or delete will not perform 
adequately in the streaming case.

I've had a few discussions with others recently who are thinking of adding 
merge like functionality, where you would upload all changes to a temp table 
and then in one scan/transaction apply those changes.  This is a common way to 
handle these situations for data warehouses, and is much easier than adding a 
primary key concept to Hive.

Alan.






 <mailto:tea...@gmail.com> Elliot West

March 26, 2015 at 14:08

Hi,

 

I'd like to ascertain if it might be possible to add 'update' and 'delete' 
operations to the hive-hcatalog-streaming API. I've been looking at the API 
with interest for the last week as it appears to have the potential to help 
with some general data processing patterns that are prevalent where I work. 
Ultimately, we continuously load large amounts of data into Hadoop which is 
partitioned by some time interval - usually hour, day, or month depending on 
the data size. However, the records that reside in this data can change. We 
often receive some new information that mutates part of an existing record 
already stored in a partition in HDFS. Typically the amount of mutations is 
very small compared to the number of records in each partitions.

 

To handle this currently we re-read and re-write all partitions that could 
potentially be affected by new data. In practice a single hour's worth of new 
data can require the reading and writing of 1 month's worth of partitions. By 
storing the data in a transactional Hive table I believe that we can instead 
issue updates and deletes for only the affected rows. Although we do use Hive 
for analytics on this data, much of the processing that generates and consumes 
the data is performed using Cascading. Therefore I'd like to be able to read 
and write the data via an API which we'd aim to integrate into a Cascading Tap 
of some description. Our Cascading processes could determine the new, updated, 
and deleted records and then use the API to stream these changes to the 
transactional Hive table.

 

We have most of this working in a proof of concept, but as 
hive-hcatalog-streaming does not expose the delete/update methods of the 
OrcRecordUpdater we've had to hack together something unpleasant based on the 
original API.

 

As a first step I'd like to check if there is any appetite for adding such 
functionality to the API or if this goes against the original motivations of 
the project? If this suggestion sounds reasonable then I'd be keen to help move 
this forward.

 

Thanks - Elliot.

 



Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Alan Gates
The missing piece for adding update and delete to the streaming API is a 
primary key.  Updates and deletes in SQL work by scanning the table or 
partition where the record resides.  This is assumed to be ok since we 
are not supporting transactional workloads and thus update/deletes are 
assumed to be infrequent.  But a need to scan for each update or delete 
will not perform adequately in the streaming case.


I've had a few discussions with others recently who are thinking of 
adding merge like functionality, where you would upload all changes to a 
temp table and then in one scan/transaction apply those changes.  This 
is a common way to handle these situations for data warehouses, and is 
much easier than adding a primary key concept to Hive.


Alan.


Elliot West 
March 26, 2015 at 14:08
Hi,

I'd like to ascertain if it might be possible to add 'update' and 
'delete' operations to the hive-hcatalog-streaming API. I've been 
looking at the API with interest for the last week as it appears to 
have the potential to help with some general data processing patterns 
that are prevalent where I work. Ultimately, we continuously load 
large amounts of data into Hadoop which is partitioned by some time 
interval - usually hour, day, or month depending on the data size. 
However, the records that reside in this data can change. We often 
receive some new information that mutates part of an existing record 
already stored in a partition in HDFS. Typically the amount of 
mutations is very small compared to the number of records in each 
partitions.


To handle this currently we re-read and re-write all partitions that 
could potentially be affected by new data. In practice a single hour's 
worth of new data can require the reading and writing of 1 month's 
worth of partitions. By storing the data in a transactional Hive table 
I believe that we can instead issue updates and deletes for only the 
affected rows. Although we do use Hive for analytics on this data, 
much of the processing that generates and consumes the data is 
performed using Cascading. Therefore I'd like to be able to read and 
write the data via an API which we'd aim to integrate into a Cascading 
Tap of some description. Our Cascading processes could determine the 
new, updated, and deleted records and then use the API to stream these 
changes to the transactional Hive table.


We have most of this working in a proof of concept, but as 
hive-hcatalog-streaming does not expose the delete/update methods of 
the OrcRecordUpdater we've had to hack together something unpleasant 
based on the original API.


As a first step I'd like to check if there is any appetite for adding 
such functionality to the API or if this goes against the original 
motivations of the project? If this suggestion sounds reasonable then 
I'd be keen to help move this forward.


Thanks - Elliot.