Re: on duplicate update equivalent?

Damien Carol Sat, 24 Sep 2016 05:27:08 -0700

Another solution is to use HIVE over HBase.
When you insert in this table, HIVE do an upsert.





2016-09-23 21:00 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> The fundamental question is: do you need these recurring updates to
> dimension tables throttling your Hive tables.
>
> Besides why bother with ETL when one can do ELT.
>
> For dimension table just add two additional columns namely
>
>    , op_type int
>    , op_time timestamp
>
> op_type = 1/2/3 (INSERT/UPDATE/DELETE)  and op_time = timestamp from Hive
> to the original table. New records will be appended to the dimension table.
> So when you have the full Entity Life History (one INSERT, multiple
> UPDATES and one delete) for a given primary key. then you can do whatever
> you want plus of course full audit of every record (for example what
> happened to every trade, who changed what etc).
>
> In your join with the FACT table you will need to use analytics to find
> the last entry for a given primary key (ignoring deletes) or just use
> standard HQL.
>
> If you are going to bring in Hbase etc to it, then Spark solution that I
> suggested earlier on may serve better.
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 23 September 2016 at 19:36, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>> > Dimensions change, and I'd rather do update than recreate a snapshot.
>>
>> Slow changing dimensions are the common use-case for Hive's ACID MERGE.
>>
>> The feature you need is most likely covered by
>>
>> https://issues.apache.org/jira/browse/HIVE-10924
>>
>> 2nd comment from that JIRA
>>
>> "Once an hour, a set of inserts and updates (up to 500k rows) for various
>> dimension tables (eg. customer, inventory, stores) needs to be processed.
>> The dimension tables have primary keys and are typically bucketed and
>> sorted on those keys."
>>
>> Any other approach would need a full snapshot re-materialization, because
>> ACID can generate DELETE + INSERT instead of rewriting the original file
>> for a 2% upsert.
>>
>> If you do not have any isolation concerns (as in, a query doing a read
>> when 50% of your update has applied), using HBase backed dimension tables
>> in Hive is possible, but it does not offer the same transactional
>> consistency as the ACID merge will.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Re: on duplicate update equivalent?

Reply via email to