Another solution is to use HIVE over HBase. When you insert in this table, HIVE do an upsert.
2016-09-23 21:00 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: > The fundamental question is: do you need these recurring updates to > dimension tables throttling your Hive tables. > > Besides why bother with ETL when one can do ELT. > > For dimension table just add two additional columns namely > > , op_type int > , op_time timestamp > > op_type = 1/2/3 (INSERT/UPDATE/DELETE) and op_time = timestamp from Hive > to the original table. New records will be appended to the dimension table. > So when you have the full Entity Life History (one INSERT, multiple > UPDATES and one delete) for a given primary key. then you can do whatever > you want plus of course full audit of every record (for example what > happened to every trade, who changed what etc). > > In your join with the FACT table you will need to use analytics to find > the last entry for a given primary key (ignoring deletes) or just use > standard HQL. > > If you are going to bring in Hbase etc to it, then Spark solution that I > suggested earlier on may serve better. > > HTH > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 23 September 2016 at 19:36, Gopal Vijayaraghavan <gop...@apache.org> > wrote: > >> > Dimensions change, and I'd rather do update than recreate a snapshot. >> >> Slow changing dimensions are the common use-case for Hive's ACID MERGE. >> >> The feature you need is most likely covered by >> >> https://issues.apache.org/jira/browse/HIVE-10924 >> >> 2nd comment from that JIRA >> >> "Once an hour, a set of inserts and updates (up to 500k rows) for various >> dimension tables (eg. customer, inventory, stores) needs to be processed. >> The dimension tables have primary keys and are typically bucketed and >> sorted on those keys." >> >> Any other approach would need a full snapshot re-materialization, because >> ACID can generate DELETE + INSERT instead of rewriting the original file >> for a 2% upsert. >> >> If you do not have any isolation concerns (as in, a query doing a read >> when 50% of your update has applied), using HBase backed dimension tables >> in Hive is possible, but it does not offer the same transactional >> consistency as the ACID merge will. >> >> Cheers, >> Gopal >> >> >> >