Re: [feature request] auto-increment field in Hive

Gopal Vijayaraghavan Sat, 15 Sep 2018 21:19:57 -0700

Hi,

> It doesn't help if you need concurrent threads writing to a table but we are 
> just using the row_number analytic and a max value subquery to generate 
> sequences on our star schema warehouse.


Yup, you're right the row_number doesn't help with concurrent writes - it 
doesn't even scale beyond a single cpu core.

Because it is a dense sequence, there's a 1 reducer chokepoint for that OVER() 
clause and it shows up as a significant performance issue as the total number 
of rows start to go up.

However, that is good enough for doing it if you are doing it once and for a 
very small number of rows.

> It has worked pretty well so far. To provide true sequence support would 
> require changes on the hive meta database side as well as locking so nothing 
> has been done on it in a long time

Since we added a sequence + locking in Hive ACID, there's a Surrogate Key 
prototype (for Hive 3.0) that's built which uses the Txn manager sequence to 
seed concurrent sequence numbering.

This should work for any format  (Parquet, for instance), because all data 
formats can be made transactional in Hive3 (though update/delete remains to be 
implemented for Parquet).

Here's a video describing that from one of the recent meetups discussing Hive 
3.x (around the 56minute mark).

https://www.youtube.com/watch?v=b_KUVK7Lq14#t=56m
 
This is not an auto_increment key, but the numbering is forward ordered.

This is mostly useful to turn a complex join type into a simple equi-join (also 
PK-FK, which CBO can optimize with), when offloading an OLTP system like Oracle 
into a historical Hive EDW system.

So, if you have something like the same business key with historical 
information (like a UPC with a different price based on day of sale/store 
location or say a EUR:USD conversion rate by day), you can generate a surrogate 
key for each change and turn a query which has a item_id = item_id and 
sale_date between start_price_date and end_price_date and sale_location = 
price_state into a simple = query on item_price_sk.

The dimension table does not grow in size due to this transformation, but it 
acquires a primary key column which is auto-generated on INSERT.

Cheers,
Gopal

Re: [feature request] auto-increment field in Hive

Reply via email to