Hi, > It doesn't help if you need concurrent threads writing to a table but we are > just using the row_number analytic and a max value subquery to generate > sequences on our star schema warehouse.
Yup, you're right the row_number doesn't help with concurrent writes - it doesn't even scale beyond a single cpu core. Because it is a dense sequence, there's a 1 reducer chokepoint for that OVER() clause and it shows up as a significant performance issue as the total number of rows start to go up. However, that is good enough for doing it if you are doing it once and for a very small number of rows. > It has worked pretty well so far. To provide true sequence support would > require changes on the hive meta database side as well as locking so nothing > has been done on it in a long time Since we added a sequence + locking in Hive ACID, there's a Surrogate Key prototype (for Hive 3.0) that's built which uses the Txn manager sequence to seed concurrent sequence numbering. This should work for any format (Parquet, for instance), because all data formats can be made transactional in Hive3 (though update/delete remains to be implemented for Parquet). Here's a video describing that from one of the recent meetups discussing Hive 3.x (around the 56minute mark). https://www.youtube.com/watch?v=b_KUVK7Lq14#t=56m This is not an auto_increment key, but the numbering is forward ordered. This is mostly useful to turn a complex join type into a simple equi-join (also PK-FK, which CBO can optimize with), when offloading an OLTP system like Oracle into a historical Hive EDW system. So, if you have something like the same business key with historical information (like a UPC with a different price based on day of sale/store location or say a EUR:USD conversion rate by day), you can generate a surrogate key for each change and turn a query which has a item_id = item_id and sale_date between start_price_date and end_price_date and sale_location = price_state into a simple = query on item_price_sk. The dimension table does not grow in size due to this transformation, but it acquires a primary key column which is auto-generated on INSERT. Cheers, Gopal