[ 
https://issues.apache.org/jira/browse/HIVE-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-16258:
----------------------------------
    Component/s: Transactions

> Suggesting a non-standard extension to MERGE
> --------------------------------------------
>
>                 Key: HIVE-16258
>                 URL: https://issues.apache.org/jira/browse/HIVE-16258
>             Project: Hive
>          Issue Type: Improvement
>          Components: Transactions
>            Reporter: Carter Shanklin
>
> Some common data maintenance strategies, especially the Type 2 SCD update, 
> would become substantially easier with a small extension to the SQL standard 
> for MERGE, specifically the ability to say "when matched then insert". Per 
> the standard, matched records can only be updated or deleted.
> In the Type 2 SCD, when a new record comes in you update the old version of 
> the record and insert the new version of the same record. If this extension 
> were supported, sample Type 2 SCD code would look as follows:
> {code}
> merge into customer
> using new_customer_stage stage
> on stage.source_pk = customer.source_pk
> when not matched then insert values    /* Insert a net new record */
>   (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, 
> true, null)
> when matched then update set           /* Update an old record to mark it as 
> out-of-date */
>   is_current = false, end_date = current_date()
> when matched then insert values        /* Insert a new current record */
>   (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, 
> true, null);
> {code}
> Without this support, the user needs to devise some sort of workaround. A 
> common approach is to first left join the staging table against the table to 
> be updated, then to join these results to a helper table that will spit out 
> two records for each match and one record for each miss. One of the matching 
> records needs to have a join key that can never occur in the source data so 
> this requires precise knowledge of the source dataset.
> An example of this:
> {code}
> merge into customer
> using (
>   select
>     *,
>     coalesce(invalid_key, source_pk) as join_key
>   from (
>     select
>       stage.source_pk, stage.name, stage.state,
>       case when customer.source_pk is null then 1
>       when stage.name <> customer.name or stage.state <> customer.state then 2
>       else 0 end as scd_row_type
>     from
>       new_customer_stage stage
>     left join
>       customer
>     on (stage.source_pk = customer.source_pk and customer.is_current = true)
>   ) updates
>   join scd_types on scd_types.type = scd_row_type
> ) sub
> on sub.join_key = customer.source_pk
> when matched then update set
>   is_current = false,
>   end_date = current_date()
> when not matched then insert values
>   (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true, 
> null);
> select * from customer order by source_pk;
> {code}
> This code is very complicated and will fail if the "invalid" key ever shows 
> up in the source dataset. This simple extension provides a lot of value and 
> likely very little maintenance overhead.
> /cc [~ekoifman]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to