lenmom opened a new issue #545: surrogate key gerneration requested.
URL: https://github.com/apache/incubator-doris/issues/545
 
 
   **scenario description:**
   supose we have a situation in **dim_sales_person(employee_id int , 
department int,...)**, and the fact table **fact_sales(order_id int, 
sales_person_id, department_id int,...)**.
   
   in case we calculate the detapartment sales amount based on the fact table 
**fact_sales**, if the sales person have changed department and we have no 
surrogate key gerneration mechanism,so all the sales metric for the sales 
person would be calculated to the persion's current department, and this is 
incorrect.
   
   a common solution for this scenario is to generate a new record for sales 
person who has changed department, and generate new record and insert to the 
**dim_sales_person** table, and the fact table would reference the new 
generated record's primary key. and set the previous record expire time to the 
day when the person's department has changed. the table schema look as follows:
   
   CREATE TABLE dim_sales_person
   (
      sales_person_sk           int   ,
      employee_id          int ,
      department          varchar(128),
      .
      .
      .
      effective_date       date,
      expiry_date          date
   )
   
   this is called slow changed dimention type 2,  and we should record the 
changing attibute history.
   
   In hive, it has already support such feture, and the solution looks like as 
follows:
   
   -- 将有地址变化的插入到dim_customer表,如果有相同数据存在有不过期的数据则不插入
   INSERT INTO dw_sales.dim_customer
   SELECT **row_number() over (ORDER BY t1.customer_number) + t2.sk_max**,
        t1.customer_number,
        t1.customer_name,
        t1.customer_street_address,
        t1.customer_zip_code,
        t1.customer_city,
        t1.customer_state,
        t1.version,
        t1.effective_date,
        t1.expiry_date
   FROM(SELECT
            t2.customer_number customer_number,
            t2.customer_name customer_name,
            t2.customer_street_address customer_street_address,
            t2.customer_zip_code,
            t2.customer_city,
            t2.customer_state,
            t1.version + 1 `version`,
            ${hivevar:pre_date} effective_date,
            ${hivevar:max_date} expiry_date
       FROM dw_sales.dim_customer t1
       INNER JOIN ods_sales.ods_customer t2 ON t1.customer_number = 
t2.customer_number
                    AND t1.expiry_date = ${hivevar:pre_date}
       LEFT JOIN dw_sales.dim_customer t3 ON t1.customer_number = 
t3.customer_number
                        AND t3.expiry_date = ${hivevar:max_date}
       WHERE t1.customer_street_address <> t2.customer_street_address 
                AND t3.customer_sk IS NULL
   ) t1
   CROSS JOIN
   (**SELECT COALESCE(MAX(customer_sk),0) sk_max FROM dw_sales.dim_customer**) 
t2;
   
   the key implementation is:
   **row_number() over (ORDER BY t1.customer_number)** + **t2.sk_max**
   
   via such solution, hive can handle surrogate key gerneration.
   
   **In summery:**
   because slow change dimention is an important feature in dataware house 
modeling, so it's suggested to support surrogate key gerneration in doris.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to