jessiedanwang opened a new issue, #6188:
URL: https://github.com/apache/iceberg/issues/6188
### Query engine
Spark
### Question
I am wondering if it is possible to use both effective and expiration date
as partition column for SCD type 2 dimension data. The problem is that the
dimension dataset is huge, and we would like to partition the dataset using
both effective and expiration date so that we can filter out irrelevant data.
Here is an example,
create table mytable (id bigint autoincrement, name text, city text,
effective date);
insert into mytable values ('Jen', 'Austin', '2017-01-01');
insert into mytable values ('Mike', 'Austin', '2017-07-01');
Upsert mytable values ('Jen', 'Tokyo', '2018-01-01');
what's in mytable
id name city effective
1 Jen Tokyo 2018-01-01
2 Mike Austin 2017-07-01
Traditional scd 2 state based on the same event stream:
create table mytable_scd2 (id bigint autoincrement, dimid bigint, name text,
city text)
partitioned by (effective bigint, expiration bigint)
what's in mytable_scd2
1 1 Ken Austin '2017-01-01' '2018-01-01' <--- this row would change
partition when it goes from null to a value
2 2 Mark Austin '2017-07-01' null
3 1 Ken Tokyo '2018-01-01' null
Given the above example, my question is whether the row (1 1 Ken Austin
'2017-01-01' '2018-01-01') will change to a different partition if the
expiration date has been updated from null to '2018-01-01'?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]