aokolnychyi commented on PR #2276:
URL: https://github.com/apache/iceberg/pull/2276#issuecomment-1284715472

   @rdblue @sunchao, I am wondering about storage-partitioned joins if we have 
more than one spec.
   
   Whenever there are multiple specs in a table, we usually compute a common 
partition type, which represents a union of all specs in the table. If a file 
belongs to a partition spec that did not include a particular partition column 
in the common partition type, we would mark it as null.
   
   ```
   Columns: c1, c2, c3, c4, c5
   Spec 1: part1 as identity(c1)
   Spec 2: part1 as identity(c1), part2 as identity(c2)
   
   File A written with spec 1: (part1 = 'Q', part2 = null)
   File B written with spec 2: (part1 = 'Q', part2 = 'W') 
   ```
   
   Now suppose we have a target table that has two partition specs:
   
   ```
   Spec 1: identity(c1),  identity(c2)
   Spec 2: identity(c1), identity(c2), identity(c3)
   ```
   
   And a source table with only one spec:
   
   ```
   Spec 1: `identity(c1)`, `identity(c2)`,  `identity(c3)`
   ```
   
   ```
   - target table -
   File T_A (part1 = A, part2 = A, part3 = null) (c3 values are random) (uses 
old spec with 2 columns)
   File T_B (part1 = A, part2 = A, part3 = 5) (c3 values are all 5) (uses new 
spec with 3 columns)
   File T_C (part1 = A, part2 = A, part3 = null) (c3 values are null) (uses new 
spec with 3 columns)
   
   - source table - 
   File S_A (part1 = A, part2 = A, part3 = 5) (c3 values are all 5) (uses new 
spec with 3 columns)
   File S_B (part1 = A, part2 = A, part3 = null) (c3 values are all null) (uses 
new spec with 3 columns)
   ```
   
   Spark will treat `T_A` and and `S_B` as splits that belong to the same 
partition even though it may not be safe if the join condition includes `t.c3 = 
s.c3`.
   
   It seems we can report `KeyGroupPartitioning` to Spark only on columns that 
were present in all partition specs that are being queried. Any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to