aokolnychyi commented on PR #2276: URL: https://github.com/apache/iceberg/pull/2276#issuecomment-1284715472
@rdblue @sunchao, I am wondering about storage-partitioned joins if we have more than one spec. Whenever there are multiple specs in a table, we usually compute a common partition type, which represents a union of all specs in the table. If a file belongs to a partition spec that did not include a particular partition column in the common partition type, we would mark it as null. ``` Columns: c1, c2, c3, c4, c5 Spec 1: part1 as identity(c1) Spec 2: part1 as identity(c1), part2 as identity(c2) File A written with spec 1: (part1 = 'Q', part2 = null) File B written with spec 2: (part1 = 'Q', part2 = 'W') ``` Now suppose we have a target table that has two partition specs: ``` Spec 1: identity(c1), identity(c2) Spec 2: identity(c1), identity(c2), identity(c3) ``` And a source table with only one spec: ``` Spec 1: `identity(c1)`, `identity(c2)`, `identity(c3)` ``` ``` - target table - File T_A (part1 = A, part2 = A, part3 = null) (c3 values are random) (uses old spec with 2 columns) File T_B (part1 = A, part2 = A, part3 = 5) (c3 values are all 5) (uses new spec with 3 columns) File T_C (part1 = A, part2 = A, part3 = null) (c3 values are null) (uses new spec with 3 columns) - source table - File S_A (part1 = A, part2 = A, part3 = 5) (c3 values are all 5) (uses new spec with 3 columns) File S_B (part1 = A, part2 = A, part3 = null) (c3 values are all null) (uses new spec with 3 columns) ``` Spark will treat `T_A` and and `S_B` as splits that belong to the same partition even though it may not be safe if the join condition includes `t.c3 = s.c3`. It seems we can report `KeyGroupPartitioning` to Spark only on columns that were present in all partition specs that are being queried. Any thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
