viirya commented on PR #55927:
URL: https://github.com/apache/spark/pull/55927#issuecomment-4512408638
Thanks for the quick turnaround @sunchao — items 1 and 2 are addressed
nicely. Two small notes that are *not* merge blockers, just opportunistic
cleanup:
**On item 1**: the new `joinType match` arms are self-explanatory, but the
*why* — "the skew only appears on the preserved side because non-preserved
NULL-keyed rows are filtered out by `=` and never emitted" — is not in the
code. The existing `// Null-safe equality usually rewrites to non-null shuffle
keys...` comment above `canSpreadNullJoinKeys` explains the `<=>` / `NullType`
corner but no longer covers why we look at the preserved side specifically. A
one-line lead-in before the `match` would close the loop. Nit.
**On item 3**: the new comment at `case NullAwareHashPartitioning` is good,
but the invariant I was trying to nail down is one level up — the non-NULL
partition-id has to stay byte-for-byte identical to what
`HashPartitioning.partitionIdExpression` would produce, otherwise the symmetric
`HashShuffleSpec` ↔ `NullAwareHashShuffleSpec` compatibility in
`partitioning.scala` silently misaligns non-NULL rows between the two layouts.
The risk surface is "future change to the hash on either side." A
back-reference would catch it:
```scala
case h: NullAwareHashPartitioning =>
// Non-NULL keys must produce the same partition id as
HashPartitioning.partitionIdExpression
// — this is what makes HashShuffleSpec compatible with
NullAwareHashShuffleSpec when both
// distributions opt in. Keep the hash function in sync with
HashPartitioning.
val joinKeyProjection = ...
```
(Also noticed you removed `NullAwareHashPartitioning.partitionIdExpression`
— good catch, that was indeed dead code.)
Both are nits; LGTM either way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]