peter-toth commented on code in PR #55927:
URL: https://github.com/apache/spark/pull/55927#discussion_r3260177280
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledJoin.scala:
##########
@@ -28,6 +28,21 @@ import
org.apache.spark.sql.catalyst.plans.physical.{ClusteredDistribution, Dist
trait ShuffledJoin extends JoinCodegenSupport {
def isSkewJoin: Boolean
+ private def containsNullSafeJoinMarker(keys: Seq[Expression]): Boolean = {
+ keys.exists(_.exists(_.isInstanceOf[IsNull]))
+ }
+
+ private lazy val canSpreadNullJoinKeys: Boolean = {
Review Comment:
Without spreading, NullType <=> keys all hash to the same value
(Murmur3Hash(null) is deterministic) → all NULL rows collocate on one reducer.
The executor then runs:
- SortMergeJoinExec.scala:1116: while (advancedStreamed() &&
streamedRowKey.anyNull) — skip every NULL-keyed streamed row.
- SortMergeJoinExec.scala:1529: in full-outer, leftRowKey.anyNull triggers
padding emission, never a match.
So even with NULL rows colocated, the executor's anyNull guard prevents
NULL=NULL from matching. The <=> semantics the user wanted (NULL matches NULL)
is never delivered for NullType — the rewrite was supposed to convert NULLs
to non-null sentinels so the executor's guard wouldn't fire, but for
NullType the sentinel itself is NULL, so the guard fires anyway and the join
produces only padding (full outer) or nothing (inner).
With spreading, NULL rows scatter across reducers. Each reducer's executor
sees some NULL rows from both sides. The anyNull guard fires the same way. Same
padding emission, same lack of matching.
Output is identical with or without spreading — both produce the
broken-but-self-consistent "NULL=NULL doesn't match" behavior for NullType.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]