peter-toth commented on code in PR #55927:
URL: https://github.com/apache/spark/pull/55927#discussion_r3260177280


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledJoin.scala:
##########
@@ -28,6 +28,21 @@ import 
org.apache.spark.sql.catalyst.plans.physical.{ClusteredDistribution, Dist
 trait ShuffledJoin extends JoinCodegenSupport {
   def isSkewJoin: Boolean
 
+  private def containsNullSafeJoinMarker(keys: Seq[Expression]): Boolean = {
+    keys.exists(_.exists(_.isInstanceOf[IsNull]))
+  }
+
+  private lazy val canSpreadNullJoinKeys: Boolean = {

Review Comment:
   Without spreading, NullType <=> keys all hash to the same value 
(Murmur3Hash(null) is deterministic) → all NULL rows collocate on one reducer. 
The executor then runs:                                                         
     
                                                                                
                                                                                
                                                                     
   - SortMergeJoinExec.scala:1116: while (advancedStreamed() && 
streamedRowKey.anyNull) — skip every NULL-keyed streamed row.                   
                                                                                
       
   - SortMergeJoinExec.scala:1529: in full-outer, leftRowKey.anyNull triggers 
padding emission, never a match.                                                
                                                                       
                                                                                
                                                                                
                                                                       
   So even with NULL rows colocated, the executor's anyNull guard prevents 
NULL=NULL from matching. The <=> semantics the user wanted (NULL matches NULL) 
is never delivered for NullType — the rewrite was supposed to convert NULLs  
   to non-null sentinels so the executor's guard wouldn't fire, but for 
NullType the sentinel itself is NULL, so the guard fires anyway and the join 
produces only padding (full outer) or nothing (inner).                          
  
                                                                                
                                                                                
                                                                       
   With spreading, NULL rows scatter across reducers. Each reducer's executor 
sees some NULL rows from both sides. The anyNull guard fires the same way. Same 
padding emission, same lack of matching.                                 
    
   Output is identical with or without spreading — both produce the 
broken-but-self-consistent "NULL=NULL doesn't match" behavior for NullType.     
                                                                                
   
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to