agrawaldevesh commented on a change in pull request #29104: URL: https://github.com/apache/spark/pull/29104#discussion_r459223391
########## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala ########## @@ -454,6 +478,28 @@ case class BroadcastHashJoinExec( val (matched, checkCondition, _) = getJoinCondition(ctx, input) val numOutput = metricTerm(ctx, "numOutputRows") + if (isNullAwareAntiJoin) { + require(leftKeys.length == 1, "leftKeys length should be 1") + require(rightKeys.length == 1, "rightKeys length should be 1") + require(joinType == LeftAnti, "joinType must be LeftAnti.") + require(buildSide == BuildRight, "buildSide must be BuildRight.") + require(SQLConf.get.nullAwareAntiJoinOptimizeEnabled, + "nullAwareAntiJoinOptimizeEnabled must be on for null aware anti join optimize.") + require(checkCondition == "", "null aware anti join optimize condition should be empty.") + + if (broadcastRelation.value.inputEmpty) { + return s""" + |// singleColumn NAAJ inputEmpty(true) accept all + |$numOutput.add(1); + |${consume(ctx, input)} + """.stripMargin + } else if (broadcastRelation.value.anyNullKeyExists) { + return s""" + |// singleColumn NAAJ inputEmpty(false) anyNullKeyExists(true) reject all Review comment: > @maryannxue @agrawaldevesh @maropu @viirya what do you think? Thanks for the empirical analysis @cloud-fan ... that always helps to seal an argument :-) I do agree that introducing a new node is sometimes more pain than worth, so I would be equally be okay with just modifying BHJ. I think they are both fine choices and I would leave it to @leanken to make a call. The ideal scenario would be to have minimal code diff in BHJ and introducing a new node is only a fallback if the diff cannot be reduced to our satisfaction. Either way, we ought to decide one way or the other soon since the PR is becoming slightly confusing with both approaches co-existing :-) Are there any performance benchmarks that we should be re-running for BHJ to ensure no regression ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org