Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22326#discussion_r220516732 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -152,3 +153,51 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType)) } } + +/** + * Correctly handle PythonUDF which need access both side of join side by changing the new join + * type to Cross. + */ +object PullOutPythonUDFInJoinCondition extends Rule[LogicalPlan] with PredicateHelper { + def hasPythonUDF(expression: Expression): Boolean = { + expression.collectFirst { case udf: PythonUDF => udf }.isDefined --- End diff -- This doesn't matter. We can't evaluate python udf in the join condition, and need to pull it out, that's all. For python udf accessing attributes from only one side, these would be pushed down by other rules. If they don't (e.g. user disables filter pushdown rule), we need to pull them out here, too. Anyway it's orthogonal to this rule.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org