[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...

mgaido91 Wed, 26 Sep 2018 00:35:57 -0700

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22326#discussion_r220453684
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
    @@ -152,3 +153,56 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * Correctly handle PythonUDF which need access both side of join side by 
changing the new join
    + * type to Cross.
    + */
    +object HandlePythonUDFInJoinCondition extends Rule[LogicalPlan] with 
PredicateHelper {
    +  def hasPythonUDF(expression: Expression): Boolean = {
    +    expression.collectFirst { case udf: PythonUDF => udf }.isDefined
    +  }
    +
    +  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    +    case j @ Join(_, _, joinType, condition)
    +      if 
condition.map(splitConjunctivePredicates).getOrElse(Nil).exists(hasPythonUDF) =>
    +      if (!joinType.isInstanceOf[InnerLike] && joinType != LeftSemi) {
    +        // The current strategy only support InnerLike and LeftSemi join 
because for other type,
    +        // it breaks SQL semantic if we run the join condition as a filter 
after join. If we pass
    +        // the plan here, it'll still get a an invalid PythonUDF 
RuntimeException with message
    +        // `requires attributes from more than one child`, we throw 
firstly here for better
    +        // readable information.
    +        throw new AnalysisException("Using PythonUDF in join condition of 
join type" +
    +          s" $joinType is not supported.")
    +      }
    +      if (SQLConf.get.crossJoinEnabled) {
    --- End diff --
    
    >  the udf check in CheckCartesianProducts is not work because we have 
pulled out the udf in join condition.
    
    yes, but the point is exactly that we don't need that check if we just do 
the change here.
    
    > It will also break the UT add in BatchEvalPythonExecSuite.
    
    This is interesting, why?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...

Reply via email to