[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attib...

mgaido91 Tue, 04 Sep 2018 03:04:48 -0700

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22326#discussion_r214857643
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
    @@ -1208,9 +1208,26 @@ object PushPredicateThroughJoin extends 
Rule[LogicalPlan] with PredicateHelper {
                 reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
               val newRight = rightJoinConditions.
                 reduceLeftOption(And).map(Filter(_, right)).getOrElse(right)
    -          val newJoinCond = commonJoinCondition.reduceLeftOption(And)
    -
    -          Join(newLeft, newRight, joinType, newJoinCond)
    +          val (newJoinConditions, others) =
    +            commonJoinCondition.partition(canEvaluateWithinJoin)
    +          val newJoinCond = newJoinConditions.reduceLeftOption(And)
    +          // if condition expression is unevaluable, it will be removed 
from
    +          // the new join conditions, if all conditions is unevaluable, we 
should
    +          // change the join type to CrossJoin.
    +          val newJoinType =
    +            if (commonJoinCondition.nonEmpty && newJoinCond.isEmpty) {
    +              logWarning(s"The whole 
commonJoinCondition:$commonJoinCondition of the join " +
    +                s"plan:\n $j is unevaluable, it will be ignored and the 
join plan will be " +
    --- End diff --
    
    @dilipbiswal there are cases when "trivial conditions" are removed from a 
join so we make a inner join a cross one for instance. The performance would be 
awful, you're right. The point is that I am not sure that there is a better way 
to achieve this. I mean, since we have no clue what the UDF does, we need to 
compare all the rows from both sides, ie. we need to perform a cartesian 
product.
    
    > Wondering if we should error out or pick a bad plan
    
    This is, indeed, arguable. I think that letting the user choose is a good 
idea. If the users runs the query and gets an `AnalysisException` because 
he/she is trying to perform a cartesian product, he/she can decide: ok, I am 
doing a wrong thing, let's change it; or he/she can say, well, one of my 2 
tables involved contains 10 rows, not a big deal, I want to perform it 
nonetheless, let's set `spark.sql.crossJoin.enabled=true` and run it.
    
    > for join types other than inner and leftsemi, we still have the same 
issue, no ?
    
    I think the current PR handles properly only the case with type inner (for 
the left semi) this PR returns an incorrect result IIUC. This needs to be fixed 
as well.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attib...

Reply via email to