Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17330#discussion_r107077129
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
    @@ -61,6 +63,37 @@ abstract class SubqueryExpression(
       }
     }
     
    +/**
    + * This expression is used to represent any form of subquery expression 
namely
    + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
    + * expression equality works properly when LogicalPlan.sameResult is called
    + * on plans containing SubqueryExpression(s). This is only a transient 
expression
    + * that only lives in the scope of sameResult function call. In other 
words, analyzer,
    + * optimizer or planner never sees this expression type during 
transformation of
    + * plans.
    + */
    +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
    +  extends UnaryExpression with Unevaluable {
    +  override def dataType: DataType = expr.dataType
    +  override def nullable: Boolean = expr.nullable
    +  override def child: Expression = expr
    +  override def toString: String = 
s"CanonicalizedSubqueryExpr(${expr.toString})"
    +
    +  // Hashcode is generated conservatively for now i.e it does not include 
the
    +  // sub query plan. Doing so causes issue when we canonicalize 
expressions to
    +  // re-order them based on hashcode.
    +  // TODO : improve the hashcode generation by considering the plan info.
    +  override def hashCode(): Int = {
    +    val h = Objects.hashCode(expr.children)
    --- End diff --
    
    @cloud-fan Hi Wenchen, let me take an example and perhaps that will help us.
    
    Lets assume there is a Filter operator that has 3 subquery expressions like 
following.
    ``` scala
    Filter Scalar-subquery(.., splan1,..) || Exists(.., splan2, ..) || In(.., 
ListQuery(.., splan3, ..)
    ```
    
    1. During sameResult on this plan, we perform this 
[logic](https://github.com/dilipbiswal/spark/blob/0c6424574282db0e865ef4ad14c5a9179fcf27ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L367-L375)
    2. We clean the args from both source and target plans.
       2.1 We change the subquery expression to CanonicalizedSubqueryExpression 
so that we can do a customized equals that basically does a semantic equals of 
two subquery expressions. 
       2.2 Additionally we change the subquery plan (splan1, splan2, splan3) to 
change the outer attribute references to BindReference. Since these are outer 
references the inner plan is not aware of these attributes and hence we fix 
them as part of canonicalize. After this, when the equals method is called on 
CanonicalizedSubqueryExpression 
[here](https://github.com/dilipbiswal/spark/blob/0c6424574282db0e865ef4ad14c5a9179fcf27ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L372)
 and the sameResult is called on splan1, splan2, splan3 it works 
     fine as the outer references have been normalized to BindReference. The 
local attributes referenced in the plan is handled in sameResult itself.
       2.3 As part of cleanArgs, after cleaning the expressions we call 
canonicalized 
[here](https://github.com/dilipbiswal/spark/blob/0c6424574282db0e865ef4ad14c5a9179fcf27ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L402).
 So here we try to re-order 3 CanonicalizedSubqueryExpressions on the basis of 
their hashCode. When i had the hashCode call expr.semanticHash(), i observed 
that the ordering of the expressions between source and target is not 
consistent. I debugged to find that expr.semanticHash() considers the subquery 
plans as they are the class arguments and since plans have arbitary expression 
ids (we have only normalized the outer references.. not all of them at this 
point). Thus, in the current implementation i am excluding the subplans while 
computing the hashcode and have marked it as a todo.
    
    Please let me know your thoughts and let me know if i have misunderstood 
any thing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to