[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-40362:
-------------------------
    Description: 
In the canonicalization code which is now in two stages, canonicalization 
involving Commutative operators is broken, if they are subexpressions of 
certain type of expressions which override precanonicalize, for example 
BinaryComparison 

Consider following expression:

a + b > 10

         GT

            |

a + b          10

The BinaryComparison operator in the precanonicalize, first precanonicalizes  
children & then   may swap operands based on left /right hashCode inequality..

lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted to 
LT

But If the same tree is created 

           GT

            |

 b + a      10

The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
for this tree

 Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.

Thus to similar trees result in different canonicalization , one having GT 
other having LT 

 

The problem occurs because  for commutative expressions the canonicalization 
normalizes the expression with consistent hashCode which is not the case with 
precanonicalize as the hashCode of commutative expression 's precanonicalize 
and post canonicalize are different.

 

 

The test 
{quote}test("bug X")
Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    val 
y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
Unknown macro}
).clone().asInstanceOf[Add]
val canonicalizedFullCond = fullCond.canonicalized

// swap the operands of add
val newAddExpr = Add(addExpr.right, addExpr.left)

// build a new condition which is same as the previous one, but with operands 
of //Add reversed 

val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized

assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
}
{quote}
This test fails.

The fix which I propose is that for the commutative expressions, the 
precanonicalize should be overridden and  
Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
of at place of canonicalize. effectively for commutative operands ( add, or , 
multiply , and etc) canonicalize and precanonicalize should be same.

PR:

[https://github.com/apache/spark/pull/37824]

 

 

I am also trying a better fix, where by the idea is that for commutative 
expressions the murmur hashCode are caluculated using unorderedHash so that it 
is order  independent ( i.e symmetric).

The above approach works fine , but in case of Least & Greatest, the Product's 
element is  a Seq,  and that messes with consistency of hashCode.

  was:
In the canonicalization code which is now in two stages, canonicalization 
involving Commutative operators is broken, if they are subexpressions of 
certain type of expressions which override precanonicalize.

Consider following expression:

a + b > 10

This GreaterThan expression when canonicalized as a whole for first time,  will 
skip the call to Canonicalize.reorderCommutativeOperators for the Add 
expression as the GreaterThan's canonicalization used precanonicalize on 
children ( the Add expression).

 

so if create a new expression 

b + a > 10 and invoke canonicalize it, the canonicalized versions of these two 
expressions will not match.

The test 
{quote}test("bug X") {
    val tr1 = LocalRelation('c.int, 'b.string, 'a.int)

   val y = tr1.where('a.attr + 'c.attr > 10).analyze
   val fullCond = y.asInstanceOf[Filter].condition.clone()
  val addExpr = (fullCond match
Unknown macro: \{     case GreaterThan(x}
).clone().asInstanceOf[Add]
val canonicalizedFullCond = fullCond.canonicalized

// swap the operands of add
val newAddExpr = Add(addExpr.right, addExpr.left)

// build a new condition which is same as the previous one, but with operands 
of //Add reversed 

val builtCondnCanonicalized = GreaterThan(newAddExpr, Literal(10)).canonicalized

assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
}
{quote}
This test fails.

The fix which I propose is that for the commutative expressions, the 
precanonicalize should be overridden and  
Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
of at place of canonicalize. effectively for commutative operands ( add, or , 
multiply , and etc) canonicalize and precanonicalize should be same.



PR:

https://github.com/apache/spark/pull/37824


> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-40362
>                 URL: https://issues.apache.org/jira/browse/SPARK-40362
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Asif
>            Priority: Major
>              Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize, for example 
> BinaryComparison 
> Consider following expression:
> a + b > 10
>          GT
>             |
> a + b          10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes  
> children & then   may swap operands based on left /right hashCode inequality..
> lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted 
> to LT
> But If the same tree is created 
>            GT
>             |
>  b + a      10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
> for this tree
>  Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT 
> other having LT 
>  
> The problem occurs because  for commutative expressions the canonicalization 
> normalizes the expression with consistent hashCode which is not the case with 
> precanonicalize as the hashCode of commutative expression 's precanonicalize 
> and post canonicalize are different.
>  
>  
> The test 
> {quote}test("bug X")
> Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    
> val y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
> y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the 
> precanonicalize should be overridden and  
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize. effectively for commutative operands ( add, or , 
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>  
>  
> I am also trying a better fix, where by the idea is that for commutative 
> expressions the murmur hashCode are caluculated using unorderedHash so that 
> it is order  independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the 
> Product's element is  a Seq,  and that messes with consistency of hashCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to