[ https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun reassigned SPARK-40362: ------------------------------------- Assignee: Peter Toth > Bug in Canonicalization of expressions like Add & Multiply i.e Commutative > Operators > ------------------------------------------------------------------------------------ > > Key: SPARK-40362 > URL: https://issues.apache.org/jira/browse/SPARK-40362 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0 > Reporter: Asif > Assignee: Peter Toth > Priority: Major > Labels: spark-sql > Fix For: 3.3.1 > > Original Estimate: 72h > Remaining Estimate: 72h > > In the canonicalization code which is now in two stages, canonicalization > involving Commutative operators is broken, if they are subexpressions of > certain type of expressions which override precanonicalize, for example > BinaryComparison > Consider following expression: > a + b > 10 > GT > | > a + b 10 > The BinaryComparison operator in the precanonicalize, first precanonicalizes > children & then may swap operands based on left /right hashCode inequality.. > lets say Add(a + b) .hashCode is > 10.hashCode as a result GT is converted > to LT > But If the same tree is created > GT > | > b + a 10 > The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that > for this tree > Add(b + a) .hashCode is < 10.hashCode in which case GT remains as is. > Thus to similar trees result in different canonicalization , one having GT > other having LT > > The problem occurs because for commutative expressions the canonicalization > normalizes the expression with consistent hashCode which is not the case with > precanonicalize as the hashCode of commutative expression 's precanonicalize > and post canonicalize are different. > > > The test > {quote}test("bug X") > Unknown macro: \{ val tr1 = LocalRelation('c.int, 'b.string, 'a.int) > val y = tr1.where('a.attr + 'c.attr > 10).analyze val fullCond = > y.asInstanceOf[Filter].condition.clone() val addExpr = (fullCond match > Unknown macro} > ).clone().asInstanceOf[Add] > val canonicalizedFullCond = fullCond.canonicalized > // swap the operands of add > val newAddExpr = Add(addExpr.right, addExpr.left) > // build a new condition which is same as the previous one, but with operands > of //Add reversed > val builtCondnCanonicalized = GreaterThan(newAddExpr, > Literal(10)).canonicalized > assertEquals(canonicalizedFullCond, builtCondnCanonicalized) > } > {quote} > This test fails. > The fix which I propose is that for the commutative expressions, the > precanonicalize should be overridden and > Canonicalize.reorderCommutativeOperators be invoked on the expression instead > of at place of canonicalize. effectively for commutative operands ( add, or , > multiply , and etc) canonicalize and precanonicalize should be same. > PR: > [https://github.com/apache/spark/pull/37824] > > > I am also trying a better fix, where by the idea is that for commutative > expressions the murmur hashCode are caluculated using unorderedHash so that > it is order independent ( i.e symmetric). > The above approach works fine , but in case of Least & Greatest, the > Product's element is a Seq, and that messes with consistency of hashCode. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org