[ https://issues.apache.org/jira/browse/SPARK-40362 ]
Asif deleted comment on SPARK-40362:
--
was (Author: ashahid7):
I have added locally test for
&&, ||, &, | and * and they all fail for the same reason..
in process of adding tests for xor, greatest & least and testing the fix for
the same...
the fix is to make these expressions implement a trait
trait CommutativeExpresionCanonicalization {
this: Expression =>
override lazy val canonicalized: Expression = preCanonicalized
override lazy val preCanonicalized: Expression = {
val canonicalizedChildren = children.map(_.preCanonicalized)
Canonicalize.reorderCommutativeOperators {
withNewChildren(canonicalizedChildren)
}
}
}
> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative
> Operators
>
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
> Issue Type: Bug
> Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
> Labels: spark-sql
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization
> involving Commutative operators is broken, if they are subexpressions of
> certain type of expressions which override precanonicalize, for example
> BinaryComparison
> Consider following expression:
> a + b > 10
> GT
> |
> a + b 10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes
> children & then may swap operands based on left /right hashCode inequality..
> lets say Add(a + b) .hashCode is > 10.hashCode as a result GT is converted
> to LT
> But If the same tree is created
> GT
> |
> b + a 10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that
> for this tree
> Add(b + a) .hashCode is < 10.hashCode in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT
> other having LT
>
> The problem occurs because for commutative expressions the canonicalization
> normalizes the expression with consistent hashCode which is not the case with
> precanonicalize as the hashCode of commutative expression 's precanonicalize
> and post canonicalize are different.
>
>
> The test
> {quote}test("bug X")
> Unknown macro: \{ val tr1 = LocalRelation('c.int, 'b.string, 'a.int)
> val y = tr1.where('a.attr + 'c.attr > 10).analyze val fullCond =
> y.asInstanceOf[Filter].condition.clone() val addExpr = (fullCond match
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands
> of //Add reversed
> val builtCondnCanonicalized = GreaterThan(newAddExpr,
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the
> precanonicalize should be overridden and
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead
> of at place of canonicalize. effectively for commutative operands ( add, or ,
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>
>
> I am also trying a better fix, where by the idea is that for commutative
> expressions the murmur hashCode are caluculated using unorderedHash so that
> it is order independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the
> Product's element is a Seq, and that messes with consistency of hashCode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org