Boxuan Li created SPARK-52226:
---------------------------------

             Summary: Strengthen data source v2 operators' equality checks
                 Key: SPARK-52226
                 URL: https://issues.apache.org/jira/browse/SPARK-52226
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 4.0.0
            Reporter: Boxuan Li


[https://github.com/apache/spark/commit/e97ab1d9807134bb557ae73920af61e8534b2b08#diff-f82edfef27867e1285af13f3603efbc5e77d81d715d427db4b51f0c3e3a0df14R35-R38]
 introduced `equals` functions to a few v2 data source operators, while none of 
the other operators has `equals` override.

This means equivalence checks of BatchScanExec, ContinuousScanExec, and 
MicroBatchScanExec are much looser than all other operators. It doesn't seem to 
be intentional; it looks like an overlook to me - different operators should 
follow the same set of basic contracts if possible, if not, they shall not be 
too different from each other. Notably, the original author also left a TODO to 
"unify" them.

Now we live in a world where most operators have strictest equivalence checks, 
while a few operators have loose equivalence checks. What could go wrong? Well, 
since Spark is extensible, it is possible to inherit Spark's operators with 
modified runtime implementation while delivering same results. In fact, that's 
what [https://github.com/apache/incubator-gluten] project does, whereas (most) 
Spark operators are inherited by Gluten operators. Given the loose equivalence 
checks of Spark operators, we could end up declaring equivalence between a 
Spark operator and a Gluten operator.

If Spark starts with a clear contract that operators are "equal" as long as 
they deliver same results, it would be probably fine. Now we live in a world 
where most operators don't do this except for the 3 operators I mentioned 
above. This is very easy to miss, and has caused unexpected behavior/bugs in 
downstream applications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to