Holden Karau created SPARK-47672:
------------------------------------

             Summary: Avoid double evaluation of non-trivial projected elements 
from filter pushdown
                 Key: SPARK-47672
                 URL: https://issues.apache.org/jira/browse/SPARK-47672
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.5.1
            Reporter: Holden Karau


Repro here [https://gist.github.com/holdenk/0f9660bcbd9e63aaff904f15d3439db1] 
 
You can work around this by setting an expensive UDF to non-deterministic but 
that's not ideal and won't fix expensive internal operations (like string 
matching).
 
Instead when we go to bubble up a filter, if we should not move a filter up 
above a projection of what we are filtering on.
 
https://issues.apache.org/jira/browse/SPARK-40045 partially fixed some of this 
by (roughly) ordering filter expressions by cost so that we're not evaluating 
more than ~2x (e.g. in old behavior bubbled up filter could become the first 
elem of the filter and then the cheap null checks would go away and we'd have 
expensive compute on everything not just filtered data), but we should "trust" 
the users projection + later use of that projection to indicate that a UDF is 
expensive and we should only evaluate it once inside of the projection and 
filter after.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to