Holden Karau created SPARK-47672: ------------------------------------ Summary: Avoid double evaluation of non-trivial projected elements from filter pushdown Key: SPARK-47672 URL: https://issues.apache.org/jira/browse/SPARK-47672 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Holden Karau
Repro here [https://gist.github.com/holdenk/0f9660bcbd9e63aaff904f15d3439db1] You can work around this by setting an expensive UDF to non-deterministic but that's not ideal and won't fix expensive internal operations (like string matching). Instead when we go to bubble up a filter, if we should not move a filter up above a projection of what we are filtering on. https://issues.apache.org/jira/browse/SPARK-40045 partially fixed some of this by (roughly) ordering filter expressions by cost so that we're not evaluating more than ~2x (e.g. in old behavior bubbled up filter could become the first elem of the filter and then the cheap null checks would go away and we'd have expensive compute on everything not just filtered data), but we should "trust" the users projection + later use of that projection to indicate that a UDF is expensive and we should only evaluate it once inside of the projection and filter after. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org