[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
caican updated SPARK-40045: --------------------------- Description: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` comes before `(id#22L = 2)` == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} was: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan {code:java} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} > The order of filtering predicates is not reasonable > --------------------------------------------------- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.2, 3.2.0, 3.3.0 > Reporter: caican > Priority: Major > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))` comes before `(id#22L = 2)` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org