[jira] [Assigned] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao reassigned SPARK-40045: -- Assignee: caican > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Assignee: caican >Priority: Major > Fix For: 3.4.0 > > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))` comes before `(id#22L = 2)` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > // `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = > 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (id#22L = > 2) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40045: Assignee: Apache Spark > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Assignee: Apache Spark >Priority: Major > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))` comes before `(id#22L = 2)` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > // `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = > 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40045: Assignee: (was: Apache Spark) > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Priority: Major > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))` comes before `(id#22L = 2)` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > // `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = > 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org