[
https://issues.apache.org/jira/browse/SPARK-53742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ji Jun Tang updated SPARK-53742:
--------------------------------
Issue Type: Improvement (was: Bug)
> Push down the filter used in the count_if function
> --------------------------------------------------
>
> Key: SPARK-53742
> URL: https://issues.apache.org/jira/browse/SPARK-53742
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.0.1
> Reporter: Ji Jun Tang
> Priority: Minor
>
> By pushing down the filter condition in the count_if function, we can reduce
> the volume of data that needs to be processed.
>
> {code:java}
> // code placeholder
> spark.sql("create table t1(a int, b int, c int) using parquet")
> spark.sql("select count_if(a <>1) from t1").explain("cost") {code}
> Current:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [count(if (NOT _common_expr_0#6) null else _common_expr_0#6) AS
> count_if((NOT (a = 1)))#4L], Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Project [NOT (a#0 = 1) AS _common_expr_0#6], Statistics(sizeInBytes=1.0 B)
> +- Relation spark_catalog.default.t1[a#0,b#1,c#2] parquet,
> Statistics(sizeInBytes=0.0 B) {code}
> Excepted:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [count(if (NOT _common_expr_2#22) null else _common_expr_2#22) AS
> count_if((NOT (a = 1)))#21L], Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Project [NOT (a#3 = 1) AS _common_expr_2#22], Statistics(sizeInBytes=1.0 B)
> +- Filter (isnotnull(a#3) AND NOT (a#3 = 1)), Statistics(sizeInBytes=1.0 B)
> +- Relation spark_catalog.default.t1[a#3,b#4,c#5] parquet,
> Statistics(sizeInBytes=0.0 B) {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]