[
https://issues.apache.org/jira/browse/SPARK-56484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ke Jia updated SPARK-56484:
---------------------------
Summary: Filter's sizeInBytes estimation should not inflate over child's
sizeInBytes when CBO is on. (was: Filter IS NOT NULL expression should not
increase sizeInBytes over child when cbo is on.)
> Filter's sizeInBytes estimation should not inflate over child's sizeInBytes
> when CBO is on.
> -------------------------------------------------------------------------------------------
>
> Key: SPARK-56484
> URL: https://issues.apache.org/jira/browse/SPARK-56484
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Ke Jia
> Priority: Major
>
> We observed a significant discrepancy in the logical plan's statistics
> estimation at the Filter node when running Q23a and Q23b in 10TB TPC-DS . For
> the customer table, the RelationV2 scan correctly identifies a sizeInBytes of
> 248.0 MiB based on actual metadata. However, after applying the Filter
> isnotnull(c_customer_sk) operator, the CBO inflates the estimated size to
> 743.9 MiB. Even though the rowCount remains unchanged , the heuristic
> recalculation of sizeInBytes triples the value. This data inflation after a
> filter causes the planner to exceed the 250 MiB threshold, incorrectly
> disabling the Broadcast Hash Join.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]