[ 
https://issues.apache.org/jira/browse/SPARK-56484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-56484:
---------------------------
    External issue URL:   (was: https://github.com/apache/spark/pull/55349)

> Filter's sizeInBytes estimation should not inflate over child's sizeInBytes 
> when CBO is on.
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-56484
>                 URL: https://issues.apache.org/jira/browse/SPARK-56484
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Ke Jia
>            Priority: Major
>
> We observed a significant discrepancy in the logical plan's statistics 
> estimation at the Filter node when running Q23a and Q23b in 10TB TPC-DS . For 
> the customer table, the RelationV2 scan correctly identifies a sizeInBytes of 
> 248.0 MiB based on actual metadata. However, after applying the Filter 
> isnotnull(c_customer_sk) operator, the CBO inflates the estimated size to 
> 743.9 MiB. Even though the rowCount remains unchanged , the heuristic 
> recalculation of sizeInBytes triples the value. This data inflation after a 
> filter causes the planner to exceed the 250 MiB threshold, incorrectly 
> disabling the Broadcast Hash Join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to