Ke Jia created SPARK-56484:
------------------------------
Summary: Filter IS NOT NULL expression should not increase
sizeInBytes over child when cbo is on.
Key: SPARK-56484
URL: https://issues.apache.org/jira/browse/SPARK-56484
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.1.1
Reporter: Ke Jia
We observed a significant discrepancy in the logical plan's statistics
estimation at the Filter node when running Q23a and Q23b in 10TB TPC-DS . For
the customer table, the RelationV2 scan correctly identifies a sizeInBytes of
248.0 MiB based on actual metadata. However, after applying the Filter
isnotnull(c_customer_sk) operator, the CBO inflates the estimated size to 743.9
MiB. Even though the rowCount remains unchanged , the heuristic recalculation
of sizeInBytes triples the value. This data inflation after a filter causes the
planner to exceed the 250 MiB threshold, incorrectly disabling the Broadcast
Hash Join.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]