Ke Jia created SPARK-56484:
------------------------------

             Summary: Filter IS NOT NULL expression should not increase 
sizeInBytes over child when cbo is on.
                 Key: SPARK-56484
                 URL: https://issues.apache.org/jira/browse/SPARK-56484
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.1.1
            Reporter: Ke Jia


We observed a significant discrepancy in the logical plan's statistics 
estimation at the Filter node when running Q23a and Q23b in 10TB TPC-DS . For 
the customer table, the RelationV2 scan correctly identifies a sizeInBytes of 
248.0 MiB based on actual metadata. However, after applying the Filter 
isnotnull(c_customer_sk) operator, the CBO inflates the estimated size to 743.9 
MiB. Even though the rowCount remains unchanged , the heuristic recalculation 
of sizeInBytes triples the value. This data inflation after a filter causes the 
planner to exceed the 250 MiB threshold, incorrectly disabling the Broadcast 
Hash Join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to