[ 
https://issues.apache.org/jira/browse/SPARK-52868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Kupchinskiy updated SPARK-52868:
--------------------------------------
    Description: 
Some data sources like Iceberg provide only partial column statistics. 
Specifically, Iceberg currently [lacks min/max stats 
implementation][[https://github.com/apache/iceberg/issues/11083]] , though it 
does provide distinctCount values.

The current filter estimation logic for EqualTo and InSet predicates contains a 
bug: when min/max stats are unavailable but distinctCount is present (the 
typical Iceberg scenario), it incorrectly estimates the filtered row count as 
zero. This causes two major issues:
 # *Broadcast OOM failures* - Large tables get misclassified as small and 
broadcast to all executors
 # *Suboptimal join ordering* - The join reordering logic makes poor decisions 
based on these incorrect estimates

Instead of assuming zero selectivity when min/max bounds are missing, the 
estimation should remain undefined, allowing the optimizer to fall back to 
safer strategies.

  was:
Some sources, iceberg for instance, provide limited set of column stats. In 
particular, there is still [no min and max stats 
implementation][[https://github.com/apache/iceberg/issues/11083]] 

And, current filter estimation logic for EqualTo and InSet predicate have a bug 
estimating row count based on column stats where min/max is empty but there is 
distinctCount defined (exactly iceberg case) as zero. That might cause OOM due 
to broadcasting poorly estimated tables as well as suboptimal order in join 
chains after reorder. 

 

Some data sources like Iceberg provide only partial column statistics. 
Specifically, Iceberg currently [lacks min/max stats 
implementation][[https://github.com/apache/iceberg/issues/11083]] , though it 
does provide distinctCount values.

The current filter estimation logic for EqualTo and InSet predicates contains a 
bug: when min/max stats are unavailable but distinctCount is present (the 
typical Iceberg scenario), it incorrectly estimates the filtered row count as 
zero. This causes two major issues:
 # *Broadcast OOM failures* - Large tables get misclassified as small and 
broadcast to all executors
 # *Suboptimal join ordering* - The join reordering logic makes poor decisions 
based on these incorrect estimates

Instead of assuming zero selectivity when min/max bounds are missing, the 
estimation should remain undefined, allowing the optimizer to fall back to 
safer strategies.


> CBO: OOM-risky iceberg table stats underestimation for some filters
> -------------------------------------------------------------------
>
>                 Key: SPARK-52868
>                 URL: https://issues.apache.org/jira/browse/SPARK-52868
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.6, 4.0.0
>            Reporter: Artem Kupchinskiy
>            Priority: Major
>              Labels: pull-request-available
>
> Some data sources like Iceberg provide only partial column statistics. 
> Specifically, Iceberg currently [lacks min/max stats 
> implementation][[https://github.com/apache/iceberg/issues/11083]] , though it 
> does provide distinctCount values.
> The current filter estimation logic for EqualTo and InSet predicates contains 
> a bug: when min/max stats are unavailable but distinctCount is present (the 
> typical Iceberg scenario), it incorrectly estimates the filtered row count as 
> zero. This causes two major issues:
>  # *Broadcast OOM failures* - Large tables get misclassified as small and 
> broadcast to all executors
>  # *Suboptimal join ordering* - The join reordering logic makes poor 
> decisions based on these incorrect estimates
> Instead of assuming zero selectivity when min/max bounds are missing, the 
> estimation should remain undefined, allowing the optimizer to fall back to 
> safer strategies.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to