[jira] [Commented] (IMPALA-12451) Cardinality underestimation can hurt bloom filter effectiveness

Kurt Deschler (Jira) Mon, 18 Sep 2023 13:17:06 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766573#comment-17766573
 ]


Kurt Deschler commented on IMPALA-12451:
----------------------------------------

Perhaps we should consider increasing RUNTIME_FILTER_MIN_SIZE or making that 
sizing more dynamic depending on the size of the query and overall query memory?

> Cardinality underestimation can hurt bloom filter effectiveness
> ---------------------------------------------------------------
>
>                 Key: IMPALA-12451
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12451
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 4.2.0
>            Reporter: Riza Suminto
>            Priority: Major
>              Labels: bloom-filter, runtime-filters
>         Attachments: 53.txt, 79.txt
>
>
> Impala planner select desired bloom filter size by estimating the NDV of 
> values and target FPP (currently default at 0.75). Starting from 
> IMPALA-11924, the NDV itself is estimated by taking the min between the input 
> cardinality going to the join builder vs the column's stats NDV.
> If Planner underestimate the input cardinality, it can select bloom filter 
> size that is too small to fit the actual row NDV from the execution, 
> rendering the filter ineffective (has big actual false-positive rate). 
> Example of this case can be observed at RF004 of Q53 and RF006 of Q79 from 
> TPC-DS 3TB run with RUNTIME_FILTER_MIN_SIZE=8KB (profiles attached).
> To be specific:
> ||query||filter||column||stats NDV||est cardinality||selected size||actual 
> cardinality||best min size||
> |Q53|RF004|i_item_sk|185571|51|8KB (2^13)|18.53K|8MB (2^23)|
> |Q79|RF006|hd_demo_sk|7200|720|8KB (2^13)|5.04K|2MB (2^21)|
> The cardinality underestimation can be attributed to bad selectivity estimate 
> in the build hand side of the join node producing that filters. Correct bloom 
> filter size will require fixing this selectivity estimation or add an 
> optimization to also consider stats NDV if cardinality estimate seems to be 
> severely underestimated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-12451) Cardinality underestimation can hurt bloom filter effectiveness

Reply via email to