[ https://issues.apache.org/jira/browse/IMPALA-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766573#comment-17766573 ]
Kurt Deschler commented on IMPALA-12451: ---------------------------------------- Perhaps we should consider increasing RUNTIME_FILTER_MIN_SIZE or making that sizing more dynamic depending on the size of the query and overall query memory? > Cardinality underestimation can hurt bloom filter effectiveness > --------------------------------------------------------------- > > Key: IMPALA-12451 > URL: https://issues.apache.org/jira/browse/IMPALA-12451 > Project: IMPALA > Issue Type: Improvement > Components: Frontend > Affects Versions: Impala 4.2.0 > Reporter: Riza Suminto > Priority: Major > Labels: bloom-filter, runtime-filters > Attachments: 53.txt, 79.txt > > > Impala planner select desired bloom filter size by estimating the NDV of > values and target FPP (currently default at 0.75). Starting from > IMPALA-11924, the NDV itself is estimated by taking the min between the input > cardinality going to the join builder vs the column's stats NDV. > If Planner underestimate the input cardinality, it can select bloom filter > size that is too small to fit the actual row NDV from the execution, > rendering the filter ineffective (has big actual false-positive rate). > Example of this case can be observed at RF004 of Q53 and RF006 of Q79 from > TPC-DS 3TB run with RUNTIME_FILTER_MIN_SIZE=8KB (profiles attached). > To be specific: > ||query||filter||column||stats NDV||est cardinality||selected size||actual > cardinality||best min size|| > |Q53|RF004|i_item_sk|185571|51|8KB (2^13)|18.53K|8MB (2^23)| > |Q79|RF006|hd_demo_sk|7200|720|8KB (2^13)|5.04K|2MB (2^21)| > The cardinality underestimation can be attributed to bad selectivity estimate > in the build hand side of the join node producing that filters. Correct bloom > filter size will require fixing this selectivity estimation or add an > optimization to also consider stats NDV if cardinality estimate seems to be > severely underestimated. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org