[jira] [Commented] (IMPALA-12357) Skip scheduling runtime filter from PK-FK join with full build scan

ASF subversion and git services (Jira) Tue, 12 Sep 2023 00:22:05 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764046#comment-17764046
 ]


ASF subversion and git services commented on IMPALA-12357:
----------------------------------------------------------

Commit bd2df11709bf1b048e889be058fa758b51b97e76 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=bd2df1170 ]

IMPALA-12357: Skip scheduling bloom filter from full-build scan

PK-FK join between a dimension table and a fact table is common
occurrences in a query. Such join often does not involve any predicate
filter in the dimension table. Thus, bloom filter value from this kind
of dimension table scan (PK) will most likely to have all values from
the fact table column (FK). It is ineffective to generate this filter
because it is unlikely to reject any rows, especially if the bloom
filter size is large and has high false positive probability (fpp)
estimate.

This patch skip scheduling bloom filter from join node that has this
characteristics:

1. Build side is full table scan (has hard estimates).
2. The build scan does not have any predicate filter nor consume any
   runtime filter.
3. The join node is assumed to have PK-FK relationship.
4. The planned bloom filter has resulting fpp estimate higher than
   max_filter_error_rate_from_full_scan flag (default to 0.9).

The fourth criteria is an additional control to eliminate based on fpp
threshold because low fpp filter sometimes is still effective in
eliminating rows (i.e., rows with NULL value). Non-bloom filters remain
unchanged as they are relatively lighter to build and evaluate than
bloom filter.

Testing:
- Add testcase in testBloomFilterAssignment
- Pass core tests
- Ran TPC-DS 3TB with following query options:
  * RUNTIME_FILTER_MIN_SIZE=8192
  * RUNTIME_FILTER_MAX_SIZE=2097152
  * MAX_NUM_RUNTIME_FILTERS=50
  * RUNTIME_FILTER_WAIT_TIME_MS=10000
  19 out of 103 queries show reduction in number of runtime bloom
  filters without any notable performance regression.

Change-Id: I494533bc06da84e606cbd1ae1619083333089a5e
Reviewed-on: http://gerrit.cloudera.org:8080/20366
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Skip scheduling runtime filter from PK-FK join with full build scan
> -------------------------------------------------------------------
>
>                 Key: IMPALA-12357
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12357
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Riza Suminto
>            Priority: Major
>              Labels: bloom-filter, runtime-filters
>         Attachments: Screen Shot 2023-08-04 at 3.13.56 PM.png
>
>
> PK-FK inner join between a dimension table and a fact table is a common 
> occurrence in a query. It is also often that such join does not involve any 
> predicate filter in the dimension table. Thus, runtime filter values coming 
> from this kind of dimension table scan (PK) is likely inclusive to all values 
> of the fact table column (FK). It is ineffective to generate this filter 
> because this filter is unlikely to reject any rows.
> Attached screenshot shows visualization of RF 50, 52, 60, and 62 targeting 
> 49:SCAN from TPC-DS Q64. These runtime filters coming from full dimension 
> table scan on PK-FK join. In theory, these filters should not reject any 
> probe rows. The query profile, however, shows that these filters can still 
> reject some probe rows with NULL values in their target column. 
> Unfortunately, due to the low number of NULL vs non-NULL, all of those 
> filters still ended up disabled by scanners because the 49:SCAN deemed them 
> ineffective.
> We can skip generating runtime filters that match all these criteria:
>  # Build side is full table scan
>  # No runtime filter targeting the build scan
>  # There is a PK-FK constraint between the runtime filter origin column in 
> the build side and the target column in the probe side.
> If PK-FK constraint is not declared in table schema, which happen most of the 
> time, criteria 3 can be replaced by checking the runtime filter’s false 
> positive probability (eliminate one with high false positive probability).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-12357) Skip scheduling runtime filter from PK-FK join with full build scan

Reply via email to