Quanlong Huang created IMPALA-13193:
---------------------------------------

             Summary: RuntimeFilter on parquet dictionary should evaluate null 
values
                 Key: IMPALA-13193
                 URL: https://issues.apache.org/jira/browse/IMPALA-13193
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
            Reporter: Quanlong Huang


IMPALA-10910, IMPALA-5509 introduces an optimization to evaluate runtime filter 
on parquet dictionary values. If non of the values can pass the check, the 
whole row group will be skipped. However, NULL values are not included in the 
parquet dictionary. Runtime filters that accept NULL values might incorrectly 
reject the row group if none of the dictionary values can pass the check.

Here are steps to reproduce the bug:
{code:sql}
create table parq_tbl (id bigint, name string) stored as parquet;
insert into parq_tbl values (0, "abc"), (1, NULL), (2, NULL), (3, "abc");

create table dim_tbl (name string);
insert into dim_tbl values (NULL);

select * from parq_tbl p join dim_tbl d
  on COALESCE(p.name, '') = COALESCE(d.name, '');{code}
The SELECT query should return 2 rows but now it returns 0 rows.

A workaround is to disable this optimization:
{code:sql}
set PARQUET_DICTIONARY_RUNTIME_FILTER_ENTRY_LIMIT=0;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to