Thanks Lijie for starting this discussion. Excited to see runtime filter is to 
be implemented in Flink.
I have few questions about it:

1: As the FLIP said, `if the ndv cannot be estimated, use row count instead`. 
So, does row count comes from the statistic from underlying table? What if the 
the statistic is also unavailable considering users maynot always remember to 
generate statistic in production.
I'm wondering whether it make senese that just disable runtime filter if 
statistic is unavailable since in that case, we can hardly evaluate the 
benefits of runtime-filter.
 

2: The FLIP said: "We will inject the runtime filters only if the following 
requirements are met:xxx", but it also said, "Once this limit is exceeded, it 
will output a fake filter(which always returns true)" in 
`RuntimeFilterBuilderOperator` part; Seems they are contradictory, so i'm 
wondering what's the real behavior, no filter will be injected or fake filter?


3: Does it also mean runtime-filter can only take effect in blocking shuffle?



Best regards,
Yuxia

----- 原始邮件 -----
发件人: "ron9 liu" <ron9....@gmail.com>
收件人: "dev" <dev@flink.apache.org>
发送时间: 星期三, 2023年 6 月 14日 下午 5:29:28
主题: Re: [DISCUSS] FLIP-324: Introduce Runtime Filter for Flink Batch Jobs

Thanks Lijie start this discussion. Runtime Filter is a common optimization
to improve the join performance that has been adopted by many computing
engines such as Spark, Doris, etc... Flink is a streaming batch computing
engine, and we are continuously optimizing the performance of batches.
Runtime filter is a general performance optimization technique that can
improve the performance of Flink batch jobs, so we are introducing it on
batch as well.

Looking forward to all feedback.

Best,
Ron

Lijie Wang <wangdachui9...@gmail.com> 于2023年6月14日周三 17:17写道:

> Hi devs
>
> Ron Liu, Gen Luo and I would like to start a discussion about FLIP-324:
> Introduce Runtime Filter for Flink Batch Jobs[1]
>
> Runtime Filter is a common optimization to improve join performance. It is
> designed to dynamically generate filter conditions for certain Join queries
> at runtime to reduce the amount of scanned or shuffled data, avoid
> unnecessary I/O and network transmission, and speed up the query. Its
> working principle is building a filter(e.g. bloom filter) based on the data
> on the small table side(build side) first, then pass this filter to the
> large table side(probe side) to filter the irrelevant data on it, this can
> reduce the data reaching the join and improve performance.
>
> You can find more details in the FLIP-324[1]. Looking forward to your
> feedback.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-324%3A+Introduce+Runtime+Filter+for+Flink+Batch+Jobs
>
> Best,
> Ron & Gen & Lijie
>

Reply via email to