Re: [DISCUSS] FLIP-434: Support optimizations for pre-partitioned data sources

Jim Hughes Mon, 11 Mar 2024 20:11:41 -0700

Hi Jeyhun,

I like the idea!  Given FLIP-376[1], I wonder if it'd make sense to
generalize FLIP-434 to be about "pre-divided" data to cover "buckets" and
"partitions" (and maybe even situations where a data source is partitioned
and bucketed).

Separate from that, the page mentions TPC-H Q1 as an example.  For a join,
any two tables joined on the same bucket key should provide a concrete
example of a join.  Systems like Kafka Streams/ksqlDB call this
"co-partitioning"; for those systems, it is a requirement placed on the
input sources.  For Flink, with FLIP-434, the proposed planner rule
could remove the shuffle.

Definitely a fun idea; I look forward to hearing more!

Cheers,

Jim

1.
https://cwiki.apache.org/confluence/display/FLINK/FLIP-376%3A+Add+DISTRIBUTED+BY+clause
2.
https://docs.ksqldb.io/en/latest/developer-guide/joins/partition-data/#co-partitioning-requirements

On Sun, Mar 10, 2024 at 3:38 PM Jeyhun Karimov <je.kari...@gmail.com> wrote:

> Hi devs,
>
> I’d like to start a discussion on FLIP-434: Support optimizations for
> pre-partitioned data sources [1].
>
> The FLIP introduces taking advantage of pre-partitioned data sources for
> SQL/Table API (it is already supported as experimental feature in
> DataStream API [2]).
>
>
> Please find more details in the FLIP wiki document [1].
> Looking forward to your feedback.
>
> Regards,
> Jeyhun
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-434%3A+Support+optimizations+for+pre-partitioned+data+sources
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/
>

Re: [DISCUSS] FLIP-434: Support optimizations for pre-partitioned data sources

Reply via email to