Hi Jeyhun, I like the idea! Given FLIP-376[1], I wonder if it'd make sense to generalize FLIP-434 to be about "pre-divided" data to cover "buckets" and "partitions" (and maybe even situations where a data source is partitioned and bucketed).
Separate from that, the page mentions TPC-H Q1 as an example. For a join, any two tables joined on the same bucket key should provide a concrete example of a join. Systems like Kafka Streams/ksqlDB call this "co-partitioning"; for those systems, it is a requirement placed on the input sources. For Flink, with FLIP-434, the proposed planner rule could remove the shuffle. Definitely a fun idea; I look forward to hearing more! Cheers, Jim 1. https://cwiki.apache.org/confluence/display/FLINK/FLIP-376%3A+Add+DISTRIBUTED+BY+clause 2. https://docs.ksqldb.io/en/latest/developer-guide/joins/partition-data/#co-partitioning-requirements On Sun, Mar 10, 2024 at 3:38 PM Jeyhun Karimov <je.kari...@gmail.com> wrote: > Hi devs, > > I’d like to start a discussion on FLIP-434: Support optimizations for > pre-partitioned data sources [1]. > > The FLIP introduces taking advantage of pre-partitioned data sources for > SQL/Table API (it is already supported as experimental feature in > DataStream API [2]). > > > Please find more details in the FLIP wiki document [1]. > Looking forward to your feedback. > > Regards, > Jeyhun > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-434%3A+Support+optimizations+for+pre-partitioned+data+sources > [2] > > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/ >