Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-11-03 Thread Jing Ge
Hi Timo, Fair enough, thanks for the clarification! Best regards, Jing On Fri, Nov 3, 2023 at 8:16 AM Timo Walther wrote: > If there are no objections, I would start with a voting on Monday. > > Thanks for the feedback everyone! > > Regards, > Timo > > > On 02.11.23 13:49, Martijn Visser

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-11-03 Thread Timo Walther
If there are no objections, I would start with a voting on Monday. Thanks for the feedback everyone! Regards, Timo On 02.11.23 13:49, Martijn Visser wrote: Hi all, From a user point of view, I think it makes sense to go for DISTRIBUTED BY with how Timo explained it. +1 for his proposal

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-11-02 Thread Martijn Visser
Hi all, >From a user point of view, I think it makes sense to go for DISTRIBUTED BY with how Timo explained it. +1 for his proposal Best regards, Martijn On Thu, Nov 2, 2023 at 11:00 AM Timo Walther wrote: > > Hi Jing, > > I agree this is confusing. THe Spark API calls it bucketBy in the >

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-11-02 Thread Timo Walther
Hi Jing, I agree this is confusing. THe Spark API calls it bucketBy in the programmatic API. But anyway, we should discuss the SQL semantics here. It's like a "WHERE" is called "filter" in the programmatic world. Or a "SELECT" is called "projection" in code. And looking at all the Hive

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-11-01 Thread Jing Ge
Hi Timo, Gotcha, let's use passive verbs. I am actually thinking about "BUCKETED BY 6" or "BUCKETED INTO 6". Not really used in SQL, but afaiu Spark uses the concept[1]. [1] https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrameWriter.bucketBy.html Best regards,

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-31 Thread Timo Walther
Hi Jark, here are the checks I had in mind so far. But we can also discuss this during the implementation in the PRs. Most of the tasks are very similar to PARTITIONED BY which is also a characteristic of a sink. 1) Check that DISTRIBUTED BY columns reference physical columns and at least

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-31 Thread Jark Wu
Hi Timo, Thank you for the update. The FLIP looks good to me now. I only have one more question. What does Flink check and throw exceptions for the bucketing? For example, do we check interfaces when executing create/alter DDL and when used as a source? Best, Jark On Tue, 31 Oct 2023 at 00:25,

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-30 Thread Timo Walther
Hi Jing, > Have you considered using BUCKET BY directly? Which vendor uses this syntax? Most vendors that I checked call this concept "distribution". In any case, the "BY" is optional, so certain DDL statements would declare it like "BUCKET INTO 6 BUCKETS"? And following the PARTITIONED,

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-30 Thread Jing Ge
Hi Timo, The FLIP looks great! Thanks for bringing it to our attention! In order to make sure we are on the same page, I would ask some questions: 1. DISTRIBUTED BY reminds me DISTRIBUTE BY from Hive like Benchao mentioned which is used to distribute rows amond reducers, i.e. focusing on the

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-30 Thread Timo Walther
Let me reply to the feedback from Yunfan: > Distribute by in DML is also supported by Hive I see DISTRIBUTED BY and DISTRIBUTE BY as two separate discussions. This discussion is about DDL. For DDL, we have more freedom as every vendor has custom syntax for CREATE TABLE clauses. Furthermore,

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-30 Thread Timo Walther
Hi Yunfan and Benchao, it seems the FLIP discussion thread got split into two parts. At least this is what I see in my mail program. I would kindly ask to answer in the other thread [1]. I will also reply there now to maintain the discussion link. Regards, Timo [1]

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-30 Thread Timo Walther
Hi Jark, my intention was to avoid too complex syntax in the first version. In the past years, we could enable use cases also without this clause, so we should be careful with overloading it with too functionality in the first version. We can still iterate on it later, the interfaces are

Re: Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-28 Thread Benchao Li
Thanks Timo for preparing the FLIP. Regarding "By default, DISTRIBUTED BY assumes a list of columns for an implicit hash partitioning." Do you think it's useful to add some extensibility for the hash strategy. One scenario I can foresee is if we write bucketed data into Hive, and if Flink's hash

RE: Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-27 Thread yunfan zhang
Distribute by in DML is also supported by Hive. And it is also useful for flink. Users can use this ability to increase cache hit rate in lookup join. And users can use "distribute by key, rand(1, 10)” to avoid data skew problem. And I think it is another way to solve this Flip204[1] There is

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-27 Thread Jark Wu
Hi Timo, Thanks for starting this discussion. I really like it! The FLIP is already in good shape, I only have some minor comments. 1. Could we also support HASH and RANGE distribution kind on the DDL syntax? I noticed that HASH and UNKNOWN are introduced in the Java API, but not in the syntax.

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-26 Thread Jingsong Li
Very thanks Timo for starting this discussion. Big +1 for this. The design looks good to me! We can add some documentation for connector developers. For example: for sink, If there needs some keyby, please finish the keyby by the connector itself. SupportsBucketing is just a marker interface.

[DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-26 Thread Timo Walther
Hi everyone, I would like to start a discussion on FLIP-376: Add DISTRIBUTED BY clause [1]. Many SQL vendors expose the concepts of Partitioning, Bucketing, and Clustering. This FLIP continues the work of previous FLIPs and would like to introduce the concept of "Bucketing" to Flink. This