Hi Timo,
Fair enough, thanks for the clarification!
Best regards,
Jing
On Fri, Nov 3, 2023 at 8:16 AM Timo Walther wrote:
> If there are no objections, I would start with a voting on Monday.
>
> Thanks for the feedback everyone!
>
> Regards,
> Timo
>
>
> On 02.11.23 13:49, Martijn Visser
If there are no objections, I would start with a voting on Monday.
Thanks for the feedback everyone!
Regards,
Timo
On 02.11.23 13:49, Martijn Visser wrote:
Hi all,
From a user point of view, I think it makes sense to go for
DISTRIBUTED BY with how Timo explained it. +1 for his proposal
Hi all,
>From a user point of view, I think it makes sense to go for
DISTRIBUTED BY with how Timo explained it. +1 for his proposal
Best regards,
Martijn
On Thu, Nov 2, 2023 at 11:00 AM Timo Walther wrote:
>
> Hi Jing,
>
> I agree this is confusing. THe Spark API calls it bucketBy in the
>
Hi Jing,
I agree this is confusing. THe Spark API calls it bucketBy in the
programmatic API. But anyway, we should discuss the SQL semantics here.
It's like a "WHERE" is called "filter" in the programmatic world. Or a
"SELECT" is called "projection" in code.
And looking at all the Hive
Hi Timo,
Gotcha, let's use passive verbs. I am actually thinking about "BUCKETED BY
6" or "BUCKETED INTO 6".
Not really used in SQL, but afaiu Spark uses the concept[1].
[1]
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrameWriter.bucketBy.html
Best regards,
Hi Jark,
here are the checks I had in mind so far. But we can also discuss this
during the implementation in the PRs. Most of the tasks are very similar
to PARTITIONED BY which is also a characteristic of a sink.
1) Check that DISTRIBUTED BY columns reference physical columns and at
least
Hi Timo,
Thank you for the update. The FLIP looks good to me now.
I only have one more question.
What does Flink check and throw exceptions for the bucketing?
For example, do we check interfaces when executing create/alter
DDL and when used as a source?
Best,
Jark
On Tue, 31 Oct 2023 at 00:25,
Hi Jing,
> Have you considered using BUCKET BY directly?
Which vendor uses this syntax? Most vendors that I checked call this
concept "distribution".
In any case, the "BY" is optional, so certain DDL statements would
declare it like "BUCKET INTO 6 BUCKETS"? And following the PARTITIONED,
Hi Timo,
The FLIP looks great! Thanks for bringing it to our attention! In order to
make sure we are on the same page, I would ask some questions:
1. DISTRIBUTED BY reminds me DISTRIBUTE BY from Hive like Benchao mentioned
which is used to distribute rows amond reducers, i.e. focusing on the
Let me reply to the feedback from Yunfan:
> Distribute by in DML is also supported by Hive
I see DISTRIBUTED BY and DISTRIBUTE BY as two separate discussions. This
discussion is about DDL. For DDL, we have more freedom as every vendor
has custom syntax for CREATE TABLE clauses. Furthermore,
Hi Yunfan and Benchao,
it seems the FLIP discussion thread got split into two parts. At least
this is what I see in my mail program. I would kindly ask to answer in
the other thread [1].
I will also reply there now to maintain the discussion link.
Regards,
Timo
[1]
Hi Jark,
my intention was to avoid too complex syntax in the first version. In
the past years, we could enable use cases also without this clause, so
we should be careful with overloading it with too functionality in the
first version. We can still iterate on it later, the interfaces are
Thanks Timo for preparing the FLIP.
Regarding "By default, DISTRIBUTED BY assumes a list of columns for an
implicit hash partitioning."
Do you think it's useful to add some extensibility for the hash
strategy. One scenario I can foresee is if we write bucketed data into
Hive, and if Flink's hash
Distribute by in DML is also supported by Hive.
And it is also useful for flink.
Users can use this ability to increase cache hit rate in lookup join.
And users can use "distribute by key, rand(1, 10)” to avoid data skew problem.
And I think it is another way to solve this Flip204[1]
There is
Hi Timo,
Thanks for starting this discussion. I really like it!
The FLIP is already in good shape, I only have some minor comments.
1. Could we also support HASH and RANGE distribution kind on the DDL
syntax?
I noticed that HASH and UNKNOWN are introduced in the Java API, but not in
the syntax.
Very thanks Timo for starting this discussion.
Big +1 for this.
The design looks good to me!
We can add some documentation for connector developers. For example:
for sink, If there needs some keyby, please finish the keyby by the
connector itself. SupportsBucketing is just a marker interface.
Hi everyone,
I would like to start a discussion on FLIP-376: Add DISTRIBUTED BY
clause [1].
Many SQL vendors expose the concepts of Partitioning, Bucketing, and
Clustering. This FLIP continues the work of previous FLIPs and would
like to introduce the concept of "Bucketing" to Flink.
This
17 matches
Mail list logo