Re: [DISCUSS] FLIP-381: Deprecate configuration getters/setters that return/set complex Java objects

2023-11-01 Thread Rui Fan
Thanks Junrui for driving this proposal!

ConfigOption is easy to use for flink users, easy to manage options
for flink platform maintainers, and easy to maintain for flink developers
and flink community.

So big +1 for this proposal!

Best,
Rui

On Thu, Nov 2, 2023 at 10:10 AM Junrui Lee  wrote:

> Hi devs,
>
> I would like to start a discussion on FLIP-381: Deprecate configuration
> getters/setters that return/set complex Java objects[1].
>
> Currently, the job configuration in FLINK is spread out across different
> components, which leads to inconsistencies and confusion. To address this
> issue, it is necessary to migrate non-ConfigOption complex Java objects to
> use ConfigOption and adopt a single Configuration object to host all the
> configuration.
> However, there is a significant blocker in implementing this solution.
> These complex Java objects in StreamExecutionEnvironment, CheckpointConfig,
> and ExecutionConfig have already been exposed through the public API,
> making it challenging to modify the existing implementation.
>
> Therefore, I propose to deprecate these Java objects and their
> corresponding getter/setter interfaces, ultimately removing them in
> FLINK-2.0.
>
> Your feedback and thoughts on this proposal are highly appreciated.
>
> Best regards,
> Junrui Lee
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278464992
>


[DISCUSS] FLIP-381: Deprecate configuration getters/setters that return/set complex Java objects

2023-11-01 Thread Junrui Lee
Hi devs,

I would like to start a discussion on FLIP-381: Deprecate configuration
getters/setters that return/set complex Java objects[1].

Currently, the job configuration in FLINK is spread out across different
components, which leads to inconsistencies and confusion. To address this
issue, it is necessary to migrate non-ConfigOption complex Java objects to
use ConfigOption and adopt a single Configuration object to host all the
configuration.
However, there is a significant blocker in implementing this solution.
These complex Java objects in StreamExecutionEnvironment, CheckpointConfig,
and ExecutionConfig have already been exposed through the public API,
making it challenging to modify the existing implementation.

Therefore, I propose to deprecate these Java objects and their
corresponding getter/setter interfaces, ultimately removing them in
FLINK-2.0.

Your feedback and thoughts on this proposal are highly appreciated.

Best regards,
Junrui Lee

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278464992


RE: [DISCUSS] Confluent Avro support without Schema Registry access

2023-11-01 Thread Dale Lane
Thanks for the pointer to FLINK-33045 - I hadn’t spotted that. That sounds like 
it’d address one possible issue (where someone using Flink shouldn’t be, or 
perhaps doesn’t have access/permission to, register new schemas).

I should be clear that I absolutely agree that using a schema registry is 
optimum. It should be the norm – it should be the default, preferred and 
recommended option.

However, I think that there may still be times where the schema registry isn’t 
available.

Maybe you’re using a mirrored copy of the topic on another kafka cluster and 
don’t have the original Kafka cluster’s schema registry available. Maybe 
networking restrictions means where you are running Flink doesn’t have 
connectivity to the schema registry. Maybe the developer using Flink doesn’t 
have permission for or access to the schema registry. Maybe the schema registry 
is currently unavailable. Maybe the developer using Flink is developing their 
Flink job offline, disconnected from the environment where the schema registry 
is running (ahead of in future deploying their finished Flink job where it will 
have access to the schema registry).

It is in such circumstances that I think the approach the avro formatter offers 
is a useful fallback. Through the table schema, it lets you specify the shape 
of your data, allowing you to process it without requiring an external 
dependency.

It seems to me that making it impossible to process Confluent Avro-encoded 
messages without access to an additional external component is too strict a 
restriction (as much as there are completely valid reasons for it to be a 
recommendation).

And, with a very small modification to the avro formatter, it’s a restriction 
we could remove.

Kind regards

Dale



From: Ryan Skraba 
Date: Monday, 30 October 2023 at 16:42
To: dev@flink.apache.org 
Subject: [EXTERNAL] Re: [DISCUSS] Confluent Avro support without Schema 
Registry access
Hello!  I took a look at FLINK-33045, which is somewhat related: In
that improvement, the author wants to control who registers schemas.
The Flink job would know the Avro schema to use, and would look up the
ID to use in framing the Avro binary.  It uses but never changes the
schema registry.

Here it sounds like you want nearly the same thing with one more step:
if the Flink job is configured with the schema to use, it could also
be pre-configured with the ID that the schema registry knows.
Technically, it could be configured with a *set* of schemas mapped to
their IDs when the job starts, but I imagine this would be pretty
clunky.

I'm curious if you can share what customer use cases wouldn't want
access to the schema registry!  One of the reasons it exists is to
prevent systems from writing unreadable or corrupted data to a Kafka
topic (or other messaging system) -- which I think is what Martijn is
asking about.  It's unlikely to be a performance gain from hiding it.

Thanks for bringing this up for discussion!  Ryan

[FLINK-33045]: https://issues.apache.org/jira/browse/FLINK-33045
[Single Object Encoding]:
https://avro.apache.org/docs/1.11.1/specification/_print/#single-object-encoding-specification

On Fri, Oct 27, 2023 at 3:13 PM Dale Lane  wrote:
>
> > if you strip the magic byte, and the schema has
> > evolved when you're consuming it from Flink,
> > you can end up with deserialization errors given
> > that a field might have been deleted/added/
> > changed etc.
>
> Aren’t we already fairly dependent on the schema remaining consistent, 
> because otherwise we’d need to update the table schema as well?
>
> > it wouldn't work when you actually want to
> > write avro-confluent, because that requires a
> > check when producing if you're still being compliant.
>
> I’m not sure what you mean here, sorry. Are you thinking about issues if you 
> needed to mix-and-match with both formatters at the same time? (Rather than 
> just using the Avro formatter as I was describing)
>
> Kind regards
>
> Dale
>
>
>
> From: Martijn Visser 
> Date: Friday, 27 October 2023 at 14:03
> To: dev@flink.apache.org 
> Subject: [EXTERNAL] Re: [DISCUSS] Confluent Avro support without Schema 
> Registry access
> Hi Dale,
>
> I'm struggling to understand in what cases you want to read data
> serialized in connection with Confluent Schema Registry, but can't get
> access to the Schema Registry service. It seems like a rather exotic
> situation and it beats the purposes of using a Schema Registry in the
> first place? I also doubt that it's actually really useful: if you
> strip the magic byte, and the schema has evolved when you're consuming
> it from Flink, you can end up with deserialization errors given that a
> field might have been deleted/added/changed etc. Also, it wouldn't
> work when you actually want to write avro-confluent, because that
> requires a check when producing if you're still being compliant.
>
> Best regards,
>
> Martijn
>
> On Fri, Oct 27, 2023 at 2:53 PM Dale Lane  wrote:
> >
> > TLDR:
> > We currently require a 

Re: [DISCUSS] FLIP-378: Support Avro timestamp with local timezone

2023-11-01 Thread Peter Huang
Thanks @Leonard Xu . Two minor versions are definitely
needed for flip the configs.

On Mon, Oct 30, 2023 at 8:55 PM Leonard Xu  wrote:

> Thanks @Peter for driving this FLIP
>
> +1 from my side, the timestamp semantics mapping looks good to me.
>
> >  In the end, the legacy behavior will be dropped in
> > Flink 2.0
> > I don’t think we can drop this option which introduced in 1.19 and drop
> in 2.0, the API removal requires at least two minor versions.
>
>
> Best,
> Leonard
>
> > 2023年10月31日 上午11:18,Peter Huang  写道:
> >
> > Hi Devs,
> >
> > Currently, Flink Avro Format doesn't support the Avro time (milli/micros)
> > with local timezone type.
> > Although the Avro timestamp (millis/micros) type is supported and is
> mapped
> > to flink timestamp without timezone,
> > it is not compliant to semantics defined in Consistent timestamp types in
> > Hadoop SQL engines
> > <
> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.n699ftkvhjlo
> >
> > .
> >
> > I propose to support Avro timestamps with the compliance to the mapping
> > semantics [1] by using a configuration flag.
> > To keep back compatibility, current mapping is kept as default behavior.
> > Users can explicitly turn on the new mapping
> > by setting it to false. In the end, the legacy behavior will be dropped
> in
> > Flink 2.0
> >
> > Looking forward to your feedback.
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-378%3A+Support+Avro+timestamp+with+local+timezone
> >
> >
> > Best Regards
> >
> > Peter Huang
>
>


Re: [DISCUSS][FLINK-33240] Document deprecated options as well

2023-11-01 Thread Junrui Lee
Hi Zhanghao,

Thank you for the proposal.

+1 from my side. It would be more user-friendly to have the deprecated
options in the same section as the non-deprecated ones. Therefore, adding
them in the same section sounds good to me.

Best regards,
Junrui

Zhanghao Chen  于2023年11月1日周三 21:10写道:

> Hi Samrat and Ruan,
>
> Thanks for the suggestion. I'm actually in favor of adding the deprecated
> options in the same section as the non-deprecated ones. This would make
> user search for descriptions of the replacement options more easily. It
> would be a different story for options deprecated because the related
> API/module is entirely deprecated, e.g. DataSet API. In that case, users
> would not search for replacement on an individual option but rather need to
> migrate to a new API, and it would be better to move these options to a
> separate section. WDYT?
>
> Best,
> Zhanghao Chen
> 
> From: Samrat Deb 
> Sent: Wednesday, November 1, 2023 15:31
> To: dev@flink.apache.org 
> Cc: u...@flink.apache.org 
> Subject: Re: [DISCUSS][FLINK-33240] Document deprecated options as well
>
> Thanks for the proposal ,
> +1 for adding deprecated identifier
>
> [Thought] Can we have seperate section / page for deprecated configs ? Wdut
> ?
>
>
> Bests,
> Samrat
>
>
> On Tue, 31 Oct 2023 at 3:44 PM, Alexander Fedulov <
> alexander.fedu...@gmail.com> wrote:
>
> > Hi Zhanghao,
> >
> > Thanks for the proposition.
> > In general +1, this sounds like a good idea as long it is clear that the
> > usage of these settings is discouraged.
> > Just one minor concern - the configuration page is already very long, do
> > you have a rough estimate of how many more options would be added with
> this
> > change?
> >
> > Best,
> > Alexander Fedulov
> >
> > On Mon, 30 Oct 2023 at 18:24, Matthias Pohl  > .invalid>
> > wrote:
> >
> > > Thanks for your proposal, Zhanghao Chen. I think it adds more
> > transparency
> > > to the configuration documentation.
> > >
> > > +1 from my side on the proposal
> > >
> > > On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen <
> zhanghao.c...@outlook.com
> > >
> > > wrote:
> > >
> > > > Hi Flink users and developers,
> > > >
> > > > Currently, Flink won't generate doc for the deprecated options. This
> > > might
> > > > confuse users when upgrading from an older version of Flink: they
> have
> > to
> > > > either carefully read the release notes or check the source code for
> > > > upgrade guidance on deprecated options.
> > > >
> > > > I propose to document deprecated options as well, with a
> "(deprecated)"
> > > > tag placed at the beginning of the option description to highlight
> the
> > > > deprecation status [1].
> > > >
> > > > Looking forward to your feedbacks on it.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-33240
> > > >
> > > > Best,
> > > > Zhanghao Chen
> > > >
> > >
> >
>


Re: [DISCUSS] Kubernetes Operator 1.7.0 release planning

2023-11-01 Thread Maximilian Michels
+1 for targeting the release as soon as possible. Given the effort
that Rui has undergone to decouple the autoscaling implementation, it
makes sense to also include an alternative implementation with the
release. In the long run, I wonder whether the standalone
implementation should even be part of the Kubernetes operator
repository. It can be hosted in a different repository and simply
consume the flink-autoscaler jar. But the same applies to the
flink-autoscaler module. For this release, we can keep everything
together.

I have a minor issue [1] I would like to include in the release.

-Max

[1] https://issues.apache.org/jira/browse/FLINK-33429

On Tue, Oct 31, 2023 at 11:14 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Thanks Gyula for driving this release!
>
> I'd like to check with you and community, could we
> postpone the code freeze by a week?
>
> I'm developing the FLINK-33099[1], and the prod code is done.
> I need some time to develop the tests. I hope this feature is included in
> 1.7.0 for two main reasons:
>
> 1. We have completed the decoupling of the autoscaler and
> kubernetes-operator in 1.7.0. During the decoupling period, we modified
> a large number of autoscaler-related interfaces. The standalone autoscaler
> is an autoscaler process that can run independently. It can help us confirm
> whether the new interface is reasonable.
> 2. 1.18.0 was recently released, standalone autoscaler allows more users to
> play autoscaler and in-place rescale.
>
> I have created a draft PR[2] for FLINK-33099, it just includes prod code.
> I have run it manually, it works well. And I will try my best to finish all
> unit tests before Friday, but must finish all unit tests before next Monday
> at the latest.
>
> WDYT?
>
> I'm deeply sorry for the request to postpone the release.
>
> [1] https://issues.apache.org/jira/browse/FLINK-33099
> [2] https://github.com/apache/flink-kubernetes-operator/pull/698
>
> Best,
> Rui
>
> On Tue, Oct 31, 2023 at 4:10 PM Samrat Deb  wrote:
>
> > Thank you Gyula
> >
> > (+1 non-binding) in support of you taking on the role of release manager.
> >
> > > I think this is reasonable as I am not aware of any big features / bug
> > fixes being worked on right now. Given the size of the changes related to
> > the autoscaler module refactor we should try to focus the remaining time on
> > testing.
> >
> > I completely agree with you. Since the changes are quite extensive, it's
> > crucial to allocate more time for thorough testing and verification.
> >
> > Regarding working with you for the release, I might not have the necessary
> > privileges for that.
> >
> > However, I'd be more than willing to assist with testing the changes,
> > validating various features, and checking for any potential regressions in
> > the flink-kubernetes-operator.
> > Just let me know how I can support the testing efforts.
> >
> > Bests,
> > Samrat
> >
> >
> > On Tue, 31 Oct 2023 at 12:59 AM, Gyula Fóra  wrote:
> >
> > > Hi all!
> > >
> > > I would like to kick off the release planning for the operator 1.7.0
> > > release. We have added quite a lot of new functionality over the last few
> > > weeks and I think the operator is in a good state to kick this off.
> > >
> > > Based on the original release schedule we had Nov 1 as the proposed
> > feature
> > > freeze date and Nov 7 as the date for the release cut / rc1.
> > >
> > > I think this is reasonable as I am not aware of any big features / bug
> > > fixes being worked on right now. Given the size of the changes related to
> > > the autoscaler module refactor we should try to focus the remaining time
> > on
> > > testing.
> > >
> > > I am happy to volunteer as a release manager but I am of course open to
> > > working together with someone on this.
> > >
> > > What do you think?
> > >
> > > Cheers,
> > > Gyula
> > >
> >


[jira] [Created] (FLINK-33429) Metric collection during stabilization phase may error due to missing metrics

2023-11-01 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-33429:
--

 Summary: Metric collection during stabilization phase may error 
due to missing metrics
 Key: FLINK-33429
 URL: https://issues.apache.org/jira/browse/FLINK-33429
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler
Affects Versions: kubernetes-operator-1.7.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.7.0


The new code for the 1.7.0 release introduces metric collection during the 
stabilization phase to allow sampling the observed true processing rate. 
Metrics might not be fully initialized during that phase, as evident through 
the error metrics. The following error is thrown: 

{noformat}
java.lang.RuntimeException: Could not find required metric 
NUM_RECORDS_OUT_PER_SEC for 667f5d5aa757fb217b92c06f0f5d2bf2 
{noformat}

To prevent these errors shadowing actual errors, we should detect and ignore 
this recoverable exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS][FLINK-33240] Document deprecated options as well

2023-11-01 Thread Zhanghao Chen
Hi Samrat and Ruan,

Thanks for the suggestion. I'm actually in favor of adding the deprecated 
options in the same section as the non-deprecated ones. This would make user 
search for descriptions of the replacement options more easily. It would be a 
different story for options deprecated because the related API/module is 
entirely deprecated, e.g. DataSet API. In that case, users would not search for 
replacement on an individual option but rather need to migrate to a new API, 
and it would be better to move these options to a separate section. WDYT?

Best,
Zhanghao Chen

From: Samrat Deb 
Sent: Wednesday, November 1, 2023 15:31
To: dev@flink.apache.org 
Cc: u...@flink.apache.org 
Subject: Re: [DISCUSS][FLINK-33240] Document deprecated options as well

Thanks for the proposal ,
+1 for adding deprecated identifier

[Thought] Can we have seperate section / page for deprecated configs ? Wdut
?


Bests,
Samrat


On Tue, 31 Oct 2023 at 3:44 PM, Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:

> Hi Zhanghao,
>
> Thanks for the proposition.
> In general +1, this sounds like a good idea as long it is clear that the
> usage of these settings is discouraged.
> Just one minor concern - the configuration page is already very long, do
> you have a rough estimate of how many more options would be added with this
> change?
>
> Best,
> Alexander Fedulov
>
> On Mon, 30 Oct 2023 at 18:24, Matthias Pohl  .invalid>
> wrote:
>
> > Thanks for your proposal, Zhanghao Chen. I think it adds more
> transparency
> > to the configuration documentation.
> >
> > +1 from my side on the proposal
> >
> > On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen  >
> > wrote:
> >
> > > Hi Flink users and developers,
> > >
> > > Currently, Flink won't generate doc for the deprecated options. This
> > might
> > > confuse users when upgrading from an older version of Flink: they have
> to
> > > either carefully read the release notes or check the source code for
> > > upgrade guidance on deprecated options.
> > >
> > > I propose to document deprecated options as well, with a "(deprecated)"
> > > tag placed at the beginning of the option description to highlight the
> > > deprecation status [1].
> > >
> > > Looking forward to your feedbacks on it.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-33240
> > >
> > > Best,
> > > Zhanghao Chen
> > >
> >
>


Re: [DISCUSS][FLINK-33240] Document deprecated options as well

2023-11-01 Thread Zhanghao Chen
Hi Alexander,

I haven't done a complete analysis yet. But through simple code search, roughly 
35 options would be added with this change. Also note that some old options 
defined in a ConfigConstant class won's be added here as flink-doc won't 
discover these constant-based options.

Best,
Zhanghao Chen

From: Alexander Fedulov 
Sent: Tuesday, October 31, 2023 18:12
To: dev@flink.apache.org 
Cc: u...@flink.apache.org 
Subject: Re: [DISCUSS][FLINK-33240] Document deprecated options as well

Hi Zhanghao,

Thanks for the proposition.
In general +1, this sounds like a good idea as long it is clear that the usage 
of these settings is discouraged.
Just one minor concern - the configuration page is already very long, do you 
have a rough estimate of how many more options would be added with this change?

Best,
Alexander Fedulov

On Mon, 30 Oct 2023 at 18:24, Matthias Pohl  
wrote:
Thanks for your proposal, Zhanghao Chen. I think it adds more transparency
to the configuration documentation.

+1 from my side on the proposal

On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen 
mailto:zhanghao.c...@outlook.com>>
wrote:

> Hi Flink users and developers,
>
> Currently, Flink won't generate doc for the deprecated options. This might
> confuse users when upgrading from an older version of Flink: they have to
> either carefully read the release notes or check the source code for
> upgrade guidance on deprecated options.
>
> I propose to document deprecated options as well, with a "(deprecated)"
> tag placed at the beginning of the option description to highlight the
> deprecation status [1].
>
> Looking forward to your feedbacks on it.
>
> [1] https://issues.apache.org/jira/browse/FLINK-33240
>
> Best,
> Zhanghao Chen
>


[jira] [Created] (FLINK-33428) Flink sql cep support 'followed','notNext' and 'notFollowedBy' semantics

2023-11-01 Thread xiaoran (Jira)
xiaoran created FLINK-33428:
---

 Summary: Flink sql cep support 'followed','notNext' and 
'notFollowedBy' semantics
 Key: FLINK-33428
 URL: https://issues.apache.org/jira/browse/FLINK-33428
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / API, Table SQL / Planner
Affects Versions: 1.16.0
Reporter: xiaoran


Currently, the cep mode of the Flink API can support next, notNext, followedBy, 
followedByAny, and notFollowedBy semantics, but Flink SQL only supports next 
semantics. The remaining notNext and followedBy semantics are implemented by 
other alternatives, while the notFollowedBy semantics are not currently 
implemented. At present, this semantics is generally implemented in business 
scenarios, such as judging that a user has placed an order within 15 minutes 
without paying. Therefore, I suggest to provide new functionality to support 
notFollowedBy in sql mode, along with the other three semantics

 

The syntax of enhanced MATCH_RECOGNIZE is proposed as follows:

MATCH_RECOGNIZE (
[ PARTITION BY  [, ... ] ]
[ ORDER BY  [, ... ] ]
[ MEASURES  [AS]  [, ... ] ]
[ ONE ROW PER MATCH [ { SHOW TIMEOUT MATCHES } ] |
  ALL ROWS PER MATCH [ { SHOW TIMEOUT MATCHES } ]
]
[ AFTER MATCH SKIP
  {
  PAST LAST ROW   |
  TO NEXT ROW   |
  TO [ { FIRST | LAST} ] 
  }
]
PATTERN (  )
DEFINE  AS  [, ... ]
)
[^ ] is proposed in  to express the notNext semantic. For 
example, A [^B] is translated to A.notNext(B).
{-  -} is proposed in  to express the followedBy semantic. For 
example, A { B*? -} C is translated to A.followedBy(C).
{- symbol1 -} with [^ ] is proposed in  to express the 
notFollowedBy semantic. For example, A {- B*? -} [^C] is translated to 
A.notFollwedBy(B).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-11-01 Thread Jing Ge
Hi Timo,

Gotcha, let's use passive verbs. I am actually thinking about "BUCKETED BY
6" or "BUCKETED INTO 6".

Not really used in SQL, but afaiu Spark uses the concept[1].

[1]
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrameWriter.bucketBy.html


Best regards,
Jing

On Mon, Oct 30, 2023 at 5:25 PM Timo Walther  wrote:

> Hi Jing,
>
>  > Have you considered using BUCKET BY directly?
>
> Which vendor uses this syntax? Most vendors that I checked call this
> concept "distribution".
>
> In any case, the "BY" is optional, so certain DDL statements would
> declare it like "BUCKET INTO 6 BUCKETS"? And following the PARTITIONED,
> we should use the passive voice.
>
>  > Did you mean users can use their own algorithm? How to do it?
>
> "own algorithm" only refers to deciding between a list of partitioning
> strategies (namely hash and range partitioning) if the connector offers
> more than one.
>
> Regards,
> Timo
>
>
> On 30.10.23 12:39, Jing Ge wrote:
> > Hi Timo,
> >
> > The FLIP looks great! Thanks for bringing it to our attention! In order
> to
> > make sure we are on the same page, I would ask some questions:
> >
> > 1. DISTRIBUTED BY reminds me DISTRIBUTE BY from Hive like Benchao
> mentioned
> > which is used to distribute rows amond reducers, i.e. focusing on the
> > shuffle during the computation. The FLIP is focusing more on storage, if
> I
> > am not mistaken. Have you considered using BUCKET BY directly?
> >
> > 2. According to the FLIP: " CREATE TABLE MyTable (uid BIGINT, name
> STRING)
> > DISTRIBUTED BY HASH(uid) INTO 6 BUCKETS
> >
> > - For advanced users, the algorithm can be defined explicitly.
> > - Currently, either HASH() or RANGE().
> >
> > "
> > Did you mean users can use their own algorithm? How to do it?
> >
> > Best regards,
> > Jing
> >
> > On Mon, Oct 30, 2023 at 11:13 AM Timo Walther 
> wrote:
> >
> >> Let me reply to the feedback from Yunfan:
> >>
> >>   > Distribute by in DML is also supported by Hive
> >>
> >> I see DISTRIBUTED BY and DISTRIBUTE BY as two separate discussions. This
> >> discussion is about DDL. For DDL, we have more freedom as every vendor
> >> has custom syntax for CREATE TABLE clauses. Furthermore, this is tightly
> >> connector to the connector implementation, not the engine. However, for
> >> DML we need to watch out for standard compliance and introduce changes
> >> with high caution.
> >>
> >> How a LookupTableSource interprets the DISTRIBUTED BY is
> >> connector-dependent in my opinion. In general this FLIP is a sink
> >> ability, but we could have a follow FLIP that helps in distributing load
> >> of lookup joins.
> >>
> >>   > to avoid data skew problem
> >>
> >> I understand the use case and that it is important to solve it
> >> eventually. Maybe a solution might be to introduce helper Polymorphic
> >> Table Functions [1] in the future instead of new syntax.
> >>
> >> [1]
> >>
> >>
> https://www.ifis.uni-luebeck.de/~moeller/Lectures/WS-19-20/NDBDM/12-Literature/Michels-SQL-2016.pdf
> >>
> >>
> >> Let me reply to the feedback from Benchao:
> >>
> >>   > Do you think it's useful to add some extensibility for the hash
> >> strategy
> >>
> >> The hash strategy is fully determined by the connector, not the Flink
> >> SQL engine. We are not using Flink's hash strategy in any way. If the
> >> hash strategy for the regular Flink file system connector should be
> >> changed, this should be expressed via config option. Otherwise we should
> >> offer a dedicated `hive-filesystem` or `spark-filesystem` connector.
> >>
> >> Regards,
> >> Timo
> >>
> >>
> >> On 30.10.23 10:44, Timo Walther wrote:
> >>> Hi Jark,
> >>>
> >>> my intention was to avoid too complex syntax in the first version. In
> >>> the past years, we could enable use cases also without this clause, so
> >>> we should be careful with overloading it with too functionality in the
> >>> first version. We can still iterate on it later, the interfaces are
> >>> flexible enough to support more in the future.
> >>>
> >>> I agree that maybe an explicit HASH and RANGE doesn't harm. Also making
> >>> the bucket number optional.
> >>>
> >>> I updated the FLIP accordingly. Now the SupportsBucketing interface
> >>> declares more methods that help in validation and proving helpful error
> >>> messages to users.
> >>>
> >>> Let me know what you think.
> >>>
> >>> Regards,
> >>> Timo
> >>>
> >>>
> >>> On 27.10.23 10:20, Jark Wu wrote:
>  Hi Timo,
> 
>  Thanks for starting this discussion. I really like it!
>  The FLIP is already in good shape, I only have some minor comments.
> 
>  1. Could we also support HASH and RANGE distribution kind on the DDL
>  syntax?
>  I noticed that HASH and UNKNOWN are introduced in the Java API, but
>  not in
>  the syntax.
> 
>  2. Can we make "INTO n BUCKETS" optional in CREATE TABLE and ALTER
> >> TABLE?
>  Some storage engines support automatically determining the bucket
> number
>  

[jira] [Created] (FLINK-33427) Mark new relocated autoscaler configs IGNORE in the operator

2023-11-01 Thread Gyula Fora (Jira)
Gyula Fora created FLINK-33427:
--

 Summary: Mark new relocated autoscaler configs IGNORE in the 
operator
 Key: FLINK-33427
 URL: https://issues.apache.org/jira/browse/FLINK-33427
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Gyula Fora
Assignee: Gyula Fora
 Fix For: kubernetes-operator-1.7.0


The operator currently only ignores "kubrernetes.operator" prefixed configs to 
not trigger upgrades. Autoscaler configs should also fall in this category.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-379: Dynamic source parallelism inference for batch jobs

2023-11-01 Thread Xia Sun
Thanks Lijie for the comments!
1. For Hive source, dynamic parallelism inference in batch scenarios is a
superset of static parallelism inference. As a follow-up task, we can
consider changing the default value of
'table.exec.hive.infer-source-parallelism' to false.

2. I think that both dynamic parallelism inference and static parallelism
inference have their own use cases. Currently, for streaming sources and
other sources that are not sensitive to dynamic information, the benefits
of dynamic parallelism inference may not be significant. In such cases, we
can continue to use static parallelism inference.

Thanks,
Xia

Lijie Wang  于2023年11月1日周三 14:52写道:

> Hi Xia,
>
> Thanks for driving this FLIP, +1 for the proposal.
>
> I have 2 questions about the relationship between static inference and
> dynamic inference:
>
> 1. AFAIK, currently the hive table source enable static inference by
> default. In this case, which one (static vs dynamic) will take effect ? I
> think it would be better if we can point this out in FLIP
>
> 2. As you mentioned above, dynamic inference is the most ideal way, so do
> we have plan to deprecate the static inference in the future?
>
> Best,
> Lijie
>
> Zhu Zhu  于2023年10月31日周二 20:19写道:
>
> > Thanks for opening the FLIP and kicking off this discussion, Xia!
> > The proposed changes make up an important missing part of the dynamic
> > parallelism inference of adaptive batch scheduler.
> >
> > Besides that, it is also one good step towards supporting dynamic
> > parallelism inference for streaming sources, e.g. allowing Kafka
> > sources to determine its parallelism automatically based on the
> > number of partitions.
> >
> > +1 for the proposal.
> >
> > Thanks,
> > Zhu
> >
> > Xia Sun  于2023年10月31日周二 16:01写道:
> >
> > > Hi everyone,
> > > I would like to start a discussion on FLIP-379: Dynamic source
> > parallelism
> > > inference for batch jobs[1].
> > >
> > > In general, there are three main ways to set source parallelism for
> batch
> > > jobs:
> > > (1) User-defined source parallelism.
> > > (2) Connector static parallelism inference.
> > > (3) Dynamic parallelism inference.
> > >
> > > Compared to manually setting parallelism, automatic parallelism
> inference
> > > is easier to use and can better adapt to varying data volumes each day.
> > > However, static parallelism inference cannot leverage runtime
> > information,
> > > resulting in inaccurate parallelism inference. Therefore, for batch
> jobs,
> > > dynamic parallelism inference is the most ideal, but currently, the
> > support
> > > for adaptive batch scheduler is not very comprehensive.
> > >
> > > Therefore, we aim to introduce a general interface that enables the
> > > adaptive batch scheduler to dynamically infer the source parallelism at
> > > runtime. Please refer to the FLIP[1] document for more details about
> the
> > > proposed design and implementation.
> > >
> > > I also thank Zhu Zhu and LiJie Wang for their suggestions during the
> > > pre-discussion.
> > > Looking forward to your feedback and suggestions, thanks.
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-379%3A+Dynamic+source+parallelism+inference+for+batch+jobs
> > >
> > > Best regards,
> > > Xia
> > >
> >
>


[jira] [Created] (FLINK-33426) If the directory does not have the read permission, an exception cannot be thrown, when read this path.

2023-11-01 Thread wenhao.yu (Jira)
wenhao.yu created FLINK-33426:
-

 Summary: If the directory does not have the read permission, an 
exception cannot be thrown, when read this path.
 Key: FLINK-33426
 URL: https://issues.apache.org/jira/browse/FLINK-33426
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.17.1, 1.13.5
Reporter: wenhao.yu
 Fix For: 1.17.1


When I use StreamExecutionEnvironment.ReadFile () this API, found that while 
reading on HDFS directory if the directory does not give permission to read, 
then this exception is not thrown, task would have been run, the outside world 
will not perceive the task status.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS][FLINK-33240] Document deprecated options as well

2023-11-01 Thread Hang Ruan
Thanks for the proposal.

+1 from my side and +1 for putting them to a separate section.

Best,
Hang

Samrat Deb  于2023年11月1日周三 15:32写道:

> Thanks for the proposal ,
> +1 for adding deprecated identifier
>
> [Thought] Can we have seperate section / page for deprecated configs ? Wdut
> ?
>
>
> Bests,
> Samrat
>
>
> On Tue, 31 Oct 2023 at 3:44 PM, Alexander Fedulov <
> alexander.fedu...@gmail.com> wrote:
>
> > Hi Zhanghao,
> >
> > Thanks for the proposition.
> > In general +1, this sounds like a good idea as long it is clear that the
> > usage of these settings is discouraged.
> > Just one minor concern - the configuration page is already very long, do
> > you have a rough estimate of how many more options would be added with
> this
> > change?
> >
> > Best,
> > Alexander Fedulov
> >
> > On Mon, 30 Oct 2023 at 18:24, Matthias Pohl  > .invalid>
> > wrote:
> >
> > > Thanks for your proposal, Zhanghao Chen. I think it adds more
> > transparency
> > > to the configuration documentation.
> > >
> > > +1 from my side on the proposal
> > >
> > > On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen <
> zhanghao.c...@outlook.com
> > >
> > > wrote:
> > >
> > > > Hi Flink users and developers,
> > > >
> > > > Currently, Flink won't generate doc for the deprecated options. This
> > > might
> > > > confuse users when upgrading from an older version of Flink: they
> have
> > to
> > > > either carefully read the release notes or check the source code for
> > > > upgrade guidance on deprecated options.
> > > >
> > > > I propose to document deprecated options as well, with a
> "(deprecated)"
> > > > tag placed at the beginning of the option description to highlight
> the
> > > > deprecation status [1].
> > > >
> > > > Looking forward to your feedbacks on it.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-33240
> > > >
> > > > Best,
> > > > Zhanghao Chen
> > > >
> > >
> >
>


[jira] [Created] (FLINK-33425) Flink SQL doesn't support a inline field in struct type as primary key

2023-11-01 Thread Elakiya (Jira)
Elakiya created FLINK-33425:
---

 Summary: Flink SQL doesn't support a inline field in struct type 
as primary key
 Key: FLINK-33425
 URL: https://issues.apache.org/jira/browse/FLINK-33425
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / API
Affects Versions: 1.16.2
Reporter: Elakiya


I have a Kafka topic named employee which uses confluent avro schema and will 
emit the payload as below:

{
"employee": {
"id": "123456",
"name": "sampleName"
}
}
I am using the upsert-kafka connector to consume the events from the above 
Kafka topic as below using the Flink SQL DDL statement, also here I want to use 
the id field as the Primary key. But I am unable to use the id field since it 
is inside the object and currently Flink doesn't support this feature. I am 
using Apache Flink 16.2 and its dependent versions

DDL Statement:
String statement = "CREATE TABLE Employee (\r\n" +
"  employee  ROW(id STRING, name STRING\r\n" +
"  ),\r\n" +
"  PRIMARY KEY ([employee.id|http://employee.id/]) NOT ENFORCED\r\n" +
") WITH (\r\n" +
"  'connector' = 'upsert-kafka',\r\n" +
"  'topic' = 'employee',\r\n" +
"  'properties.bootstrap.servers' = 'kafka-cp-kafka:9092',\r\n" +
"  'key.format' = 'raw',\r\n" +
"  'value.format' = 'avro-confluent',\r\n" +
"  'value.avro-confluent.url' = 
'[http://kafka-cp-schema-registry:8081|http://kafka-cp-schema-registry:8081/]',\r\n"
 +
")";
A new feature to use the property of a Row datatype (in this case employee.id) 
as a primary key  would be helpful in many scenarios.
Let me know if more details are required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS][FLINK-33240] Document deprecated options as well

2023-11-01 Thread Samrat Deb
Thanks for the proposal ,
+1 for adding deprecated identifier

[Thought] Can we have seperate section / page for deprecated configs ? Wdut
?


Bests,
Samrat


On Tue, 31 Oct 2023 at 3:44 PM, Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:

> Hi Zhanghao,
>
> Thanks for the proposition.
> In general +1, this sounds like a good idea as long it is clear that the
> usage of these settings is discouraged.
> Just one minor concern - the configuration page is already very long, do
> you have a rough estimate of how many more options would be added with this
> change?
>
> Best,
> Alexander Fedulov
>
> On Mon, 30 Oct 2023 at 18:24, Matthias Pohl  .invalid>
> wrote:
>
> > Thanks for your proposal, Zhanghao Chen. I think it adds more
> transparency
> > to the configuration documentation.
> >
> > +1 from my side on the proposal
> >
> > On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen  >
> > wrote:
> >
> > > Hi Flink users and developers,
> > >
> > > Currently, Flink won't generate doc for the deprecated options. This
> > might
> > > confuse users when upgrading from an older version of Flink: they have
> to
> > > either carefully read the release notes or check the source code for
> > > upgrade guidance on deprecated options.
> > >
> > > I propose to document deprecated options as well, with a "(deprecated)"
> > > tag placed at the beginning of the option description to highlight the
> > > deprecation status [1].
> > >
> > > Looking forward to your feedbacks on it.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-33240
> > >
> > > Best,
> > > Zhanghao Chen
> > >
> >
>


[jira] [Created] (FLINK-33424) Resolve the problem that yarnClient cannot load yarn configurations

2023-11-01 Thread zhengzhili (Jira)
zhengzhili created FLINK-33424:
--

 Summary: Resolve the problem that yarnClient cannot load yarn 
configurations
 Key: FLINK-33424
 URL: https://issues.apache.org/jira/browse/FLINK-33424
 Project: Flink
  Issue Type: Bug
  Components: Client / Job Submission
Affects Versions: 1.17.1
Reporter: zhengzhili
 Fix For: 1.19.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-379: Dynamic source parallelism inference for batch jobs

2023-11-01 Thread Lijie Wang
Hi Xia,

Thanks for driving this FLIP, +1 for the proposal.

I have 2 questions about the relationship between static inference and
dynamic inference:

1. AFAIK, currently the hive table source enable static inference by
default. In this case, which one (static vs dynamic) will take effect ? I
think it would be better if we can point this out in FLIP

2. As you mentioned above, dynamic inference is the most ideal way, so do
we have plan to deprecate the static inference in the future?

Best,
Lijie

Zhu Zhu  于2023年10月31日周二 20:19写道:

> Thanks for opening the FLIP and kicking off this discussion, Xia!
> The proposed changes make up an important missing part of the dynamic
> parallelism inference of adaptive batch scheduler.
>
> Besides that, it is also one good step towards supporting dynamic
> parallelism inference for streaming sources, e.g. allowing Kafka
> sources to determine its parallelism automatically based on the
> number of partitions.
>
> +1 for the proposal.
>
> Thanks,
> Zhu
>
> Xia Sun  于2023年10月31日周二 16:01写道:
>
> > Hi everyone,
> > I would like to start a discussion on FLIP-379: Dynamic source
> parallelism
> > inference for batch jobs[1].
> >
> > In general, there are three main ways to set source parallelism for batch
> > jobs:
> > (1) User-defined source parallelism.
> > (2) Connector static parallelism inference.
> > (3) Dynamic parallelism inference.
> >
> > Compared to manually setting parallelism, automatic parallelism inference
> > is easier to use and can better adapt to varying data volumes each day.
> > However, static parallelism inference cannot leverage runtime
> information,
> > resulting in inaccurate parallelism inference. Therefore, for batch jobs,
> > dynamic parallelism inference is the most ideal, but currently, the
> support
> > for adaptive batch scheduler is not very comprehensive.
> >
> > Therefore, we aim to introduce a general interface that enables the
> > adaptive batch scheduler to dynamically infer the source parallelism at
> > runtime. Please refer to the FLIP[1] document for more details about the
> > proposed design and implementation.
> >
> > I also thank Zhu Zhu and LiJie Wang for their suggestions during the
> > pre-discussion.
> > Looking forward to your feedback and suggestions, thanks.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-379%3A+Dynamic+source+parallelism+inference+for+batch+jobs
> >
> > Best regards,
> > Xia
> >
>


[jira] [Created] (FLINK-33423) Resolve the problem that yarnClient cannot load yarn configurations

2023-11-01 Thread zhengzhili (Jira)
zhengzhili created FLINK-33423:
--

 Summary: Resolve the problem that yarnClient cannot load yarn 
configurations
 Key: FLINK-33423
 URL: https://issues.apache.org/jira/browse/FLINK-33423
 Project: Flink
  Issue Type: Bug
  Components: Client / Job Submission
Reporter: zhengzhili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)