Re: Future of classical remoting in Pekko

2023-09-25 Thread He Pin
The related changes are :
https://github.com/apache/incubator-pekko/pull/643
https://github.com/apache/incubator-pekko/pull/651
https://github.com/apache/incubator-pekko/pull/652
https://github.com/apache/incubator-pekko/pull/656

It would be very helpful if you take another round of review, thanks.

On 2023/09/19 19:04:59 Ferenc Csaky wrote:
> I think that is totally fine, because any Pekko related changes can only be 
> added to the first patch release of 1.18 at this point, as there is an RC0 
> [1] already so the release process will be initiated soon.
> 
> I am glad the mentioned PR got merged, did not have the chance to review.
> 
> [1] https://lists.apache.org/thread/5x28rp3zct4p603hm4zdwx6kfr101w38
> 
> 
> 
> --- Original Message ---
> On Monday, September 18th, 2023 at 14:20, Matthew de Detrich 
>  wrote:
> 
> 
> > 
> > 
> > I think that the end of September is too soon for a Pekko 1.1.x, there are
> > still more things
> > that we would like to merge before making a release.
> > 
> > Good news is that the PR to migrate to netty4 for classic remoting has been
> > merged
> > (see https://github.com/apache/incubator-pekko/pull/643). Improvements are
> > also
> > still be done, so the next minor version release of Pekko (1.1.0) will
> > contain these
> > changes.
> > 
> > On Wed, Sep 13, 2023 at 11:22 AM Ferenc Csaky ferenc.cs...@pm.me.invalid
> > 
> > wrote:
> > 
> > > The target release date for 1.18 is the end of Sept [1], but I'm not sure
> > > everything will come together by then. Maybe it will pushed by a couple
> > > days.
> > > 
> > > I'm happy to help out, even making the Flink related changes when we're at
> > > that point.
> > > 
> > > [1] https://cwiki.apache.org/confluence/display/FLINK/1.18+Release
> > > 
> > > --- Original Message ---
> > > On Tuesday, September 12th, 2023 at 17:43, He Pin he...@apache.org
> > > wrote:
> > > 
> > > > Hi Ferenc:
> > > > What's the ETA of the Flink 1.18? I think we should beable to
> > > > collaborate on this,and at work we are using Flink too.
> > > > 
> > > > On 2023/09/12 15:16:11 Ferenc Csaky wrote:
> > > > 
> > > > > Hi Matthew,
> > > > > 
> > > > > Thanks for bringing this up! Cca half a year ago I started to work on
> > > > > an Akka Artery migration, there is a draft PR for that 1. It might be 
> > > > > an
> > > > > option to revive that work and point it against Pekko instead. 
> > > > > Although I
> > > > > would highlight FLINK-29281 2 which will replace the whole RPC
> > > > > implementation in Flink to a gRPC-based one when it is done.
> > > > > 
> > > > > I am not sure about the progess on the gRPC work, it looks hanging for
> > > > > a while now, so I think if there is a chance to replace Netty3 with 
> > > > > Netty4
> > > > > in Pekko in the short term it would benefit Flink and then we can 
> > > > > decide if
> > > > > it would worth to upgrade to Artery, or how fast the gRPC solution 
> > > > > can be
> > > > > done and then it will not be necessary.
> > > > > 
> > > > > All in all, in the short term I think Flink would benefit to have that
> > > > > mentioned PR 3 merged, then the updated Pekko version could be 
> > > > > included in
> > > > > the first 1.18 patch probably to mitigate those pesky Netty3 CVEs 
> > > > > that are
> > > > > carried for a while ASAP.
> > > > > 
> > > > > Cheers,
> > > > > Ferenc
> > > > > 
> > > > > 1 https://github.com/apache/flink/pull/22271
> > > > > 2 https://issues.apache.org/jira/browse/FLINK-29281
> > > > > 3 https://github.com/apache/incubator-pekko/pull/643
> > > > > 
> > > > > --- Original Message ---
> > > > > On Tuesday, September 12th, 2023 at 10:29, Matthew de Detrich
> > > > > matthew.dedetr...@aiven.io.INVALID wrote:
> > > > > 
> > > > > > It's come to my attention that Flink is using Pekko's classical
> > > > > > remoting,
> > > > > > if this is the case then I would recommend making a response at
> > > > > > https://lists.apache.org/thread/19h2wrs2om91g5vhnftv583fo0ddfshm .
> > > > > > 
> > > > > > Quick summary of what is being discussed is what to do with Pekko's
> > > > > > classical remoting. Classic remoting is considered deprecated since
> > > > > > 2019,
> > > > > > an artifact that we inherited from Akka1. Ontop of this classical
> > > > > > remoting happens to be using netty3 which has known CVE's2, these
> > > > > > CVE's
> > > > > > were never fixed in the netty3 series.
> > > > > > 
> > > > > > The question is what should be done given this, i.e. some people in
> > > > > > the
> > > > > > Pekko community are wanting to drop classical remoting as quickly as
> > > > > > possible (i.e. even sooner then what semver allows but this is being
> > > > > > discussed) and others are wanting to leave it as it is (even with 
> > > > > > the
> > > > > > CVE's) since we don't want to incentivize and/or create impression
> > > > > > that we
> > > > > > are officially supporting it. There is also a currently open PR3
> > > > > > which
> > > > > > upgrades Pekko's c

[jira] [Created] (FLINK-33149) Bump snappy-java to 1.1.10.4

2023-09-25 Thread Ryan Skraba (Jira)
Ryan Skraba created FLINK-33149:
---

 Summary: Bump snappy-java to 1.1.10.4
 Key: FLINK-33149
 URL: https://issues.apache.org/jira/browse/FLINK-33149
 Project: Flink
  Issue Type: Bug
  Components: API / Core, Connectors / AWS, Connectors / HBase, 
Connectors / Kafka, Stateful Functions
Affects Versions: 1.18.0, 1.16.3, 1.17.2
Reporter: Ryan Skraba


Xerial published a security alert for a Denial of Service attack that [exists 
on 
1.1.10.1|https://github.com/xerial/snappy-java/security/advisories/GHSA-55g7-9cwv-5qfv].

This is included in flink-dist, but also in flink-statefun, and several 
connectors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33150) add the processing logic for the long type

2023-09-25 Thread wenhao.yu (Jira)
wenhao.yu created FLINK-33150:
-

 Summary:  add the processing logic for the long type
 Key: FLINK-33150
 URL: https://issues.apache.org/jira/browse/FLINK-33150
 Project: Flink
  Issue Type: New Feature
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Affects Versions: 1.15.4
Reporter: wenhao.yu
 Fix For: 1.15.4


The AvroToRowDataConverters class has a convertToDate method that will report 
an error when it encounters time data represented by the long type, so add a 
code to handle the long type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33151) Prometheus Sink Connector - Create Github Repo

2023-09-25 Thread Danny Cranmer (Jira)
Danny Cranmer created FLINK-33151:
-

 Summary: Prometheus Sink Connector - Create Github Repo
 Key: FLINK-33151
 URL: https://issues.apache.org/jira/browse/FLINK-33151
 Project: Flink
  Issue Type: Sub-task
Reporter: Danny Cranmer


Create the \{{flink-connector-prometheus}} repo



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-09-25 Thread Yuepeng Pan
Hi all,




I and Fan Rui(CC’ed) created the FLIP-370[1] to support balanced tasks 
scheduling.




The current strategy of Flink to deploy tasks sometimes leads some 
TMs(TaskManagers) to have more tasks while others have fewer tasks, resulting 
in excessive resource utilization at some TMs that contain more tasks and 
becoming a bottleneck for the entire job processing. Developing strategies to 
achieve task load balancing for TMs and reducing job bottlenecks becomes very 
meaningful.




The raw design and discussions could be found in the Flink JIRA[2] and Google 
doc[3]. We really appreciate Zhu Zhu(CC’ed) for providing some valuable help 
and suggestions in advance. 




Please refer to the FLIP[1] document for more details about the proposed design 
and implementation. We welcome any feedback and opinions on this proposal.




[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling

[2] https://issues.apache.org/jira/browse/FLINK-31757

[3] 
https://docs.google.com/document/d/14WhrSNGBdcsRl3IK7CZO-RaZ5KXU2X1dWqxPEFr3iS8




Best,

Yuepeng Pan

[jira] [Created] (FLINK-33152) Prometheus Sink Connector - Integration tests

2023-09-25 Thread Lorenzo Nicora (Jira)
Lorenzo Nicora created FLINK-33152:
--

 Summary: Prometheus Sink Connector - Integration tests
 Key: FLINK-33152
 URL: https://issues.apache.org/jira/browse/FLINK-33152
 Project: Flink
  Issue Type: Sub-task
Reporter: Lorenzo Nicora


Integration tests against containerised Prometheus



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-09-25 Thread Piotr Nowojski
Hi Jarl and Dong,

I'm a bit confused about the difference between the two competing options.
Could one of you elaborate what's the difference between:
> 2) The semantics of `#setIsProcessingBacklog(false)` is that it overrides
> the effect of the previous invocation (if any) of
> `#setIsProcessingBacklog(true)` on the given source instance.

and

> 2) The semantics of `#setIsProcessingBacklog(false)` is that the given
> source instance will have watermarkLag=false.

?

Best,
Piotrek

czw., 21 wrz 2023 o 15:28 Dong Lin  napisał(a):

> Hi all,
>
> Jark and I discussed this FLIP offline and I will summarize our discussion
> below. It would be great if you could provide your opinion of the proposed
> options.
>
> Regarding the target use-cases:
> - We both agreed that MySQL CDC should have backlog=true when watermarkLag
> is large during the binlog phase.
> - Dong argued that other streaming sources with watermarkLag defined (e.g.
> Kafka) should also have backlog=true when watermarkLag is large. The
> pros/cons discussion below assume this use-case needs to be supported.
>
> The 1st option is what is currently proposed in FLIP-328, with the
> following key characteristics:
> 1) There is one job-level config (i.e.
> pipeline.backlog.watermark-lag-threshold) that applies to all sources with
> watermarkLag metric defined.
> 2) The semantics of `#setIsProcessingBacklog(false)` is that it overrides
> the effect of the previous invocation (if any) of
> `#setIsProcessingBacklog(true)` on the given source instance.
>
> The 2nd option is what Jark proposed in this email thread, with the
> following key characteristics:
> 1) Add source-specific config (both Java API and SQL source property) to
> every source for which we want to set backlog status based on the
> watermarkLag metric. For example, we might add separate Java APIs
> `#setWatermarkLagThreshold`  for MySQL CDC source, HybridSource,
> KafkaSource, PulsarSource etc.
> 2) The semantics of `#setIsProcessingBacklog(false)` is that the given
> source instance will have watermarkLag=false.
>
> Here are the key pros/cons of these two options.
>
> Cons of the 1st option:
> 1) The semantics of `#setIsProcessingBacklog(false)` is harder to
> understand for Flink operator developers than the corresponding semantics
> in option-2.
>
> Cons of the  2nd option:
> 1) More work for end-users. For a job with multiple sources that need to be
> configured with a watermark lag threshold, users need to specify multiple
> configs (one for each source) instead of specifying one job-level config.
>
> 2) More work for Flink operator developers. Overall there are more public
> APIs (one Java API and one SQL property for each source that needs to
> determine backlog based on watermark) exposed to end users. This also adds
> more burden for the Flink community to maintain these APIs.
>
> 3) It would be hard (e.g. require backward incompatible API change) to
> extend the Flink runtime to support job-level config to set watermark
> strategy in the future (e.g. support the
> pipeline.backlog.watermark-lag-threshold in option-1). This is because an
> existing source operator's code might have hardcoded an invocation of
> `#setIsProcessingBacklog(false)`, which means the backlog status must be
> set to true, which prevents Flink runtime from setting backlog=true when a
> new strategy is triggered.
>
> Overall, I am still inclined to choose option-1 because it is more
> extensible and simpler to use in the long term when we want to support/use
> multiple sources whose backlog status can change based on the watermark
> lag. While option-1's `#setIsProcessingBacklog` is a bit harder to
> understand than option-2, I think this overhead/cost is worthwhile as it
> makes end-users' life easier in the long term.
>
> Jark: thank you for taking the time to review this FLIP. Please feel free
> to comment if I missed anything in the pros/cons above.
>
> Jark and I have not reached agreement on which option is better. It will be
> really helpful if we can get more comments on these options.
>
> Thanks,
> Dong
>
>
> On Tue, Sep 19, 2023 at 11:26 AM Dong Lin  wrote:
>
> > Hi Jark,
> >
> > Thanks for the reply. Please see my comments inline.
> >
> > On Tue, Sep 19, 2023 at 10:12 AM Jark Wu  wrote:
> >
> >> Hi Dong,
> >>
> >> Sorry for the late reply.
> >>
> >> > The rationale is that if there is any strategy that is triggered and
> >> says
> >> > backlog=true, then job's backlog should be true. Otherwise, the job's
> >> > backlog status is false.
> >>
> >> I'm quite confused about this. Does that mean, if the source is in the
> >> changelog phase, the source has to continuously invoke
> >> "setIsProcessingBacklog(true)" (in an infinite loop?). Otherwise,
> >> the job's backlog status would be set to false by the framework?
> >>
> >
> > No, the source would not have to continuously invoke
> > setIsProcessingBacklog(true) in an infinite loop.
> >
> > Actually, I am not very sure why there is confusion that "

Re: [ANNOUNCE] Release 1.18.0, release candidate #0

2023-09-25 Thread Zakelly Lan
Hi Jing and everyone,

I have conducted three rounds of benchmarking with Java11, comparing
release 1.18 (commit: deb07e99560[1]) with commit 6d62f9918ea[2]. The
results are attached[3]. Most of the tests show no obvious regression.
However, I did observe significant change in several tests. Upon
reviewing the historical results from the previous pipeline, I also
discovered a substantial variance in those tests, as shown in the
timeline pictures included in the sheet[3]. I believe this variance
has existed for a long time and requires further investigation, and
fully measuring the variance requires more rounds (15 or more). I
think for now it is not a blocker for release 1.18. WDYT?


Best,
Zakelly

[1] 
https://github.com/apache/flink/commit/deb07e99560b45033a629afc3f90666ad0a32feb
[2] 
https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711
[3] 
https://docs.google.com/spreadsheets/d/1V0-duzNTgu7H6R7kioF-TAPhlqWl7Co6Q9ikTBuaULo/edit?usp=sharing

On Sun, Sep 24, 2023 at 11:29 AM ConradJam  wrote:
>
> +1 for testing with Java 17
>
> Jing Ge  于2023年9月24日周日 09:40写道:
>
> > +1 for testing with Java 17 too. Thanks Zakelly for your effort!
> >
> > Best regards,
> > Jing
> >
> > On Fri, Sep 22, 2023 at 1:01 PM Zakelly Lan  wrote:
> >
> > > Hi Jing,
> > >
> > > I agree we could wait for the result with Java 11. And it should be
> > > available next Monday.
> > > Additionally, I could also build a pipeline with Java 17 later since
> > > it is supported in 1.18[1].
> > >
> > >
> > > Best regards,
> > > Zakelly
> > >
> > > [1]
> > >
> > https://github.com/apache/flink/commit/9c1318ca7fa5b2e7b11827068ad1288483aaa464#diff-8310c97396d60e96766a936ca8680f1e2971ef486cfc2bc55ec9ca5a5333c47fR53
> > >
> > > On Fri, Sep 22, 2023 at 5:57 PM Jing Ge 
> > > wrote:
> > > >
> > > > Hi Zakelly,
> > > >
> > > > Thanks for your effort and the update! Since Java 8 has been
> > > deprecated[1],
> > > > let's wait for the result with Java 11. It should be available after
> > the
> > > > weekend and there should be no big surprise. WDYT?
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > > [1]
> > > >
> > >
> > https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.15/#jdk-upgrade
> > > >
> > > > On Fri, Sep 22, 2023 at 11:26 AM Zakelly Lan 
> > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I want to provide an update on the benchmark results that I have been
> > > > > working on. After spending some time preparing the environment and
> > > > > adjusting the benchmark script, I finally got a comparison between
> > > > > release 1.18 (commit: 2aeb99804ba[1]) and the commit before the old
> > > > > codespeed server went down (commit: 6d62f9918ea[2]) on openjdk8. The
> > > > > report is attached[3]. Note that the test has only run once on jdk8,
> > > > > so the impact of single-test fluctuations is not ruled out.
> > > > > Additionally, I have noticed some significant fluctuations in
> > specific
> > > > > tests when reviewing previous benchmark scores, which I have also
> > > > > noted in the report. Taking all of these factors into consideration,
> > I
> > > > > think there is no obvious regression in release 1.18 *for now*. More
> > > > > tests including the one on openjdk11 are on the way. Hope it does not
> > > > > delay the release procedure.
> > > > >
> > > > > Please let me know if you have any concerns.
> > > > >
> > > > >
> > > > > Best,
> > > > > Zakelly
> > > > >
> > > > > [1]
> > > > >
> > >
> > https://github.com/apache/flink/commit/2aeb99804ba56c008df0a1730f3246d3fea856b9
> > > > > [2]
> > > > >
> > >
> > https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711
> > > > > [3]
> > > > >
> > >
> > https://docs.google.com/spreadsheets/d/1-3Y974jYq_WrQNzLN-y_6lOU-NGXaIDTQBYTZd04tJ0/edit?usp=sharing
> > > > >
> > > > > The new environment for benchmark:
> > > > > ECS on Aliyun
> > > > > CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz (8 Core available)
> > > > > Memory: 64GB
> > > > > OS: Alibaba Cloud Linux 3.2104 LTS 64bit
> > > > > Kernel: 5.10.134-15.al8.x86_64
> > > > > OpenJDK8 version: 1.8.0_372
> > > > >
> > > > > On Thu, Sep 21, 2023 at 12:04 PM Yuxin Tan 
> > > wrote:
> > > > > >
> > > > > > Hi, Zakelly,
> > > > > >
> > > > > > No benchmark tests currently are affected by this issue. We
> > > > > > may add benchmarks to guard it later. Thanks.
> > > > > >
> > > > > > Best,
> > > > > > Yuxin
> > > > > >
> > > > > >
> > > > > > Zakelly Lan  于2023年9月21日周四 11:56写道:
> > > > > >
> > > > > > > Hi Jing,
> > > > > > >
> > > > > > > Sure, I will run the benchmark with this fix.
> > > > > > >
> > > > > > > Hi Yunxin,
> > > > > > >
> > > > > > > I'm not familiar with the hybrid shuffle. Is there any specific
> > > > > > > benchmark test that may be affected by this issue? I will pay
> > > special
> > > > > > > attention to it.
> > > > > > > Thanks.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Zakelly
> > > > > > >
> > > > > > > On

[jira] [Created] (FLINK-33153) Kafka using latest-offset maybe missing data

2023-09-25 Thread tanjialiang (Jira)
tanjialiang created FLINK-33153:
---

 Summary: Kafka using latest-offset maybe missing data
 Key: FLINK-33153
 URL: https://issues.apache.org/jira/browse/FLINK-33153
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: kafka-4.1.0
Reporter: tanjialiang


When Kafka start with the latest-offset strategy, it does not fetch the latest 
snapshot offset and specify it for consumption. Instead, it sets the 
startingOffset to -1 (KafkaPartitionSplit.LATEST_OFFSET, which makes 
currentOffset = -1, and call the KafkaConsumer's  seekToEnd API). The 
currentOffset is only set to the consumed offset + 1 when the task consumes 
data, and this currentOffset is stored in the state during checkpointing. If 
there are very few messages in Kafka and a partition has not consumed any data, 
and I stop the task with a savepoint, then write data to that partition, and 
start the task with the savepoint, the task will resume from the saved state. 
Due to the startingOffset in the state being -1, it will cause the task to miss 
the data that was written before the recovery point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[VOTE] FLIP-362: Support minimum resource limitation

2023-09-25 Thread xiangyu feng
Hi all,

I would like to start the vote for FLIP-362:  Support minimum resource
limitation[1].
This FLIP was discussed in this thread [2].

The vote will be open for at least 72 hours unless there is an objection or
insufficient votes.

Regards,
Xiangyu

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation
[2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4


Re: [VOTE] FLIP-362: Support minimum resource limitation

2023-09-25 Thread Yangze Guo
+1 (binding)

Best,
Yangze Guo

On Mon, Sep 25, 2023 at 5:39 PM xiangyu feng  wrote:
>
> Hi all,
>
> I would like to start the vote for FLIP-362:  Support minimum resource
> limitation[1].
> This FLIP was discussed in this thread [2].
>
> The vote will be open for at least 72 hours unless there is an objection or
> insufficient votes.
>
> Regards,
> Xiangyu
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation
> [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4


Re: [VOTE] FLIP-362: Support minimum resource limitation

2023-09-25 Thread Ahmed Hamdy
Thanks for the proposal
+1(non-binding)
Best regards
Ahmed

On Mon, 25 Sep 2023, 13:35 Yangze Guo,  wrote:

> +1 (binding)
>
> Best,
> Yangze Guo
>
> On Mon, Sep 25, 2023 at 5:39 PM xiangyu feng  wrote:
> >
> > Hi all,
> >
> > I would like to start the vote for FLIP-362:  Support minimum resource
> > limitation[1].
> > This FLIP was discussed in this thread [2].
> >
> > The vote will be open for at least 72 hours unless there is an objection
> or
> > insufficient votes.
> >
> > Regards,
> > Xiangyu
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation
> > [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4
>


回复: [VOTE] FLIP-362: Support minimum resource limitation

2023-09-25 Thread Chen Zhanghao
Thanks for driving this. +1 (non-binding)

Best,
Zhanghao Chen

发件人: xiangyu feng 
发送时间: 2023年9月25日 17:38
收件人: dev@flink.apache.org 
主题: [VOTE] FLIP-362: Support minimum resource limitation

Hi all,

I would like to start the vote for FLIP-362:  Support minimum resource
limitation[1].
This FLIP was discussed in this thread [2].

The vote will be open for at least 72 hours unless there is an objection or
insufficient votes.

Regards,
Xiangyu

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation
[2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4


[jira] [Created] (FLINK-33154) flink on k8s,An error occurred during consuming rocketmq

2023-09-25 Thread Monody (Jira)
Monody created FLINK-33154:
--

 Summary: flink on k8s,An error occurred during consuming rocketmq
 Key: FLINK-33154
 URL: https://issues.apache.org/jira/browse/FLINK-33154
 Project: Flink
  Issue Type: Technical Debt
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.1.0
 Environment: 
flink-kubernetes-operator:https://github.com/apache/flink-kubernetes-operator#current-api-version-v1beta1
rocketmq-flink:https://github.com/apache/rocketmq-flink
Reporter: Monody


The following error occurs when flink consumes rocketmq. The flink job is 
running on k8s, and the projects used are:
The projects used by flink to consume rocketmq are:
The flink job runs normally on yarn, and no abnormality is found on the 
rocketmq server. Why does this happen? and how to solve it?
!https://user-images.githubusercontent.com/47728686/265662530-231c500c-fd64-4679-9b0f-ff4a025dd766.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33155) Flink ResourceManager continuously fails to start TM container on YARN when Kerberos enabled

2023-09-25 Thread Yang Wang (Jira)
Yang Wang created FLINK-33155:
-

 Summary: Flink ResourceManager continuously fails to start TM 
container on YARN when Kerberos enabled
 Key: FLINK-33155
 URL: https://issues.apache.org/jira/browse/FLINK-33155
 Project: Flink
  Issue Type: Bug
Reporter: Yang Wang


When Kerberos enabled(with key tab) and after one day(the container token 
expired), Flink fails to create the TaskManager container on YARN due to the 
following exception.

 
{code:java}
2023-09-25 16:48:50,030 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker container_1695106898104_0003_01_69 is terminated. Diagnostics: 
Container container_1695106898104_0003_01_69 was invalid. Diagnostics: 
[2023-09-25 16:48:45.710]token (token for hadoop: HDFS_DELEGATION_TOKEN 
owner=, renewer=, realUser=, issueDate=1695196431487, 
maxDate=1695801231487, sequenceNumber=12, masterKeyId=3) can't be found in 
cacheorg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (token for hadoop: HDFS_DELEGATION_TOKEN 
owner=hadoop/master-1-1.c-5ee7bdc598b6e1cc.cn-beijing.emr.aliyuncs@emr.c-5ee7bdc598b6e1cc.com,
 renewer=, realUser=, issueDate=1695196431487, maxDate=1695801231487, 
sequenceNumber=12, masterKeyId=3) can't be found in cacheat 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1545)at 
org.apache.hadoop.ipc.Client.call(Client.java:1491)at 
org.apache.hadoop.ipc.Client.call(Client.java:1388)at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:907)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)at 
org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1666)at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1588)
at 
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)at 
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)at 
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)at 
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)at 
java.security.AccessController.doPrivileged(Native Method)at 
javax.security.auth.Subject.doAs(Subject.java:422)at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
   at java.lang.Thread.run(Thread.java:750) {code}
The root cause might be that we are reading the delegation token from JM local 
file[1]. It will expire after one day. When the old TaskManager container 
crashes and ResourceManager tries to create a new one, th

Re: [ANNOUNCE] Release 1.18.0, release candidate #0

2023-09-25 Thread Jing Ge
Thanks Zakelly for the update! Appreciate it!

@Piotr Nowojski  If you do not have any other
concerns, I will move forward to create 1.18 rc1 and start voting. WDYT?

Best regards,
Jing

On Mon, Sep 25, 2023 at 2:20 AM Zakelly Lan  wrote:

> Hi Jing and everyone,
>
> I have conducted three rounds of benchmarking with Java11, comparing
> release 1.18 (commit: deb07e99560[1]) with commit 6d62f9918ea[2]. The
> results are attached[3]. Most of the tests show no obvious regression.
> However, I did observe significant change in several tests. Upon
> reviewing the historical results from the previous pipeline, I also
> discovered a substantial variance in those tests, as shown in the
> timeline pictures included in the sheet[3]. I believe this variance
> has existed for a long time and requires further investigation, and
> fully measuring the variance requires more rounds (15 or more). I
> think for now it is not a blocker for release 1.18. WDYT?
>
>
> Best,
> Zakelly
>
> [1]
> https://github.com/apache/flink/commit/deb07e99560b45033a629afc3f90666ad0a32feb
> [2]
> https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711
> [3]
> https://docs.google.com/spreadsheets/d/1V0-duzNTgu7H6R7kioF-TAPhlqWl7Co6Q9ikTBuaULo/edit?usp=sharing
>
> On Sun, Sep 24, 2023 at 11:29 AM ConradJam  wrote:
> >
> > +1 for testing with Java 17
> >
> > Jing Ge  于2023年9月24日周日 09:40写道:
> >
> > > +1 for testing with Java 17 too. Thanks Zakelly for your effort!
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Fri, Sep 22, 2023 at 1:01 PM Zakelly Lan 
> wrote:
> > >
> > > > Hi Jing,
> > > >
> > > > I agree we could wait for the result with Java 11. And it should be
> > > > available next Monday.
> > > > Additionally, I could also build a pipeline with Java 17 later since
> > > > it is supported in 1.18[1].
> > > >
> > > >
> > > > Best regards,
> > > > Zakelly
> > > >
> > > > [1]
> > > >
> > >
> https://github.com/apache/flink/commit/9c1318ca7fa5b2e7b11827068ad1288483aaa464#diff-8310c97396d60e96766a936ca8680f1e2971ef486cfc2bc55ec9ca5a5333c47fR53
> > > >
> > > > On Fri, Sep 22, 2023 at 5:57 PM Jing Ge 
> > > > wrote:
> > > > >
> > > > > Hi Zakelly,
> > > > >
> > > > > Thanks for your effort and the update! Since Java 8 has been
> > > > deprecated[1],
> > > > > let's wait for the result with Java 11. It should be available
> after
> > > the
> > > > > weekend and there should be no big surprise. WDYT?
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.15/#jdk-upgrade
> > > > >
> > > > > On Fri, Sep 22, 2023 at 11:26 AM Zakelly Lan <
> zakelly@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I want to provide an update on the benchmark results that I have
> been
> > > > > > working on. After spending some time preparing the environment
> and
> > > > > > adjusting the benchmark script, I finally got a comparison
> between
> > > > > > release 1.18 (commit: 2aeb99804ba[1]) and the commit before the
> old
> > > > > > codespeed server went down (commit: 6d62f9918ea[2]) on openjdk8.
> The
> > > > > > report is attached[3]. Note that the test has only run once on
> jdk8,
> > > > > > so the impact of single-test fluctuations is not ruled out.
> > > > > > Additionally, I have noticed some significant fluctuations in
> > > specific
> > > > > > tests when reviewing previous benchmark scores, which I have also
> > > > > > noted in the report. Taking all of these factors into
> consideration,
> > > I
> > > > > > think there is no obvious regression in release 1.18 *for now*.
> More
> > > > > > tests including the one on openjdk11 are on the way. Hope it
> does not
> > > > > > delay the release procedure.
> > > > > >
> > > > > > Please let me know if you have any concerns.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Zakelly
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > >
> https://github.com/apache/flink/commit/2aeb99804ba56c008df0a1730f3246d3fea856b9
> > > > > > [2]
> > > > > >
> > > >
> > >
> https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711
> > > > > > [3]
> > > > > >
> > > >
> > >
> https://docs.google.com/spreadsheets/d/1-3Y974jYq_WrQNzLN-y_6lOU-NGXaIDTQBYTZd04tJ0/edit?usp=sharing
> > > > > >
> > > > > > The new environment for benchmark:
> > > > > > ECS on Aliyun
> > > > > > CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz (8 Core
> available)
> > > > > > Memory: 64GB
> > > > > > OS: Alibaba Cloud Linux 3.2104 LTS 64bit
> > > > > > Kernel: 5.10.134-15.al8.x86_64
> > > > > > OpenJDK8 version: 1.8.0_372
> > > > > >
> > > > > > On Thu, Sep 21, 2023 at 12:04 PM Yuxin Tan <
> tanyuxinw...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > Hi, Zakelly,
> > > > > > >
> > > > > > > No benchmark tests currently are affected by this issue. We
> > > > > > > may add benchmarks to guard it later. Thanks.
> > > > > > >
> > > > > > > Best

Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL Sources

2023-09-25 Thread Lincoln Lee
Hi Zhanghao,

Thanks for your explanation!

- For question 2, I can accept that the user must provide a pk when
modifying the parallelism of the cdc source.

- For question 1&3 and additional concerns, if there is no more general or
discussable solution at the time, I agree to address part of the tuning
requirements in current manner.

Thanks again for moving this FLIP forward!

Best,
Lincoln Lee


Chen Zhanghao  于2023年9月25日周一 12:04写道:

> Hi Lincoln,
>
> Thanks for the comments.
>
> - For concerns #1, I agree that we do not always produce optimal plan for
> both cases. However, it is difficult to do so and non-trivial complexity is
> expected. On the other hand, although our proposal generates a sub-optimal
> plan when the bottleneck is on the aggregate operator, it still provides
> more flexibility for performance tuning. Therefore, I think we can
> implement it in the straightforward way first, WDYT?
>
> - For concerns #2, I'd like to clarify a bit: exception will only be
> thrown i.f.f. the source may produce delete/update messages AND no primary
> key specified AND the source parallelism is different from the default
> parallelism. So for CDC without pk, we're still good if source parallelism
> is not specified.
>
> - For concerns #3, at the current point, I think making the name more
> specific is better as no other future use cases can be previsioned now.
> Since this is an internal API, we are free to refactor it later if needed.
>
> - For concerns about adaptive scheduler, the problems you mentioned do
> exist, but I don't think it relevant here. The planner may encode some
> hints to help the scheduler, but afterall, those efforts should be
> initiated on the scheduler side.
>
> Best,
> Zhanghao Chen
> 
> 发件人: Lincoln Lee 
> 发送时间: 2023年9月22日 23:19
> 收件人: dev@flink.apache.org 
> 主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL
> Sources
>
> Hi Zhanghao,
>
> Thanks for the FLIP and discussion!  Hope this reply isn't too late : )
> Firstly I'm fully agreed with the motivation of this FLIP and the value for
> the users, but there are a few things we should consider(please correct me
> if I'm misunderstanding):
>
> *1.  *It seems that the current solution only takes care of part of the
> requirement, the need to set source's parallelism may be different in
> different jobs,  for example, consider the following two job topologies(one
> {} simply represents a vertex):
> a. {source -> calc -> sink}
>
> b. {source -> calc} -> {aggregate} -> {sink}
>
> For job a, if there is a bottleneck in calc operator, but source
> parallelism cannot be scaled up (e.g., limited by kafka's partition
> number), so the calc operator cannot be scaled up to achieve higher
> throughput because the operators in source vertex are chained together,
> then current solution is reasonable (break the chain, add a shuffle).
>
> But for job b, if the bottleneck is the aggregate operator (not calc), it's
> more likely be better to scale up the aggregate operator/vertex and without
> breaking the {source -> calc} chain, as this will incur additional shuffle
> cost.
> So if we decide to add this new feature, I would recommend that both cases
> be taken care of.
>
>
> 2. the assumption that a cdc source must have pk(primary key) may not be
> reasonable, for example, mysql cdc supports the case without pk(
>
> https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc.html#tables-without-primary-keys
> ),
> so we can not just raise an error here.
>
>
> 3. for the new SourceTransformationWrapper I have some concerns about the
> future evolution, if we need to add support for other operators, do we
> continue to add new xxWrappers?
>
> I've also revisited the previous discussion on FLIP-146[1], there were no
> clear conclusions or good ideas about similar support issues for the source
> before, and I also noticed that the new capability to change per-vertex
> parallelism via rest api in 1.18 (part of FLIP-291[2][3], but there is
> actually an issue about sql job's parallelism change which may require a
> hash shuffle to ensure the order of update stream, this needs to be
> followed up in FLIP-291, a jira will be created later).  So perhaps, we
> need to think about it more (the next version is not yet launched, so we
> still have time)
>
> [1] https://lists.apache.org/thread/gtpswl42jzv0c9o3clwqskpllnw8rh87
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> [3] https://issues.apache.org/jira/browse/FLINK-31471
>
>
> Best,
> Lincoln Lee
>
>
> Chen Zhanghao  于2023年9月22日周五 16:00写道:
>
> > Thanks to everyone who participated in the discussion here. If no further
> > questions/concerns are raised, we'll start voting next Monday afternoon
> > (GMT+8).
> >
> > Best,
> > Zhanghao Chen
> > 
> > 发件人: Jane Chan 
> > 发送时间: 2023年9月22日 15:35
> > 收件人: dev@flink.apache.org 
> > 主题: R

Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL Sources

2023-09-25 Thread Jing Ge
 Hi Zhanghao,

Thanks for driving the FLIP. This is a nice feature users are looking for.
>From users' perspective, would you like to add explicit description about
any potential(or none) compatibility issues if users want to use this new
feature and start existing jobs with savepoints or checkpoints?

Best regards,
Jing

On Sun, Sep 24, 2023 at 9:05 PM Chen Zhanghao 
wrote:

> Hi Lincoln,
>
> Thanks for the comments.
>
> - For concerns #1, I agree that we do not always produce optimal plan for
> both cases. However, it is difficult to do so and non-trivial complexity is
> expected. On the other hand, although our proposal generates a sub-optimal
> plan when the bottleneck is on the aggregate operator, it still provides
> more flexibility for performance tuning. Therefore, I think we can
> implement it in the straightforward way first, WDYT?
>
> - For concerns #2, I'd like to clarify a bit: exception will only be
> thrown i.f.f. the source may produce delete/update messages AND no primary
> key specified AND the source parallelism is different from the default
> parallelism. So for CDC without pk, we're still good if source parallelism
> is not specified.
>
> - For concerns #3, at the current point, I think making the name more
> specific is better as no other future use cases can be previsioned now.
> Since this is an internal API, we are free to refactor it later if needed.
>
> - For concerns about adaptive scheduler, the problems you mentioned do
> exist, but I don't think it relevant here. The planner may encode some
> hints to help the scheduler, but afterall, those efforts should be
> initiated on the scheduler side.
>
> Best,
> Zhanghao Chen
> 
> 发件人: Lincoln Lee 
> 发送时间: 2023年9月22日 23:19
> 收件人: dev@flink.apache.org 
> 主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL
> Sources
>
> Hi Zhanghao,
>
> Thanks for the FLIP and discussion!  Hope this reply isn't too late : )
> Firstly I'm fully agreed with the motivation of this FLIP and the value for
> the users, but there are a few things we should consider(please correct me
> if I'm misunderstanding):
>
> *1.  *It seems that the current solution only takes care of part of the
> requirement, the need to set source's parallelism may be different in
> different jobs,  for example, consider the following two job topologies(one
> {} simply represents a vertex):
> a. {source -> calc -> sink}
>
> b. {source -> calc} -> {aggregate} -> {sink}
>
> For job a, if there is a bottleneck in calc operator, but source
> parallelism cannot be scaled up (e.g., limited by kafka's partition
> number), so the calc operator cannot be scaled up to achieve higher
> throughput because the operators in source vertex are chained together,
> then current solution is reasonable (break the chain, add a shuffle).
>
> But for job b, if the bottleneck is the aggregate operator (not calc), it's
> more likely be better to scale up the aggregate operator/vertex and without
> breaking the {source -> calc} chain, as this will incur additional shuffle
> cost.
> So if we decide to add this new feature, I would recommend that both cases
> be taken care of.
>
>
> 2. the assumption that a cdc source must have pk(primary key) may not be
> reasonable, for example, mysql cdc supports the case without pk(
>
> https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc.html#tables-without-primary-keys
> ),
> so we can not just raise an error here.
>
>
> 3. for the new SourceTransformationWrapper I have some concerns about the
> future evolution, if we need to add support for other operators, do we
> continue to add new xxWrappers?
>
> I've also revisited the previous discussion on FLIP-146[1], there were no
> clear conclusions or good ideas about similar support issues for the source
> before, and I also noticed that the new capability to change per-vertex
> parallelism via rest api in 1.18 (part of FLIP-291[2][3], but there is
> actually an issue about sql job's parallelism change which may require a
> hash shuffle to ensure the order of update stream, this needs to be
> followed up in FLIP-291, a jira will be created later).  So perhaps, we
> need to think about it more (the next version is not yet launched, so we
> still have time)
>
> [1] https://lists.apache.org/thread/gtpswl42jzv0c9o3clwqskpllnw8rh87
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> [3] https://issues.apache.org/jira/browse/FLINK-31471
>
>
> Best,
> Lincoln Lee
>
>
> Chen Zhanghao  于2023年9月22日周五 16:00写道:
>
> > Thanks to everyone who participated in the discussion here. If no further
> > questions/concerns are raised, we'll start voting next Monday afternoon
> > (GMT+8).
> >
> > Best,
> > Zhanghao Chen
> > 
> > 发件人: Jane Chan 
> > 发送时间: 2023年9月22日 15:35
> > 收件人: dev@flink.apache.org 
> > 主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/S

Re: [Discuss] FLIP-362: Support minimum resource limitation

2023-09-25 Thread Jing Ge
Hi Yangze,

Thanks for the clarification. The example of two batch jobs team up with
one streaming job is interesting.

Best regards,
Jing

On Wed, Sep 20, 2023 at 7:19 PM Yangze Guo  wrote:

> Thanks for the comments, Jing.
>
> > Will the minimum resource configuration also take effect for streaming
> jobs in application mode?
> > Since it is not recommended to configure slotmanager.number-of-slots.max
> for streaming jobs, does it make sense to disable it for common streaming
> jobs? At least disable the check for avoiding the oscillation?
>
> Yes. The minimum resource configuration will only disabled in
> standalone cluster atm. I agree it make sense to disable it for a pure
> streaming job, however:
> - By default, the minimum resource is configured to 0. If users do not
> proactively set it, either the oscillation check or the minimum
> restriction can be considered as disabled.
> - The minimum resource is a cluster-level configuration rather than a
> job-level configuration. If a user has an application with two batch
> jobs preceding the streaming job, they may also require this
> configuration to accelerate the execution of batch jobs.
>
> WDYT?
>
> Best,
> Yangze Guo
>
> On Thu, Sep 21, 2023 at 4:49 AM Jing Ge 
> wrote:
> >
> > Hi Xiangyu,
> >
> > Thanks for driving it! There is one thing I am not really sure if I
> > understand you correctly.
> >
> > According to the FLIP: "The minimum resource limitation will be
> implemented
> > in the DefaultResourceAllocationStrategy of FineGrainedSlotManager.
> >
> > Each time when SlotManager needs to reconcile the cluster resources or
> > fulfill job resource requirements, the DefaultResourceAllocationStrategy
> > will check if the minimum resource requirement has been fulfilled. If it
> is
> > not, DefaultResourceAllocationStrategy will request new
> PendingTaskManagers
> > and FineGrainedSlotManager will allocate new worker resources
> accordingly."
> >
> > "To avoid this oscillation, we need to check the worker number derived
> from
> > minimum and maximum resource configuration is consistent before starting
> > SlotManager."
> >
> > Will the minimum resource configuration also take effect for streaming
> jobs
> > in application mode? Since it is not recommended to
> > configure slotmanager.number-of-slots.max for streaming jobs, does it
> make
> > sense to disable it for common streaming jobs? At least disable the check
> > for avoiding the oscillation?
> >
> > Best regards,
> > Jing
> >
> >
> > On Tue, Sep 19, 2023 at 4:58 PM Chen Zhanghao  >
> > wrote:
> >
> > > Thanks for driving this, Xiangyu. We use Session clusters for quick SQL
> > > debugging internally, and found cold-start job submission slow due to
> lack
> > > of the exact minimum resource reservation feature proposed here. This
> > > should improve the experience a lot for running short lived-jobs in
> session
> > > clusters.
> > >
> > > Best,
> > > Zhanghao Chen
> > > 
> > > 发件人: Yangze Guo 
> > > 发送时间: 2023年9月19日 13:10
> > > 收件人: xiangyu feng 
> > > 抄送: dev@flink.apache.org 
> > > 主题: Re: [Discuss] FLIP-362: Support minimum resource limitation
> > >
> > > Thanks for driving this @Xiangyu. This is a feature that many users
> > > have requested for a long time. +1 for the overall proposal.
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Tue, Sep 19, 2023 at 11:48 AM xiangyu feng 
> > > wrote:
> > > >
> > > > Hi Devs,
> > > >
> > > > I'm opening this thread to discuss FLIP-362: Support minimum resource
> > > limitation. The design doc can be found at:
> > > > FLIP-362: Support minimum resource limitation
> > > >
> > > > Currently, the Flink cluster only requests Task Managers (TMs) when
> > > there is a resource requirement, and idle TMs are released after a
> certain
> > > period of time. However, in certain scenarios, such as running short
> > > lived-jobs in session cluster and scheduling batch jobs stage by
> stage, we
> > > need to improve the efficiency of job execution by maintaining a
> certain
> > > number of available workers in the cluster all the time.
> > > >
> > > > After discussed with Yangze, we introduced this new feature. The new
> > > added public options and proposed changes are described in this FLIP.
> > > >
> > > > Looking forward to your feedback, thanks.
> > > >
> > > > Best regards,
> > > > Xiangyu
> > > >
> > >
>


Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener

2023-09-25 Thread Jing Ge
+1(binding) Thanks!

Best Regards,
Jing

On Sun, Sep 24, 2023 at 10:30 PM Shammon FY  wrote:

> Hi devs,
>
> Thanks for all the feedback on FLIP-314: Support Customized Job Lineage
> Listener [1] in thread [2].
>
> I would like to start a vote for it. The vote will be opened for at least
> 72 hours unless there is an objection or insufficient votes.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener
> [2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc
>
> Best,
> Shammon FY
>


Re: [VOTE] FLIP-362: Support minimum resource limitation

2023-09-25 Thread Jing Ge
+1(binding) Thanks!

Best regards,
Jing

On Mon, Sep 25, 2023 at 3:48 AM Chen Zhanghao 
wrote:

> Thanks for driving this. +1 (non-binding)
>
> Best,
> Zhanghao Chen
> 
> 发件人: xiangyu feng 
> 发送时间: 2023年9月25日 17:38
> 收件人: dev@flink.apache.org 
> 主题: [VOTE] FLIP-362: Support minimum resource limitation
>
> Hi all,
>
> I would like to start the vote for FLIP-362:  Support minimum resource
> limitation[1].
> This FLIP was discussed in this thread [2].
>
> The vote will be open for at least 72 hours unless there is an objection or
> insufficient votes.
>
> Regards,
> Xiangyu
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation
> [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4
>


Re: [VOTE] FLIP-327: Support switching from batch to stream mode to improve throughput when processing backlog data

2023-09-25 Thread Venkatakrishnan Sowrirajan
+1 (non-binding)

On Sun, Sep 24, 2023, 6:49 PM Xintong Song  wrote:

> +1 (binding)
>
> Best,
>
> Xintong
>
>
>
> On Sat, Sep 23, 2023 at 10:16 PM Yuepeng Pan 
> wrote:
>
> > +1(non-binding), thank you for driving this proposal.
> >
> > Best,
> > Yuepeng Pan.
> > At 2023-09-22 14:07:45, "Dong Lin"  wrote:
> > >Hi all,
> > >
> > >We would like to start the vote for FLIP-327: Support switching from
> batch
> > >to stream mode to improve throughput when processing backlog data [1].
> > This
> > >FLIP was discussed in this thread [2].
> > >
> > >The vote will be open until at least Sep 27th (at least 72
> > >hours), following the consensus voting process.
> > >
> > >Cheers,
> > >Xuannan and Dong
> > >
> > >[1]
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-327*3A*Support*switching*from*batch*to*stream*mode*to*improve*throughput*when*processing*backlog*data__;JSsrKysrKysrKysrKysr!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKResPPc7$
> > >[2]
> https://urldefense.com/v3/__https://lists.apache.org/thread/29nvjt9sgnzvs90browb8r6ng31dcs3n__;!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKZEAz9yp$
> >
>


Re: [VOTE] FLIP-327: Support switching from batch to stream mode to improve throughput when processing backlog data

2023-09-25 Thread Ahmed Hamdy
+1(non binding)
Best regards
Ahmed Hamdy

On Mon, 25 Sep 2023, 19:57 Venkatakrishnan Sowrirajan, 
wrote:

> +1 (non-binding)
>
> On Sun, Sep 24, 2023, 6:49 PM Xintong Song  wrote:
>
> > +1 (binding)
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Sat, Sep 23, 2023 at 10:16 PM Yuepeng Pan 
> > wrote:
> >
> > > +1(non-binding), thank you for driving this proposal.
> > >
> > > Best,
> > > Yuepeng Pan.
> > > At 2023-09-22 14:07:45, "Dong Lin"  wrote:
> > > >Hi all,
> > > >
> > > >We would like to start the vote for FLIP-327: Support switching from
> > batch
> > > >to stream mode to improve throughput when processing backlog data [1].
> > > This
> > > >FLIP was discussed in this thread [2].
> > > >
> > > >The vote will be open until at least Sep 27th (at least 72
> > > >hours), following the consensus voting process.
> > > >
> > > >Cheers,
> > > >Xuannan and Dong
> > > >
> > > >[1]
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-327*3A*Support*switching*from*batch*to*stream*mode*to*improve*throughput*when*processing*backlog*data__;JSsrKysrKysrKysrKysr!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKResPPc7$
> > > >[2]
> >
> https://urldefense.com/v3/__https://lists.apache.org/thread/29nvjt9sgnzvs90browb8r6ng31dcs3n__;!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKZEAz9yp$
> > >
> >
>


Re: [ANNOUNCE] Release 1.18.0, release candidate #0

2023-09-25 Thread Ryan Skraba
Hello!  There's a security fix that probably should be applied to 1.18
in the next RC1 : https://github.com/apache/flink/pull/23461 (bump to
snappy-java).  Do you think this would be possible to include?

All my best, Ryan

[1]: https://issues.apache.org/jira/browse/FLINK-33149 "Bump
snappy-java to 1.1.10.4"



On Mon, Sep 25, 2023 at 3:54 PM Jing Ge  wrote:
>
> Thanks Zakelly for the update! Appreciate it!
>
> @Piotr Nowojski  If you do not have any other
> concerns, I will move forward to create 1.18 rc1 and start voting. WDYT?
>
> Best regards,
> Jing
>
> On Mon, Sep 25, 2023 at 2:20 AM Zakelly Lan  wrote:
>
> > Hi Jing and everyone,
> >
> > I have conducted three rounds of benchmarking with Java11, comparing
> > release 1.18 (commit: deb07e99560[1]) with commit 6d62f9918ea[2]. The
> > results are attached[3]. Most of the tests show no obvious regression.
> > However, I did observe significant change in several tests. Upon
> > reviewing the historical results from the previous pipeline, I also
> > discovered a substantial variance in those tests, as shown in the
> > timeline pictures included in the sheet[3]. I believe this variance
> > has existed for a long time and requires further investigation, and
> > fully measuring the variance requires more rounds (15 or more). I
> > think for now it is not a blocker for release 1.18. WDYT?
> >
> >
> > Best,
> > Zakelly
> >
> > [1]
> > https://github.com/apache/flink/commit/deb07e99560b45033a629afc3f90666ad0a32feb
> > [2]
> > https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711
> > [3]
> > https://docs.google.com/spreadsheets/d/1V0-duzNTgu7H6R7kioF-TAPhlqWl7Co6Q9ikTBuaULo/edit?usp=sharing
> >
> > On Sun, Sep 24, 2023 at 11:29 AM ConradJam  wrote:
> > >
> > > +1 for testing with Java 17
> > >
> > > Jing Ge  于2023年9月24日周日 09:40写道:
> > >
> > > > +1 for testing with Java 17 too. Thanks Zakelly for your effort!
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > > On Fri, Sep 22, 2023 at 1:01 PM Zakelly Lan 
> > wrote:
> > > >
> > > > > Hi Jing,
> > > > >
> > > > > I agree we could wait for the result with Java 11. And it should be
> > > > > available next Monday.
> > > > > Additionally, I could also build a pipeline with Java 17 later since
> > > > > it is supported in 1.18[1].
> > > > >
> > > > >
> > > > > Best regards,
> > > > > Zakelly
> > > > >
> > > > > [1]
> > > > >
> > > >
> > https://github.com/apache/flink/commit/9c1318ca7fa5b2e7b11827068ad1288483aaa464#diff-8310c97396d60e96766a936ca8680f1e2971ef486cfc2bc55ec9ca5a5333c47fR53
> > > > >
> > > > > On Fri, Sep 22, 2023 at 5:57 PM Jing Ge 
> > > > > wrote:
> > > > > >
> > > > > > Hi Zakelly,
> > > > > >
> > > > > > Thanks for your effort and the update! Since Java 8 has been
> > > > > deprecated[1],
> > > > > > let's wait for the result with Java 11. It should be available
> > after
> > > > the
> > > > > > weekend and there should be no big surprise. WDYT?
> > > > > >
> > > > > > Best regards,
> > > > > > Jing
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.15/#jdk-upgrade
> > > > > >
> > > > > > On Fri, Sep 22, 2023 at 11:26 AM Zakelly Lan <
> > zakelly@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I want to provide an update on the benchmark results that I have
> > been
> > > > > > > working on. After spending some time preparing the environment
> > and
> > > > > > > adjusting the benchmark script, I finally got a comparison
> > between
> > > > > > > release 1.18 (commit: 2aeb99804ba[1]) and the commit before the
> > old
> > > > > > > codespeed server went down (commit: 6d62f9918ea[2]) on openjdk8.
> > The
> > > > > > > report is attached[3]. Note that the test has only run once on
> > jdk8,
> > > > > > > so the impact of single-test fluctuations is not ruled out.
> > > > > > > Additionally, I have noticed some significant fluctuations in
> > > > specific
> > > > > > > tests when reviewing previous benchmark scores, which I have also
> > > > > > > noted in the report. Taking all of these factors into
> > consideration,
> > > > I
> > > > > > > think there is no obvious regression in release 1.18 *for now*.
> > More
> > > > > > > tests including the one on openjdk11 are on the way. Hope it
> > does not
> > > > > > > delay the release procedure.
> > > > > > >
> > > > > > > Please let me know if you have any concerns.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Zakelly
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > >
> > > >
> > https://github.com/apache/flink/commit/2aeb99804ba56c008df0a1730f3246d3fea856b9
> > > > > > > [2]
> > > > > > >
> > > > >
> > > >
> > https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711
> > > > > > > [3]
> > > > > > >
> > > > >
> > > >
> > https://docs.google.com/spreadsheets/d/1-3Y974jYq_WrQNzLN-y_6lOU-NGXaIDTQBYTZd04tJ0/edit?usp=sharing
> > > > > > >
> > > > > > >

Re: [ANNOUNCE] Release 1.18.0, release candidate #0

2023-09-25 Thread Jing Ge
Hi Ryan,

Thanks for reaching out. It is fine to include it but we need to wait until
the CI passes. I am not sure how long it will take, since there seems to be
some infra issues.

Best regards,
Jing

On Mon, Sep 25, 2023 at 11:34 AM Ryan Skraba 
wrote:

> Hello!  There's a security fix that probably should be applied to 1.18
> in the next RC1 : https://github.com/apache/flink/pull/23461 (bump to
> snappy-java).  Do you think this would be possible to include?
>
> All my best, Ryan
>
> [1]: https://issues.apache.org/jira/browse/FLINK-33149 "Bump
> snappy-java to 1.1.10.4"
>
>
>
> On Mon, Sep 25, 2023 at 3:54 PM Jing Ge 
> wrote:
> >
> > Thanks Zakelly for the update! Appreciate it!
> >
> > @Piotr Nowojski  If you do not have any other
> > concerns, I will move forward to create 1.18 rc1 and start voting. WDYT?
> >
> > Best regards,
> > Jing
> >
> > On Mon, Sep 25, 2023 at 2:20 AM Zakelly Lan 
> wrote:
> >
> > > Hi Jing and everyone,
> > >
> > > I have conducted three rounds of benchmarking with Java11, comparing
> > > release 1.18 (commit: deb07e99560[1]) with commit 6d62f9918ea[2]. The
> > > results are attached[3]. Most of the tests show no obvious regression.
> > > However, I did observe significant change in several tests. Upon
> > > reviewing the historical results from the previous pipeline, I also
> > > discovered a substantial variance in those tests, as shown in the
> > > timeline pictures included in the sheet[3]. I believe this variance
> > > has existed for a long time and requires further investigation, and
> > > fully measuring the variance requires more rounds (15 or more). I
> > > think for now it is not a blocker for release 1.18. WDYT?
> > >
> > >
> > > Best,
> > > Zakelly
> > >
> > > [1]
> > >
> https://github.com/apache/flink/commit/deb07e99560b45033a629afc3f90666ad0a32feb
> > > [2]
> > >
> https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711
> > > [3]
> > >
> https://docs.google.com/spreadsheets/d/1V0-duzNTgu7H6R7kioF-TAPhlqWl7Co6Q9ikTBuaULo/edit?usp=sharing
> > >
> > > On Sun, Sep 24, 2023 at 11:29 AM ConradJam 
> wrote:
> > > >
> > > > +1 for testing with Java 17
> > > >
> > > > Jing Ge  于2023年9月24日周日 09:40写道:
> > > >
> > > > > +1 for testing with Java 17 too. Thanks Zakelly for your effort!
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > > > > On Fri, Sep 22, 2023 at 1:01 PM Zakelly Lan  >
> > > wrote:
> > > > >
> > > > > > Hi Jing,
> > > > > >
> > > > > > I agree we could wait for the result with Java 11. And it should
> be
> > > > > > available next Monday.
> > > > > > Additionally, I could also build a pipeline with Java 17 later
> since
> > > > > > it is supported in 1.18[1].
> > > > > >
> > > > > >
> > > > > > Best regards,
> > > > > > Zakelly
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > >
> https://github.com/apache/flink/commit/9c1318ca7fa5b2e7b11827068ad1288483aaa464#diff-8310c97396d60e96766a936ca8680f1e2971ef486cfc2bc55ec9ca5a5333c47fR53
> > > > > >
> > > > > > On Fri, Sep 22, 2023 at 5:57 PM Jing Ge
> 
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi Zakelly,
> > > > > > >
> > > > > > > Thanks for your effort and the update! Since Java 8 has been
> > > > > > deprecated[1],
> > > > > > > let's wait for the result with Java 11. It should be available
> > > after
> > > > > the
> > > > > > > weekend and there should be no big surprise. WDYT?
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Jing
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > > >
> > >
> https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.15/#jdk-upgrade
> > > > > > >
> > > > > > > On Fri, Sep 22, 2023 at 11:26 AM Zakelly Lan <
> > > zakelly@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > I want to provide an update on the benchmark results that I
> have
> > > been
> > > > > > > > working on. After spending some time preparing the
> environment
> > > and
> > > > > > > > adjusting the benchmark script, I finally got a comparison
> > > between
> > > > > > > > release 1.18 (commit: 2aeb99804ba[1]) and the commit before
> the
> > > old
> > > > > > > > codespeed server went down (commit: 6d62f9918ea[2]) on
> openjdk8.
> > > The
> > > > > > > > report is attached[3]. Note that the test has only run once
> on
> > > jdk8,
> > > > > > > > so the impact of single-test fluctuations is not ruled out.
> > > > > > > > Additionally, I have noticed some significant fluctuations in
> > > > > specific
> > > > > > > > tests when reviewing previous benchmark scores, which I have
> also
> > > > > > > > noted in the report. Taking all of these factors into
> > > consideration,
> > > > > I
> > > > > > > > think there is no obvious regression in release 1.18 *for
> now*.
> > > More
> > > > > > > > tests including the one on openjdk11 are on the way. Hope it
> > > does not
> > > > > > > > delay the release procedure.
> > > > > > > >
> > > > > > > > Please let me know if you have

RE: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released

2023-09-25 Thread frans.king
Hi Martijn.

Thanks for this.   Should there also be docker images available?  
https://hub.docker.com/r/apache/flink-statefun/tags goes up to 3.2.0. 

Frans

-Original Message-
From: Martijn Visser  
Sent: Tuesday, September 19, 2023 11:37 AM
To: dev@flink.apache.org; user ; user-zh 
; n...@flink.apache.org; annou...@apache.org
Subject: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released

Stateful Functions is a cross-platform stack for building Stateful Serverless 
applications, making it radically simpler to develop scalable, consistent, and 
elastic distributed applications. This new release upgrades the Flink runtime 
to 1.16.2.

Release highlight:
- Upgrade underlying Flink dependency to 1.16.2

Release blogpost:
https://flink.apache.org/2023/09/19/stateful-functions-3.3.0-release-announcement/

The release is available for download at:
https://flink.apache.org/downloads/

Java SDK can be found at:
https://search.maven.org/artifact/org.apache.flink/statefun-sdk-java/3.3.0/jar

Python SDK can be found at:
https://pypi.org/project/apache-flink-statefun/

GoLang SDK can be found at:
https://github.com/apache/flink-statefun/tree/statefun-sdk-go/v3.3.0

JavaScript SDK can be found at:
https://www.npmjs.com/package/apache-flink-statefun

Official Docker image for Flink Stateful Functions can be found at: 
https://hub.docker.com/r/apache/flink-statefun

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351276

We would like to thank all contributors of the Apache Flink community who made 
this release possible!

Regards,
Martijn Visser




Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-09-25 Thread Dong Lin
Hi Piotr,

Thanks for the comments. Let me try to explain it below.

Overall, the two competing options differ in how an invocation of
`#setIsProcessingBacklog(false)` affects the backlog status for the given
source (corresponding to the SplitEnumeratorContext instance on which this
method is invoked).

- With my approach, setIsProcessingBacklog(false) merely unsets effects of
any previous invocation of setIsProcessingBacklog(..) on the given source,
without necessarily forcing the source's backlog status to be false.
- With Jark’s approach, setIsProcessingBacklog(false) forces the source's
backlog status to be false.

There is no practical difference between these two options as of FLIP-309.
However, once we introduce additional strategy (e.g. job-level config) to
configure backlog status in FLIP-328, there will be tricky and important
differences between them.

More specifically, let’s say we want to introduce a job-level config such
as “”pipeline.backlog.watermark-lag-threshold” as mentioned in FLIP-328:

- With Jack’s approach, if MySQL CDC invokes setIsProcessingBacklog(false)
at the beginning of the “unbounded phase”, then that effectively means
isProcessingBacklog=false even if watermark lag exceeds the configured
threshold, preventing job-level config from taking effect during the
"unbounded phase".
- With my approach, even if MySQL CDC invokes setIsProcessingBacklog(false)
at the beginning of the “unbounded phase”, the source can still have
isProcessingBacklog=true when watermark lag is too high.

Would this clarify the difference between these two options?

Regards,
Dong


On Mon, Sep 25, 2023 at 5:15 PM Piotr Nowojski 
wrote:

> Hi Jarl and Dong,
>
> I'm a bit confused about the difference between the two competing options.
> Could one of you elaborate what's the difference between:
> > 2) The semantics of `#setIsProcessingBacklog(false)` is that it overrides
> > the effect of the previous invocation (if any) of
> > `#setIsProcessingBacklog(true)` on the given source instance.
>
> and
>
> > 2) The semantics of `#setIsProcessingBacklog(false)` is that the given
> > source instance will have watermarkLag=false.
>
> ?
>
> Best,
> Piotrek
>
> czw., 21 wrz 2023 o 15:28 Dong Lin  napisał(a):
>
> > Hi all,
> >
> > Jark and I discussed this FLIP offline and I will summarize our
> discussion
> > below. It would be great if you could provide your opinion of the
> proposed
> > options.
> >
> > Regarding the target use-cases:
> > - We both agreed that MySQL CDC should have backlog=true when
> watermarkLag
> > is large during the binlog phase.
> > - Dong argued that other streaming sources with watermarkLag defined
> (e.g.
> > Kafka) should also have backlog=true when watermarkLag is large. The
> > pros/cons discussion below assume this use-case needs to be supported.
> >
> > The 1st option is what is currently proposed in FLIP-328, with the
> > following key characteristics:
> > 1) There is one job-level config (i.e.
> > pipeline.backlog.watermark-lag-threshold) that applies to all sources
> with
> > watermarkLag metric defined.
> > 2) The semantics of `#setIsProcessingBacklog(false)` is that it overrides
> > the effect of the previous invocation (if any) of
> > `#setIsProcessingBacklog(true)` on the given source instance.
> >
> > The 2nd option is what Jark proposed in this email thread, with the
> > following key characteristics:
> > 1) Add source-specific config (both Java API and SQL source property) to
> > every source for which we want to set backlog status based on the
> > watermarkLag metric. For example, we might add separate Java APIs
> > `#setWatermarkLagThreshold`  for MySQL CDC source, HybridSource,
> > KafkaSource, PulsarSource etc.
> > 2) The semantics of `#setIsProcessingBacklog(false)` is that the given
> > source instance will have watermarkLag=false.
> >
> > Here are the key pros/cons of these two options.
> >
> > Cons of the 1st option:
> > 1) The semantics of `#setIsProcessingBacklog(false)` is harder to
> > understand for Flink operator developers than the corresponding semantics
> > in option-2.
> >
> > Cons of the  2nd option:
> > 1) More work for end-users. For a job with multiple sources that need to
> be
> > configured with a watermark lag threshold, users need to specify multiple
> > configs (one for each source) instead of specifying one job-level config.
> >
> > 2) More work for Flink operator developers. Overall there are more public
> > APIs (one Java API and one SQL property for each source that needs to
> > determine backlog based on watermark) exposed to end users. This also
> adds
> > more burden for the Flink community to maintain these APIs.
> >
> > 3) It would be hard (e.g. require backward incompatible API change) to
> > extend the Flink runtime to support job-level config to set watermark
> > strategy in the future (e.g. support the
> > pipeline.backlog.watermark-lag-threshold in option-1). This is because an
> > existing source operator's code might have hardcoded a

Re: [VOTE] FLIP-362: Support minimum resource limitation

2023-09-25 Thread Shammon FY
+1(binding), thanks for the proposal.

Best,
Shammon FY

On Mon, Sep 25, 2023 at 11:20 PM Jing Ge  wrote:

> +1(binding) Thanks!
>
> Best regards,
> Jing
>
> On Mon, Sep 25, 2023 at 3:48 AM Chen Zhanghao 
> wrote:
>
> > Thanks for driving this. +1 (non-binding)
> >
> > Best,
> > Zhanghao Chen
> > 
> > 发件人: xiangyu feng 
> > 发送时间: 2023年9月25日 17:38
> > 收件人: dev@flink.apache.org 
> > 主题: [VOTE] FLIP-362: Support minimum resource limitation
> >
> > Hi all,
> >
> > I would like to start the vote for FLIP-362:  Support minimum resource
> > limitation[1].
> > This FLIP was discussed in this thread [2].
> >
> > The vote will be open for at least 72 hours unless there is an objection
> or
> > insufficient votes.
> >
> > Regards,
> > Xiangyu
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation
> > [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4
> >
>


Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener

2023-09-25 Thread Feng Jin
+1(no-binding)


thanks for driving this proposal


Best,
Feng

On Mon, Sep 25, 2023 at 11:19 PM Jing Ge  wrote:

> +1(binding) Thanks!
>
> Best Regards,
> Jing
>
> On Sun, Sep 24, 2023 at 10:30 PM Shammon FY  wrote:
>
> > Hi devs,
> >
> > Thanks for all the feedback on FLIP-314: Support Customized Job Lineage
> > Listener [1] in thread [2].
> >
> > I would like to start a vote for it. The vote will be opened for at least
> > 72 hours unless there is an objection or insufficient votes.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener
> > [2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc
> >
> > Best,
> > Shammon FY
> >
>


Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener

2023-09-25 Thread Leonard Xu
+1(binding)


Best,
Leonard

> On Sep 26, 2023, at 9:59 AM, Feng Jin  wrote:
> 
> +1(no-binding)
> 
> 
> thanks for driving this proposal
> 
> 
> Best,
> Feng
> 
> On Mon, Sep 25, 2023 at 11:19 PM Jing Ge  wrote:
> 
>> +1(binding) Thanks!
>> 
>> Best Regards,
>> Jing
>> 
>> On Sun, Sep 24, 2023 at 10:30 PM Shammon FY  wrote:
>> 
>>> Hi devs,
>>> 
>>> Thanks for all the feedback on FLIP-314: Support Customized Job Lineage
>>> Listener [1] in thread [2].
>>> 
>>> I would like to start a vote for it. The vote will be opened for at least
>>> 72 hours unless there is an objection or insufficient votes.
>>> 
>>> [1]
>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener
>>> [2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc
>>> 
>>> Best,
>>> Shammon FY
>>> 
>> 



Re: Re: [DISCUSS] FLIP-331: Support EndOfStreamWindows and isOutputOnEOF operator attribute to optimize task deployment

2023-09-25 Thread Dong Lin
Hi all,

Thanks for the review!

Becket and I discussed this FLIP offline and we agreed on several things
that need to be improved with this FLIP. I will summarize our discussion
with the problems and TODOs. We will update the FLIP and let you know once
the FLIP is ready for review again.

1) Investigate whether it is possible to update the existing GlobalWindows
in a backward-compatible way and re-use it for the same purpose
as EndOfStreamWindows, without introducing EndOfStreamWindows as a new
class.

Note that GlobalWindows#getDefaultTrigger returns a NeverTrigger instance
which will not trigger window's computation even on end-of-inputs. We will
need to investigate its existing usage and see if we can re-use it in a
backward-compatible way.

2) Let JM know whether any operator in the upstream of the operator with
"isOutputOnEOF=true" will emit output via any side channel. The FLIP should
update the execution mode of those operators *only if* all outputs from
those operators are emitted only at the end of input.

More specifically, the upstream operator might involve a user-defined
operator that might emit output directly to an external service, where the
emission operation is not explicitly expressed as an operator's output edge
and thus not visible to JM. Similarly, it is also possible for the
user-defined operator to register a timer
via InternalTimerService#registerEventTimeTimer and emit output to an
external service inside Triggerable#onEventTime. There is a chance that
users still need related logic to output data in real-time, even if the
downstream operators have isOutputOnEOF=true.

One possible solution to address this problem is to add an extra
OperatorAttribute to specify whether this operator might output records in
such a way that does not go through operator's output (e.g. side output).
Then the JM can safely enable the runtime optimization currently described
in the FLIP when there is no such operator.

3) Create a follow-up FLIP that allows users to specify whether a source
with Boundedness=bounded should have isProcessingBacklog=true.

This capability would effectively introduce a 3rd strategy to set backlog
status (in addition to FLIP-309 and FLIP-328). It might be useful to note
that, even though the data in bounded sources are backlog data in most
practical use-cases, it is not necessarily true. For example, users might
want to start a Flink job to consume real-time data from a Kafka topic and
specify that the job stops after 24 hours, which means the source is
technically bounded while the data is fresh/real-time.

This capability is more generic and can cover more use-case than
EndOfStreamWindows. On the other hand, EndOfStreamWindows will still be
useful in cases where users already need to specify this window assigner in
a DataStream program, without bothering users to decide whether it is safe
to treat data in a bounded source as backlog data.


Regards,
Dong






On Mon, Sep 18, 2023 at 2:56 PM Yuxin Tan  wrote:

> Hi, Dong,
> Thanks for your efforts.
>
> +1 to this proposal,
> I believe this will improve the performance in some mixture circumstances
> of bounded and unbounded workloads.
>
> Best,
> Yuxin
>
>
> Xintong Song  于2023年9月18日周一 10:56写道:
>
> > Thanks for addressing my comments, Dong.
> >
> > LGTM.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Sat, Sep 16, 2023 at 3:34 PM Wencong Liu 
> wrote:
> >
> > > Hi Dong & Jinhao,
> > >
> > > Thanks for your clarification! +1
> > >
> > > Best regards,
> > > Wencong
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2023-09-15 11:26:16, "Dong Lin"  wrote:
> > > >Hi Wencong,
> > > >
> > > >Thanks for your comments! Please see my reply inline.
> > > >
> > > >On Thu, Sep 14, 2023 at 12:30 PM Wencong Liu 
> > > wrote:
> > > >
> > > >> Dear Dong,
> > > >>
> > > >> I have thoroughly reviewed the proposal for FLIP-331 and believe it
> > > would
> > > >> be
> > > >> a valuable addition to Flink. However, I do have a few questions
> that
> > I
> > > >> would
> > > >> like to discuss:
> > > >>
> > > >>
> > > >> 1. The FLIP-331 proposed the EndOfStreamWindows that is implemented
> by
> > > >> TimeWindow with maxTimestamp = (Long.MAX_VALUE - 1), which naturally
> > > >> supports WindowedStream and AllWindowedStream to process all records
> > > >> belonging to a key in a 'global' window under both STREAMING and
> BATCH
> > > >> runtime execution mode.
> > > >>
> > > >>
> > > >> However, besides coGroup and keyBy().aggregate(), other operators on
> > > >> WindowedStream and AllWindowedStream, such as join/reduce, etc,
> > > currently
> > > >> are still implemented based on WindowOperator.
> > > >>
> > > >>
> > > >> In fact, these operators can also be implemented without using
> > > >> WindowOperator
> > > >> to prevent additional WindowAssigner#assignWindows or
> > > >> triggerContext#onElement
> > > >> invocation cost. Will there be plans to support these operators in
> the
> 

Re: [VOTE] FLIP-362: Support minimum resource limitation

2023-09-25 Thread Weihua Hu
+1 (binding)

Best,
Weihua


On Tue, Sep 26, 2023 at 10:00 AM Shammon FY  wrote:

> +1(binding), thanks for the proposal.
>
> Best,
> Shammon FY
>
> On Mon, Sep 25, 2023 at 11:20 PM Jing Ge 
> wrote:
>
> > +1(binding) Thanks!
> >
> > Best regards,
> > Jing
> >
> > On Mon, Sep 25, 2023 at 3:48 AM Chen Zhanghao  >
> > wrote:
> >
> > > Thanks for driving this. +1 (non-binding)
> > >
> > > Best,
> > > Zhanghao Chen
> > > 
> > > 发件人: xiangyu feng 
> > > 发送时间: 2023年9月25日 17:38
> > > 收件人: dev@flink.apache.org 
> > > 主题: [VOTE] FLIP-362: Support minimum resource limitation
> > >
> > > Hi all,
> > >
> > > I would like to start the vote for FLIP-362:  Support minimum resource
> > > limitation[1].
> > > This FLIP was discussed in this thread [2].
> > >
> > > The vote will be open for at least 72 hours unless there is an
> objection
> > or
> > > insufficient votes.
> > >
> > > Regards,
> > > Xiangyu
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation
> > > [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4
> > >
> >
>


Re: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released

2023-09-25 Thread Martijn Visser
Hi Frans,

Good remark, I still need to provide the images to those who have
access to the Dockerhub, but I haven't been able to done that yet.
Hopefully I can do that at the end of the week.

Best regards,

Martijn

On Mon, Sep 25, 2023 at 2:04 PM  wrote:
>
> Hi Martijn.
>
> Thanks for this.   Should there also be docker images available?  
> https://hub.docker.com/r/apache/flink-statefun/tags goes up to 3.2.0.
>
> Frans
>
> -Original Message-
> From: Martijn Visser 
> Sent: Tuesday, September 19, 2023 11:37 AM
> To: dev@flink.apache.org; user ; user-zh 
> ; n...@flink.apache.org; annou...@apache.org
> Subject: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released
>
> Stateful Functions is a cross-platform stack for building Stateful Serverless 
> applications, making it radically simpler to develop scalable, consistent, 
> and elastic distributed applications. This new release upgrades the Flink 
> runtime to 1.16.2.
>
> Release highlight:
> - Upgrade underlying Flink dependency to 1.16.2
>
> Release blogpost:
> https://flink.apache.org/2023/09/19/stateful-functions-3.3.0-release-announcement/
>
> The release is available for download at:
> https://flink.apache.org/downloads/
>
> Java SDK can be found at:
> https://search.maven.org/artifact/org.apache.flink/statefun-sdk-java/3.3.0/jar
>
> Python SDK can be found at:
> https://pypi.org/project/apache-flink-statefun/
>
> GoLang SDK can be found at:
> https://github.com/apache/flink-statefun/tree/statefun-sdk-go/v3.3.0
>
> JavaScript SDK can be found at:
> https://www.npmjs.com/package/apache-flink-statefun
>
> Official Docker image for Flink Stateful Functions can be found at: 
> https://hub.docker.com/r/apache/flink-statefun
>
> The full release notes are available in Jira:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351276
>
> We would like to thank all contributors of the Apache Flink community who 
> made this release possible!
>
> Regards,
> Martijn Visser
>
>


回复: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL Sources

2023-09-25 Thread Chen Zhanghao
Hi Jing,

I've updated Compatibility, Deprecation, and Migration Plan section to list all 
the potential compatibility issues for users who want to upgrade an existing 
job to use this feature: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263429150.

Best,
Zhanghao Chen

发件人: Jing Ge 
发送时间: 2023年9月25日 23:02
收件人: dev@flink.apache.org 
主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL Sources

Hi Zhanghao,

Thanks for driving the FLIP. This is a nice feature users are looking for.
From users' perspective, would you like to add explicit description about
any potential(or none) compatibility issues if users want to use this new
feature and start existing jobs with savepoints or checkpoints?

Best regards,
Jing

On Sun, Sep 24, 2023 at 9:05 PM Chen Zhanghao 
wrote:

> Hi Lincoln,
>
> Thanks for the comments.
>
> - For concerns #1, I agree that we do not always produce optimal plan for
> both cases. However, it is difficult to do so and non-trivial complexity is
> expected. On the other hand, although our proposal generates a sub-optimal
> plan when the bottleneck is on the aggregate operator, it still provides
> more flexibility for performance tuning. Therefore, I think we can
> implement it in the straightforward way first, WDYT?
>
> - For concerns #2, I'd like to clarify a bit: exception will only be
> thrown i.f.f. the source may produce delete/update messages AND no primary
> key specified AND the source parallelism is different from the default
> parallelism. So for CDC without pk, we're still good if source parallelism
> is not specified.
>
> - For concerns #3, at the current point, I think making the name more
> specific is better as no other future use cases can be previsioned now.
> Since this is an internal API, we are free to refactor it later if needed.
>
> - For concerns about adaptive scheduler, the problems you mentioned do
> exist, but I don't think it relevant here. The planner may encode some
> hints to help the scheduler, but afterall, those efforts should be
> initiated on the scheduler side.
>
> Best,
> Zhanghao Chen
> 
> 发件人: Lincoln Lee 
> 发送时间: 2023年9月22日 23:19
> 收件人: dev@flink.apache.org 
> 主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL
> Sources
>
> Hi Zhanghao,
>
> Thanks for the FLIP and discussion!  Hope this reply isn't too late : )
> Firstly I'm fully agreed with the motivation of this FLIP and the value for
> the users, but there are a few things we should consider(please correct me
> if I'm misunderstanding):
>
> *1.  *It seems that the current solution only takes care of part of the
> requirement, the need to set source's parallelism may be different in
> different jobs,  for example, consider the following two job topologies(one
> {} simply represents a vertex):
> a. {source -> calc -> sink}
>
> b. {source -> calc} -> {aggregate} -> {sink}
>
> For job a, if there is a bottleneck in calc operator, but source
> parallelism cannot be scaled up (e.g., limited by kafka's partition
> number), so the calc operator cannot be scaled up to achieve higher
> throughput because the operators in source vertex are chained together,
> then current solution is reasonable (break the chain, add a shuffle).
>
> But for job b, if the bottleneck is the aggregate operator (not calc), it's
> more likely be better to scale up the aggregate operator/vertex and without
> breaking the {source -> calc} chain, as this will incur additional shuffle
> cost.
> So if we decide to add this new feature, I would recommend that both cases
> be taken care of.
>
>
> 2. the assumption that a cdc source must have pk(primary key) may not be
> reasonable, for example, mysql cdc supports the case without pk(
>
> https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc.html#tables-without-primary-keys
> ),
> so we can not just raise an error here.
>
>
> 3. for the new SourceTransformationWrapper I have some concerns about the
> future evolution, if we need to add support for other operators, do we
> continue to add new xxWrappers?
>
> I've also revisited the previous discussion on FLIP-146[1], there were no
> clear conclusions or good ideas about similar support issues for the source
> before, and I also noticed that the new capability to change per-vertex
> parallelism via rest api in 1.18 (part of FLIP-291[2][3], but there is
> actually an issue about sql job's parallelism change which may require a
> hash shuffle to ensure the order of update stream, this needs to be
> followed up in FLIP-291, a jira will be created later).  So perhaps, we
> need to think about it more (the next version is not yet launched, so we
> still have time)
>
> [1] https://lists.apache.org/thread/gtpswl42jzv0c9o3clwqskpllnw8rh87
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> [3] https://issues.apache.org/jira/browse/FLI

[jira] [Created] (FLINK-33156) Remove flakiness from tests in OperatorStateBackendTest.java

2023-09-25 Thread Asha Boyapati (Jira)
Asha Boyapati created FLINK-33156:
-

 Summary: Remove flakiness from tests in 
OperatorStateBackendTest.java
 Key: FLINK-33156
 URL: https://issues.apache.org/jira/browse/FLINK-33156
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.17.1
Reporter: Asha Boyapati
 Fix For: 1.17.1


This issue is similar to:
https://issues.apache.org/jira/browse/FLINK-32963

We are proposing to make the following tests stable:

{quote}org.apache.flink.runtime.state.OperatorStateBackendTest#testSnapshotRestoreSync
org.apache.flink.runtime.state.OperatorStateBackendTest#testSnapshotRestoreAsync{quote}

The tests are currently flaky because the order of elements returned by 
iterators is non-deterministic.

The following PR fixes the flaky test by making it independent of the order of 
elements returned by the iterator:
https://github.com/asha-boyapati/flink/pull/2

We detected this using the NonDex tool using the following commands:

{quote}mvn edu.illinois:nondex-maven-plugin:2.1.1:nondex -pl flink-runtime 
-DnondexRuns=10 
-Dtest=org.apache.flink.runtime.state.OperatorStateBackendTest#testSnapshotRestoreSync

mvn edu.illinois:nondex-maven-plugin:2.1.1:nondex -pl flink-runtime 
-DnondexRuns=10 
-Dtest=org.apache.flink.runtime.state.OperatorStateBackendTest#testSnapshotRestoreAsync{quote}

Please see the following Continuous Integration log that shows the flakiness:
https://github.com/asha-boyapati/flink/actions/runs/6193757385

Please see the following Continuous Integration log that shows that the 
flakiness is fixed by this change:
https://github.com/asha-boyapati/flink/actions/runs/619409



--
This message was sent by Atlassian Jira
(v8.20.10#820010)