Re: Future of classical remoting in Pekko
The related changes are : https://github.com/apache/incubator-pekko/pull/643 https://github.com/apache/incubator-pekko/pull/651 https://github.com/apache/incubator-pekko/pull/652 https://github.com/apache/incubator-pekko/pull/656 It would be very helpful if you take another round of review, thanks. On 2023/09/19 19:04:59 Ferenc Csaky wrote: > I think that is totally fine, because any Pekko related changes can only be > added to the first patch release of 1.18 at this point, as there is an RC0 > [1] already so the release process will be initiated soon. > > I am glad the mentioned PR got merged, did not have the chance to review. > > [1] https://lists.apache.org/thread/5x28rp3zct4p603hm4zdwx6kfr101w38 > > > > --- Original Message --- > On Monday, September 18th, 2023 at 14:20, Matthew de Detrich > wrote: > > > > > > > > I think that the end of September is too soon for a Pekko 1.1.x, there are > > still more things > > that we would like to merge before making a release. > > > > Good news is that the PR to migrate to netty4 for classic remoting has been > > merged > > (see https://github.com/apache/incubator-pekko/pull/643). Improvements are > > also > > still be done, so the next minor version release of Pekko (1.1.0) will > > contain these > > changes. > > > > On Wed, Sep 13, 2023 at 11:22 AM Ferenc Csaky ferenc.cs...@pm.me.invalid > > > > wrote: > > > > > The target release date for 1.18 is the end of Sept [1], but I'm not sure > > > everything will come together by then. Maybe it will pushed by a couple > > > days. > > > > > > I'm happy to help out, even making the Flink related changes when we're at > > > that point. > > > > > > [1] https://cwiki.apache.org/confluence/display/FLINK/1.18+Release > > > > > > --- Original Message --- > > > On Tuesday, September 12th, 2023 at 17:43, He Pin he...@apache.org > > > wrote: > > > > > > > Hi Ferenc: > > > > What's the ETA of the Flink 1.18? I think we should beable to > > > > collaborate on this,and at work we are using Flink too. > > > > > > > > On 2023/09/12 15:16:11 Ferenc Csaky wrote: > > > > > > > > > Hi Matthew, > > > > > > > > > > Thanks for bringing this up! Cca half a year ago I started to work on > > > > > an Akka Artery migration, there is a draft PR for that 1. It might be > > > > > an > > > > > option to revive that work and point it against Pekko instead. > > > > > Although I > > > > > would highlight FLINK-29281 2 which will replace the whole RPC > > > > > implementation in Flink to a gRPC-based one when it is done. > > > > > > > > > > I am not sure about the progess on the gRPC work, it looks hanging for > > > > > a while now, so I think if there is a chance to replace Netty3 with > > > > > Netty4 > > > > > in Pekko in the short term it would benefit Flink and then we can > > > > > decide if > > > > > it would worth to upgrade to Artery, or how fast the gRPC solution > > > > > can be > > > > > done and then it will not be necessary. > > > > > > > > > > All in all, in the short term I think Flink would benefit to have that > > > > > mentioned PR 3 merged, then the updated Pekko version could be > > > > > included in > > > > > the first 1.18 patch probably to mitigate those pesky Netty3 CVEs > > > > > that are > > > > > carried for a while ASAP. > > > > > > > > > > Cheers, > > > > > Ferenc > > > > > > > > > > 1 https://github.com/apache/flink/pull/22271 > > > > > 2 https://issues.apache.org/jira/browse/FLINK-29281 > > > > > 3 https://github.com/apache/incubator-pekko/pull/643 > > > > > > > > > > --- Original Message --- > > > > > On Tuesday, September 12th, 2023 at 10:29, Matthew de Detrich > > > > > matthew.dedetr...@aiven.io.INVALID wrote: > > > > > > > > > > > It's come to my attention that Flink is using Pekko's classical > > > > > > remoting, > > > > > > if this is the case then I would recommend making a response at > > > > > > https://lists.apache.org/thread/19h2wrs2om91g5vhnftv583fo0ddfshm . > > > > > > > > > > > > Quick summary of what is being discussed is what to do with Pekko's > > > > > > classical remoting. Classic remoting is considered deprecated since > > > > > > 2019, > > > > > > an artifact that we inherited from Akka1. Ontop of this classical > > > > > > remoting happens to be using netty3 which has known CVE's2, these > > > > > > CVE's > > > > > > were never fixed in the netty3 series. > > > > > > > > > > > > The question is what should be done given this, i.e. some people in > > > > > > the > > > > > > Pekko community are wanting to drop classical remoting as quickly as > > > > > > possible (i.e. even sooner then what semver allows but this is being > > > > > > discussed) and others are wanting to leave it as it is (even with > > > > > > the > > > > > > CVE's) since we don't want to incentivize and/or create impression > > > > > > that we > > > > > > are officially supporting it. There is also a currently open PR3 > > > > > > which > > > > > > upgrades Pekko's c
[jira] [Created] (FLINK-33149) Bump snappy-java to 1.1.10.4
Ryan Skraba created FLINK-33149: --- Summary: Bump snappy-java to 1.1.10.4 Key: FLINK-33149 URL: https://issues.apache.org/jira/browse/FLINK-33149 Project: Flink Issue Type: Bug Components: API / Core, Connectors / AWS, Connectors / HBase, Connectors / Kafka, Stateful Functions Affects Versions: 1.18.0, 1.16.3, 1.17.2 Reporter: Ryan Skraba Xerial published a security alert for a Denial of Service attack that [exists on 1.1.10.1|https://github.com/xerial/snappy-java/security/advisories/GHSA-55g7-9cwv-5qfv]. This is included in flink-dist, but also in flink-statefun, and several connectors. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33150) add the processing logic for the long type
wenhao.yu created FLINK-33150: - Summary: add the processing logic for the long type Key: FLINK-33150 URL: https://issues.apache.org/jira/browse/FLINK-33150 Project: Flink Issue Type: New Feature Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Affects Versions: 1.15.4 Reporter: wenhao.yu Fix For: 1.15.4 The AvroToRowDataConverters class has a convertToDate method that will report an error when it encounters time data represented by the long type, so add a code to handle the long type. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33151) Prometheus Sink Connector - Create Github Repo
Danny Cranmer created FLINK-33151: - Summary: Prometheus Sink Connector - Create Github Repo Key: FLINK-33151 URL: https://issues.apache.org/jira/browse/FLINK-33151 Project: Flink Issue Type: Sub-task Reporter: Danny Cranmer Create the \{{flink-connector-prometheus}} repo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling
Hi all, I and Fan Rui(CC’ed) created the FLIP-370[1] to support balanced tasks scheduling. The current strategy of Flink to deploy tasks sometimes leads some TMs(TaskManagers) to have more tasks while others have fewer tasks, resulting in excessive resource utilization at some TMs that contain more tasks and becoming a bottleneck for the entire job processing. Developing strategies to achieve task load balancing for TMs and reducing job bottlenecks becomes very meaningful. The raw design and discussions could be found in the Flink JIRA[2] and Google doc[3]. We really appreciate Zhu Zhu(CC’ed) for providing some valuable help and suggestions in advance. Please refer to the FLIP[1] document for more details about the proposed design and implementation. We welcome any feedback and opinions on this proposal. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling [2] https://issues.apache.org/jira/browse/FLINK-31757 [3] https://docs.google.com/document/d/14WhrSNGBdcsRl3IK7CZO-RaZ5KXU2X1dWqxPEFr3iS8 Best, Yuepeng Pan
[jira] [Created] (FLINK-33152) Prometheus Sink Connector - Integration tests
Lorenzo Nicora created FLINK-33152: -- Summary: Prometheus Sink Connector - Integration tests Key: FLINK-33152 URL: https://issues.apache.org/jira/browse/FLINK-33152 Project: Flink Issue Type: Sub-task Reporter: Lorenzo Nicora Integration tests against containerised Prometheus -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag
Hi Jarl and Dong, I'm a bit confused about the difference between the two competing options. Could one of you elaborate what's the difference between: > 2) The semantics of `#setIsProcessingBacklog(false)` is that it overrides > the effect of the previous invocation (if any) of > `#setIsProcessingBacklog(true)` on the given source instance. and > 2) The semantics of `#setIsProcessingBacklog(false)` is that the given > source instance will have watermarkLag=false. ? Best, Piotrek czw., 21 wrz 2023 o 15:28 Dong Lin napisał(a): > Hi all, > > Jark and I discussed this FLIP offline and I will summarize our discussion > below. It would be great if you could provide your opinion of the proposed > options. > > Regarding the target use-cases: > - We both agreed that MySQL CDC should have backlog=true when watermarkLag > is large during the binlog phase. > - Dong argued that other streaming sources with watermarkLag defined (e.g. > Kafka) should also have backlog=true when watermarkLag is large. The > pros/cons discussion below assume this use-case needs to be supported. > > The 1st option is what is currently proposed in FLIP-328, with the > following key characteristics: > 1) There is one job-level config (i.e. > pipeline.backlog.watermark-lag-threshold) that applies to all sources with > watermarkLag metric defined. > 2) The semantics of `#setIsProcessingBacklog(false)` is that it overrides > the effect of the previous invocation (if any) of > `#setIsProcessingBacklog(true)` on the given source instance. > > The 2nd option is what Jark proposed in this email thread, with the > following key characteristics: > 1) Add source-specific config (both Java API and SQL source property) to > every source for which we want to set backlog status based on the > watermarkLag metric. For example, we might add separate Java APIs > `#setWatermarkLagThreshold` for MySQL CDC source, HybridSource, > KafkaSource, PulsarSource etc. > 2) The semantics of `#setIsProcessingBacklog(false)` is that the given > source instance will have watermarkLag=false. > > Here are the key pros/cons of these two options. > > Cons of the 1st option: > 1) The semantics of `#setIsProcessingBacklog(false)` is harder to > understand for Flink operator developers than the corresponding semantics > in option-2. > > Cons of the 2nd option: > 1) More work for end-users. For a job with multiple sources that need to be > configured with a watermark lag threshold, users need to specify multiple > configs (one for each source) instead of specifying one job-level config. > > 2) More work for Flink operator developers. Overall there are more public > APIs (one Java API and one SQL property for each source that needs to > determine backlog based on watermark) exposed to end users. This also adds > more burden for the Flink community to maintain these APIs. > > 3) It would be hard (e.g. require backward incompatible API change) to > extend the Flink runtime to support job-level config to set watermark > strategy in the future (e.g. support the > pipeline.backlog.watermark-lag-threshold in option-1). This is because an > existing source operator's code might have hardcoded an invocation of > `#setIsProcessingBacklog(false)`, which means the backlog status must be > set to true, which prevents Flink runtime from setting backlog=true when a > new strategy is triggered. > > Overall, I am still inclined to choose option-1 because it is more > extensible and simpler to use in the long term when we want to support/use > multiple sources whose backlog status can change based on the watermark > lag. While option-1's `#setIsProcessingBacklog` is a bit harder to > understand than option-2, I think this overhead/cost is worthwhile as it > makes end-users' life easier in the long term. > > Jark: thank you for taking the time to review this FLIP. Please feel free > to comment if I missed anything in the pros/cons above. > > Jark and I have not reached agreement on which option is better. It will be > really helpful if we can get more comments on these options. > > Thanks, > Dong > > > On Tue, Sep 19, 2023 at 11:26 AM Dong Lin wrote: > > > Hi Jark, > > > > Thanks for the reply. Please see my comments inline. > > > > On Tue, Sep 19, 2023 at 10:12 AM Jark Wu wrote: > > > >> Hi Dong, > >> > >> Sorry for the late reply. > >> > >> > The rationale is that if there is any strategy that is triggered and > >> says > >> > backlog=true, then job's backlog should be true. Otherwise, the job's > >> > backlog status is false. > >> > >> I'm quite confused about this. Does that mean, if the source is in the > >> changelog phase, the source has to continuously invoke > >> "setIsProcessingBacklog(true)" (in an infinite loop?). Otherwise, > >> the job's backlog status would be set to false by the framework? > >> > > > > No, the source would not have to continuously invoke > > setIsProcessingBacklog(true) in an infinite loop. > > > > Actually, I am not very sure why there is confusion that "
Re: [ANNOUNCE] Release 1.18.0, release candidate #0
Hi Jing and everyone, I have conducted three rounds of benchmarking with Java11, comparing release 1.18 (commit: deb07e99560[1]) with commit 6d62f9918ea[2]. The results are attached[3]. Most of the tests show no obvious regression. However, I did observe significant change in several tests. Upon reviewing the historical results from the previous pipeline, I also discovered a substantial variance in those tests, as shown in the timeline pictures included in the sheet[3]. I believe this variance has existed for a long time and requires further investigation, and fully measuring the variance requires more rounds (15 or more). I think for now it is not a blocker for release 1.18. WDYT? Best, Zakelly [1] https://github.com/apache/flink/commit/deb07e99560b45033a629afc3f90666ad0a32feb [2] https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711 [3] https://docs.google.com/spreadsheets/d/1V0-duzNTgu7H6R7kioF-TAPhlqWl7Co6Q9ikTBuaULo/edit?usp=sharing On Sun, Sep 24, 2023 at 11:29 AM ConradJam wrote: > > +1 for testing with Java 17 > > Jing Ge 于2023年9月24日周日 09:40写道: > > > +1 for testing with Java 17 too. Thanks Zakelly for your effort! > > > > Best regards, > > Jing > > > > On Fri, Sep 22, 2023 at 1:01 PM Zakelly Lan wrote: > > > > > Hi Jing, > > > > > > I agree we could wait for the result with Java 11. And it should be > > > available next Monday. > > > Additionally, I could also build a pipeline with Java 17 later since > > > it is supported in 1.18[1]. > > > > > > > > > Best regards, > > > Zakelly > > > > > > [1] > > > > > https://github.com/apache/flink/commit/9c1318ca7fa5b2e7b11827068ad1288483aaa464#diff-8310c97396d60e96766a936ca8680f1e2971ef486cfc2bc55ec9ca5a5333c47fR53 > > > > > > On Fri, Sep 22, 2023 at 5:57 PM Jing Ge > > > wrote: > > > > > > > > Hi Zakelly, > > > > > > > > Thanks for your effort and the update! Since Java 8 has been > > > deprecated[1], > > > > let's wait for the result with Java 11. It should be available after > > the > > > > weekend and there should be no big surprise. WDYT? > > > > > > > > Best regards, > > > > Jing > > > > > > > > [1] > > > > > > > > > https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.15/#jdk-upgrade > > > > > > > > On Fri, Sep 22, 2023 at 11:26 AM Zakelly Lan > > > wrote: > > > > > > > > > Hi everyone, > > > > > > > > > > I want to provide an update on the benchmark results that I have been > > > > > working on. After spending some time preparing the environment and > > > > > adjusting the benchmark script, I finally got a comparison between > > > > > release 1.18 (commit: 2aeb99804ba[1]) and the commit before the old > > > > > codespeed server went down (commit: 6d62f9918ea[2]) on openjdk8. The > > > > > report is attached[3]. Note that the test has only run once on jdk8, > > > > > so the impact of single-test fluctuations is not ruled out. > > > > > Additionally, I have noticed some significant fluctuations in > > specific > > > > > tests when reviewing previous benchmark scores, which I have also > > > > > noted in the report. Taking all of these factors into consideration, > > I > > > > > think there is no obvious regression in release 1.18 *for now*. More > > > > > tests including the one on openjdk11 are on the way. Hope it does not > > > > > delay the release procedure. > > > > > > > > > > Please let me know if you have any concerns. > > > > > > > > > > > > > > > Best, > > > > > Zakelly > > > > > > > > > > [1] > > > > > > > > > > https://github.com/apache/flink/commit/2aeb99804ba56c008df0a1730f3246d3fea856b9 > > > > > [2] > > > > > > > > > > https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711 > > > > > [3] > > > > > > > > > > https://docs.google.com/spreadsheets/d/1-3Y974jYq_WrQNzLN-y_6lOU-NGXaIDTQBYTZd04tJ0/edit?usp=sharing > > > > > > > > > > The new environment for benchmark: > > > > > ECS on Aliyun > > > > > CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz (8 Core available) > > > > > Memory: 64GB > > > > > OS: Alibaba Cloud Linux 3.2104 LTS 64bit > > > > > Kernel: 5.10.134-15.al8.x86_64 > > > > > OpenJDK8 version: 1.8.0_372 > > > > > > > > > > On Thu, Sep 21, 2023 at 12:04 PM Yuxin Tan > > > wrote: > > > > > > > > > > > > Hi, Zakelly, > > > > > > > > > > > > No benchmark tests currently are affected by this issue. We > > > > > > may add benchmarks to guard it later. Thanks. > > > > > > > > > > > > Best, > > > > > > Yuxin > > > > > > > > > > > > > > > > > > Zakelly Lan 于2023年9月21日周四 11:56写道: > > > > > > > > > > > > > Hi Jing, > > > > > > > > > > > > > > Sure, I will run the benchmark with this fix. > > > > > > > > > > > > > > Hi Yunxin, > > > > > > > > > > > > > > I'm not familiar with the hybrid shuffle. Is there any specific > > > > > > > benchmark test that may be affected by this issue? I will pay > > > special > > > > > > > attention to it. > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > Zakelly > > > > > > > > > > > > > > On
[jira] [Created] (FLINK-33153) Kafka using latest-offset maybe missing data
tanjialiang created FLINK-33153: --- Summary: Kafka using latest-offset maybe missing data Key: FLINK-33153 URL: https://issues.apache.org/jira/browse/FLINK-33153 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: kafka-4.1.0 Reporter: tanjialiang When Kafka start with the latest-offset strategy, it does not fetch the latest snapshot offset and specify it for consumption. Instead, it sets the startingOffset to -1 (KafkaPartitionSplit.LATEST_OFFSET, which makes currentOffset = -1, and call the KafkaConsumer's seekToEnd API). The currentOffset is only set to the consumed offset + 1 when the task consumes data, and this currentOffset is stored in the state during checkpointing. If there are very few messages in Kafka and a partition has not consumed any data, and I stop the task with a savepoint, then write data to that partition, and start the task with the savepoint, the task will resume from the saved state. Due to the startingOffset in the state being -1, it will cause the task to miss the data that was written before the recovery point. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[VOTE] FLIP-362: Support minimum resource limitation
Hi all, I would like to start the vote for FLIP-362: Support minimum resource limitation[1]. This FLIP was discussed in this thread [2]. The vote will be open for at least 72 hours unless there is an objection or insufficient votes. Regards, Xiangyu [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4
Re: [VOTE] FLIP-362: Support minimum resource limitation
+1 (binding) Best, Yangze Guo On Mon, Sep 25, 2023 at 5:39 PM xiangyu feng wrote: > > Hi all, > > I would like to start the vote for FLIP-362: Support minimum resource > limitation[1]. > This FLIP was discussed in this thread [2]. > > The vote will be open for at least 72 hours unless there is an objection or > insufficient votes. > > Regards, > Xiangyu > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation > [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4
Re: [VOTE] FLIP-362: Support minimum resource limitation
Thanks for the proposal +1(non-binding) Best regards Ahmed On Mon, 25 Sep 2023, 13:35 Yangze Guo, wrote: > +1 (binding) > > Best, > Yangze Guo > > On Mon, Sep 25, 2023 at 5:39 PM xiangyu feng wrote: > > > > Hi all, > > > > I would like to start the vote for FLIP-362: Support minimum resource > > limitation[1]. > > This FLIP was discussed in this thread [2]. > > > > The vote will be open for at least 72 hours unless there is an objection > or > > insufficient votes. > > > > Regards, > > Xiangyu > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation > > [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4 >
回复: [VOTE] FLIP-362: Support minimum resource limitation
Thanks for driving this. +1 (non-binding) Best, Zhanghao Chen 发件人: xiangyu feng 发送时间: 2023年9月25日 17:38 收件人: dev@flink.apache.org 主题: [VOTE] FLIP-362: Support minimum resource limitation Hi all, I would like to start the vote for FLIP-362: Support minimum resource limitation[1]. This FLIP was discussed in this thread [2]. The vote will be open for at least 72 hours unless there is an objection or insufficient votes. Regards, Xiangyu [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4
[jira] [Created] (FLINK-33154) flink on k8s,An error occurred during consuming rocketmq
Monody created FLINK-33154: -- Summary: flink on k8s,An error occurred during consuming rocketmq Key: FLINK-33154 URL: https://issues.apache.org/jira/browse/FLINK-33154 Project: Flink Issue Type: Technical Debt Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Environment: flink-kubernetes-operator:https://github.com/apache/flink-kubernetes-operator#current-api-version-v1beta1 rocketmq-flink:https://github.com/apache/rocketmq-flink Reporter: Monody The following error occurs when flink consumes rocketmq. The flink job is running on k8s, and the projects used are: The projects used by flink to consume rocketmq are: The flink job runs normally on yarn, and no abnormality is found on the rocketmq server. Why does this happen? and how to solve it? !https://user-images.githubusercontent.com/47728686/265662530-231c500c-fd64-4679-9b0f-ff4a025dd766.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33155) Flink ResourceManager continuously fails to start TM container on YARN when Kerberos enabled
Yang Wang created FLINK-33155: - Summary: Flink ResourceManager continuously fails to start TM container on YARN when Kerberos enabled Key: FLINK-33155 URL: https://issues.apache.org/jira/browse/FLINK-33155 Project: Flink Issue Type: Bug Reporter: Yang Wang When Kerberos enabled(with key tab) and after one day(the container token expired), Flink fails to create the TaskManager container on YARN due to the following exception. {code:java} 2023-09-25 16:48:50,030 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker container_1695106898104_0003_01_69 is terminated. Diagnostics: Container container_1695106898104_0003_01_69 was invalid. Diagnostics: [2023-09-25 16:48:45.710]token (token for hadoop: HDFS_DELEGATION_TOKEN owner=, renewer=, realUser=, issueDate=1695196431487, maxDate=1695801231487, sequenceNumber=12, masterKeyId=3) can't be found in cacheorg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (token for hadoop: HDFS_DELEGATION_TOKEN owner=hadoop/master-1-1.c-5ee7bdc598b6e1cc.cn-beijing.emr.aliyuncs@emr.c-5ee7bdc598b6e1cc.com, renewer=, realUser=, issueDate=1695196431487, maxDate=1695801231487, sequenceNumber=12, masterKeyId=3) can't be found in cacheat org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1545)at org.apache.hadoop.ipc.Client.call(Client.java:1491)at org.apache.hadoop.ipc.Client.call(Client.java:1388)at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:907) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1666)at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1588) at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224) at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {code} The root cause might be that we are reading the delegation token from JM local file[1]. It will expire after one day. When the old TaskManager container crashes and ResourceManager tries to create a new one, th
Re: [ANNOUNCE] Release 1.18.0, release candidate #0
Thanks Zakelly for the update! Appreciate it! @Piotr Nowojski If you do not have any other concerns, I will move forward to create 1.18 rc1 and start voting. WDYT? Best regards, Jing On Mon, Sep 25, 2023 at 2:20 AM Zakelly Lan wrote: > Hi Jing and everyone, > > I have conducted three rounds of benchmarking with Java11, comparing > release 1.18 (commit: deb07e99560[1]) with commit 6d62f9918ea[2]. The > results are attached[3]. Most of the tests show no obvious regression. > However, I did observe significant change in several tests. Upon > reviewing the historical results from the previous pipeline, I also > discovered a substantial variance in those tests, as shown in the > timeline pictures included in the sheet[3]. I believe this variance > has existed for a long time and requires further investigation, and > fully measuring the variance requires more rounds (15 or more). I > think for now it is not a blocker for release 1.18. WDYT? > > > Best, > Zakelly > > [1] > https://github.com/apache/flink/commit/deb07e99560b45033a629afc3f90666ad0a32feb > [2] > https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711 > [3] > https://docs.google.com/spreadsheets/d/1V0-duzNTgu7H6R7kioF-TAPhlqWl7Co6Q9ikTBuaULo/edit?usp=sharing > > On Sun, Sep 24, 2023 at 11:29 AM ConradJam wrote: > > > > +1 for testing with Java 17 > > > > Jing Ge 于2023年9月24日周日 09:40写道: > > > > > +1 for testing with Java 17 too. Thanks Zakelly for your effort! > > > > > > Best regards, > > > Jing > > > > > > On Fri, Sep 22, 2023 at 1:01 PM Zakelly Lan > wrote: > > > > > > > Hi Jing, > > > > > > > > I agree we could wait for the result with Java 11. And it should be > > > > available next Monday. > > > > Additionally, I could also build a pipeline with Java 17 later since > > > > it is supported in 1.18[1]. > > > > > > > > > > > > Best regards, > > > > Zakelly > > > > > > > > [1] > > > > > > > > https://github.com/apache/flink/commit/9c1318ca7fa5b2e7b11827068ad1288483aaa464#diff-8310c97396d60e96766a936ca8680f1e2971ef486cfc2bc55ec9ca5a5333c47fR53 > > > > > > > > On Fri, Sep 22, 2023 at 5:57 PM Jing Ge > > > > wrote: > > > > > > > > > > Hi Zakelly, > > > > > > > > > > Thanks for your effort and the update! Since Java 8 has been > > > > deprecated[1], > > > > > let's wait for the result with Java 11. It should be available > after > > > the > > > > > weekend and there should be no big surprise. WDYT? > > > > > > > > > > Best regards, > > > > > Jing > > > > > > > > > > [1] > > > > > > > > > > > > > https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.15/#jdk-upgrade > > > > > > > > > > On Fri, Sep 22, 2023 at 11:26 AM Zakelly Lan < > zakelly@gmail.com> > > > > wrote: > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > I want to provide an update on the benchmark results that I have > been > > > > > > working on. After spending some time preparing the environment > and > > > > > > adjusting the benchmark script, I finally got a comparison > between > > > > > > release 1.18 (commit: 2aeb99804ba[1]) and the commit before the > old > > > > > > codespeed server went down (commit: 6d62f9918ea[2]) on openjdk8. > The > > > > > > report is attached[3]. Note that the test has only run once on > jdk8, > > > > > > so the impact of single-test fluctuations is not ruled out. > > > > > > Additionally, I have noticed some significant fluctuations in > > > specific > > > > > > tests when reviewing previous benchmark scores, which I have also > > > > > > noted in the report. Taking all of these factors into > consideration, > > > I > > > > > > think there is no obvious regression in release 1.18 *for now*. > More > > > > > > tests including the one on openjdk11 are on the way. Hope it > does not > > > > > > delay the release procedure. > > > > > > > > > > > > Please let me know if you have any concerns. > > > > > > > > > > > > > > > > > > Best, > > > > > > Zakelly > > > > > > > > > > > > [1] > > > > > > > > > > > > > > https://github.com/apache/flink/commit/2aeb99804ba56c008df0a1730f3246d3fea856b9 > > > > > > [2] > > > > > > > > > > > > > > https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711 > > > > > > [3] > > > > > > > > > > > > > > https://docs.google.com/spreadsheets/d/1-3Y974jYq_WrQNzLN-y_6lOU-NGXaIDTQBYTZd04tJ0/edit?usp=sharing > > > > > > > > > > > > The new environment for benchmark: > > > > > > ECS on Aliyun > > > > > > CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz (8 Core > available) > > > > > > Memory: 64GB > > > > > > OS: Alibaba Cloud Linux 3.2104 LTS 64bit > > > > > > Kernel: 5.10.134-15.al8.x86_64 > > > > > > OpenJDK8 version: 1.8.0_372 > > > > > > > > > > > > On Thu, Sep 21, 2023 at 12:04 PM Yuxin Tan < > tanyuxinw...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > Hi, Zakelly, > > > > > > > > > > > > > > No benchmark tests currently are affected by this issue. We > > > > > > > may add benchmarks to guard it later. Thanks. > > > > > > > > > > > > > > Best
Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL Sources
Hi Zhanghao, Thanks for your explanation! - For question 2, I can accept that the user must provide a pk when modifying the parallelism of the cdc source. - For question 1&3 and additional concerns, if there is no more general or discussable solution at the time, I agree to address part of the tuning requirements in current manner. Thanks again for moving this FLIP forward! Best, Lincoln Lee Chen Zhanghao 于2023年9月25日周一 12:04写道: > Hi Lincoln, > > Thanks for the comments. > > - For concerns #1, I agree that we do not always produce optimal plan for > both cases. However, it is difficult to do so and non-trivial complexity is > expected. On the other hand, although our proposal generates a sub-optimal > plan when the bottleneck is on the aggregate operator, it still provides > more flexibility for performance tuning. Therefore, I think we can > implement it in the straightforward way first, WDYT? > > - For concerns #2, I'd like to clarify a bit: exception will only be > thrown i.f.f. the source may produce delete/update messages AND no primary > key specified AND the source parallelism is different from the default > parallelism. So for CDC without pk, we're still good if source parallelism > is not specified. > > - For concerns #3, at the current point, I think making the name more > specific is better as no other future use cases can be previsioned now. > Since this is an internal API, we are free to refactor it later if needed. > > - For concerns about adaptive scheduler, the problems you mentioned do > exist, but I don't think it relevant here. The planner may encode some > hints to help the scheduler, but afterall, those efforts should be > initiated on the scheduler side. > > Best, > Zhanghao Chen > > 发件人: Lincoln Lee > 发送时间: 2023年9月22日 23:19 > 收件人: dev@flink.apache.org > 主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL > Sources > > Hi Zhanghao, > > Thanks for the FLIP and discussion! Hope this reply isn't too late : ) > Firstly I'm fully agreed with the motivation of this FLIP and the value for > the users, but there are a few things we should consider(please correct me > if I'm misunderstanding): > > *1. *It seems that the current solution only takes care of part of the > requirement, the need to set source's parallelism may be different in > different jobs, for example, consider the following two job topologies(one > {} simply represents a vertex): > a. {source -> calc -> sink} > > b. {source -> calc} -> {aggregate} -> {sink} > > For job a, if there is a bottleneck in calc operator, but source > parallelism cannot be scaled up (e.g., limited by kafka's partition > number), so the calc operator cannot be scaled up to achieve higher > throughput because the operators in source vertex are chained together, > then current solution is reasonable (break the chain, add a shuffle). > > But for job b, if the bottleneck is the aggregate operator (not calc), it's > more likely be better to scale up the aggregate operator/vertex and without > breaking the {source -> calc} chain, as this will incur additional shuffle > cost. > So if we decide to add this new feature, I would recommend that both cases > be taken care of. > > > 2. the assumption that a cdc source must have pk(primary key) may not be > reasonable, for example, mysql cdc supports the case without pk( > > https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc.html#tables-without-primary-keys > ), > so we can not just raise an error here. > > > 3. for the new SourceTransformationWrapper I have some concerns about the > future evolution, if we need to add support for other operators, do we > continue to add new xxWrappers? > > I've also revisited the previous discussion on FLIP-146[1], there were no > clear conclusions or good ideas about similar support issues for the source > before, and I also noticed that the new capability to change per-vertex > parallelism via rest api in 1.18 (part of FLIP-291[2][3], but there is > actually an issue about sql job's parallelism change which may require a > hash shuffle to ensure the order of update stream, this needs to be > followed up in FLIP-291, a jira will be created later). So perhaps, we > need to think about it more (the next version is not yet launched, so we > still have time) > > [1] https://lists.apache.org/thread/gtpswl42jzv0c9o3clwqskpllnw8rh87 > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management > [3] https://issues.apache.org/jira/browse/FLINK-31471 > > > Best, > Lincoln Lee > > > Chen Zhanghao 于2023年9月22日周五 16:00写道: > > > Thanks to everyone who participated in the discussion here. If no further > > questions/concerns are raised, we'll start voting next Monday afternoon > > (GMT+8). > > > > Best, > > Zhanghao Chen > > > > 发件人: Jane Chan > > 发送时间: 2023年9月22日 15:35 > > 收件人: dev@flink.apache.org > > 主题: R
Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL Sources
Hi Zhanghao, Thanks for driving the FLIP. This is a nice feature users are looking for. >From users' perspective, would you like to add explicit description about any potential(or none) compatibility issues if users want to use this new feature and start existing jobs with savepoints or checkpoints? Best regards, Jing On Sun, Sep 24, 2023 at 9:05 PM Chen Zhanghao wrote: > Hi Lincoln, > > Thanks for the comments. > > - For concerns #1, I agree that we do not always produce optimal plan for > both cases. However, it is difficult to do so and non-trivial complexity is > expected. On the other hand, although our proposal generates a sub-optimal > plan when the bottleneck is on the aggregate operator, it still provides > more flexibility for performance tuning. Therefore, I think we can > implement it in the straightforward way first, WDYT? > > - For concerns #2, I'd like to clarify a bit: exception will only be > thrown i.f.f. the source may produce delete/update messages AND no primary > key specified AND the source parallelism is different from the default > parallelism. So for CDC without pk, we're still good if source parallelism > is not specified. > > - For concerns #3, at the current point, I think making the name more > specific is better as no other future use cases can be previsioned now. > Since this is an internal API, we are free to refactor it later if needed. > > - For concerns about adaptive scheduler, the problems you mentioned do > exist, but I don't think it relevant here. The planner may encode some > hints to help the scheduler, but afterall, those efforts should be > initiated on the scheduler side. > > Best, > Zhanghao Chen > > 发件人: Lincoln Lee > 发送时间: 2023年9月22日 23:19 > 收件人: dev@flink.apache.org > 主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL > Sources > > Hi Zhanghao, > > Thanks for the FLIP and discussion! Hope this reply isn't too late : ) > Firstly I'm fully agreed with the motivation of this FLIP and the value for > the users, but there are a few things we should consider(please correct me > if I'm misunderstanding): > > *1. *It seems that the current solution only takes care of part of the > requirement, the need to set source's parallelism may be different in > different jobs, for example, consider the following two job topologies(one > {} simply represents a vertex): > a. {source -> calc -> sink} > > b. {source -> calc} -> {aggregate} -> {sink} > > For job a, if there is a bottleneck in calc operator, but source > parallelism cannot be scaled up (e.g., limited by kafka's partition > number), so the calc operator cannot be scaled up to achieve higher > throughput because the operators in source vertex are chained together, > then current solution is reasonable (break the chain, add a shuffle). > > But for job b, if the bottleneck is the aggregate operator (not calc), it's > more likely be better to scale up the aggregate operator/vertex and without > breaking the {source -> calc} chain, as this will incur additional shuffle > cost. > So if we decide to add this new feature, I would recommend that both cases > be taken care of. > > > 2. the assumption that a cdc source must have pk(primary key) may not be > reasonable, for example, mysql cdc supports the case without pk( > > https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc.html#tables-without-primary-keys > ), > so we can not just raise an error here. > > > 3. for the new SourceTransformationWrapper I have some concerns about the > future evolution, if we need to add support for other operators, do we > continue to add new xxWrappers? > > I've also revisited the previous discussion on FLIP-146[1], there were no > clear conclusions or good ideas about similar support issues for the source > before, and I also noticed that the new capability to change per-vertex > parallelism via rest api in 1.18 (part of FLIP-291[2][3], but there is > actually an issue about sql job's parallelism change which may require a > hash shuffle to ensure the order of update stream, this needs to be > followed up in FLIP-291, a jira will be created later). So perhaps, we > need to think about it more (the next version is not yet launched, so we > still have time) > > [1] https://lists.apache.org/thread/gtpswl42jzv0c9o3clwqskpllnw8rh87 > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management > [3] https://issues.apache.org/jira/browse/FLINK-31471 > > > Best, > Lincoln Lee > > > Chen Zhanghao 于2023年9月22日周五 16:00写道: > > > Thanks to everyone who participated in the discussion here. If no further > > questions/concerns are raised, we'll start voting next Monday afternoon > > (GMT+8). > > > > Best, > > Zhanghao Chen > > > > 发件人: Jane Chan > > 发送时间: 2023年9月22日 15:35 > > 收件人: dev@flink.apache.org > > 主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/S
Re: [Discuss] FLIP-362: Support minimum resource limitation
Hi Yangze, Thanks for the clarification. The example of two batch jobs team up with one streaming job is interesting. Best regards, Jing On Wed, Sep 20, 2023 at 7:19 PM Yangze Guo wrote: > Thanks for the comments, Jing. > > > Will the minimum resource configuration also take effect for streaming > jobs in application mode? > > Since it is not recommended to configure slotmanager.number-of-slots.max > for streaming jobs, does it make sense to disable it for common streaming > jobs? At least disable the check for avoiding the oscillation? > > Yes. The minimum resource configuration will only disabled in > standalone cluster atm. I agree it make sense to disable it for a pure > streaming job, however: > - By default, the minimum resource is configured to 0. If users do not > proactively set it, either the oscillation check or the minimum > restriction can be considered as disabled. > - The minimum resource is a cluster-level configuration rather than a > job-level configuration. If a user has an application with two batch > jobs preceding the streaming job, they may also require this > configuration to accelerate the execution of batch jobs. > > WDYT? > > Best, > Yangze Guo > > On Thu, Sep 21, 2023 at 4:49 AM Jing Ge > wrote: > > > > Hi Xiangyu, > > > > Thanks for driving it! There is one thing I am not really sure if I > > understand you correctly. > > > > According to the FLIP: "The minimum resource limitation will be > implemented > > in the DefaultResourceAllocationStrategy of FineGrainedSlotManager. > > > > Each time when SlotManager needs to reconcile the cluster resources or > > fulfill job resource requirements, the DefaultResourceAllocationStrategy > > will check if the minimum resource requirement has been fulfilled. If it > is > > not, DefaultResourceAllocationStrategy will request new > PendingTaskManagers > > and FineGrainedSlotManager will allocate new worker resources > accordingly." > > > > "To avoid this oscillation, we need to check the worker number derived > from > > minimum and maximum resource configuration is consistent before starting > > SlotManager." > > > > Will the minimum resource configuration also take effect for streaming > jobs > > in application mode? Since it is not recommended to > > configure slotmanager.number-of-slots.max for streaming jobs, does it > make > > sense to disable it for common streaming jobs? At least disable the check > > for avoiding the oscillation? > > > > Best regards, > > Jing > > > > > > On Tue, Sep 19, 2023 at 4:58 PM Chen Zhanghao > > > wrote: > > > > > Thanks for driving this, Xiangyu. We use Session clusters for quick SQL > > > debugging internally, and found cold-start job submission slow due to > lack > > > of the exact minimum resource reservation feature proposed here. This > > > should improve the experience a lot for running short lived-jobs in > session > > > clusters. > > > > > > Best, > > > Zhanghao Chen > > > > > > 发件人: Yangze Guo > > > 发送时间: 2023年9月19日 13:10 > > > 收件人: xiangyu feng > > > 抄送: dev@flink.apache.org > > > 主题: Re: [Discuss] FLIP-362: Support minimum resource limitation > > > > > > Thanks for driving this @Xiangyu. This is a feature that many users > > > have requested for a long time. +1 for the overall proposal. > > > > > > Best, > > > Yangze Guo > > > > > > On Tue, Sep 19, 2023 at 11:48 AM xiangyu feng > > > wrote: > > > > > > > > Hi Devs, > > > > > > > > I'm opening this thread to discuss FLIP-362: Support minimum resource > > > limitation. The design doc can be found at: > > > > FLIP-362: Support minimum resource limitation > > > > > > > > Currently, the Flink cluster only requests Task Managers (TMs) when > > > there is a resource requirement, and idle TMs are released after a > certain > > > period of time. However, in certain scenarios, such as running short > > > lived-jobs in session cluster and scheduling batch jobs stage by > stage, we > > > need to improve the efficiency of job execution by maintaining a > certain > > > number of available workers in the cluster all the time. > > > > > > > > After discussed with Yangze, we introduced this new feature. The new > > > added public options and proposed changes are described in this FLIP. > > > > > > > > Looking forward to your feedback, thanks. > > > > > > > > Best regards, > > > > Xiangyu > > > > > > > >
Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener
+1(binding) Thanks! Best Regards, Jing On Sun, Sep 24, 2023 at 10:30 PM Shammon FY wrote: > Hi devs, > > Thanks for all the feedback on FLIP-314: Support Customized Job Lineage > Listener [1] in thread [2]. > > I would like to start a vote for it. The vote will be opened for at least > 72 hours unless there is an objection or insufficient votes. > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener > [2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc > > Best, > Shammon FY >
Re: [VOTE] FLIP-362: Support minimum resource limitation
+1(binding) Thanks! Best regards, Jing On Mon, Sep 25, 2023 at 3:48 AM Chen Zhanghao wrote: > Thanks for driving this. +1 (non-binding) > > Best, > Zhanghao Chen > > 发件人: xiangyu feng > 发送时间: 2023年9月25日 17:38 > 收件人: dev@flink.apache.org > 主题: [VOTE] FLIP-362: Support minimum resource limitation > > Hi all, > > I would like to start the vote for FLIP-362: Support minimum resource > limitation[1]. > This FLIP was discussed in this thread [2]. > > The vote will be open for at least 72 hours unless there is an objection or > insufficient votes. > > Regards, > Xiangyu > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation > [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4 >
Re: [VOTE] FLIP-327: Support switching from batch to stream mode to improve throughput when processing backlog data
+1 (non-binding) On Sun, Sep 24, 2023, 6:49 PM Xintong Song wrote: > +1 (binding) > > Best, > > Xintong > > > > On Sat, Sep 23, 2023 at 10:16 PM Yuepeng Pan > wrote: > > > +1(non-binding), thank you for driving this proposal. > > > > Best, > > Yuepeng Pan. > > At 2023-09-22 14:07:45, "Dong Lin" wrote: > > >Hi all, > > > > > >We would like to start the vote for FLIP-327: Support switching from > batch > > >to stream mode to improve throughput when processing backlog data [1]. > > This > > >FLIP was discussed in this thread [2]. > > > > > >The vote will be open until at least Sep 27th (at least 72 > > >hours), following the consensus voting process. > > > > > >Cheers, > > >Xuannan and Dong > > > > > >[1] > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-327*3A*Support*switching*from*batch*to*stream*mode*to*improve*throughput*when*processing*backlog*data__;JSsrKysrKysrKysrKysr!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKResPPc7$ > > >[2] > https://urldefense.com/v3/__https://lists.apache.org/thread/29nvjt9sgnzvs90browb8r6ng31dcs3n__;!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKZEAz9yp$ > > >
Re: [VOTE] FLIP-327: Support switching from batch to stream mode to improve throughput when processing backlog data
+1(non binding) Best regards Ahmed Hamdy On Mon, 25 Sep 2023, 19:57 Venkatakrishnan Sowrirajan, wrote: > +1 (non-binding) > > On Sun, Sep 24, 2023, 6:49 PM Xintong Song wrote: > > > +1 (binding) > > > > Best, > > > > Xintong > > > > > > > > On Sat, Sep 23, 2023 at 10:16 PM Yuepeng Pan > > wrote: > > > > > +1(non-binding), thank you for driving this proposal. > > > > > > Best, > > > Yuepeng Pan. > > > At 2023-09-22 14:07:45, "Dong Lin" wrote: > > > >Hi all, > > > > > > > >We would like to start the vote for FLIP-327: Support switching from > > batch > > > >to stream mode to improve throughput when processing backlog data [1]. > > > This > > > >FLIP was discussed in this thread [2]. > > > > > > > >The vote will be open until at least Sep 27th (at least 72 > > > >hours), following the consensus voting process. > > > > > > > >Cheers, > > > >Xuannan and Dong > > > > > > > >[1] > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-327*3A*Support*switching*from*batch*to*stream*mode*to*improve*throughput*when*processing*backlog*data__;JSsrKysrKysrKysrKysr!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKResPPc7$ > > > >[2] > > > https://urldefense.com/v3/__https://lists.apache.org/thread/29nvjt9sgnzvs90browb8r6ng31dcs3n__;!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKZEAz9yp$ > > > > > >
Re: [ANNOUNCE] Release 1.18.0, release candidate #0
Hello! There's a security fix that probably should be applied to 1.18 in the next RC1 : https://github.com/apache/flink/pull/23461 (bump to snappy-java). Do you think this would be possible to include? All my best, Ryan [1]: https://issues.apache.org/jira/browse/FLINK-33149 "Bump snappy-java to 1.1.10.4" On Mon, Sep 25, 2023 at 3:54 PM Jing Ge wrote: > > Thanks Zakelly for the update! Appreciate it! > > @Piotr Nowojski If you do not have any other > concerns, I will move forward to create 1.18 rc1 and start voting. WDYT? > > Best regards, > Jing > > On Mon, Sep 25, 2023 at 2:20 AM Zakelly Lan wrote: > > > Hi Jing and everyone, > > > > I have conducted three rounds of benchmarking with Java11, comparing > > release 1.18 (commit: deb07e99560[1]) with commit 6d62f9918ea[2]. The > > results are attached[3]. Most of the tests show no obvious regression. > > However, I did observe significant change in several tests. Upon > > reviewing the historical results from the previous pipeline, I also > > discovered a substantial variance in those tests, as shown in the > > timeline pictures included in the sheet[3]. I believe this variance > > has existed for a long time and requires further investigation, and > > fully measuring the variance requires more rounds (15 or more). I > > think for now it is not a blocker for release 1.18. WDYT? > > > > > > Best, > > Zakelly > > > > [1] > > https://github.com/apache/flink/commit/deb07e99560b45033a629afc3f90666ad0a32feb > > [2] > > https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711 > > [3] > > https://docs.google.com/spreadsheets/d/1V0-duzNTgu7H6R7kioF-TAPhlqWl7Co6Q9ikTBuaULo/edit?usp=sharing > > > > On Sun, Sep 24, 2023 at 11:29 AM ConradJam wrote: > > > > > > +1 for testing with Java 17 > > > > > > Jing Ge 于2023年9月24日周日 09:40写道: > > > > > > > +1 for testing with Java 17 too. Thanks Zakelly for your effort! > > > > > > > > Best regards, > > > > Jing > > > > > > > > On Fri, Sep 22, 2023 at 1:01 PM Zakelly Lan > > wrote: > > > > > > > > > Hi Jing, > > > > > > > > > > I agree we could wait for the result with Java 11. And it should be > > > > > available next Monday. > > > > > Additionally, I could also build a pipeline with Java 17 later since > > > > > it is supported in 1.18[1]. > > > > > > > > > > > > > > > Best regards, > > > > > Zakelly > > > > > > > > > > [1] > > > > > > > > > > > https://github.com/apache/flink/commit/9c1318ca7fa5b2e7b11827068ad1288483aaa464#diff-8310c97396d60e96766a936ca8680f1e2971ef486cfc2bc55ec9ca5a5333c47fR53 > > > > > > > > > > On Fri, Sep 22, 2023 at 5:57 PM Jing Ge > > > > > wrote: > > > > > > > > > > > > Hi Zakelly, > > > > > > > > > > > > Thanks for your effort and the update! Since Java 8 has been > > > > > deprecated[1], > > > > > > let's wait for the result with Java 11. It should be available > > after > > > > the > > > > > > weekend and there should be no big surprise. WDYT? > > > > > > > > > > > > Best regards, > > > > > > Jing > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.15/#jdk-upgrade > > > > > > > > > > > > On Fri, Sep 22, 2023 at 11:26 AM Zakelly Lan < > > zakelly@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > I want to provide an update on the benchmark results that I have > > been > > > > > > > working on. After spending some time preparing the environment > > and > > > > > > > adjusting the benchmark script, I finally got a comparison > > between > > > > > > > release 1.18 (commit: 2aeb99804ba[1]) and the commit before the > > old > > > > > > > codespeed server went down (commit: 6d62f9918ea[2]) on openjdk8. > > The > > > > > > > report is attached[3]. Note that the test has only run once on > > jdk8, > > > > > > > so the impact of single-test fluctuations is not ruled out. > > > > > > > Additionally, I have noticed some significant fluctuations in > > > > specific > > > > > > > tests when reviewing previous benchmark scores, which I have also > > > > > > > noted in the report. Taking all of these factors into > > consideration, > > > > I > > > > > > > think there is no obvious regression in release 1.18 *for now*. > > More > > > > > > > tests including the one on openjdk11 are on the way. Hope it > > does not > > > > > > > delay the release procedure. > > > > > > > > > > > > > > Please let me know if you have any concerns. > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > Zakelly > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/commit/2aeb99804ba56c008df0a1730f3246d3fea856b9 > > > > > > > [2] > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711 > > > > > > > [3] > > > > > > > > > > > > > > > > > > https://docs.google.com/spreadsheets/d/1-3Y974jYq_WrQNzLN-y_6lOU-NGXaIDTQBYTZd04tJ0/edit?usp=sharing > > > > > > > > > > > > > >
Re: [ANNOUNCE] Release 1.18.0, release candidate #0
Hi Ryan, Thanks for reaching out. It is fine to include it but we need to wait until the CI passes. I am not sure how long it will take, since there seems to be some infra issues. Best regards, Jing On Mon, Sep 25, 2023 at 11:34 AM Ryan Skraba wrote: > Hello! There's a security fix that probably should be applied to 1.18 > in the next RC1 : https://github.com/apache/flink/pull/23461 (bump to > snappy-java). Do you think this would be possible to include? > > All my best, Ryan > > [1]: https://issues.apache.org/jira/browse/FLINK-33149 "Bump > snappy-java to 1.1.10.4" > > > > On Mon, Sep 25, 2023 at 3:54 PM Jing Ge > wrote: > > > > Thanks Zakelly for the update! Appreciate it! > > > > @Piotr Nowojski If you do not have any other > > concerns, I will move forward to create 1.18 rc1 and start voting. WDYT? > > > > Best regards, > > Jing > > > > On Mon, Sep 25, 2023 at 2:20 AM Zakelly Lan > wrote: > > > > > Hi Jing and everyone, > > > > > > I have conducted three rounds of benchmarking with Java11, comparing > > > release 1.18 (commit: deb07e99560[1]) with commit 6d62f9918ea[2]. The > > > results are attached[3]. Most of the tests show no obvious regression. > > > However, I did observe significant change in several tests. Upon > > > reviewing the historical results from the previous pipeline, I also > > > discovered a substantial variance in those tests, as shown in the > > > timeline pictures included in the sheet[3]. I believe this variance > > > has existed for a long time and requires further investigation, and > > > fully measuring the variance requires more rounds (15 or more). I > > > think for now it is not a blocker for release 1.18. WDYT? > > > > > > > > > Best, > > > Zakelly > > > > > > [1] > > > > https://github.com/apache/flink/commit/deb07e99560b45033a629afc3f90666ad0a32feb > > > [2] > > > > https://github.com/apache/flink/commit/6d62f9918ea2cbb8a10c705a25a4ff6deab60711 > > > [3] > > > > https://docs.google.com/spreadsheets/d/1V0-duzNTgu7H6R7kioF-TAPhlqWl7Co6Q9ikTBuaULo/edit?usp=sharing > > > > > > On Sun, Sep 24, 2023 at 11:29 AM ConradJam > wrote: > > > > > > > > +1 for testing with Java 17 > > > > > > > > Jing Ge 于2023年9月24日周日 09:40写道: > > > > > > > > > +1 for testing with Java 17 too. Thanks Zakelly for your effort! > > > > > > > > > > Best regards, > > > > > Jing > > > > > > > > > > On Fri, Sep 22, 2023 at 1:01 PM Zakelly Lan > > > > wrote: > > > > > > > > > > > Hi Jing, > > > > > > > > > > > > I agree we could wait for the result with Java 11. And it should > be > > > > > > available next Monday. > > > > > > Additionally, I could also build a pipeline with Java 17 later > since > > > > > > it is supported in 1.18[1]. > > > > > > > > > > > > > > > > > > Best regards, > > > > > > Zakelly > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://github.com/apache/flink/commit/9c1318ca7fa5b2e7b11827068ad1288483aaa464#diff-8310c97396d60e96766a936ca8680f1e2971ef486cfc2bc55ec9ca5a5333c47fR53 > > > > > > > > > > > > On Fri, Sep 22, 2023 at 5:57 PM Jing Ge > > > > > > > wrote: > > > > > > > > > > > > > > Hi Zakelly, > > > > > > > > > > > > > > Thanks for your effort and the update! Since Java 8 has been > > > > > > deprecated[1], > > > > > > > let's wait for the result with Java 11. It should be available > > > after > > > > > the > > > > > > > weekend and there should be no big surprise. WDYT? > > > > > > > > > > > > > > Best regards, > > > > > > > Jing > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.15/#jdk-upgrade > > > > > > > > > > > > > > On Fri, Sep 22, 2023 at 11:26 AM Zakelly Lan < > > > zakelly@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > I want to provide an update on the benchmark results that I > have > > > been > > > > > > > > working on. After spending some time preparing the > environment > > > and > > > > > > > > adjusting the benchmark script, I finally got a comparison > > > between > > > > > > > > release 1.18 (commit: 2aeb99804ba[1]) and the commit before > the > > > old > > > > > > > > codespeed server went down (commit: 6d62f9918ea[2]) on > openjdk8. > > > The > > > > > > > > report is attached[3]. Note that the test has only run once > on > > > jdk8, > > > > > > > > so the impact of single-test fluctuations is not ruled out. > > > > > > > > Additionally, I have noticed some significant fluctuations in > > > > > specific > > > > > > > > tests when reviewing previous benchmark scores, which I have > also > > > > > > > > noted in the report. Taking all of these factors into > > > consideration, > > > > > I > > > > > > > > think there is no obvious regression in release 1.18 *for > now*. > > > More > > > > > > > > tests including the one on openjdk11 are on the way. Hope it > > > does not > > > > > > > > delay the release procedure. > > > > > > > > > > > > > > > > Please let me know if you have
RE: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released
Hi Martijn. Thanks for this. Should there also be docker images available? https://hub.docker.com/r/apache/flink-statefun/tags goes up to 3.2.0. Frans -Original Message- From: Martijn Visser Sent: Tuesday, September 19, 2023 11:37 AM To: dev@flink.apache.org; user ; user-zh ; n...@flink.apache.org; annou...@apache.org Subject: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released Stateful Functions is a cross-platform stack for building Stateful Serverless applications, making it radically simpler to develop scalable, consistent, and elastic distributed applications. This new release upgrades the Flink runtime to 1.16.2. Release highlight: - Upgrade underlying Flink dependency to 1.16.2 Release blogpost: https://flink.apache.org/2023/09/19/stateful-functions-3.3.0-release-announcement/ The release is available for download at: https://flink.apache.org/downloads/ Java SDK can be found at: https://search.maven.org/artifact/org.apache.flink/statefun-sdk-java/3.3.0/jar Python SDK can be found at: https://pypi.org/project/apache-flink-statefun/ GoLang SDK can be found at: https://github.com/apache/flink-statefun/tree/statefun-sdk-go/v3.3.0 JavaScript SDK can be found at: https://www.npmjs.com/package/apache-flink-statefun Official Docker image for Flink Stateful Functions can be found at: https://hub.docker.com/r/apache/flink-statefun The full release notes are available in Jira: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351276 We would like to thank all contributors of the Apache Flink community who made this release possible! Regards, Martijn Visser
Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag
Hi Piotr, Thanks for the comments. Let me try to explain it below. Overall, the two competing options differ in how an invocation of `#setIsProcessingBacklog(false)` affects the backlog status for the given source (corresponding to the SplitEnumeratorContext instance on which this method is invoked). - With my approach, setIsProcessingBacklog(false) merely unsets effects of any previous invocation of setIsProcessingBacklog(..) on the given source, without necessarily forcing the source's backlog status to be false. - With Jark’s approach, setIsProcessingBacklog(false) forces the source's backlog status to be false. There is no practical difference between these two options as of FLIP-309. However, once we introduce additional strategy (e.g. job-level config) to configure backlog status in FLIP-328, there will be tricky and important differences between them. More specifically, let’s say we want to introduce a job-level config such as “”pipeline.backlog.watermark-lag-threshold” as mentioned in FLIP-328: - With Jack’s approach, if MySQL CDC invokes setIsProcessingBacklog(false) at the beginning of the “unbounded phase”, then that effectively means isProcessingBacklog=false even if watermark lag exceeds the configured threshold, preventing job-level config from taking effect during the "unbounded phase". - With my approach, even if MySQL CDC invokes setIsProcessingBacklog(false) at the beginning of the “unbounded phase”, the source can still have isProcessingBacklog=true when watermark lag is too high. Would this clarify the difference between these two options? Regards, Dong On Mon, Sep 25, 2023 at 5:15 PM Piotr Nowojski wrote: > Hi Jarl and Dong, > > I'm a bit confused about the difference between the two competing options. > Could one of you elaborate what's the difference between: > > 2) The semantics of `#setIsProcessingBacklog(false)` is that it overrides > > the effect of the previous invocation (if any) of > > `#setIsProcessingBacklog(true)` on the given source instance. > > and > > > 2) The semantics of `#setIsProcessingBacklog(false)` is that the given > > source instance will have watermarkLag=false. > > ? > > Best, > Piotrek > > czw., 21 wrz 2023 o 15:28 Dong Lin napisał(a): > > > Hi all, > > > > Jark and I discussed this FLIP offline and I will summarize our > discussion > > below. It would be great if you could provide your opinion of the > proposed > > options. > > > > Regarding the target use-cases: > > - We both agreed that MySQL CDC should have backlog=true when > watermarkLag > > is large during the binlog phase. > > - Dong argued that other streaming sources with watermarkLag defined > (e.g. > > Kafka) should also have backlog=true when watermarkLag is large. The > > pros/cons discussion below assume this use-case needs to be supported. > > > > The 1st option is what is currently proposed in FLIP-328, with the > > following key characteristics: > > 1) There is one job-level config (i.e. > > pipeline.backlog.watermark-lag-threshold) that applies to all sources > with > > watermarkLag metric defined. > > 2) The semantics of `#setIsProcessingBacklog(false)` is that it overrides > > the effect of the previous invocation (if any) of > > `#setIsProcessingBacklog(true)` on the given source instance. > > > > The 2nd option is what Jark proposed in this email thread, with the > > following key characteristics: > > 1) Add source-specific config (both Java API and SQL source property) to > > every source for which we want to set backlog status based on the > > watermarkLag metric. For example, we might add separate Java APIs > > `#setWatermarkLagThreshold` for MySQL CDC source, HybridSource, > > KafkaSource, PulsarSource etc. > > 2) The semantics of `#setIsProcessingBacklog(false)` is that the given > > source instance will have watermarkLag=false. > > > > Here are the key pros/cons of these two options. > > > > Cons of the 1st option: > > 1) The semantics of `#setIsProcessingBacklog(false)` is harder to > > understand for Flink operator developers than the corresponding semantics > > in option-2. > > > > Cons of the 2nd option: > > 1) More work for end-users. For a job with multiple sources that need to > be > > configured with a watermark lag threshold, users need to specify multiple > > configs (one for each source) instead of specifying one job-level config. > > > > 2) More work for Flink operator developers. Overall there are more public > > APIs (one Java API and one SQL property for each source that needs to > > determine backlog based on watermark) exposed to end users. This also > adds > > more burden for the Flink community to maintain these APIs. > > > > 3) It would be hard (e.g. require backward incompatible API change) to > > extend the Flink runtime to support job-level config to set watermark > > strategy in the future (e.g. support the > > pipeline.backlog.watermark-lag-threshold in option-1). This is because an > > existing source operator's code might have hardcoded a
Re: [VOTE] FLIP-362: Support minimum resource limitation
+1(binding), thanks for the proposal. Best, Shammon FY On Mon, Sep 25, 2023 at 11:20 PM Jing Ge wrote: > +1(binding) Thanks! > > Best regards, > Jing > > On Mon, Sep 25, 2023 at 3:48 AM Chen Zhanghao > wrote: > > > Thanks for driving this. +1 (non-binding) > > > > Best, > > Zhanghao Chen > > > > 发件人: xiangyu feng > > 发送时间: 2023年9月25日 17:38 > > 收件人: dev@flink.apache.org > > 主题: [VOTE] FLIP-362: Support minimum resource limitation > > > > Hi all, > > > > I would like to start the vote for FLIP-362: Support minimum resource > > limitation[1]. > > This FLIP was discussed in this thread [2]. > > > > The vote will be open for at least 72 hours unless there is an objection > or > > insufficient votes. > > > > Regards, > > Xiangyu > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation > > [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4 > > >
Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener
+1(no-binding) thanks for driving this proposal Best, Feng On Mon, Sep 25, 2023 at 11:19 PM Jing Ge wrote: > +1(binding) Thanks! > > Best Regards, > Jing > > On Sun, Sep 24, 2023 at 10:30 PM Shammon FY wrote: > > > Hi devs, > > > > Thanks for all the feedback on FLIP-314: Support Customized Job Lineage > > Listener [1] in thread [2]. > > > > I would like to start a vote for it. The vote will be opened for at least > > 72 hours unless there is an objection or insufficient votes. > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener > > [2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc > > > > Best, > > Shammon FY > > >
Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener
+1(binding) Best, Leonard > On Sep 26, 2023, at 9:59 AM, Feng Jin wrote: > > +1(no-binding) > > > thanks for driving this proposal > > > Best, > Feng > > On Mon, Sep 25, 2023 at 11:19 PM Jing Ge wrote: > >> +1(binding) Thanks! >> >> Best Regards, >> Jing >> >> On Sun, Sep 24, 2023 at 10:30 PM Shammon FY wrote: >> >>> Hi devs, >>> >>> Thanks for all the feedback on FLIP-314: Support Customized Job Lineage >>> Listener [1] in thread [2]. >>> >>> I would like to start a vote for it. The vote will be opened for at least >>> 72 hours unless there is an objection or insufficient votes. >>> >>> [1] >>> >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener >>> [2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc >>> >>> Best, >>> Shammon FY >>> >>
Re: Re: [DISCUSS] FLIP-331: Support EndOfStreamWindows and isOutputOnEOF operator attribute to optimize task deployment
Hi all, Thanks for the review! Becket and I discussed this FLIP offline and we agreed on several things that need to be improved with this FLIP. I will summarize our discussion with the problems and TODOs. We will update the FLIP and let you know once the FLIP is ready for review again. 1) Investigate whether it is possible to update the existing GlobalWindows in a backward-compatible way and re-use it for the same purpose as EndOfStreamWindows, without introducing EndOfStreamWindows as a new class. Note that GlobalWindows#getDefaultTrigger returns a NeverTrigger instance which will not trigger window's computation even on end-of-inputs. We will need to investigate its existing usage and see if we can re-use it in a backward-compatible way. 2) Let JM know whether any operator in the upstream of the operator with "isOutputOnEOF=true" will emit output via any side channel. The FLIP should update the execution mode of those operators *only if* all outputs from those operators are emitted only at the end of input. More specifically, the upstream operator might involve a user-defined operator that might emit output directly to an external service, where the emission operation is not explicitly expressed as an operator's output edge and thus not visible to JM. Similarly, it is also possible for the user-defined operator to register a timer via InternalTimerService#registerEventTimeTimer and emit output to an external service inside Triggerable#onEventTime. There is a chance that users still need related logic to output data in real-time, even if the downstream operators have isOutputOnEOF=true. One possible solution to address this problem is to add an extra OperatorAttribute to specify whether this operator might output records in such a way that does not go through operator's output (e.g. side output). Then the JM can safely enable the runtime optimization currently described in the FLIP when there is no such operator. 3) Create a follow-up FLIP that allows users to specify whether a source with Boundedness=bounded should have isProcessingBacklog=true. This capability would effectively introduce a 3rd strategy to set backlog status (in addition to FLIP-309 and FLIP-328). It might be useful to note that, even though the data in bounded sources are backlog data in most practical use-cases, it is not necessarily true. For example, users might want to start a Flink job to consume real-time data from a Kafka topic and specify that the job stops after 24 hours, which means the source is technically bounded while the data is fresh/real-time. This capability is more generic and can cover more use-case than EndOfStreamWindows. On the other hand, EndOfStreamWindows will still be useful in cases where users already need to specify this window assigner in a DataStream program, without bothering users to decide whether it is safe to treat data in a bounded source as backlog data. Regards, Dong On Mon, Sep 18, 2023 at 2:56 PM Yuxin Tan wrote: > Hi, Dong, > Thanks for your efforts. > > +1 to this proposal, > I believe this will improve the performance in some mixture circumstances > of bounded and unbounded workloads. > > Best, > Yuxin > > > Xintong Song 于2023年9月18日周一 10:56写道: > > > Thanks for addressing my comments, Dong. > > > > LGTM. > > > > Best, > > > > Xintong > > > > > > > > On Sat, Sep 16, 2023 at 3:34 PM Wencong Liu > wrote: > > > > > Hi Dong & Jinhao, > > > > > > Thanks for your clarification! +1 > > > > > > Best regards, > > > Wencong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2023-09-15 11:26:16, "Dong Lin" wrote: > > > >Hi Wencong, > > > > > > > >Thanks for your comments! Please see my reply inline. > > > > > > > >On Thu, Sep 14, 2023 at 12:30 PM Wencong Liu > > > wrote: > > > > > > > >> Dear Dong, > > > >> > > > >> I have thoroughly reviewed the proposal for FLIP-331 and believe it > > > would > > > >> be > > > >> a valuable addition to Flink. However, I do have a few questions > that > > I > > > >> would > > > >> like to discuss: > > > >> > > > >> > > > >> 1. The FLIP-331 proposed the EndOfStreamWindows that is implemented > by > > > >> TimeWindow with maxTimestamp = (Long.MAX_VALUE - 1), which naturally > > > >> supports WindowedStream and AllWindowedStream to process all records > > > >> belonging to a key in a 'global' window under both STREAMING and > BATCH > > > >> runtime execution mode. > > > >> > > > >> > > > >> However, besides coGroup and keyBy().aggregate(), other operators on > > > >> WindowedStream and AllWindowedStream, such as join/reduce, etc, > > > currently > > > >> are still implemented based on WindowOperator. > > > >> > > > >> > > > >> In fact, these operators can also be implemented without using > > > >> WindowOperator > > > >> to prevent additional WindowAssigner#assignWindows or > > > >> triggerContext#onElement > > > >> invocation cost. Will there be plans to support these operators in > the >
Re: [VOTE] FLIP-362: Support minimum resource limitation
+1 (binding) Best, Weihua On Tue, Sep 26, 2023 at 10:00 AM Shammon FY wrote: > +1(binding), thanks for the proposal. > > Best, > Shammon FY > > On Mon, Sep 25, 2023 at 11:20 PM Jing Ge > wrote: > > > +1(binding) Thanks! > > > > Best regards, > > Jing > > > > On Mon, Sep 25, 2023 at 3:48 AM Chen Zhanghao > > > wrote: > > > > > Thanks for driving this. +1 (non-binding) > > > > > > Best, > > > Zhanghao Chen > > > > > > 发件人: xiangyu feng > > > 发送时间: 2023年9月25日 17:38 > > > 收件人: dev@flink.apache.org > > > 主题: [VOTE] FLIP-362: Support minimum resource limitation > > > > > > Hi all, > > > > > > I would like to start the vote for FLIP-362: Support minimum resource > > > limitation[1]. > > > This FLIP was discussed in this thread [2]. > > > > > > The vote will be open for at least 72 hours unless there is an > objection > > or > > > insufficient votes. > > > > > > Regards, > > > Xiangyu > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-362%3A+Support+minimum+resource+limitation > > > [2] https://lists.apache.org/thread/m2v9n4yynm97v8swhqj2o5k0sqlb5ym4 > > > > > >
Re: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released
Hi Frans, Good remark, I still need to provide the images to those who have access to the Dockerhub, but I haven't been able to done that yet. Hopefully I can do that at the end of the week. Best regards, Martijn On Mon, Sep 25, 2023 at 2:04 PM wrote: > > Hi Martijn. > > Thanks for this. Should there also be docker images available? > https://hub.docker.com/r/apache/flink-statefun/tags goes up to 3.2.0. > > Frans > > -Original Message- > From: Martijn Visser > Sent: Tuesday, September 19, 2023 11:37 AM > To: dev@flink.apache.org; user ; user-zh > ; n...@flink.apache.org; annou...@apache.org > Subject: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released > > Stateful Functions is a cross-platform stack for building Stateful Serverless > applications, making it radically simpler to develop scalable, consistent, > and elastic distributed applications. This new release upgrades the Flink > runtime to 1.16.2. > > Release highlight: > - Upgrade underlying Flink dependency to 1.16.2 > > Release blogpost: > https://flink.apache.org/2023/09/19/stateful-functions-3.3.0-release-announcement/ > > The release is available for download at: > https://flink.apache.org/downloads/ > > Java SDK can be found at: > https://search.maven.org/artifact/org.apache.flink/statefun-sdk-java/3.3.0/jar > > Python SDK can be found at: > https://pypi.org/project/apache-flink-statefun/ > > GoLang SDK can be found at: > https://github.com/apache/flink-statefun/tree/statefun-sdk-go/v3.3.0 > > JavaScript SDK can be found at: > https://www.npmjs.com/package/apache-flink-statefun > > Official Docker image for Flink Stateful Functions can be found at: > https://hub.docker.com/r/apache/flink-statefun > > The full release notes are available in Jira: > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351276 > > We would like to thank all contributors of the Apache Flink community who > made this release possible! > > Regards, > Martijn Visser > >
回复: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL Sources
Hi Jing, I've updated Compatibility, Deprecation, and Migration Plan section to list all the potential compatibility issues for users who want to upgrade an existing job to use this feature: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263429150. Best, Zhanghao Chen 发件人: Jing Ge 发送时间: 2023年9月25日 23:02 收件人: dev@flink.apache.org 主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL Sources Hi Zhanghao, Thanks for driving the FLIP. This is a nice feature users are looking for. From users' perspective, would you like to add explicit description about any potential(or none) compatibility issues if users want to use this new feature and start existing jobs with savepoints or checkpoints? Best regards, Jing On Sun, Sep 24, 2023 at 9:05 PM Chen Zhanghao wrote: > Hi Lincoln, > > Thanks for the comments. > > - For concerns #1, I agree that we do not always produce optimal plan for > both cases. However, it is difficult to do so and non-trivial complexity is > expected. On the other hand, although our proposal generates a sub-optimal > plan when the bottleneck is on the aggregate operator, it still provides > more flexibility for performance tuning. Therefore, I think we can > implement it in the straightforward way first, WDYT? > > - For concerns #2, I'd like to clarify a bit: exception will only be > thrown i.f.f. the source may produce delete/update messages AND no primary > key specified AND the source parallelism is different from the default > parallelism. So for CDC without pk, we're still good if source parallelism > is not specified. > > - For concerns #3, at the current point, I think making the name more > specific is better as no other future use cases can be previsioned now. > Since this is an internal API, we are free to refactor it later if needed. > > - For concerns about adaptive scheduler, the problems you mentioned do > exist, but I don't think it relevant here. The planner may encode some > hints to help the scheduler, but afterall, those efforts should be > initiated on the scheduler side. > > Best, > Zhanghao Chen > > 发件人: Lincoln Lee > 发送时间: 2023年9月22日 23:19 > 收件人: dev@flink.apache.org > 主题: Re: [DISCUSS] FLIP-367: Support Setting Parallelism for Table/SQL > Sources > > Hi Zhanghao, > > Thanks for the FLIP and discussion! Hope this reply isn't too late : ) > Firstly I'm fully agreed with the motivation of this FLIP and the value for > the users, but there are a few things we should consider(please correct me > if I'm misunderstanding): > > *1. *It seems that the current solution only takes care of part of the > requirement, the need to set source's parallelism may be different in > different jobs, for example, consider the following two job topologies(one > {} simply represents a vertex): > a. {source -> calc -> sink} > > b. {source -> calc} -> {aggregate} -> {sink} > > For job a, if there is a bottleneck in calc operator, but source > parallelism cannot be scaled up (e.g., limited by kafka's partition > number), so the calc operator cannot be scaled up to achieve higher > throughput because the operators in source vertex are chained together, > then current solution is reasonable (break the chain, add a shuffle). > > But for job b, if the bottleneck is the aggregate operator (not calc), it's > more likely be better to scale up the aggregate operator/vertex and without > breaking the {source -> calc} chain, as this will incur additional shuffle > cost. > So if we decide to add this new feature, I would recommend that both cases > be taken care of. > > > 2. the assumption that a cdc source must have pk(primary key) may not be > reasonable, for example, mysql cdc supports the case without pk( > > https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc.html#tables-without-primary-keys > ), > so we can not just raise an error here. > > > 3. for the new SourceTransformationWrapper I have some concerns about the > future evolution, if we need to add support for other operators, do we > continue to add new xxWrappers? > > I've also revisited the previous discussion on FLIP-146[1], there were no > clear conclusions or good ideas about similar support issues for the source > before, and I also noticed that the new capability to change per-vertex > parallelism via rest api in 1.18 (part of FLIP-291[2][3], but there is > actually an issue about sql job's parallelism change which may require a > hash shuffle to ensure the order of update stream, this needs to be > followed up in FLIP-291, a jira will be created later). So perhaps, we > need to think about it more (the next version is not yet launched, so we > still have time) > > [1] https://lists.apache.org/thread/gtpswl42jzv0c9o3clwqskpllnw8rh87 > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management > [3] https://issues.apache.org/jira/browse/FLI
[jira] [Created] (FLINK-33156) Remove flakiness from tests in OperatorStateBackendTest.java
Asha Boyapati created FLINK-33156: - Summary: Remove flakiness from tests in OperatorStateBackendTest.java Key: FLINK-33156 URL: https://issues.apache.org/jira/browse/FLINK-33156 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.17.1 Reporter: Asha Boyapati Fix For: 1.17.1 This issue is similar to: https://issues.apache.org/jira/browse/FLINK-32963 We are proposing to make the following tests stable: {quote}org.apache.flink.runtime.state.OperatorStateBackendTest#testSnapshotRestoreSync org.apache.flink.runtime.state.OperatorStateBackendTest#testSnapshotRestoreAsync{quote} The tests are currently flaky because the order of elements returned by iterators is non-deterministic. The following PR fixes the flaky test by making it independent of the order of elements returned by the iterator: https://github.com/asha-boyapati/flink/pull/2 We detected this using the NonDex tool using the following commands: {quote}mvn edu.illinois:nondex-maven-plugin:2.1.1:nondex -pl flink-runtime -DnondexRuns=10 -Dtest=org.apache.flink.runtime.state.OperatorStateBackendTest#testSnapshotRestoreSync mvn edu.illinois:nondex-maven-plugin:2.1.1:nondex -pl flink-runtime -DnondexRuns=10 -Dtest=org.apache.flink.runtime.state.OperatorStateBackendTest#testSnapshotRestoreAsync{quote} Please see the following Continuous Integration log that shows the flakiness: https://github.com/asha-boyapati/flink/actions/runs/6193757385 Please see the following Continuous Integration log that shows that the flakiness is fixed by this change: https://github.com/asha-boyapati/flink/actions/runs/619409 -- This message was sent by Atlassian Jira (v8.20.10#820010)