[jira] [Updated] (FLINK-31901) AbstractBroadcastWrapperOperator should not block checkpoint barriers when processing cached records
[ https://issues.apache.org/jira/browse/FLINK-31901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhipeng Zhang updated FLINK-31901: -- Description: Currently `BroadcastUtils#withBroadcast` tries to caches the non-broadcast input until the broadcast inputs are all processed. After the broadcast variables are ready, we first process the cached records and then continue to process the newly arrived records. Processing cached elements is invoked via `Input#processElement` and `Input#processWatermark`. However, processing cached element may take a long time since there may be many cached records, which could potentially block the checkpoint barrier. If we run the code snippet here[1], we are supposed to get logs as follows. {code:java} OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 1 at time: 1682319149462 processed cached records, cnt: 1 at time: 1682319149569 processed cached records, cnt: 2 at time: 1682319149614 processed cached records, cnt: 3 at time: 1682319149655 processed cached records, cnt: 4 at time: 1682319149702 processed cached records, cnt: 5 at time: 1682319149746 processed cached records, cnt: 6 at time: 1682319149781 processed cached records, cnt: 7 at time: 1682319149891 processed cached records, cnt: 8 at time: 1682319150011 processed cached records, cnt: 9 at time: 1682319150116 processed cached records, cnt: 10 at time: 1682319150199 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 2 at time: 1682319150378 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 3 at time: 1682319150606 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 4 at time: 1682319150704 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 5 at time: 1682319150785 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 6 at time: 1682319150859 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 7 at time: 1682319150935 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 8 at time: 1682319151007{code} We can find that from line#2 to line#11, there is no checkpoints and the barriers are blocked until all cached elements are processed, which takes ~600ms and much longer than checkpoint interval (i.e., 100ms) [1]https://github.com/zhipeng93/flink-ml/tree/FLINK-31901-demo-case was: Currently `BroadcastUtils#withBroadcast` tries to caches the non-broadcast input until the broadcast inputs are all processed. After the broadcast variables are ready, we first process the cached records and then continue to process the newly arrived records. Processing cached elements is invoked via `Input#processElement` and `Input#processWatermark`. However, processing cached element may take a long time since there may be many cached records, which could potentially block the checkpoint barrier. If we run the code snippet here[1], we are supposed to get logs as follows. {code:java} OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 1 at time: 1682319149462 processed cached records, cnt: 1 at time: 1682319149569 processed cached records, cnt: 2 at time: 1682319149614 processed cached records, cnt: 3 at time: 1682319149655 processed cached records, cnt: 4 at time: 1682319149702 processed cached records, cnt: 5 at time: 1682319149746 processed cached records, cnt: 6 at time: 1682319149781 processed cached records, cnt: 7 at time: 1682319149891 processed cached records, cnt: 8 at time: 1682319150011 processed cached records, cnt: 9 at time: 1682319150116 processed cached records, cnt: 10 at time: 1682319150199 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 2 at time: 1682319150378 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 3 at time: 1682319150606 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 4 at time: 1682319150704 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 5 at time: 1682319150785 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 6 at time: 1682319150859 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 7 at time: 1682319150935 OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 8 at time: 1682319151007{code} We can find that from line#3 to line#12, there is no checkpoints and the barriers are blocked until all cached elements are processed, which takes ~600ms and much longer than checkpoint interval (i.e., 100ms) [1]https://github.com/zhipeng93/flink-ml/tree/FLINK-31901-demo-case > AbstractBroadcastWrapperOperator should not block checkpoint barriers when > processing cached records > > >
[jira] [Updated] (FLINK-32202) useless configuration
[ https://issues.apache.org/jira/browse/FLINK-32202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangdong7 updated FLINK-32202: --- Description: According to the official Flink documentation, the parameter query.server.ports has been replaced by queryable-state.server.ports, but the parameter query.server.ports:6125 will be generated when Flink starts. Is this a historical problem? public static final ConfigOption SERVER_PORT_RANGE = ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The port range of the queryable state server. The specified range can be a single port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new String[]\{"query.server.ports"}); was:According to the official Flink documentation, the parameter query.server.ports has been replaced by queryable-state.server.ports, but the parameter query.server.ports:6125 will be generated when Flink starts. Is this a historical problem? > useless configuration > - > > Key: FLINK-32202 > URL: https://issues.apache.org/jira/browse/FLINK-32202 > Project: Flink > Issue Type: Improvement > Components: API / Core >Affects Versions: 1.15.4 >Reporter: zhangdong7 >Priority: Minor > > According to the official Flink documentation, the parameter > query.server.ports has been replaced by queryable-state.server.ports, but the > parameter query.server.ports:6125 will be generated when Flink starts. Is > this a historical problem? > > > public static final ConfigOption SERVER_PORT_RANGE = > ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The > port range of the queryable state server. The specified range can be a > single port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges > and ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new > String[]\{"query.server.ports"}); -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32202) useless configuration
[ https://issues.apache.org/jira/browse/FLINK-32202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangdong7 updated FLINK-32202: --- Description: According to the official Flink documentation, the parameter query.server.ports has been replaced by queryable-state.server.ports, but the parameter query.server.ports:6125 will be generated when Flink starts. Is this a historical problem? {code:java} public static final ConfigOption SERVER_PORT_RANGE = ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The port range of the queryable state server. The specified range can be a single port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new String[]{"query.server.ports"});{code} was: According to the official Flink documentation, the parameter query.server.ports has been replaced by queryable-state.server.ports, but the parameter query.server.ports:6125 will be generated when Flink starts. Is this a historical problem? public static final ConfigOption SERVER_PORT_RANGE = ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The port range of the queryable state server. The specified range can be a single port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new String[]\{"query.server.ports"}); > useless configuration > - > > Key: FLINK-32202 > URL: https://issues.apache.org/jira/browse/FLINK-32202 > Project: Flink > Issue Type: Improvement > Components: API / Core >Affects Versions: 1.15.4 >Reporter: zhangdong7 >Priority: Minor > > According to the official Flink documentation, the parameter > query.server.ports has been replaced by queryable-state.server.ports, but the > parameter query.server.ports:6125 will be generated when Flink starts. Is > this a historical problem? > > {code:java} > public static final ConfigOption SERVER_PORT_RANGE = > ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The > port range of the queryable state server. The specified range can be a > single port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges > and ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new > String[]{"query.server.ports"});{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32202) useless configuration
[ https://issues.apache.org/jira/browse/FLINK-32202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangdong7 updated FLINK-32202: --- Environment: (was: public static final ConfigOption SERVER_PORT_RANGE = ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The port range of the queryable state server. The specified range can be a single port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new String[]\{"query.server.ports"});) > useless configuration > - > > Key: FLINK-32202 > URL: https://issues.apache.org/jira/browse/FLINK-32202 > Project: Flink > Issue Type: Improvement > Components: API / Core >Affects Versions: 1.15.4 >Reporter: zhangdong7 >Priority: Minor > > According to the official Flink documentation, the parameter > query.server.ports has been replaced by queryable-state.server.ports, but the > parameter query.server.ports:6125 will be generated when Flink starts. Is > this a historical problem? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32202) useless configuration
zhangdong7 created FLINK-32202: -- Summary: useless configuration Key: FLINK-32202 URL: https://issues.apache.org/jira/browse/FLINK-32202 Project: Flink Issue Type: Improvement Components: API / Core Affects Versions: 1.15.4 Environment: public static final ConfigOption SERVER_PORT_RANGE = ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The port range of the queryable state server. The specified range can be a single port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new String[]\{"query.server.ports"}); Reporter: zhangdong7 According to the official Flink documentation, the parameter query.server.ports has been replaced by queryable-state.server.ports, but the parameter query.server.ports:6125 will be generated when Flink starts. Is this a historical problem? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32201) Enable the distribution of shuffle descriptors via the blob server by connection number
Weihua Hu created FLINK-32201: - Summary: Enable the distribution of shuffle descriptors via the blob server by connection number Key: FLINK-32201 URL: https://issues.apache.org/jira/browse/FLINK-32201 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Weihua Hu Flink support distributes shuffle descriptors via the blob server to reduce JobManager overhead. But the default threshold to enable it is 1MB, which never reaches. Users need to set a proper value for this, but it requires advanced knowledge before configuring it. I would like to enable this feature by the number of connections of a group of shuffle descriptors. For examples, a simple streaming job with two operators, each with 10,000 parallelism and connected via all-to-all distribution. In this job, we only get one set of shuffle descriptors, and this group has 1 * 1 connections. This means that JobManager needs to send this set of shuffle descriptors to 1 tasks. Since it is also difficult for users to configure, I would like to give it a default value. The serialized shuffle descriptors sizes for different parallelism are shown below. || Producer parallelism || serialized shuffle descriptor size || consumer parallelism || total data size that JM needs to send || | 5000 | 100KB | 5000 | 500MB | | 1 | 200KB | 1 | 2GB | | 2 | 400Kb | 2 | 8GB | So, I would like to set the default value to 10,000 * 10,000. Any suggestions or concerns are appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink-web] Matrix42 opened a new pull request, #656: fix logo redirect to wrong url
Matrix42 opened a new pull request, #656: URL: https://github.com/apache/flink-web/pull/656 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-32192) JsonBatchFileSystemITCase fail due to Process Exit Code: 239 (NoClassDefFoundError: akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1)
[ https://issues.apache.org/jira/browse/FLINK-32192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Nuyanzin updated FLINK-32192: Summary: JsonBatchFileSystemITCase fail due to Process Exit Code: 239 (NoClassDefFoundError: akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1) (was: JsonBatchFileSystemITCase fail due to Process Exit Code: 239 (because of NoClassDefFoundError akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1)) > JsonBatchFileSystemITCase fail due to Process Exit Code: 239 > (NoClassDefFoundError: > akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1) > - > > Key: FLINK-32192 > URL: https://issues.apache.org/jira/browse/FLINK-32192 > Project: Flink > Issue Type: Sub-task > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Tests >Affects Versions: 1.18.0 >Reporter: Sergey Nuyanzin >Priority: Critical > Labels: test-stability > > [This > build|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49288&view=logs&j=7e3d33c3-a462-5ea8-98b8-27e1aafe4ceb&t=ef77f8d1-44c8-5ee2-f175-1c88f61de8c0] > failed with a 239 exit code in test_cron_hadoop313 connect_1 with > JsonBatchFileSystemITCase crashed > {noformat} > May 24 01:02:14 01:02:14.069 [ERROR] Crashed tests: > May 24 01:02:14 01:02:14.069 [ERROR] > org.apache.flink.formats.json.JsonBatchFileSystemITCase > May 24 01:02:14 01:02:14.069 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:748) > May 24 01:02:14 01:02:14.069 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$700(ForkStarter.java:121) > May 24 01:02:14 01:02:14.069 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter$2.call(ForkStarter.java:465) > May 24 01:02:14 01:02:14.069 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter$2.call(ForkStarter.java:442) > May 24 01:02:14 01:02:14.069 [ERROR] at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > May 24 01:02:14 01:02:14.069 [ERROR] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > May 24 01:02:14 01:02:14.069 [ERROR] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > May 24 01:02:14 01:02:14.069 [ERROR] at java.lang.Thread.run(Thread.java:748) > {noformat} > also in logs > {noformat} > 01:02:00,081 [flink-akka.actor.default-dispatcher-9] ERROR > org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: > Thread 'flink-akka.actor.default-dispatcher-9' produced an uncaught > exception. Stopping the process... > java.lang.NoClassDefFoundError: > akka/actor/dungeon/FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1 > at > akka.actor.dungeon.FaultHandling.handleNonFatalOrInterruptedException(FaultHandling.scala:336) > ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT] > at > akka.actor.dungeon.FaultHandling.handleNonFatalOrInterruptedException$(FaultHandling.scala:336) > ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT] > at > akka.actor.ActorCell.handleNonFatalOrInterruptedException(ActorCell.scala:410) > ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT] > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:523) > ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT] > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:535) > ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT] > at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:295) > ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT] > at akka.dispatch.Mailbox.run(Mailbox.scala:230) > ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT] > at akka.dispatch.Mailbox.exec(Mailbox.scala:243) > [flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT] > at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) > [?:1.8.0_292] > at > java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) > [?:1.8.0_292] > at > java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) > [?:1.8.0_292] > at > java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) > [?:1.8.0_292] > Caused by: java.lang.ClassNotFoundException: > akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1 > at java.net.URLClassLoader.findClass(URLClassLoader.
[jira] [Updated] (FLINK-32150) ThreadDumpInfoTest crashed with exit code 239 on AZP (NoClassDefFoundError: org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder)
[ https://issues.apache.org/jira/browse/FLINK-32150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Nuyanzin updated FLINK-32150: Summary: ThreadDumpInfoTest crashed with exit code 239 on AZP (NoClassDefFoundError: org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder) (was: ThreadDumpInfoTest crashed with exit code 239 on AZP (because of NoClassDefFoundError org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder)) > ThreadDumpInfoTest crashed with exit code 239 on AZP (NoClassDefFoundError: > org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder) > - > > Key: FLINK-32150 > URL: https://issues.apache.org/jira/browse/FLINK-32150 > Project: Flink > Issue Type: Sub-task > Components: Runtime / REST, Tests >Affects Versions: 1.18.0 >Reporter: Sergey Nuyanzin >Priority: Critical > Labels: test-stability > Attachments: logs-ci-test_ci_core-1684238025.zip > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49057&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=8572 > {noformat} > May 16 12:03:24 12:03:24.449 [ERROR] > org.apache.flink.runtime.rest.messages.ThreadDumpInfoTest > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:532) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:405) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:321) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:266) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1314) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1159) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:932) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:370) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:351) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:215) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:171) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:163) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:294) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105) > May 16 12:03:24 12:03:24.449 [ERROR] at > org.apache.maven.cli.MavenCli.execute(MavenCli.java:960) > {noformat} > also in logs > {noformat} > 12:01:08,340 [flink-akka.remote.default-remote-dispatcher-5] ERROR > org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: > Thread 'flink-akka.remote.default-remote-dispatcher-5' produced an uncaught > exception. Stopping the process... > java.lang.NoClassDefFoundError: > org/jboss/netty/handler/codec/frame/LengthFieldBasedFrameDecoder > at > akka.remote.transport.netty.NettyTransport.akka$remote$transport$netty$NettyTransport$$newPipeline(NettyTransport.scala:402) > ~[flink-rpc-akka_40d517e5-dc06-423d-8404-00e8331e0610.jar:1.18-SNAPSHOT] > at > akka.remote.transport.nett
[jira] [Created] (FLINK-32200) OrcFileSystemITCase cashed with exit code 239 (NoClassDefFoundError: scala/concurrent/duration/Deadline)
Matthias Pohl created FLINK-32200: - Summary: OrcFileSystemITCase cashed with exit code 239 (NoClassDefFoundError: scala/concurrent/duration/Deadline) Key: FLINK-32200 URL: https://issues.apache.org/jira/browse/FLINK-32200 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.18.0 Reporter: Matthias Pohl https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49325&view=logs&j=4eda0b4a-bd0d-521a-0916-8285b9be9bb5&t=2ff6d5fa-53a6-53ac-bff7-fa524ea361a9&l=12302 {code} 12:24:14,883 [flink-akka.actor.internal-dispatcher-2] ERROR org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread 'flink-akka.actor.internal-dispatcher-2' produced an uncaught exception. Stopping the process... java.lang.NoClassDefFoundError: scala/concurrent/duration/Deadline at scala.concurrent.duration.Deadline$.apply(Deadline.scala:30) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.concurrent.duration.Deadline$.now(Deadline.scala:76) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.actor.CoordinatedShutdown.loop$1(CoordinatedShutdown.scala:737) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.actor.CoordinatedShutdown.$anonfun$run$7(CoordinatedShutdown.scala:762) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:100) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:100) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:49) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48) [flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_292] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_292] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_292] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) [?:1.8.0_292] 12:24:14,882 [flink-metrics-akka.actor.internal-dispatcher-2] ERROR org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread 'flink-metrics-akka.actor.internal-dispatcher-2' produced an uncaught exception. Stopping the process... java.lang.NoClassDefFoundError: scala/concurrent/duration/Deadline at scala.concurrent.duration.Deadline$.apply(Deadline.scala:30) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.concurrent.duration.Deadline$.now(Deadline.scala:76) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.actor.CoordinatedShutdown.loop$1(CoordinatedShutdown.scala:737) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.actor.CoordinatedShutdown.$anonfun$run$7(CoordinatedShutdown.scala:762) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT] at akka.dispatch.BatchingExecutor
[GitHub] [flink] 1996fanrui commented on pull request #22560: [FLINK-32023][API / DataStream] Support negative Duration for special usage, such as -1ms for execution.buffer-timeout
1996fanrui commented on PR #22560: URL: https://github.com/apache/flink/pull/22560#issuecomment-1563868797 I will merge this PR next Monday if no other comments. Hi @Myracle , could you help squash all commits? thanks~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] 1996fanrui commented on pull request #22560: [FLINK-32023][API / DataStream] Support negative Duration for special usage, such as -1ms for execution.buffer-timeout
1996fanrui commented on PR #22560: URL: https://github.com/apache/flink/pull/22560#issuecomment-1563867456 I will merge this PR next Monday if no other comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-32191) Support for configuring tcp keepalive related parameters.
[ https://issues.apache.org/jira/browse/FLINK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726494#comment-17726494 ] Weihua Hu commented on FLINK-32191: --- There are two transport types, Nio and Epoll. For Epoll, we already have options for keepalive, such as "EpollChannelOption.TCP_KEEPIDLE". But for Nio, the keepalive options have been introduced by JDK11, such as "ExtendedSocketOptions.TCP_KEEPIDLE". Flink is still required to be compatible with JDK8, even though it has been deprecated. Hence, we need to inform users that these configurations will not be taken into account if NIO and JDK8 are used together. Would you mind taking a look at this ticket when you are free. [~Weijie Guo][~wanglijie] > Support for configuring tcp keepalive related parameters. > - > > Key: FLINK-32191 > URL: https://issues.apache.org/jira/browse/FLINK-32191 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: dizhou cao >Priority: Minor > > We encountered a case in our production environment where the netty client > was unable to send data to the server due to an abnormality in the switch > link. However, client can only detect the abnormality after RTO timeout > retransmission failure, which takes about 15 minutes in our production > environment. This may result in a 15-minute job unavailability. We hope to > perform failover and reschedule job more quickly. Flink has already enabled > keepalive, but the default keepalive idle time is 2 hours. We can adjust the > timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and > TCP_KEEPCOUNT. These configurations are already supported at the Netty. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] XComp commented on pull request #22390: [FLINK-31785][runtime] Moves LeaderElectionService.stop into LeaderElection.close
XComp commented on PR #22390: URL: https://github.com/apache/flink/pull/22390#issuecomment-1563845998 Did a rebase to `master` after FLINK-31776 (PR #22384) was merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (FLINK-31776) Introducing sub-interface LeaderElectionService.LeaderElection
[ https://issues.apache.org/jira/browse/FLINK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl resolved FLINK-31776. --- Fix Version/s: 1.18.0 Resolution: Fixed master: 3b53f30d396e5a7f22330d6522ea26450e238628 > Introducing sub-interface LeaderElectionService.LeaderElection > --- > > Key: FLINK-31776 > URL: https://issues.apache.org/jira/browse/FLINK-31776 > Project: Flink > Issue Type: Sub-task >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > Fix For: 1.18.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism
[ https://issues.apache.org/jira/browse/FLINK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726487#comment-17726487 ] Lijie Wang commented on FLINK-32199: Thanks [~JunRuiLi] , assigned to you :D > MetricStore does not remove metrics of nonexistent parallelism in > TaskMetricStore when scale down job parallelism > - > > Key: FLINK-32199 > URL: https://issues.apache.org/jira/browse/FLINK-32199 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics >Affects Versions: 1.17.0 >Reporter: Junrui Li >Priority: Major > Fix For: 1.18.0, 1.16.3, 1.17.2 > > > After FLINK-29615, FLINK will update the subtask metrics store when scaling > down parallelism. However, task metrics are added in the form of > "subtaskIndex + metric.name" or "subtaskIndex + operatorName + metric.name". > Users will be able to find many redundant metrics through > JobVertexMetricsHandler, which will be very troublesome for users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism
[ https://issues.apache.org/jira/browse/FLINK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijie Wang updated FLINK-32199: --- Fix Version/s: 1.18.0 > MetricStore does not remove metrics of nonexistent parallelism in > TaskMetricStore when scale down job parallelism > - > > Key: FLINK-32199 > URL: https://issues.apache.org/jira/browse/FLINK-32199 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics >Affects Versions: 1.17.0 >Reporter: Junrui Li >Priority: Major > Fix For: 1.18.0, 1.16.3, 1.17.2 > > > After FLINK-29615, FLINK will update the subtask metrics store when scaling > down parallelism. However, task metrics are added in the form of > "subtaskIndex + metric.name" or "subtaskIndex + operatorName + metric.name". > Users will be able to find many redundant metrics through > JobVertexMetricsHandler, which will be very troublesome for users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism
[ https://issues.apache.org/jira/browse/FLINK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijie Wang reassigned FLINK-32199: -- Assignee: Junrui Li > MetricStore does not remove metrics of nonexistent parallelism in > TaskMetricStore when scale down job parallelism > - > > Key: FLINK-32199 > URL: https://issues.apache.org/jira/browse/FLINK-32199 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics >Affects Versions: 1.17.0 >Reporter: Junrui Li >Assignee: Junrui Li >Priority: Major > Fix For: 1.18.0, 1.16.3, 1.17.2 > > > After FLINK-29615, FLINK will update the subtask metrics store when scaling > down parallelism. However, task metrics are added in the form of > "subtaskIndex + metric.name" or "subtaskIndex + operatorName + metric.name". > Users will be able to find many redundant metrics through > JobVertexMetricsHandler, which will be very troublesome for users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism
[ https://issues.apache.org/jira/browse/FLINK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726482#comment-17726482 ] Junrui Li commented on FLINK-32199: --- I've prepared a quick fix for it. Can you assign this ticket for me?[~wanglijie] :) > MetricStore does not remove metrics of nonexistent parallelism in > TaskMetricStore when scale down job parallelism > - > > Key: FLINK-32199 > URL: https://issues.apache.org/jira/browse/FLINK-32199 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics >Affects Versions: 1.17.0 >Reporter: Junrui Li >Priority: Major > Fix For: 1.16.3, 1.17.2 > > > After FLINK-29615, FLINK will update the subtask metrics store when scaling > down parallelism. However, task metrics are added in the form of > "subtaskIndex + metric.name" or "subtaskIndex + operatorName + metric.name". > Users will be able to find many redundant metrics through > JobVertexMetricsHandler, which will be very troublesome for users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism
Junrui Li created FLINK-32199: - Summary: MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism Key: FLINK-32199 URL: https://issues.apache.org/jira/browse/FLINK-32199 Project: Flink Issue Type: Bug Components: Runtime / Metrics Affects Versions: 1.17.0 Reporter: Junrui Li Fix For: 1.16.3, 1.17.2 After FLINK-29615, FLINK will update the subtask metrics store when scaling down parallelism. However, task metrics are added in the form of "subtaskIndex + metric.name" or "subtaskIndex + operatorName + metric.name". Users will be able to find many redundant metrics through JobVertexMetricsHandler, which will be very troublesome for users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726479#comment-17726479 ] Sharon Xie commented on FLINK-32196: [~tzulitai]Thanks for analyzing the logs. As additional context, this happened after a security patch on the broker side. Though most of the jobs auto recovered, we've found a couple that got stuck in the recovery step. So there is a chance that this is caused by an issue from the broker side - eg: some broker side transaction state is lost or bad partition state. Any possible explanation here? A couple other questions. > These are the ones to abort in step 2. Initializing the transaction > automatically aborts the transaction, as I mentioned in earlier comments. So > I believe this is also expected. In this case, does the transaction state [messages|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] in kafka broker look right to you? It seems there is no change in those messages except the epoch and txnLastUpdateTimestamp. I guess the idea is to call transaction init with the old txnId and just let it time out. But there is some heart beat to update the transaction? Also can you please explain a bit about why abortTransaction is not used? > What is NOT expected, though, is the bunch of kafka-producer-network-thread > threads being spawned per TID to abort in step 2. Is it common to have so many lingering transactions that need to abort? The job is not a high throughput one. About 3 records/sec at the checkpointing interval = 10sec. It takes ~30min to run oom and I feel it's weird that the kafka sink would need so long to recover. > kafka sink under EO sometimes is unable to recover from a checkpoint > > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: healthy_kafka_producer_thread.csv, > kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best *hypothesis* for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] flinkbot commented on pull request #22663: [FLINK-32198][runtime] Enforce single maxExceptions query parameter
flinkbot commented on PR #22663: URL: https://github.com/apache/flink/pull/22663#issuecomment-1563798860 ## CI report: * 3d10cec1ba099543e8ae4ceef670caae9f145553 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-32198) Enforce single maxExceptions query parameter
[ https://issues.apache.org/jira/browse/FLINK-32198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-32198: --- Labels: pull-request-available (was: ) > Enforce single maxExceptions query parameter > > > Key: FLINK-32198 > URL: https://issues.apache.org/jira/browse/FLINK-32198 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Reporter: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > > While working on FLINK-31894 I realized that `UpperLimitExceptionParameter` > allows multiple values to be collected as a comma separated list even though > JobExceptionsHandler is only using the first one > [https://github.com/apache/flink/blob/1293958652053c0d163fde28e8dfefb5ee8f6101/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobExceptionsHandler.java#L101-L104] > A better approach would be to deny multiple `maxExceptions` params and let > the users know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] pgaref opened a new pull request, #22663: [FLINK-32198][runtime] Enforce single maxExceptions query parameter
pgaref opened a new pull request, #22663: URL: https://github.com/apache/flink/pull/22663 https://issues.apache.org/jira/browse/FLINK-32198 * throw ConversionException when multiple values are passed for maxExceptions parameter * added test to validate functionality -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-32198) Enforce single maxExceptions query parameter
Panagiotis Garefalakis created FLINK-32198: -- Summary: Enforce single maxExceptions query parameter Key: FLINK-32198 URL: https://issues.apache.org/jira/browse/FLINK-32198 Project: Flink Issue Type: Bug Components: Runtime / REST Reporter: Panagiotis Garefalakis While working on FLINK-31894 I realized that `UpperLimitExceptionParameter` allows multiple values to be collected as a comma separated list even though JobExceptionsHandler is only using the first one [https://github.com/apache/flink/blob/1293958652053c0d163fde28e8dfefb5ee8f6101/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobExceptionsHandler.java#L101-L104] A better approach would be to deny multiple `maxExceptions` params and let the users know. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink-connector-jdbc] ewangsdc commented on pull request #3: [FLINK-15462][connectors] Add Trino dialect
ewangsdc commented on PR #3: URL: https://github.com/apache/flink-connector-jdbc/pull/3#issuecomment-1563767213 Since lots of companies use Kerberos authentication to connect to Trino, we’d appreciate it very much if you could please ensure that Flink JDBC connector Trino dialect support Kerberos authentication like Trino Python client, i.e., https://github.com/trinodb/trino-python-client, and Spark/PySpark JDBC Trino connection options, i.e., https://trino.io/docs/current/client/jdbc.html. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (FLINK-3154) Update Kryo version from 2.24.0 to latest Kryo LTS version
[ https://issues.apache.org/jira/browse/FLINK-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726320#comment-17726320 ] Kurt Ostfeld edited comment on FLINK-3154 at 5/26/23 3:18 AM: -- [~martijnvisser] this PR upgrades Flink from Kryo v2.x to Kryo v5.x and preserves backward compatibility with existing savepoints and checkpoints: [https://github.com/apache/flink/pull/22660] This keeps the Kryo v2 project dependency for backwards compatibility only and otherwise uses Kryo v5.x. EDIT: I will fix the CI errors. was (Author: JIRAUSER38): [~martijnvisser] this PR upgrades Flink from Kryo v2.x to Kryo v5.x and preserves backward compatibility with existing savepoints and checkpoints: [https://github.com/apache/flink/pull/22660] This keeps the Kryo v2 project dependency for backwards compatibility only and otherwise uses Kryo v5.x. > Update Kryo version from 2.24.0 to latest Kryo LTS version > -- > > Key: FLINK-3154 > URL: https://issues.apache.org/jira/browse/FLINK-3154 > Project: Flink > Issue Type: Improvement > Components: API / Type Serialization System >Affects Versions: 1.0.0 >Reporter: Maximilian Michels >Priority: Not a Priority > Labels: pull-request-available > > Flink's Kryo version is outdated and could be updated to a newer version, > e.g. kryo-3.0.3. > From ML: we cannot bumping the Kryo version easily - the serialization format > changed (that's why they have a new major version), which would render all > Flink savepoints and checkpoints incompatible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32190) Bad link in Flink page
[ https://issues.apache.org/jira/browse/FLINK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726456#comment-17726456 ] Lijie Wang commented on FLINK-32190: Thanks [~JunRuiLi], assigned to you :) > Bad link in Flink page > -- > > Key: FLINK-32190 > URL: https://issues.apache.org/jira/browse/FLINK-32190 > Project: Flink > Issue Type: Bug > Components: Documentation >Reporter: Claude Warren >Assignee: Junrui Li >Priority: Minor > > on the page: [https://flink.apache.org/use-cases/] > in the section: What are typical data analytics applications? > The first link: [Quality monitoring of Telco > networks|http://2016.flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/] > returns a 404 error: > h1. This site can’t be reached > Check if there is a typo in 2016.flink-forward.org. > DNS_PROBE_FINISHED_NXDOMAIN -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-32190) Bad link in Flink page
[ https://issues.apache.org/jira/browse/FLINK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijie Wang reassigned FLINK-32190: -- Assignee: Junrui Li > Bad link in Flink page > -- > > Key: FLINK-32190 > URL: https://issues.apache.org/jira/browse/FLINK-32190 > Project: Flink > Issue Type: Bug > Components: Documentation >Reporter: Claude Warren >Assignee: Junrui Li >Priority: Minor > > on the page: [https://flink.apache.org/use-cases/] > in the section: What are typical data analytics applications? > The first link: [Quality monitoring of Telco > networks|http://2016.flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/] > returns a 404 error: > h1. This site can’t be reached > Check if there is a typo in 2016.flink-forward.org. > DNS_PROBE_FINISHED_NXDOMAIN -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] pgaref commented on pull request #22516: [FLINK-31993][runtime] Initialize and pass down FailureEnrichers
pgaref commented on PR #22516: URL: https://github.com/apache/flink/pull/22516#issuecomment-1563756212 > LGTM overall, minus a few cosmetic comments. I think the commit message is also misleading > > ``` > Fail fast on bad FailureEnricher JM startup as given by the user > ``` > > since we still have `filterInvalidEnrichers` in the codepath Awesome! Thanks @dmvk ! Updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-32190) Bad link in Flink page
[ https://issues.apache.org/jira/browse/FLINK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726454#comment-17726454 ] Junrui Li commented on FLINK-32190: --- It seems that the 404 error was caused by an expired external link. May be we can replace other links or remove this use case. [~wanglijie] What do you think? I'd like to take this ticket.:) > Bad link in Flink page > -- > > Key: FLINK-32190 > URL: https://issues.apache.org/jira/browse/FLINK-32190 > Project: Flink > Issue Type: Bug > Components: Documentation >Reporter: Claude Warren >Priority: Minor > > on the page: [https://flink.apache.org/use-cases/] > in the section: What are typical data analytics applications? > The first link: [Quality monitoring of Telco > networks|http://2016.flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/] > returns a 404 error: > h1. This site can’t be reached > Check if there is a typo in 2016.flink-forward.org. > DNS_PROBE_FINISHED_NXDOMAIN -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726453#comment-17726453 ] Yuxin Tan edited comment on FLINK-32194 at 5/26/23 3:02 AM: [~Sergey Nuyanzin] Could you help review this change? It is also like FLINK-32187. was (Author: tanyuxin): [~snuyanzin] Could you help review this change? It is also like [FLINK-32187|https://issues.apache.org/jira/browse/FLINK-32187]. > Elasticsearch connector should remove the dependency on flink-shaded > > > Key: FLINK-32194 > URL: https://issues.apache.org/jira/browse/FLINK-32194 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / ElasticSearch >Affects Versions: elasticsearch-4.0.0 >Reporter: Yuxin Tan >Assignee: Yuxin Tan >Priority: Major > Labels: pull-request-available > > The Elasticsearch connector depends on flink-shaded. With the externalization > of the connector, the connectors shouldn't rely on Flink-Shaded -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726453#comment-17726453 ] Yuxin Tan edited comment on FLINK-32194 at 5/26/23 2:58 AM: [~snuyanzin] Could you help review this change? It is also like [FLINK-32187|https://issues.apache.org/jira/browse/FLINK-32187]. was (Author: tanyuxin): [~snuyanzin] Could you help review this change? It is also like [#https://issues.apache.org/jira/browse/FLINK-32187]. > Elasticsearch connector should remove the dependency on flink-shaded > > > Key: FLINK-32194 > URL: https://issues.apache.org/jira/browse/FLINK-32194 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / ElasticSearch >Affects Versions: elasticsearch-4.0.0 >Reporter: Yuxin Tan >Assignee: Yuxin Tan >Priority: Major > Labels: pull-request-available > > The Elasticsearch connector depends on flink-shaded. With the externalization > of the connector, the connectors shouldn't rely on Flink-Shaded -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726453#comment-17726453 ] Yuxin Tan commented on FLINK-32194: --- [~snuyanzin] Could you help review this change? It is also like [#https://issues.apache.org/jira/browse/FLINK-32187]. > Elasticsearch connector should remove the dependency on flink-shaded > > > Key: FLINK-32194 > URL: https://issues.apache.org/jira/browse/FLINK-32194 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / ElasticSearch >Affects Versions: elasticsearch-4.0.0 >Reporter: Yuxin Tan >Assignee: Yuxin Tan >Priority: Major > Labels: pull-request-available > > The Elasticsearch connector depends on flink-shaded. With the externalization > of the connector, the connectors shouldn't rely on Flink-Shaded -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] pgaref commented on a diff in pull request #22516: [FLINK-31993][runtime] Initialize and pass down FailureEnrichers
pgaref commented on code in PR #22516: URL: https://github.com/apache/flink/pull/22516#discussion_r1206189321 ## flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/factories/DefaultJobMasterServiceFactory.java: ## @@ -57,6 +59,7 @@ public class DefaultJobMasterServiceFactory implements JobMasterServiceFactory { private final FatalErrorHandler fatalErrorHandler; private final ClassLoader userCodeClassloader; private final ShuffleMaster shuffleMaster; +private final Collection failureEnrichers; Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] pgaref commented on a diff in pull request #22516: [FLINK-31993][runtime] Initialize and pass down FailureEnrichers
pgaref commented on code in PR #22516: URL: https://github.com/apache/flink/pull/22516#discussion_r1206188285 ## flink-runtime/src/main/java/org/apache/flink/runtime/minicluster/MiniCluster.java: ## @@ -197,6 +198,9 @@ public class MiniCluster implements AutoCloseableAsync { @GuardedBy("lock") private HeartbeatServices heartbeatServices; +@GuardedBy("lock") +private Collection failureEnrichers; Review Comment: We might need it for e2e tests, but removing it for the time being -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-32008) Protobuf format throws exception with Map datatype
[ https://issues.apache.org/jira/browse/FLINK-32008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726448#comment-17726448 ] Benchao Li commented on FLINK-32008: [~rskraba] Thanks for the digging. I agree that currently FileSystem's fallback (de)serializer does not fit for all formats, this should be one of them. It might be worth to add support to implement a {{BulkReaderFormatFactory}} for protobuf format. > Protobuf format throws exception with Map datatype > -- > > Key: FLINK-32008 > URL: https://issues.apache.org/jira/browse/FLINK-32008 > Project: Flink > Issue Type: Bug > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.17.0 >Reporter: Xuannan Su >Priority: Major > Attachments: flink-protobuf-example.zip > > > The protobuf format throws exception when working with Map data type. I > uploaded a example project to reproduce the problem. > > {code:java} > Caused by: java.lang.RuntimeException: One or more fetchers have encountered > exception > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:261) > at > org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:169) > at > org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:131) > at > org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417) > at > org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) > at > org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788) > at > org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) > at > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received > unexpected exception while polling the records > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165) > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:114) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ... 1 more > Caused by: java.io.IOException: Failed to deserialize PB object. > at > org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:75) > at > org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:42) > at > org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:82) > at > org.apache.flink.connector.file.table.DeserializationSchemaAdapter$LineBytesInputFormat.readRecord(DeserializationSchemaAdapter.java:197) > at > org.apache.flink.connector.file.table.DeserializationSchemaAdapter$LineBytesInputFormat.nextRecord(DeserializationSchemaAdapter.java:210) > at > org.apache.flink.connector.file.table.DeserializationSchemaAdapter$Reader.readBatch(DeserializationSchemaAdapter.java:124) > at > org.apache.flink.connector.file.src.util.RecordMapperWrapperRecordIterator$1.readBatch(RecordMapperWrapperRecordIterator.java:82) > at > org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:67) > at > org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58) > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:162) > ... 6 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > a
[jira] [Commented] (FLINK-32132) Cast function CODEGEN does not work as expected when nullOnFailure enabled
[ https://issues.apache.org/jira/browse/FLINK-32132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726445#comment-17726445 ] xiaogang zhou commented on FLINK-32132: --- [~luoyuxia] Can you please help review > Cast function CODEGEN does not work as expected when nullOnFailure enabled > -- > > Key: FLINK-32132 > URL: https://issues.apache.org/jira/browse/FLINK-32132 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I am trying to generate code cast string to bigint, and got generated code > like: > > > {code:java} > // code placeholder > if (!isNull$14) { > result$15 = > org.apache.flink.table.data.binary.BinaryStringDataUtil.toLong(field$13.trim()); > } else { > result$15 = -1L; > } >castRuleResult$16 = result$15; >castRuleResultIsNull$17 = isNull$14; > } catch (java.lang.Throwable e) { >castRuleResult$16 = -1L; >castRuleResultIsNull$17 = true; > } > // --- End cast section > out.setLong(0, castRuleResult$16); {code} > such kind of handle does not provide a perfect solution as the default value > of long is set to -1L, which can be meaningful in some case. And can cause > some calculation error. > > And I understand the cast returns a bigint not null, But since there is a > exception, we should ignore the type restriction, so I suggest to modify the > CodeGenUtils.rowSetField like below: > > {code:java} > // code placeholder > if (fieldType.isNullable || > fieldExpr.nullTerm.startsWith("castRuleResultIsNull")) { > s""" > |${fieldExpr.code} > |if (${fieldExpr.nullTerm}) { > | $setNullField; > |} else { > | $writeField; > |} > """.stripMargin > } else { > s""" > |${fieldExpr.code} > |$writeField; >""".stripMargin > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-31956) Extend the CompiledPlan to read from/write to Flink's FileSystem
[ https://issues.apache.org/jira/browse/FLINK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lincoln lee reassigned FLINK-31956: --- Assignee: Shuai Xu > Extend the CompiledPlan to read from/write to Flink's FileSystem > > > Key: FLINK-31956 > URL: https://issues.apache.org/jira/browse/FLINK-31956 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Client, Table SQL / Planner >Affects Versions: 1.18.0 >Reporter: Jane Chan >Assignee: Shuai Xu >Priority: Major > Labels: pull-request-available > Fix For: 1.18.0 > > > At present, COMPILE/EXECUTE PLAN FOR '${plan.json}' only supports writing > to/reading from a local file without the scheme. We propose to extend the > support for Flink's FileSystem. > {code:sql} > -- before > COMPILE PLAN FOR '/tmp/foo/bar.json' > EXECUTE PLAN FOR '/tmp/foo/bar.json' > -- after > COMPILE PLAN FOR 'file:///tmp/foo/bar.json' > COMPILE PLAN FOR 'hdfs:///tmp/foo/bar.json' > COMPILE PLAN FOR 's3:///tmp/foo/bar.json' > COMPILE PLAN FOR 'oss:///tmp/foo/bar.json' > EXECUTE PLAN FOR 'file:///tmp/foo/bar.json' > EXECUTE PLAN FOR 'hdfs:///tmp/foo/bar.json' > EXECUTE PLAN FOR 's3:///tmp/foo/bar.json' > EXECUTE PLAN FOR 'oss:///tmp/foo/bar.json' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31956) Extend the CompiledPlan to read from/write to Flink's FileSystem
[ https://issues.apache.org/jira/browse/FLINK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726444#comment-17726444 ] lincoln lee commented on FLINK-31956: - [~xu_shuai_] assigned to you :) > Extend the CompiledPlan to read from/write to Flink's FileSystem > > > Key: FLINK-31956 > URL: https://issues.apache.org/jira/browse/FLINK-31956 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Client, Table SQL / Planner >Affects Versions: 1.18.0 >Reporter: Jane Chan >Assignee: Shuai Xu >Priority: Major > Labels: pull-request-available > Fix For: 1.18.0 > > > At present, COMPILE/EXECUTE PLAN FOR '${plan.json}' only supports writing > to/reading from a local file without the scheme. We propose to extend the > support for Flink's FileSystem. > {code:sql} > -- before > COMPILE PLAN FOR '/tmp/foo/bar.json' > EXECUTE PLAN FOR '/tmp/foo/bar.json' > -- after > COMPILE PLAN FOR 'file:///tmp/foo/bar.json' > COMPILE PLAN FOR 'hdfs:///tmp/foo/bar.json' > COMPILE PLAN FOR 's3:///tmp/foo/bar.json' > COMPILE PLAN FOR 'oss:///tmp/foo/bar.json' > EXECUTE PLAN FOR 'file:///tmp/foo/bar.json' > EXECUTE PLAN FOR 'hdfs:///tmp/foo/bar.json' > EXECUTE PLAN FOR 's3:///tmp/foo/bar.json' > EXECUTE PLAN FOR 'oss:///tmp/foo/bar.json' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726426#comment-17726426 ] Tzu-Li (Gordon) Tai edited comment on FLINK-32196 at 5/26/23 12:06 AM: --- [~sharonxr55] a few things to clarify first: # When a KafkaSink subtask restores, there are some transaction that needs to be committed (i.e. ones that are written in the Flink checkpoint), and # All other transactions are considered "lingering" which should be aborted (which is done by the loop you referenced). # Only after the above 2 step completes, the subtask initialization is considered complete. So: > A healthy transaction goes from INITIALIZING to READY- to > COMMITTING_TRANSACTION to READY in the log I believe these transactions are the ones from step 1. Which is expected. > I've found that the transaction thread never progress beyond “Transition from > state INITIALIZING to READY” These are the ones to abort in step 2. Initializing the transaction automatically aborts the transaction, as I mentioned in earlier comments. So I believe this is also expected. What is NOT expected, though, is the bunch of {{kafka-producer-network-thread}} threads being spawned per TID to abort in step 2. Thanks for sharing the logs btw, it was helpful figuring out what was going on! Kafka's producer only spawns a single {{kafka-producer-network-thread}} per instance. And the abort loop for lingering transactions always tries to reuse the same producer instance without creating new ones, so I would expect to only see a single {{kafka-producer-network-thread}} throughout the whole loop. This doesn't seem to be the case. From the naming of these threads, it seems like for every TID that the KafkaSink is trying to abort, a new {{kafka-producer-network-thread}} thread is spawned: This is hinted by the naming of the threads (see the last portion of the thread name, where it's strictly incrementing; that's the TIDs of transactions the KafkaSink is trying to abort) {code:java} “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708579" “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708580” “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708581" “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708582” {code} The only way I see this happen is if the loop is creating new producer instances per attempted TID, but it doesn't make sense given the code. It could be something wrong with how the KafkaSink is using Java reflections to reset the TID on the reused producer, but I'll need to spend some time to look into this a bit deeper. was (Author: tzulitai): [~sharonxr55] a few things to clarify first: # When a KafkaSink subtask restores, there are some transaction that needs to be committed (i.e. ones that are written in the Flink checkpoint), and # All other transactions are considered "lingering" which should be aborted (which is done by the loop you referenced). # Only after the above 2 step completes, the subtask initialization is considered complete. So: > A healthy transaction goes from INITIALIZING to READY- to > COMMITTING_TRANSACTION to READY in the log I believe these transactions are the ones from step 1. Which is expected. > I've found that the transaction thread never progress beyond “Transition from > state INITIALIZING to READY” These are the ones to abort in step 2. Initializing the transaction automatically aborts the transaction, as I mentioned in earlier comments. So I believe this is also expected. What is NOT expected, though, is the bunch of {{kafka-producer-network-thread}} threads being spawned per TID to abort in step 2. Thanks for sharing the logs btw, it was helpful figuring out what was going on! Kafka's producer only spawns a single {{kafka-producer-network-thread}} per instance. And the abort loop for lingering transactions always tries to reuse the same producer instance without creating new ones, so I would expect to only see a single {{kafka-producer-network-thread}} throughout the whole loop. This doesn't seem to be the case. From the naming of these threads, it seems like for every TID that the KafkaSink is trying to abort, a new {{kafka-producer-network-thread}} thread is spawned: This is hinted by the naming of the threads (see the last portion of the thread name, where it's strictly incrementing; that's the TIDs of transactions the KafkaSink is trying to abort) {code} “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708579" “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708580” “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708581" “kafka-producer-network-
[jira] [Commented] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726426#comment-17726426 ] Tzu-Li (Gordon) Tai commented on FLINK-32196: - [~sharonxr55] a few things to clarify first: # When a KafkaSink subtask restores, there are some transaction that needs to be committed (i.e. ones that are written in the Flink checkpoint), and # All other transactions are considered "lingering" which should be aborted (which is done by the loop you referenced). # Only after the above 2 step completes, the subtask initialization is considered complete. So: > A healthy transaction goes from INITIALIZING to READY- to > COMMITTING_TRANSACTION to READY in the log I believe these transactions are the ones from step 1. Which is expected. > I've found that the transaction thread never progress beyond “Transition from > state INITIALIZING to READY” These are the ones to abort in step 2. Initializing the transaction automatically aborts the transaction, as I mentioned in earlier comments. So I believe this is also expected. What is NOT expected, though, is the bunch of {{kafka-producer-network-thread}} threads being spawned per TID to abort in step 2. Thanks for sharing the logs btw, it was helpful figuring out what was going on! Kafka's producer only spawns a single {{kafka-producer-network-thread}} per instance. And the abort loop for lingering transactions always tries to reuse the same producer instance without creating new ones, so I would expect to only see a single {{kafka-producer-network-thread}} throughout the whole loop. This doesn't seem to be the case. From the naming of these threads, it seems like for every TID that the KafkaSink is trying to abort, a new {{kafka-producer-network-thread}} thread is spawned: This is hinted by the naming of the threads (see the last portion of the thread name, where it's strictly incrementing; that's the TIDs of transactions the KafkaSink is trying to abort) {code} “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708579" “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708580” “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708581" “kafka-producer-network-thread | producer-tx-account-7a604e01-CONNECTION-75373861-0-1708582” {code} The only way I see this happen is if the loop is creating new producer instances per attempted TID, but it doesn't make sense given the code. It could be something funny with how the KafkaSink is using Java reflections to reset the TID on the reused producer, but I'll need to spend some time to look into this a bit deeper. > kafka sink under EO sometimes is unable to recover from a checkpoint > > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: healthy_kafka_producer_thread.csv, > kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best *hypothesis* for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink-kubernetes-operator] tweise commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy
tweise commented on code in PR #607: URL: https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1206092693 ## flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/metrics/ScalingMetrics.java: ## @@ -82,7 +82,7 @@ public static void computeDataRateMetrics( scalingMetrics.put(ScalingMetric.TRUE_PROCESSING_RATE, trueProcessingRate); scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, numRecordsInPerSecond); } else { -LOG.error("Cannot compute true processing rate without numRecordsInPerSecond"); +LOG.warn("Cannot compute true processing rate without numRecordsInPerSecond"); Review Comment: Not actionable is also why I'm not sure about logging at warn level. Who is going to read these logs? Will logs be flooded with repeating entries that are unlikely to be addressed? I thought a message in the status (or one time event) might be more useful, after all the application would need to be modified to remedy the condition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-32197) FLIP 246: Multi Cluster Kafka Source
Mason Chen created FLINK-32197: -- Summary: FLIP 246: Multi Cluster Kafka Source Key: FLINK-32197 URL: https://issues.apache.org/jira/browse/FLINK-32197 Project: Flink Issue Type: New Feature Components: Connectors / Kafka Reporter: Mason Chen This is for introducing a new connector that extends off the current KafkaSource to read multiple Kafka clusters, which can change dynamically. For more details, please refer to [FLIP 246|https://cwiki.apache.org/confluence/display/FLINK/FLIP-246%3A+Multi+Cluster+Kafka+Source]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726405#comment-17726405 ] Sharon Xie edited comment on FLINK-32196 at 5/25/23 10:04 PM: -- Thank you [~tzulitai] for the quick response and information. {quote}Are you actually observing that there are lingering transactions not being aborted in Kafka? Or was that a speculation based on not seeing a abortTransaction() in the code? {quote} This is a speculation. So this may not be the root cause of the issue I'm seeing. {quote}If there are actually lingering transactions in Kafka after restore, do they get timeout by Kafka after transaction.timeout.ms? Or are they lingering beyond the timeout threshold? {quote} What I've observed is that the subtask gets stuck in the initializing state and there is a growing number of kafka-producer-network-thread and the job eventually runs OOM - In the [^kafka_sink_oom_logs.csv], you can see lots of producers get closed in the end . In the debug log, I've found that the transaction thread never progress beyond “Transition from state INITIALIZING to READY” and eventually times out. An example thread log is [^kafka_producer_network_thread_log.csv] . A healthy transaction goes from INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the thread doesn't exit - example [^healthy_kafka_producer_thread.csv]. I've also queried the kafka's _transaction_state topic for the problematic transaction and [here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the messages in the topic. I'd appreciate any pointers or potential ways to explain the situation. was (Author: sharonxr55): Thank you [~tzulitai] for the quick response and information. {quote}Are you actually observing that there are lingering transactions not being aborted in Kafka? Or was that a speculation based on not seeing a abortTransaction() in the code? {quote} This is a speculation. So this may not be the root cause of the issue I'm seeing. {quote}If there are actually lingering transactions in Kafka after restore, do they get timeout by Kafka after transaction.timeout.ms? Or are they lingering beyond the timeout threshold? {quote} What I've observed is that the subtask gets stuck in the initializing state and there is a growing number of kafka-producer-network-thread and the job eventually runs OOM - [^kafka_sink_oom_logs.csv] . In the debug log, I've found that the transaction thread never progress beyond “Transition from state INITIALIZING to READY” and eventually times out. An example thread log is [^kafka_producer_network_thread_log.csv] . A healthy transaction goes from INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the thread doesn't exit - example [^healthy_kafka_producer_thread.csv]. I've also queried the kafka's _transaction_state topic for the problematic transaction and [here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the messages in the topic. I'd appreciate any pointers or potential ways to explain the situation. > kafka sink under EO sometimes is unable to recover from a checkpoint > > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: healthy_kafka_producer_thread.csv, > kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best *hypothesis* for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This messa
[jira] [Comment Edited] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726405#comment-17726405 ] Sharon Xie edited comment on FLINK-32196 at 5/25/23 10:04 PM: -- Thank you [~tzulitai] for the quick response and information. {quote}Are you actually observing that there are lingering transactions not being aborted in Kafka? Or was that a speculation based on not seeing a abortTransaction() in the code? {quote} This is a speculation. So this may not be the root cause of the issue I'm seeing. {quote}If there are actually lingering transactions in Kafka after restore, do they get timeout by Kafka after transaction.timeout.ms? Or are they lingering beyond the timeout threshold? {quote} What I've observed is that the subtask gets stuck in the initializing state and there is a growing number of kafka-producer-network-thread and the job eventually runs OOM - [^kafka_sink_oom_logs.csv] . In the debug log, I've found that the transaction thread never progress beyond “Transition from state INITIALIZING to READY” and eventually times out. An example thread log is [^kafka_producer_network_thread_log.csv] . A healthy transaction goes from INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the thread doesn't exit - example [^healthy_kafka_producer_thread.csv]. I've also queried the kafka's _transaction_state topic for the problematic transaction and [here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the messages in the topic. I'd appreciate any pointers or potential ways to explain the situation. was (Author: sharonxr55): Thank you [~tzulitai] for the quick response and information. {quote}Are you actually observing that there are lingering transactions not being aborted in Kafka? Or was that a speculation based on not seeing a abortTransaction() in the code? {quote} This is a speculation. So this may not be the root cause of the issue I'm seeing. {quote}If there are actually lingering transactions in Kafka after restore, do they get timeout by Kafka after transaction.timeout.ms? Or are they lingering beyond the timeout threshold? {quote} What I've observed is that the subtask gets stuck in the initializing state and there is a growing number of kafka-producer-network-thread and the job eventually runs OOM. In the debug log, I've found that the transaction thread never progress beyond “Transition from state INITIALIZING to READY” and eventually times out. An example thread log is [^kafka_producer_network_thread_log.csv] . A healthy transaction goes from INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the thread doesn't exit - example [^healthy_kafka_producer_thread.csv]. I've also queried the kafka's _transaction_state topic for the problematic transaction and [here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the messages in the topic. I'd appreciate any pointers or potential ways to explain the situation. > kafka sink under EO sometimes is unable to recover from a checkpoint > > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: healthy_kafka_producer_thread.csv, > kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best *hypothesis* for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharon Xie updated FLINK-32196: --- Attachment: kafka_sink_oom_logs.csv > kafka sink under EO sometimes is unable to recover from a checkpoint > > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: healthy_kafka_producer_thread.csv, > kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best *hypothesis* for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharon Xie updated FLINK-32196: --- Summary: kafka sink under EO sometimes is unable to recover from a checkpoint (was: KafkaWriter recovery doesn't abort lingering transactions under the EO semantic) > kafka sink under EO sometimes is unable to recover from a checkpoint > > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: healthy_kafka_producer_thread.csv, > kafka_producer_network_thread_log.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best hypothesis for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharon Xie updated FLINK-32196: --- Description: We are seeing an issue where a Flink job using kafka sink under EO is unable to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and eventually runs OOM. The cause for OOM is that there is a kafka producer thread leak. Here is our best *hypothesis* for the issue. In `KafkaWriter` under the EO semantic, it intends to abort lingering transactions upon recovery [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] However, the actual implementation to abort those transactions in the `TransactionAborter` doesn't abort those transactions [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] Specifically `producer.abortTransaction()` is never called in that function. Instead it calls `producer.flush()`. Also The function is in for loop that only breaks when `producer.getEpoch() == 0` which is why we are seeing a producer thread leak as the recovery gets stuck in this for loop. was: We are seeing an issue where a Flink job using kafka sink under EO is unable to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and eventually runs OOM. The cause for OOM is that there is a kafka producer thread leak. Here is our best hypothesis for the issue. In `KafkaWriter` under the EO semantic, it intends to abort lingering transactions upon recovery [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] However, the actual implementation to abort those transactions in the `TransactionAborter` doesn't abort those transactions [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] Specifically `producer.abortTransaction()` is never called in that function. Instead it calls `producer.flush()`. Also The function is in for loop that only breaks when `producer.getEpoch() == 0` which is why we are seeing a producer thread leak as the recovery gets stuck in this for loop. > kafka sink under EO sometimes is unable to recover from a checkpoint > > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: healthy_kafka_producer_thread.csv, > kafka_producer_network_thread_log.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best *hypothesis* for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726405#comment-17726405 ] Sharon Xie commented on FLINK-32196: Thank you [~tzulitai] for the quick response and information. {quote}Are you actually observing that there are lingering transactions not being aborted in Kafka? Or was that a speculation based on not seeing a abortTransaction() in the code? {quote} This is a speculation. So this may not be the root cause of the issue I'm seeing. {quote}If there are actually lingering transactions in Kafka after restore, do they get timeout by Kafka after transaction.timeout.ms? Or are they lingering beyond the timeout threshold? {quote} What I've observed is that the subtask gets stuck in the initializing state and there is a growing number of kafka-producer-network-thread and the job eventually runs OOM. In the debug log, I've found that the transaction thread never progress beyond “Transition from state INITIALIZING to READY” and eventually times out. An example thread log is [^kafka_producer_network_thread_log.csv] . A healthy transaction goes from INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the thread doesn't exit - example [^healthy_kafka_producer_thread.csv]. I've also queried the kafka's _transaction_state topic for the problematic transaction and [here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the messages in the topic. I'd appreciate any pointers or potential ways to explain the situation. > KafkaWriter recovery doesn't abort lingering transactions under the EO > semantic > --- > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: healthy_kafka_producer_thread.csv, > kafka_producer_network_thread_log.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best hypothesis for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharon Xie updated FLINK-32196: --- Attachment: healthy_kafka_producer_thread.csv > KafkaWriter recovery doesn't abort lingering transactions under the EO > semantic > --- > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: healthy_kafka_producer_thread.csv, > kafka_producer_network_thread_log.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best hypothesis for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharon Xie updated FLINK-32196: --- Attachment: kafka_producer_network_thread_log.csv > KafkaWriter recovery doesn't abort lingering transactions under the EO > semantic > --- > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > Attachments: kafka_producer_network_thread_log.csv > > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best hypothesis for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726390#comment-17726390 ] Tzu-Li (Gordon) Tai commented on FLINK-32196: - In terms of the lingering transactions you are observing, a few questions: # Are you actually observing that there are lingering transactions not being aborted in Kafka? Or was that a speculation based on not seeing a {{abortTransaction()}} in the code? # If there are actually lingering transactions in Kafka after restore, do they get timeout by Kafka after {{{}transaction.timeout.ms{}}}? Or are they lingering beyond the timeout threshold? > KafkaWriter recovery doesn't abort lingering transactions under the EO > semantic > --- > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best hypothesis for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726385#comment-17726385 ] Tzu-Li (Gordon) Tai edited comment on FLINK-32196 at 5/25/23 8:56 PM: -- Hi [~sharonxr55], the {{abortTransactionOfSubtask}} you posted aborts transactions by relying on the fact that when you call `initTransactions()`, Kafka automatically aborts any old ongoing transactions under the same {{{}transactional.id{}}}. Could you re-elaborate the producer leak? As far as I can tell, the loop is reusing the same producer instance; on every loop entry, the same producer instance is reset with a new {{transactional.id}} and called {{initTransactions()}} to abort the transaction. So, there doesn't seem to be an issue with run-away producer instances, unless I'm misunderstanding something here. was (Author: tzulitai): Hi [~sharonxr55], the {{abortTransactionOfSubtask}} you posted aborts transactions by relying on the fact that when you call `initTransactions()`, Kafka automatically aborts any old ongoing transactions under the same {{{}transactional.id{}}}. Could you re-elaborate the producer leak? As far as I can tell, the loop is reusing the same producer instance; on every loop entry, the same producer instance is reset with a new {{transactional.id}} and called {{initTransactions()}} to abort the transaction. > KafkaWriter recovery doesn't abort lingering transactions under the EO > semantic > --- > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best hypothesis for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] pgaref commented on pull request #22587: [FLINK-31892][runtime] Introduce AdaptiveScheduler global failure enrichment/labeling
pgaref commented on PR #22587: URL: https://github.com/apache/flink/pull/22587#issuecomment-1563492807 > LGTM minus one code path in StopWithSavepoint 👍 Thanks for the review @dmvk ! Updated :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726385#comment-17726385 ] Tzu-Li (Gordon) Tai commented on FLINK-32196: - Hi [~sharonxr55], the {{abortTransactionOfSubtask}} you posted aborts transactions by relying on the fact that when you call `initTransactions()`, Kafka automatically aborts any old ongoing transactions under the same {{{}transactional.id{}}}. Could you re-elaborate the producer leak? As far as I can tell, the loop is reusing the same producer instance; on every loop entry, the same producer instance is reset with a new {{transactional.id}} and called {{initTransactions()}} to abort the transaction. > KafkaWriter recovery doesn't abort lingering transactions under the EO > semantic > --- > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best hypothesis for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy
gyfora commented on code in PR #607: URL: https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1205922751 ## flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/metrics/ScalingMetrics.java: ## @@ -82,7 +82,7 @@ public static void computeDataRateMetrics( scalingMetrics.put(ScalingMetric.TRUE_PROCESSING_RATE, trueProcessingRate); scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, numRecordsInPerSecond); } else { -LOG.error("Cannot compute true processing rate without numRecordsInPerSecond"); +LOG.warn("Cannot compute true processing rate without numRecordsInPerSecond"); Review Comment: We could trigger an event, but putting this into the status may be an overkill because this is not really actionable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy
gyfora commented on code in PR #607: URL: https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1205921907 ## flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobVertexScaler.java: ## @@ -221,7 +221,7 @@ private boolean detectIneffectiveScaleUp( message); if (conf.get(AutoScalerOptions.SCALING_EFFECTIVENESS_DETECTION_ENABLED)) { -LOG.info( +LOG.warn( Review Comment: There is already an event triggered for this, that should be enough for most users. The log just outputs some additional info -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-kubernetes-operator] tweise commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy
tweise commented on code in PR #607: URL: https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1205905605 ## flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/metrics/ScalingMetrics.java: ## @@ -82,7 +82,7 @@ public static void computeDataRateMetrics( scalingMetrics.put(ScalingMetric.TRUE_PROCESSING_RATE, trueProcessingRate); scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, numRecordsInPerSecond); } else { -LOG.error("Cannot compute true processing rate without numRecordsInPerSecond"); +LOG.warn("Cannot compute true processing rate without numRecordsInPerSecond"); Review Comment: Same as above, the logging is probably not the best way to bring this to the intended audience and warn should probably be mostly reserved for operator internal problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-32148) Reduce info logging noise in the operator
[ https://issues.apache.org/jira/browse/FLINK-32148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-32148: --- Labels: pull-request-available (was: ) > Reduce info logging noise in the operator > - > > Key: FLINK-32148 > URL: https://issues.apache.org/jira/browse/FLINK-32148 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.5.0 >Reporter: Gyula Fora >Assignee: Gyula Fora >Priority: Major > Labels: pull-request-available > > The operator controller/reconciler/observer logic currently logs a lot of > information on INFO level even when "nothing" happens. This should be > improved and reduce most of these to DEBUG -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink-kubernetes-operator] tweise commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy
tweise commented on code in PR #607: URL: https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1205903992 ## flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobVertexScaler.java: ## @@ -221,7 +221,7 @@ private boolean detectIneffectiveScaleUp( message); if (conf.get(AutoScalerOptions.SCALING_EFFECTIVENESS_DETECTION_ENABLED)) { -LOG.info( +LOG.warn( Review Comment: Why should this be warn? This looks more like something that should be reflected in the status? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under EO semantic
Sharon Xie created FLINK-32196: -- Summary: KafkaWriter recovery doesn't abort lingering transactions under EO semantic Key: FLINK-32196 URL: https://issues.apache.org/jira/browse/FLINK-32196 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.15.4, 1.6.4 Reporter: Sharon Xie We are seeing an issue where a Flink job using kafka sink under EO is unable to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and eventually runs OOM. The cause for OOM is that there is a kafka producer thread leak. Here is our best hypothesis for the issue. In `KafkaWriter` under the EO semantic, it intends to abort lingering transactions upon recovery [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] However, the actual implementation to abort those transactions in the `TransactionAborter` doesn't abort those transactions [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] Specifically `producer.abortTransaction()` is never called in that function. Instead it calls `producer.flush()`. Also The function is in for loop that only breaks when `producer.getEpoch() == 0` which is why we are seeing a producer thread leak as the recovery gets stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic
[ https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharon Xie updated FLINK-32196: --- Summary: KafkaWriter recovery doesn't abort lingering transactions under the EO semantic (was: KafkaWriter recovery doesn't abort lingering transactions under EO semantic) > KafkaWriter recovery doesn't abort lingering transactions under the EO > semantic > --- > > Key: FLINK-32196 > URL: https://issues.apache.org/jira/browse/FLINK-32196 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.6.4, 1.15.4 >Reporter: Sharon Xie >Priority: Major > > We are seeing an issue where a Flink job using kafka sink under EO is unable > to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and > eventually runs OOM. The cause for OOM is that there is a kafka producer > thread leak. > Here is our best hypothesis for the issue. > In `KafkaWriter` under the EO semantic, it intends to abort lingering > transactions upon recovery > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179] > However, the actual implementation to abort those transactions in the > `TransactionAborter` doesn't abort those transactions > [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124] > Specifically `producer.abortTransaction()` is never called in that function. > Instead it calls `producer.flush()`. > Also The function is in for loop that only breaks when `producer.getEpoch() > == 0` which is why we are seeing a producer thread leak as the recovery gets > stuck in this for loop. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32195) Add SQL Gateway custom headers support
Elkhan Dadashov created FLINK-32195: --- Summary: Add SQL Gateway custom headers support Key: FLINK-32195 URL: https://issues.apache.org/jira/browse/FLINK-32195 Project: Flink Issue Type: New Feature Components: Table SQL / Gateway Affects Versions: 1.17.0 Reporter: Elkhan Dadashov For some use cases, it might be needed setting a few extra HTTP headers with a request to FlinkSQL Gateway, for example, a cookie for Auth/session. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (FLINK-32174) Update Cloudera product and link in doc page
[ https://issues.apache.org/jira/browse/FLINK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Márton Balassi closed FLINK-32174. -- Resolution: Fixed a4de894 in main fda49dd in release-1.17 6a31cc4 in release-1.16 > Update Cloudera product and link in doc page > > > Key: FLINK-32174 > URL: https://issues.apache.org/jira/browse/FLINK-32174 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: Ferenc Csaky >Assignee: Ferenc Csaky >Priority: Minor > Labels: pull-request-available > Fix For: 1.18.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32174) Update Cloudera product and link in doc page
[ https://issues.apache.org/jira/browse/FLINK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Márton Balassi updated FLINK-32174: --- Fix Version/s: 1.18.0 > Update Cloudera product and link in doc page > > > Key: FLINK-32174 > URL: https://issues.apache.org/jira/browse/FLINK-32174 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: Ferenc Csaky >Assignee: Ferenc Csaky >Priority: Minor > Labels: pull-request-available > Fix For: 1.18.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32174) Update Cloudera product and link in doc page
[ https://issues.apache.org/jira/browse/FLINK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-32174: --- Labels: pull-request-available (was: ) > Update Cloudera product and link in doc page > > > Key: FLINK-32174 > URL: https://issues.apache.org/jira/browse/FLINK-32174 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: Ferenc Csaky >Assignee: Ferenc Csaky >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] flinkbot commented on pull request #22662: KafkaSource: disable setting client id prefix if client.id is set
flinkbot commented on PR #22662: URL: https://github.com/apache/flink/pull/22662#issuecomment-1563271820 ## CI report: * 10dc163a5e357195fd35757422f45d32d5500234 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot commented on pull request #22661: [WIP][FLINK-31783][runtime] Migrates DefaultLeaderElectionService from LeaderElectionDriver to the MultipleComponentLeaderElectionDriver int
flinkbot commented on PR #22661: URL: https://github.com/apache/flink/pull/22661#issuecomment-1563265317 ## CI report: * a87a2c01417b9c75b0c59b84187e759058f1b45a UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] hmedhat opened a new pull request, #22662: disable setting client id prefix if client.id is set
hmedhat opened a new pull request, #22662: URL: https://github.com/apache/flink/pull/22662 ## What is the purpose of the change Setting different client ids for the consumer makes it difficult to set kafka quotas for that consumer group. Disabling setting consumer id prefix on the expense of kafka consumer metrics won't be reported correctly. ## Brief change log Allows setting client.id on the kafka consumer level which will ignore client.id.prefix being and will print a warning. ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. This change added tests and can be verified as follows: - Manually verified the change a simple job and verified the client.id being set ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no) - The S3 file system connector: (no ) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-31783) Replace LeaderElectionDriver in DefaultLeaderElectionService with MultipleComponentLeaderElectionDriver
[ https://issues.apache.org/jira/browse/FLINK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-31783: --- Labels: pull-request-available (was: ) > Replace LeaderElectionDriver in DefaultLeaderElectionService with > MultipleComponentLeaderElectionDriver > --- > > Key: FLINK-31783 > URL: https://issues.apache.org/jira/browse/FLINK-31783 > Project: Flink > Issue Type: Sub-task >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] HuangZhenQiu commented on a diff in pull request #22491: [FLINK-31947] enable stdout for flink-console.sh
HuangZhenQiu commented on code in PR #22491: URL: https://github.com/apache/flink/pull/22491#discussion_r1181389434 ## flink-dist/src/main/flink-bin/bin/flink-console.sh: ## @@ -116,4 +117,5 @@ echo $$ >> "$pid" 2>/dev/null # Evaluate user options for local variable expansion FLINK_ENV_JAVA_OPTS=$(eval echo ${FLINK_ENV_JAVA_OPTS}) +exec 1>"${out}" Review Comment: The file size of redirection is a concern from community. It makes sense to me. The lack of out file from web was reported by our internal users who use stream.print() to verify some execution result. Given stream.print() is a supported functionality, l feel it is a kind of lacking integrity if flink on k8s behaves differently with flink on yarn. Shall we revisit the problem together with Till and Zentol? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] XComp opened a new pull request, #22661: [FLINK-31783][runtime] Migrates DefaultLeaderElectionService from LeaderElectionDriver to the MultipleComponentLeaderElectionDriver interface
XComp opened a new pull request, #22661: URL: https://github.com/apache/flink/pull/22661 ## What is the purpose of the change `DefaultLeaderElectionService` should rely on `MultipleComponentLeaderElectionDriver` directly to enable support for multiple contenders within the same service later on. ## Brief change log * Resolved FLINK-30338 subtasks * Made `MultipleComponentLeaderElectionDriver.Listener` extend `AutoCloseable` ## Verifying this change * Migrated any existing tests to the `MultipleComponentLeaderElectionDriver` interface. The tests should still succeed. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (FLINK-3154) Update Kryo version from 2.24.0 to latest Kryo LTS version
[ https://issues.apache.org/jira/browse/FLINK-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726320#comment-17726320 ] Kurt Ostfeld edited comment on FLINK-3154 at 5/25/23 5:09 PM: -- [~martijnvisser] this PR upgrades Flink from Kryo v2.x to Kryo v5.x and preserves backward compatibility with existing savepoints and checkpoints: [https://github.com/apache/flink/pull/22660] This keeps the Kryo v2 project dependency for backwards compatibility only and otherwise uses Kryo v5.x. was (Author: JIRAUSER38): [~martijnvisser] this PR upgrades Flink from v2 to v5 and preserves backward compatibility with existing savepoints and checkpoints: [https://github.com/apache/flink/pull/22660] This keeps the Kryo v2 project dependency for backwards compatibility only and otherwise uses Kryo v5.x. > Update Kryo version from 2.24.0 to latest Kryo LTS version > -- > > Key: FLINK-3154 > URL: https://issues.apache.org/jira/browse/FLINK-3154 > Project: Flink > Issue Type: Improvement > Components: API / Type Serialization System >Affects Versions: 1.0.0 >Reporter: Maximilian Michels >Priority: Not a Priority > Labels: pull-request-available > > Flink's Kryo version is outdated and could be updated to a newer version, > e.g. kryo-3.0.3. > From ML: we cannot bumping the Kryo version easily - the serialization format > changed (that's why they have a new major version), which would render all > Flink savepoints and checkpoints incompatible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-3154) Update Kryo version from 2.24.0 to latest Kryo LTS version
[ https://issues.apache.org/jira/browse/FLINK-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726320#comment-17726320 ] Kurt Ostfeld commented on FLINK-3154: - [~martijnvisser] this PR upgrades Flink from v2 to v5 and preserves backward compatibility with existing savepoints and checkpoints: [https://github.com/apache/flink/pull/22660] This keeps the Kryo v2 project dependency for backwards compatibility only and otherwise uses Kryo v5.x. > Update Kryo version from 2.24.0 to latest Kryo LTS version > -- > > Key: FLINK-3154 > URL: https://issues.apache.org/jira/browse/FLINK-3154 > Project: Flink > Issue Type: Improvement > Components: API / Type Serialization System >Affects Versions: 1.0.0 >Reporter: Maximilian Michels >Priority: Not a Priority > Labels: pull-request-available > > Flink's Kryo version is outdated and could be updated to a newer version, > e.g. kryo-3.0.3. > From ML: we cannot bumping the Kryo version easily - the serialization format > changed (that's why they have a new major version), which would render all > Flink savepoints and checkpoints incompatible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] flinkbot commented on pull request #22660: [FLINK-3154][runtime] Upgrade from Kryo v2 + Chill 0.7.6 to Kryo v5 w…
flinkbot commented on PR #22660: URL: https://github.com/apache/flink/pull/22660#issuecomment-1563235211 ## CI report: * d24bb0737d2a7b4b61f51843241ab78a47302d0a UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-32008) Protobuf format throws exception with Map datatype
[ https://issues.apache.org/jira/browse/FLINK-32008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726314#comment-17726314 ] Ryan Skraba commented on FLINK-32008: - Oh, just taking a quick look – protobuf isn't supported by the filesystem connector in the [Flink|https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/table/filesystem/] docs. The real bug here might be that filesystem + protobuf doesn't fail immediately as an unsupported option! > Protobuf format throws exception with Map datatype > -- > > Key: FLINK-32008 > URL: https://issues.apache.org/jira/browse/FLINK-32008 > Project: Flink > Issue Type: Bug > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.17.0 >Reporter: Xuannan Su >Priority: Major > Attachments: flink-protobuf-example.zip > > > The protobuf format throws exception when working with Map data type. I > uploaded a example project to reproduce the problem. > > {code:java} > Caused by: java.lang.RuntimeException: One or more fetchers have encountered > exception > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:261) > at > org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:169) > at > org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:131) > at > org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417) > at > org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) > at > org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788) > at > org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) > at > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received > unexpected exception while polling the records > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165) > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:114) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ... 1 more > Caused by: java.io.IOException: Failed to deserialize PB object. > at > org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:75) > at > org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:42) > at > org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:82) > at > org.apache.flink.connector.file.table.DeserializationSchemaAdapter$LineBytesInputFormat.readRecord(DeserializationSchemaAdapter.java:197) > at > org.apache.flink.connector.file.table.DeserializationSchemaAdapter$LineBytesInputFormat.nextRecord(DeserializationSchemaAdapter.java:210) > at > org.apache.flink.connector.file.table.DeserializationSchemaAdapter$Reader.readBatch(DeserializationSchemaAdapter.java:124) > at > org.apache.flink.connector.file.src.util.RecordMapperWrapperRecordIterator$1.readBatch(RecordMapperWrapperRecordIterator.java:82) > at > org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:67) > at > org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58) > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:162) > ... 6 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAcce
[jira] [Updated] (FLINK-3154) Update Kryo version from 2.24.0 to latest Kryo LTS version
[ https://issues.apache.org/jira/browse/FLINK-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-3154: -- Labels: pull-request-available (was: ) > Update Kryo version from 2.24.0 to latest Kryo LTS version > -- > > Key: FLINK-3154 > URL: https://issues.apache.org/jira/browse/FLINK-3154 > Project: Flink > Issue Type: Improvement > Components: API / Type Serialization System >Affects Versions: 1.0.0 >Reporter: Maximilian Michels >Priority: Not a Priority > Labels: pull-request-available > > Flink's Kryo version is outdated and could be updated to a newer version, > e.g. kryo-3.0.3. > From ML: we cannot bumping the Kryo version easily - the serialization format > changed (that's why they have a new major version), which would render all > Flink savepoints and checkpoints incompatible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] kurtostfeld opened a new pull request, #22660: [FLINK-3154][runtime] Upgrade from Kryo v2 + Chill 0.7.6 to Kryo v5 w…
kurtostfeld opened a new pull request, #22660: URL: https://github.com/apache/flink/pull/22660 …ith backward compatibility for existing savepoints and checkpoints. ## What is the purpose of the change To upgrade the primary Kryo library used by Flink from v2.x to v5.x, while providing backwards compatibility with existing savepoints and checkponts. This PR adds a new Kryo v5 dependency that is namespaced so that it can coexist with the legacy dependencies that would be kept for compatibility purposes. The existing chill-java library version 0.7.6 would also be deprecated and kept for compatibility purposes only. Some future version of Flink could eventually drop Kryo v2 and chill-java when backwards compatibility with Kryo v2 based state is no longer needed. Why upgrade Kryo: - Kryo v5.x is more compatible with newer versions of Java. I wrote a simple Flink app with Kryo serialized state: When running under Java 17 or Java 21 pre-release builds, the Kryo v2 library fails at runtime, while a prototype build of Flink with this PR can load state serialized with Kryo v5.x running under both Java 17 and Java 21 pre-release builds successfully without failures. - Kryo v5.x supports Java records, while Kryo v2.x does not. - Kryo v2.2.x was released over ten years ago, well before the release of Java 8. Kryo is a maintained project, there have been lots of improvements over the past ten years. Kryo v5 has faster runtime performance, and more memory efficient serialization protocols. Additionally Kryo v5 has fixed lots of bugs. While the Flink project has worked around all Kryo v2 bugs, it would be an improvement to remove the workarounds. Next, Kryo v2 depends on the chill-java library for functionality, where almost all of that functionality is built into Kryo v5.x so that chill-java dependency can be phased out. ## Brief change log This is a large PR with a lot of surface area and risk. I tried to keep the scope of these changes as narrow and simple as possible. I copied all the Kryo v2 code to Kryo v5 equivalents and made necessary adjustments to get everything working. Some Flink serialization code references Flink Java classes by full package name, so I didn't modify the package names or class names of any existing serialization classes. ## Verifying this change There are lots of existing unit tests that cover backward compatibility with existing state and also the serialization framework. This obviously passes all those tests. Additionally, I wrote a Flink application to do a more thorough test of the Kryo upgrade that was difficult to convert into unit test form. https://github.com/kurtostfeld/flink-kryo-upgrade-demo ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): yes. This adds a new Kryo v5 dependency and keeps legacy dependencies for backwards compatibility purposes. - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: yes. Kryo v2 APIs are deprecated. Parallel Kryo v5 APIs are created with PublicEvolving - The serializers: yes. This absolutely affects the serializers. - The runtime per-record code paths (performance sensitive): yes. This should be faster but I haven't done any benchmark testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-32008) Protobuf format throws exception with Map datatype
[ https://issues.apache.org/jira/browse/FLINK-32008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726311#comment-17726311 ] Ryan Skraba commented on FLINK-32008: - Hello, thanks for the example project! That's so helpful to reproduce and debug. The *current* file strategy for protobuf in Flink is to write one record serialized as binary per line, adding a *{{0x0a}}*. Your example message is serialized as: {code} 12 06 0a 01 61 12 01 62 0a {code} The first *{{0a}}* is the protobuf encoding for the key field in the map. The last *{{0a}}* is a new line (which probably shouldn't be there). When reading, from a file, splits are calculated and assigned to tasks using the *{{0a}}* as a delimiter, which is very, very likely to fail and a fault in the protobuf file implementation of Flink. I'm guessing this isn't limited to maps, we can expect this delimiter byte to occur many different ways in the protobuf binary. If this hasn't been addressed, it's probably because it's pretty rare to store protobuf messages in a file container (as opposed to in a single message packet, or a table cell). Do you have a good use case that we can use to guide what we expect Flink to do with protobuf files? For info, nothing in the Protobuf encoding that can be used to [distinguish the start or end|https://protobuf.dev/programming-guides/techniques/#streaming] of a message. If we want to store multiple messages in the same container (file or any sequence of bytes), we have to manage the indices ourselves The above link recommends writing the message size followed by the binary (in Java, this is using {{writeDelimitedTo}}/{{parseDelimitedFrom}} instead of {{writeTo}}/{{parseFrom}}, for example). > Protobuf format throws exception with Map datatype > -- > > Key: FLINK-32008 > URL: https://issues.apache.org/jira/browse/FLINK-32008 > Project: Flink > Issue Type: Bug > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.17.0 >Reporter: Xuannan Su >Priority: Major > Attachments: flink-protobuf-example.zip > > > The protobuf format throws exception when working with Map data type. I > uploaded a example project to reproduce the problem. > > {code:java} > Caused by: java.lang.RuntimeException: One or more fetchers have encountered > exception > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:261) > at > org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:169) > at > org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:131) > at > org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417) > at > org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) > at > org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788) > at > org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) > at > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received > unexpected exception while polling the records > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165) > at > org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:114) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ... 1 more > Caused by: java.io.IOException: Failed to deserialize PB object. > at > org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:75) > at > org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema
[GitHub] [flink] flinkbot commented on pull request #22659: [FLINK-32186][runtime-web] Support subtask stack auto-search when red…
flinkbot commented on PR #22659: URL: https://github.com/apache/flink/pull/22659#issuecomment-1563200191 ## CI report: * 89518bc54ec90fb7d66ef42d4538cbbf7263b43d UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-32186) Support subtask stack auto-search when redirecting from subtask backpressure tab
[ https://issues.apache.org/jira/browse/FLINK-32186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-32186: --- Labels: pull-request-available (was: ) > Support subtask stack auto-search when redirecting from subtask backpressure > tab > > > Key: FLINK-32186 > URL: https://issues.apache.org/jira/browse/FLINK-32186 > Project: Flink > Issue Type: Improvement > Components: Runtime / Web Frontend >Affects Versions: 1.18.0 >Reporter: Yu Chen >Assignee: Yu Chen >Priority: Major > Labels: pull-request-available > Attachments: image-2023-05-25-15-52-54-383.png, > image-2023-05-25-16-11-00-374.png > > > Note that we have introduced a dump link on the backpressure page in > FLINK-29996(Figure 1), which helps to check what are the corresponding > subtask doing more easily. > But we still have to search for the corresponding call stack of the > back-pressured subtask from the whole TaskManager thread dumps, it's not > convenient enough. > Therefore, I would like to trigger the search for the editor automatically > after redirecting from the backpressure tab, which will help to scroll the > thread dumps to the corresponding call stack of the back-pressured subtask > (As shown in Figure 2). > !image-2023-05-25-15-52-54-383.png|width=680,height=260! > Figure 1. ThreadDump Link in Backpressure Tab > !image-2023-05-25-16-11-00-374.png|width=680,height=353! > Figure 2. Trigger Auto-search after Redirecting from Backpressure Tab -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink] yuchen-ecnu opened a new pull request, #22659: [FLINK-32186][runtime-web] Support subtask stack auto-search when red…
yuchen-ecnu opened a new pull request, #22659: URL: https://github.com/apache/flink/pull/22659 …irecting from subtask backpressure tab ## What is the purpose of the change To further optimize the scenario of troubleshooting sub-task backpressure, when the user clicks the `Dump` button from the backpressure panel of the operator, it carries the operator name and **automatically** scrolls to the thread stack of the corresponding subtask. ## Brief change log - Carrying the operator name in the URL when clicking the Dump button in the backpressure panel of the operator. - Triggers the search automatically when the editor in the ThreadDump page is initialized. ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-kafka] tzulitai commented on pull request #29: [FLINK-32021][Connectors/Kafka] Improvement the Javadoc for SpecifiedOffsetsInitializer and TimestampOffsetsInitializer
tzulitai commented on PR #29: URL: https://github.com/apache/flink-connector-kafka/pull/29#issuecomment-1563142803 Thanks for the PR @loserwang1024 and @RamanVerma for the reviews! Merging this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-kafka] tzulitai commented on pull request #25: [FLINK-30859] Sync flink sql client test
tzulitai commented on PR #25: URL: https://github.com/apache/flink-connector-kafka/pull/25#issuecomment-1563139310 On a second look, it looks like this test does far more stuff than just testing the SQL Kafka Connector ... Again, like the state machine example PR, I'm not sure we really should be moving the whole e2e test as is, but rather: 1. modify the original E2E in `apache/flink` to not use Kafka 2. Move over only the parts related to Kafka in `apache/flink-connector-kafka` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] snuyanzin merged pull request #22657: [BP-1.16][FLINK-30844][runtime] Increased TASK_CANCELLATION_TIMEOUT for testInterruptibleSharedLockInInvokeAndCancel
snuyanzin merged PR #22657: URL: https://github.com/apache/flink/pull/22657 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-32194: --- Labels: pull-request-available (was: ) > Elasticsearch connector should remove the dependency on flink-shaded > > > Key: FLINK-32194 > URL: https://issues.apache.org/jira/browse/FLINK-32194 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / ElasticSearch >Affects Versions: elasticsearch-4.0.0 >Reporter: Yuxin Tan >Assignee: Yuxin Tan >Priority: Major > Labels: pull-request-available > > The Elasticsearch connector depends on flink-shaded. With the externalization > of the connector, the connectors shouldn't rely on Flink-Shaded -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink-connector-elasticsearch] boring-cyborg[bot] commented on pull request #65: [FLINK-32194][connectors/elasticsearch] Remove the dependency on flink-shaded
boring-cyborg[bot] commented on PR #65: URL: https://github.com/apache/flink-connector-elasticsearch/pull/65#issuecomment-1563090057 Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded
Yuxin Tan created FLINK-32194: - Summary: Elasticsearch connector should remove the dependency on flink-shaded Key: FLINK-32194 URL: https://issues.apache.org/jira/browse/FLINK-32194 Project: Flink Issue Type: Technical Debt Components: Connectors / ElasticSearch Affects Versions: elasticsearch-4.0.0 Reporter: Yuxin Tan Assignee: Yuxin Tan The Elasticsearch connector depends on flink-shaded. With the externalization of the connector, the connectors shouldn't rely on Flink-Shaded -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-32193) AWS connector removes the dependency on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-32193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuxin Tan reassigned FLINK-32193: - Assignee: (was: Yuxin Tan) > AWS connector removes the dependency on flink-shaded > > > Key: FLINK-32193 > URL: https://issues.apache.org/jira/browse/FLINK-32193 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / AWS >Reporter: Yuxin Tan >Priority: Major > Fix For: aws-connector-4.2.0 > > > The AWS connector depends on flink-shaded. With the externalization of the > connector, connectors shouldn't rely on Flink-Shaded -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32144) FileSourceTextLinesITCase times out on AZP
[ https://issues.apache.org/jira/browse/FLINK-32144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726264#comment-17726264 ] Lijie Wang commented on FLINK-32144: Adding this feature on configuration level is an option. Another option, according to the [doc of junit5 migration|https://docs.google.com/document/d/1514Wa_aNB9bJUen4xm5uiuXOooOJTtXqS_Jqk9KJitU], is to let {{MiniClusterExtension}} implements {{CustomExtension}} interface, and wrap it with {{EachCallbackWrapper}} for per-test lifecycle, or wrap it with {{AllCallbackWrapper}} for per-class lifecycle (Currently there are some other extensions following this rule, for example {{{}ZooKeeperExtension{}}}). But I found that we did as this before FLINK-26252, and I don't know why it has been changed to the current state in FLINK-26252. > FileSourceTextLinesITCase times out on AZP > -- > > Key: FLINK-32144 > URL: https://issues.apache.org/jira/browse/FLINK-32144 > Project: Flink > Issue Type: Bug > Components: Connectors / FileSystem, Tests >Affects Versions: 1.18.0 >Reporter: Sergey Nuyanzin >Priority: Critical > Labels: test-stability > > {noformat} > May 21 01:15:09 > == > May 21 01:15:09 Process produced no output for 900 seconds. > May 21 01:15:09 > == > May 21 01:15:09 > == > May 21 01:15:09 The following Java processes are running (JPS) > May 21 01:15:09 > == > ... > May 21 01:15:10 "ForkJoinPool-1-worker-25" #27 daemon prio=5 os_prio=0 > tid=0x7ff334c35800 nid=0x1e49 waiting on condition [0x7ff23c3e7000] > May 21 01:15:10java.lang.Thread.State: WAITING (parking) > May 21 01:15:10 at sun.misc.Unsafe.park(Native Method) > May 21 01:15:10 - parking to wait for <0xce8682e0> (a > java.util.concurrent.CompletableFuture$Signaller) > May 21 01:15:10 at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > May 21 01:15:10 at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > May 21 01:15:10 at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313) > May 21 01:15:10 at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > May 21 01:15:10 at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > May 21 01:15:10 at > org.apache.flink.connector.file.src.FileSourceTextLinesITCase$RecordCounterToFail.waitToFail(FileSourceTextLinesITCase.java:481) > May 21 01:15:10 at > org.apache.flink.connector.file.src.FileSourceTextLinesITCase$RecordCounterToFail.access$100(FileSourceTextLinesITCase.java:457) > May 21 01:15:10 at > org.apache.flink.connector.file.src.FileSourceTextLinesITCase.testBoundedTextFileSource(FileSourceTextLinesITCase.java:145) > May 21 01:15:10 at > org.apache.flink.connector.file.src.FileSourceTextLinesITCase.testBoundedTextFileSourceWithJobManagerFailover(FileSourceTextLinesITCase.java:110) > May 21 01:15:10 > ... > May 21 01:15:10 Killing process with pid=860 and all descendants > /__w/2/s/tools/ci/watchdog.sh: line 113: 860 Terminated $cmd > May 21 01:15:11 Process exited with EXIT CODE: 143. > May 21 01:15:11 Trying to KILL watchdog (856). > May 21 01:15:11 Searching for .dump, .dumpstream and related files in > '/__w/2/s' > May 21 01:15:15 Moving > '/__w/2/s/flink-connectors/flink-connector-files/target/surefire-reports/2023-05-21T00-58-13_549-jvmRun2.dumpstream' > to target directory ('/__w/_temp/debug_files') > May 21 01:15:15 Moving > '/__w/2/s/flink-connectors/flink-connector-files/target/surefire-reports/2023-05-21T00-58-13_549-jvmRun2.dump' > to target directory ('/__w/_temp/debug_files') > The STDIO streams did not close within 10 seconds of the exit event from > process '/bin/bash'. This may indicate a child process inherited the STDIO > streams and has not yet exited. > {noformat} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49181&view=logs&j=4eda0b4a-bd0d-521a-0916-8285b9be9bb5&t=2ff6d5fa-53a6-53ac-bff7-fa524ea361a9&l=14436 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (FLINK-31945) Table of contents in the blogs of the project website is missing some titles
[ https://issues.apache.org/jira/browse/FLINK-31945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn Visser closed FLINK-31945. -- Resolution: Fixed Fixed in apache/flink-web:asf-site via 6948c400ce9749fdb3a2f7ccc3a96ec76b2e4a47 > Table of contents in the blogs of the project website is missing some titles > > > Key: FLINK-31945 > URL: https://issues.apache.org/jira/browse/FLINK-31945 > Project: Flink > Issue Type: Improvement > Components: Project Website >Reporter: Zhilong Hong >Assignee: Zhilong Hong >Priority: Minor > Labels: pull-request-available > > The ToC in the blog pages of the project website doesn't have all the section > titles. The section titles of the first level is missing. > Solution: Add the following configuration item in config.toml. > {noformat} > [markup.tableOfContents] > startLevel = 0{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink-connector-kafka] tzulitai commented on a diff in pull request #25: [FLINK-30859] Sync flink sql client test
tzulitai commented on code in PR #25: URL: https://github.com/apache/flink-connector-kafka/pull/25#discussion_r1205621149 ## flink-connector-kafka-e2e-tests/flink-sql-client-test/pom.xml: ## @@ -0,0 +1,129 @@ + +http://maven.apache.org/POM/4.0.0"; +xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; +xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> + + org.apache.flink + flink-connector-kafka-e2e-tests + 3.1-SNAPSHOT + + 4.0.0 + + flink-sql-client-test + Flink : E2E Tests : SQL client + jar + + + + org.apache.flink + flink-table-common + ${flink.version} + provided + + + + org.apache.flink + flink-end-to-end-tests-common-kafka + ${project.version} + + + Review Comment: Is this comment still relevant? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-kafka] tzulitai commented on a diff in pull request #24: [FLINK-30859] Sync kafka related examples from apache/flink:master
tzulitai commented on code in PR #24: URL: https://github.com/apache/flink-connector-kafka/pull/24#discussion_r1205609370 ## flink-examples-kafka/flink-examples-streaming-state-machine-build-helper/src/main/resources/META-INF/NOTICE: ## @@ -0,0 +1,9 @@ +flink-examples-streaming-state-machine +Copyright 2014-2023 The Apache Software Foundation + +This product includes software developed at +The Apache Software Foundation (http://www.apache.org/). + +This project bundles the following dependencies under the Apache Software License 2.0. (http://www.apache.org/licenses/LICENSE-2.0.txt) + +- org.apache.kafka:kafka-clients:3.2.3 Review Comment: the depended Kafka version is now 3.4.0 ## flink-examples-kafka/pom.xml: ## @@ -0,0 +1,45 @@ + + +http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd";> + + 4.0.0 + + +org.apache.flink +flink-connector-kafka-parent +3.1-SNAPSHOT + + + pom + + flink-examples-kafka + Flink : Formats : Kafka Review Comment: `Flink : Examples : Kafka`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-jdbc] boring-cyborg[bot] commented on pull request #46: [hotfix][test] remove non-used code in tests
boring-cyborg[bot] commented on PR #46: URL: https://github.com/apache/flink-connector-jdbc/pull/46#issuecomment-1563027492 Awesome work, congrats on your first merged pull request! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-connector-jdbc] MartijnVisser merged pull request #46: [hotfix][test] remove non-used code in tests
MartijnVisser merged PR #46: URL: https://github.com/apache/flink-connector-jdbc/pull/46 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-31832) Add benchmarks for end to end restarting tasks
[ https://issues.apache.org/jira/browse/FLINK-31832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726242#comment-17726242 ] Lijie Wang commented on FLINK-31832: Done via: flink master: 7d529a5a167f52c1b76fd6f70918b27149cf8782 flink-benchmarks master: 8b32b75e06ec51ee6ec21304c6e473159e0ef3ff > Add benchmarks for end to end restarting tasks > --- > > Key: FLINK-31832 > URL: https://issues.apache.org/jira/browse/FLINK-31832 > Project: Flink > Issue Type: Sub-task > Components: Benchmarks, Runtime / Coordination >Reporter: Weihua Hu >Assignee: Weihua Hu >Priority: Major > Labels: pull-request-available > Fix For: 1.18.0 > > > As discussed in https://issues.apache.org/jira/browse/FLINK-31771. > We need a benchmark for job failover and end to end restarting tasks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [flink-connector-jdbc] WenDing-Y commented on pull request #49: [FLINK-32068] connector jdbc support clickhouse
WenDing-Y commented on PR #49: URL: https://github.com/apache/flink-connector-jdbc/pull/49#issuecomment-1562933275 the pr is ready -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-32068) flink-connector-jdbc support clickhouse
[ https://issues.apache.org/jira/browse/FLINK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-32068: --- Labels: pull-request-available (was: ) > flink-connector-jdbc support clickhouse > > > Key: FLINK-32068 > URL: https://issues.apache.org/jira/browse/FLINK-32068 > Project: Flink > Issue Type: New Feature > Components: Connectors / JDBC >Reporter: leishuiyu >Assignee: leishuiyu >Priority: Minor > Labels: pull-request-available > > flink sql support clickhouse > * int batch scene ,the clickhouse can as source and sink > * int stream scene ,the clickhouse as sink -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32193) AWS connector removes the dependency on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-32193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuxin Tan updated FLINK-32193: -- Issue Type: Technical Debt (was: Bug) > AWS connector removes the dependency on flink-shaded > > > Key: FLINK-32193 > URL: https://issues.apache.org/jira/browse/FLINK-32193 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / AWS >Reporter: Yuxin Tan >Assignee: Yuxin Tan >Priority: Major > Fix For: aws-connector-4.2.0 > > > The AWS connector depends on flink-shaded. With the externalization of the > connector, connectors shouldn't rely on Flink-Shaded -- This message was sent by Atlassian Jira (v8.20.10#820010)