[jira] [Updated] (FLINK-31901) AbstractBroadcastWrapperOperator should not block checkpoint barriers when processing cached records

2023-05-25 Thread Zhipeng Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhipeng Zhang updated FLINK-31901:
--
Description: 
Currently `BroadcastUtils#withBroadcast` tries to caches the non-broadcast 
input until the broadcast inputs are all processed. After the broadcast 
variables are ready, we first process the cached records and then continue to 
process the newly arrived records.

 

Processing cached elements is invoked via `Input#processElement` and 
`Input#processWatermark`.  However, processing cached element may take a long 
time since there may be many cached records, which could potentially block the 
checkpoint barrier.

 

If we run the code snippet here[1], we are supposed to get logs as follows.
{code:java}
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 1 at 
time: 1682319149462
processed cached records, cnt: 1 at time: 1682319149569
processed cached records, cnt: 2 at time: 1682319149614
processed cached records, cnt: 3 at time: 1682319149655
processed cached records, cnt: 4 at time: 1682319149702
processed cached records, cnt: 5 at time: 1682319149746
processed cached records, cnt: 6 at time: 1682319149781
processed cached records, cnt: 7 at time: 1682319149891
processed cached records, cnt: 8 at time: 1682319150011
processed cached records, cnt: 9 at time: 1682319150116
processed cached records, cnt: 10 at time: 1682319150199
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 2 at 
time: 1682319150378
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 3 at 
time: 1682319150606
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 4 at 
time: 1682319150704
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 5 at 
time: 1682319150785
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 6 at 
time: 1682319150859
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 7 at 
time: 1682319150935
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 8 at 
time: 1682319151007{code}
 

We can find that from line#2 to line#11, there is no checkpoints and the 
barriers are blocked until all cached elements are processed, which takes 
~600ms and much longer than checkpoint interval (i.e., 100ms)

 

  [1]https://github.com/zhipeng93/flink-ml/tree/FLINK-31901-demo-case

  was:
Currently `BroadcastUtils#withBroadcast` tries to caches the non-broadcast 
input until the broadcast inputs are all processed. After the broadcast 
variables are ready, we first process the cached records and then continue to 
process the newly arrived records.

 

Processing cached elements is invoked via `Input#processElement` and 
`Input#processWatermark`.  However, processing cached element may take a long 
time since there may be many cached records, which could potentially block the 
checkpoint barrier.

 

If we run the code snippet here[1], we are supposed to get logs as follows.
{code:java}
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 1 at 
time: 1682319149462
processed cached records, cnt: 1 at time: 1682319149569
processed cached records, cnt: 2 at time: 1682319149614
processed cached records, cnt: 3 at time: 1682319149655
processed cached records, cnt: 4 at time: 1682319149702
processed cached records, cnt: 5 at time: 1682319149746
processed cached records, cnt: 6 at time: 1682319149781
processed cached records, cnt: 7 at time: 1682319149891
processed cached records, cnt: 8 at time: 1682319150011
processed cached records, cnt: 9 at time: 1682319150116
processed cached records, cnt: 10 at time: 1682319150199
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 2 at 
time: 1682319150378
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 3 at 
time: 1682319150606
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 4 at 
time: 1682319150704
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 5 at 
time: 1682319150785
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 6 at 
time: 1682319150859
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 7 at 
time: 1682319150935
OneInputBroadcastWrapperOperator doing checkpoint with checkpoint id: 8 at 
time: 1682319151007{code}
 

We can find that from line#3 to line#12, there is no checkpoints and the 
barriers are blocked until all cached elements are processed, which takes 
~600ms and much longer than checkpoint interval (i.e., 100ms)

 

  [1]https://github.com/zhipeng93/flink-ml/tree/FLINK-31901-demo-case


> AbstractBroadcastWrapperOperator should not block checkpoint barriers when 
> processing cached records
> 
>
> 

[jira] [Updated] (FLINK-32202) useless configuration

2023-05-25 Thread zhangdong7 (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangdong7 updated FLINK-32202:
---
Description: 
According to the official Flink documentation, the parameter query.server.ports 
has been replaced by queryable-state.server.ports, but the parameter 
query.server.ports:6125 will be generated when Flink starts. Is this a 
historical problem?

 

 

public static final ConfigOption SERVER_PORT_RANGE = 
ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The
 port range of the queryable state server. The specified range can be a single 
port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and 
ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new 
String[]\{"query.server.ports"});

  was:According to the official Flink documentation, the parameter 
query.server.ports has been replaced by queryable-state.server.ports, but the 
parameter query.server.ports:6125 will be generated when Flink starts. Is this 
a historical problem?


> useless configuration
> -
>
> Key: FLINK-32202
> URL: https://issues.apache.org/jira/browse/FLINK-32202
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Core
>Affects Versions: 1.15.4
>Reporter: zhangdong7
>Priority: Minor
>
> According to the official Flink documentation, the parameter 
> query.server.ports has been replaced by queryable-state.server.ports, but the 
> parameter query.server.ports:6125 will be generated when Flink starts. Is 
> this a historical problem?
>  
>  
> public static final ConfigOption SERVER_PORT_RANGE = 
> ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The
>  port range of the queryable state server. The specified range can be a 
> single port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges 
> and ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new 
> String[]\{"query.server.ports"});



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32202) useless configuration

2023-05-25 Thread zhangdong7 (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangdong7 updated FLINK-32202:
---
Description: 
According to the official Flink documentation, the parameter query.server.ports 
has been replaced by queryable-state.server.ports, but the parameter 
query.server.ports:6125 will be generated when Flink starts. Is this a 
historical problem?

 
{code:java}
public static final ConfigOption SERVER_PORT_RANGE = 
ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The
 port range of the queryable state server. The specified range can be a single 
port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and 
ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new 
String[]{"query.server.ports"});{code}
 

  was:
According to the official Flink documentation, the parameter query.server.ports 
has been replaced by queryable-state.server.ports, but the parameter 
query.server.ports:6125 will be generated when Flink starts. Is this a 
historical problem?

 

 

public static final ConfigOption SERVER_PORT_RANGE = 
ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The
 port range of the queryable state server. The specified range can be a single 
port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and 
ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new 
String[]\{"query.server.ports"});


> useless configuration
> -
>
> Key: FLINK-32202
> URL: https://issues.apache.org/jira/browse/FLINK-32202
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Core
>Affects Versions: 1.15.4
>Reporter: zhangdong7
>Priority: Minor
>
> According to the official Flink documentation, the parameter 
> query.server.ports has been replaced by queryable-state.server.ports, but the 
> parameter query.server.ports:6125 will be generated when Flink starts. Is 
> this a historical problem?
>  
> {code:java}
> public static final ConfigOption SERVER_PORT_RANGE = 
> ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The
>  port range of the queryable state server. The specified range can be a 
> single port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges 
> and ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new 
> String[]{"query.server.ports"});{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32202) useless configuration

2023-05-25 Thread zhangdong7 (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangdong7 updated FLINK-32202:
---
Environment: (was: public static final ConfigOption 
SERVER_PORT_RANGE = 
ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The
 port range of the queryable state server. The specified range can be a single 
port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and 
ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new 
String[]\{"query.server.ports"});)

> useless configuration
> -
>
> Key: FLINK-32202
> URL: https://issues.apache.org/jira/browse/FLINK-32202
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Core
>Affects Versions: 1.15.4
>Reporter: zhangdong7
>Priority: Minor
>
> According to the official Flink documentation, the parameter 
> query.server.ports has been replaced by queryable-state.server.ports, but the 
> parameter query.server.ports:6125 will be generated when Flink starts. Is 
> this a historical problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32202) useless configuration

2023-05-25 Thread zhangdong7 (Jira)
zhangdong7 created FLINK-32202:
--

 Summary: useless configuration
 Key: FLINK-32202
 URL: https://issues.apache.org/jira/browse/FLINK-32202
 Project: Flink
  Issue Type: Improvement
  Components: API / Core
Affects Versions: 1.15.4
 Environment: public static final ConfigOption 
SERVER_PORT_RANGE = 
ConfigOptions.key("queryable-state.server.ports").stringType().defaultValue("9067").withDescription("The
 port range of the queryable state server. The specified range can be a single 
port: \"9123\", a range of ports: \"50100-50200\", or a list of ranges and 
ports: \"50100-50200,50300-50400,51234\".").withDeprecatedKeys(new 
String[]\{"query.server.ports"});
Reporter: zhangdong7


According to the official Flink documentation, the parameter query.server.ports 
has been replaced by queryable-state.server.ports, but the parameter 
query.server.ports:6125 will be generated when Flink starts. Is this a 
historical problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32201) Enable the distribution of shuffle descriptors via the blob server by connection number

2023-05-25 Thread Weihua Hu (Jira)
Weihua Hu created FLINK-32201:
-

 Summary: Enable the distribution of shuffle descriptors via the 
blob server by connection number
 Key: FLINK-32201
 URL: https://issues.apache.org/jira/browse/FLINK-32201
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Weihua Hu



Flink support distributes shuffle descriptors via the blob server to reduce 
JobManager overhead. But the default threshold to enable it is 1MB, which never 
reaches. Users need to set a proper value for this, but it requires advanced 
knowledge before configuring it.

I would like to enable this feature by the number of connections of a group of 
shuffle descriptors. For examples, a simple streaming job with two operators, 
each with 10,000 parallelism and connected via all-to-all distribution. In this 
job, we only get one set of shuffle descriptors, and this group has 1 * 
1 connections. This means that JobManager needs to send this set of shuffle 
descriptors to 1 tasks.

Since it is also difficult for users to configure, I would like to give it a 
default value. The serialized shuffle descriptors sizes for different 
parallelism are shown below.


|| Producer parallelism || serialized shuffle descriptor size || consumer 
parallelism || total data size that JM needs to send ||
| 5000 | 100KB | 5000 | 500MB |
| 1 | 200KB | 1 | 2GB |
| 2 | 400Kb | 2 | 8GB |

So, I would like to set the default value to 10,000 * 10,000. 

Any suggestions or concerns are appreciated.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-web] Matrix42 opened a new pull request, #656: fix logo redirect to wrong url

2023-05-25 Thread via GitHub


Matrix42 opened a new pull request, #656:
URL: https://github.com/apache/flink-web/pull/656

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-32192) JsonBatchFileSystemITCase fail due to Process Exit Code: 239 (NoClassDefFoundError: akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1)

2023-05-25 Thread Sergey Nuyanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Nuyanzin updated FLINK-32192:

Summary: JsonBatchFileSystemITCase fail due to Process Exit Code: 239 
(NoClassDefFoundError: 
akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1)
  (was: JsonBatchFileSystemITCase fail due to Process Exit Code: 239 (because 
of NoClassDefFoundError 
akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1))

> JsonBatchFileSystemITCase fail due to Process Exit Code: 239 
> (NoClassDefFoundError: 
> akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1)
> -
>
> Key: FLINK-32192
> URL: https://issues.apache.org/jira/browse/FLINK-32192
> Project: Flink
>  Issue Type: Sub-task
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Tests
>Affects Versions: 1.18.0
>Reporter: Sergey Nuyanzin
>Priority: Critical
>  Labels: test-stability
>
> [This 
> build|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49288&view=logs&j=7e3d33c3-a462-5ea8-98b8-27e1aafe4ceb&t=ef77f8d1-44c8-5ee2-f175-1c88f61de8c0]
>  failed with a 239 exit code in test_cron_hadoop313 connect_1 with 
> JsonBatchFileSystemITCase crashed
> {noformat}
> May 24 01:02:14 01:02:14.069 [ERROR] Crashed tests:
> May 24 01:02:14 01:02:14.069 [ERROR] 
> org.apache.flink.formats.json.JsonBatchFileSystemITCase
> May 24 01:02:14 01:02:14.069 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:748)
> May 24 01:02:14 01:02:14.069 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$700(ForkStarter.java:121)
> May 24 01:02:14 01:02:14.069 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter$2.call(ForkStarter.java:465)
> May 24 01:02:14 01:02:14.069 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter$2.call(ForkStarter.java:442)
> May 24 01:02:14 01:02:14.069 [ERROR]  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> May 24 01:02:14 01:02:14.069 [ERROR]  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> May 24 01:02:14 01:02:14.069 [ERROR]  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> May 24 01:02:14 01:02:14.069 [ERROR]  at java.lang.Thread.run(Thread.java:748)
> {noformat}
> also in logs 
> {noformat}
> 01:02:00,081 [flink-akka.actor.default-dispatcher-9] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: 
> Thread 'flink-akka.actor.default-dispatcher-9' produced an uncaught 
> exception. Stopping the process...
> java.lang.NoClassDefFoundError: 
> akka/actor/dungeon/FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1
> at 
> akka.actor.dungeon.FaultHandling.handleNonFatalOrInterruptedException(FaultHandling.scala:336)
>  ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT]
> at 
> akka.actor.dungeon.FaultHandling.handleNonFatalOrInterruptedException$(FaultHandling.scala:336)
>  ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT]
> at 
> akka.actor.ActorCell.handleNonFatalOrInterruptedException(ActorCell.scala:410)
>  ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT]
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:523) 
> ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT]
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:535) 
> ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT]
> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:295) 
> ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT]
> at akka.dispatch.Mailbox.run(Mailbox.scala:230) 
> ~[flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:243) 
> [flink-rpc-akka_af85cba1-bb7d-40d1-98e1-939c276575fd.jar:1.18-SNAPSHOT]
> at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) 
> [?:1.8.0_292]
> at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) 
> [?:1.8.0_292]
> at 
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) 
> [?:1.8.0_292]
> at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) 
> [?:1.8.0_292]
> Caused by: java.lang.ClassNotFoundException: 
> akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1
> at java.net.URLClassLoader.findClass(URLClassLoader.

[jira] [Updated] (FLINK-32150) ThreadDumpInfoTest crashed with exit code 239 on AZP (NoClassDefFoundError: org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder)

2023-05-25 Thread Sergey Nuyanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Nuyanzin updated FLINK-32150:

Summary: ThreadDumpInfoTest crashed with exit code 239 on AZP 
(NoClassDefFoundError: 
org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder)  (was: 
ThreadDumpInfoTest crashed with exit code 239 on AZP (because of 
NoClassDefFoundError 
org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder))

> ThreadDumpInfoTest crashed with exit code 239 on AZP (NoClassDefFoundError: 
> org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder)
> -
>
> Key: FLINK-32150
> URL: https://issues.apache.org/jira/browse/FLINK-32150
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / REST, Tests
>Affects Versions: 1.18.0
>Reporter: Sergey Nuyanzin
>Priority: Critical
>  Labels: test-stability
> Attachments: logs-ci-test_ci_core-1684238025.zip
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49057&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=8572
> {noformat}
> May 16 12:03:24 12:03:24.449 [ERROR] 
> org.apache.flink.runtime.rest.messages.ThreadDumpInfoTest
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:532)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:405)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:321)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:266)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1314)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1159)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:932)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:370)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:351)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:215)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:171)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:163)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:294)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
> May 16 12:03:24 12:03:24.449 [ERROR]  at 
> org.apache.maven.cli.MavenCli.execute(MavenCli.java:960)
> {noformat}
> also in logs 
> {noformat}
> 12:01:08,340 [flink-akka.remote.default-remote-dispatcher-5] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: 
> Thread 'flink-akka.remote.default-remote-dispatcher-5' produced an uncaught 
> exception. Stopping the process...
> java.lang.NoClassDefFoundError: 
> org/jboss/netty/handler/codec/frame/LengthFieldBasedFrameDecoder
> at 
> akka.remote.transport.netty.NettyTransport.akka$remote$transport$netty$NettyTransport$$newPipeline(NettyTransport.scala:402)
>  ~[flink-rpc-akka_40d517e5-dc06-423d-8404-00e8331e0610.jar:1.18-SNAPSHOT]
> at 
> akka.remote.transport.nett

[jira] [Created] (FLINK-32200) OrcFileSystemITCase cashed with exit code 239 (NoClassDefFoundError: scala/concurrent/duration/Deadline)

2023-05-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-32200:
-

 Summary: OrcFileSystemITCase cashed with exit code 239 
(NoClassDefFoundError: scala/concurrent/duration/Deadline)
 Key: FLINK-32200
 URL: https://issues.apache.org/jira/browse/FLINK-32200
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.18.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49325&view=logs&j=4eda0b4a-bd0d-521a-0916-8285b9be9bb5&t=2ff6d5fa-53a6-53ac-bff7-fa524ea361a9&l=12302

{code}
12:24:14,883 [flink-akka.actor.internal-dispatcher-2] ERROR 
org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: Thread 
'flink-akka.actor.internal-dispatcher-2' produced an uncaught exception. 
Stopping the process...
java.lang.NoClassDefFoundError: scala/concurrent/duration/Deadline
at scala.concurrent.duration.Deadline$.apply(Deadline.scala:30) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at scala.concurrent.duration.Deadline$.now(Deadline.scala:76) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at akka.actor.CoordinatedShutdown.loop$1(CoordinatedShutdown.scala:737) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
akka.actor.CoordinatedShutdown.$anonfun$run$7(CoordinatedShutdown.scala:762) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63)
 ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:100)
 ~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:100) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:49) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
 [flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) 
[?:1.8.0_292]
at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) 
[?:1.8.0_292]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) 
[?:1.8.0_292]
at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) 
[?:1.8.0_292]
12:24:14,882 [flink-metrics-akka.actor.internal-dispatcher-2] ERROR 
org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: Thread 
'flink-metrics-akka.actor.internal-dispatcher-2' produced an uncaught 
exception. Stopping the process...
java.lang.NoClassDefFoundError: scala/concurrent/duration/Deadline
at scala.concurrent.duration.Deadline$.apply(Deadline.scala:30) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at scala.concurrent.duration.Deadline$.now(Deadline.scala:76) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at akka.actor.CoordinatedShutdown.loop$1(CoordinatedShutdown.scala:737) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
akka.actor.CoordinatedShutdown.$anonfun$run$7(CoordinatedShutdown.scala:762) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) 
~[flink-rpc-akka_318674dc-e98c-4e16-8705-faefda52bd1a.jar:1.18-SNAPSHOT]
at 
akka.dispatch.BatchingExecutor

[GitHub] [flink] 1996fanrui commented on pull request #22560: [FLINK-32023][API / DataStream] Support negative Duration for special usage, such as -1ms for execution.buffer-timeout

2023-05-25 Thread via GitHub


1996fanrui commented on PR #22560:
URL: https://github.com/apache/flink/pull/22560#issuecomment-1563868797

   I will merge this PR next Monday if no other comments.
   
   Hi @Myracle , could you help squash all commits? thanks~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] 1996fanrui commented on pull request #22560: [FLINK-32023][API / DataStream] Support negative Duration for special usage, such as -1ms for execution.buffer-timeout

2023-05-25 Thread via GitHub


1996fanrui commented on PR #22560:
URL: https://github.com/apache/flink/pull/22560#issuecomment-1563867456

   I will merge this PR next Monday if no other comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-32191) Support for configuring tcp keepalive related parameters.

2023-05-25 Thread Weihua Hu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726494#comment-17726494
 ] 

Weihua Hu commented on FLINK-32191:
---

There are two transport types, Nio and Epoll. 

For Epoll, we already have options for keepalive, such as 
"EpollChannelOption.TCP_KEEPIDLE".
But for Nio, the keepalive options have been introduced by JDK11, such as 
"ExtendedSocketOptions.TCP_KEEPIDLE".

Flink is still required to be compatible with JDK8, even though it has been 
deprecated. Hence, we need to inform users that these configurations will not 
be taken into account if NIO and JDK8 are used together. 

Would you mind taking a look at this ticket when you are free. [~Weijie 
Guo][~wanglijie]


> Support for configuring tcp keepalive related parameters.
> -
>
> Key: FLINK-32191
> URL: https://issues.apache.org/jira/browse/FLINK-32191
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: dizhou cao
>Priority: Minor
>
> We encountered a case in our production environment where the netty client 
> was unable to send data to the server due to an abnormality in the switch 
> link. However, client can only detect the abnormality after RTO timeout 
> retransmission failure, which takes about 15 minutes in our production 
> environment. This may result in a 15-minute job unavailability. We hope to 
> perform failover and reschedule job more quickly. Flink has already enabled 
> keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
> timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
> TCP_KEEPCOUNT. These configurations are already supported at the Netty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] XComp commented on pull request #22390: [FLINK-31785][runtime] Moves LeaderElectionService.stop into LeaderElection.close

2023-05-25 Thread via GitHub


XComp commented on PR #22390:
URL: https://github.com/apache/flink/pull/22390#issuecomment-1563845998

   Did a rebase to `master` after FLINK-31776 (PR #22384) was merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (FLINK-31776) Introducing sub-interface LeaderElectionService.LeaderElection

2023-05-25 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-31776.
---
Fix Version/s: 1.18.0
   Resolution: Fixed

master: 3b53f30d396e5a7f22330d6522ea26450e238628

>  Introducing sub-interface LeaderElectionService.LeaderElection
> ---
>
> Key: FLINK-31776
> URL: https://issues.apache.org/jira/browse/FLINK-31776
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism

2023-05-25 Thread Lijie Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726487#comment-17726487
 ] 

Lijie Wang commented on FLINK-32199:


Thanks [~JunRuiLi] , assigned to you :D

> MetricStore does not remove metrics of nonexistent parallelism in 
> TaskMetricStore when scale down job parallelism
> -
>
> Key: FLINK-32199
> URL: https://issues.apache.org/jira/browse/FLINK-32199
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.17.0
>Reporter: Junrui Li
>Priority: Major
> Fix For: 1.18.0, 1.16.3, 1.17.2
>
>
> After FLINK-29615, FLINK will update the subtask metrics store when scaling 
> down parallelism. However, task metrics are added in the form of 
> "subtaskIndex + metric.name" or "subtaskIndex + operatorName + metric.name". 
> Users will be able to find many redundant metrics through 
> JobVertexMetricsHandler, which will be very troublesome for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism

2023-05-25 Thread Lijie Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijie Wang updated FLINK-32199:
---
Fix Version/s: 1.18.0

> MetricStore does not remove metrics of nonexistent parallelism in 
> TaskMetricStore when scale down job parallelism
> -
>
> Key: FLINK-32199
> URL: https://issues.apache.org/jira/browse/FLINK-32199
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.17.0
>Reporter: Junrui Li
>Priority: Major
> Fix For: 1.18.0, 1.16.3, 1.17.2
>
>
> After FLINK-29615, FLINK will update the subtask metrics store when scaling 
> down parallelism. However, task metrics are added in the form of 
> "subtaskIndex + metric.name" or "subtaskIndex + operatorName + metric.name". 
> Users will be able to find many redundant metrics through 
> JobVertexMetricsHandler, which will be very troublesome for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism

2023-05-25 Thread Lijie Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijie Wang reassigned FLINK-32199:
--

Assignee: Junrui Li

> MetricStore does not remove metrics of nonexistent parallelism in 
> TaskMetricStore when scale down job parallelism
> -
>
> Key: FLINK-32199
> URL: https://issues.apache.org/jira/browse/FLINK-32199
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.17.0
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
> Fix For: 1.18.0, 1.16.3, 1.17.2
>
>
> After FLINK-29615, FLINK will update the subtask metrics store when scaling 
> down parallelism. However, task metrics are added in the form of 
> "subtaskIndex + metric.name" or "subtaskIndex + operatorName + metric.name". 
> Users will be able to find many redundant metrics through 
> JobVertexMetricsHandler, which will be very troublesome for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism

2023-05-25 Thread Junrui Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726482#comment-17726482
 ] 

Junrui Li commented on FLINK-32199:
---

I've prepared a quick fix for it. Can you assign this ticket for 
me?[~wanglijie] :)

> MetricStore does not remove metrics of nonexistent parallelism in 
> TaskMetricStore when scale down job parallelism
> -
>
> Key: FLINK-32199
> URL: https://issues.apache.org/jira/browse/FLINK-32199
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.17.0
>Reporter: Junrui Li
>Priority: Major
> Fix For: 1.16.3, 1.17.2
>
>
> After FLINK-29615, FLINK will update the subtask metrics store when scaling 
> down parallelism. However, task metrics are added in the form of 
> "subtaskIndex + metric.name" or "subtaskIndex + operatorName + metric.name". 
> Users will be able to find many redundant metrics through 
> JobVertexMetricsHandler, which will be very troublesome for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32199) MetricStore does not remove metrics of nonexistent parallelism in TaskMetricStore when scale down job parallelism

2023-05-25 Thread Junrui Li (Jira)
Junrui Li created FLINK-32199:
-

 Summary: MetricStore does not remove metrics of nonexistent 
parallelism in TaskMetricStore when scale down job parallelism
 Key: FLINK-32199
 URL: https://issues.apache.org/jira/browse/FLINK-32199
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
Affects Versions: 1.17.0
Reporter: Junrui Li
 Fix For: 1.16.3, 1.17.2


After FLINK-29615, FLINK will update the subtask metrics store when scaling 
down parallelism. However, task metrics are added in the form of "subtaskIndex 
+ metric.name" or "subtaskIndex + operatorName + metric.name". Users will be 
able to find many redundant metrics through JobVertexMetricsHandler, which will 
be very troublesome for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint

2023-05-25 Thread Sharon Xie (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726479#comment-17726479
 ] 

Sharon Xie commented on FLINK-32196:


[~tzulitai]Thanks for analyzing the logs. As additional context, this happened 
after a security patch on the broker side. Though most of the jobs auto 
recovered, we've found a couple that got stuck in the recovery step. So there 
is a chance that this is caused by an issue from the broker side - eg: some 
broker side transaction state is lost or bad partition state. Any possible 
explanation here?

A couple other questions.

> These are the ones to abort in step 2. Initializing the transaction 
> automatically aborts the transaction, as I mentioned in earlier comments. So 
> I believe this is also expected.

In this case, does the transaction state 
[messages|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] in 
kafka broker look right to you? It seems there is no change in those messages 
except the epoch and txnLastUpdateTimestamp. I guess the idea is to call 
transaction init with the old txnId and just let it time out. But there is some 
heart beat to update the transaction? Also can you please explain a bit about 
why abortTransaction is not used?

> What is NOT expected, though, is the bunch of kafka-producer-network-thread 
> threads being spawned per TID to abort in step 2.
Is it common to have so many lingering transactions that need to abort? The job 
is not a high throughput one. About 3 records/sec at the checkpointing interval 
= 10sec. It takes ~30min to run oom and I feel it's weird that the kafka sink 
would need so long to recover.

> kafka sink under EO sometimes is unable to recover from a checkpoint
> 
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: healthy_kafka_producer_thread.csv, 
> kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best *hypothesis* for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] flinkbot commented on pull request #22663: [FLINK-32198][runtime] Enforce single maxExceptions query parameter

2023-05-25 Thread via GitHub


flinkbot commented on PR #22663:
URL: https://github.com/apache/flink/pull/22663#issuecomment-1563798860

   
   ## CI report:
   
   * 3d10cec1ba099543e8ae4ceef670caae9f145553 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-32198) Enforce single maxExceptions query parameter

2023-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32198:
---
Labels: pull-request-available  (was: )

> Enforce single maxExceptions query parameter
> 
>
> Key: FLINK-32198
> URL: https://issues.apache.org/jira/browse/FLINK-32198
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Reporter: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>
> While working on FLINK-31894 I realized that `UpperLimitExceptionParameter` 
> allows multiple values to be collected as a comma separated list even though 
> JobExceptionsHandler is only using the first one 
> [https://github.com/apache/flink/blob/1293958652053c0d163fde28e8dfefb5ee8f6101/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobExceptionsHandler.java#L101-L104]
> A better approach would be to deny multiple `maxExceptions` params and let 
> the users know.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] pgaref opened a new pull request, #22663: [FLINK-32198][runtime] Enforce single maxExceptions query parameter

2023-05-25 Thread via GitHub


pgaref opened a new pull request, #22663:
URL: https://github.com/apache/flink/pull/22663

   https://issues.apache.org/jira/browse/FLINK-32198
   
   * throw ConversionException when multiple values are passed for 
maxExceptions parameter
   * added test to validate functionality


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (FLINK-32198) Enforce single maxExceptions query parameter

2023-05-25 Thread Panagiotis Garefalakis (Jira)
Panagiotis Garefalakis created FLINK-32198:
--

 Summary: Enforce single maxExceptions query parameter
 Key: FLINK-32198
 URL: https://issues.apache.org/jira/browse/FLINK-32198
 Project: Flink
  Issue Type: Bug
  Components: Runtime / REST
Reporter: Panagiotis Garefalakis


While working on FLINK-31894 I realized that `UpperLimitExceptionParameter` 
allows multiple values to be collected as a comma separated list even though 
JobExceptionsHandler is only using the first one 
[https://github.com/apache/flink/blob/1293958652053c0d163fde28e8dfefb5ee8f6101/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobExceptionsHandler.java#L101-L104]

A better approach would be to deny multiple `maxExceptions` params and let the 
users know.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-connector-jdbc] ewangsdc commented on pull request #3: [FLINK-15462][connectors] Add Trino dialect

2023-05-25 Thread via GitHub


ewangsdc commented on PR #3:
URL: 
https://github.com/apache/flink-connector-jdbc/pull/3#issuecomment-1563767213

   
   Since lots of companies use Kerberos authentication to connect to Trino, 
we’d appreciate it very much if you could please ensure that Flink JDBC 
connector Trino dialect support Kerberos authentication like Trino Python 
client, i.e., https://github.com/trinodb/trino-python-client, and Spark/PySpark 
JDBC Trino connection options, i.e., 
https://trino.io/docs/current/client/jdbc.html.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (FLINK-3154) Update Kryo version from 2.24.0 to latest Kryo LTS version

2023-05-25 Thread Kurt Ostfeld (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726320#comment-17726320
 ] 

Kurt Ostfeld edited comment on FLINK-3154 at 5/26/23 3:18 AM:
--

[~martijnvisser] this PR upgrades Flink from Kryo v2.x to Kryo v5.x and 
preserves backward compatibility with existing savepoints and checkpoints: 
[https://github.com/apache/flink/pull/22660]

 

This keeps the Kryo v2 project dependency for backwards compatibility only and 
otherwise uses Kryo v5.x.

 

EDIT: I will fix the CI errors.


was (Author: JIRAUSER38):
[~martijnvisser] this PR upgrades Flink from Kryo v2.x to Kryo v5.x and 
preserves backward compatibility with existing savepoints and checkpoints: 
[https://github.com/apache/flink/pull/22660]

 

This keeps the Kryo v2 project dependency for backwards compatibility only and 
otherwise uses Kryo v5.x.

> Update Kryo version from 2.24.0 to latest Kryo LTS version
> --
>
> Key: FLINK-3154
> URL: https://issues.apache.org/jira/browse/FLINK-3154
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Type Serialization System
>Affects Versions: 1.0.0
>Reporter: Maximilian Michels
>Priority: Not a Priority
>  Labels: pull-request-available
>
> Flink's Kryo version is outdated and could be updated to a newer version, 
> e.g. kryo-3.0.3.
> From ML: we cannot bumping the Kryo version easily - the serialization format 
> changed (that's why they have a new major version), which would render all 
> Flink savepoints and checkpoints incompatible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32190) Bad link in Flink page

2023-05-25 Thread Lijie Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726456#comment-17726456
 ] 

Lijie Wang commented on FLINK-32190:


Thanks [~JunRuiLi], assigned to you :)

> Bad link in Flink page
> --
>
> Key: FLINK-32190
> URL: https://issues.apache.org/jira/browse/FLINK-32190
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Claude Warren
>Assignee: Junrui Li
>Priority: Minor
>
> on the page: [https://flink.apache.org/use-cases/]
> in the section: What are typical data analytics applications?
> The first link: [Quality monitoring of Telco 
> networks|http://2016.flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/]
> returns a 404 error: 
> h1. This site can’t be reached
> Check if there is a typo in 2016.flink-forward.org.
> DNS_PROBE_FINISHED_NXDOMAIN



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-32190) Bad link in Flink page

2023-05-25 Thread Lijie Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijie Wang reassigned FLINK-32190:
--

Assignee: Junrui Li

> Bad link in Flink page
> --
>
> Key: FLINK-32190
> URL: https://issues.apache.org/jira/browse/FLINK-32190
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Claude Warren
>Assignee: Junrui Li
>Priority: Minor
>
> on the page: [https://flink.apache.org/use-cases/]
> in the section: What are typical data analytics applications?
> The first link: [Quality monitoring of Telco 
> networks|http://2016.flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/]
> returns a 404 error: 
> h1. This site can’t be reached
> Check if there is a typo in 2016.flink-forward.org.
> DNS_PROBE_FINISHED_NXDOMAIN



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] pgaref commented on pull request #22516: [FLINK-31993][runtime] Initialize and pass down FailureEnrichers

2023-05-25 Thread via GitHub


pgaref commented on PR #22516:
URL: https://github.com/apache/flink/pull/22516#issuecomment-1563756212

   > LGTM overall, minus a few cosmetic comments. I think the commit message is 
also misleading
   > 
   > ```
   > Fail fast on bad FailureEnricher JM startup as given by the user
   > ```
   > 
   > since we still have `filterInvalidEnrichers` in the codepath
   
   Awesome! Thanks @dmvk ! Updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-32190) Bad link in Flink page

2023-05-25 Thread Junrui Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726454#comment-17726454
 ] 

Junrui Li commented on FLINK-32190:
---

It seems that the 404 error was caused by an expired external link. May be we 
can replace other links or remove this use case. [~wanglijie] What do you 
think? I'd like to take this ticket.:)

> Bad link in Flink page
> --
>
> Key: FLINK-32190
> URL: https://issues.apache.org/jira/browse/FLINK-32190
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Claude Warren
>Priority: Minor
>
> on the page: [https://flink.apache.org/use-cases/]
> in the section: What are typical data analytics applications?
> The first link: [Quality monitoring of Telco 
> networks|http://2016.flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/]
> returns a 404 error: 
> h1. This site can’t be reached
> Check if there is a typo in 2016.flink-forward.org.
> DNS_PROBE_FINISHED_NXDOMAIN



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded

2023-05-25 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726453#comment-17726453
 ] 

Yuxin Tan edited comment on FLINK-32194 at 5/26/23 3:02 AM:


[~Sergey Nuyanzin]  Could you help review this change? It is also like 
FLINK-32187.


was (Author: tanyuxin):
[~snuyanzin] Could you help review this change? It is also like 
[FLINK-32187|https://issues.apache.org/jira/browse/FLINK-32187].

> Elasticsearch connector should remove the dependency on flink-shaded
> 
>
> Key: FLINK-32194
> URL: https://issues.apache.org/jira/browse/FLINK-32194
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / ElasticSearch
>Affects Versions: elasticsearch-4.0.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
>
> The Elasticsearch connector depends on flink-shaded. With the externalization 
> of the connector, the connectors shouldn't rely on Flink-Shaded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded

2023-05-25 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726453#comment-17726453
 ] 

Yuxin Tan edited comment on FLINK-32194 at 5/26/23 2:58 AM:


[~snuyanzin] Could you help review this change? It is also like 
[FLINK-32187|https://issues.apache.org/jira/browse/FLINK-32187].


was (Author: tanyuxin):
[~snuyanzin] Could you help review this change? It is also like 
[#https://issues.apache.org/jira/browse/FLINK-32187].

> Elasticsearch connector should remove the dependency on flink-shaded
> 
>
> Key: FLINK-32194
> URL: https://issues.apache.org/jira/browse/FLINK-32194
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / ElasticSearch
>Affects Versions: elasticsearch-4.0.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
>
> The Elasticsearch connector depends on flink-shaded. With the externalization 
> of the connector, the connectors shouldn't rely on Flink-Shaded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded

2023-05-25 Thread Yuxin Tan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726453#comment-17726453
 ] 

Yuxin Tan commented on FLINK-32194:
---

[~snuyanzin] Could you help review this change? It is also like 
[#https://issues.apache.org/jira/browse/FLINK-32187].

> Elasticsearch connector should remove the dependency on flink-shaded
> 
>
> Key: FLINK-32194
> URL: https://issues.apache.org/jira/browse/FLINK-32194
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / ElasticSearch
>Affects Versions: elasticsearch-4.0.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
>
> The Elasticsearch connector depends on flink-shaded. With the externalization 
> of the connector, the connectors shouldn't rely on Flink-Shaded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] pgaref commented on a diff in pull request #22516: [FLINK-31993][runtime] Initialize and pass down FailureEnrichers

2023-05-25 Thread via GitHub


pgaref commented on code in PR #22516:
URL: https://github.com/apache/flink/pull/22516#discussion_r1206189321


##
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/factories/DefaultJobMasterServiceFactory.java:
##
@@ -57,6 +59,7 @@ public class DefaultJobMasterServiceFactory implements 
JobMasterServiceFactory {
 private final FatalErrorHandler fatalErrorHandler;
 private final ClassLoader userCodeClassloader;
 private final ShuffleMaster shuffleMaster;
+private final Collection failureEnrichers;

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22516: [FLINK-31993][runtime] Initialize and pass down FailureEnrichers

2023-05-25 Thread via GitHub


pgaref commented on code in PR #22516:
URL: https://github.com/apache/flink/pull/22516#discussion_r1206188285


##
flink-runtime/src/main/java/org/apache/flink/runtime/minicluster/MiniCluster.java:
##
@@ -197,6 +198,9 @@ public class MiniCluster implements AutoCloseableAsync {
 @GuardedBy("lock")
 private HeartbeatServices heartbeatServices;
 
+@GuardedBy("lock")
+private Collection failureEnrichers;

Review Comment:
   We might need it for e2e tests, but removing it for the time being



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-32008) Protobuf format throws exception with Map datatype

2023-05-25 Thread Benchao Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726448#comment-17726448
 ] 

Benchao Li commented on FLINK-32008:


[~rskraba] Thanks for the digging. I agree that currently FileSystem's fallback 
(de)serializer does not fit for all formats, this should be one of them. It 
might be worth to add support to implement a {{BulkReaderFormatFactory}} for 
protobuf format.

> Protobuf format throws exception with Map datatype
> --
>
> Key: FLINK-32008
> URL: https://issues.apache.org/jira/browse/FLINK-32008
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.17.0
>Reporter: Xuannan Su
>Priority: Major
> Attachments: flink-protobuf-example.zip
>
>
> The protobuf format throws exception when working with Map data type. I 
> uploaded a example project to reproduce the problem.
>  
> {code:java}
> Caused by: java.lang.RuntimeException: One or more fetchers have encountered 
> exception
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:261)
>     at 
> org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:169)
>     at 
> org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:131)
>     at 
> org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417)
>     at 
> org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
>     at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550)
>     at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
>     at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
>     at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received 
> unexpected exception while polling the records
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165)
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:114)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     ... 1 more
> Caused by: java.io.IOException: Failed to deserialize PB object.
>     at 
> org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:75)
>     at 
> org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:42)
>     at 
> org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:82)
>     at 
> org.apache.flink.connector.file.table.DeserializationSchemaAdapter$LineBytesInputFormat.readRecord(DeserializationSchemaAdapter.java:197)
>     at 
> org.apache.flink.connector.file.table.DeserializationSchemaAdapter$LineBytesInputFormat.nextRecord(DeserializationSchemaAdapter.java:210)
>     at 
> org.apache.flink.connector.file.table.DeserializationSchemaAdapter$Reader.readBatch(DeserializationSchemaAdapter.java:124)
>     at 
> org.apache.flink.connector.file.src.util.RecordMapperWrapperRecordIterator$1.readBatch(RecordMapperWrapperRecordIterator.java:82)
>     at 
> org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:67)
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58)
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:162)
>     ... 6 more
> Caused by: java.lang.reflect.InvocationTargetException
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     a

[jira] [Commented] (FLINK-32132) Cast function CODEGEN does not work as expected when nullOnFailure enabled

2023-05-25 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726445#comment-17726445
 ] 

xiaogang zhou commented on FLINK-32132:
---

[~luoyuxia] Can you please help review

> Cast function CODEGEN does not work as expected when nullOnFailure enabled
> --
>
> Key: FLINK-32132
> URL: https://issues.apache.org/jira/browse/FLINK-32132
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Planner
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Assignee: xiaogang zhou
>Priority: Major
>  Labels: pull-request-available
>
> I am trying to generate code cast string to bigint, and got generated code 
> like:
>  
>  
> {code:java}
> // code placeholder
> if (!isNull$14) {
> result$15 = 
> org.apache.flink.table.data.binary.BinaryStringDataUtil.toLong(field$13.trim());
> } else {
> result$15 = -1L;
> }
>castRuleResult$16 = result$15;
>castRuleResultIsNull$17 = isNull$14;
>  } catch (java.lang.Throwable e) {
>castRuleResult$16 = -1L;
>castRuleResultIsNull$17 = true;
>  }
>  // --- End cast section
> out.setLong(0, castRuleResult$16); {code}
> such kind of handle does not provide a perfect solution as the default value 
> of long is set to -1L, which can be meaningful in some case. And can cause 
> some calculation error.
>  
> And I understand the cast returns a bigint not null, But since there is a 
> exception, we should ignore the type restriction, so I suggest to modify the 
> CodeGenUtils.rowSetField like below:
>  
> {code:java}
> // code placeholder
> if (fieldType.isNullable || 
> fieldExpr.nullTerm.startsWith("castRuleResultIsNull")) {
>   s"""
>  |${fieldExpr.code}
>  |if (${fieldExpr.nullTerm}) {
>  |  $setNullField;
>  |} else {
>  |  $writeField;
>  |}
> """.stripMargin
> } else {
>   s"""
>  |${fieldExpr.code}
>  |$writeField;
>""".stripMargin
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-31956) Extend the CompiledPlan to read from/write to Flink's FileSystem

2023-05-25 Thread lincoln lee (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lincoln lee reassigned FLINK-31956:
---

Assignee: Shuai Xu

> Extend the CompiledPlan to read from/write to Flink's FileSystem
> 
>
> Key: FLINK-31956
> URL: https://issues.apache.org/jira/browse/FLINK-31956
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Client, Table SQL / Planner
>Affects Versions: 1.18.0
>Reporter: Jane Chan
>Assignee: Shuai Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> At present, COMPILE/EXECUTE PLAN FOR '${plan.json}' only supports writing 
> to/reading from a local file without the scheme. We propose to extend the 
> support for Flink's FileSystem.
> {code:sql}
> -- before
> COMPILE PLAN FOR '/tmp/foo/bar.json' 
> EXECUTE PLAN FOR '/tmp/foo/bar.json' 
> -- after
> COMPILE PLAN FOR 'file:///tmp/foo/bar.json' 
> COMPILE PLAN FOR 'hdfs:///tmp/foo/bar.json' 
> COMPILE PLAN FOR 's3:///tmp/foo/bar.json' 
> COMPILE PLAN FOR 'oss:///tmp/foo/bar.json'  
> EXECUTE PLAN FOR 'file:///tmp/foo/bar.json'
> EXECUTE PLAN FOR 'hdfs:///tmp/foo/bar.json'
> EXECUTE PLAN FOR 's3:///tmp/foo/bar.json'
> EXECUTE PLAN FOR 'oss:///tmp/foo/bar.json' {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31956) Extend the CompiledPlan to read from/write to Flink's FileSystem

2023-05-25 Thread lincoln lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726444#comment-17726444
 ] 

lincoln lee commented on FLINK-31956:
-

[~xu_shuai_] assigned to you :)

> Extend the CompiledPlan to read from/write to Flink's FileSystem
> 
>
> Key: FLINK-31956
> URL: https://issues.apache.org/jira/browse/FLINK-31956
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Client, Table SQL / Planner
>Affects Versions: 1.18.0
>Reporter: Jane Chan
>Assignee: Shuai Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> At present, COMPILE/EXECUTE PLAN FOR '${plan.json}' only supports writing 
> to/reading from a local file without the scheme. We propose to extend the 
> support for Flink's FileSystem.
> {code:sql}
> -- before
> COMPILE PLAN FOR '/tmp/foo/bar.json' 
> EXECUTE PLAN FOR '/tmp/foo/bar.json' 
> -- after
> COMPILE PLAN FOR 'file:///tmp/foo/bar.json' 
> COMPILE PLAN FOR 'hdfs:///tmp/foo/bar.json' 
> COMPILE PLAN FOR 's3:///tmp/foo/bar.json' 
> COMPILE PLAN FOR 'oss:///tmp/foo/bar.json'  
> EXECUTE PLAN FOR 'file:///tmp/foo/bar.json'
> EXECUTE PLAN FOR 'hdfs:///tmp/foo/bar.json'
> EXECUTE PLAN FOR 's3:///tmp/foo/bar.json'
> EXECUTE PLAN FOR 'oss:///tmp/foo/bar.json' {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint

2023-05-25 Thread Tzu-Li (Gordon) Tai (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726426#comment-17726426
 ] 

Tzu-Li (Gordon) Tai edited comment on FLINK-32196 at 5/26/23 12:06 AM:
---

[~sharonxr55] a few things to clarify first:
 # When a KafkaSink subtask restores, there are some transaction that needs to 
be committed (i.e. ones that are written in the Flink checkpoint), and
 # All other transactions are considered "lingering" which should be aborted 
(which is done by the loop you referenced).
 # Only after the above 2 step completes, the subtask initialization is 
considered complete.

So:

> A healthy transaction goes from INITIALIZING to READY- to 
> COMMITTING_TRANSACTION to READY in the log

I believe these transactions are the ones from step 1. Which is expected.

> I've found that the transaction thread never progress beyond “Transition from 
> state INITIALIZING to READY”

These are the ones to abort in step 2. Initializing the transaction 
automatically aborts the transaction, as I mentioned in earlier comments. So I 
believe this is also expected.

 

What is NOT expected, though, is the bunch of {{kafka-producer-network-thread}} 
threads being spawned per TID to abort in step 2. Thanks for sharing the logs 
btw, it was helpful figuring out what was going on!

Kafka's producer only spawns a single {{kafka-producer-network-thread}} per 
instance. And the abort loop for lingering transactions always tries to reuse 
the same producer instance without creating new ones, so I would expect to only 
see a single {{kafka-producer-network-thread}} throughout the whole loop. This 
doesn't seem to be the case. From the naming of these threads, it seems like 
for every TID that the KafkaSink is trying to abort, a new 
{{kafka-producer-network-thread}} thread is spawned:

This is hinted by the naming of the threads (see the last portion of the thread 
name, where it's strictly incrementing; that's the TIDs of transactions the 
KafkaSink is trying to abort)
{code:java}
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708579"
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708580”
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708581"
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708582”
{code}
The only way I see this happen is if the loop is creating new producer 
instances per attempted TID, but it doesn't make sense given the code. It could 
be something wrong with how the KafkaSink is using Java reflections to reset 
the TID on the reused producer, but I'll need to spend some time to look into 
this a bit deeper.


was (Author: tzulitai):
[~sharonxr55] a few things to clarify first:
 # When a KafkaSink subtask restores, there are some transaction that needs to 
be committed (i.e. ones that are written in the Flink checkpoint), and
 # All other transactions are considered "lingering" which should be aborted 
(which is done by the loop you referenced).
 # Only after the above 2 step completes, the subtask initialization is 
considered complete.

So:

> A healthy transaction goes from INITIALIZING to READY- to 
> COMMITTING_TRANSACTION to READY in the log

I believe these transactions are the ones from step 1. Which is expected.

> I've found that the transaction thread never progress beyond “Transition from 
> state INITIALIZING to READY”

These are the ones to abort in step 2. Initializing the transaction 
automatically aborts the transaction, as I mentioned in earlier comments. So I 
believe this is also expected.

 

What is NOT expected, though, is the bunch of {{kafka-producer-network-thread}} 
threads being spawned per TID to abort in step 2. Thanks for sharing the logs 
btw, it was helpful figuring out what was going on!


Kafka's producer only spawns a single {{kafka-producer-network-thread}} per 
instance. And the abort loop for lingering transactions always tries to reuse 
the same producer instance without creating new ones, so I would expect to only 
see a single {{kafka-producer-network-thread}} throughout the whole loop. This 
doesn't seem to be the case. From the naming of these threads, it seems like 
for every TID that the KafkaSink is trying to abort, a new 
{{kafka-producer-network-thread}} thread is spawned:

This is hinted by the naming of the threads (see the last portion of the thread 
name, where it's strictly incrementing; that's the TIDs of transactions the 
KafkaSink is trying to abort)
{code}
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708579"
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708580”
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708581"
“kafka-producer-network-

[jira] [Commented] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint

2023-05-25 Thread Tzu-Li (Gordon) Tai (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726426#comment-17726426
 ] 

Tzu-Li (Gordon) Tai commented on FLINK-32196:
-

[~sharonxr55] a few things to clarify first:
 # When a KafkaSink subtask restores, there are some transaction that needs to 
be committed (i.e. ones that are written in the Flink checkpoint), and
 # All other transactions are considered "lingering" which should be aborted 
(which is done by the loop you referenced).
 # Only after the above 2 step completes, the subtask initialization is 
considered complete.

So:

> A healthy transaction goes from INITIALIZING to READY- to 
> COMMITTING_TRANSACTION to READY in the log

I believe these transactions are the ones from step 1. Which is expected.

> I've found that the transaction thread never progress beyond “Transition from 
> state INITIALIZING to READY”

These are the ones to abort in step 2. Initializing the transaction 
automatically aborts the transaction, as I mentioned in earlier comments. So I 
believe this is also expected.

 

What is NOT expected, though, is the bunch of {{kafka-producer-network-thread}} 
threads being spawned per TID to abort in step 2. Thanks for sharing the logs 
btw, it was helpful figuring out what was going on!


Kafka's producer only spawns a single {{kafka-producer-network-thread}} per 
instance. And the abort loop for lingering transactions always tries to reuse 
the same producer instance without creating new ones, so I would expect to only 
see a single {{kafka-producer-network-thread}} throughout the whole loop. This 
doesn't seem to be the case. From the naming of these threads, it seems like 
for every TID that the KafkaSink is trying to abort, a new 
{{kafka-producer-network-thread}} thread is spawned:

This is hinted by the naming of the threads (see the last portion of the thread 
name, where it's strictly incrementing; that's the TIDs of transactions the 
KafkaSink is trying to abort)
{code}
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708579"
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708580”
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708581"
“kafka-producer-network-thread | 
producer-tx-account-7a604e01-CONNECTION-75373861-0-1708582”
{code}

The only way I see this happen is if the loop is creating new producer 
instances per attempted TID, but it doesn't make sense given the code. It could 
be something funny with how the KafkaSink is using Java reflections to reset 
the TID on the reused producer, but I'll need to spend some time to look into 
this a bit deeper.

> kafka sink under EO sometimes is unable to recover from a checkpoint
> 
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: healthy_kafka_producer_thread.csv, 
> kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best *hypothesis* for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-kubernetes-operator] tweise commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy

2023-05-25 Thread via GitHub


tweise commented on code in PR #607:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1206092693


##
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/metrics/ScalingMetrics.java:
##
@@ -82,7 +82,7 @@ public static void computeDataRateMetrics(
 scalingMetrics.put(ScalingMetric.TRUE_PROCESSING_RATE, 
trueProcessingRate);
 scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, 
numRecordsInPerSecond);
 } else {
-LOG.error("Cannot compute true processing rate without 
numRecordsInPerSecond");
+LOG.warn("Cannot compute true processing rate without 
numRecordsInPerSecond");

Review Comment:
   Not actionable is also why I'm not sure about logging at warn level. Who is 
going to read these logs? Will logs be flooded with repeating entries that are 
unlikely to be addressed? I thought a message in the status (or one time event) 
might be more useful, after all the application would need to be modified to 
remedy the condition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (FLINK-32197) FLIP 246: Multi Cluster Kafka Source

2023-05-25 Thread Mason Chen (Jira)
Mason Chen created FLINK-32197:
--

 Summary: FLIP 246: Multi Cluster Kafka Source
 Key: FLINK-32197
 URL: https://issues.apache.org/jira/browse/FLINK-32197
 Project: Flink
  Issue Type: New Feature
  Components: Connectors / Kafka
Reporter: Mason Chen


This is for introducing a new connector that extends off the current 
KafkaSource to read multiple Kafka clusters, which can change dynamically.

For more details, please refer to [FLIP 
246|https://cwiki.apache.org/confluence/display/FLINK/FLIP-246%3A+Multi+Cluster+Kafka+Source].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint

2023-05-25 Thread Sharon Xie (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726405#comment-17726405
 ] 

Sharon Xie edited comment on FLINK-32196 at 5/25/23 10:04 PM:
--

Thank you [~tzulitai] for the quick response and information.
{quote}Are you actually observing that there are lingering transactions not 
being aborted in Kafka? Or was that a speculation based on not seeing a 
abortTransaction() in the code?
{quote}
This is a speculation. So this may not be the root cause of the issue I'm 
seeing.
{quote}If there are actually lingering transactions in Kafka after restore, do 
they get timeout by Kafka after transaction.timeout.ms? Or are they lingering 
beyond the timeout threshold?
{quote}
What I've observed is that the subtask gets stuck in the initializing state and 
there is a growing number of kafka-producer-network-thread and the job 
eventually runs OOM -  In the [^kafka_sink_oom_logs.csv], you can see lots of 
producers get closed in the end . In the debug log, I've found that the 
transaction thread never progress beyond “Transition from state INITIALIZING to 
READY” and eventually times out. An example thread log is 
[^kafka_producer_network_thread_log.csv] . A healthy transaction goes from 
INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the 
thread doesn't exit - example [^healthy_kafka_producer_thread.csv].

I've also queried the kafka's _transaction_state topic for the problematic 
transaction and 
[here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the 
messages in the topic.  

I'd appreciate any pointers or potential ways to explain the situation. 



was (Author: sharonxr55):
Thank you [~tzulitai] for the quick response and information.
{quote}Are you actually observing that there are lingering transactions not 
being aborted in Kafka? Or was that a speculation based on not seeing a 
abortTransaction() in the code?
{quote}
This is a speculation. So this may not be the root cause of the issue I'm 
seeing.
{quote}If there are actually lingering transactions in Kafka after restore, do 
they get timeout by Kafka after transaction.timeout.ms? Or are they lingering 
beyond the timeout threshold?
{quote}
What I've observed is that the subtask gets stuck in the initializing state and 
there is a growing number of kafka-producer-network-thread and the job 
eventually runs OOM -  [^kafka_sink_oom_logs.csv] . In the debug log, I've 
found that the transaction thread never progress beyond “Transition from state 
INITIALIZING to READY” and eventually times out. An example thread log is 
[^kafka_producer_network_thread_log.csv] . A healthy transaction goes from 
INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the 
thread doesn't exit - example [^healthy_kafka_producer_thread.csv].

I've also queried the kafka's _transaction_state topic for the problematic 
transaction and 
[here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the 
messages in the topic.  

I'd appreciate any pointers or potential ways to explain the situation. 


> kafka sink under EO sometimes is unable to recover from a checkpoint
> 
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: healthy_kafka_producer_thread.csv, 
> kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best *hypothesis* for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This messa

[jira] [Comment Edited] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint

2023-05-25 Thread Sharon Xie (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726405#comment-17726405
 ] 

Sharon Xie edited comment on FLINK-32196 at 5/25/23 10:04 PM:
--

Thank you [~tzulitai] for the quick response and information.
{quote}Are you actually observing that there are lingering transactions not 
being aborted in Kafka? Or was that a speculation based on not seeing a 
abortTransaction() in the code?
{quote}
This is a speculation. So this may not be the root cause of the issue I'm 
seeing.
{quote}If there are actually lingering transactions in Kafka after restore, do 
they get timeout by Kafka after transaction.timeout.ms? Or are they lingering 
beyond the timeout threshold?
{quote}
What I've observed is that the subtask gets stuck in the initializing state and 
there is a growing number of kafka-producer-network-thread and the job 
eventually runs OOM -  [^kafka_sink_oom_logs.csv] . In the debug log, I've 
found that the transaction thread never progress beyond “Transition from state 
INITIALIZING to READY” and eventually times out. An example thread log is 
[^kafka_producer_network_thread_log.csv] . A healthy transaction goes from 
INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the 
thread doesn't exit - example [^healthy_kafka_producer_thread.csv].

I've also queried the kafka's _transaction_state topic for the problematic 
transaction and 
[here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the 
messages in the topic.  

I'd appreciate any pointers or potential ways to explain the situation. 



was (Author: sharonxr55):
Thank you [~tzulitai] for the quick response and information.
{quote}Are you actually observing that there are lingering transactions not 
being aborted in Kafka? Or was that a speculation based on not seeing a 
abortTransaction() in the code?
{quote}
This is a speculation. So this may not be the root cause of the issue I'm 
seeing.
{quote}If there are actually lingering transactions in Kafka after restore, do 
they get timeout by Kafka after transaction.timeout.ms? Or are they lingering 
beyond the timeout threshold?
{quote}
What I've observed is that the subtask gets stuck in the initializing state and 
there is a growing number of kafka-producer-network-thread and the job 
eventually runs OOM. In the debug log, I've found that the transaction thread 
never progress beyond “Transition from state INITIALIZING to READY” and 
eventually times out. An example thread log is 
[^kafka_producer_network_thread_log.csv] . A healthy transaction goes from 
INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the 
thread doesn't exit - example [^healthy_kafka_producer_thread.csv].

I've also queried the kafka's _transaction_state topic for the problematic 
transaction and 
[here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the 
messages in the topic.  

I'd appreciate any pointers or potential ways to explain the situation. 


> kafka sink under EO sometimes is unable to recover from a checkpoint
> 
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: healthy_kafka_producer_thread.csv, 
> kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best *hypothesis* for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint

2023-05-25 Thread Sharon Xie (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharon Xie updated FLINK-32196:
---
Attachment: kafka_sink_oom_logs.csv

> kafka sink under EO sometimes is unable to recover from a checkpoint
> 
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: healthy_kafka_producer_thread.csv, 
> kafka_producer_network_thread_log.csv, kafka_sink_oom_logs.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best *hypothesis* for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint

2023-05-25 Thread Sharon Xie (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharon Xie updated FLINK-32196:
---
Summary: kafka sink under EO sometimes is unable to recover from a 
checkpoint  (was: KafkaWriter recovery doesn't abort lingering transactions 
under the EO semantic)

> kafka sink under EO sometimes is unable to recover from a checkpoint
> 
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: healthy_kafka_producer_thread.csv, 
> kafka_producer_network_thread_log.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best hypothesis for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32196) kafka sink under EO sometimes is unable to recover from a checkpoint

2023-05-25 Thread Sharon Xie (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharon Xie updated FLINK-32196:
---
Description: 
We are seeing an issue where a Flink job using kafka sink under EO is unable to 
recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
eventually runs OOM. The cause for OOM is that there is a kafka producer thread 
leak.

Here is our best *hypothesis* for the issue.
In `KafkaWriter` under the EO semantic, it intends to abort lingering 
transactions upon recovery 
[https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]

However, the actual implementation to abort those transactions in the 
`TransactionAborter` doesn't abort those transactions 
[https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]

Specifically `producer.abortTransaction()` is never called in that function. 
Instead it calls `producer.flush()`.

Also The function is in for loop that only breaks when `producer.getEpoch() == 
0` which is why we are seeing a producer thread leak as the recovery gets stuck 
in this for loop.

  was:
We are seeing an issue where a Flink job using kafka sink under EO is unable to 
recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
eventually runs OOM. The cause for OOM is that there is a kafka producer thread 
leak.

Here is our best hypothesis for the issue.
In `KafkaWriter` under the EO semantic, it intends to abort lingering 
transactions upon recovery 
[https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]

However, the actual implementation to abort those transactions in the 
`TransactionAborter` doesn't abort those transactions 
[https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]

Specifically `producer.abortTransaction()` is never called in that function. 
Instead it calls `producer.flush()`.

Also The function is in for loop that only breaks when `producer.getEpoch() == 
0` which is why we are seeing a producer thread leak as the recovery gets stuck 
in this for loop.


> kafka sink under EO sometimes is unable to recover from a checkpoint
> 
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: healthy_kafka_producer_thread.csv, 
> kafka_producer_network_thread_log.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best *hypothesis* for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic

2023-05-25 Thread Sharon Xie (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726405#comment-17726405
 ] 

Sharon Xie commented on FLINK-32196:


Thank you [~tzulitai] for the quick response and information.
{quote}Are you actually observing that there are lingering transactions not 
being aborted in Kafka? Or was that a speculation based on not seeing a 
abortTransaction() in the code?
{quote}
This is a speculation. So this may not be the root cause of the issue I'm 
seeing.
{quote}If there are actually lingering transactions in Kafka after restore, do 
they get timeout by Kafka after transaction.timeout.ms? Or are they lingering 
beyond the timeout threshold?
{quote}
What I've observed is that the subtask gets stuck in the initializing state and 
there is a growing number of kafka-producer-network-thread and the job 
eventually runs OOM. In the debug log, I've found that the transaction thread 
never progress beyond “Transition from state INITIALIZING to READY” and 
eventually times out. An example thread log is 
[^kafka_producer_network_thread_log.csv] . A healthy transaction goes from 
INITIALIZING to READY- to COMMITTING_TRANSACTION to READY in the log and the 
thread doesn't exit - example [^healthy_kafka_producer_thread.csv].

I've also queried the kafka's _transaction_state topic for the problematic 
transaction and 
[here|https://gist.github.com/sharonx/51e300e383455f016be1a95f0c855b97] are the 
messages in the topic.  

I'd appreciate any pointers or potential ways to explain the situation. 


> KafkaWriter recovery doesn't abort lingering transactions under the EO 
> semantic
> ---
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: healthy_kafka_producer_thread.csv, 
> kafka_producer_network_thread_log.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best hypothesis for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic

2023-05-25 Thread Sharon Xie (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharon Xie updated FLINK-32196:
---
Attachment: healthy_kafka_producer_thread.csv

> KafkaWriter recovery doesn't abort lingering transactions under the EO 
> semantic
> ---
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: healthy_kafka_producer_thread.csv, 
> kafka_producer_network_thread_log.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best hypothesis for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic

2023-05-25 Thread Sharon Xie (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharon Xie updated FLINK-32196:
---
Attachment: kafka_producer_network_thread_log.csv

> KafkaWriter recovery doesn't abort lingering transactions under the EO 
> semantic
> ---
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
> Attachments: kafka_producer_network_thread_log.csv
>
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best hypothesis for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic

2023-05-25 Thread Tzu-Li (Gordon) Tai (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726390#comment-17726390
 ] 

Tzu-Li (Gordon) Tai commented on FLINK-32196:
-

In terms of the lingering transactions you are observing, a few questions:
 # Are you actually observing that there are lingering transactions not being 
aborted in Kafka? Or was that a speculation based on not seeing a 
{{abortTransaction()}} in the code?
 # If there are actually lingering transactions in Kafka after restore, do they 
get timeout by Kafka after {{{}transaction.timeout.ms{}}}? Or are they 
lingering beyond the timeout threshold?

> KafkaWriter recovery doesn't abort lingering transactions under the EO 
> semantic
> ---
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best hypothesis for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic

2023-05-25 Thread Tzu-Li (Gordon) Tai (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726385#comment-17726385
 ] 

Tzu-Li (Gordon) Tai edited comment on FLINK-32196 at 5/25/23 8:56 PM:
--

Hi [~sharonxr55], the {{abortTransactionOfSubtask}} you posted aborts 
transactions by relying on the fact that when you call `initTransactions()`, 
Kafka automatically aborts any old ongoing transactions under the same 
{{{}transactional.id{}}}.

Could you re-elaborate the producer leak? As far as I can tell, the loop is 
reusing the same producer instance; on every loop entry, the same producer 
instance is reset with a new {{transactional.id}} and called 
{{initTransactions()}} to abort the transaction. So, there doesn't seem to be 
an issue with run-away producer instances, unless I'm misunderstanding 
something here.


was (Author: tzulitai):
Hi [~sharonxr55], the {{abortTransactionOfSubtask}} you posted aborts 
transactions by relying on the fact that when you call `initTransactions()`, 
Kafka automatically aborts any old ongoing transactions under the same 
{{{}transactional.id{}}}.

Could you re-elaborate the producer leak? As far as I can tell, the loop is 
reusing the same producer instance; on every loop entry, the same producer 
instance is reset with a new {{transactional.id}} and called 
{{initTransactions()}} to abort the transaction.

> KafkaWriter recovery doesn't abort lingering transactions under the EO 
> semantic
> ---
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best hypothesis for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] pgaref commented on pull request #22587: [FLINK-31892][runtime] Introduce AdaptiveScheduler global failure enrichment/labeling

2023-05-25 Thread via GitHub


pgaref commented on PR #22587:
URL: https://github.com/apache/flink/pull/22587#issuecomment-1563492807

   > LGTM minus one code path in StopWithSavepoint 👍
   
   Thanks for the review @dmvk ! Updated :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic

2023-05-25 Thread Tzu-Li (Gordon) Tai (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726385#comment-17726385
 ] 

Tzu-Li (Gordon) Tai commented on FLINK-32196:
-

Hi [~sharonxr55], the {{abortTransactionOfSubtask}} you posted aborts 
transactions by relying on the fact that when you call `initTransactions()`, 
Kafka automatically aborts any old ongoing transactions under the same 
{{{}transactional.id{}}}.

Could you re-elaborate the producer leak? As far as I can tell, the loop is 
reusing the same producer instance; on every loop entry, the same producer 
instance is reset with a new {{transactional.id}} and called 
{{initTransactions()}} to abort the transaction.

> KafkaWriter recovery doesn't abort lingering transactions under the EO 
> semantic
> ---
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best hypothesis for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy

2023-05-25 Thread via GitHub


gyfora commented on code in PR #607:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1205922751


##
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/metrics/ScalingMetrics.java:
##
@@ -82,7 +82,7 @@ public static void computeDataRateMetrics(
 scalingMetrics.put(ScalingMetric.TRUE_PROCESSING_RATE, 
trueProcessingRate);
 scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, 
numRecordsInPerSecond);
 } else {
-LOG.error("Cannot compute true processing rate without 
numRecordsInPerSecond");
+LOG.warn("Cannot compute true processing rate without 
numRecordsInPerSecond");

Review Comment:
   We could trigger an event, but putting this into the status may be an 
overkill because this is not really actionable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy

2023-05-25 Thread via GitHub


gyfora commented on code in PR #607:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1205921907


##
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobVertexScaler.java:
##
@@ -221,7 +221,7 @@ private boolean detectIneffectiveScaleUp(
 message);
 
 if 
(conf.get(AutoScalerOptions.SCALING_EFFECTIVENESS_DETECTION_ENABLED)) {
-LOG.info(
+LOG.warn(

Review Comment:
   There is already an event triggered for this, that should be enough for most 
users. The log just outputs some additional info



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy

2023-05-25 Thread via GitHub


tweise commented on code in PR #607:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1205905605


##
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/metrics/ScalingMetrics.java:
##
@@ -82,7 +82,7 @@ public static void computeDataRateMetrics(
 scalingMetrics.put(ScalingMetric.TRUE_PROCESSING_RATE, 
trueProcessingRate);
 scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, 
numRecordsInPerSecond);
 } else {
-LOG.error("Cannot compute true processing rate without 
numRecordsInPerSecond");
+LOG.warn("Cannot compute true processing rate without 
numRecordsInPerSecond");

Review Comment:
   Same as above, the logging is probably not the best way to bring this to the 
intended audience and warn should probably be mostly reserved for operator 
internal problems.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-32148) Reduce info logging noise in the operator

2023-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32148:
---
Labels: pull-request-available  (was: )

> Reduce info logging noise in the operator
> -
>
> Key: FLINK-32148
> URL: https://issues.apache.org/jira/browse/FLINK-32148
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.5.0
>Reporter: Gyula Fora
>Assignee: Gyula Fora
>Priority: Major
>  Labels: pull-request-available
>
> The operator controller/reconciler/observer logic currently logs a lot of 
> information on INFO level even when "nothing" happens. This should be 
> improved and reduce most of these to DEBUG



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-kubernetes-operator] tweise commented on a diff in pull request #607: [FLINK-32148] Make operator logging less noisy

2023-05-25 Thread via GitHub


tweise commented on code in PR #607:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/607#discussion_r1205903992


##
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobVertexScaler.java:
##
@@ -221,7 +221,7 @@ private boolean detectIneffectiveScaleUp(
 message);
 
 if 
(conf.get(AutoScalerOptions.SCALING_EFFECTIVENESS_DETECTION_ENABLED)) {
-LOG.info(
+LOG.warn(

Review Comment:
   Why should this be warn? This looks more like something that should be 
reflected in the status?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under EO semantic

2023-05-25 Thread Sharon Xie (Jira)
Sharon Xie created FLINK-32196:
--

 Summary: KafkaWriter recovery doesn't abort lingering transactions 
under EO semantic
 Key: FLINK-32196
 URL: https://issues.apache.org/jira/browse/FLINK-32196
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.15.4, 1.6.4
Reporter: Sharon Xie


We are seeing an issue where a Flink job using kafka sink under EO is unable to 
recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
eventually runs OOM. The cause for OOM is that there is a kafka producer thread 
leak.

Here is our best hypothesis for the issue.
In `KafkaWriter` under the EO semantic, it intends to abort lingering 
transactions upon recovery 
[https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]

However, the actual implementation to abort those transactions in the 
`TransactionAborter` doesn't abort those transactions 
[https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]

Specifically `producer.abortTransaction()` is never called in that function. 
Instead it calls `producer.flush()`.

Also The function is in for loop that only breaks when `producer.getEpoch() == 
0` which is why we are seeing a producer thread leak as the recovery gets stuck 
in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32196) KafkaWriter recovery doesn't abort lingering transactions under the EO semantic

2023-05-25 Thread Sharon Xie (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharon Xie updated FLINK-32196:
---
Summary: KafkaWriter recovery doesn't abort lingering transactions under 
the EO semantic  (was: KafkaWriter recovery doesn't abort lingering 
transactions under EO semantic)

> KafkaWriter recovery doesn't abort lingering transactions under the EO 
> semantic
> ---
>
> Key: FLINK-32196
> URL: https://issues.apache.org/jira/browse/FLINK-32196
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.6.4, 1.15.4
>Reporter: Sharon Xie
>Priority: Major
>
> We are seeing an issue where a Flink job using kafka sink under EO is unable 
> to recover from a checkpoint. The sink task stuck at `INITIALIZING` state and 
> eventually runs OOM. The cause for OOM is that there is a kafka producer 
> thread leak.
> Here is our best hypothesis for the issue.
> In `KafkaWriter` under the EO semantic, it intends to abort lingering 
> transactions upon recovery 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java#L175-L179]
> However, the actual implementation to abort those transactions in the 
> `TransactionAborter` doesn't abort those transactions 
> [https://github.com/apache/flink/blob/release-1.15/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/TransactionAborter.java#L97-L124]
> Specifically `producer.abortTransaction()` is never called in that function. 
> Instead it calls `producer.flush()`.
> Also The function is in for loop that only breaks when `producer.getEpoch() 
> == 0` which is why we are seeing a producer thread leak as the recovery gets 
> stuck in this for loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32195) Add SQL Gateway custom headers support

2023-05-25 Thread Elkhan Dadashov (Jira)
Elkhan Dadashov created FLINK-32195:
---

 Summary: Add SQL Gateway custom headers support
 Key: FLINK-32195
 URL: https://issues.apache.org/jira/browse/FLINK-32195
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Gateway
Affects Versions: 1.17.0
Reporter: Elkhan Dadashov


For some use cases, it might be needed setting a few extra HTTP headers with a 
request to FlinkSQL Gateway, for example, a cookie for Auth/session.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-32174) Update Cloudera product and link in doc page

2023-05-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/FLINK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Márton Balassi closed FLINK-32174.
--
Resolution: Fixed

a4de894 in main
fda49dd in release-1.17
6a31cc4 in release-1.16

> Update Cloudera product and link in doc page
> 
>
> Key: FLINK-32174
> URL: https://issues.apache.org/jira/browse/FLINK-32174
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Ferenc Csaky
>Assignee: Ferenc Csaky
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32174) Update Cloudera product and link in doc page

2023-05-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/FLINK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Márton Balassi updated FLINK-32174:
---
Fix Version/s: 1.18.0

> Update Cloudera product and link in doc page
> 
>
> Key: FLINK-32174
> URL: https://issues.apache.org/jira/browse/FLINK-32174
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Ferenc Csaky
>Assignee: Ferenc Csaky
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32174) Update Cloudera product and link in doc page

2023-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32174:
---
Labels: pull-request-available  (was: )

> Update Cloudera product and link in doc page
> 
>
> Key: FLINK-32174
> URL: https://issues.apache.org/jira/browse/FLINK-32174
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Ferenc Csaky
>Assignee: Ferenc Csaky
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] flinkbot commented on pull request #22662: KafkaSource: disable setting client id prefix if client.id is set

2023-05-25 Thread via GitHub


flinkbot commented on PR #22662:
URL: https://github.com/apache/flink/pull/22662#issuecomment-1563271820

   
   ## CI report:
   
   * 10dc163a5e357195fd35757422f45d32d5500234 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #22661: [WIP][FLINK-31783][runtime] Migrates DefaultLeaderElectionService from LeaderElectionDriver to the MultipleComponentLeaderElectionDriver int

2023-05-25 Thread via GitHub


flinkbot commented on PR #22661:
URL: https://github.com/apache/flink/pull/22661#issuecomment-1563265317

   
   ## CI report:
   
   * a87a2c01417b9c75b0c59b84187e759058f1b45a UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] hmedhat opened a new pull request, #22662: disable setting client id prefix if client.id is set

2023-05-25 Thread via GitHub


hmedhat opened a new pull request, #22662:
URL: https://github.com/apache/flink/pull/22662

   
   
   ## What is the purpose of the change
   Setting different client ids for the consumer makes it difficult to set 
kafka quotas for that consumer group. Disabling setting consumer id prefix on 
the expense of kafka consumer metrics won't be reported correctly.
   
   
   ## Brief change log
   
   Allows setting client.id on the kafka consumer level which will ignore 
client.id.prefix being and will print a warning.
   
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   This change added tests and can be verified as follows:
 - Manually verified the change a simple job and verified the client.id 
being set

   ## Does this pull request potentially affect one of the following parts:
 - Dependencies (does it add or upgrade a dependency): (no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
 - The serializers: (no)
 - The runtime per-record code paths (performance sensitive): (no)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
 - The S3 file system connector: (no )
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (no)
 - If yes, how is the feature documented? (not applicable)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-31783) Replace LeaderElectionDriver in DefaultLeaderElectionService with MultipleComponentLeaderElectionDriver

2023-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-31783:
---
Labels: pull-request-available  (was: )

> Replace LeaderElectionDriver in DefaultLeaderElectionService with 
> MultipleComponentLeaderElectionDriver
> ---
>
> Key: FLINK-31783
> URL: https://issues.apache.org/jira/browse/FLINK-31783
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] HuangZhenQiu commented on a diff in pull request #22491: [FLINK-31947] enable stdout for flink-console.sh

2023-05-25 Thread via GitHub


HuangZhenQiu commented on code in PR #22491:
URL: https://github.com/apache/flink/pull/22491#discussion_r1181389434


##
flink-dist/src/main/flink-bin/bin/flink-console.sh:
##
@@ -116,4 +117,5 @@ echo $$ >> "$pid" 2>/dev/null
 # Evaluate user options for local variable expansion
 FLINK_ENV_JAVA_OPTS=$(eval echo ${FLINK_ENV_JAVA_OPTS})
 
+exec 1>"${out}"

Review Comment:
   The file size of redirection is a concern from community. It makes sense to 
me. The lack of out file from web was reported by our internal users who use 
stream.print() to verify some execution result. Given stream.print() is a 
supported functionality, l feel it is a kind of lacking integrity if flink on 
k8s behaves differently with flink on yarn. Shall we revisit the problem 
together with Till and Zentol?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] XComp opened a new pull request, #22661: [FLINK-31783][runtime] Migrates DefaultLeaderElectionService from LeaderElectionDriver to the MultipleComponentLeaderElectionDriver interface

2023-05-25 Thread via GitHub


XComp opened a new pull request, #22661:
URL: https://github.com/apache/flink/pull/22661

   ## What is the purpose of the change
   
   `DefaultLeaderElectionService` should rely on 
`MultipleComponentLeaderElectionDriver` directly to enable support for multiple 
contenders within the same service later on.
   
   
   ## Brief change log
   
   * Resolved FLINK-30338 subtasks
   * Made `MultipleComponentLeaderElectionDriver.Listener` extend 
`AutoCloseable`
   
   
   ## Verifying this change
   
   * Migrated any existing tests to the `MultipleComponentLeaderElectionDriver` 
interface. The tests should still succeed.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
 - The serializers: no
 - The runtime per-record code paths (performance sensitive): no
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
 - The S3 file system connector: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (FLINK-3154) Update Kryo version from 2.24.0 to latest Kryo LTS version

2023-05-25 Thread Kurt Ostfeld (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726320#comment-17726320
 ] 

Kurt Ostfeld edited comment on FLINK-3154 at 5/25/23 5:09 PM:
--

[~martijnvisser] this PR upgrades Flink from Kryo v2.x to Kryo v5.x and 
preserves backward compatibility with existing savepoints and checkpoints: 
[https://github.com/apache/flink/pull/22660]

 

This keeps the Kryo v2 project dependency for backwards compatibility only and 
otherwise uses Kryo v5.x.


was (Author: JIRAUSER38):
[~martijnvisser] this PR upgrades Flink from v2 to v5 and preserves backward 
compatibility with existing savepoints and checkpoints: 
[https://github.com/apache/flink/pull/22660]

 

This keeps the Kryo v2 project dependency for backwards compatibility only and 
otherwise uses Kryo v5.x.

> Update Kryo version from 2.24.0 to latest Kryo LTS version
> --
>
> Key: FLINK-3154
> URL: https://issues.apache.org/jira/browse/FLINK-3154
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Type Serialization System
>Affects Versions: 1.0.0
>Reporter: Maximilian Michels
>Priority: Not a Priority
>  Labels: pull-request-available
>
> Flink's Kryo version is outdated and could be updated to a newer version, 
> e.g. kryo-3.0.3.
> From ML: we cannot bumping the Kryo version easily - the serialization format 
> changed (that's why they have a new major version), which would render all 
> Flink savepoints and checkpoints incompatible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-3154) Update Kryo version from 2.24.0 to latest Kryo LTS version

2023-05-25 Thread Kurt Ostfeld (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726320#comment-17726320
 ] 

Kurt Ostfeld commented on FLINK-3154:
-

[~martijnvisser] this PR upgrades Flink from v2 to v5 and preserves backward 
compatibility with existing savepoints and checkpoints: 
[https://github.com/apache/flink/pull/22660]

 

This keeps the Kryo v2 project dependency for backwards compatibility only and 
otherwise uses Kryo v5.x.

> Update Kryo version from 2.24.0 to latest Kryo LTS version
> --
>
> Key: FLINK-3154
> URL: https://issues.apache.org/jira/browse/FLINK-3154
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Type Serialization System
>Affects Versions: 1.0.0
>Reporter: Maximilian Michels
>Priority: Not a Priority
>  Labels: pull-request-available
>
> Flink's Kryo version is outdated and could be updated to a newer version, 
> e.g. kryo-3.0.3.
> From ML: we cannot bumping the Kryo version easily - the serialization format 
> changed (that's why they have a new major version), which would render all 
> Flink savepoints and checkpoints incompatible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] flinkbot commented on pull request #22660: [FLINK-3154][runtime] Upgrade from Kryo v2 + Chill 0.7.6 to Kryo v5 w…

2023-05-25 Thread via GitHub


flinkbot commented on PR #22660:
URL: https://github.com/apache/flink/pull/22660#issuecomment-1563235211

   
   ## CI report:
   
   * d24bb0737d2a7b4b61f51843241ab78a47302d0a UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-32008) Protobuf format throws exception with Map datatype

2023-05-25 Thread Ryan Skraba (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726314#comment-17726314
 ] 

Ryan Skraba commented on FLINK-32008:
-

Oh, just taking a quick look – protobuf isn't supported by the filesystem 
connector in the 
[Flink|https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/table/filesystem/]
 docs.  The real bug here might be that filesystem + protobuf doesn't fail 
immediately as an unsupported option!

> Protobuf format throws exception with Map datatype
> --
>
> Key: FLINK-32008
> URL: https://issues.apache.org/jira/browse/FLINK-32008
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.17.0
>Reporter: Xuannan Su
>Priority: Major
> Attachments: flink-protobuf-example.zip
>
>
> The protobuf format throws exception when working with Map data type. I 
> uploaded a example project to reproduce the problem.
>  
> {code:java}
> Caused by: java.lang.RuntimeException: One or more fetchers have encountered 
> exception
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:261)
>     at 
> org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:169)
>     at 
> org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:131)
>     at 
> org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417)
>     at 
> org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
>     at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550)
>     at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
>     at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
>     at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received 
> unexpected exception while polling the records
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165)
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:114)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     ... 1 more
> Caused by: java.io.IOException: Failed to deserialize PB object.
>     at 
> org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:75)
>     at 
> org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:42)
>     at 
> org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:82)
>     at 
> org.apache.flink.connector.file.table.DeserializationSchemaAdapter$LineBytesInputFormat.readRecord(DeserializationSchemaAdapter.java:197)
>     at 
> org.apache.flink.connector.file.table.DeserializationSchemaAdapter$LineBytesInputFormat.nextRecord(DeserializationSchemaAdapter.java:210)
>     at 
> org.apache.flink.connector.file.table.DeserializationSchemaAdapter$Reader.readBatch(DeserializationSchemaAdapter.java:124)
>     at 
> org.apache.flink.connector.file.src.util.RecordMapperWrapperRecordIterator$1.readBatch(RecordMapperWrapperRecordIterator.java:82)
>     at 
> org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:67)
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58)
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:162)
>     ... 6 more
> Caused by: java.lang.reflect.InvocationTargetException
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAcce

[jira] [Updated] (FLINK-3154) Update Kryo version from 2.24.0 to latest Kryo LTS version

2023-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-3154:
--
Labels: pull-request-available  (was: )

> Update Kryo version from 2.24.0 to latest Kryo LTS version
> --
>
> Key: FLINK-3154
> URL: https://issues.apache.org/jira/browse/FLINK-3154
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Type Serialization System
>Affects Versions: 1.0.0
>Reporter: Maximilian Michels
>Priority: Not a Priority
>  Labels: pull-request-available
>
> Flink's Kryo version is outdated and could be updated to a newer version, 
> e.g. kryo-3.0.3.
> From ML: we cannot bumping the Kryo version easily - the serialization format 
> changed (that's why they have a new major version), which would render all 
> Flink savepoints and checkpoints incompatible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] kurtostfeld opened a new pull request, #22660: [FLINK-3154][runtime] Upgrade from Kryo v2 + Chill 0.7.6 to Kryo v5 w…

2023-05-25 Thread via GitHub


kurtostfeld opened a new pull request, #22660:
URL: https://github.com/apache/flink/pull/22660

   …ith backward compatibility for existing savepoints and checkpoints.
   
   ## What is the purpose of the change
   
   To upgrade the primary Kryo library used by Flink from v2.x to v5.x, while 
providing backwards compatibility with existing savepoints and checkponts. This 
PR adds a new Kryo v5 dependency that is namespaced so that it can coexist with 
the legacy dependencies that would be kept for compatibility purposes. The 
existing chill-java library version 0.7.6 would also be deprecated and kept for 
compatibility purposes only. Some future version of Flink could eventually drop 
Kryo v2 and chill-java when backwards compatibility with Kryo v2 based state is 
no longer needed.
   
   Why upgrade Kryo:
   
   - Kryo v5.x is more compatible with newer versions of Java. I wrote a simple 
Flink app with Kryo serialized state: When running under Java 17 or Java 21 
pre-release builds, the Kryo v2 library fails at runtime, while a prototype 
build of Flink with this PR can load state serialized with Kryo v5.x running 
under both Java 17 and Java 21 pre-release builds successfully without failures.
   - Kryo v5.x supports Java records, while Kryo v2.x does not.
   - Kryo v2.2.x was released over ten years ago, well before the release of 
Java 8. Kryo is a maintained project, there have been lots of improvements over 
the past ten years. Kryo v5 has faster runtime performance, and more memory 
efficient serialization protocols. Additionally Kryo v5 has fixed lots of bugs. 
While the Flink project has worked around all Kryo v2 bugs, it would be an 
improvement to remove the workarounds. Next, Kryo v2 depends on the chill-java 
library for functionality, where almost all of that functionality is built into 
Kryo v5.x so that chill-java dependency can be phased out.
   
   ## Brief change log
   
   This is a large PR with a lot of surface area and risk. I tried to keep the 
scope of these changes as narrow and simple as possible. I copied all the Kryo 
v2 code to Kryo v5 equivalents and made necessary adjustments to get everything 
working. Some Flink serialization code references Flink Java classes by full 
package name, so I didn't modify the package names or class names of any 
existing serialization classes.
   
   ## Verifying this change
   
   There are lots of existing unit tests that cover backward compatibility with 
existing state and also the serialization framework. This obviously passes all 
those tests. Additionally, I wrote a Flink application to do a more thorough 
test of the Kryo upgrade that was difficult to convert into unit test form.
   
   https://github.com/kurtostfeld/flink-kryo-upgrade-demo
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): yes. This adds a new 
Kryo v5 dependency and keeps legacy dependencies for backwards compatibility 
purposes.
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: yes. Kryo v2 APIs are deprecated. Parallel Kryo v5 APIs 
are created with PublicEvolving
 - The serializers: yes. This absolutely affects the serializers.
 - The runtime per-record code paths (performance sensitive): yes. This 
should be faster but I haven't done any benchmark testing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-32008) Protobuf format throws exception with Map datatype

2023-05-25 Thread Ryan Skraba (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726311#comment-17726311
 ] 

Ryan Skraba commented on FLINK-32008:
-

Hello, thanks for the example project!  That's so helpful to reproduce and 
debug.

The *current* file strategy for protobuf in Flink is to write one record 
serialized as binary per line, adding a *{{0x0a}}*.  Your example message is 
serialized as:

{code}
12 06 0a 01 61 12 01 62 0a
{code}

The first *{{0a}}* is the protobuf encoding for the key field in the map.  The 
last *{{0a}}* is a new line (which probably shouldn't be there).

When reading, from a file, splits are calculated and assigned to tasks using 
the *{{0a}}* as a delimiter, which is very, very likely to fail and a fault in 
the protobuf file implementation of Flink.

I'm guessing this isn't limited to maps, we can expect this delimiter byte to 
occur many different ways in the protobuf binary.

If this hasn't been addressed, it's probably because it's pretty rare to store 
protobuf messages in a file container (as opposed to in a single message 
packet, or a table cell).  Do you have a good use case that we can use to guide 
what we expect Flink to do with protobuf files?

For info, nothing in the Protobuf encoding that can be used to [distinguish the 
start or end|https://protobuf.dev/programming-guides/techniques/#streaming] of 
a message.  If we want to store multiple messages in the same container (file 
or any sequence of bytes), we have to manage the indices ourselves  The above 
link recommends writing the message size followed by the binary (in Java, this 
is using {{writeDelimitedTo}}/{{parseDelimitedFrom}} instead of 
{{writeTo}}/{{parseFrom}}, for example).

> Protobuf format throws exception with Map datatype
> --
>
> Key: FLINK-32008
> URL: https://issues.apache.org/jira/browse/FLINK-32008
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.17.0
>Reporter: Xuannan Su
>Priority: Major
> Attachments: flink-protobuf-example.zip
>
>
> The protobuf format throws exception when working with Map data type. I 
> uploaded a example project to reproduce the problem.
>  
> {code:java}
> Caused by: java.lang.RuntimeException: One or more fetchers have encountered 
> exception
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:261)
>     at 
> org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:169)
>     at 
> org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:131)
>     at 
> org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417)
>     at 
> org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
>     at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550)
>     at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
>     at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
>     at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received 
> unexpected exception while polling the records
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165)
>     at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:114)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     ... 1 more
> Caused by: java.io.IOException: Failed to deserialize PB object.
>     at 
> org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema.deserialize(PbRowDataDeserializationSchema.java:75)
>     at 
> org.apache.flink.formats.protobuf.deserialize.PbRowDataDeserializationSchema

[GitHub] [flink] flinkbot commented on pull request #22659: [FLINK-32186][runtime-web] Support subtask stack auto-search when red…

2023-05-25 Thread via GitHub


flinkbot commented on PR #22659:
URL: https://github.com/apache/flink/pull/22659#issuecomment-1563200191

   
   ## CI report:
   
   * 89518bc54ec90fb7d66ef42d4538cbbf7263b43d UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-32186) Support subtask stack auto-search when redirecting from subtask backpressure tab

2023-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32186:
---
Labels: pull-request-available  (was: )

> Support subtask stack auto-search when redirecting from subtask backpressure 
> tab
> 
>
> Key: FLINK-32186
> URL: https://issues.apache.org/jira/browse/FLINK-32186
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Affects Versions: 1.18.0
>Reporter: Yu Chen
>Assignee: Yu Chen
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2023-05-25-15-52-54-383.png, 
> image-2023-05-25-16-11-00-374.png
>
>
> Note that we have introduced a dump link on the backpressure page in 
> FLINK-29996(Figure 1), which helps to check what are the corresponding 
> subtask doing more easily.
> But we still have to search for the corresponding call stack of the 
> back-pressured subtask from the whole TaskManager thread dumps, it's not 
> convenient enough.
> Therefore, I would like to trigger the search for the editor automatically 
> after redirecting from the backpressure tab, which will help to scroll the 
> thread dumps to the corresponding call stack of the back-pressured subtask 
> (As shown in Figure 2).
> !image-2023-05-25-15-52-54-383.png|width=680,height=260!
> Figure 1. ThreadDump Link in Backpressure Tab
> !image-2023-05-25-16-11-00-374.png|width=680,height=353!
> Figure 2. Trigger Auto-search after Redirecting from Backpressure Tab



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink] yuchen-ecnu opened a new pull request, #22659: [FLINK-32186][runtime-web] Support subtask stack auto-search when red…

2023-05-25 Thread via GitHub


yuchen-ecnu opened a new pull request, #22659:
URL: https://github.com/apache/flink/pull/22659

   …irecting from subtask backpressure tab
   
   ## What is the purpose of the change
   
   To further optimize the scenario of troubleshooting sub-task backpressure, 
when the user clicks the `Dump` button from the backpressure panel of the 
operator, it carries the operator name and **automatically** scrolls to the 
thread stack of the corresponding subtask.
   
   
   ## Brief change log
   
   - Carrying the operator name in the URL when clicking the Dump button in the 
backpressure panel of the operator.
   - Triggers the search automatically when the editor in the ThreadDump page 
is initialized.
   
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
 - The serializers: no
 - The runtime per-record code paths (performance sensitive): no
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
 - The S3 file system connector: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-connector-kafka] tzulitai commented on pull request #29: [FLINK-32021][Connectors/Kafka] Improvement the Javadoc for SpecifiedOffsetsInitializer and TimestampOffsetsInitializer

2023-05-25 Thread via GitHub


tzulitai commented on PR #29:
URL: 
https://github.com/apache/flink-connector-kafka/pull/29#issuecomment-1563142803

   Thanks for the PR @loserwang1024 and @RamanVerma for the reviews!
   
   Merging this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-connector-kafka] tzulitai commented on pull request #25: [FLINK-30859] Sync flink sql client test

2023-05-25 Thread via GitHub


tzulitai commented on PR #25:
URL: 
https://github.com/apache/flink-connector-kafka/pull/25#issuecomment-1563139310

   On a second look, it looks like this test does far more stuff than just 
testing the SQL Kafka Connector ...
   
   Again, like the state machine example PR, I'm not sure we really should be 
moving the whole e2e test as is, but rather:
   
   1. modify the original E2E in `apache/flink` to not use Kafka
   2. Move over only the parts related to Kafka in 
`apache/flink-connector-kafka`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] snuyanzin merged pull request #22657: [BP-1.16][FLINK-30844][runtime] Increased TASK_CANCELLATION_TIMEOUT for testInterruptibleSharedLockInInvokeAndCancel

2023-05-25 Thread via GitHub


snuyanzin merged PR #22657:
URL: https://github.com/apache/flink/pull/22657


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded

2023-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32194:
---
Labels: pull-request-available  (was: )

> Elasticsearch connector should remove the dependency on flink-shaded
> 
>
> Key: FLINK-32194
> URL: https://issues.apache.org/jira/browse/FLINK-32194
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / ElasticSearch
>Affects Versions: elasticsearch-4.0.0
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
>  Labels: pull-request-available
>
> The Elasticsearch connector depends on flink-shaded. With the externalization 
> of the connector, the connectors shouldn't rely on Flink-Shaded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-connector-elasticsearch] boring-cyborg[bot] commented on pull request #65: [FLINK-32194][connectors/elasticsearch] Remove the dependency on flink-shaded

2023-05-25 Thread via GitHub


boring-cyborg[bot] commented on PR #65:
URL: 
https://github.com/apache/flink-connector-elasticsearch/pull/65#issuecomment-1563090057

   Thanks for opening this pull request! Please check out our contributing 
guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (FLINK-32194) Elasticsearch connector should remove the dependency on flink-shaded

2023-05-25 Thread Yuxin Tan (Jira)
Yuxin Tan created FLINK-32194:
-

 Summary: Elasticsearch connector should remove the dependency on 
flink-shaded
 Key: FLINK-32194
 URL: https://issues.apache.org/jira/browse/FLINK-32194
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / ElasticSearch
Affects Versions: elasticsearch-4.0.0
Reporter: Yuxin Tan
Assignee: Yuxin Tan


The Elasticsearch connector depends on flink-shaded. With the externalization 
of the connector, the connectors shouldn't rely on Flink-Shaded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-32193) AWS connector removes the dependency on flink-shaded

2023-05-25 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan reassigned FLINK-32193:
-

Assignee: (was: Yuxin Tan)

> AWS connector removes the dependency on flink-shaded
> 
>
> Key: FLINK-32193
> URL: https://issues.apache.org/jira/browse/FLINK-32193
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / AWS
>Reporter: Yuxin Tan
>Priority: Major
> Fix For: aws-connector-4.2.0
>
>
> The AWS connector depends on flink-shaded. With the externalization of the 
> connector, connectors shouldn't rely on Flink-Shaded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32144) FileSourceTextLinesITCase times out on AZP

2023-05-25 Thread Lijie Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726264#comment-17726264
 ] 

Lijie Wang commented on FLINK-32144:


Adding this feature on configuration level is an option. 

Another option, according to the [doc of junit5 
migration|https://docs.google.com/document/d/1514Wa_aNB9bJUen4xm5uiuXOooOJTtXqS_Jqk9KJitU],
  is to let {{MiniClusterExtension}} implements {{CustomExtension}} interface, 
and wrap it with {{EachCallbackWrapper}} for per-test lifecycle, or wrap it 
with {{AllCallbackWrapper}} for per-class lifecycle (Currently there are some 
other extensions following this rule, for example {{{}ZooKeeperExtension{}}}).

But I found that we did as this before FLINK-26252, and I don't know why it has 
been changed to the current state in FLINK-26252.

> FileSourceTextLinesITCase times out on AZP
> --
>
> Key: FLINK-32144
> URL: https://issues.apache.org/jira/browse/FLINK-32144
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Tests
>Affects Versions: 1.18.0
>Reporter: Sergey Nuyanzin
>Priority: Critical
>  Labels: test-stability
>
> {noformat}
>  May 21 01:15:09 
> ==
> May 21 01:15:09 Process produced no output for 900 seconds.
> May 21 01:15:09 
> ==
> May 21 01:15:09 
> ==
> May 21 01:15:09 The following Java processes are running (JPS)
> May 21 01:15:09 
> ==
> ...
> May 21 01:15:10 "ForkJoinPool-1-worker-25" #27 daemon prio=5 os_prio=0 
> tid=0x7ff334c35800 nid=0x1e49 waiting on condition [0x7ff23c3e7000]
> May 21 01:15:10java.lang.Thread.State: WAITING (parking)
> May 21 01:15:10   at sun.misc.Unsafe.park(Native Method)
> May 21 01:15:10   - parking to wait for  <0xce8682e0> (a 
> java.util.concurrent.CompletableFuture$Signaller)
> May 21 01:15:10   at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> May 21 01:15:10   at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> May 21 01:15:10   at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
> May 21 01:15:10   at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> May 21 01:15:10   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> May 21 01:15:10   at 
> org.apache.flink.connector.file.src.FileSourceTextLinesITCase$RecordCounterToFail.waitToFail(FileSourceTextLinesITCase.java:481)
> May 21 01:15:10   at 
> org.apache.flink.connector.file.src.FileSourceTextLinesITCase$RecordCounterToFail.access$100(FileSourceTextLinesITCase.java:457)
> May 21 01:15:10   at 
> org.apache.flink.connector.file.src.FileSourceTextLinesITCase.testBoundedTextFileSource(FileSourceTextLinesITCase.java:145)
> May 21 01:15:10   at 
> org.apache.flink.connector.file.src.FileSourceTextLinesITCase.testBoundedTextFileSourceWithJobManagerFailover(FileSourceTextLinesITCase.java:110)
> May 21 01:15:10   
> ...
> May 21 01:15:10 Killing process with pid=860 and all descendants
> /__w/2/s/tools/ci/watchdog.sh: line 113:   860 Terminated  $cmd
> May 21 01:15:11 Process exited with EXIT CODE: 143.
> May 21 01:15:11 Trying to KILL watchdog (856).
> May 21 01:15:11 Searching for .dump, .dumpstream and related files in 
> '/__w/2/s'
> May 21 01:15:15 Moving 
> '/__w/2/s/flink-connectors/flink-connector-files/target/surefire-reports/2023-05-21T00-58-13_549-jvmRun2.dumpstream'
>  to target directory ('/__w/_temp/debug_files')
> May 21 01:15:15 Moving 
> '/__w/2/s/flink-connectors/flink-connector-files/target/surefire-reports/2023-05-21T00-58-13_549-jvmRun2.dump'
>  to target directory ('/__w/_temp/debug_files')
> The STDIO streams did not close within 10 seconds of the exit event from 
> process '/bin/bash'. This may indicate a child process inherited the STDIO 
> streams and has not yet exited.
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=49181&view=logs&j=4eda0b4a-bd0d-521a-0916-8285b9be9bb5&t=2ff6d5fa-53a6-53ac-bff7-fa524ea361a9&l=14436



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-31945) Table of contents in the blogs of the project website is missing some titles

2023-05-25 Thread Martijn Visser (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn Visser closed FLINK-31945.
--
Resolution: Fixed

Fixed in apache/flink-web:asf-site via 6948c400ce9749fdb3a2f7ccc3a96ec76b2e4a47

> Table of contents in the blogs of the project website is missing some titles
> 
>
> Key: FLINK-31945
> URL: https://issues.apache.org/jira/browse/FLINK-31945
> Project: Flink
>  Issue Type: Improvement
>  Components: Project Website
>Reporter: Zhilong Hong
>Assignee: Zhilong Hong
>Priority: Minor
>  Labels: pull-request-available
>
> The ToC in the blog pages of the project website doesn't have all the section 
> titles. The section titles of the first level is missing. 
> Solution: Add the following configuration item in config.toml.
> {noformat}
> [markup.tableOfContents]
>   startLevel = 0{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-connector-kafka] tzulitai commented on a diff in pull request #25: [FLINK-30859] Sync flink sql client test

2023-05-25 Thread via GitHub


tzulitai commented on code in PR #25:
URL: 
https://github.com/apache/flink-connector-kafka/pull/25#discussion_r1205621149


##
flink-connector-kafka-e2e-tests/flink-sql-client-test/pom.xml:
##
@@ -0,0 +1,129 @@
+
+http://maven.apache.org/POM/4.0.0";
+xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
+   
+   org.apache.flink
+   flink-connector-kafka-e2e-tests
+   3.1-SNAPSHOT
+   
+   4.0.0
+
+   flink-sql-client-test
+   Flink : E2E Tests : SQL client
+   jar
+
+   
+   
+   org.apache.flink
+   flink-table-common
+   ${flink.version}
+   provided
+   
+
+   
+   org.apache.flink
+   
flink-end-to-end-tests-common-kafka
+   ${project.version}
+   
+
+   

Review Comment:
   Is this comment still relevant?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-connector-kafka] tzulitai commented on a diff in pull request #24: [FLINK-30859] Sync kafka related examples from apache/flink:master

2023-05-25 Thread via GitHub


tzulitai commented on code in PR #24:
URL: 
https://github.com/apache/flink-connector-kafka/pull/24#discussion_r1205609370


##
flink-examples-kafka/flink-examples-streaming-state-machine-build-helper/src/main/resources/META-INF/NOTICE:
##
@@ -0,0 +1,9 @@
+flink-examples-streaming-state-machine
+Copyright 2014-2023 The Apache Software Foundation
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+This project bundles the following dependencies under the Apache Software 
License 2.0. (http://www.apache.org/licenses/LICENSE-2.0.txt)
+
+- org.apache.kafka:kafka-clients:3.2.3

Review Comment:
   the depended Kafka version is now 3.4.0



##
flink-examples-kafka/pom.xml:
##
@@ -0,0 +1,45 @@
+
+
+http://maven.apache.org/POM/4.0.0"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/maven-v4_0_0.xsd";>
+
+  4.0.0
+
+  
+org.apache.flink
+flink-connector-kafka-parent
+3.1-SNAPSHOT
+  
+
+  pom
+
+  flink-examples-kafka
+  Flink : Formats : Kafka

Review Comment:
   `Flink : Examples : Kafka`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-connector-jdbc] boring-cyborg[bot] commented on pull request #46: [hotfix][test] remove non-used code in tests

2023-05-25 Thread via GitHub


boring-cyborg[bot] commented on PR #46:
URL: 
https://github.com/apache/flink-connector-jdbc/pull/46#issuecomment-1563027492

   Awesome work, congrats on your first merged pull request!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-connector-jdbc] MartijnVisser merged pull request #46: [hotfix][test] remove non-used code in tests

2023-05-25 Thread via GitHub


MartijnVisser merged PR #46:
URL: https://github.com/apache/flink-connector-jdbc/pull/46


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (FLINK-31832) Add benchmarks for end to end  restarting tasks

2023-05-25 Thread Lijie Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726242#comment-17726242
 ] 

Lijie Wang commented on FLINK-31832:


Done via:
flink master: 7d529a5a167f52c1b76fd6f70918b27149cf8782
flink-benchmarks master: 8b32b75e06ec51ee6ec21304c6e473159e0ef3ff

> Add benchmarks for end to end  restarting tasks
> ---
>
> Key: FLINK-31832
> URL: https://issues.apache.org/jira/browse/FLINK-31832
> Project: Flink
>  Issue Type: Sub-task
>  Components: Benchmarks, Runtime / Coordination
>Reporter: Weihua Hu
>Assignee: Weihua Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.0
>
>
> As discussed in https://issues.apache.org/jira/browse/FLINK-31771. 
> We need a benchmark for job failover and end to end restarting tasks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [flink-connector-jdbc] WenDing-Y commented on pull request #49: [FLINK-32068] connector jdbc support clickhouse

2023-05-25 Thread via GitHub


WenDing-Y commented on PR #49:
URL: 
https://github.com/apache/flink-connector-jdbc/pull/49#issuecomment-1562933275

   the pr is ready


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (FLINK-32068) flink-connector-jdbc support clickhouse

2023-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32068:
---
Labels: pull-request-available  (was: )

>  flink-connector-jdbc support clickhouse
> 
>
> Key: FLINK-32068
> URL: https://issues.apache.org/jira/browse/FLINK-32068
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / JDBC
>Reporter: leishuiyu
>Assignee: leishuiyu
>Priority: Minor
>  Labels: pull-request-available
>
> flink sql support clickhouse
>  * int batch scene ,the clickhouse  can as source and sink
>  * int stream scene ,the clickhouse  as  sink



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32193) AWS connector removes the dependency on flink-shaded

2023-05-25 Thread Yuxin Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Tan updated FLINK-32193:
--
Issue Type: Technical Debt  (was: Bug)

> AWS connector removes the dependency on flink-shaded
> 
>
> Key: FLINK-32193
> URL: https://issues.apache.org/jira/browse/FLINK-32193
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / AWS
>Reporter: Yuxin Tan
>Assignee: Yuxin Tan
>Priority: Major
> Fix For: aws-connector-4.2.0
>
>
> The AWS connector depends on flink-shaded. With the externalization of the 
> connector, connectors shouldn't rely on Flink-Shaded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   >