[jira] [Comment Edited] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-07-01 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861194#comment-17861194
 ] 

Matthias Pohl edited comment on FLINK-35042 at 7/1/24 3:18 PM:
---

AdaptiveScheduler: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60569=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=12741

This is the same behavior that can be observed in the build [~Weijie Guo] 
shared in the FLINK-35042 description.


was (Author: mapohl):
AdaptiveScheduler: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60569=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=12741

This one times out while waiting for the second full restart of the job (while 
there are RPC error preventing the job reaching the Executing state again). The 
job only fully recovers again after the 2nd TM becomes available resulting in 
no 2nd restart at all.

This is the same behavior that can be observed in the build [~Weijie Guo] 
shared in the FLINK-35042 description.

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04  at 
> 

[jira] [Comment Edited] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-07-01 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861194#comment-17861194
 ] 

Matthias Pohl edited comment on FLINK-35042 at 7/1/24 3:16 PM:
---

AdaptiveScheduler: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60569=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=12741

This one times out while waiting for the second full restart of the job (while 
there are RPC error preventing the job reaching the Executing state again). The 
job only fully recovers again after the 2nd TM becomes available resulting in 
no 2nd restart at all.

This is the same behavior that can be observed in the build [~Weijie Guo] 
shared in the FLINK-35042 description.


was (Author: mapohl):
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60569=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=12741

This one times out while waiting for the second full restart of the job (while 
there are RPC error preventing the job reaching the Executing state again). The 
job only fully recovers again after the 2nd TM becomes available resulting in 
no 2nd restart at all.

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460)
>  

[jira] [Comment Edited] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-07-01 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861194#comment-17861194
 ] 

Matthias Pohl edited comment on FLINK-35042 at 7/1/24 3:11 PM:
---

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60569=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=12741

This one times out while waiting for the second full restart of the job (while 
there are RPC error preventing the job reaching the Executing state again). The 
job only fully recovers again after the 2nd TM becomes available resulting in 
no 2nd restart at all.


was (Author: mapohl):
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60569=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=12741

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04  at 
> 

[jira] [Comment Edited] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-06-14 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854944#comment-17854944
 ] 

Matthias Pohl edited comment on FLINK-35042 at 6/14/24 6:37 AM:


I noticed that the build failure in the description is unrelated to FLINK-34150 
because it appeared on April 8, 2024 whereas FLINK-34150 only was merged on May 
10, 2024.

But the build failure I shared might be related. So, it could be that these two 
are actually two different issues.


was (Author: mapohl):
I noticed that the build failure in the description is unrelated to FLINK-34150 
because it appeared on April 8, 2024 whereas FLINK-24150 only was merged on May 
10, 2024.

But the build failure I shared might be related. So, it could be that these two 
are actually two different issues.

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04  at 
> 

[jira] [Comment Edited] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-06-13 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854789#comment-17854789
 ] 

Matthias Pohl edited comment on FLINK-35042 at 6/13/24 4:16 PM:


[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60237=logs=ef799394-2d67-5ff4-b2e5-410b80c9c0af=9e5768bc-daae-5f5f-1861-e58617922c7a=9817]

for that one it looks like the test never reached the expected processed values:
{code:java}
Jun 13 13:04:25 Waiting for Dispatcher REST endpoint to come up...
Jun 13 13:04:26 Dispatcher REST endpoint is up.
Jun 13 13:04:28 [INFO] 1 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:04:28 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:04:32 [INFO] 2 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:04:32 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:04:37 [INFO] 3 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:04:37 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:04:37 Submitting job.
Jun 13 13:04:57 Job (be9bc06a08a4c0fc3bf2c9e1c92219d4) is running.
Jun 13 13:04:57 Waiting for job (be9bc06a08a4c0fc3bf2c9e1c92219d4) to have at 
least 3 completed checkpoints ...
Jun 13 13:05:06 Killing TM
Jun 13 13:05:06 TaskManager 122377 killed.
Jun 13 13:05:06 Starting TM
Jun 13 13:05:08 [INFO] 3 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:05:08 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:05:08 Waiting for restart to happen
Jun 13 13:05:08 Still waiting for restarts. Expected: 1 Current: 0
Jun 13 13:05:13 Still waiting for restarts. Expected: 1 Current: 0
Jun 13 13:05:18 Still waiting for restarts. Expected: 1 Current: 0
Jun 13 13:05:23 Killing 2 TMs
Jun 13 13:05:24 TaskManager 121771 killed.
Jun 13 13:05:24 TaskManager 122908 killed.
Jun 13 13:05:24 Starting 2 TMs
Jun 13 13:05:26 [INFO] 2 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:05:26 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:05:31 [INFO] 3 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:05:31 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:05:31 Waiting for restart to happen
Jun 13 13:05:31 Still waiting for restarts. Expected: 2 Current: 1
Jun 13 13:05:36 Still waiting for restarts. Expected: 2 Current: 1
Jun 13 13:05:41 Waiting until all values have been produced
Jun 13 13:05:43 Number of produced values 0/6
[...] {code}


was (Author: mapohl):
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60237=logs=ef799394-2d67-5ff4-b2e5-410b80c9c0af=9e5768bc-daae-5f5f-1861-e58617922c7a=9817

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>