[ 
https://issues.apache.org/jira/browse/FLINK-28695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629601#comment-17629601
 ] 

Weijie Guo edited comment on FLINK-28695 at 11/7/22 6:22 AM:
-------------------------------------------------------------

[~Brian Zhou] If you suspect that the problem is caused by connection reuse, 
you can try to set `taskmanager.network.max-num-tcp-connections` to a large 
value.What I'm puzzled about is that even if the TCP connection between two TMs 
is reused, the K8S should theoretically be able to correctly deliver the 
message to the new pod. And It sounds like you can reproduce this problem, 
could you provide me with a minimum reproducible approach to do further 
investigation.


was (Author: weijie guo):
[~Brian Zhou] If you suspect that the problem is caused by connection reuse, 
you can try to set `taskmanager.network.max-num-tcp-connections` to a large 
value.What I'm puzzled about is that even if the TCP connection between two TMs 
is reused, the K8S should theoretically be able to correctly deliver the 
message to the new pod.

> Fail to send partition request to restarted taskmanager
> -------------------------------------------------------
>
>                 Key: FLINK-28695
>                 URL: https://issues.apache.org/jira/browse/FLINK-28695
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes, Runtime / Network
>    Affects Versions: 1.15.0, 1.15.1
>            Reporter: Simonas
>            Priority: Major
>         Attachments: deployment.txt, job_log.txt, jobmanager_config.txt, 
> jobmanager_logs.txt, pod_restart.txt, taskmanager_config.txt
>
>
> After upgrade to *1.15.1* we started getting error while running JOB
>  
> {code:java}
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed.    at 
> org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
>     .... {code}
> {code:java}
> Caused by: 
> org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
>  
> atrg.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
>  ChannelPromise)(Unknown Source){code}
> After investigation we managed narrow it down to the exact behavior then this 
> issue happens:
>  # Deploying JOB on fresh kubernetes session cluster with multiple 
> TaskManagers: TM1 and TM2 is successful. Job has multiple partitions running 
> on both TM1 and TM2.
>  # One TaskManager TM2 (XXX.XXX.XX.32) fails for unrelated issue. For example 
> OOM exception.
>  # Kubernetes POD with mentioned TaskManager TM2 is restarted. POD retains 
> same IP address as before.
>  # JobManager is able to pickup the restarted TM2 (XXX.XXX.XX.32)
>  # JOB is restarted because it was running on the failed TaskManager TM2
>  # TM1 data channel to TM2 is closed and we get LocalTransportException: 
> Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed during JOB 
> running stage.  
>  # When we explicitly delete pod with TM2 it creates new POD with different 
> IP address and JOB is able to start again.
> Important to note that we didn't encountered this issue with previous 
> *1.14.4* version and TaskManager restarts didn't cause such error.
> Please note attached kubernetes deployments and reduced logs from JobManager. 
> TaskManager logs did show errors before error, but doesn't show anything 
> significant after restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to