[ https://issues.apache.org/jira/browse/FLINK-28695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636390#comment-17636390 ]
nobleyd commented on FLINK-28695: --------------------------------- [~pnowojski] Hi, we have the same problem, and adjusting `taskmanager.network.max-num-tcp-connections` words fine. > Fail to send partition request to restarted taskmanager > ------------------------------------------------------- > > Key: FLINK-28695 > URL: https://issues.apache.org/jira/browse/FLINK-28695 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Network > Affects Versions: 1.15.0, 1.15.1 > Reporter: Simonas > Assignee: Rui Fan > Priority: Critical > Labels: pull-request-available > Attachments: deployment.txt, image-2022-11-20-16-16-45-705.png, > job_log.txt, jobmanager_config.txt, jobmanager_logs.txt, pod_restart.txt, > taskmanager_config.txt > > > After upgrade to *1.15.1* we started getting error while running JOB > > {code:java} > org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: > Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed. at > org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145) > .... {code} > {code:java} > Caused by: > org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException > > atrg.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, > ChannelPromise)(Unknown Source){code} > After investigation we managed narrow it down to the exact behavior then this > issue happens: > # Deploying JOB on fresh kubernetes session cluster with multiple > TaskManagers: TM1 and TM2 is successful. Job has multiple partitions running > on both TM1 and TM2. > # One TaskManager TM2 (XXX.XXX.XX.32) fails for unrelated issue. For example > OOM exception. > # Kubernetes POD with mentioned TaskManager TM2 is restarted. POD retains > same IP address as before. > # JobManager is able to pickup the restarted TM2 (XXX.XXX.XX.32) > # JOB is restarted because it was running on the failed TaskManager TM2 > # TM1 data channel to TM2 is closed and we get LocalTransportException: > Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed during JOB > running stage. > # When we explicitly delete pod with TM2 it creates new POD with different > IP address and JOB is able to start again. > Important to note that we didn't encountered this issue with previous > *1.14.4* version and TaskManager restarts didn't cause such error. > Please note attached kubernetes deployments and reduced logs from JobManager. > TaskManager logs did show errors before error, but doesn't show anything > significant after restart. -- This message was sent by Atlassian Jira (v8.20.10#820010)