[ https://issues.apache.org/jira/browse/FLINK-29117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
geonyeong kim updated FLINK-29117: ---------------------------------- Description: Hello. I am planning to distribute and use FlinkDeployment through the flink kubernetes operator. CRD, operator, webbook, etc. are all set up, and we actually distributed FlinkDeployment to confirm normal operation. *However, strangely, connecting to resource manager fails if you make more than one task manager pod replica.* I thought it might be a problem with akka, timeout, etc. so I increased the values as below The connection continues to fail. - akka.retry-gate-closed-for: 10000 - akka.server-socket-worker-pool.pool-size-min: 6 - akka.server-socket-worker-pool.pool-size-max: 10 - akka.client-socket-worker-pool.pool-size-max: 10 - akka.client-socket-worker-pool.pool-size-min: 6 - blob.client.connect.timeout: 30000 The log of the taskmanager is as follows. {code:java} Association with remote system [akka.tcp://flink@10.238.80.92:6123] has failed, address is now gated for [10000] ms. Reason: [Disassociated] Could not resolve ResourceManager address akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1. Tried to associate with unreachable remote address [akka.tcp://flink@10.238.80.92:6123]. Address is now gated for 10000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.] {code} *If you go into the task manager pod and tcp check, the connection is open.* *Below are the flink versions I used.* * flink image: 1.15.1 - flink kubernetes operator: 1.1.0 *I would appreciate it if you could check the problem quickly.* *If it's a bug, please tell me how to detour in the current situation.* was: Hello. I am planning to distribute and use FlinkDeployment through the flink kubernetes operator. CRD, operator, webbook, etc. are all set up, and we actually distributed FlinkDeployment to confirm normal operation. *However, strangely, connecting to resource manager fails if you make more than one task manager pod replica.* I thought it might be a problem with akka, timeout, etc. so I increased the values as below The connection continues to fail. - akka.retry-gate-closed-for: 10000 - akka.server-socket-worker-pool.pool-size-min: 6 - akka.server-socket-worker-pool.pool-size-max: 10 - akka.client-socket-worker-pool.pool-size-max: 10 - akka.client-socket-worker-pool.pool-size-min: 6 - blob.client.connect. The log of the taskmanager is as follows. {code:java} Association with remote system [akka.tcp://flink@10.238.80.92:6123] has failed, address is now gated for [10000] ms. Reason: [Disassociated] Could not resolve ResourceManager address akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1. Tried to associate with unreachable remote address [akka.tcp://flink@10.238.80.92:6123]. Address is now gated for 10000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.] {code} *If you go into the task manager pod and tcp check, the connection is open.* *Below are the flink versions I used.* * flink image: 1.15.1 - flink kubernetes operator: 1.1.0 *I would appreciate it if you could check the problem quickly.* *If it's a bug, please tell me how to detour in the current situation.* > Tried to associate with unreachable remote resourcemanager address > ------------------------------------------------------------------ > > Key: FLINK-29117 > URL: https://issues.apache.org/jira/browse/FLINK-29117 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, flink-contrib, flink-docker, > Kubernetes Operator > Affects Versions: 1.15.1, kubernetes-operator-1.1.0 > Reporter: geonyeong kim > Priority: Critical > Attachments: taskmanager_log.png > > > Hello. > I am planning to distribute and use FlinkDeployment through the flink > kubernetes operator. > CRD, operator, webbook, etc. are all set up, and we actually distributed > FlinkDeployment to confirm normal operation. > *However, strangely, connecting to resource manager fails if you make more > than one task manager pod replica.* > I thought it might be a problem with akka, timeout, etc. so I increased the > values as below > The connection continues to fail. > - akka.retry-gate-closed-for: 10000 > - akka.server-socket-worker-pool.pool-size-min: 6 > - akka.server-socket-worker-pool.pool-size-max: 10 > - akka.client-socket-worker-pool.pool-size-max: 10 > - akka.client-socket-worker-pool.pool-size-min: 6 > - blob.client.connect.timeout: 30000 > The log of the taskmanager is as follows. > > {code:java} > Association with remote system [akka.tcp://flink@10.238.80.92:6123] has > failed, address is now gated for [10000] ms. Reason: [Disassociated] > Could not resolve ResourceManager address > akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1, retrying in > 10000 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1. > Tried to associate with unreachable remote address > [akka.tcp://flink@10.238.80.92:6123]. Address is now gated for 10000 ms, all > messages to this address will be delivered to dead letters. Reason: [The > remote system has quarantined this system. No further associations to the > remote system are possible until this system is restarted.] {code} > *If you go into the task manager pod and tcp check, the connection is open.* > *Below are the flink versions I used.* > * flink image: 1.15.1 > - flink kubernetes operator: 1.1.0 > > *I would appreciate it if you could check the problem quickly.* > *If it's a bug, please tell me how to detour in the current situation.* -- This message was sent by Atlassian Jira (v8.20.10#820010)