[jira] [Updated] (FLINK-29117) Tried to associate with unreachable remote resourcemanager address

geonyeong kim (Jira) Fri, 26 Aug 2022 01:53:06 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-29117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


geonyeong kim updated FLINK-29117:
----------------------------------
    Description: 
Hello.

I am planning to distribute and use FlinkDeployment through the flink 
kubernetes operator.

CRD, operator, webbook, etc. are all set up, and we actually distributed 
FlinkDeployment to confirm normal operation.

*However, strangely, connecting to resource manager fails if you make more than 
one task manager pod replica.*

I thought it might be a problem with akka, timeout, etc. so I increased the 
values as below
The connection continues to fail.
 - akka.retry-gate-closed-for: 10000
 - akka.server-socket-worker-pool.pool-size-min: 6
 - akka.server-socket-worker-pool.pool-size-max: 10
 - akka.client-socket-worker-pool.pool-size-max: 10
 - akka.client-socket-worker-pool.pool-size-min: 6
 - blob.client.connect.timeout: 30000

The log of the taskmanager is as follows.

 
{code:java}
Association with remote system [akka.tcp://flink@10.238.80.92:6123] has failed, 
address is now gated for [10000] ms. Reason: [Disassociated] 
Could not resolve ResourceManager address 
akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1, retrying in 
10000 ms: Could not connect to rpc endpoint under address 
akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1. 
Tried to associate with unreachable remote address 
[akka.tcp://flink@10.238.80.92:6123]. Address is now gated for 10000 ms, all 
messages to this address will be delivered to dead letters. Reason: [The remote 
system has quarantined this system. No further associations to the remote 
system are possible until this system is restarted.]  {code}
*If you go into the task manager pod and tcp check, the connection is open.*

*Below are the flink versions I used.*
 * flink image: 1.15.1

 - flink kubernetes operator: 1.1.0

 

*I would appreciate it if you could check the problem quickly.*
*If it's a bug, please tell me how to detour in the current situation.*

  was:
Hello.

I am planning to distribute and use FlinkDeployment through the flink 
kubernetes operator.

CRD, operator, webbook, etc. are all set up, and we actually distributed 
FlinkDeployment to confirm normal operation.

*However, strangely, connecting to resource manager fails if you make more than 
one task manager pod replica.*

I thought it might be a problem with akka, timeout, etc. so I increased the 
values as below
The connection continues to fail.
 - akka.retry-gate-closed-for: 10000
 - akka.server-socket-worker-pool.pool-size-min: 6
 - akka.server-socket-worker-pool.pool-size-max: 10
 - akka.client-socket-worker-pool.pool-size-max: 10
 - akka.client-socket-worker-pool.pool-size-min: 6
 - blob.client.connect.

The log of the taskmanager is as follows.

 
{code:java}
Association with remote system [akka.tcp://flink@10.238.80.92:6123] has failed, 
address is now gated for [10000] ms. Reason: [Disassociated] 
Could not resolve ResourceManager address 
akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1, retrying in 
10000 ms: Could not connect to rpc endpoint under address 
akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1. 
Tried to associate with unreachable remote address 
[akka.tcp://flink@10.238.80.92:6123]. Address is now gated for 10000 ms, all 
messages to this address will be delivered to dead letters. Reason: [The remote 
system has quarantined this system. No further associations to the remote 
system are possible until this system is restarted.]  {code}
*If you go into the task manager pod and tcp check, the connection is open.*

*Below are the flink versions I used.*
 * flink image: 1.15.1

 - flink kubernetes operator: 1.1.0

 

*I would appreciate it if you could check the problem quickly.*
*If it's a bug, please tell me how to detour in the current situation.*


> Tried to associate with unreachable remote resourcemanager address
> ------------------------------------------------------------------
>
>                 Key: FLINK-29117
>                 URL: https://issues.apache.org/jira/browse/FLINK-29117
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes, flink-contrib, flink-docker, 
> Kubernetes Operator
>    Affects Versions: 1.15.1, kubernetes-operator-1.1.0
>            Reporter: geonyeong kim
>            Priority: Critical
>         Attachments: taskmanager_log.png
>
>
> Hello.
> I am planning to distribute and use FlinkDeployment through the flink 
> kubernetes operator.
> CRD, operator, webbook, etc. are all set up, and we actually distributed 
> FlinkDeployment to confirm normal operation.
> *However, strangely, connecting to resource manager fails if you make more 
> than one task manager pod replica.*
> I thought it might be a problem with akka, timeout, etc. so I increased the 
> values as below
> The connection continues to fail.
>  - akka.retry-gate-closed-for: 10000
>  - akka.server-socket-worker-pool.pool-size-min: 6
>  - akka.server-socket-worker-pool.pool-size-max: 10
>  - akka.client-socket-worker-pool.pool-size-max: 10
>  - akka.client-socket-worker-pool.pool-size-min: 6
>  - blob.client.connect.timeout: 30000
> The log of the taskmanager is as follows.
>  
> {code:java}
> Association with remote system [akka.tcp://flink@10.238.80.92:6123] has 
> failed, address is now gated for [10000] ms. Reason: [Disassociated] 
> Could not resolve ResourceManager address 
> akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1, retrying in 
> 10000 ms: Could not connect to rpc endpoint under address 
> akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1. 
> Tried to associate with unreachable remote address 
> [akka.tcp://flink@10.238.80.92:6123]. Address is now gated for 10000 ms, all 
> messages to this address will be delivered to dead letters. Reason: [The 
> remote system has quarantined this system. No further associations to the 
> remote system are possible until this system is restarted.]  {code}
> *If you go into the task manager pod and tcp check, the connection is open.*
> *Below are the flink versions I used.*
>  * flink image: 1.15.1
>  - flink kubernetes operator: 1.1.0
>  
> *I would appreciate it if you could check the problem quickly.*
> *If it's a bug, please tell me how to detour in the current situation.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-29117) Tried to associate with unreachable remote resourcemanager address

Reply via email to