[jira] [Updated] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

Henrik (JIRA) Thu, 16 May 2019 01:53:40 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Henrik updated FLINK-12384:
---------------------------
    Description: 
{code:java}
[tm] 2019-05-01 13:30:53,316 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
client connection, connectString=analytics-zetcd:2181 sessionTimeout=60000 
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
[tm] 2019-05-01 13:30:53,384 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
configuration failed: javax.security.auth.login.LoginException: No JAAS 
configuration section named 'Client' was found in specified JAAS configuration 
file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Using 
configured hostname/address for TaskManager: 10.1.2.173.
[tm] 2019-05-01 13:30:53,401 ERROR 
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
Authentication failed
[tm] 2019-05-01 13:30:53,418 INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils         - Trying to start 
actor system at 10.1.2.173:0
[tm] 2019-05-01 13:30:53,420 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
connection established to 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session
[tm] 2019-05-01 13:30:53,500 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
Connected to an old server; r-o mode will be unavailable
[tm] 2019-05-01 13:30:53,500 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
establishment complete on server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
0xbf06a739001d446, negotiated timeout = 60000
[tm] 2019-05-01 13:30:53,525 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: CONNECTED{code}
Repro:

Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd 
in front. Configure the sesssion cluster to go against zetcd.

Ensure the job can start successfully.

Now, kill the etcd pods one by one, letting the quorum re-establish in between, 
so that the cluster is still OK.

Now restart the job/tm pods. You'll end up in this no-mans-land.

 

—

Workaround: clean out the etcd cluster and remove all its data, however, this 
resets all time windows and state, despite having that saved in GCS, so it's a 
crappy workaround.

 

–

 

flink-conf.yaml
{code:java}
parallelism.default: 1
rest.address: analytics-job
jobmanager.rpc.address: analytics-job # = resource manager's address too
jobmanager.heap.size: 1024m
jobmanager.rpc.port: 6123
jobmanager.slot.request.timeout: 30000
resourcemanager.rpc.port: 6123
high-availability.jobmanager.port: 6123
blob.server.port: 6124
queryable-state.server.ports: 6125
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 1
web.log.path: /var/lib/log/flink/jobmanager.log
rest.port: 8081
rest.bind-address: 0.0.0.0
web.submit.enable: false
high-availability: zookeeper
high-availability.storageDir: gs://example_analytics/flink/zetcd/
high-availability.zookeeper.quorum: analytics-zetcd:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.client.acl: open
state.backend: rocksdb
state.checkpoints.num-retained: 3
state.checkpoints.dir: gs://example_analytics/flink/checkpoints
state.savepoints.dir: gs://example_analytics/flink/savepoints
state.backend.incremental: true
state.backend.async: true
fs.hdfs.hadoopconf: /opt/flink/hadoop
log.file: /var/lib/log/flink/jobmanager.log{code}

  was:
{code:java}
[tm] 2019-05-01 13:30:53,316 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
client connection, connectString=analytics-zetcd:2181 sessionTimeout=60000 
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
[tm] 2019-05-01 13:30:53,384 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
configuration failed: javax.security.auth.login.LoginException: No JAAS 
configuration section named 'Client' was found in specified JAAS configuration 
file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Using 
configured hostname/address for TaskManager: 10.1.2.173.
[tm] 2019-05-01 13:30:53,401 ERROR 
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
Authentication failed
[tm] 2019-05-01 13:30:53,418 INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils         - Trying to start 
actor system at 10.1.2.173:0
[tm] 2019-05-01 13:30:53,420 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
connection established to 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session
[tm] 2019-05-01 13:30:53,500 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
Connected to an old server; r-o mode will be unavailable
[tm] 2019-05-01 13:30:53,500 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
establishment complete on server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
0xbf06a739001d446, negotiated timeout = 60000
[tm] 2019-05-01 13:30:53,525 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: CONNECTED{code}
Repro:

Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd 
in front. Configure the sesssion cluster to go against zetcd.

Ensure the job can start successfully.

Now, kill the etcd pods one by one, letting the quorum re-establish in between, 
so that the cluster is still OK.

Now restart the job/tm pods. You'll end up in this no-mans-land.

 

—

Workaround: clean out the etcd cluster and remove all its data, however, this 
resets all time windows and state, despite having that saved in GCS, so it's a 
crappy workaround.

 

--

 

flink-conf.yaml

{{ parallelism.default: 1}}
{{ rest.address: analytics-job}}
{{ jobmanager.rpc.address: analytics-job # = resource manager's address too}}
{{ jobmanager.heap.size: 1024m}}
{{ jobmanager.rpc.port: 6123}}
{{ jobmanager.slot.request.timeout: 30000}}
{{ resourcemanager.rpc.port: 6123}}
{{ high-availability.jobmanager.port: 6123}}
{{ blob.server.port: 6124}}
{{ queryable-state.server.ports: 6125}}
{{ taskmanager.heap.size: 1024m}}
{{ taskmanager.numberOfTaskSlots: 1}}
{{ web.log.path: /var/lib/log/flink/jobmanager.log}}
{{ rest.port: 8081}}
{{ rest.bind-address: 0.0.0.0}}
{{ web.submit.enable: false}}
{{ high-availability: zookeeper}}
{{ high-availability.storageDir: 
gs://project-id-example_analytics/flink/zetcd/}}
{{ high-availability.zookeeper.quorum: analytics-zetcd:2181}}
{{ high-availability.zookeeper.path.root: /flink}}
{{ high-availability.zookeeper.client.acl: open}}
{{ state.backend: rocksdb}}
{{ state.checkpoints.num-retained: 3}}
{{ state.checkpoints.dir: gs://project-id-example_analytics/flink/checkpoints}}
{{ state.savepoints.dir: 
gs://}}{{project-id-example}}{{_analytics/flink/savepoints}}
{{ state.backend.incremental: true}}
{{ state.backend.async: true}}
{{ fs.hdfs.hadoopconf: /opt/flink/hadoop}}
{{ log.file: /var/lib/log/flink/jobmanager.log}}


> Rolling the etcd servers causes "Connected to an old server; r-o mode will be 
> unavailable"
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-12384
>                 URL: https://issues.apache.org/jira/browse/FLINK-12384
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Henrik
>            Priority: Major
>
> {code:java}
> [tm] 2019-05-01 13:30:53,316 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - 
> Initiating client connection, connectString=analytics-zetcd:2181 
> sessionTimeout=60000 
> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
> [tm] 2019-05-01 13:30:53,384 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
> configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue 
> connection to Zookeeper server without SASL authentication, if Zookeeper 
> server allows it.
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
> socket connection to server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Using 
> configured hostname/address for TaskManager: 10.1.2.173.
> [tm] 2019-05-01 13:30:53,401 ERROR 
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
> Authentication failed
> [tm] 2019-05-01 13:30:53,418 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils         - Trying to 
> start actor system at 10.1.2.173:0
> [tm] 2019-05-01 13:30:53,420 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
> connection established to 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating 
> session
> [tm] 2019-05-01 13:30:53,500 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
> Connected to an old server; r-o mode will be unavailable
> [tm] 2019-05-01 13:30:53,500 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
> establishment complete on server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
> 0xbf06a739001d446, negotiated timeout = 60000
> [tm] 2019-05-01 13:30:53,525 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: CONNECTED{code}
> Repro:
> Start an etcd-cluster, with e.g. etcd-operator, with three members. Start 
> zetcd in front. Configure the sesssion cluster to go against zetcd.
> Ensure the job can start successfully.
> Now, kill the etcd pods one by one, letting the quorum re-establish in 
> between, so that the cluster is still OK.
> Now restart the job/tm pods. You'll end up in this no-mans-land.
>  
> —
> Workaround: clean out the etcd cluster and remove all its data, however, this 
> resets all time windows and state, despite having that saved in GCS, so it's 
> a crappy workaround.
>  
> –
>  
> flink-conf.yaml
> {code:java}
> parallelism.default: 1
> rest.address: analytics-job
> jobmanager.rpc.address: analytics-job # = resource manager's address too
> jobmanager.heap.size: 1024m
> jobmanager.rpc.port: 6123
> jobmanager.slot.request.timeout: 30000
> resourcemanager.rpc.port: 6123
> high-availability.jobmanager.port: 6123
> blob.server.port: 6124
> queryable-state.server.ports: 6125
> taskmanager.heap.size: 1024m
> taskmanager.numberOfTaskSlots: 1
> web.log.path: /var/lib/log/flink/jobmanager.log
> rest.port: 8081
> rest.bind-address: 0.0.0.0
> web.submit.enable: false
> high-availability: zookeeper
> high-availability.storageDir: gs://example_analytics/flink/zetcd/
> high-availability.zookeeper.quorum: analytics-zetcd:2181
> high-availability.zookeeper.path.root: /flink
> high-availability.zookeeper.client.acl: open
> state.backend: rocksdb
> state.checkpoints.num-retained: 3
> state.checkpoints.dir: gs://example_analytics/flink/checkpoints
> state.savepoints.dir: gs://example_analytics/flink/savepoints
> state.backend.incremental: true
> state.backend.async: true
> fs.hdfs.hadoopconf: /opt/flink/hadoop
> log.file: /var/lib/log/flink/jobmanager.log{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

Reply via email to