[jira] [Commented] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

Henrik (JIRA) Thu, 16 May 2019 01:53:40 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841133#comment-16841133
 ]


Henrik commented on FLINK-12384:
--------------------------------

Thanks for hte ping [~gjy]

I've updated the issue with that information.

> Rolling the etcd servers causes "Connected to an old server; r-o mode will be 
> unavailable"
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-12384
>                 URL: https://issues.apache.org/jira/browse/FLINK-12384
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Henrik
>            Priority: Major
>
> {code:java}
> [tm] 2019-05-01 13:30:53,316 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - 
> Initiating client connection, connectString=analytics-zetcd:2181 
> sessionTimeout=60000 
> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
> [tm] 2019-05-01 13:30:53,384 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
> configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue 
> connection to Zookeeper server without SASL authentication, if Zookeeper 
> server allows it.
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
> socket connection to server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Using 
> configured hostname/address for TaskManager: 10.1.2.173.
> [tm] 2019-05-01 13:30:53,401 ERROR 
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
> Authentication failed
> [tm] 2019-05-01 13:30:53,418 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils         - Trying to 
> start actor system at 10.1.2.173:0
> [tm] 2019-05-01 13:30:53,420 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
> connection established to 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating 
> session
> [tm] 2019-05-01 13:30:53,500 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
> Connected to an old server; r-o mode will be unavailable
> [tm] 2019-05-01 13:30:53,500 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
> establishment complete on server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
> 0xbf06a739001d446, negotiated timeout = 60000
> [tm] 2019-05-01 13:30:53,525 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: CONNECTED{code}
> Repro:
> Start an etcd-cluster, with e.g. etcd-operator, with three members. Start 
> zetcd in front. Configure the sesssion cluster to go against zetcd.
> Ensure the job can start successfully.
> Now, kill the etcd pods one by one, letting the quorum re-establish in 
> between, so that the cluster is still OK.
> Now restart the job/tm pods. You'll end up in this no-mans-land.
>  
> —
> Workaround: clean out the etcd cluster and remove all its data, however, this 
> resets all time windows and state, despite having that saved in GCS, so it's 
> a crappy workaround.
>  
> --
>  
> flink-conf.yaml
> {{ parallelism.default: 1}}
> {{ rest.address: analytics-job}}
> {{ jobmanager.rpc.address: analytics-job # = resource manager's address too}}
> {{ jobmanager.heap.size: 1024m}}
> {{ jobmanager.rpc.port: 6123}}
> {{ jobmanager.slot.request.timeout: 30000}}
> {{ resourcemanager.rpc.port: 6123}}
> {{ high-availability.jobmanager.port: 6123}}
> {{ blob.server.port: 6124}}
> {{ queryable-state.server.ports: 6125}}
> {{ taskmanager.heap.size: 1024m}}
> {{ taskmanager.numberOfTaskSlots: 1}}
> {{ web.log.path: /var/lib/log/flink/jobmanager.log}}
> {{ rest.port: 8081}}
> {{ rest.bind-address: 0.0.0.0}}
> {{ web.submit.enable: false}}
> {{ high-availability: zookeeper}}
> {{ high-availability.storageDir: 
> gs://project-id-example_analytics/flink/zetcd/}}
> {{ high-availability.zookeeper.quorum: analytics-zetcd:2181}}
> {{ high-availability.zookeeper.path.root: /flink}}
> {{ high-availability.zookeeper.client.acl: open}}
> {{ state.backend: rocksdb}}
> {{ state.checkpoints.num-retained: 3}}
> {{ state.checkpoints.dir: 
> gs://project-id-example_analytics/flink/checkpoints}}
> {{ state.savepoints.dir: 
> gs://}}{{project-id-example}}{{_analytics/flink/savepoints}}
> {{ state.backend.incremental: true}}
> {{ state.backend.async: true}}
> {{ fs.hdfs.hadoopconf: /opt/flink/hadoop}}
> {{ log.file: /var/lib/log/flink/jobmanager.log}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

Reply via email to