[ https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841133#comment-16841133 ]
Henrik commented on FLINK-12384: -------------------------------- Thanks for hte ping [~gjy] I've updated the issue with that information. > Rolling the etcd servers causes "Connected to an old server; r-o mode will be > unavailable" > ------------------------------------------------------------------------------------------ > > Key: FLINK-12384 > URL: https://issues.apache.org/jira/browse/FLINK-12384 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Reporter: Henrik > Priority: Major > > {code:java} > [tm] 2019-05-01 13:30:53,316 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - > Initiating client connection, connectString=analytics-zetcd:2181 > sessionTimeout=60000 > watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f > [tm] 2019-05-01 13:30:53,384 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named 'Client' was found in specified JAAS > configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue > connection to Zookeeper server without SASL authentication, if Zookeeper > server allows it. > [tm] 2019-05-01 13:30:53,395 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening > socket connection to server > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 > [tm] 2019-05-01 13:30:53,395 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using > configured hostname/address for TaskManager: 10.1.2.173. > [tm] 2019-05-01 13:30:53,401 ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed > [tm] 2019-05-01 13:30:53,418 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to > start actor system at 10.1.2.173:0 > [tm] 2019-05-01 13:30:53,420 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket > connection established to > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating > session > [tm] 2019-05-01 13:30:53,500 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - > Connected to an old server; r-o mode will be unavailable > [tm] 2019-05-01 13:30:53,500 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session > establishment complete on server > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = > 0xbf06a739001d446, negotiated timeout = 60000 > [tm] 2019-05-01 13:30:53,525 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: CONNECTED{code} > Repro: > Start an etcd-cluster, with e.g. etcd-operator, with three members. Start > zetcd in front. Configure the sesssion cluster to go against zetcd. > Ensure the job can start successfully. > Now, kill the etcd pods one by one, letting the quorum re-establish in > between, so that the cluster is still OK. > Now restart the job/tm pods. You'll end up in this no-mans-land. > > — > Workaround: clean out the etcd cluster and remove all its data, however, this > resets all time windows and state, despite having that saved in GCS, so it's > a crappy workaround. > > -- > > flink-conf.yaml > {{ parallelism.default: 1}} > {{ rest.address: analytics-job}} > {{ jobmanager.rpc.address: analytics-job # = resource manager's address too}} > {{ jobmanager.heap.size: 1024m}} > {{ jobmanager.rpc.port: 6123}} > {{ jobmanager.slot.request.timeout: 30000}} > {{ resourcemanager.rpc.port: 6123}} > {{ high-availability.jobmanager.port: 6123}} > {{ blob.server.port: 6124}} > {{ queryable-state.server.ports: 6125}} > {{ taskmanager.heap.size: 1024m}} > {{ taskmanager.numberOfTaskSlots: 1}} > {{ web.log.path: /var/lib/log/flink/jobmanager.log}} > {{ rest.port: 8081}} > {{ rest.bind-address: 0.0.0.0}} > {{ web.submit.enable: false}} > {{ high-availability: zookeeper}} > {{ high-availability.storageDir: > gs://project-id-example_analytics/flink/zetcd/}} > {{ high-availability.zookeeper.quorum: analytics-zetcd:2181}} > {{ high-availability.zookeeper.path.root: /flink}} > {{ high-availability.zookeeper.client.acl: open}} > {{ state.backend: rocksdb}} > {{ state.checkpoints.num-retained: 3}} > {{ state.checkpoints.dir: > gs://project-id-example_analytics/flink/checkpoints}} > {{ state.savepoints.dir: > gs://}}{{project-id-example}}{{_analytics/flink/savepoints}} > {{ state.backend.incremental: true}} > {{ state.backend.async: true}} > {{ fs.hdfs.hadoopconf: /opt/flink/hadoop}} > {{ log.file: /var/lib/log/flink/jobmanager.log}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)