Hi Jace, Thanks – I’ve got a few more questions:
· What output do you see if you run ‘sudo service clearwater-etcd stop && sudo service clearwater-etcd start’ on your homestead node? · Can you let me know the values of etcd_cluster (in /etc/clearwater/local_config) on each of your nodes? Ellie From: Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org] On Behalf Of jace.li...@itri.org.tw Sent: 01 November 2016 03:45 To: clearwater@lists.projectclearwater.org Subject: Re: [Project Clearwater] etcd_process execution failed on each node. Hi Ellie, Thanks for your reply, Recently I reinstalled my deployment with newest version Porygon. (Origionaly Onix). But the issues still appears. Even I tried reboot the nodes and use monit restart <process>. • Can you send me the clearwater-etcd.log? My clearwater-etcd.log (on node homestead ip: 192.168.2.205) root@hs1:/var/log/clearwater-etcd# cat clearwater-etcd.log 2016-11-01 11:27:54.603372 I | etcdmain: etcd Version: 2.2.5 2016-11-01 11:27:54.603449 I | etcdmain: Git SHA: bc9ddf2 2016-11-01 11:27:54.603457 I | etcdmain: Go Version: go1.5.3 2016-11-01 11:27:54.603463 I | etcdmain: Go OS/Arch: linux/amd64 2016-11-01 11:27:54.603478 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4 2016-11-01 11:27:54.603528 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2016-11-01 11:27:54.603622 I | etcdmain: listening for peers on http://192.168.2.205:2380 2016-11-01 11:27:54.603664 I | etcdmain: listening for client requests on http://0.0.0.0:4000 2016-11-01 11:27:54.606028 I | etcdserver: recovered store from snapshot at index 1060134 2016-11-01 11:27:54.606049 I | etcdserver: name = 192-168-2-205 2016-11-01 11:27:54.606056 I | etcdserver: data dir = /var/lib/clearwater-etcd/192.168.2.205 2016-11-01 11:27:54.606063 I | etcdserver: member dir = /var/lib/clearwater-etcd/192.168.2.205/member 2016-11-01 11:27:54.606070 I | etcdserver: heartbeat = 100ms 2016-11-01 11:27:54.606076 I | etcdserver: election = 1000ms 2016-11-01 11:27:54.606082 I | etcdserver: snapshot count = 10000 2016-11-01 11:27:54.606096 I | etcdserver: advertise client URLs = http://192.168.2.205:4000 2016-11-01 11:27:54.606124 I | etcdserver: loaded cluster information from store: <nil> 2016-11-01 11:27:54.720487 I | etcdserver: restarting member 1226bb321c91a88e in cluster 877b90a46cdaaa83 at commit index 1069551 2016-11-01 11:27:54.721099 I | raft: 1226bb321c91a88e became follower at term 1125 2016-11-01 11:27:54.721123 I | raft: newRaft 1226bb321c91a88e [peers: [1226bb321c91a88e,4cb5fd19beaa1750,59c1b019f66e6a49,8ac8820f24de7303,96a11f6c3323dc6b,a4a5d4f826d5740a], term: 1125, commit: 1069551, applied: 1060134, lastindex: 1069555, lastterm: 1125] 2016-11-01 11:27:54.744919 E | rafthttp: failed to dial 4cb5fd19beaa1750 on stream Message (the member has been permanently removed from the cluster) 2016-11-01 11:27:54.745228 E | rafthttp: failed to dial 4cb5fd19beaa1750 on stream MsgApp v2 (the member has been permanently removed from the cluster) 2016-11-01 11:27:54.746580 E | rafthttp: failed to dial 59c1b019f66e6a49 on stream Message (the member has been permanently removed from the cluster) 2016-11-01 11:27:54.747532 I | etcdserver: starting server... [version: 2.2.5, cluster version: 2.2] 2016-11-01 11:27:54.748399 E | etcdserver: the member has been permanently removed from the cluster 2016-11-01 11:27:54.748418 I | etcdserver: the data-dir used by this member must be removed. 2016-11-01 11:27:54.748503 E | rafthttp: failed to dial a4a5d4f826d5740a on stream MsgApp v2 (net/http: request canceled while waiting for connection) 2016-11-01 11:27:54.748526 E | rafthttp: failed to dial a4a5d4f826d5740a on stream Message (net/http: request canceled while waiting for connection) 2016-11-01 11:27:54.748544 D | etcdhttp: [GET] /v2/keys/clearwater/site1/configuration/apply_config?quorum=true remote:192.168.2.205:38261 2016-11-01 11:27:54.748642 E | rafthttp: failed to dial 59c1b019f66e6a49 on stream MsgApp v2 (net/http: request canceled) 2016-11-01 11:27:54.748702 E | rafthttp: failed to dial 8ac8820f24de7303 on stream MsgApp v2 (net/http: request canceled while waiting for connection) 2016-11-01 11:27:54.748723 E | rafthttp: failed to dial 8ac8820f24de7303 on stream Message (net/http: request canceled while waiting for connection) 2016-11-01 11:27:54.748761 E | rafthttp: failed to dial 96a11f6c3323dc6b on stream MsgApp v2 (net/http: request canceled) 2016-11-01 11:27:54.748788 E | rafthttp: failed to dial 96a11f6c3323dc6b on stream Message (net/http: request canceled) 2016-11-01 11:27:54.910648 E | etcdhttp: got unexpected response error (etcdserver: server stopped) 2016-11-01 11:27:56.614126 I | etcdmain: etcd Version: 2.2.5 2016-11-01 11:27:56.614168 I | etcdmain: Git SHA: bc9ddf2 2016-11-01 11:27:56.614175 I | etcdmain: Go Version: go1.5.3 2016-11-01 11:27:56.614181 I | etcdmain: Go OS/Arch: linux/amd64 2016-11-01 11:27:56.614191 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4 2016-11-01 11:27:56.614239 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2016-11-01 11:27:56.614331 I | etcdmain: listening for peers on http://192.168.2.205:2380 2016-11-01 11:27:56.614377 I | etcdmain: listening for client requests on http://0.0.0.0:4000 2016-11-01 11:27:56.616237 I | etcdserver: recovered store from snapshot at index 1060134 2016-11-01 11:27:56.616251 I | etcdserver: name = 192-168-2-205 2016-11-01 11:27:56.616257 I | etcdserver: data dir = /var/lib/clearwater-etcd/192.168.2.205 2016-11-01 11:27:56.616264 I | etcdserver: member dir = /var/lib/clearwater-etcd/192.168.2.205/member 2016-11-01 11:27:56.616270 I | etcdserver: heartbeat = 100ms 2016-11-01 11:27:56.616275 I | etcdserver: election = 1000ms 2016-11-01 11:27:56.616287 I | etcdserver: snapshot count = 10000 2016-11-01 11:27:56.616297 I | etcdserver: advertise client URLs = http://192.168.2.205:4000 2016-11-01 11:27:56.616317 I | etcdserver: loaded cluster information from store: <nil> 2016-11-01 11:27:56.714780 I | etcdserver: restarting member 1226bb321c91a88e in cluster 877b90a46cdaaa83 at commit index 1069551 2016-11-01 11:27:56.717156 I | raft: 1226bb321c91a88e became follower at term 1125 2016-11-01 11:27:56.717202 I | raft: newRaft 1226bb321c91a88e [peers: [1226bb321c91a88e,4cb5fd19beaa1750,59c1b019f66e6a49,8ac8820f24de7303,96a11f6c3323dc6b,a4a5d4f826d5740a], term: 1125, commit: 1069551, applied: 1060134, lastindex: 1069555, lastterm: 1125] 2016-11-01 11:27:56.729486 E | rafthttp: failed to dial 4cb5fd19beaa1750 on stream MsgApp v2 (the member has been permanently removed from the cluster) 2016-11-01 11:27:56.729790 E | rafthttp: failed to dial 4cb5fd19beaa1750 on stream Message (the member has been permanently removed from the cluster) 2016-11-01 11:27:56.743622 E | rafthttp: failed to dial 8ac8820f24de7303 on stream Message (dial tcp 192.168.2.202:2380: getsockopt: connection refused) 2016-11-01 11:27:56.743702 E | rafthttp: failed to dial 8ac8820f24de7303 on stream MsgApp v2 (dial tcp 192.168.2.202:2380: getsockopt: connection refused) 2016-11-01 11:27:56.744016 E | rafthttp: failed to dial 59c1b019f66e6a49 on stream Message (the member has been permanently removed from the cluster) 2016-11-01 11:27:56.744319 E | rafthttp: failed to dial 59c1b019f66e6a49 on stream MsgApp v2 (the member has been permanently removed from the cluster) 2016-11-01 11:27:56.744598 I | etcdserver: starting server... [version: 2.2.5, cluster version: 2.2] 2016-11-01 11:27:56.745081 E | etcdserver: the member has been permanently removed from the cluster 2016-11-01 11:27:56.745096 I | etcdserver: the data-dir used by this member must be removed. 2016-11-01 11:27:56.745224 E | rafthttp: failed to dial 96a11f6c3323dc6b on stream MsgApp v2 (net/http: request canceled) 2016-11-01 11:27:56.745259 E | rafthttp: failed to dial 96a11f6c3323dc6b on stream Message (net/http: request canceled) 2016-11-01 11:27:56.745302 E | rafthttp: failed to dial a4a5d4f826d5740a on stream MsgApp v2 (net/http: request canceled while waiting for connection) 2016-11-01 11:27:56.745331 E | rafthttp: failed to dial a4a5d4f826d5740a on stream Message (net/http: request canceled while waiting for connection) • What are the values of etcd_cluster on your nodes? Here is the result of Clearwater-etcdctl member list on bono (etcd_process runs well) Meanwhile, etcd_process on sprout and homestead is failed to start. 4cb5fd19beaa1750: name=192-168-2-206 peerURLs=http://192.168.2.206:2380 clientURLs=http://192.168.2.206:4000 (ellis) 59c1b019f66e6a49: name=192-168-2-204 peerURLs=http://192.168.2.204:2380 clientURLs=http://192.168.2.204:4000 (homer) 96a11f6c3323dc6b: name=192-168-2-201 peerURLs=http://192.168.2.201:2380 clientURLs=http://192.168.2.201:4000 (bono) a4a5d4f826d5740a: name=192-168-2-203 peerURLs=http://192.168.2.203:2380 clientURLs=http://192.168.2.203:4000 (ralf) And on the Homestead node (where etcd_process failed to start) root@hs1:/var/log/clearwater-etcd# clearwater-etcdctl member list Error: dial tcp 192.168.2.205:4000: getsockopt: connection refused I think the value of node homestead is 1226bb321c91a88e as it is shown in etcd.log and member list before the etcd_process failed. • When you restart your VMs, does this change the IP addresses of your VMs at all? I assigned the static IP for my VMs, the IPs stay same as I assigned. Wish these information helps. Jace. From: Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org] On Behalf Of Eleanor Merry (projectclearwater.org) Sent: Friday, October 28, 2016 9:31 PM To: clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org> Subject: Re: [Project Clearwater] etcd_process execution failed on each node. Hi Jace, We see the zmq_msg_recv message when the clearwater alarm agent isn’t running. This is benign as we don’t mandate that the clearwater alarm agent is installed (and we’re tracking this issue at https://github.com/Metaswitch/clearwater-infrastructure/issues/391). I don’t think it’s the cause of your etcd issues therefore. To help resolve this then, can you send me some more information? • Can you send me the clearwater-etcd.log? • What are the values of etcd_cluster on your nodes? • When you restart your VMs, does this change the IP addresses of your VMs at all? You can run /usr/share/clearwater/bin/clearwater-version (in release 108 onwards) to show when the Project Clearwater packages were built – we don’t have anything that reports what specific Project Clearwater release you’re running though. Thanks, Ellie From: Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org] On Behalf Of jace.li...@itri.org.tw<mailto:jace.li...@itri.org.tw> Sent: 25 October 2016 02:45 To: Richard Whitehouse; clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org> Subject: Re: [Project Clearwater] etcd_process execution failed on each node. Hi Richard, I think my problem here is not the issue<https://github.com/Metaswitch/clearwater-etcd/issues/320> you mentioned in last mail. As my clearwater-etcd.log is quite different with that one. And my clearwater version is already the latest Onix, release 108. PS: I followed the manual install instruction and try the upgrade procedure but nothing had to be updated. In my case, my ctcd-process looks like unable to use the socket to successfully start and communicate with each other. As the boot.log shows cat /var/log/boot.log ……….. ……….. zmq_msg_recv: Resource temporarily unavailable Configuring monit for only localhost access Error: dial tcp 192.168.2.205:4000: getsockopt: no route to host Rejoining cluster... Etcd failed to come up – exiting I think the root cause is about “zmq_msg_recv: Resource temporarily unavailable”. (this is something about socket error) Since I did the two test on two deployment. -------------------First deployment --------------- (1). Six new VM with Ubuntu official 14.04.02 64-bits server (2). Reboot the system when OS is finished (there is no zmq_msg_recv: Resource temporarily unavailable in the boot.log) (3.) Run the manual installation of clearwater then restart VMs. the zmq_msg_recv info shows! Etcd_process failed to start. (4.) Try to upgrade to the release Onix 108, but everything is newest. (5.) the eted-process is still unfunctional. ------------------Second deployment----------------- (1). In this deployment, I used my 6 VM images of Clearwater nodes that I setup in about 2015 Oct. And this deployment was running smoothly. (2). Run the upgrade procedure to the latest version Onix 108. (3). The zmq_msg_recv: Resource temporarily unavailable shows during the upgrade process. (4.) Reboot after the upgrade finished. (5.) The zmq_msg_recv: Resource temporarily unavailable shows during the boot step. And the etcd_process became unfunctional. I would like to know what’s version of your python? Seems like zmq_msg_recv is from a python binary. And how do I check my clearwater release version? And Install the new machine with certain version? I think the old version might just good for me. Thank you very much! Jace. From: Richard Whitehouse [mailto:richard.whiteho...@metaswitch.com] Sent: Friday, October 21, 2016 5:43 PM To: clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>; 梁維恩 <jace.li...@itri.org.tw<mailto:jace.li...@itri.org.tw>> Subject: RE: etcd_process execution failed on each node. Jace Liang, What version of Project Clearwater are you running? From your problem description, you might be hitting https://github.com/Metaswitch/clearwater-etcd/issues/320 which we’ve fixed in the latest release, Onix, release 108. This can cause the etcd cluster to lose quorum and thus prevent it starting up correctly. If this is the case, you’ve got two options: 1) You can delete your existing installation, and install release-108 instead. That’s probably the simplest solution, but you’ll lose all of your data. 2) Alternatively, you can upgrade your nodes to release-108, and then restore the cluster. As it’s lost quorum, you’ll need to follow the instructions for multiple node recovery, which are documented in the docs at http://clearwater.readthedocs.io/en/stable/Handling_Failed_Nodes.html#multiple-failed-nodes Hope this helps, Richard From: Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org] On Behalf Of jace.li...@itri.org.tw<mailto:jace.li...@itri.org.tw> Sent: 21 October 2016 03:23 To: clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org> Subject: [Project Clearwater] etcd_process execution failed on each node. Dear All, Recently I found out my 6 VMs of each node have trouble of executing etcd_process I think I did the config file right, because at first their etcd_process was running well and the “clearwater-etcdctl cluster health” shows all healthly. But somehow they suddenly I tried “monit restart etcd_process” but still failed. Here are some command result for more information. (on the node ellis) root@ellis1:/var/log# clearwater-etcdctl cluster-health cluster may be unhealthy: failed to list members Error: client: etcd cluster is unavailable or misconfigured error #0: dial tcp 192.168.2.206:4000: getsockopt: connection refused (.206 is homestead’s ip) cat /var/log/boot.log ……….. ……….. ……….. zmq_msg_recv: Resource temporarily unavailable Configuring monit for only localhost access Error: dial tcp 192.168.2.205:4000: getsockopt: no route to host Rejoining cluster... Etcd failed to come up - exiting root@ellis1:/var/log/clearwater-etcd# cat clearwater-etcd.log …………. …………. ………… 2016-10-21 10:11:46.686827 I | etcdmain: etcd Version: 2.2.5 2016-10-21 10:11:46.686888 I | etcdmain: Git SHA: bc9ddf2 2016-10-21 10:11:46.686895 I | etcdmain: Go Version: go1.5.3 2016-10-21 10:11:46.686902 I | etcdmain: Go OS/Arch: linux/amd64 2016-10-21 10:11:46.686913 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4 2016-10-21 10:11:46.686953 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2016-10-21 10:11:46.687015 I | etcdmain: listening for peers on http://192.168.2.206:2380 2016-10-21 10:11:46.687039 I | etcdmain: listening for client requests on http://192.168.2.206:4000 2016-10-21 10:11:46.689639 I | etcdserver: recovered store from snapshot at index 10001 2016-10-21 10:11:46.689654 I | etcdserver: name = 192-168-2-206 2016-10-21 10:11:46.689660 I | etcdserver: data dir = /var/lib/clearwater-etcd/192.168.2.206 2016-10-21 10:11:46.689668 I | etcdserver: member dir = /var/lib/clearwater-etcd/192.168.2.206/member 2016-10-21 10:11:46.689674 I | etcdserver: heartbeat = 100ms 2016-10-21 10:11:46.689680 I | etcdserver: election = 1000ms 2016-10-21 10:11:46.689686 I | etcdserver: snapshot count = 10000 2016-10-21 10:11:46.689696 I | etcdserver: advertise client URLs = http://192.168.2.206:4000 2016-10-21 10:11:46.689717 I | etcdserver: loaded cluster information from store: <nil> 2016-10-21 10:11:46.726159 I | etcdserver: restarting member 4cb5fd19beaa1750 in cluster 877b90a46cdaaa83 at commit index 14044 2016-10-21 10:11:46.727646 I | raft: 4cb5fd19beaa1750 became follower at term 814 2016-10-21 10:11:46.727690 I | raft: newRaft 4cb5fd19beaa1750 [peers: [1226bb321c91a88e,4cb5fd19beaa1750,8ac8820f24de7303,a4a5d4f826d5740a], term: 814, commit: 14044, applied: 10001, lastindex: 14045, lastterm: 814] 2016-10-21 10:11:46.734589 I | rafthttp: the connection with 1226bb321c91a88e became active 2016-10-21 10:11:46.739284 E | rafthttp: failed to dial 8ac8820f24de7303 on stream Message (dial tcp 192.168.2.202:2380: getsockopt: connection refused) 2016-10-21 10:11:46.740156 E | rafthttp: failed to dial 8ac8820f24de7303 on stream MsgApp v2 (dial tcp 192.168.2.202:2380: getsockopt: connection refused) 2016-10-21 10:11:46.745962 I | etcdserver: starting server... [version: 2.2.5, cluster version: 2.2] 2016-10-21 10:11:46.747252 E | rafthttp: failed to dial a4a5d4f826d5740a on stream Message (dial tcp 192.168.2.203:2380: getsockopt: connection refused) 2016-10-21 10:11:46.747394 E | rafthttp: failed to dial a4a5d4f826d5740a on stream MsgApp v2 (dial tcp 192.168.2.203:2380: getsockopt: connection refused) 2016-10-21 10:11:46.756637 I | rafthttp: the connection with 1226bb321c91a88e became inactive 2016-10-21 10:11:46.756660 E | rafthttp: failed to read 1226bb321c91a88e on stream Message (net/http: request canceled) 2016-10-21 10:11:46.756687 N | etcdserver: removed member 1226bb321c91a88e from cluster 877b90a46cdaaa83 2016-10-21 10:11:46.756766 D | etcdserver: skipped updating attributes of removed member 1226bb321c91a88e 2016-10-21 10:11:46.756853 C | etcdserver: nodeToMember should never fail: raftAttributes key doesn't exist panic: nodeToMember should never fail: raftAttributes key doesn't exist Seems like the nodes cannot connect to each other, but I test it with ping, they still ping to each other. Can anyone give us some advice or solution? Thank you. -- 本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。 This email may contain confidential information. Please do not use or disclose it in any way and delete it if you are not the intended recipient. -- 本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。 This email may contain confidential information. Please do not use or disclose it in any way and delete it if you are not the intended recipient. -- 本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。 This email may contain confidential information. Please do not use or disclose it in any way and delete it if you are not the intended recipient.
_______________________________________________ Clearwater mailing list Clearwater@lists.projectclearwater.org http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org