HI, the Cluster was working all the time pretty cool. So actually I found out that PVE File-System is not mounted. And here you also can see some logs you ask for ;)
● corosync.service - Corosync Cluster Engine Loaded: loaded (/lib/systemd/system/corosync.service; enabled) Active: active (running) since Fri 2017-02-17 15:59:11 CET; 2 weeks 4 days ago Main PID: 2083 (corosync) CGroup: /system.slice/corosync.service └─2083 corosync Mar 08 09:41:28 host01 corosync[2083]: [MAIN ] Completed service synchronization, ready to provide service. Mar 08 09:41:32 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112748) was formed. Members Mar 08 09:41:32 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13 Mar 08 09:41:32 host01 corosync[2083]: [MAIN ] Completed service synchronization, ready to provide service. Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112756) was formed. Members joined: 13 left: 13 Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13 Mar 08 09:41:39 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13 Mar 08 09:41:39 host01 corosync[2083]: [MAIN ] Completed service synchronization, ready to provide service. Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112760) was formed. Members left: 13 Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13 ● pve-cluster.service - The Proxmox VE cluster filesystem Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled) Active: failed (Result: signal) since Wed 2017-03-08 10:54:06 CET; 6min ago Process: 22861 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS) Main PID: 22868 (code=killed, signal=KILL) Mar 08 10:54:01 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 950 Mar 08 10:54:02 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 960 Mar 08 10:54:03 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 970 Mar 08 10:54:04 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 980 Mar 08 10:54:05 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 990 Mar 08 10:54:06 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 1000 Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing. Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL Mar 08 10:54:06 host01 systemd[1]: Failed to start The Proxmox VE cluster filesystem. Mar 08 10:54:06 host01 systemd[1]: Unit pve-cluster.service entered failed state. It seems that TOTEM ] Failed to receive the leave message. failed: 13 was the problem. -- Grüsse Daniel Am 08.03.17, 10:53 schrieb "pve-user im Auftrag von Thomas Lamprecht" <pve-user-boun...@pve.proxmox.com im Auftrag von t.lampre...@proxmox.com>: On 03/08/2017 10:40 AM, Daniel wrote: > Hi there, > > one College remove one server from the datacenter and after that the whole cluster is broken: Did this server act as a multicast querier? Could explain the behavior. Check if your switch has setup IGMP snooping, if yes you could disable it temporarily to see if that fixes the problem (may have a performance impact on the whole network as multicast messages get delivered to all network members). You may also try to enable a querier on one node: # echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier > Mar 8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > Mar 8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > Mar 8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > Mar 8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > Mar 8 10:35:01 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 230 > Mar 8 10:35:01 host01 snmpd[1441]: Connection from UDP: [10.0.2.50]:40800->[10.0.2.110]:161 > Mar 8 10:35:01 host01 snmpd[1441]: Connection from UDP: [10.0.2.50]:55768->[10.0.2.110]:161 > Mar 8 10:35:02 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 240 > Mar 8 10:35:03 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 250 > Mar 8 10:35:04 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 260 > Mar 8 10:35:05 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 270 > Mar 8 10:35:06 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 280 > Mar 8 10:35:07 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 290 > Mar 8 10:35:08 host01 /usr/share/filebeat/bin/filebeat[20736]: logp.go:230: Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=6 libbeat.logstash.publish.write_bytes=4907 libbeat.publisher.published_events=76 libbeat.logstash.published_and_acked_events=76 publish.events=76 libbeat.logstash.publish.read_bytes=222 registrar.states.update=76 registrar.writes=6 > Mar 8 10:35:08 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 300 > Mar 8 10:35:09 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 310 > Mar 8 10:35:10 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 320 > Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused > > So /etc/pve/ is not mounted anymore and I cant restart anythink. > Anyone have an idee what can happen? Whats your corosync and pve-cluster status? systemctl status corosync pve-cluster Looks like corosync is dead/broken and does not let our cluster filesystem join. cheers and good luck, Thomas _______________________________________________ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user _______________________________________________ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user