Re: [Pacemaker] Loosing corosync communication clusterwide
Tomasz Kontusz writes: > Hanging corosync sounds like libqb problems: trusty comes with 0.16, > which likes to hang from time to time. Try building libqb 0.17. It was already reported on Ubuntu tracker[1] Regards. Footnotes: [1] https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496 -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loosing corosync communication clusterwide
> On 11 Nov 2014, at 10:12 pm, Daniel Dehennin > wrote: > > Andrew Beekhof writes: > > > [...] > >>> I have fencing configured and working, modulo fencing VMs on dead host[1]. >> >> Are you saying that the host and the VMs running inside it are both part of >> the same cluster? > > Yes, one of the VM needs to access the GFS2 filesystem like the nodes, > the other VM is a quorum node (standby=on). That sounds like a recipe for disaster to be honest. If you want VM's to be part of a cluster, it would be advisable to have their host(s) be in a different one. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loosing corosync communication clusterwide
Andrew Beekhof writes: [...] >> I have fencing configured and working, modulo fencing VMs on dead host[1]. > > Are you saying that the host and the VMs running inside it are both part of > the same cluster? Yes, one of the VM needs to access the GFS2 filesystem like the nodes, the other VM is a quorum node (standby=on). Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loosing corosync communication clusterwide
> On 11 Nov 2014, at 4:39 am, Daniel Dehennin > wrote: > > emmanuel segura writes: > >> I think, you don't have fencing configured in your cluster. > > I have fencing configured and working, modulo fencing VMs on dead host[1]. Are you saying that the host and the VMs running inside it are both part of the same cluster? > > Regards. > > Footnotes: > [1] http://oss.clusterlabs.org/pipermail/pacemaker/2014-November/022965.html > > -- > Daniel Dehennin > Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF > Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loosing corosync communication clusterwide
Tomasz Kontusz writes: > Hanging corosync sounds like libqb problems: trusty comes with 0.16, > which likes to hang from time to time. Try building libqb 0.17. Thanks, I'll look at this. Is there a way to get back to normal state without rebooting all machines and interrupting services? I thought about a lightweight version of something like: 1. stop pacemaker on all nodes without doing anything with resources, they all continue to work 2. stop corosync on all nodes 3. start corosync on all nodes 4. start pacemaker on all nodes, as services are running nothing needs to be done I looked in the documentation but fail to find some kind of cluster management best practices. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loosing corosync communication clusterwide
emmanuel segura writes: > I think, you don't have fencing configured in your cluster. I have fencing configured and working, modulo fencing VMs on dead host[1]. Regards. Footnotes: [1] http://oss.clusterlabs.org/pipermail/pacemaker/2014-November/022965.html -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loosing corosync communication clusterwide
Hanging corosync sounds like libqb problems: trusty comes with 0.16, which likes to hang from time to time. Try building libqb 0.17. Daniel Dehennin napisał: >Hello, > >I just have an issue on my pacemaker setup, my dlm/clvm/gfs2 was >blocked. > >The “dlm_tool ls” command told me “wait ringid”. > >The corosync-* commands hangs (like corosync-quorumtool). > >The pacemaker “crm_mon” display nothing wrong. > >I'm using Ubuntu Trusty Tahr: > >- corosync 2.3.3-1ubuntu1 >- pacemaker 1.1.10+git20130802-1ubuntu2.1 > >My cluster was manually rebooted. > >Any idea how to debug such situation? > >Regards. >-- >Daniel Dehennin >Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF >Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF > > > > >___ >Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >Project Home: http://www.clusterlabs.org >Getting started: >http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >Bugs: http://bugs.clusterlabs.org -- Wysłane za pomocą K-9 Mail.___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loosing corosync communication clusterwide
I think, you don't have fencing configured in your cluster. 2014-11-10 17:02 GMT+01:00 Daniel Dehennin : > Daniel Dehennin writes: > >> Hello, > > Hello, > >> I just have an issue on my pacemaker setup, my dlm/clvm/gfs2 was >> blocked. >> >> The “dlm_tool ls” command told me “wait ringid”. > > It happened again: > > root@nebula2:~# dlm_tool ls > dlm lockspaces > name datastores > id0x1b61ba6a > flags 0x0004 kern_stop > changemember 4 joined 1 remove 0 failed 0 seq 3,3 > members 1084811078 1084811079 1084811080 108489 > new changemember 3 joined 0 remove 1 failed 1 seq 4,4 > new statuswait ringid > new members 1084811078 1084811079 1084811080 > > name clvmd > id0x4104eefa > flags 0x0004 kern_stop > changemember 4 joined 1 remove 0 failed 0 seq 3,3 > members 1084811078 1084811079 1084811080 108489 > new changemember 3 joined 0 remove 1 failed 1 seq 4,4 > new statuswait ringid > new members 1084811078 1084811079 1084811080 > > root@nebula2:~# dlm_tool status > cluster nodeid 1084811079 quorate 1 ring seq 21372 21372 > daemon now 8351 fence_pid 0 > fence 108489 nodedown pid 0 actor 0 fail 1415634527 fence 0 now > 1415634734 > node 1084811078 M add 5089 rem 0 fail 0 fence 0 at 0 0 > node 1084811079 M add 5089 rem 0 fail 0 fence 0 at 0 0 > node 1084811080 M add 5089 rem 0 fail 0 fence 0 at 0 0 > node 108489 X add 5766 rem 8144 fail 8144 fence 0 at 0 0 > > Any idea? > -- > Daniel Dehennin > Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF > Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loosing corosync communication clusterwide
Daniel Dehennin writes: > Hello, Hello, > I just have an issue on my pacemaker setup, my dlm/clvm/gfs2 was > blocked. > > The “dlm_tool ls” command told me “wait ringid”. It happened again: root@nebula2:~# dlm_tool ls dlm lockspaces name datastores id0x1b61ba6a flags 0x0004 kern_stop changemember 4 joined 1 remove 0 failed 0 seq 3,3 members 1084811078 1084811079 1084811080 108489 new changemember 3 joined 0 remove 1 failed 1 seq 4,4 new statuswait ringid new members 1084811078 1084811079 1084811080 name clvmd id0x4104eefa flags 0x0004 kern_stop changemember 4 joined 1 remove 0 failed 0 seq 3,3 members 1084811078 1084811079 1084811080 108489 new changemember 3 joined 0 remove 1 failed 1 seq 4,4 new statuswait ringid new members 1084811078 1084811079 1084811080 root@nebula2:~# dlm_tool status cluster nodeid 1084811079 quorate 1 ring seq 21372 21372 daemon now 8351 fence_pid 0 fence 108489 nodedown pid 0 actor 0 fail 1415634527 fence 0 now 1415634734 node 1084811078 M add 5089 rem 0 fail 0 fence 0 at 0 0 node 1084811079 M add 5089 rem 0 fail 0 fence 0 at 0 0 node 1084811080 M add 5089 rem 0 fail 0 fence 0 at 0 0 node 108489 X add 5766 rem 8144 fail 8144 fence 0 at 0 0 Any idea? -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Loosing corosync communication clusterwide
Hello, I just have an issue on my pacemaker setup, my dlm/clvm/gfs2 was blocked. The “dlm_tool ls” command told me “wait ringid”. The corosync-* commands hangs (like corosync-quorumtool). The pacemaker “crm_mon” display nothing wrong. I'm using Ubuntu Trusty Tahr: - corosync 2.3.3-1ubuntu1 - pacemaker 1.1.10+git20130802-1ubuntu2.1 My cluster was manually rebooted. Any idea how to debug such situation? Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org