Re: [ClusterLabs] epic fail
On 07/24/2017 11:34 AM, Ken Gaillot wrote: > On Mon, 2017-07-24 at 18:09 +0200, Valentin Vidic wrote: >> On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote: >>> Lsof/fuser show the PID of the process holding FS open as "kernel". >> >> That could be the NFS server running in the kernel. > > Dimitri, > > Is the NFS server also managed by pacemaker? Is it ordered after DRBD? > Did pacemaker try to stop it before stopping DRBD? > See the other post w/ the log. Sorry for trimming it off of the first one -- I can repost the whole thing if it makes it easier. Yes, it's successfully stopped stopped dovecot @ 14:03:46, nfs_server @ 14:03:47, removed all the symlinks, and failed to unmount /raid @ 14:03:47. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On Mon, 2017-07-24 at 18:09 +0200, Valentin Vidic wrote: > On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote: > > Lsof/fuser show the PID of the process holding FS open as "kernel". > > That could be the NFS server running in the kernel. Dimitri, Is the NFS server also managed by pacemaker? Is it ordered after DRBD? Did pacemaker try to stop it before stopping DRBD? -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On Mon, Jul 24, 2017 at 10:38:40AM -0500, Ken Gaillot wrote: > Standby is not necessary, it's just a cautious step that allows the > admin to verify that all resources moved off correctly. The restart that > yum does should be sufficient for pacemaker to move everything. > > A restart shouldn't lead to fencing in any case where something's not > going seriously wrong. I'm not familiar with the "kernel is using it" > message, I haven't run into that before. Right, pacemaker upgrade might not be the biggest problem. I've seen other packages upgrades cause RA monitors to return results like $OCF_NOT_RUNNING or $OCF_ERR_INSTALLED. This of course causes the cluster to react, so I prefer the node standby option :) In this case the pacemaker was trying to stop the resources, the stop action has failed and the upgrading node was killed off by the second node trying to cleanup the mess. The resources should have come up on the second node after that. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stopping NFS > server ... > Jul 22 14:03:46 zebrafish systemd: Stopping NFS server and services... > Jul 22 14:03:46 zebrafish systemd: Stopped NFS server and services. > Jul 22 14:03:46 zebrafish systemd: Stopping NFS Mount Daemon... > Jul 22 14:03:46 zebrafish systemd: Stopping NFSv4 ID-name mapping service... > Jul 22 14:03:46 zebrafish rpc.mountd[2655]: Caught signal 15, un-registering > and exiting. > Jul 22 14:03:46 zebrafish systemd: Stopped NFSv4 ID-name mapping service. > Jul 22 14:03:46 zebrafish systemd: Stopped NFS Mount Daemon. > Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: threads > Jul 22 14:03:46 zebrafish kernel: nfsd: last server has exited, flushing > export cache > Jul 22 14:03:46 zebrafish systemd: Stopping NFS status monitor for NFSv2/3 > locking > Jul 22 14:03:46 zebrafish systemd: Stopped NFS status monitor for NFSv2/3 > locking.. > Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: rpc-statd > Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: nfs-idmapd > Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: nfs-mountd > Jul 22 14:03:46 zebrafish systemd: Stopping RPC bind service... > Jul 22 14:03:46 zebrafish systemd: Stopped RPC bind service. > Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: rpcbind > Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: rpc-gssd > Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: umount > (1/10 attempts) > Jul 22 14:03:47 zebrafish nfsserver(server_nfs)[6614]: INFO: NFS server > stopped > Jul 22 14:03:47 zebrafish crmd[1078]: notice: Result of stop operation for > server_nfs on zebrafish: 0 (ok) > Jul 22 14:03:47 zebrafish crmd[1078]: notice: Initiating stop operation > floating_ip_stop_0 locally on zebrafish > Jul 22 14:03:48 zebrafish crmd[1078]: notice: Result of stop operation for > server_dovecot on zebrafish: 0 (ok) > Jul 22 14:03:48 zebrafish crmd[1078]: notice: Initiating stop operation > symlink_etc_pki_stop_0 locally on zebrafish > Jul 22 14:03:48 zebrafish IPaddr2(floating_ip)[6769]: INFO: IP status = ok, > IP_CIP= > Jul 22 14:03:48 zebrafish crmd[1078]: notice: Initiating stop operation > symlink_var_dovecot_stop_0 locally on zebrafish > Jul 22 14:03:48 zebrafish crmd[1078]: notice: Result of stop operation for > floating_ip on zebrafish: 0 (ok) > Jul 22 14:03:48 zebrafish symlink(symlink_etc_pki)[6821]: INFO: removed > '/etc/pki' > Jul 22 14:03:48 zebrafish symlink(symlink_var_dovecot)[6822]: INFO: removed > '/var/spool/dovecot' > Jul 22 14:03:48 zebrafish crmd[1078]: notice: Result of stop operation for > symlink_var_dovecot on zebrafish: 0 (ok) > Jul 22 14:03:48 zebrafish crmd[1078]: notice: Initiating stop operation > symlink_etc_dovecot_stop_0 locally on zebrafish > Jul 22 14:03:48 zebrafish crmd[1078]: notice: Result of stop operation for > symlink_etc_pki on zebrafish: 0 (ok) > Jul 22 14:03:48 zebrafish symlink(symlink_etc_dovecot)[6863]: INFO: removed > '/etc/dovecot' > Jul 22 14:03:48 zebrafish crmd[1078]: notice: Result of stop operation for > symlink_etc_dovecot on zebrafish: 0 (ok) > Jul 22 14:03:48 zebrafish crmd[1078]: notice: Initiating stop operation > drbd_filesystem_stop_0 locally on zebrafish > Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running > stop for /dev/drbd0 on /raid > Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to > unmount /raid > Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't > unmount /raid; trying cleanup with TERM ... -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
nfs server/share is also managed by pacemaker and orderis set right? S pozdravem Kristián Feldsam Tel.: +420 773 303 353, +421 944 137 535 E-mail.: supp...@feldhost.cz www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za adekvátní ceny. FELDSAM s.r.o. V rohu 434/3 Praha 4 – Libuš, PSČ 142 00 IČ: 290 60 958, DIČ: CZ290 60 958 C 200350 vedená u Městského soudu v Praze Banka: Fio banka a.s. Číslo účtu: 2400330446/2010 BIC: FIOBCZPPXX IBAN: CZ82 2010 0024 0033 0446 > On 24 Jul 2017, at 18:01, Dimitri Maziukwrote: > > On 07/24/2017 10:38 AM, Ken Gaillot wrote: > >> A restart shouldn't lead to fencing in any case where something's not >> going seriously wrong. I'm not familiar with the "kernel is using it" >> message, I haven't run into that before. > > I posted it at least once before. > >> >> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running >> stop for /dev/drbd0 on /raid >> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to >> unmount /raid >> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't >> unmount /raid; trying cleanup with TERM >> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No >> processes on /raid were signalled. force_unmount is set to 'yes' >> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't >> unmount /raid; trying cleanup with TERM >> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No >> processes on /raid were signalled. force_unmount is set to 'yes' >> Jul 22 14:03:50 zebrafish ntpd[596]: Deleting interface #8 enp2s0f0, >> 144.92.167.221#123, interface stats: received=0, sent=0, dropped=0, >> active_time=260 secs >> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't >> unmount /raid; trying cleanup with TERM >> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No >> processes on /raid were signalled. force_unmount is set to 'yes' >> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't >> unmount /raid; trying cleanup with KILL >> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No >> processes on /raid were signalled. force_unmount is set to 'yes' >> Jul 22 14:03:52 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't >> unmount /raid; trying cleanup with KILL >> Jul 22 14:03:53 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No >> processes on /raid were signalled. force_unmount is set to 'yes' >> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't >> unmount /raid; trying cleanup with KILL >> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No >> processes on /raid were signalled. force_unmount is set to 'yes' >> Jul 22 14:03:55 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't >> unmount /raid, giving up! >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info >> about processes that use ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ the device is found by lsof(8) >> or fuser(1)) ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; >> trying cleanup with TERM ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info >> about processes that use ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ the device is found by lsof(8) >> or fuser(1)) ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; >> trying cleanup with TERM ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info >> about processes that use ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ the device is found by lsof(8) >> or fuser(1)) ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; >> trying cleanup with TERM ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ] >> Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info >> about processes
Re: [ClusterLabs] epic fail
On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote: > Lsof/fuser show the PID of the process holding FS open as "kernel". That could be the NFS server running in the kernel. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On 07/24/2017 10:38 AM, Ken Gaillot wrote: > A restart shouldn't lead to fencing in any case where something's not > going seriously wrong. I'm not familiar with the "kernel is using it" > message, I haven't run into that before. I posted it at least once before. > > Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running > stop for /dev/drbd0 on /raid > Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to > unmount /raid > Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't > unmount /raid; trying cleanup with TERM > Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't > unmount /raid; trying cleanup with TERM > Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Jul 22 14:03:50 zebrafish ntpd[596]: Deleting interface #8 enp2s0f0, > 144.92.167.221#123, interface stats: received=0, sent=0, dropped=0, > active_time=260 secs > Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't > unmount /raid; trying cleanup with TERM > Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't > unmount /raid; trying cleanup with KILL > Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Jul 22 14:03:52 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't > unmount /raid; trying cleanup with KILL > Jul 22 14:03:53 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't > unmount /raid; trying cleanup with KILL > Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Jul 22 14:03:55 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't > unmount /raid, giving up! > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about > processes that use ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ the device is found by lsof(8) > or fuser(1)) ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; > trying cleanup with TERM ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about > processes that use ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ the device is found by lsof(8) > or fuser(1)) ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; > trying cleanup with TERM ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about > processes that use ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ the device is found by lsof(8) > or fuser(1)) ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; > trying cleanup with TERM ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about > processes that use ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ the device is found by lsof(8) > or fuser(1)) ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; > trying cleanup with KILL ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: > drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about > processes that use ] > Jul 22 14:03:55 zebrafish lrmd[1075]: notice: >
Re: [ClusterLabs] epic fail
On Mon, 2017-07-24 at 17:13 +0200, Kristián Feldsam wrote: > Hmm, so when you know, that it happens also when putting node standy, > them why you run yum update on live cluster, it must be clear that > node will be fenced. Standby is not necessary, it's just a cautious step that allows the admin to verify that all resources moved off correctly. The restart that yum does should be sufficient for pacemaker to move everything. A restart shouldn't lead to fencing in any case where something's not going seriously wrong. I'm not familiar with the "kernel is using it" message, I haven't run into that before. The only case where special handling was needed before a yum update is a node running pacemaker_remote instead of the full cluster stack, before pacemaker 1.1.15. > Would you post your pacemaker config? + some logs? > > S pozdravem Kristián Feldsam > Tel.: +420 773 303 353, +421 944 137 535 > E-mail.: supp...@feldhost.cz > > www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové > služby za adekvátní ceny. > > FELDSAM s.r.o. > V rohu 434/3 > Praha 4 – Libuš, PSČ 142 00 > IČ: 290 60 958, DIČ: CZ290 60 958 > C 200350 vedená u Městského soudu v Praze > > Banka: Fio banka a.s. > Číslo účtu: 2400330446/2010 > BIC: FIOBCZPPXX > IBAN: CZ82 2010 0024 0033 0446 > > > On 24 Jul 2017, at 17:04, Dimitri Maziuk> > wrote: > > > > On 07/24/2017 09:40 AM, Jan Pokorný wrote: > > > > > Would there be an interest, though? And would that be meaningful? > > > > IMO the only reason to put a node in standby is if you want to > > reboot > > the active node with no service interruption. For anything else, > > including a reboot with service interruption (during maintenance > > window), it's a no. > > > > This is akin to "your mouse has moved, windows needs to be > > restarted". > > Except the mouse thing is a joke whereas those "standby" clowns > > appear > > to be serious. > > > > With this particular failure, something in the Redhat patched kernel > > (NFS?) does not release the DRBD filesystem. It happens when I put > > the > > node in standby as well, the only difference is not messing up the > > RPM > > database which isn't that hard to fix. Since I have several centos 6 > > + > > DRBD + NFS + heartbeat R1 pairs running happily for years, I have to > > conclude that centos 7 is simply the wrong tool for this particular > > job. > > > > -- > > Dimitri Maziuk > > Programmer/sysadmin > > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
Hmm, so when you know, that it happens also when putting node standy, them why you run yum update on live cluster, it must be clear that node will be fenced. Would you post your pacemaker config? + some logs? S pozdravem Kristián Feldsam Tel.: +420 773 303 353, +421 944 137 535 E-mail.: supp...@feldhost.cz www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za adekvátní ceny. FELDSAM s.r.o. V rohu 434/3 Praha 4 – Libuš, PSČ 142 00 IČ: 290 60 958, DIČ: CZ290 60 958 C 200350 vedená u Městského soudu v Praze Banka: Fio banka a.s. Číslo účtu: 2400330446/2010 BIC: FIOBCZPPXX IBAN: CZ82 2010 0024 0033 0446 > On 24 Jul 2017, at 17:04, Dimitri Maziukwrote: > > On 07/24/2017 09:40 AM, Jan Pokorný wrote: > >> Would there be an interest, though? And would that be meaningful? > > IMO the only reason to put a node in standby is if you want to reboot > the active node with no service interruption. For anything else, > including a reboot with service interruption (during maintenance > window), it's a no. > > This is akin to "your mouse has moved, windows needs to be restarted". > Except the mouse thing is a joke whereas those "standby" clowns appear > to be serious. > > With this particular failure, something in the Redhat patched kernel > (NFS?) does not release the DRBD filesystem. It happens when I put the > node in standby as well, the only difference is not messing up the RPM > database which isn't that hard to fix. Since I have several centos 6 + > DRBD + NFS + heartbeat R1 pairs running happily for years, I have to > conclude that centos 7 is simply the wrong tool for this particular job. > > -- > Dimitri Maziuk > Programmer/sysadmin > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On 07/24/2017 09:40 AM, Jan Pokorný wrote: > Would there be an interest, though? And would that be meaningful? IMO the only reason to put a node in standby is if you want to reboot the active node with no service interruption. For anything else, including a reboot with service interruption (during maintenance window), it's a no. This is akin to "your mouse has moved, windows needs to be restarted". Except the mouse thing is a joke whereas those "standby" clowns appear to be serious. With this particular failure, something in the Redhat patched kernel (NFS?) does not release the DRBD filesystem. It happens when I put the node in standby as well, the only difference is not messing up the RPM database which isn't that hard to fix. Since I have several centos 6 + DRBD + NFS + heartbeat R1 pairs running happily for years, I have to conclude that centos 7 is simply the wrong tool for this particular job. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On 23/07/17 14:40 +0200, Valentin Vidic wrote: > On Sun, Jul 23, 2017 at 07:27:03AM -0500, Dmitri Maziuk wrote: >> So yesterday I ran yum update that puled in the new pacemaker and tried to >> restart it. The node went into its usual "can't unmount drbd because kernel >> is using it" and got stonith'ed in the middle of yum transaction. The end >> result: DRBD reports split brain, HA daemons don't start on boot, RPM >> database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6 >> + heartbeat R1. > > It seems you did not put the node into standby before the upgrade as it > still had resources running. What was the old/new pacemaker version there? Thinking out loud, it shouldn't be too hard to deliver an RPM plugin[1] with RPM-shipped pacemaker (it doesn't make much sense otherwise) that will hook into RPM transactions, putting the node into standby first so to cover the corner case one updates the live cluster. Something akin to systemd_inhibit.so. Would there be an interest, though? And would that be meaningful? [1] http://rpm.org/devel_doc/plugins.html -- Jan (Poki) pgpIjMoTZC4Yn.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On 2017-07-23 08:27 AM, Dmitri Maziuk wrote: > So yesterday I ran yum update that puled in the new pacemaker and tried > to restart it. The node went into its usual "can't unmount drbd because > kernel is using it" and got stonith'ed in the middle of yum transaction. > The end result: DRBD reports split brain, HA daemons don't start on > boot, RPM database is FUBAR. I've had enough. I'm rebuilding this > cluster as centos 6 + heartbeat R1. > > Centos 7 + DRBD 8.4 + pacemaker + NFS server: FAIL. You have been warned. > > Dima Is DRBD set to 'fencing resource-and-stonith'? If so, then the only way to get a split-brain is if something is configured wrong in pacemaker or if something caused crm-fence-peer.sh to report success when it didn't actually succeed... -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
You can not update running cluster! First you need put node standby, check if all resources stopped and them do what you need. This was unfortunately your fail :( S pozdravem Kristián Feldsam Tel.: +420 773 303 353, +421 944 137 535 E-mail.: supp...@feldhost.cz www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za adekvátní ceny. FELDSAM s.r.o. V rohu 434/3 Praha 4 – Libuš, PSČ 142 00 IČ: 290 60 958, DIČ: CZ290 60 958 C 200350 vedená u Městského soudu v Praze Banka: Fio banka a.s. Číslo účtu: 2400330446/2010 BIC: FIOBCZPPXX IBAN: CZ82 2010 0024 0033 0446 > On 23 Jul 2017, at 14:27, Dmitri Maziukwrote: > > So yesterday I ran yum update that puled in the new pacemaker and tried to > restart it. The node went into its usual "can't unmount drbd because kernel > is using it" and got stonith'ed in the middle of yum transaction. The end > result: DRBD reports split brain, HA daemons don't start on boot, RPM > database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6 + > heartbeat R1. > > Centos 7 + DRBD 8.4 + pacemaker + NFS server: FAIL. You have been warned. > > Dima > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On Sun, Jul 23, 2017 at 07:27:03AM -0500, Dmitri Maziuk wrote: > So yesterday I ran yum update that puled in the new pacemaker and tried to > restart it. The node went into its usual "can't unmount drbd because kernel > is using it" and got stonith'ed in the middle of yum transaction. The end > result: DRBD reports split brain, HA daemons don't start on boot, RPM > database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6 > + heartbeat R1. It seems you did not put the node into standby before the upgrade as it still had resources running. What was the old/new pacemaker version there? -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] epic fail
So yesterday I ran yum update that puled in the new pacemaker and tried to restart it. The node went into its usual "can't unmount drbd because kernel is using it" and got stonith'ed in the middle of yum transaction. The end result: DRBD reports split brain, HA daemons don't start on boot, RPM database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6 + heartbeat R1. Centos 7 + DRBD 8.4 + pacemaker + NFS server: FAIL. You have been warned. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org