Re: [ClusterLabs] epic fail

2017-07-24 Thread Dimitri Maziuk
On 07/24/2017 11:34 AM, Ken Gaillot wrote:
> On Mon, 2017-07-24 at 18:09 +0200, Valentin Vidic wrote:
>> On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote:
>>> Lsof/fuser show the PID of the process holding FS open as "kernel".
>>
>> That could be the NFS server running in the kernel.
> 
> Dimitri,
> 
> Is the NFS server also managed by pacemaker? Is it ordered after DRBD?
> Did pacemaker try to stop it before stopping DRBD?
> 

See the other post w/ the log. Sorry for trimming it off of the first
one -- I can repost the whole thing if it makes it easier.

Yes, it's successfully stopped stopped dovecot @ 14:03:46, nfs_server @
14:03:47, removed all the symlinks, and failed to unmount /raid @ 14:03:47.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Ken Gaillot
On Mon, 2017-07-24 at 18:09 +0200, Valentin Vidic wrote:
> On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote:
> > Lsof/fuser show the PID of the process holding FS open as "kernel".
> 
> That could be the NFS server running in the kernel.

Dimitri,

Is the NFS server also managed by pacemaker? Is it ordered after DRBD?
Did pacemaker try to stop it before stopping DRBD?
-- 
Ken Gaillot 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Valentin Vidic
On Mon, Jul 24, 2017 at 10:38:40AM -0500, Ken Gaillot wrote:
> Standby is not necessary, it's just a cautious step that allows the
> admin to verify that all resources moved off correctly. The restart that
> yum does should be sufficient for pacemaker to move everything.
> 
> A restart shouldn't lead to fencing in any case where something's not
> going seriously wrong. I'm not familiar with the "kernel is using it"
> message, I haven't run into that before.

Right, pacemaker upgrade might not be the biggest problem.  I've seen
other packages upgrades cause RA monitors to return results like 
$OCF_NOT_RUNNING or $OCF_ERR_INSTALLED.  This of course causes the
cluster to react, so I prefer the node standby option :)

In this case the pacemaker was trying to stop the resources, the stop
action has failed and the upgrading node was killed off by the second
node trying to cleanup the mess.  The resources should have come up
on the second node after that.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Dimitri Maziuk
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stopping NFS 
> server ...
> Jul 22 14:03:46 zebrafish systemd: Stopping NFS server and services...
> Jul 22 14:03:46 zebrafish systemd: Stopped NFS server and services.
> Jul 22 14:03:46 zebrafish systemd: Stopping NFS Mount Daemon...
> Jul 22 14:03:46 zebrafish systemd: Stopping NFSv4 ID-name mapping service...
> Jul 22 14:03:46 zebrafish rpc.mountd[2655]: Caught signal 15, un-registering 
> and exiting.
> Jul 22 14:03:46 zebrafish systemd: Stopped NFSv4 ID-name mapping service.
> Jul 22 14:03:46 zebrafish systemd: Stopped NFS Mount Daemon.
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: threads
> Jul 22 14:03:46 zebrafish kernel: nfsd: last server has exited, flushing 
> export cache
> Jul 22 14:03:46 zebrafish systemd: Stopping NFS status monitor for NFSv2/3 
> locking
> Jul 22 14:03:46 zebrafish systemd: Stopped NFS status monitor for NFSv2/3 
> locking..
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: rpc-statd
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: nfs-idmapd
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: nfs-mountd
> Jul 22 14:03:46 zebrafish systemd: Stopping RPC bind service...
> Jul 22 14:03:46 zebrafish systemd: Stopped RPC bind service.
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: rpcbind
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: rpc-gssd
> Jul 22 14:03:46 zebrafish nfsserver(server_nfs)[6614]: INFO: Stop: umount 
> (1/10 attempts)
> Jul 22 14:03:47 zebrafish nfsserver(server_nfs)[6614]: INFO: NFS server 
> stopped
> Jul 22 14:03:47 zebrafish crmd[1078]:  notice: Result of stop operation for 
> server_nfs on zebrafish: 0 (ok)
> Jul 22 14:03:47 zebrafish crmd[1078]:  notice: Initiating stop operation 
> floating_ip_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> server_dovecot on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Initiating stop operation 
> symlink_etc_pki_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish IPaddr2(floating_ip)[6769]: INFO: IP status = ok, 
> IP_CIP=
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Initiating stop operation 
> symlink_var_dovecot_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> floating_ip on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish symlink(symlink_etc_pki)[6821]: INFO: removed 
> '/etc/pki'
> Jul 22 14:03:48 zebrafish symlink(symlink_var_dovecot)[6822]: INFO: removed 
> '/var/spool/dovecot'
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> symlink_var_dovecot on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Initiating stop operation 
> symlink_etc_dovecot_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> symlink_etc_pki on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish symlink(symlink_etc_dovecot)[6863]: INFO: removed 
> '/etc/dovecot'
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Result of stop operation for 
> symlink_etc_dovecot on zebrafish: 0 (ok)
> Jul 22 14:03:48 zebrafish crmd[1078]:  notice: Initiating stop operation 
> drbd_filesystem_stop_0 locally on zebrafish
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running 
> stop for /dev/drbd0 on /raid
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to 
> unmount /raid
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
...

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Kristián Feldsam
nfs server/share is also managed by pacemaker and orderis set right?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 18:01, Dimitri Maziuk  wrote:
> 
> On 07/24/2017 10:38 AM, Ken Gaillot wrote:
> 
>> A restart shouldn't lead to fencing in any case where something's not
>> going seriously wrong. I'm not familiar with the "kernel is using it"
>> message, I haven't run into that before.
> 
> I posted it at least once before.
> 
>> 
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running 
>> stop for /dev/drbd0 on /raid
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to 
>> unmount /raid
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with TERM
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with TERM
>> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:50 zebrafish ntpd[596]: Deleting interface #8 enp2s0f0, 
>> 144.92.167.221#123, interface stats: received=0, sent=0, dropped=0, 
>> active_time=260 secs
>> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with TERM
>> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with KILL
>> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:52 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with KILL
>> Jul 22 14:03:53 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with KILL
>> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:55 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid, giving up!
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes that use ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
>> or fuser(1)) ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
>> trying cleanup with TERM ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes that use ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
>> or fuser(1)) ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
>> trying cleanup with TERM ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes that use ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
>> or fuser(1)) ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
>> trying cleanup with TERM ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes 

Re: [ClusterLabs] epic fail

2017-07-24 Thread Valentin Vidic
On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote:
> Lsof/fuser show the PID of the process holding FS open as "kernel".

That could be the NFS server running in the kernel.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Dimitri Maziuk
On 07/24/2017 10:38 AM, Ken Gaillot wrote:

> A restart shouldn't lead to fencing in any case where something's not
> going seriously wrong. I'm not familiar with the "kernel is using it"
> message, I haven't run into that before.

I posted it at least once before.

> 
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running 
> stop for /dev/drbd0 on /raid
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to 
> unmount /raid
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:50 zebrafish ntpd[596]: Deleting interface #8 enp2s0f0, 
> 144.92.167.221#123, interface stats: received=0, sent=0, dropped=0, 
> active_time=260 secs
> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with KILL
> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:52 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with KILL
> Jul 22 14:03:53 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid; trying cleanup with KILL
> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Jul 22 14:03:55 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
> unmount /raid, giving up!
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with TERM ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with TERM ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with TERM ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with KILL ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info about 
> processes that use ]
> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
> 

Re: [ClusterLabs] epic fail

2017-07-24 Thread Ken Gaillot
On Mon, 2017-07-24 at 17:13 +0200, Kristián Feldsam wrote:
> Hmm, so when you know, that it happens also when putting node standy,
> them why you run yum update on live cluster, it must be clear that
> node will be fenced.

Standby is not necessary, it's just a cautious step that allows the
admin to verify that all resources moved off correctly. The restart that
yum does should be sufficient for pacemaker to move everything.

A restart shouldn't lead to fencing in any case where something's not
going seriously wrong. I'm not familiar with the "kernel is using it"
message, I haven't run into that before.

The only case where special handling was needed before a yum update is a
node running pacemaker_remote instead of the full cluster stack, before
pacemaker 1.1.15.

> Would you post your pacemaker config? + some logs?
> 
> S pozdravem Kristián Feldsam
> Tel.: +420 773 303 353, +421 944 137 535
> E-mail.: supp...@feldhost.cz
> 
> www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové
> služby za adekvátní ceny.
> 
> FELDSAM s.r.o.
> V rohu 434/3
> Praha 4 – Libuš, PSČ 142 00
> IČ: 290 60 958, DIČ: CZ290 60 958
> C 200350 vedená u Městského soudu v Praze
> 
> Banka: Fio banka a.s.
> Číslo účtu: 2400330446/2010
> BIC: FIOBCZPPXX
> IBAN: CZ82 2010  0024 0033 0446
> 
> > On 24 Jul 2017, at 17:04, Dimitri Maziuk 
> > wrote:
> > 
> > On 07/24/2017 09:40 AM, Jan Pokorný wrote:
> > 
> > > Would there be an interest, though?  And would that be meaningful?
> > 
> > IMO the only reason to put a node in standby is if you want to
> > reboot
> > the active node with no service interruption. For anything else,
> > including a reboot with service interruption (during maintenance
> > window), it's a no.
> > 
> > This is akin to "your mouse has moved, windows needs to be
> > restarted".
> > Except the mouse thing is a joke whereas those "standby" clowns
> > appear
> > to be serious.
> > 
> > With this particular failure, something in the Redhat patched kernel
> > (NFS?) does not release the DRBD filesystem. It happens when I put
> > the
> > node in standby as well, the only difference is not messing up the
> > RPM
> > database which isn't that hard to fix. Since I have several centos 6
> > +
> > DRBD + NFS + heartbeat R1 pairs running happily for years, I have to
> > conclude that centos 7 is simply the wrong tool for this particular
> > job.
> > 
> > -- 
> > Dimitri Maziuk
> > Programmer/sysadmin
> > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Kristián Feldsam
Hmm, so when you know, that it happens also when putting node standy, them why 
you run yum update on live cluster, it must be clear that node will be fenced.

Would you post your pacemaker config? + some logs?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 17:04, Dimitri Maziuk  wrote:
> 
> On 07/24/2017 09:40 AM, Jan Pokorný wrote:
> 
>> Would there be an interest, though?  And would that be meaningful?
> 
> IMO the only reason to put a node in standby is if you want to reboot
> the active node with no service interruption. For anything else,
> including a reboot with service interruption (during maintenance
> window), it's a no.
> 
> This is akin to "your mouse has moved, windows needs to be restarted".
> Except the mouse thing is a joke whereas those "standby" clowns appear
> to be serious.
> 
> With this particular failure, something in the Redhat patched kernel
> (NFS?) does not release the DRBD filesystem. It happens when I put the
> node in standby as well, the only difference is not messing up the RPM
> database which isn't that hard to fix. Since I have several centos 6 +
> DRBD + NFS + heartbeat R1 pairs running happily for years, I have to
> conclude that centos 7 is simply the wrong tool for this particular job.
> 
> -- 
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Dimitri Maziuk
On 07/24/2017 09:40 AM, Jan Pokorný wrote:

> Would there be an interest, though?  And would that be meaningful?

IMO the only reason to put a node in standby is if you want to reboot
the active node with no service interruption. For anything else,
including a reboot with service interruption (during maintenance
window), it's a no.

This is akin to "your mouse has moved, windows needs to be restarted".
Except the mouse thing is a joke whereas those "standby" clowns appear
to be serious.

With this particular failure, something in the Redhat patched kernel
(NFS?) does not release the DRBD filesystem. It happens when I put the
node in standby as well, the only difference is not messing up the RPM
database which isn't that hard to fix. Since I have several centos 6 +
DRBD + NFS + heartbeat R1 pairs running happily for years, I have to
conclude that centos 7 is simply the wrong tool for this particular job.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Jan Pokorný
On 23/07/17 14:40 +0200, Valentin Vidic wrote:
> On Sun, Jul 23, 2017 at 07:27:03AM -0500, Dmitri Maziuk wrote:
>> So yesterday I ran yum update that puled in the new pacemaker and tried to
>> restart it. The node went into its usual "can't unmount drbd because kernel
>> is using it" and got stonith'ed in the middle of yum transaction. The end
>> result: DRBD reports split brain, HA daemons don't start on boot, RPM
>> database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6
>> + heartbeat R1.
> 
> It seems you did not put the node into standby before the upgrade as it
> still had resources running.  What was the old/new pacemaker version there?

Thinking out loud, it shouldn't be too hard to deliver an RPM
plugin[1] with RPM-shipped pacemaker (it doesn't make much sense
otherwise) that will hook into RPM transactions, putting the node
into standby first so to cover the corner case one updates the
live cluster.  Something akin to systemd_inhibit.so.

Would there be an interest, though?  And would that be meaningful?

[1] http://rpm.org/devel_doc/plugins.html

-- 
Jan (Poki)


pgpIjMoTZC4Yn.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-23 Thread Digimer
On 2017-07-23 08:27 AM, Dmitri Maziuk wrote:
> So yesterday I ran yum update that puled in the new pacemaker and tried
> to restart it. The node went into its usual "can't unmount drbd because
> kernel is using it" and got stonith'ed in the middle of yum transaction.
> The end result: DRBD reports split brain, HA daemons don't start on
> boot, RPM database is FUBAR. I've had enough. I'm rebuilding this
> cluster as centos 6 + heartbeat R1.
> 
> Centos 7 + DRBD 8.4 + pacemaker + NFS server: FAIL. You have been warned.
> 
> Dima

Is DRBD set to 'fencing resource-and-stonith'? If so, then the only way
to get a split-brain is if something is configured wrong in pacemaker or
if something caused crm-fence-peer.sh to report success when it didn't
actually succeed...

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-23 Thread Kristián Feldsam
You can not update running cluster! First you need put node standby, check if 
all resources stopped and them do what you need. This was unfortunately your 
fail :(

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 23 Jul 2017, at 14:27, Dmitri Maziuk  wrote:
> 
> So yesterday I ran yum update that puled in the new pacemaker and tried to 
> restart it. The node went into its usual "can't unmount drbd because kernel 
> is using it" and got stonith'ed in the middle of yum transaction. The end 
> result: DRBD reports split brain, HA daemons don't start on boot, RPM 
> database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6 + 
> heartbeat R1.
> 
> Centos 7 + DRBD 8.4 + pacemaker + NFS server: FAIL. You have been warned.
> 
> Dima
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-23 Thread Valentin Vidic
On Sun, Jul 23, 2017 at 07:27:03AM -0500, Dmitri Maziuk wrote:
> So yesterday I ran yum update that puled in the new pacemaker and tried to
> restart it. The node went into its usual "can't unmount drbd because kernel
> is using it" and got stonith'ed in the middle of yum transaction. The end
> result: DRBD reports split brain, HA daemons don't start on boot, RPM
> database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6
> + heartbeat R1.

It seems you did not put the node into standby before the upgrade as it
still had resources running.  What was the old/new pacemaker version there?

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] epic fail

2017-07-23 Thread Dmitri Maziuk
So yesterday I ran yum update that puled in the new pacemaker and tried 
to restart it. The node went into its usual "can't unmount drbd because 
kernel is using it" and got stonith'ed in the middle of yum transaction. 
The end result: DRBD reports split brain, HA daemons don't start on 
boot, RPM database is FUBAR. I've had enough. I'm rebuilding this 
cluster as centos 6 + heartbeat R1.


Centos 7 + DRBD 8.4 + pacemaker + NFS server: FAIL. You have been warned.

Dima


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org