Hi, On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote: > > Hello, > > 2nd post in 10 years, lets see if this one gets an answer unlike the first > one... > > One of the main use cases for pacemaker here are DRBD replicated > active/active mailbox servers (dovecot/exim) on Debian machines. > We've been doing this for a loong time, as evidenced by the oldest pair > still running Wheezy with heartbeat and pacemaker 1.1.7. > > The majority of cluster pairs is on Jessie with corosync and backported > pacemaker 1.1.16. > > Yesterday we had a hiccup, resulting in half the machines loosing > their upstream router for 50 seconds which in turn caused the pingd RA to > trigger a fail-over of the DRBD RA and associated resource group > (filesystem/IP) to the other node. > > The old cluster performed flawlessly, the newer clusters all wound up with > DRBD and FS resource being BLOCKED as the processes holding open the > filesystem didn't get killed fast enough. > > Comparing the 2 RAs (no versioning T_T) reveals a large change in the > "signal_processes" routine. > > So with the old Filesystem RA using fuser we get something like this and > thousands of processes killed per second: > --- > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: > (res_Filesystem_mb07:stop:stdout) 3478 3593 3597 3618 3654 3705 3708 > 3716 3736 3781 3792 3804 3963 3964 3972 3974 3978 3980 3981 3982 > 3985 3987 3991 3996 4002 4008 4013 4030 > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: > (res_Filesystem_mb07:stop:stderr) > cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: > (res_Filesystem_mb07:stop:stdout) 4032 4058 4086 4107 4199 4230 4320 > 4336 4362 4420 4429 4432 4435 4450 4468 4470 4471 4498 4510 4519 > 4584 4592 4604 4607 4632 4638 4640 4649 4676 4722 4765 > --- > > Whereas the new RA (newer isn't better) that goes around killing processes > individually with beautiful logging was a total fail at about 4 processes > per second killed... > --- > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending > signal TERM to: mail 4226 4909 0 09:43 ? S 0:00 > dovecot/imap > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending > signal TERM to: mail 4229 4909 0 09:43 ? S 0:00 > dovecot/imap [idling] > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending > signal TERM to: mail 4238 4909 0 09:43 ? S 0:00 > dovecot/imap > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending > signal TERM to: mail 4239 4909 0 09:43 ? S 0:00 > dovecot/imap > --- > > So my questions are: > > 1. Am I the only one with more than a handful of processes per FS who > can't afford to wait hours the new routine to finish?
The change was introduced about five years ago. > 2. Can we have the old FUSER (kill) mode back? Yes. I'll make a pull request. Sorry for the trouble. Thanks, Dejan > Regards, > > Christian > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Rakuten Communications > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org