On 03/08/2017 09:58 AM, Jeffrey Westgate wrote: > Ok. > > Been running monit for a few days, and atop (running a script to capture an > atop output every 10 seconds for an hour, rotate the log, and do it again; > runs from midnight to midnight, changes the date, and does it again). I > correlate between the atop logs, nagios alerts, and monit, to try to find a > trigger. Like trying to find a particular snowflake in Alaska in January. > > Have had a handful of episodes with all the monitors running. We have > determined nothing. Nothing significantly changes from normal/regular to high > host load. > > It's a VMWare/ESXi-hosted VM, so we moved it to a different host and > different datastore (so, effectively new CPU, memory, nic, disk, video... > basically all "new" hardware. still have episodes. > > Was running the "VMWare provided" vmtools. removed and replaced with > open-vm-tools this morning. just had another episode. > > was running atop interactively when the episode started - the only thing that > seems to change is the hostload goes up. momentary spike in "avio" for the > disk -- all the way up to 25 msecs. lasted for one ten-second slice from atop. > > no zombies, no wait, no spike in network, transport, mem use, disk > reads/writes... nothing I can see (and by I, I mean "we" as we have three > people looking) > > I've got other boxes running the same OS - updated them at the same time, so > patch level is all same. No similar issues. The only thing I have different > is these two are running pacemaker, corosync, keepalived. maybe when they > were updated, they need a library I don't have? > > running /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags > there. so - not OS, not IO, not hardware (virtual as it is...) ... only > leaves software. > > Maybe pacemaker is just incompatible with: > > Scientific Linux release 6.5 (Carbon) > kernel 2.6.32-642.15.1.el6.x86_64 > > ??
That does sound bizarre. I haven't tried 6.5 in a while, but it's certainly compatible with the current 6.8. IIRC, you updated to the 6.8 pacemaker packages ... Did you also update the OS and/or other cluster-related packages to 6.8? > At this point it's more of a curiosity than an out and out problem, as > performance does not seem to be impacted noticeably. Packet-in, packet-out > seems unperturbed. Same cannot be send for us administrators... > > > > > ________________________________________ > From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org] > Sent: Friday, March 03, 2017 7:27 AM > To: users@clusterlabs.org > Subject: Users Digest, Vol 26, Issue 10 > > Send Users mailing list submissions to > users@clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.clusterlabs.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-requ...@clusterlabs.org > > You can reach the person managing the list at > users-ow...@clusterlabs.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Users digest..." > > > Today's Topics: > > 1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error > retrying (Ulrich Windl) > 2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join > error retrying (emmanuel segura) > 3. Antw: Re: Never join a list without a problem... > (Jeffrey Westgate) > > > ---------------------------------------------------------------------- > > ------------------------------ > > Message: 3 > Date: Fri, 3 Mar 2017 13:27:25 +0000 > From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov> > To: "users@clusterlabs.org" <users@clusterlabs.org> > Subject: [ClusterLabs] Antw: Re: Never join a list without a > problem... > Message-ID: > > <a36b14fa9aa67f4e836c0ee59dea89c4015b214...@cm-sas-mbx-07.sas.arkgov.net> > > Content-Type: text/plain; charset="us-ascii" > > Appreciate the offer - not familiar with monit. > > Going to try running atop through logratate for the day, keep 12, rotate > hourly (to control space utilization) and see if I can catch anything that > way. My biggest issue is we've not caught it as it starts, so we don't ever > see anything amiss. > > If this doesn't work, then I will likely take you up on how to script monit > to catch something. > > Thanks -- > > Jeff > ________________________________________ > From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org] > Sent: Friday, March 03, 2017 4:51 AM > To: users@clusterlabs.org > Subject: Users Digest, Vol 26, Issue 9 > > Send Users mailing list submissions to > users@clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.clusterlabs.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-requ...@clusterlabs.org > > You can reach the person managing the list at > users-ow...@clusterlabs.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Users digest..." > > > Today's Topics: > > 1. Re: Never join a list without a problem... (Jeffrey Westgate) > 2. Re: PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to > PCMK_OCF_UNKNOWN_ERROR (Ken Gaillot) > 3. Re: Cannot clone clvmd resource (Eric Ren) > 4. Re: Cannot clone clvmd resource (Eric Ren) > 5. Antw: Re: Never join a list without a problem... (Ulrich Windl) > 6. Antw: Re: Cannot clone clvmd resource (Ulrich Windl) > 7. Re: Insert delay between the statup of VirtualDomain > (Dejan Muhamedagic) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 2 Mar 2017 16:32:02 +0000 > From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov> > To: Adam Spiers <aspi...@suse.com>, "Cluster Labs - All topics related > to open-source clustering welcomed" <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Never join a list without a problem... > Message-ID: > > <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net> > > Content-Type: text/plain; charset="iso-8859-1" > > Since we have both pieces of the load-balanced cluster doing the same thing - > for still-as-yet unidentified reasons - we've put atop on one and sysdig on > the other. Running atop at 10 second slices, hoping it will catch something. > While configuring it yesterday, that server went into it's 'episode', but > there was nothing in the atop log to show anything. Nothing else changed > except the cpu load average. No increase in any other parameter. > > frustrating. > > > ________________________________________ > From: Adam Spiers [aspi...@suse.com] > Sent: Wednesday, March 01, 2017 5:33 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > Cc: Jeffrey Westgate > Subject: Re: [ClusterLabs] Never join a list without a problem... > > Ferenc W?gner <wf...@niif.hu> wrote: >> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes: >> >>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes >>> longer, and we cannot set a clock by it - while the machine is 95% >>> idle (or more according to 'top'), the host load shoots up to 50 or >>> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes >>> to come back down to baseline, which is mostly 0.00. (attached >>> hostload.pdf) This happens to both machines, randomly, and is >>> concerning, as we'd like to find what's causing it and resolve it. >> >> Try running atop (http://www.atoptool.nl/). It collects and logs >> process accounting info, allowing you to step back in time and check >> resource usage in the past. > > Nice, I didn't know atop could also log the collected data for future > analysis. > > If you want to capture even more detail, sysdig is superb: > > http://www.sysdig.org/ > > > > ------------------------------ > > Message: 2 > Date: Thu, 2 Mar 2017 17:31:33 -0600 > From: Ken Gaillot <kgail...@redhat.com> > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are > mapped to PCMK_OCF_UNKNOWN_ERROR > Message-ID: <8b8dd955-8e35-6824-a80c-2556d833f...@redhat.com> > Content-Type: text/plain; charset=windows-1252 > > On 03/01/2017 05:28 PM, Andrew Beekhof wrote: >> On Tue, Feb 28, 2017 at 12:06 AM, Lars Ellenberg >> <lars.ellenb...@linbit.com> wrote: >>> When I recently tried to make use of the DEGRADED monitoring results, >>> I found out that it does still not work. >>> >>> Because LRMD choses to filter them in ocf2uniform_rc(), >>> and maps them to PCMK_OCF_UNKNOWN_ERROR. >>> >>> See patch suggestion below. >>> >>> It also filters away the other "special" rc values. >>> Do we really not want to see them in crmd/pengine? >> >> I would think we do. >> >>> Why does LRMD think it needs to outsmart the pengine? >> >> Because the person that implemented the feature incorrectly assumed >> the rc would be passed back unmolested. >> >>> >>> Note: I did build it, but did not use this yet, >>> so I have no idea if the rest of the implementation of the DEGRADED >>> stuff works as intended or if there are other things missing as well. >> >> failcount might be the other place that needs some massaging. >> specifically, not incrementing it when a degraded rc comes through > > I think that's already taken care of. > >>> Thougts?\ >> >> looks good to me >> >>> >>> diff --git a/lrmd/lrmd.c b/lrmd/lrmd.c >>> index 724edb7..39a7dd1 100644 >>> --- a/lrmd/lrmd.c >>> +++ b/lrmd/lrmd.c >>> @@ -800,11 +800,40 @@ hb2uniform_rc(const char *action, int rc, const char >>> *stdout_data) >>> static int >>> ocf2uniform_rc(int rc) >>> { >>> - if (rc < 0 || rc > PCMK_OCF_FAILED_MASTER) { >>> - return PCMK_OCF_UNKNOWN_ERROR; > > Let's simply use > PCMK_OCF_OTHER_ERROR here, since that's guaranteed to > be the high end. > > Lars, do you want to test that? > >>> + switch (rc) { >>> + default: >>> + return PCMK_OCF_UNKNOWN_ERROR; >>> + >>> + case PCMK_OCF_OK: >>> + case PCMK_OCF_UNKNOWN_ERROR: >>> + case PCMK_OCF_INVALID_PARAM: >>> + case PCMK_OCF_UNIMPLEMENT_FEATURE: >>> + case PCMK_OCF_INSUFFICIENT_PRIV: >>> + case PCMK_OCF_NOT_INSTALLED: >>> + case PCMK_OCF_NOT_CONFIGURED: >>> + case PCMK_OCF_NOT_RUNNING: >>> + case PCMK_OCF_RUNNING_MASTER: >>> + case PCMK_OCF_FAILED_MASTER: >>> + >>> + case PCMK_OCF_DEGRADED: >>> + case PCMK_OCF_DEGRADED_MASTER: >>> + return rc; >>> + >>> +#if 0 >>> + /* What about these?? */ >> >> yes, these should get passed back as-is too >> >>> + /* 150-199 reserved for application use */ >>> + PCMK_OCF_CONNECTION_DIED = 189, /* Operation failure implied by >>> disconnection of the LRM API to a local or remote node */ >>> + >>> + PCMK_OCF_EXEC_ERROR = 192, /* Generic problem invoking the agent */ >>> + PCMK_OCF_UNKNOWN = 193, /* State of the service is unknown - >>> used for recording in-flight operations */ >>> + PCMK_OCF_SIGNAL = 194, >>> + PCMK_OCF_NOT_SUPPORTED = 195, >>> + PCMK_OCF_PENDING = 196, >>> + PCMK_OCF_CANCELLED = 197, >>> + PCMK_OCF_TIMEOUT = 198, >>> + PCMK_OCF_OTHER_ERROR = 199, /* Keep the same codes as PCMK_LSB */ >>> +#endif >>> } >>> - >>> - return rc; >>> } >>> >>> static int > > > > ------------------------------ > > Message: 3 > Date: Fri, 3 Mar 2017 09:48:34 +0800 > From: Eric Ren <z...@suse.com> > To: kgail...@redhat.com, Cluster Labs - All topics related to > open-source clustering welcomed <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Cannot clone clvmd resource > Message-ID: <2bb79eb7-300a-509d-b65f-29b5899c4...@suse.com> > Content-Type: text/plain; charset=windows-1252; format=flowed > > On 03/02/2017 06:20 AM, Ken Gaillot wrote: >> On 03/01/2017 03:49 PM, Anne Nicolas wrote: >>> Hi there >>> >>> >>> I'm testing quite an easy configuration to work on clvm. I'm just >>> getting crazy as it seems clmd cannot be cloned on other nodes. >>> >>> clvmd start well on node1 but fails on both node2 and node3. >> Your config looks fine, so I'm going to guess there's some local >> difference on the nodes. >> >>> In pacemaker journalctl I get the following message >>> Mar 01 16:34:36 node3 pidofproc[27391]: pidofproc: cannot stat /clvmd: >>> No such file or directory >>> Mar 01 16:34:36 node3 pidofproc[27392]: pidofproc: cannot stat >>> /cmirrord: No such file or directory >> I have no idea where the above is coming from. pidofproc is an LSB >> function, but (given journalctl) I'm assuming you're using systemd. I >> don't think anything in pacemaker or resource-agents uses pidofproc (at >> least not currently, not sure about the older version you're using). > I guess Anne is using LVM2 on SUSE release. In our lvm2 package, there are > cLVM related > resource agents for clvmd and cmirrord. They're using pidofproc. > > Eric > >> >>> Mar 01 16:34:36 node3 lrmd[2174]: notice: finished - rsc:p-clvmd >>> action:stop call_id:233 pid:27384 exit-code:0 exec-time:45ms queue-time:0ms >>> Mar 01 16:34:36 node3 crmd[2177]: notice: Operation p-clvmd_stop_0: ok >>> (node=node3, call=233, rc=0, cib-update=541, confirmed=true) >>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 72: stop >>> p-dlm_stop_0 on node3 (local) >>> Mar 01 16:34:36 node3 lrmd[2174]: notice: executing - rsc:p-dlm >>> action:stop call_id:235 >>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 67: stop >>> p-dlm_stop_0 on node2 >>> >>> Here is my configuration >>> >>> node 739312139: node1 >>> node 739312140: node2 >>> node 739312141: node3 >>> primitive admin_addr IPaddr2 \ >>> params ip=172.17.2.10 \ >>> op monitor interval=10 timeout=20 \ >>> meta target-role=Started >>> primitive p-clvmd ocf:lvm2:clvmd \ >>> op start timeout=90 interval=0 \ >>> op stop timeout=100 interval=0 \ >>> op monitor interval=30 timeout=90 >>> primitive p-dlm ocf:pacemaker:controld \ >>> op start timeout=90 interval=0 \ >>> op stop timeout=100 interval=0 \ >>> op monitor interval=60 timeout=90 >>> primitive stonith-sbd stonith:external/sbd >>> group g-clvm p-dlm p-clvmd >>> clone c-clvm g-clvm meta interleave=true >>> property cib-bootstrap-options: \ >>> have-watchdog=true \ >>> dc-version=1.1.13-14.7-6f22ad7 \ >>> cluster-infrastructure=corosync \ >>> cluster-name=hacluster \ >>> stonith-enabled=true \ >>> placement-strategy=balanced \ >>> no-quorum-policy=freeze \ >>> last-lrm-refresh=1488404073 >>> rsc_defaults rsc-options: \ >>> resource-stickiness=1 \ >>> migration-threshold=10 >>> op_defaults op-options: \ >>> timeout=600 \ >>> record-pending=true >>> >>> Thanks in advance for your input >>> >>> Cheers >>> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > > > ------------------------------ > > Message: 4 > Date: Fri, 3 Mar 2017 11:12:01 +0800 > From: Eric Ren <z...@suse.com> > To: Cluster Labs - All topics related to open-source clustering > welcomed <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Cannot clone clvmd resource > Message-ID: <c004860e-376e-4bc3-1d35-d60428b41...@suse.com> > Content-Type: text/plain; charset=windows-1252; format=flowed > > On 03/02/2017 07:09 AM, Anne Nicolas wrote: >> >> Le 01/03/2017 ? 23:20, Ken Gaillot a ?crit : >>> On 03/01/2017 03:49 PM, Anne Nicolas wrote: >>>> Hi there >>>> >>>> >>>> I'm testing quite an easy configuration to work on clvm. I'm just >>>> getting crazy as it seems clmd cannot be cloned on other nodes. >>>> >>>> clvmd start well on node1 but fails on both node2 and node3. >>> Your config looks fine, so I'm going to guess there's some local >>> difference on the nodes. >>> >>>> In pacemaker journalctl I get the following message >>>> Mar 01 16:34:36 node3 pidofproc[27391]: pidofproc: cannot stat /clvmd: >>>> No such file or directory >>>> Mar 01 16:34:36 node3 pidofproc[27392]: pidofproc: cannot stat >>>> /cmirrord: No such file or directory >>> I have no idea where the above is coming from. pidofproc is an LSB >>> function, but (given journalctl) I'm assuming you're using systemd. I >>> don't think anything in pacemaker or resource-agents uses pidofproc (at >>> least not currently, not sure about the older version you're using). >> >> Thanks for your feedback. I finally checked the RA script and found the >> error >> >> in clvm2 RA script on non working nodes I got >> # Common variables >> DAEMON="${sbindir}/clvmd" >> CMIRRORD="${sbindir}/cmirrord" >> LVMCONF="${sbindir}/lvmconf" >> >> on working node >> DAEMON="/usr/sbin/clvmd" >> CMIRRORD="/usr/sbin/cmirrord" >> >> Looks like it was path variables were not interpreted. I just have to >> check why I did get those versions > A bugfix for this issue has been released in lvm2 2.02.120-70.1. And, since > SLE12-SP2 > and openSUSE leap42.2, we recommend using > '/usr/lib/ocf/resource.d/heartbeat/clvm' > instead, which is from 'resource-agents' package. > > Eric >> >> THanks again for your answer >> >>>> Mar 01 16:34:36 node3 lrmd[2174]: notice: finished - rsc:p-clvmd >>>> action:stop call_id:233 pid:27384 exit-code:0 exec-time:45ms queue-time:0ms >>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Operation p-clvmd_stop_0: ok >>>> (node=node3, call=233, rc=0, cib-update=541, confirmed=true) >>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 72: stop >>>> p-dlm_stop_0 on node3 (local) >>>> Mar 01 16:34:36 node3 lrmd[2174]: notice: executing - rsc:p-dlm >>>> action:stop call_id:235 >>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 67: stop >>>> p-dlm_stop_0 on node2 >>>> >>>> Here is my configuration >>>> >>>> node 739312139: node1 >>>> node 739312140: node2 >>>> node 739312141: node3 >>>> primitive admin_addr IPaddr2 \ >>>> params ip=172.17.2.10 \ >>>> op monitor interval=10 timeout=20 \ >>>> meta target-role=Started >>>> primitive p-clvmd ocf:lvm2:clvmd \ >>>> op start timeout=90 interval=0 \ >>>> op stop timeout=100 interval=0 \ >>>> op monitor interval=30 timeout=90 >>>> primitive p-dlm ocf:pacemaker:controld \ >>>> op start timeout=90 interval=0 \ >>>> op stop timeout=100 interval=0 \ >>>> op monitor interval=60 timeout=90 >>>> primitive stonith-sbd stonith:external/sbd >>>> group g-clvm p-dlm p-clvmd >>>> clone c-clvm g-clvm meta interleave=true >>>> property cib-bootstrap-options: \ >>>> have-watchdog=true \ >>>> dc-version=1.1.13-14.7-6f22ad7 \ >>>> cluster-infrastructure=corosync \ >>>> cluster-name=hacluster \ >>>> stonith-enabled=true \ >>>> placement-strategy=balanced \ >>>> no-quorum-policy=freeze \ >>>> last-lrm-refresh=1488404073 >>>> rsc_defaults rsc-options: \ >>>> resource-stickiness=1 \ >>>> migration-threshold=10 >>>> op_defaults op-options: \ >>>> timeout=600 \ >>>> record-pending=true >>>> >>>> Thanks in advance for your input >>>> >>>> Cheers >>>> >>> >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org >>> http://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> > > > > > ------------------------------ > > Message: 5 > Date: Fri, 03 Mar 2017 08:04:22 +0100 > From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> > To: <users@clusterlabs.org> > Subject: [ClusterLabs] Antw: Re: Never join a list without a > problem... > Message-ID: <58b91576020000a100024...@gwsmtp1.uni-regensburg.de> > Content-Type: text/plain; charset=UTF-8 > >>>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> schrieb am 02.03.2017 um > 17:32 > in Nachricht > <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>: >> Since we have both pieces of the load-balanced cluster doing the same thing > - >> for still-as-yet unidentified reasons - we've put atop on one and sysdig on > the >> other. Running atop at 10 second slices, hoping it will catch something. >> While configuring it yesterday, that server went into it's 'episode', but >> there was nothing in the atop log to show anything. Nothing else changed >> except the cpu load average. No increase in any other parameter. >> >> frustrating. > > Hi! > > You could try the monit-approach (I could provide an RPM with a > "recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it). > > The part that monitors unusual load looks like this here: > check system host.domain.org > if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh" > if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh" > if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh" > if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh" > if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh" > if swap usage > 50% then exec "/var/lib/monit/log-top.sh" > if cpu usage > 99% for 15 cycles then alert > if cpu usage (user) > 90% for 30 cycles then alert > if cpu usage (system) > 20% for 2 cycles then exec > "/var/lib/monit/log-top.s > h" > if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh" > group local > ### all numbers are a matter of taste ;-) > And my script (in lack of better ideas) looks like this: > #!/bin/sh > { > echo "========== $(/bin/date) ==========" > /usr/bin/mpstat > echo "---" > /usr/bin/vmstat > echo "---" > /usr/bin/top -b -n 1 -Hi > } >> /var/log/monit/top.log > > Regards, > Ulrich > >> >> >> ________________________________________ >> From: Adam Spiers [aspi...@suse.com] >> Sent: Wednesday, March 01, 2017 5:33 AM >> To: Cluster Labs - All topics related to open-source clustering welcomed >> Cc: Jeffrey Westgate >> Subject: Re: [ClusterLabs] Never join a list without a problem... >> >> Ferenc W?gner <wf...@niif.hu> wrote: >>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes: >>> >>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes >>>> longer, and we cannot set a clock by it - while the machine is 95% >>>> idle (or more according to 'top'), the host load shoots up to 50 or >>>> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes >>>> to come back down to baseline, which is mostly 0.00. (attached >>>> hostload.pdf) This happens to both machines, randomly, and is >>>> concerning, as we'd like to find what's causing it and resolve it. >>> >>> Try running atop (http://www.atoptool.nl/). It collects and logs >>> process accounting info, allowing you to step back in time and check >>> resource usage in the past. >> >> Nice, I didn't know atop could also log the collected data for future >> analysis. >> >> If you want to capture even more detail, sysdig is superb: >> >> http://www.sysdig.org/ >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > > > > ------------------------------ > > Message: 6 > Date: Fri, 03 Mar 2017 08:27:23 +0100 > From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> > To: <users@clusterlabs.org> > Subject: [ClusterLabs] Antw: Re: Cannot clone clvmd resource > Message-ID: <58b91adb020000a100024...@gwsmtp1.uni-regensburg.de> > Content-Type: text/plain; charset=US-ASCII > >>>> Eric Ren <z...@suse.com> schrieb am 03.03.2017 um 04:12 in Nachricht > <c004860e-376e-4bc3-1d35-d60428b41...@suse.com>: > [...] >> A bugfix for this issue has been released in lvm2 2.02.120-70.1. And, since >> SLE12-SP2 >> and openSUSE leap42.2, we recommend using >> '/usr/lib/ocf/resource.d/heartbeat/clvm' >> instead, which is from 'resource-agents' package. > > [...] > It seems some release notes were not clear enough: I found out that we are > also using ocf:lvm2:clvmd here (SLES11 SP4). When trying to diff, I found > this: > # diff -u /usr/lib/ocf/resource.d/{lvm2,heartbeat}/clvmd |less > diff: /usr/lib/ocf/resource.d/heartbeat/clvmd: No such file or directory > # rpm -qf /usr/lib/ocf/resource.d/heartbeat /usr/lib/ocf/resource.d/lvm2/ > resource-agents-3.9.5-49.2 > lvm2-clvm-2.02.98-0.42.3 > > I'm confused! > > Regards, > Ulrich > > > > > > > ------------------------------ > > Message: 7 > Date: Fri, 3 Mar 2017 11:51:09 +0100 > From: Dejan Muhamedagic <deja...@fastmail.fm> > To: Cluster Labs - All topics related to open-source clustering > welcomed <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Insert delay between the statup of > VirtualDomain > Message-ID: <20170303105109.GA16526@tuttle.homenet> > Content-Type: text/plain; charset=iso-8859-1 > > Hi, > > On Wed, Mar 01, 2017 at 01:47:21PM +0100, Oscar Segarra wrote: >> Hi Dejan, >> >> In my environment, is it possible to launch the check from the hypervisor. >> A simple telnet against an specific port may be enough tp check if service >> is ready. > > telnet is not so practical for scripts, better use ssh or > the mysql client. > >> In this simple scenario (and check) how can I instruct the second server to >> wait the mysql server is up? > > That's what the ordering constraints in pacemaker are for. You > don't need to do anything special. > > Thanks, > > Dejan > >> >> Thanks a lot >> >> El 1 mar. 2017 1:08 p. m., "Dejan Muhamedagic" <deja...@fastmail.fm> >> escribi?: >> >>> Hi, >>> >>> On Sat, Feb 25, 2017 at 09:58:01PM +0100, Oscar Segarra wrote: >>>> Hi, >>>> >>>> Yes, >>>> >>>> Database server can be considered started up when it accepts mysql client >>>> connections >>>> Applications server can be considered started as soon as the listening >>> port >>>> is up al accepting connections >>>> >>>> ?Can you provide any example about how to achieve this? >>> >>> Is it possible to connect to the database from the supervisor? >>> Then something like this would do: >>> >>> mysql -h vm_ip_address ... < /dev/null >>> >>> If not, then if ssh works: >>> >>> echo mysql ... | ssh vm_ip_address >>> >>> I'm afraid I cannot help you more with mysql details and what to >>> put in '...' stead above, but it should do whatever is necessary >>> to test if the database reached the functional state. You can >>> find an example in ocf:heartbeat:mysql: just look for the >>> "test_table" parameter. Of course, you'll need to put that in a >>> script and test output and so on. I guess that there's enough >>> information in internet on how to do that. >>> >>> Good luck! >>> >>> Dejan >>> >>>> Thanks a lot. >>>> >>>> >>>> 2017-02-25 19:35 GMT+01:00 Dejan Muhamedagic <deja...@fastmail.fm>: >>>> >>>>> Hi, >>>>> >>>>> On Thu, Feb 23, 2017 at 08:51:20PM +0100, Oscar Segarra wrote: >>>>>> Hi, >>>>>> >>>>>> In my environment I have 5 guestes that have to be started up in a >>>>>> specified order starting for the MySQL database server. >>>>>> >>>>>> I have set the order constraints and VirtualDomains start in the >>> right >>>>>> order but, the problem I have, is that the second host starts up >>> faster >>>>>> than the database server and therefore applications running on the >>> second >>>>>> host raise errors due to database connectivity problems. >>>>>> >>>>>> I'd like to introduce a delay between the startup of the >>> VirtualDomain of >>>>>> the database server and the startup of the second guest. >>>>> >>>>> Do you have a way to check if this server is up? If so... >>>>> The start action of VirtualDomain won't exit until the monitor >>>>> action returns success. And there's a parameter called >>>>> monitor_scripts (see the meta-data). Note that these programs >>>>> (scripts) are run at the supervisor host and not in the guest. >>>>> It's all a bit involved, but should be doable. >>>>> >>>>> Thanks, >>>>> >>>>> Dejan >>>>> >>>>>> ?Is it any way to get this? >>>>>> >>>>>> Thanks a lot. _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org