Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

Ken Gaillot Wed, 08 Mar 2017 08:43:51 -0800

On 03/08/2017 09:58 AM, Jeffrey Westgate wrote:
> Ok. 
> 
> Been running monit for a few days, and atop (running a script to capture an 
> atop output every 10 seconds for an hour, rotate the log, and do it again; 
> runs from midnight to midnight, changes the date, and does it again).  I 
> correlate between the atop logs, nagios alerts, and monit, to try to find a 
> trigger.  Like trying to find a particular snowflake in Alaska in January.
> 
> Have had a handful of episodes with all the monitors running.  We have 
> determined nothing. Nothing significantly changes from normal/regular to high 
> host load.
> 
> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and 
> different datastore (so, effectively new CPU, memory, nic, disk, video... 
> basically all "new" hardware.  still have episodes.
> 
> Was running the "VMWare provided" vmtools.  removed and replaced with 
> open-vm-tools this morning.  just had another episode.
> 
> was running atop interactively when the episode started - the only thing that 
> seems to change is the hostload goes up.  momentary spike in "avio" for the 
> disk -- all the way up to 25 msecs. lasted for one ten-second slice from atop.
> 
> no zombies, no wait, no spike in network, transport, mem use, disk 
> reads/writes... nothing I can see (and by I, I mean "we" as we have three 
> people looking)
> 
> I've got other boxes running the same OS - updated them at the same time, so 
> patch level is all same.  No similar issues.  The only thing I have different 
> is these two are running pacemaker, corosync, keepalived.  maybe when they 
> were updated, they need a library I don't have? 
> 
> running     /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags 
> there.  so - not OS, not IO, not hardware (virtual as it is...) ... only 
> leaves software.
> 
> Maybe pacemaker is just incompatible with:
> 
> Scientific Linux release 6.5 (Carbon)
> kernel  2.6.32-642.15.1.el6.x86_64
> 
> ??


That does sound bizarre. I haven't tried 6.5 in a while, but it's
certainly compatible with the current 6.8.

IIRC, you updated to the 6.8 pacemaker packages ... Did you also update
the OS and/or other cluster-related packages to 6.8?

> At this point it's more of a curiosity than an out and out problem, as 
> performance does not seem to be impacted noticeably.  Packet-in, packet-out 
> seems unperturbed. Same cannot be send for us administrators...
> 
> 
> 
> 
> ________________________________________
> From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
> Sent: Friday, March 03, 2017 7:27 AM
> To: users@clusterlabs.org
> Subject: Users Digest, Vol 26, Issue 10
> 
> Send Users mailing list submissions to
>         users@clusterlabs.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>         users-requ...@clusterlabs.org
> 
> You can reach the person managing the list at
>         users-ow...@clusterlabs.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
> 
> 
> Today's Topics:
> 
>    1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error
>       retrying (Ulrich Windl)
>    2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join
>       error retrying (emmanuel segura)
>    3. Antw: Re:  Never join a list without a problem...
>       (Jeffrey Westgate)
> 
> 
> ----------------------------------------------------------------------
> 
> ------------------------------
> 
> Message: 3
> Date: Fri, 3 Mar 2017 13:27:25 +0000
> From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov>
> To: "users@clusterlabs.org" <users@clusterlabs.org>
> Subject: [ClusterLabs] Antw: Re:  Never join a list without a
>         problem...
> Message-ID:
>         
> <a36b14fa9aa67f4e836c0ee59dea89c4015b214...@cm-sas-mbx-07.sas.arkgov.net>
> 
> Content-Type: text/plain; charset="us-ascii"
> 
> Appreciate the offer - not familiar with monit.
> 
> Going to try running atop through logratate for the day, keep 12, rotate 
> hourly (to control space utilization) and see if I can catch anything that 
> way.  My biggest issue is we've not caught it as it starts, so we don't ever 
> see anything amiss.
> 
> If this doesn't work, then I will likely take you up on how to script monit 
> to catch something.
> 
> Thanks --
> 
> Jeff
> ________________________________________
> From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
> Sent: Friday, March 03, 2017 4:51 AM
> To: users@clusterlabs.org
> Subject: Users Digest, Vol 26, Issue 9
> 
> Send Users mailing list submissions to
>         users@clusterlabs.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>         users-requ...@clusterlabs.org
> 
> You can reach the person managing the list at
>         users-ow...@clusterlabs.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: Never join a list without a problem... (Jeffrey Westgate)
>    2. Re: PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to
>       PCMK_OCF_UNKNOWN_ERROR (Ken Gaillot)
>    3. Re: Cannot clone clvmd resource (Eric Ren)
>    4. Re: Cannot clone clvmd resource (Eric Ren)
>    5. Antw: Re:  Never join a list without a problem... (Ulrich Windl)
>    6. Antw: Re:  Cannot clone clvmd resource (Ulrich Windl)
>    7. Re: Insert delay between the statup of VirtualDomain
>       (Dejan Muhamedagic)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 2 Mar 2017 16:32:02 +0000
> From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov>
> To: Adam Spiers <aspi...@suse.com>, "Cluster Labs - All topics related
>         to      open-source clustering welcomed" <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Never join a list without a problem...
> Message-ID:
>         
> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>
> 
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Since we have both pieces of the load-balanced cluster doing the same thing - 
> for still-as-yet unidentified reasons - we've put atop on one and sysdig on 
> the other.  Running atop at 10 second slices, hoping it will catch something. 
>  While configuring it yesterday, that server went into it's 'episode', but 
> there was nothing in the atop log to show anything.  Nothing else changed 
> except the cpu load average.  No increase in any other parameter.
> 
> frustrating.
> 
> 
> ________________________________________
> From: Adam Spiers [aspi...@suse.com]
> Sent: Wednesday, March 01, 2017 5:33 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Cc: Jeffrey Westgate
> Subject: Re: [ClusterLabs] Never join a list without a problem...
> 
> Ferenc W?gner <wf...@niif.hu> wrote:
>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes:
>>
>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>> longer, and we cannot set a clock by it - while the machine is 95%
>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>> to come back down to baseline, which is mostly 0.00.  (attached
>>> hostload.pdf) This happens to both machines, randomly, and is
>>> concerning, as we'd like to find what's causing it and resolve it.
>>
>> Try running atop (http://www.atoptool.nl/).  It collects and logs
>> process accounting info, allowing you to step back in time and check
>> resource usage in the past.
> 
> Nice, I didn't know atop could also log the collected data for future
> analysis.
> 
> If you want to capture even more detail, sysdig is superb:
> 
>     http://www.sysdig.org/
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Thu, 2 Mar 2017 17:31:33 -0600
> From: Ken Gaillot <kgail...@redhat.com>
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are
>         mapped to PCMK_OCF_UNKNOWN_ERROR
> Message-ID: <8b8dd955-8e35-6824-a80c-2556d833f...@redhat.com>
> Content-Type: text/plain; charset=windows-1252
> 
> On 03/01/2017 05:28 PM, Andrew Beekhof wrote:
>> On Tue, Feb 28, 2017 at 12:06 AM, Lars Ellenberg
>> <lars.ellenb...@linbit.com> wrote:
>>> When I recently tried to make use of the DEGRADED monitoring results,
>>> I found out that it does still not work.
>>>
>>> Because LRMD choses to filter them in ocf2uniform_rc(),
>>> and maps them to PCMK_OCF_UNKNOWN_ERROR.
>>>
>>> See patch suggestion below.
>>>
>>> It also filters away the other "special" rc values.
>>> Do we really not want to see them in crmd/pengine?
>>
>> I would think we do.
>>
>>> Why does LRMD think it needs to outsmart the pengine?
>>
>> Because the person that implemented the feature incorrectly assumed
>> the rc would be passed back unmolested.
>>
>>>
>>> Note: I did build it, but did not use this yet,
>>> so I have no idea if the rest of the implementation of the DEGRADED
>>> stuff works as intended or if there are other things missing as well.
>>
>> failcount might be the other place that needs some massaging.
>> specifically, not incrementing it when a degraded rc comes through
> 
> I think that's already taken care of.
> 
>>> Thougts?\
>>
>> looks good to me
>>
>>>
>>> diff --git a/lrmd/lrmd.c b/lrmd/lrmd.c
>>> index 724edb7..39a7dd1 100644
>>> --- a/lrmd/lrmd.c
>>> +++ b/lrmd/lrmd.c
>>> @@ -800,11 +800,40 @@ hb2uniform_rc(const char *action, int rc, const char 
>>> *stdout_data)
>>>  static int
>>>  ocf2uniform_rc(int rc)
>>>  {
>>> -    if (rc < 0 || rc > PCMK_OCF_FAILED_MASTER) {
>>> -        return PCMK_OCF_UNKNOWN_ERROR;
> 
> Let's simply use > PCMK_OCF_OTHER_ERROR here, since that's guaranteed to
> be the high end.
> 
> Lars, do you want to test that?
> 
>>> +    switch (rc) {
>>> +    default:
>>> +           return PCMK_OCF_UNKNOWN_ERROR;
>>> +
>>> +    case PCMK_OCF_OK:
>>> +    case PCMK_OCF_UNKNOWN_ERROR:
>>> +    case PCMK_OCF_INVALID_PARAM:
>>> +    case PCMK_OCF_UNIMPLEMENT_FEATURE:
>>> +    case PCMK_OCF_INSUFFICIENT_PRIV:
>>> +    case PCMK_OCF_NOT_INSTALLED:
>>> +    case PCMK_OCF_NOT_CONFIGURED:
>>> +    case PCMK_OCF_NOT_RUNNING:
>>> +    case PCMK_OCF_RUNNING_MASTER:
>>> +    case PCMK_OCF_FAILED_MASTER:
>>> +
>>> +    case PCMK_OCF_DEGRADED:
>>> +    case PCMK_OCF_DEGRADED_MASTER:
>>> +           return rc;
>>> +
>>> +#if 0
>>> +           /* What about these?? */
>>
>> yes, these should get passed back as-is too
>>
>>> +    /* 150-199 reserved for application use */
>>> +    PCMK_OCF_CONNECTION_DIED = 189, /* Operation failure implied by 
>>> disconnection of the LRM API to a local or remote node */
>>> +
>>> +    PCMK_OCF_EXEC_ERROR    = 192, /* Generic problem invoking the agent */
>>> +    PCMK_OCF_UNKNOWN       = 193, /* State of the service is unknown - 
>>> used for recording in-flight operations */
>>> +    PCMK_OCF_SIGNAL        = 194,
>>> +    PCMK_OCF_NOT_SUPPORTED = 195,
>>> +    PCMK_OCF_PENDING       = 196,
>>> +    PCMK_OCF_CANCELLED     = 197,
>>> +    PCMK_OCF_TIMEOUT       = 198,
>>> +    PCMK_OCF_OTHER_ERROR   = 199, /* Keep the same codes as PCMK_LSB */
>>> +#endif
>>>      }
>>> -
>>> -    return rc;
>>>  }
>>>
>>>  static int
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Fri, 3 Mar 2017 09:48:34 +0800
> From: Eric Ren <z...@suse.com>
> To: kgail...@redhat.com,        Cluster Labs - All topics related to
>         open-source clustering welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Cannot clone clvmd resource
> Message-ID: <2bb79eb7-300a-509d-b65f-29b5899c4...@suse.com>
> Content-Type: text/plain; charset=windows-1252; format=flowed
> 
> On 03/02/2017 06:20 AM, Ken Gaillot wrote:
>> On 03/01/2017 03:49 PM, Anne Nicolas wrote:
>>> Hi there
>>>
>>>
>>> I'm testing quite an easy configuration to work on clvm. I'm just
>>> getting crazy as it seems clmd cannot be cloned on other nodes.
>>>
>>> clvmd start well on node1 but fails on both node2 and node3.
>> Your config looks fine, so I'm going to guess there's some local
>> difference on the nodes.
>>
>>> In pacemaker journalctl I get the following message
>>> Mar 01 16:34:36 node3 pidofproc[27391]: pidofproc: cannot stat /clvmd:
>>> No such file or directory
>>> Mar 01 16:34:36 node3 pidofproc[27392]: pidofproc: cannot stat
>>> /cmirrord: No such file or directory
>> I have no idea where the above is coming from. pidofproc is an LSB
>> function, but (given journalctl) I'm assuming you're using systemd. I
>> don't think anything in pacemaker or resource-agents uses pidofproc (at
>> least not currently, not sure about the older version you're using).
> I guess Anne is using LVM2 on SUSE release. In our lvm2 package, there are 
> cLVM related
> resource agents for clvmd and cmirrord. They're using pidofproc.
> 
> Eric
> 
>>
>>> Mar 01 16:34:36 node3 lrmd[2174]: notice: finished - rsc:p-clvmd
>>> action:stop call_id:233 pid:27384 exit-code:0 exec-time:45ms queue-time:0ms
>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Operation p-clvmd_stop_0: ok
>>> (node=node3, call=233, rc=0, cib-update=541, confirmed=true)
>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 72: stop
>>> p-dlm_stop_0 on node3 (local)
>>> Mar 01 16:34:36 node3 lrmd[2174]: notice: executing - rsc:p-dlm
>>> action:stop call_id:235
>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 67: stop
>>> p-dlm_stop_0 on node2
>>>
>>> Here is my configuration
>>>
>>> node 739312139: node1
>>> node 739312140: node2
>>> node 739312141: node3
>>> primitive admin_addr IPaddr2 \
>>>          params ip=172.17.2.10 \
>>>          op monitor interval=10 timeout=20 \
>>>          meta target-role=Started
>>> primitive p-clvmd ocf:lvm2:clvmd \
>>>          op start timeout=90 interval=0 \
>>>          op stop timeout=100 interval=0 \
>>>          op monitor interval=30 timeout=90
>>> primitive p-dlm ocf:pacemaker:controld \
>>>          op start timeout=90 interval=0 \
>>>          op stop timeout=100 interval=0 \
>>>          op monitor interval=60 timeout=90
>>> primitive stonith-sbd stonith:external/sbd
>>> group g-clvm p-dlm p-clvmd
>>> clone c-clvm g-clvm meta interleave=true
>>> property cib-bootstrap-options: \
>>>          have-watchdog=true \
>>>          dc-version=1.1.13-14.7-6f22ad7 \
>>>          cluster-infrastructure=corosync \
>>>          cluster-name=hacluster \
>>>          stonith-enabled=true \
>>>          placement-strategy=balanced \
>>>          no-quorum-policy=freeze \
>>>          last-lrm-refresh=1488404073
>>> rsc_defaults rsc-options: \
>>>          resource-stickiness=1 \
>>>          migration-threshold=10
>>> op_defaults op-options: \
>>>          timeout=600 \
>>>          record-pending=true
>>>
>>> Thanks in advance for your input
>>>
>>> Cheers
>>>
>>
>> _______________________________________________
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Fri, 3 Mar 2017 11:12:01 +0800
> From: Eric Ren <z...@suse.com>
> To: Cluster Labs - All topics related to open-source clustering
>         welcomed        <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Cannot clone clvmd resource
> Message-ID: <c004860e-376e-4bc3-1d35-d60428b41...@suse.com>
> Content-Type: text/plain; charset=windows-1252; format=flowed
> 
> On 03/02/2017 07:09 AM, Anne Nicolas wrote:
>>
>> Le 01/03/2017 ? 23:20, Ken Gaillot a ?crit :
>>> On 03/01/2017 03:49 PM, Anne Nicolas wrote:
>>>> Hi there
>>>>
>>>>
>>>> I'm testing quite an easy configuration to work on clvm. I'm just
>>>> getting crazy as it seems clmd cannot be cloned on other nodes.
>>>>
>>>> clvmd start well on node1 but fails on both node2 and node3.
>>> Your config looks fine, so I'm going to guess there's some local
>>> difference on the nodes.
>>>
>>>> In pacemaker journalctl I get the following message
>>>> Mar 01 16:34:36 node3 pidofproc[27391]: pidofproc: cannot stat /clvmd:
>>>> No such file or directory
>>>> Mar 01 16:34:36 node3 pidofproc[27392]: pidofproc: cannot stat
>>>> /cmirrord: No such file or directory
>>> I have no idea where the above is coming from. pidofproc is an LSB
>>> function, but (given journalctl) I'm assuming you're using systemd. I
>>> don't think anything in pacemaker or resource-agents uses pidofproc (at
>>> least not currently, not sure about the older version you're using).
>>
>> Thanks for your feedback. I finally checked the RA script and found the
>> error
>>
>> in clvm2 RA script on non working nodes I got
>> # Common variables
>> DAEMON="${sbindir}/clvmd"
>> CMIRRORD="${sbindir}/cmirrord"
>> LVMCONF="${sbindir}/lvmconf"
>>
>> on working node
>> DAEMON="/usr/sbin/clvmd"
>> CMIRRORD="/usr/sbin/cmirrord"
>>
>> Looks like it was path variables were not interpreted. I just have to
>> check why I did get those versions
> A bugfix for this issue has been released in lvm2 2.02.120-70.1. And, since 
> SLE12-SP2
> and openSUSE leap42.2, we recommend using 
> '/usr/lib/ocf/resource.d/heartbeat/clvm'
> instead, which is from 'resource-agents' package.
> 
> Eric
>>
>> THanks again for your answer
>>
>>>> Mar 01 16:34:36 node3 lrmd[2174]: notice: finished - rsc:p-clvmd
>>>> action:stop call_id:233 pid:27384 exit-code:0 exec-time:45ms queue-time:0ms
>>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Operation p-clvmd_stop_0: ok
>>>> (node=node3, call=233, rc=0, cib-update=541, confirmed=true)
>>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 72: stop
>>>> p-dlm_stop_0 on node3 (local)
>>>> Mar 01 16:34:36 node3 lrmd[2174]: notice: executing - rsc:p-dlm
>>>> action:stop call_id:235
>>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 67: stop
>>>> p-dlm_stop_0 on node2
>>>>
>>>> Here is my configuration
>>>>
>>>> node 739312139: node1
>>>> node 739312140: node2
>>>> node 739312141: node3
>>>> primitive admin_addr IPaddr2 \
>>>>          params ip=172.17.2.10 \
>>>>          op monitor interval=10 timeout=20 \
>>>>          meta target-role=Started
>>>> primitive p-clvmd ocf:lvm2:clvmd \
>>>>          op start timeout=90 interval=0 \
>>>>          op stop timeout=100 interval=0 \
>>>>          op monitor interval=30 timeout=90
>>>> primitive p-dlm ocf:pacemaker:controld \
>>>>          op start timeout=90 interval=0 \
>>>>          op stop timeout=100 interval=0 \
>>>>          op monitor interval=60 timeout=90
>>>> primitive stonith-sbd stonith:external/sbd
>>>> group g-clvm p-dlm p-clvmd
>>>> clone c-clvm g-clvm meta interleave=true
>>>> property cib-bootstrap-options: \
>>>>          have-watchdog=true \
>>>>          dc-version=1.1.13-14.7-6f22ad7 \
>>>>          cluster-infrastructure=corosync \
>>>>          cluster-name=hacluster \
>>>>          stonith-enabled=true \
>>>>          placement-strategy=balanced \
>>>>          no-quorum-policy=freeze \
>>>>          last-lrm-refresh=1488404073
>>>> rsc_defaults rsc-options: \
>>>>          resource-stickiness=1 \
>>>>          migration-threshold=10
>>>> op_defaults op-options: \
>>>>          timeout=600 \
>>>>          record-pending=true
>>>>
>>>> Thanks in advance for your input
>>>>
>>>> Cheers
>>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users@clusterlabs.org
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
> 
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Fri, 03 Mar 2017 08:04:22 +0100
> From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
> To: <users@clusterlabs.org>
> Subject: [ClusterLabs] Antw: Re:  Never join a list without a
>         problem...
> Message-ID: <58b91576020000a100024...@gwsmtp1.uni-regensburg.de>
> Content-Type: text/plain; charset=UTF-8
> 
>>>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> schrieb am 02.03.2017 um
> 17:32
> in Nachricht
> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>:
>> Since we have both pieces of the load-balanced cluster doing the same thing
> -
>> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
> the
>> other.  Running atop at 10 second slices, hoping it will catch something.
>> While configuring it yesterday, that server went into it's 'episode', but
>> there was nothing in the atop log to show anything.  Nothing else changed
>> except the cpu load average.  No increase in any other parameter.
>>
>> frustrating.
> 
> Hi!
> 
> You could try the monit-approach (I could provide an RPM with a
> "recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).
> 
> The part that monitors unusual load looks like this here:
>   check system host.domain.org
>     if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
>     if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
>     if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
>     if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
>     if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
>     if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
>     if cpu usage > 99% for 15 cycles then alert
>     if cpu usage (user) > 90% for 30 cycles then alert
>     if cpu usage (system) > 20% for 2 cycles then exec
> "/var/lib/monit/log-top.s
> h"
>     if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
>     group local
> ### all numbers are a matter of taste ;-)
> And my script (in lack of better ideas) looks like this:
> #!/bin/sh
> {
>     echo "========== $(/bin/date) =========="
>     /usr/bin/mpstat
>     echo "---"
>     /usr/bin/vmstat
>     echo "---"
>     /usr/bin/top -b -n 1 -Hi
> } >> /var/log/monit/top.log
> 
> Regards,
> Ulrich
> 
>>
>>
>> ________________________________________
>> From: Adam Spiers [aspi...@suse.com]
>> Sent: Wednesday, March 01, 2017 5:33 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> Cc: Jeffrey Westgate
>> Subject: Re: [ClusterLabs] Never join a list without a problem...
>>
>> Ferenc W?gner <wf...@niif.hu> wrote:
>>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes:
>>>
>>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>>> longer, and we cannot set a clock by it - while the machine is 95%
>>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>>> to come back down to baseline, which is mostly 0.00.  (attached
>>>> hostload.pdf) This happens to both machines, randomly, and is
>>>> concerning, as we'd like to find what's causing it and resolve it.
>>>
>>> Try running atop (http://www.atoptool.nl/).  It collects and logs
>>> process accounting info, allowing you to step back in time and check
>>> resource usage in the past.
>>
>> Nice, I didn't know atop could also log the collected data for future
>> analysis.
>>
>> If you want to capture even more detail, sysdig is superb:
>>
>>     http://www.sysdig.org/
>>
>> _______________________________________________
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Fri, 03 Mar 2017 08:27:23 +0100
> From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
> To: <users@clusterlabs.org>
> Subject: [ClusterLabs] Antw: Re:  Cannot clone clvmd resource
> Message-ID: <58b91adb020000a100024...@gwsmtp1.uni-regensburg.de>
> Content-Type: text/plain; charset=US-ASCII
> 
>>>> Eric Ren <z...@suse.com> schrieb am 03.03.2017 um 04:12 in Nachricht
> <c004860e-376e-4bc3-1d35-d60428b41...@suse.com>:
> [...]
>> A bugfix for this issue has been released in lvm2 2.02.120-70.1. And, since
>> SLE12-SP2
>> and openSUSE leap42.2, we recommend using
>> '/usr/lib/ocf/resource.d/heartbeat/clvm'
>> instead, which is from 'resource-agents' package.
> 
> [...]
> It seems some release notes were not clear enough: I found out that we are 
> also using ocf:lvm2:clvmd here (SLES11 SP4). When trying to diff, I found 
> this:
> # diff -u /usr/lib/ocf/resource.d/{lvm2,heartbeat}/clvmd |less
> diff: /usr/lib/ocf/resource.d/heartbeat/clvmd: No such file or directory
> # rpm -qf /usr/lib/ocf/resource.d/heartbeat /usr/lib/ocf/resource.d/lvm2/
> resource-agents-3.9.5-49.2
> lvm2-clvm-2.02.98-0.42.3
> 
> I'm confused!
> 
> Regards,
> Ulrich
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 7
> Date: Fri, 3 Mar 2017 11:51:09 +0100
> From: Dejan Muhamedagic <deja...@fastmail.fm>
> To: Cluster Labs - All topics related to open-source clustering
>         welcomed        <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Insert delay between the statup of
>         VirtualDomain
> Message-ID: <20170303105109.GA16526@tuttle.homenet>
> Content-Type: text/plain; charset=iso-8859-1
> 
> Hi,
> 
> On Wed, Mar 01, 2017 at 01:47:21PM +0100, Oscar Segarra wrote:
>> Hi Dejan,
>>
>> In my environment, is it possible to launch the check from the hypervisor.
>> A simple telnet against an specific port may be enough tp check if service
>> is ready.
> 
> telnet is not so practical for scripts, better use ssh or
> the mysql client.
> 
>> In this simple scenario (and check) how can I instruct the second server to
>> wait the mysql server is up?
> 
> That's what the ordering constraints in pacemaker are for. You
> don't need to do anything special.
> 
> Thanks,
> 
> Dejan
> 
>>
>> Thanks a lot
>>
>> El 1 mar. 2017 1:08 p. m., "Dejan Muhamedagic" <deja...@fastmail.fm>
>> escribi?:
>>
>>> Hi,
>>>
>>> On Sat, Feb 25, 2017 at 09:58:01PM +0100, Oscar Segarra wrote:
>>>> Hi,
>>>>
>>>> Yes,
>>>>
>>>> Database server can be considered started up when it accepts mysql client
>>>> connections
>>>> Applications server can be considered started as soon as the listening
>>> port
>>>> is up al accepting connections
>>>>
>>>> ?Can you provide any example about how to achieve this?
>>>
>>> Is it possible to connect to the database from the supervisor?
>>> Then something like this would do:
>>>
>>> mysql -h vm_ip_address ... < /dev/null
>>>
>>> If not, then if ssh works:
>>>
>>> echo mysql ... | ssh vm_ip_address
>>>
>>> I'm afraid I cannot help you more with mysql details and what to
>>> put in '...' stead above, but it should do whatever is necessary
>>> to test if the database reached the functional state. You can
>>> find an example in ocf:heartbeat:mysql: just look for the
>>> "test_table" parameter. Of course, you'll need to put that in a
>>> script and test output and so on. I guess that there's enough
>>> information in internet on how to do that.
>>>
>>> Good luck!
>>>
>>> Dejan
>>>
>>>> Thanks a lot.
>>>>
>>>>
>>>> 2017-02-25 19:35 GMT+01:00 Dejan Muhamedagic <deja...@fastmail.fm>:
>>>>
>>>>> Hi,
>>>>>
>>>>> On Thu, Feb 23, 2017 at 08:51:20PM +0100, Oscar Segarra wrote:
>>>>>> Hi,
>>>>>>
>>>>>> In my environment I have 5 guestes that have to be started up in a
>>>>>> specified order starting for the MySQL database server.
>>>>>>
>>>>>> I have set the order constraints and VirtualDomains start in the
>>> right
>>>>>> order but, the problem I have, is that the second host starts up
>>> faster
>>>>>> than the database server and therefore applications running on the
>>> second
>>>>>> host raise errors due to database connectivity problems.
>>>>>>
>>>>>> I'd like to introduce a delay between the startup of the
>>> VirtualDomain of
>>>>>> the database server and the startup of the second guest.
>>>>>
>>>>> Do you have a way to check if this server is up? If so...
>>>>> The start action of VirtualDomain won't exit until the monitor
>>>>> action returns success. And there's a parameter called
>>>>> monitor_scripts (see the meta-data). Note that these programs
>>>>> (scripts) are run at the supervisor host and not in the guest.
>>>>> It's all a bit involved, but should be doable.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dejan
>>>>>
>>>>>> ?Is it any way to get this?
>>>>>>
>>>>>> Thanks a lot.

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

Reply via email to