[ClusterLabs] Antw: Re: Never join a list without a problem...

Jeffrey Westgate Wed, 08 Mar 2017 08:01:20 -0800

Ok. 

Been running monit for a few days, and atop (running a script to capture an 
atop output every 10 seconds for an hour, rotate the log, and do it again; runs 
from midnight to midnight, changes the date, and does it again).  I correlate 
between the atop logs, nagios alerts, and monit, to try to find a trigger.  
Like trying to find a particular snowflake in Alaska in January.


Have had a handful of episodes with all the monitors running.  We have 
determined nothing. Nothing significantly changes from normal/regular to high 
host load.

It's a VMWare/ESXi-hosted VM, so we moved it to a different host and different 
datastore (so, effectively new CPU, memory, nic, disk, video... basically all 
"new" hardware.  still have episodes.

Was running the "VMWare provided" vmtools.  removed and replaced with 
open-vm-tools this morning.  just had another episode.

was running atop interactively when the episode started - the only thing that 
seems to change is the hostload goes up.  momentary spike in "avio" for the 
disk -- all the way up to 25 msecs. lasted for one ten-second slice from atop.

no zombies, no wait, no spike in network, transport, mem use, disk 
reads/writes... nothing I can see (and by I, I mean "we" as we have three 
people looking)

I've got other boxes running the same OS - updated them at the same time, so 
patch level is all same.  No similar issues.  The only thing I have different 
is these two are running pacemaker, corosync, keepalived.  maybe when they were 
updated, they need a library I don't have? 

running     /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags there. 
 so - not OS, not IO, not hardware (virtual as it is...) ... only leaves 
software.

Maybe pacemaker is just incompatible with:

Scientific Linux release 6.5 (Carbon)
kernel  2.6.32-642.15.1.el6.x86_64

??

At this point it's more of a curiosity than an out and out problem, as 
performance does not seem to be impacted noticeably.  Packet-in, packet-out 
seems unperturbed. Same cannot be send for us administrators...




________________________________________
From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
Sent: Friday, March 03, 2017 7:27 AM
To: users@clusterlabs.org
Subject: Users Digest, Vol 26, Issue 10

Send Users mailing list submissions to
        users@clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
        users-requ...@clusterlabs.org

You can reach the person managing the list at
        users-ow...@clusterlabs.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Users digest..."


Today's Topics:

   1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error
      retrying (Ulrich Windl)
   2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join
      error retrying (emmanuel segura)
   3. Antw: Re:  Never join a list without a problem...
      (Jeffrey Westgate)


----------------------------------------------------------------------

------------------------------

Message: 3
Date: Fri, 3 Mar 2017 13:27:25 +0000
From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov>
To: "users@clusterlabs.org" <users@clusterlabs.org>
Subject: [ClusterLabs] Antw: Re:  Never join a list without a
        problem...
Message-ID:
        
<a36b14fa9aa67f4e836c0ee59dea89c4015b214...@cm-sas-mbx-07.sas.arkgov.net>

Content-Type: text/plain; charset="us-ascii"

Appreciate the offer - not familiar with monit.

Going to try running atop through logratate for the day, keep 12, rotate hourly 
(to control space utilization) and see if I can catch anything that way.  My 
biggest issue is we've not caught it as it starts, so we don't ever see 
anything amiss.

If this doesn't work, then I will likely take you up on how to script monit to 
catch something.

Thanks --

Jeff
________________________________________
From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
Sent: Friday, March 03, 2017 4:51 AM
To: users@clusterlabs.org
Subject: Users Digest, Vol 26, Issue 9

Send Users mailing list submissions to
        users@clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
        users-requ...@clusterlabs.org

You can reach the person managing the list at
        users-ow...@clusterlabs.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Users digest..."


Today's Topics:

   1. Re: Never join a list without a problem... (Jeffrey Westgate)
   2. Re: PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to
      PCMK_OCF_UNKNOWN_ERROR (Ken Gaillot)
   3. Re: Cannot clone clvmd resource (Eric Ren)
   4. Re: Cannot clone clvmd resource (Eric Ren)
   5. Antw: Re:  Never join a list without a problem... (Ulrich Windl)
   6. Antw: Re:  Cannot clone clvmd resource (Ulrich Windl)
   7. Re: Insert delay between the statup of VirtualDomain
      (Dejan Muhamedagic)


----------------------------------------------------------------------

Message: 1
Date: Thu, 2 Mar 2017 16:32:02 +0000
From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov>
To: Adam Spiers <aspi...@suse.com>, "Cluster Labs - All topics related
        to      open-source clustering welcomed" <users@clusterlabs.org>
Subject: Re: [ClusterLabs] Never join a list without a problem...
Message-ID:
        
<a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>

Content-Type: text/plain; charset="iso-8859-1"

Since we have both pieces of the load-balanced cluster doing the same thing - 
for still-as-yet unidentified reasons - we've put atop on one and sysdig on the 
other.  Running atop at 10 second slices, hoping it will catch something.  
While configuring it yesterday, that server went into it's 'episode', but there 
was nothing in the atop log to show anything.  Nothing else changed except the 
cpu load average.  No increase in any other parameter.

frustrating.


________________________________________
From: Adam Spiers [aspi...@suse.com]
Sent: Wednesday, March 01, 2017 5:33 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Cc: Jeffrey Westgate
Subject: Re: [ClusterLabs] Never join a list without a problem...

Ferenc W?gner <wf...@niif.hu> wrote:
>Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes:
>
>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>> longer, and we cannot set a clock by it - while the machine is 95%
>> idle (or more according to 'top'), the host load shoots up to 50 or
>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>> to come back down to baseline, which is mostly 0.00.  (attached
>> hostload.pdf) This happens to both machines, randomly, and is
>> concerning, as we'd like to find what's causing it and resolve it.
>
>Try running atop (http://www.atoptool.nl/).  It collects and logs
>process accounting info, allowing you to step back in time and check
>resource usage in the past.

Nice, I didn't know atop could also log the collected data for future
analysis.

If you want to capture even more detail, sysdig is superb:

    http://www.sysdig.org/



------------------------------

Message: 2
Date: Thu, 2 Mar 2017 17:31:33 -0600
From: Ken Gaillot <kgail...@redhat.com>
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are
        mapped to PCMK_OCF_UNKNOWN_ERROR
Message-ID: <8b8dd955-8e35-6824-a80c-2556d833f...@redhat.com>
Content-Type: text/plain; charset=windows-1252

On 03/01/2017 05:28 PM, Andrew Beekhof wrote:
> On Tue, Feb 28, 2017 at 12:06 AM, Lars Ellenberg
> <lars.ellenb...@linbit.com> wrote:
>> When I recently tried to make use of the DEGRADED monitoring results,
>> I found out that it does still not work.
>>
>> Because LRMD choses to filter them in ocf2uniform_rc(),
>> and maps them to PCMK_OCF_UNKNOWN_ERROR.
>>
>> See patch suggestion below.
>>
>> It also filters away the other "special" rc values.
>> Do we really not want to see them in crmd/pengine?
>
> I would think we do.
>
>> Why does LRMD think it needs to outsmart the pengine?
>
> Because the person that implemented the feature incorrectly assumed
> the rc would be passed back unmolested.
>
>>
>> Note: I did build it, but did not use this yet,
>> so I have no idea if the rest of the implementation of the DEGRADED
>> stuff works as intended or if there are other things missing as well.
>
> failcount might be the other place that needs some massaging.
> specifically, not incrementing it when a degraded rc comes through

I think that's already taken care of.

>> Thougts?\
>
> looks good to me
>
>>
>> diff --git a/lrmd/lrmd.c b/lrmd/lrmd.c
>> index 724edb7..39a7dd1 100644
>> --- a/lrmd/lrmd.c
>> +++ b/lrmd/lrmd.c
>> @@ -800,11 +800,40 @@ hb2uniform_rc(const char *action, int rc, const char 
>> *stdout_data)
>>  static int
>>  ocf2uniform_rc(int rc)
>>  {
>> -    if (rc < 0 || rc > PCMK_OCF_FAILED_MASTER) {
>> -        return PCMK_OCF_UNKNOWN_ERROR;

Let's simply use > PCMK_OCF_OTHER_ERROR here, since that's guaranteed to
be the high end.

Lars, do you want to test that?

>> +    switch (rc) {
>> +    default:
>> +           return PCMK_OCF_UNKNOWN_ERROR;
>> +
>> +    case PCMK_OCF_OK:
>> +    case PCMK_OCF_UNKNOWN_ERROR:
>> +    case PCMK_OCF_INVALID_PARAM:
>> +    case PCMK_OCF_UNIMPLEMENT_FEATURE:
>> +    case PCMK_OCF_INSUFFICIENT_PRIV:
>> +    case PCMK_OCF_NOT_INSTALLED:
>> +    case PCMK_OCF_NOT_CONFIGURED:
>> +    case PCMK_OCF_NOT_RUNNING:
>> +    case PCMK_OCF_RUNNING_MASTER:
>> +    case PCMK_OCF_FAILED_MASTER:
>> +
>> +    case PCMK_OCF_DEGRADED:
>> +    case PCMK_OCF_DEGRADED_MASTER:
>> +           return rc;
>> +
>> +#if 0
>> +           /* What about these?? */
>
> yes, these should get passed back as-is too
>
>> +    /* 150-199 reserved for application use */
>> +    PCMK_OCF_CONNECTION_DIED = 189, /* Operation failure implied by 
>> disconnection of the LRM API to a local or remote node */
>> +
>> +    PCMK_OCF_EXEC_ERROR    = 192, /* Generic problem invoking the agent */
>> +    PCMK_OCF_UNKNOWN       = 193, /* State of the service is unknown - used 
>> for recording in-flight operations */
>> +    PCMK_OCF_SIGNAL        = 194,
>> +    PCMK_OCF_NOT_SUPPORTED = 195,
>> +    PCMK_OCF_PENDING       = 196,
>> +    PCMK_OCF_CANCELLED     = 197,
>> +    PCMK_OCF_TIMEOUT       = 198,
>> +    PCMK_OCF_OTHER_ERROR   = 199, /* Keep the same codes as PCMK_LSB */
>> +#endif
>>      }
>> -
>> -    return rc;
>>  }
>>
>>  static int



------------------------------

Message: 3
Date: Fri, 3 Mar 2017 09:48:34 +0800
From: Eric Ren <z...@suse.com>
To: kgail...@redhat.com,        Cluster Labs - All topics related to
        open-source clustering welcomed <users@clusterlabs.org>
Subject: Re: [ClusterLabs] Cannot clone clvmd resource
Message-ID: <2bb79eb7-300a-509d-b65f-29b5899c4...@suse.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

On 03/02/2017 06:20 AM, Ken Gaillot wrote:
> On 03/01/2017 03:49 PM, Anne Nicolas wrote:
>> Hi there
>>
>>
>> I'm testing quite an easy configuration to work on clvm. I'm just
>> getting crazy as it seems clmd cannot be cloned on other nodes.
>>
>> clvmd start well on node1 but fails on both node2 and node3.
> Your config looks fine, so I'm going to guess there's some local
> difference on the nodes.
>
>> In pacemaker journalctl I get the following message
>> Mar 01 16:34:36 node3 pidofproc[27391]: pidofproc: cannot stat /clvmd:
>> No such file or directory
>> Mar 01 16:34:36 node3 pidofproc[27392]: pidofproc: cannot stat
>> /cmirrord: No such file or directory
> I have no idea where the above is coming from. pidofproc is an LSB
> function, but (given journalctl) I'm assuming you're using systemd. I
> don't think anything in pacemaker or resource-agents uses pidofproc (at
> least not currently, not sure about the older version you're using).
I guess Anne is using LVM2 on SUSE release. In our lvm2 package, there are cLVM 
related
resource agents for clvmd and cmirrord. They're using pidofproc.

Eric

>
>> Mar 01 16:34:36 node3 lrmd[2174]: notice: finished - rsc:p-clvmd
>> action:stop call_id:233 pid:27384 exit-code:0 exec-time:45ms queue-time:0ms
>> Mar 01 16:34:36 node3 crmd[2177]: notice: Operation p-clvmd_stop_0: ok
>> (node=node3, call=233, rc=0, cib-update=541, confirmed=true)
>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 72: stop
>> p-dlm_stop_0 on node3 (local)
>> Mar 01 16:34:36 node3 lrmd[2174]: notice: executing - rsc:p-dlm
>> action:stop call_id:235
>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 67: stop
>> p-dlm_stop_0 on node2
>>
>> Here is my configuration
>>
>> node 739312139: node1
>> node 739312140: node2
>> node 739312141: node3
>> primitive admin_addr IPaddr2 \
>>          params ip=172.17.2.10 \
>>          op monitor interval=10 timeout=20 \
>>          meta target-role=Started
>> primitive p-clvmd ocf:lvm2:clvmd \
>>          op start timeout=90 interval=0 \
>>          op stop timeout=100 interval=0 \
>>          op monitor interval=30 timeout=90
>> primitive p-dlm ocf:pacemaker:controld \
>>          op start timeout=90 interval=0 \
>>          op stop timeout=100 interval=0 \
>>          op monitor interval=60 timeout=90
>> primitive stonith-sbd stonith:external/sbd
>> group g-clvm p-dlm p-clvmd
>> clone c-clvm g-clvm meta interleave=true
>> property cib-bootstrap-options: \
>>          have-watchdog=true \
>>          dc-version=1.1.13-14.7-6f22ad7 \
>>          cluster-infrastructure=corosync \
>>          cluster-name=hacluster \
>>          stonith-enabled=true \
>>          placement-strategy=balanced \
>>          no-quorum-policy=freeze \
>>          last-lrm-refresh=1488404073
>> rsc_defaults rsc-options: \
>>          resource-stickiness=1 \
>>          migration-threshold=10
>> op_defaults op-options: \
>>          timeout=600 \
>>          record-pending=true
>>
>> Thanks in advance for your input
>>
>> Cheers
>>
>
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>




------------------------------

Message: 4
Date: Fri, 3 Mar 2017 11:12:01 +0800
From: Eric Ren <z...@suse.com>
To: Cluster Labs - All topics related to open-source clustering
        welcomed        <users@clusterlabs.org>
Subject: Re: [ClusterLabs] Cannot clone clvmd resource
Message-ID: <c004860e-376e-4bc3-1d35-d60428b41...@suse.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

On 03/02/2017 07:09 AM, Anne Nicolas wrote:
>
> Le 01/03/2017 ? 23:20, Ken Gaillot a ?crit :
>> On 03/01/2017 03:49 PM, Anne Nicolas wrote:
>>> Hi there
>>>
>>>
>>> I'm testing quite an easy configuration to work on clvm. I'm just
>>> getting crazy as it seems clmd cannot be cloned on other nodes.
>>>
>>> clvmd start well on node1 but fails on both node2 and node3.
>> Your config looks fine, so I'm going to guess there's some local
>> difference on the nodes.
>>
>>> In pacemaker journalctl I get the following message
>>> Mar 01 16:34:36 node3 pidofproc[27391]: pidofproc: cannot stat /clvmd:
>>> No such file or directory
>>> Mar 01 16:34:36 node3 pidofproc[27392]: pidofproc: cannot stat
>>> /cmirrord: No such file or directory
>> I have no idea where the above is coming from. pidofproc is an LSB
>> function, but (given journalctl) I'm assuming you're using systemd. I
>> don't think anything in pacemaker or resource-agents uses pidofproc (at
>> least not currently, not sure about the older version you're using).
>
> Thanks for your feedback. I finally checked the RA script and found the
> error
>
> in clvm2 RA script on non working nodes I got
> # Common variables
> DAEMON="${sbindir}/clvmd"
> CMIRRORD="${sbindir}/cmirrord"
> LVMCONF="${sbindir}/lvmconf"
>
> on working node
> DAEMON="/usr/sbin/clvmd"
> CMIRRORD="/usr/sbin/cmirrord"
>
> Looks like it was path variables were not interpreted. I just have to
> check why I did get those versions
A bugfix for this issue has been released in lvm2 2.02.120-70.1. And, since 
SLE12-SP2
and openSUSE leap42.2, we recommend using 
'/usr/lib/ocf/resource.d/heartbeat/clvm'
instead, which is from 'resource-agents' package.

Eric
>
> THanks again for your answer
>
>>> Mar 01 16:34:36 node3 lrmd[2174]: notice: finished - rsc:p-clvmd
>>> action:stop call_id:233 pid:27384 exit-code:0 exec-time:45ms queue-time:0ms
>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Operation p-clvmd_stop_0: ok
>>> (node=node3, call=233, rc=0, cib-update=541, confirmed=true)
>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 72: stop
>>> p-dlm_stop_0 on node3 (local)
>>> Mar 01 16:34:36 node3 lrmd[2174]: notice: executing - rsc:p-dlm
>>> action:stop call_id:235
>>> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 67: stop
>>> p-dlm_stop_0 on node2
>>>
>>> Here is my configuration
>>>
>>> node 739312139: node1
>>> node 739312140: node2
>>> node 739312141: node3
>>> primitive admin_addr IPaddr2 \
>>>          params ip=172.17.2.10 \
>>>          op monitor interval=10 timeout=20 \
>>>          meta target-role=Started
>>> primitive p-clvmd ocf:lvm2:clvmd \
>>>          op start timeout=90 interval=0 \
>>>          op stop timeout=100 interval=0 \
>>>          op monitor interval=30 timeout=90
>>> primitive p-dlm ocf:pacemaker:controld \
>>>          op start timeout=90 interval=0 \
>>>          op stop timeout=100 interval=0 \
>>>          op monitor interval=60 timeout=90
>>> primitive stonith-sbd stonith:external/sbd
>>> group g-clvm p-dlm p-clvmd
>>> clone c-clvm g-clvm meta interleave=true
>>> property cib-bootstrap-options: \
>>>          have-watchdog=true \
>>>          dc-version=1.1.13-14.7-6f22ad7 \
>>>          cluster-infrastructure=corosync \
>>>          cluster-name=hacluster \
>>>          stonith-enabled=true \
>>>          placement-strategy=balanced \
>>>          no-quorum-policy=freeze \
>>>          last-lrm-refresh=1488404073
>>> rsc_defaults rsc-options: \
>>>          resource-stickiness=1 \
>>>          migration-threshold=10
>>> op_defaults op-options: \
>>>          timeout=600 \
>>>          record-pending=true
>>>
>>> Thanks in advance for your input
>>>
>>> Cheers
>>>
>>
>> _______________________________________________
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>




------------------------------

Message: 5
Date: Fri, 03 Mar 2017 08:04:22 +0100
From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
To: <users@clusterlabs.org>
Subject: [ClusterLabs] Antw: Re:  Never join a list without a
        problem...
Message-ID: <58b91576020000a100024...@gwsmtp1.uni-regensburg.de>
Content-Type: text/plain; charset=UTF-8

>>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> schrieb am 02.03.2017 um
17:32
in Nachricht
<a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>:
> Since we have both pieces of the load-balanced cluster doing the same thing
-
> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
the
> other.  Running atop at 10 second slices, hoping it will catch something.
> While configuring it yesterday, that server went into it's 'episode', but
> there was nothing in the atop log to show anything.  Nothing else changed
> except the cpu load average.  No increase in any other parameter.
>
> frustrating.

Hi!

You could try the monit-approach (I could provide an RPM with a
"recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).

The part that monitors unusual load looks like this here:
  check system host.domain.org
    if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
    if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
    if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
    if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
    if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
    if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
    if cpu usage > 99% for 15 cycles then alert
    if cpu usage (user) > 90% for 30 cycles then alert
    if cpu usage (system) > 20% for 2 cycles then exec
"/var/lib/monit/log-top.s
h"
    if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
    group local
### all numbers are a matter of taste ;-)
And my script (in lack of better ideas) looks like this:
#!/bin/sh
{
    echo "========== $(/bin/date) =========="
    /usr/bin/mpstat
    echo "---"
    /usr/bin/vmstat
    echo "---"
    /usr/bin/top -b -n 1 -Hi
} >> /var/log/monit/top.log

Regards,
Ulrich

>
>
> ________________________________________
> From: Adam Spiers [aspi...@suse.com]
> Sent: Wednesday, March 01, 2017 5:33 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Cc: Jeffrey Westgate
> Subject: Re: [ClusterLabs] Never join a list without a problem...
>
> Ferenc W?gner <wf...@niif.hu> wrote:
>>Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes:
>>
>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>> longer, and we cannot set a clock by it - while the machine is 95%
>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>> to come back down to baseline, which is mostly 0.00.  (attached
>>> hostload.pdf) This happens to both machines, randomly, and is
>>> concerning, as we'd like to find what's causing it and resolve it.
>>
>>Try running atop (http://www.atoptool.nl/).  It collects and logs
>>process accounting info, allowing you to step back in time and check
>>resource usage in the past.
>
> Nice, I didn't know atop could also log the collected data for future
> analysis.
>
> If you want to capture even more detail, sysdig is superb:
>
>     http://www.sysdig.org/
>
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org






------------------------------

Message: 6
Date: Fri, 03 Mar 2017 08:27:23 +0100
From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
To: <users@clusterlabs.org>
Subject: [ClusterLabs] Antw: Re:  Cannot clone clvmd resource
Message-ID: <58b91adb020000a100024...@gwsmtp1.uni-regensburg.de>
Content-Type: text/plain; charset=US-ASCII

>>> Eric Ren <z...@suse.com> schrieb am 03.03.2017 um 04:12 in Nachricht
<c004860e-376e-4bc3-1d35-d60428b41...@suse.com>:
[...]
> A bugfix for this issue has been released in lvm2 2.02.120-70.1. And, since
> SLE12-SP2
> and openSUSE leap42.2, we recommend using
> '/usr/lib/ocf/resource.d/heartbeat/clvm'
> instead, which is from 'resource-agents' package.

[...]
It seems some release notes were not clear enough: I found out that we are also 
using ocf:lvm2:clvmd here (SLES11 SP4). When trying to diff, I found this:
# diff -u /usr/lib/ocf/resource.d/{lvm2,heartbeat}/clvmd |less
diff: /usr/lib/ocf/resource.d/heartbeat/clvmd: No such file or directory
# rpm -qf /usr/lib/ocf/resource.d/heartbeat /usr/lib/ocf/resource.d/lvm2/
resource-agents-3.9.5-49.2
lvm2-clvm-2.02.98-0.42.3

I'm confused!

Regards,
Ulrich






------------------------------

Message: 7
Date: Fri, 3 Mar 2017 11:51:09 +0100
From: Dejan Muhamedagic <deja...@fastmail.fm>
To: Cluster Labs - All topics related to open-source clustering
        welcomed        <users@clusterlabs.org>
Subject: Re: [ClusterLabs] Insert delay between the statup of
        VirtualDomain
Message-ID: <20170303105109.GA16526@tuttle.homenet>
Content-Type: text/plain; charset=iso-8859-1

Hi,

On Wed, Mar 01, 2017 at 01:47:21PM +0100, Oscar Segarra wrote:
> Hi Dejan,
>
> In my environment, is it possible to launch the check from the hypervisor.
> A simple telnet against an specific port may be enough tp check if service
> is ready.

telnet is not so practical for scripts, better use ssh or
the mysql client.

> In this simple scenario (and check) how can I instruct the second server to
> wait the mysql server is up?

That's what the ordering constraints in pacemaker are for. You
don't need to do anything special.

Thanks,

Dejan

>
> Thanks a lot
>
> El 1 mar. 2017 1:08 p. m., "Dejan Muhamedagic" <deja...@fastmail.fm>
> escribi?:
>
> > Hi,
> >
> > On Sat, Feb 25, 2017 at 09:58:01PM +0100, Oscar Segarra wrote:
> > > Hi,
> > >
> > > Yes,
> > >
> > > Database server can be considered started up when it accepts mysql client
> > > connections
> > > Applications server can be considered started as soon as the listening
> > port
> > > is up al accepting connections
> > >
> > > ?Can you provide any example about how to achieve this?
> >
> > Is it possible to connect to the database from the supervisor?
> > Then something like this would do:
> >
> > mysql -h vm_ip_address ... < /dev/null
> >
> > If not, then if ssh works:
> >
> > echo mysql ... | ssh vm_ip_address
> >
> > I'm afraid I cannot help you more with mysql details and what to
> > put in '...' stead above, but it should do whatever is necessary
> > to test if the database reached the functional state. You can
> > find an example in ocf:heartbeat:mysql: just look for the
> > "test_table" parameter. Of course, you'll need to put that in a
> > script and test output and so on. I guess that there's enough
> > information in internet on how to do that.
> >
> > Good luck!
> >
> > Dejan
> >
> > > Thanks a lot.
> > >
> > >
> > > 2017-02-25 19:35 GMT+01:00 Dejan Muhamedagic <deja...@fastmail.fm>:
> > >
> > > > Hi,
> > > >
> > > > On Thu, Feb 23, 2017 at 08:51:20PM +0100, Oscar Segarra wrote:
> > > > > Hi,
> > > > >
> > > > > In my environment I have 5 guestes that have to be started up in a
> > > > > specified order starting for the MySQL database server.
> > > > >
> > > > > I have set the order constraints and VirtualDomains start in the
> > right
> > > > > order but, the problem I have, is that the second host starts up
> > faster
> > > > > than the database server and therefore applications running on the
> > second
> > > > > host raise errors due to database connectivity problems.
> > > > >
> > > > > I'd like to introduce a delay between the startup of the
> > VirtualDomain of
> > > > > the database server and the startup of the second guest.
> > > >
> > > > Do you have a way to check if this server is up? If so...
> > > > The start action of VirtualDomain won't exit until the monitor
> > > > action returns success. And there's a parameter called
> > > > monitor_scripts (see the meta-data). Note that these programs
> > > > (scripts) are run at the supervisor host and not in the guest.
> > > > It's all a bit involved, but should be doable.
> > > >
> > > > Thanks,
> > > >
> > > > Dejan
> > > >
> > > > > ?Is it any way to get this?
> > > > >
> > > > > Thanks a lot.
> > > >
> > > > > _______________________________________________
> > > > > Users mailing list: Users@clusterlabs.org
> > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
> > > > > Project Home: http://www.clusterlabs.org
> > > > > Getting started: http://www.clusterlabs.org/
> > doc/Cluster_from_Scratch.pdf
> > > > > Bugs: http://bugs.clusterlabs.org
> > > >
> > > >
> > > > _______________________________________________
> > > > Users mailing list: Users@clusterlabs.org
> > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/
> > doc/Cluster_from_Scratch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > >
> >
> > > _______________________________________________
> > > Users mailing list: Users@clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> >
> >
> > _______________________________________________
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >

> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




------------------------------

_______________________________________________
Users mailing list
Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users


End of Users Digest, Vol 26, Issue 9
************************************



------------------------------

_______________________________________________
Users mailing list
Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users


End of Users Digest, Vol 26, Issue 10
*************************************

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Antw: Re: Never join a list without a problem...

Reply via email to