Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-28 Thread Dimitri Maziuk
On 04/27/2016 11:01 AM, Dejan Muhamedagic wrote:

... I failed to
>> convince Andrew that wget'ing http://localhost/server-status/ is a
>> wrong thing to do in the first place (apache RA).
> 
> I'm not sure why would it be wrong, but neither can I vouch that
> there's no better way to do a basic apache functionality test. At
> any rate, the test URL can be defined using a parameter.

Define "basic apache functionality".

If the goal is to see that httpd is answering, http code 404 or 302 is
just as good as 200 OK, the failure is connection timeout or TCP RST. If
that is the case with the current version of the RA -- I didn't look --
then using http://floating_ip/ for the test URL should be good enough.
Certainly way better than the default of normally disabled
/server-status @ 127.0.0.1

If you wanted to further shave off a bit of the load you could assume
that if it's listening it's answering. That could be the "lightweight"
check if there's an easy way to get this out of /proc or something.

(As I recall what prompted that back then was that at the time Andrew's
Cluster from Scratch failed to mention that you need to install
wget/curl and enable /server-status in the first place.)

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-27 Thread Dejan Muhamedagic
Hi Dmitri,

On Tue, Apr 26, 2016 at 10:20:45AM -0500, Dmitri Maziuk wrote:
> On 2016-04-26 00:58, Klaus Wenninger wrote:
> 
> >But what you are attempting doesn't sound entirely proprietary.
> >So once you have something that looks like it might be useful
> >for others as well let the community participate and free yourself
> >from having to always take care of your private copy ;-)
> 
> Presumably you could try a pull request but last time I failed to
> convince Andrew that wget'ing http://localhost/server-status/ is a
> wrong thing to do in the first place (apache RA).

I'm not sure why would it be wrong, but neither can I vouch that
there's no better way to do a basic apache functionality test. At
any rate, the test URL can be defined using a parameter.

> So your pull
> request may never get merged.

True. That should really depend only on the quality of the
contribution. Sometimes other obstacles, which may not always
look justified from the outside, can get in the way. Though it
may not always look like that, maintainers are only humans too ;-)

Thanks,

Dejan

> Which I suppose is better than my mon scripts: those are
> private-copy-only with no place in the heartbeat packages to try and
> share them.
> 
> Dimitri
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-26 Thread Dmitri Maziuk

On 2016-04-26 00:58, Klaus Wenninger wrote:


But what you are attempting doesn't sound entirely proprietary.
So once you have something that looks like it might be useful
for others as well let the community participate and free yourself
from having to always take care of your private copy ;-)


Presumably you could try a pull request but last time I failed to 
convince Andrew that wget'ing http://localhost/server-status/ is a wrong 
thing to do in the first place (apache RA). So your pull request may 
never get merged.


Which I suppose is better than my mon scripts: those are 
private-copy-only with no place in the heartbeat packages to try and 
share them.


Dimitri



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-26 Thread Klaus Wenninger
On 04/26/2016 06:04 AM, Ken Gaillot wrote:
> On 04/25/2016 10:23 AM, Dmitri Maziuk wrote:
>> On 2016-04-24 16:20, Ken Gaillot wrote:
>>
>>> Correct, you would need to customize the RA.
>> Well, you wouldn't because your custom RA will be overwritten by the
>> next RPM update.
> Correct again :)
>
> I should have mentioned that the convention is to copy the script to a
> different name before editing it. The recommended approach is to create
> a new provider for your organization. For example, copy the RA to a new
> directory /usr/lib/ocf/resource.d/local, so it would be used in
> pacemaker as ocf:local:mysql. You can use anything in place of "local".
>
But what you are attempting doesn't sound entirely proprietary.
So once you have something that looks like it might be useful
for others as well let the community participate and free yourself
from having to always take care of your private copy ;-)
>> Dimitri
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-25 Thread Ken Gaillot
On 04/25/2016 10:23 AM, Dmitri Maziuk wrote:
> On 2016-04-24 16:20, Ken Gaillot wrote:
> 
>> Correct, you would need to customize the RA.
> 
> Well, you wouldn't because your custom RA will be overwritten by the
> next RPM update.

Correct again :)

I should have mentioned that the convention is to copy the script to a
different name before editing it. The recommended approach is to create
a new provider for your organization. For example, copy the RA to a new
directory /usr/lib/ocf/resource.d/local, so it would be used in
pacemaker as ocf:local:mysql. You can use anything in place of "local".

> Dimitri
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-25 Thread Dmitri Maziuk

On 2016-04-24 16:20, Ken Gaillot wrote:


Correct, you would need to customize the RA.


Well, you wouldn't because your custom RA will be overwritten by the 
next RPM update.


Dimitri



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-24 Thread Ken Gaillot
On 04/22/2016 01:13 PM, Dimitri Maziuk wrote:
> On 04/22/2016 12:58 PM, Ken Gaillot wrote:
> 
>>> Consider that monitoring - at least as part of the action -
>>> should check if what your service is actually providing is
>>> working according to some functional and nonfunctional
>>> constraints as to simulate the experience of the consumer of
>>> your services.
> 
> Goedel and Turing say the only one who can answer that is the
> actual consumer. So a simple check for what you *can* check would
> be very nice indeed.
> 
>> Also, you can provide multiple levels of monitoring:
>> 
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_multiple_monitor_operations
>>
>>
>> 
>> For example, you could provide a very simple check that just makes sure
>> MySQL is responding on its port, and run that frequently with a
>> low timeout. And your existing thorough monitor could be run less
>> frequently with a high timeout.
> 
> Looking at this, it seems you have to actually rewrite the RA to
> switch on $OCF_CHECK_LEVEL -- unless the stock RA already provides
> the "simple check" you need, is that correct?
> 
> E.g. this page:
> http://linux-ha.org/doc/man-pages/re-ra-apache.html suggests that
> apache RA does not and all you can do in practice is run the same
> curl http:/localhost/server-status check with different 
> frequencies. Would that be what we actually have ATM?

Correct, you would need to customize the RA. Given how long you said a
check can take, I assumed you already had a custom check that did
something more detailed than the stock mysql RA.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-24 Thread Dimitri Maziuk
On 04/22/2016 12:58 PM, Ken Gaillot wrote:

>> Consider that monitoring - at least as part of the action - should check
>> if what your service is actually providing
>> is working according to some functional and nonfunctional constraints as
>> to simulate the experience of the
>> consumer of your services.

Goedel and Turing say the only one who can answer that is the actual
consumer. So a simple check for what you *can* check would be very nice
indeed.

> Also, you can provide multiple levels of monitoring:
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_multiple_monitor_operations
> 
> For example, you could provide a very simple check that just makes sure
> MySQL is responding on its port, and run that frequently with a low
> timeout. And your existing thorough monitor could be run less frequently
> with a high timeout.

Looking at this, it seems you have to actually rewrite the RA to switch
on $OCF_CHECK_LEVEL -- unless the stock RA already provides the "simple
check" you need, is that correct?

E.g. this page: http://linux-ha.org/doc/man-pages/re-ra-apache.html
suggests that apache RA does not and all you can do in practice is run
the same curl http:/localhost/server-status check with different
frequencies. Would that be what we actually have ATM?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-22 Thread Ken Gaillot
On 04/22/2016 08:57 AM, Klaus Wenninger wrote:
> On 04/22/2016 03:29 PM, John Gogu wrote:
>> Hello community,
>> I am facing following situation with a Pacemaker 2 nodes DB cluster 
>> (3 resources configured into the cluster - 1 MySQL DB resource, 1
>> Apache resource, 1 IP resource )
>> -at every 61 seconds an MySQL monitoring action is started and have a
>> 1200 sec timeout.
> You can increase the timeout for monitoring.
>>
>> In some situation due to high load on the machines, monitoring action
>> run into a timeout, and the cluster is performing a fail over even if
>> the DB is up and running. Do you have a hint how can  be prioritized
>> automatically monitoring actions?
>>
> Consider that monitoring - at least as part of the action - should check
> if what your service is actually providing
> is working according to some functional and nonfunctional constraints as
> to simulate the experience of the
> consumer of your services. So you probably don't want that to happen
> prioritized.
> So if you relaxed the timing requirements of your monitoring to
> something that would be acceptable in terms
> of the definition of the service you are providing and you are still
> running into troubles the service quality you
> are providing wouldn't be that spiffing either...

Also, you can provide multiple levels of monitoring:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_multiple_monitor_operations

For example, you could provide a very simple check that just makes sure
MySQL is responding on its port, and run that frequently with a low
timeout. And your existing thorough monitor could be run less frequently
with a high timeout.

FYI there was a bug related to multiple monitors apparently introduced
in 1.1.10, such that a higher-level monitor failure might not trigger a
resource failure. It was recently fixed in the upstream master branch
(which will be in the soon-to-be-released 1.1.15-rc1).

>> Thank you and best regards,
>> John

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-22 Thread Klaus Wenninger
On 04/22/2016 03:29 PM, John Gogu wrote:
> Hello community,
> I am facing following situation with a Pacemaker 2 nodes DB cluster 
> (3 resources configured into the cluster - 1 MySQL DB resource, 1
> Apache resource, 1 IP resource )
> -at every 61 seconds an MySQL monitoring action is started and have a
> 1200 sec timeout.
You can increase the timeout for monitoring.
>
> In some situation due to high load on the machines, monitoring action
> run into a timeout, and the cluster is performing a fail over even if
> the DB is up and running. Do you have a hint how can  be prioritized
> automatically monitoring actions?
>
Consider that monitoring - at least as part of the action - should check
if what your service is actually providing
is working according to some functional and nonfunctional constraints as
to simulate the experience of the
consumer of your services. So you probably don't want that to happen
prioritized.
So if you relaxed the timing requirements of your monitoring to
something that would be acceptable in terms
of the definition of the service you are providing and you are still
running into troubles the service quality you
are providing wouldn't be that spiffing either...
> Thank you and best regards,
> John
>
>
>
>
>
>
>
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org