Re: [ClusterLabs] Cluster node loss detection.

2015-10-16 Thread Vallevand, Mark K
No stonith configured.  Not explicitly anyway.
Does that factor into this somehow?

I've tested stonith, but we aren't doing it for customers.  Maybe in the future 
if someone cries or pays us money.
Our solution is deployed onto too many different machines.  A couple of bare 
metal.  A couple of VMs.  We don't want customers to need to figure out stonith 
and we can't test all possible configurations and write instructions.  So, they 
get one-size-fits-all.


Regards.
Mark K Vallevand   mark.vallev...@unisys.com  
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.


-Original Message-
From: Digimer [mailto:li...@alteeve.ca] 
Sent: Friday, October 16, 2015 11:51 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Cluster node loss detection.

On 16/10/15 12:37 PM, Vallevand, Mark K wrote:
> Fencing, yes.  I have pcmk-redirect for each node in cluster.conf.

Do you have stonith configured (and tested!) in Pacemaker as well?

> I run with default cman settings for corosync.  No totem clause.  That gives 
> the 20s detection.  Not sure what the defaults really are.
> I added  to 
> cluster.conf and get about a 5s detection.
> 
> The corosync man page says:
>token  This timeout specifies in milliseconds until a token loss is 
> declared after not receiving a token.  This is the time spent detecting a
>   failure of a processor in the current configuration.  Reforming 
> a new configuration takes about 50 milliseconds in  addition  to  this
>   timeout.
> 
>   The default is 1000 milliseconds.
> 
>token_retransmit
>   This timeout specifies in milliseconds after how long before 
> receiving a token the token is retransmitted.  This will be automatically
>   calculated if token is modified.  It is not recommended to 
> alter this value without guidance from the corosync community.
> 
>   The default is 238 milliseconds.
> 
>hold   This timeout specifies in milliseconds how long the token 
> should be held by the representative when the protocol is under low utiliza‐
>   tion.   It is not recommended to alter this value without 
> guidance from the corosync community.
> 
>   The default is 180 milliseconds.
> 
>token_retransmits_before_loss_const
>   This  value  identifies  how  many  token  retransmits  should 
> be attempted before forming a new configuration.  If this value is set,
>   retransmit and hold will be automatically calculated from 
> retransmits_before_loss and token.
> 
>   The default is 4 retransmissions.
> 
> But, I don't know what cman sets these to.  But, they aren't these values.  
> And, they aren't the values in the cman man page, which says this:

Maybe it's changed by the ubuntu packagers? I don't know, I don't use
debian or ubuntu.

>   Cman uses different defaults for some of the corosync 
> parameters listed in corosync.conf(5).  If you wish to use a non-default set‐
>   ting, they can be configured in cluster.conf as shown above.  
> Cman uses the following default values:
> 
>vsftype="none"
>   token="1"
>   token_retransmits_before_loss_const="20"
>   join="60"
>   consensus="4800"
>   rrp_mode="none"
>From: Digimer [mailto:li...@alteeve.ca] 
> Sent: Friday, October 16, 2015 11:18 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Cluster node loss detection.
> 
> On 16/10/15 11:40 AM, Vallevand, Mark K wrote:
>> Thanks.  I wasn't completely aware of corosync's role in this.  I see new 
>> things in the docs every time I read them.
>>
>> I looked up the corosync settings at one time and did it again:
>>  token loss 3000ms
>>  retransmits 10
>> So 30s.  Redid my simple testing and got detection times of 22s, 26s, and 
>> 25s using very crude methods.
>> Any warnings about setting these values to something else?
>> We require our customers to use an isolated, private network for cluster 
>> communications.  All taken care of in our instructions and cluster 
>> configuration scripts.  Network traffic will not be a factor.  So, I'm 
>> thinking 1000ms and 5 retransmits as an experiment.
> 
> That is very high. I think the default is something like 236ms x 4 losses.
> 
> You do have fencing, right?
> 
>> I was pretty sure that DLM was just being informed by clustering, but I 
>> needed to ask.
>>
>> Again, thanks.
>>  
>>
>> 

Re: [ClusterLabs] Cluster node loss detection.

2015-10-16 Thread Vallevand, Mark K
We know.  We've worked out our application-specific answer to split brain.  
But, proper fencing is on our to-do list.
Currently we only deploy 2-node systems.  There is one application and its 
agent.  One resource is configured.  
We have this in cluster.conf
   
  
So, we don’t get quorum issues.
We are also experimenting with a second, redundant network for clustering use.  
It works, but we aren't deploying yet.
Haven't seen split-brain yet, except in early, fumble-fingered experiments.  

Reading the tutorial.  Always interested in understanding more.  Thanks.


Regards.
Mark K Vallevand   mark.vallev...@unisys.com  
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.


-Original Message-
From: Digimer [mailto:li...@alteeve.ca] 
Sent: Friday, October 16, 2015 12:35 PM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Cluster node loss detection.

On 16/10/15 01:14 PM, Vallevand, Mark K wrote:
> No stonith configured.  Not explicitly anyway.
> Does that factor into this somehow?

Yes, you will eventually have a split-brain.

All fencing in cman does with 'fence_pcmk' is say "hey, if you need to
fence, ask pacemaker to do it". That's useless if pacemaker can't fence.

> I've tested stonith, but we aren't doing it for customers.  Maybe in the 
> future if someone cries or pays us money.
> Our solution is deployed onto too many different machines.  A couple of bare 
> metal.  A couple of VMs.  We don't want customers to need to figure out 
> stonith and we can't test all possible configurations and write instructions. 
>  So, they get one-size-fits-all.

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Concept.3B_Fencing

You are doing a disservice to your customers. Without fencing, you
*will* have a bad day, it's just a question of when. I can't tell you
how many times I've heard "but it worked fine for over a year!".

Stonith is worth the hassle.

> Regards.
> Mark K Vallevand   mark.vallev...@unisys.com 
>  
> Never try and teach a pig to sing: it's a waste of time, and it annoys the 
> pig.
> 
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
> MATERIAL and is thus for use only by the intended recipient. If you received 
> this in error, please contact the sender and delete the e-mail and its 
> attachments from all computers.
> 
> 
> -Original Message-
> From: Digimer [mailto:li...@alteeve.ca] 
> Sent: Friday, October 16, 2015 11:51 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Cluster node loss detection.
> 
> On 16/10/15 12:37 PM, Vallevand, Mark K wrote:
>> Fencing, yes.  I have pcmk-redirect for each node in cluster.conf.
> 
> Do you have stonith configured (and tested!) in Pacemaker as well?
> 
>> I run with default cman settings for corosync.  No totem clause.  That gives 
>> the 20s detection.  Not sure what the defaults really are.
>> I added  to 
>> cluster.conf and get about a 5s detection.
>>
>> The corosync man page says:
>>token  This timeout specifies in milliseconds until a token loss is 
>> declared after not receiving a token.  This is the time spent detecting a
>>   failure of a processor in the current configuration.  
>> Reforming a new configuration takes about 50 milliseconds in  addition  to  
>> this
>>   timeout.
>>
>>   The default is 1000 milliseconds.
>>
>>token_retransmit
>>   This timeout specifies in milliseconds after how long before 
>> receiving a token the token is retransmitted.  This will be automatically
>>   calculated if token is modified.  It is not recommended to 
>> alter this value without guidance from the corosync community.
>>
>>   The default is 238 milliseconds.
>>
>>hold   This timeout specifies in milliseconds how long the token 
>> should be held by the representative when the protocol is under low utiliza‐
>>   tion.   It is not recommended to alter this value without 
>> guidance from the corosync community.
>>
>>   The default is 180 milliseconds.
>>
>>token_retransmits_before_loss_const
>>   This  value  identifies  how  many  token  retransmits  should 
>> be attempted before forming a new configuration.  If this value is set,
>>   retransmit and hold will be automatically calculated from 
>> retransmits_before_loss and token.
>>
>>   The default is 4 retransmissions.
>>
>> But, I don't know what cman sets these to.  But, they aren't these values.  
>> And, they aren't the values in the 

Re: [ClusterLabs] Cluster node loss detection.

2015-10-16 Thread Digimer
On 16/10/15 12:37 PM, Vallevand, Mark K wrote:
> Fencing, yes.  I have pcmk-redirect for each node in cluster.conf.

Do you have stonith configured (and tested!) in Pacemaker as well?

> I run with default cman settings for corosync.  No totem clause.  That gives 
> the 20s detection.  Not sure what the defaults really are.
> I added  to 
> cluster.conf and get about a 5s detection.
> 
> The corosync man page says:
>token  This timeout specifies in milliseconds until a token loss is 
> declared after not receiving a token.  This is the time spent detecting a
>   failure of a processor in the current configuration.  Reforming 
> a new configuration takes about 50 milliseconds in  addition  to  this
>   timeout.
> 
>   The default is 1000 milliseconds.
> 
>token_retransmit
>   This timeout specifies in milliseconds after how long before 
> receiving a token the token is retransmitted.  This will be automatically
>   calculated if token is modified.  It is not recommended to 
> alter this value without guidance from the corosync community.
> 
>   The default is 238 milliseconds.
> 
>hold   This timeout specifies in milliseconds how long the token 
> should be held by the representative when the protocol is under low utiliza‐
>   tion.   It is not recommended to alter this value without 
> guidance from the corosync community.
> 
>   The default is 180 milliseconds.
> 
>token_retransmits_before_loss_const
>   This  value  identifies  how  many  token  retransmits  should 
> be attempted before forming a new configuration.  If this value is set,
>   retransmit and hold will be automatically calculated from 
> retransmits_before_loss and token.
> 
>   The default is 4 retransmissions.
> 
> But, I don't know what cman sets these to.  But, they aren't these values.  
> And, they aren't the values in the cman man page, which says this:

Maybe it's changed by the ubuntu packagers? I don't know, I don't use
debian or ubuntu.

>   Cman uses different defaults for some of the corosync 
> parameters listed in corosync.conf(5).  If you wish to use a non-default set‐
>   ting, they can be configured in cluster.conf as shown above.  
> Cman uses the following default values:
> 
>vsftype="none"
>   token="1"
>   token_retransmits_before_loss_const="20"
>   join="60"
>   consensus="4800"
>   rrp_mode="none"
>From: Digimer [mailto:li...@alteeve.ca] 
> Sent: Friday, October 16, 2015 11:18 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Cluster node loss detection.
> 
> On 16/10/15 11:40 AM, Vallevand, Mark K wrote:
>> Thanks.  I wasn't completely aware of corosync's role in this.  I see new 
>> things in the docs every time I read them.
>>
>> I looked up the corosync settings at one time and did it again:
>>  token loss 3000ms
>>  retransmits 10
>> So 30s.  Redid my simple testing and got detection times of 22s, 26s, and 
>> 25s using very crude methods.
>> Any warnings about setting these values to something else?
>> We require our customers to use an isolated, private network for cluster 
>> communications.  All taken care of in our instructions and cluster 
>> configuration scripts.  Network traffic will not be a factor.  So, I'm 
>> thinking 1000ms and 5 retransmits as an experiment.
> 
> That is very high. I think the default is something like 236ms x 4 losses.
> 
> You do have fencing, right?
> 
>> I was pretty sure that DLM was just being informed by clustering, but I 
>> needed to ask.
>>
>> Again, thanks.
>>  
>>
>> Regards.
>> Mark K Vallevand   mark.vallev...@unisys.com 
>>  
>> Never try and teach a pig to sing: it's a waste of time, and it annoys the 
>> pig.
> 
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Stopped node detection.

2015-10-16 Thread Vallevand, Mark K
Ubuntu 12.04 LTS
pacemaker 1.1.10
cman 3.1.7
corosync 1.4.6

If my cluster has no resources, it seems like it takes 20s for a stopped node 
to be detected.  Is the value really 20s and is it a parameter that can be 
adjusted?


Regards.
Mark K Vallevand   mark.vallev...@unisys.com
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Alternative to resource monitor polling?

2015-10-16 Thread Vallevand, Mark K
Is there an alternative to resource monitor polling to detect a resource 
failure?
If, for example, a resource failure is detected by our own software, could it 
signal clustering that a resource has failed?

Regards.
Mark K Vallevand   mark.vallev...@unisys.com
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Stopped node detection.

2015-10-16 Thread Ulrich Windl
>>> "Vallevand, Mark K"  schrieb am 15.10.2015 um 
>>> 22:55
in Nachricht
<2f280811793d43418745268be7397...@us-exch13-5.na.uis.unisys.com>:
> Ubuntu 12.04 LTS
> pacemaker 1.1.10
> cman 3.1.7
> corosync 1.4.6
> 
> If my cluster has no resources, it seems like it takes 20s for a stopped 
> node to be detected.  Is the value really 20s and is it a parameter that can 
> be adjusted?

What should happen if you have a 5-second network outage (e.g. when quickly 
replugging a cable)? You can set this down to any crazy value I think, but be 
prepared to get what you asked for.

> 
> 
> Regards.
> Mark K Vallevand   mark.vallev...@unisys.com
> Never try and teach a pig to sing: it's a waste of time, and it annoys the 
> pig.
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
> MATERIAL and is thus for use only by the intended recipient. If you received 
> this in error, please contact the sender and delete the e-mail and its 
> attachments from all computers.





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Cluster node loss detection.

2015-10-16 Thread Vallevand, Mark K
It looks like it takes 20s for a cluster to detect that a node has been lost.
The detection seems to correlate to dlm reporting its lost connection to the 
node.
Not sure if correlation is causation.
Anyway, can someone tell me where that 20s might be coming from and if it is 
adjustable?


Ubuntu 12.04 LTS

pacemaker 1.1.10

 cman 3.1.7

corosync 1.4.6

Thanks!

Regards.
Mark K Vallevand   mark.vallev...@unisys.com
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stopped node detection.

2015-10-16 Thread Ken Gaillot
On 10/15/2015 03:55 PM, Vallevand, Mark K wrote:
> Ubuntu 12.04 LTS
> pacemaker 1.1.10
> cman 3.1.7
> corosync 1.4.6
> 
> If my cluster has no resources, it seems like it takes 20s for a stopped node 
> to be detected.  Is the value really 20s and is it a parameter that can be 
> adjusted?

The corosync token timeout is the main factor, so check your corosync.conf.

Pacemaker will then try to fence the node (if it was stopped uncleanly),
so that will take some time depending on what fencing you're using.

Generally this takes much less than 20s, but maybe you have a longer
timeout configured, or fencing is not working, or something like that.
The logs should have some clues, post them if you can't find it.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] A question about resource monitoring.

2015-10-16 Thread Vallevand, Mark K
Is there an alternative to resource monitoring?  Maybe a 'supplement' to 
resource polling is a better way to say it.
If my application self-detects an error and wants to report it (rather than 
wait for the monitor to poll it), can it report that to clustering?
Suggestions are welcome.

And a quick follow up question.  Are there any practical reasons for not having 
a very short resource monitor period?

Regards.
Mark K Vallevand   mark.vallev...@unisys.com
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster node loss detection.

2015-10-16 Thread Vallevand, Mark K
Thanks.  I wasn't completely aware of corosync's role in this.  I see new 
things in the docs every time I read them.

I looked up the corosync settings at one time and did it again:
token loss 3000ms
retransmits 10
So 30s.  Redid my simple testing and got detection times of 22s, 26s, and 25s 
using very crude methods.
Any warnings about setting these values to something else?
We require our customers to use an isolated, private network for cluster 
communications.  All taken care of in our instructions and cluster 
configuration scripts.  Network traffic will not be a factor.  So, I'm thinking 
1000ms and 5 retransmits as an experiment.

I was pretty sure that DLM was just being informed by clustering, but I needed 
to ask.

Again, thanks.


Regards.
Mark K Vallevand   mark.vallev...@unisys.com  
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.


-Original Message-
From: Digimer [mailto:li...@alteeve.ca] 
Sent: Friday, October 16, 2015 10:04 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Cluster node loss detection.

On 16/10/15 10:51 AM, Vallevand, Mark K wrote:
> It looks like it takes 20s for a cluster to detect that a node has been
> lost.

Loss is detected by corosync, and it declares loss after X lost totem
tokens, each token being declared lost after Y milliseconds. By default,
node loss should be detected in about 1 second of no network traffic,
but you need to check corosync's settings.

> The detection seems to correlate to dlm reporting its lost connection to
> the node.

Negative. DLM is informed when a node is declared lost and blocks until
fenced/stonithd tells it that the peer has been successfully fenced.
After which time, it reaps lost locks and recovers.

> Not sure if correlation is causation.

Correlation.

> Anyway, can someone tell me where that 20s might be coming from and if
> it is adjustable? 
> 
> Ubuntu 12.04 LTS
> pacemaker 1.1.10
>  cman 3.1.7
> corosync 1.4.6
> 
> Thanks!
> 
>  
> 
> Regards.
> Mark K Vallevand   mark.vallev...@unisys.com
> 
> Never try and teach a pig to sing: it's a waste of time, and it annoys
> the pig.
> 
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
> MATERIAL and is thus for use only by the intended recipient. If you
> received this in error, please contact the sender and delete the e-mail
> and its attachments from all computers.

This suffix has zero legal bearing, just saying. Anything posted to this
list is 100% open and public.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster node loss detection.

2015-10-16 Thread Digimer
On 16/10/15 11:40 AM, Vallevand, Mark K wrote:
> Thanks.  I wasn't completely aware of corosync's role in this.  I see new 
> things in the docs every time I read them.
> 
> I looked up the corosync settings at one time and did it again:
>   token loss 3000ms
>   retransmits 10
> So 30s.  Redid my simple testing and got detection times of 22s, 26s, and 25s 
> using very crude methods.
> Any warnings about setting these values to something else?
> We require our customers to use an isolated, private network for cluster 
> communications.  All taken care of in our instructions and cluster 
> configuration scripts.  Network traffic will not be a factor.  So, I'm 
> thinking 1000ms and 5 retransmits as an experiment.

That is very high. I think the default is something like 236ms x 4 losses.

You do have fencing, right?

> I was pretty sure that DLM was just being informed by clustering, but I 
> needed to ask.
> 
> Again, thanks.
>   
> 
> Regards.
> Mark K Vallevand   mark.vallev...@unisys.com 
>  
> Never try and teach a pig to sing: it's a waste of time, and it annoys the 
> pig.


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster node loss detection.

2015-10-16 Thread Vallevand, Mark K
Oops.  Cman starts corosync.  Cman has corosync settings of token loss 1ms 
and retransmit 10.  According to the man page, anyway.
Experimenting.


Regards.
Mark K Vallevand   mark.vallev...@unisys.com  
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.


-Original Message-
From: Vallevand, Mark K [mailto:mark.vallev...@unisys.com] 
Sent: Friday, October 16, 2015 10:41 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Cluster node loss detection.

Thanks.  I wasn't completely aware of corosync's role in this.  I see new 
things in the docs every time I read them.

I looked up the corosync settings at one time and did it again:
token loss 3000ms
retransmits 10
So 30s.  Redid my simple testing and got detection times of 22s, 26s, and 25s 
using very crude methods.
Any warnings about setting these values to something else?
We require our customers to use an isolated, private network for cluster 
communications.  All taken care of in our instructions and cluster 
configuration scripts.  Network traffic will not be a factor.  So, I'm thinking 
1000ms and 5 retransmits as an experiment.

I was pretty sure that DLM was just being informed by clustering, but I needed 
to ask.

Again, thanks.


Regards.
Mark K Vallevand   mark.vallev...@unisys.com  
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.


-Original Message-
From: Digimer [mailto:li...@alteeve.ca] 
Sent: Friday, October 16, 2015 10:04 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Cluster node loss detection.

On 16/10/15 10:51 AM, Vallevand, Mark K wrote:
> It looks like it takes 20s for a cluster to detect that a node has been
> lost.

Loss is detected by corosync, and it declares loss after X lost totem
tokens, each token being declared lost after Y milliseconds. By default,
node loss should be detected in about 1 second of no network traffic,
but you need to check corosync's settings.

> The detection seems to correlate to dlm reporting its lost connection to
> the node.

Negative. DLM is informed when a node is declared lost and blocks until
fenced/stonithd tells it that the peer has been successfully fenced.
After which time, it reaps lost locks and recovers.

> Not sure if correlation is causation.

Correlation.

> Anyway, can someone tell me where that 20s might be coming from and if
> it is adjustable? 
> 
> Ubuntu 12.04 LTS
> pacemaker 1.1.10
>  cman 3.1.7
> corosync 1.4.6
> 
> Thanks!
> 
>  
> 
> Regards.
> Mark K Vallevand   mark.vallev...@unisys.com
> 
> Never try and teach a pig to sing: it's a waste of time, and it annoys
> the pig.
> 
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
> MATERIAL and is thus for use only by the intended recipient. If you
> received this in error, please contact the sender and delete the e-mail
> and its attachments from all computers.

This suffix has zero legal bearing, just saying. Anything posted to this
list is 100% open and public.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster node loss detection.

2015-10-16 Thread Vallevand, Mark K
Fencing, yes.  I have pcmk-redirect for each node in cluster.conf.

I run with default cman settings for corosync.  No totem clause.  That gives 
the 20s detection.  Not sure what the defaults really are.
I added  to 
cluster.conf and get about a 5s detection.

The corosync man page says:
   token  This timeout specifies in milliseconds until a token loss is 
declared after not receiving a token.  This is the time spent detecting a
  failure of a processor in the current configuration.  Reforming a 
new configuration takes about 50 milliseconds in  addition  to  this
  timeout.

  The default is 1000 milliseconds.

   token_retransmit
  This timeout specifies in milliseconds after how long before 
receiving a token the token is retransmitted.  This will be automatically
  calculated if token is modified.  It is not recommended to alter 
this value without guidance from the corosync community.

  The default is 238 milliseconds.

   hold   This timeout specifies in milliseconds how long the token should 
be held by the representative when the protocol is under low utiliza‐
  tion.   It is not recommended to alter this value without 
guidance from the corosync community.

  The default is 180 milliseconds.

   token_retransmits_before_loss_const
  This  value  identifies  how  many  token  retransmits  should be 
attempted before forming a new configuration.  If this value is set,
  retransmit and hold will be automatically calculated from 
retransmits_before_loss and token.

  The default is 4 retransmissions.

But, I don't know what cman sets these to.  But, they aren't these values.  
And, they aren't the values in the cman man page, which says this:
  Cman uses different defaults for some of the corosync parameters 
listed in corosync.conf(5).  If you wish to use a non-default set‐
  ting, they can be configured in cluster.conf as shown above.  
Cman uses the following default values:


/>
   
So, it looks like setting the corosync parameters in cluster.conf has some 
effect.  Cman seems to pass them to corosync.

Onward!


Regards.
Mark K Vallevand   mark.vallev...@unisys.com  
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.


-Original Message-
From: Digimer [mailto:li...@alteeve.ca] 
Sent: Friday, October 16, 2015 11:18 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Cluster node loss detection.

On 16/10/15 11:40 AM, Vallevand, Mark K wrote:
> Thanks.  I wasn't completely aware of corosync's role in this.  I see new 
> things in the docs every time I read them.
> 
> I looked up the corosync settings at one time and did it again:
>   token loss 3000ms
>   retransmits 10
> So 30s.  Redid my simple testing and got detection times of 22s, 26s, and 25s 
> using very crude methods.
> Any warnings about setting these values to something else?
> We require our customers to use an isolated, private network for cluster 
> communications.  All taken care of in our instructions and cluster 
> configuration scripts.  Network traffic will not be a factor.  So, I'm 
> thinking 1000ms and 5 retransmits as an experiment.

That is very high. I think the default is something like 236ms x 4 losses.

You do have fencing, right?

> I was pretty sure that DLM was just being informed by clustering, but I 
> needed to ask.
> 
> Again, thanks.
>   
> 
> Regards.
> Mark K Vallevand   mark.vallev...@unisys.com 
>  
> Never try and teach a pig to sing: it's a waste of time, and it annoys the 
> pig.


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster node loss detection.

2015-10-16 Thread Digimer
On 16/10/15 01:14 PM, Vallevand, Mark K wrote:
> No stonith configured.  Not explicitly anyway.
> Does that factor into this somehow?

Yes, you will eventually have a split-brain.

All fencing in cman does with 'fence_pcmk' is say "hey, if you need to
fence, ask pacemaker to do it". That's useless if pacemaker can't fence.

> I've tested stonith, but we aren't doing it for customers.  Maybe in the 
> future if someone cries or pays us money.
> Our solution is deployed onto too many different machines.  A couple of bare 
> metal.  A couple of VMs.  We don't want customers to need to figure out 
> stonith and we can't test all possible configurations and write instructions. 
>  So, they get one-size-fits-all.

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Concept.3B_Fencing

You are doing a disservice to your customers. Without fencing, you
*will* have a bad day, it's just a question of when. I can't tell you
how many times I've heard "but it worked fine for over a year!".

Stonith is worth the hassle.

> Regards.
> Mark K Vallevand   mark.vallev...@unisys.com 
>  
> Never try and teach a pig to sing: it's a waste of time, and it annoys the 
> pig.
> 
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
> MATERIAL and is thus for use only by the intended recipient. If you received 
> this in error, please contact the sender and delete the e-mail and its 
> attachments from all computers.
> 
> 
> -Original Message-
> From: Digimer [mailto:li...@alteeve.ca] 
> Sent: Friday, October 16, 2015 11:51 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Cluster node loss detection.
> 
> On 16/10/15 12:37 PM, Vallevand, Mark K wrote:
>> Fencing, yes.  I have pcmk-redirect for each node in cluster.conf.
> 
> Do you have stonith configured (and tested!) in Pacemaker as well?
> 
>> I run with default cman settings for corosync.  No totem clause.  That gives 
>> the 20s detection.  Not sure what the defaults really are.
>> I added  to 
>> cluster.conf and get about a 5s detection.
>>
>> The corosync man page says:
>>token  This timeout specifies in milliseconds until a token loss is 
>> declared after not receiving a token.  This is the time spent detecting a
>>   failure of a processor in the current configuration.  
>> Reforming a new configuration takes about 50 milliseconds in  addition  to  
>> this
>>   timeout.
>>
>>   The default is 1000 milliseconds.
>>
>>token_retransmit
>>   This timeout specifies in milliseconds after how long before 
>> receiving a token the token is retransmitted.  This will be automatically
>>   calculated if token is modified.  It is not recommended to 
>> alter this value without guidance from the corosync community.
>>
>>   The default is 238 milliseconds.
>>
>>hold   This timeout specifies in milliseconds how long the token 
>> should be held by the representative when the protocol is under low utiliza‐
>>   tion.   It is not recommended to alter this value without 
>> guidance from the corosync community.
>>
>>   The default is 180 milliseconds.
>>
>>token_retransmits_before_loss_const
>>   This  value  identifies  how  many  token  retransmits  should 
>> be attempted before forming a new configuration.  If this value is set,
>>   retransmit and hold will be automatically calculated from 
>> retransmits_before_loss and token.
>>
>>   The default is 4 retransmissions.
>>
>> But, I don't know what cman sets these to.  But, they aren't these values.  
>> And, they aren't the values in the cman man page, which says this:
> 
> Maybe it's changed by the ubuntu packagers? I don't know, I don't use
> debian or ubuntu.
> 
>>   Cman uses different defaults for some of the corosync 
>> parameters listed in corosync.conf(5).  If you wish to use a non-default set‐
>>   ting, they can be configured in cluster.conf as shown above.  
>> Cman uses the following default values:
>>
>> >   vsftype="none"
>>   token="1"
>>   token_retransmits_before_loss_const="20"
>>   join="60"
>>   consensus="4800"
>>   rrp_mode="none"
>>   > From: Digimer [mailto:li...@alteeve.ca] 
>> Sent: Friday, October 16, 2015 11:18 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> Subject: Re: [ClusterLabs] Cluster node loss detection.
>>
>> On 16/10/15 11:40 AM, Vallevand, Mark K wrote:
>>> Thanks.  I wasn't completely aware of corosync's role in this.  I see new 
>>> things in the docs every time I read them.
>>>
>>> I looked up the corosync settings at one time and did it again:
>>> token loss 3000ms
>>> retransmits 10
>>> So 30s.  Redid my simple testing