Re: [ClusterLabs] What does these logs mean in corosync.log
On Mon, 2018-02-12 at 23:25 +0800, lkxjtu wrote: > These logs are both print when system is abnormal, I am very confused > what they mean. Does anyone know what they mean? Thank you very much > corosync version 2.4.0 > pacemaker version 1.1.16 > > 1) > Feb 01 10:57:58 [18927] paas-controller-192-167-0-2 crmd: > warning: find_xml_node: Could not find parameters in resource- > agent. This looks like one of the OCF resource agents used by the cluster does not have a "" section as it should. > 2) > Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice > [TOTEM ] orf_token_rtr Retransmit List: 19f1bb > Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice > [TOTEM ] orf_token_rtr Retransmit List: 19f1bb > Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice > [TOTEM ] orf_token_rtr Retransmit List: 19f1bb > Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice > [TOTEM ] orf_token_rtr Retransmit List: 19f1bb > Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice > [TOTEM ] orf_token_rtr Retransmit List: 19f1bb > Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice > [TOTEM ] orf_token_rtr Retransmit List: 19f1cf > Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice > [TOTEM ] orf_token_rtr Retransmit List: 19f1cf > Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice > [TOTEM ] orf_token_rtr Retransmit List: 19f1cf > Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice > [TOTEM ] orf_token_rtr Retransmit List: 19f1cf > > 3) > Feb 11 22:57:17 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11533 > (ratio 20:1) in 51ms > Feb 11 22:57:21 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11522 > (ratio 20:1) in 53ms > Feb 11 22:57:21 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11537 > (ratio 20:1) in 45ms > Feb 11 22:57:21 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11514 > (ratio 20:1) in 47ms > Feb 11 22:57:22 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11536 > (ratio 20:1) in 50ms > Feb 11 22:57:22 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11551 > (ratio 20:1) in 51ms > Feb 11 22:57:22 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11524 > (ratio 20:1) in 54ms > Feb 11 22:57:22 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11545 > (ratio 20:1) in 60ms > Feb 11 22:57:22 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11536 > (ratio 20:1) in 54ms > Feb 11 22:57:25 [5206] paas-controller-192-20-20-6 cib: > info: crm_compress_string: Compressed 233922 bytes into 11522 > (ratio 20:1) in 61ms > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] What does these logs mean in corosync.log
lkxjtu, I will just comment corosync log. These logs are both print when system is abnormal, I am very confused what they mean. Does anyone know what they mean? Thank you very much corosync version 2.4.0 pacemaker version 1.1.16 1) Feb 01 10:57:58 [18927] paas-controller-192-167-0-2 crmd: warning: find_xml_node:Could not find parameters in resource-agent. 2) Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1cf Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1cf Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1cf Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1cf The question is. How often you get this lines? If there are only few of them, it's nothing to worry, it just means that corosync message(s) was/were lost and Corosync tries to resend them again. But if you have a lot of these, followed by new membership forming, it means you ether: - are using multicast, but messages got lost for some reason (usually switches) -> try UDPU - MTU of network is smaller than 1500 bytes and fragmentation is not allowed -> try reduce totem.netmtu Honza 3) Feb 11 22:57:17 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11533 (ratio 20:1) in 51ms Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11522 (ratio 20:1) in 53ms Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11537 (ratio 20:1) in 45ms Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11514 (ratio 20:1) in 47ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11536 (ratio 20:1) in 50ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11551 (ratio 20:1) in 51ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11524 (ratio 20:1) in 54ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11545 (ratio 20:1) in 60ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11536 (ratio 20:1) in 54ms Feb 11 22:57:25 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11522 (ratio 20:1) in 61ms ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?
Eric, General question. I tried to set up a cman + corosync + pacemaker cluster using two corosync rings. When I start the cluster, everything works fine, except when I do a 'corosync-cfgtool -s' it only shows one ring. I tried manually editing the /etc/cluster/cluster.conf file adding two AFAIK cluster.conf should be edited so altname is used. So something like in this example: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/cluster_administration/s1-config-rrp-cli-ca I don't think you have to add altmulticast. Honza sections, but then cman complained that I didn't have a multicast address specified, even though I did. I tried editing the /etc/corosdync/corosync.conf file, and then I could get two rings, but the nodes would not both join the cluster. Bah! I did some reading and saw that cman didn't support multiple rings years ago. Did it never get updated? [sig] ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?
On 2018-02-12 07:10 AM, Eric Robinson wrote: > General question. I tried to set up a cman + corosync + pacemaker > cluster using two corosync rings. When I start the cluster, everything > works fine, except when I do a ‘corosync-cfgtool -s’ it only shows one > ring. I tried manually editing the /etc/cluster/cluster.conf file adding > two sections, but then cman complained that I didn’t have a > multicast address specified, even though I did. I tried editing the > /etc/corosdync/corosync.conf file, and then I could get two rings, but > the nodes would not both join the cluster. Bah! I did some reading and > saw that cman didn’t support multiple rings years ago. Did it never get > updated? > > > > sig It's been a while since I tested it (couldn't use it because of issues with GFS2), but yes it worked. Don't edit corosync.conf, all corosync config is handled in cman's cluster.conf. I believe you need to specify the '' element for the second ring. If you still have trouble, let me know and I'll see if I can find my old notes. -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 2018-02-12 08:15 AM, Klaus Wenninger wrote: > On 02/12/2018 01:02 PM, Maxim wrote: >> Hello, >> >> [Sorry for a message duplication. Web mail client ruined the >> formatting of the previous e-mail =( ] >> >> There is a simple configuration of two cluster nodes (built via RHEL 6 >> pcs interface) with multiple master/slave resources, disabled fencing >> and the single sync interface. > > fencing-disabled is probably due to it being a test-setup ... > RHEL 6 pcs being made for configuring a cman-pacemaker-setup > I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker > setup with that. You've obviously edited corosync.conf to > reflect that ... Without fencing, all bets are off. Please enable it and see if the issue remains Changing EL6 to corosync 2 pushes further into uncharted waters. EL6 should be using the cman pluging with corosync 1. May I ask why you don't use EL7 if you want such a recent stack? >> All is ok mainly. But there is some problem of the cluster activity >> performance when the master node is powered off (hard): the slave node >> detects that the master one is down after about 100-3500 ms. And the >> main question is how to avoid this 3 sec delay that occurred sometimes. > > Kind of interesting that you ever get a detection below 2000ms with the > token-timeout set to that value. (Given you are doing a hard-shutdown > that doesn't give corosync time to sign off.) > You've derived these times from the corosync-logs!? > > Regards, > Klaus > >> >> On the slave node i have a little script that checks the connection to >> the master node. It detects a problem of a sync breakage within about >> 100 ms. But corosync requires a much more time sometimes to figure out >> the situation and mark the master node as offline one. It shows 'ok' >> ring status. >> >> If i understand correctly then >> 1 the pacemaker actions (crm_resource --move) will not perform until >> corosync is not refreshed its ring state >> 2 the detection of a problem (from a corosync side) can be speeded up >> via timeout tuning in the corosync.conf >> 3 there is no way to ask corosync to recheck its ring status or mark a >> ring as failed manually >> >> But maybe i'm missing something. >> >> All i want is to move resources faster. >> In my little script i tried to force the cluster software to move >> resources to the slave node. But i've no success so far. >> >> Could you please share your thoughts about the situation. >> Thank you in advance. >> >> >> Cluster software: >> corosync - 2.4.3 >> pacemaker - 1.1.18 >> libqb - 1.0.2 >> >> >> corosync.conf: >> totem { >> version: 2 >> secauth: off >> cluster_name: cluster >> transport: udpu >> token: 2000 >> } >> >> nodelist { >> node { >> ring0_addr: main-node >> nodeid: 1 >> } >> >> node { >> ring0_addr: reserve-node >> nodeid: 2 >> } >> } >> >> quorum { >> provider: corosync_votequorum >> two_node: 1 >> } >> >> >> Regards, >> Maxim. >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 02/12/2018 04:34 PM, Maxim wrote: > 12.02.2018 16:15, Klaus Wenninger пишет: >> On 02/12/2018 01:02 PM, Maxim wrote: > > fencing-disabled is probably due to it being a test-setup ... RHEL 6 > > pcs being made for configuring a cman-pacemaker-setup I'm not sure if > > it is advisable to do a setup for a corosync-2 pacemaker setup with > > that. You've obviously edited corosync.conf to reflect that ... > It is ok. Fencing is not required at the time. > It works well with latest stable corosync and pacemaker that were > built manually (not from RHEL 6 repos). > And the attached config was generated by this pcs (i've removed > 'logging' section from there to decrease a message size). > >> > >> > >> All is ok mainly. But there is some problem of the cluster > >> activity performance when the master node is powered off (hard): > >> the slave node detects that the master one is down after about > >> 100-3500 ms. And the main question is how to avoid this 3 sec delay > >> that occurred sometimes. > > > > Kind of interesting that you ever get a detection below 2000ms with > > the token-timeout set to that value. (Given you are doing a > > hard-shutdown that doesn't give corosync time to sign off.) You've > > derived these times from the corosync-logs!? > > > > Regards, Klaus > > > Not actually. After your message i've conduct some more investigations > with quite active logging on the master node to get the real time when > node is going down. And... you are right. The delay is close to 4 > seconds. So there is a [foating] bug in my script. > Thank you for your inside, Klaus =) > > Butneverthelessis there any mechanism to force the slave corosync "to > think" that the master corosync is down? > [I have seen the abilities of corosync-cfgtools but, seems, it doesn't > contain similar functionality] > Or maybe are there some another ways? Maybe a few notes on the other way ;-) In general it is not easy to have a reliable answer to the question if the other node is down within just let's say 100ms. Think of network-hickups, scheduling issues and alike ... But if you are willing to accept false-positives you can reduce the token timeout of corosync instead of having another script that tries to do the job corosync is (amonst other things) made for (At least that is how I understood what you are aiming to do.). Regards, Klaus > > Regards, Maxim > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
12.02.2018 16:15, Klaus Wenninger пишет: On 02/12/2018 01:02 PM, Maxim wrote: > fencing-disabled is probably due to it being a test-setup ... RHEL 6 > pcs being made for configuring a cman-pacemaker-setup I'm not sure if > it is advisable to do a setup for a corosync-2 pacemaker setup with > that. You've obviously edited corosync.conf to reflect that ... It is ok. Fencing is not required at the time. It works well with latest stable corosync and pacemaker that were built manually (not from RHEL 6 repos). And the attached config was generated by this pcs (i've removed 'logging' section from there to decrease a message size). >> >> All is ok mainly. But there is some problem of the cluster >> activity performance when the master node is powered off (hard): >> the slave node detects that the master one is down after about >> 100-3500 ms. And the main question is how to avoid this 3 sec delay >> that occurred sometimes. > > Kind of interesting that you ever get a detection below 2000ms with > the token-timeout set to that value. (Given you are doing a > hard-shutdown that doesn't give corosync time to sign off.) You've > derived these times from the corosync-logs!? > > Regards, Klaus > Not actually. After your message i've conduct some more investigations with quite active logging on the master node to get the real time when node is going down. And... you are right. The delay is close to 4 seconds. So there is a [foating] bug in my script. Thank you for your inside, Klaus =) Butneverthelessis there any mechanism to force the slave corosync "to think" that the master corosync is down? [I have seen the abilities of corosync-cfgtools but, seems, it doesn't contain similar functionality] Or maybe are there some another ways? Regards, Maxim ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] What does these logs mean in corosync.log
These logs are both print when system is abnormal, I am very confused what they mean. Does anyone know what they mean? Thank you very much corosync version 2.4.0 pacemaker version 1.1.16 1) Feb 01 10:57:58 [18927] paas-controller-192-167-0-2 crmd: warning: find_xml_node:Could not find parameters in resource-agent. 2) Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1bb Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1cf Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1cf Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1cf Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice [TOTEM ] orf_token_rtr Retransmit List: 19f1cf 3) Feb 11 22:57:17 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11533 (ratio 20:1) in 51ms Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11522 (ratio 20:1) in 53ms Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11537 (ratio 20:1) in 45ms Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11514 (ratio 20:1) in 47ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11536 (ratio 20:1) in 50ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11551 (ratio 20:1) in 51ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11524 (ratio 20:1) in 54ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11545 (ratio 20:1) in 60ms Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11536 (ratio 20:1) in 54ms Feb 11 22:57:25 [5206] paas-controller-192-20-20-6cib: info: crm_compress_string: Compressed 233922 bytes into 11522 (ratio 20:1) in 61ms___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
On 02/12/2018 01:02 PM, Maxim wrote: > Hello, > > [Sorry for a message duplication. Web mail client ruined the > formatting of the previous e-mail =( ] > > There is a simple configuration of two cluster nodes (built via RHEL 6 > pcs interface) with multiple master/slave resources, disabled fencing > and the single sync interface. fencing-disabled is probably due to it being a test-setup ... RHEL 6 pcs being made for configuring a cman-pacemaker-setup I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker setup with that. You've obviously edited corosync.conf to reflect that ... > > All is ok mainly. But there is some problem of the cluster activity > performance when the master node is powered off (hard): the slave node > detects that the master one is down after about 100-3500 ms. And the > main question is how to avoid this 3 sec delay that occurred sometimes. Kind of interesting that you ever get a detection below 2000ms with the token-timeout set to that value. (Given you are doing a hard-shutdown that doesn't give corosync time to sign off.) You've derived these times from the corosync-logs!? Regards, Klaus > > On the slave node i have a little script that checks the connection to > the master node. It detects a problem of a sync breakage within about > 100 ms. But corosync requires a much more time sometimes to figure out > the situation and mark the master node as offline one. It shows 'ok' > ring status. > > If i understand correctly then > 1 the pacemaker actions (crm_resource --move) will not perform until > corosync is not refreshed its ring state > 2 the detection of a problem (from a corosync side) can be speeded up > via timeout tuning in the corosync.conf > 3 there is no way to ask corosync to recheck its ring status or mark a > ring as failed manually > > But maybe i'm missing something. > > All i want is to move resources faster. > In my little script i tried to force the cluster software to move > resources to the slave node. But i've no success so far. > > Could you please share your thoughts about the situation. > Thank you in advance. > > > Cluster software: > corosync - 2.4.3 > pacemaker - 1.1.18 > libqb - 1.0.2 > > > corosync.conf: > totem { > version: 2 > secauth: off > cluster_name: cluster > transport: udpu > token: 2000 > } > > nodelist { > node { > ring0_addr: main-node > nodeid: 1 > } > > node { > ring0_addr: reserve-node > nodeid: 2 > } > } > > quorum { > provider: corosync_votequorum > two_node: 1 > } > > > Regards, > Maxim. > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?
General question. I tried to set up a cman + corosync + pacemaker cluster using two corosync rings. When I start the cluster, everything works fine, except when I do a 'corosync-cfgtool -s' it only shows one ring. I tried manually editing the /etc/cluster/cluster.conf file adding two sections, but then cman complained that I didn't have a multicast address specified, even though I did. I tried editing the /etc/corosdync/corosync.conf file, and then I could get two rings, but the nodes would not both join the cluster. Bah! I did some reading and saw that cman didn't support multiple rings years ago. Did it never get updated? [sig] ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
Hello, [Sorry for a message duplication. Web mail client ruined the formatting of the previous e-mail =( ] There is a simple configuration of two cluster nodes (built via RHEL 6 pcs interface) with multiple master/slave resources, disabled fencing and the single sync interface. All is ok mainly. But there is some problem of the cluster activity performance when the master node is powered off (hard): the slave node detects that the master one is down after about 100-3500 ms. And the main question is how to avoid this 3 sec delay that occurred sometimes. On the slave node i have a little script that checks the connection to the master node. It detects a problem of a sync breakage within about 100 ms. But corosync requires a much more time sometimes to figure out the situation and mark the master node as offline one. It shows 'ok' ring status. If i understand correctly then 1 the pacemaker actions (crm_resource --move) will not perform until corosync is not refreshed its ring state 2 the detection of a problem (from a corosync side) can be speeded up via timeout tuning in the corosync.conf 3 there is no way to ask corosync to recheck its ring status or mark a ring as failed manually But maybe i'm missing something. All i want is to move resources faster. In my little script i tried to force the cluster software to move resources to the slave node. But i've no success so far. Could you please share your thoughts about the situation. Thank you in advance. Cluster software: corosync - 2.4.3 pacemaker - 1.1.18 libqb - 1.0.2 corosync.conf: totem { version: 2 secauth: off cluster_name: cluster transport: udpu token: 2000 } nodelist { node { ring0_addr: main-node nodeid: 1 } node { ring0_addr: reserve-node nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } Regards, Maxim. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Error when linking to libqb in shared library
Jan Pokornýwrites: > I guess you are linking your python extension with one of the > pacemaker libraries (directly on indirectly to libcrmcommon), and in > that case, you need to rebuild pacemaker with the patched libqb[*] for > the whole arrangement to work. Likewise in that case, as you may be > aware, the "API" is quite uncommitted at this point, stability hasn't > been of importance so far (because of the handles into pacemaker being > mostly abstracted through built-in CLI tools for the outside players > so far, which I agree is encumbered with tedious round-trips, etc.). > There's a huge debt in this area, so some discretion and perhaps > feedback which functions are indeed proper-API-worth is advised. The ultimate goal of my project is indeed to be able to propose or begin a discussion around a stable API for Pacemaker to eventually move away from command-line tools as the only way to interact with the cluster. Thank you, I'll investigate the proposed changes. Cheers, Kristoffer > > [*] > shortcut 1: just recompile pacemaker with those extra > /usr/include/qb/qblog.h modifications as of the > referenced commit) > shortcut 2: if the above can be tolerated widely, this is certainly > for local development only: recompile pacemaker with > CPPFLAGS=-DQB_KILL_ATTRIBUTE_SECTION > > Hope this helps. > > -- > Jan (Poki) > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Error when linking to libqb in shared library
[let's move this to developers list] On 12/02/18 07:22 +0100, Kristoffer Grönlund wrote: > (and especially the libqb developers) > > I started hacking on a python library written in C which links to > pacemaker, and so to libqb as well, but I'm encountering a strange > problem which I don't know how to solve. > > When I try to import the library in python, I see this error: > > --- command --- > PYTHONPATH='/home/krig/projects/work/libpacemakerclient/build/python' > /usr/bin/python3 > /home/krig/projects/python-pacemaker/build/../python/clienttest.py > --- stderr --- > python3: utils.c:66: common: Assertion `"implicit callsite section is > observable, otherwise target's and/or libqb's build is at fault, preventing > reliable logging" && work_s1 != NULL && work_s2 != NULL' failed. > --- > > This appears to be coming from the following libqb macro: > > https://github.com/ClusterLabs/libqb/blob/master/include/qb/qblog.h#L352 > > There is a long comment above the macro which if nothing else tells me > that I'm not the first person to have issues with it, but it doesn't > really tell me what I'm doing wrong... > > Does anyone know what the issue is, and if so, what I could do to > resolve it? Something similar has been reported already: https://github.com/ClusterLabs/libqb/pull/266#issuecomment-356855212 and the fix is proposed: https://github.com/ClusterLabs/libqb/pull/288/commits/f9f180cdbcb189b6590e541502b1de658c81005e https://github.com/ClusterLabs/libqb/pull/288 But the suitability depends on particular usecase. I guess you are linking your python extension with one of the pacemaker libraries (directly on indirectly to libcrmcommon), and in that case, you need to rebuild pacemaker with the patched libqb[*] for the whole arrangement to work. Likewise in that case, as you may be aware, the "API" is quite uncommitted at this point, stability hasn't been of importance so far (because of the handles into pacemaker being mostly abstracted through built-in CLI tools for the outside players so far, which I agree is encumbered with tedious round-trips, etc.). There's a huge debt in this area, so some discretion and perhaps feedback which functions are indeed proper-API-worth is advised. [*] shortcut 1: just recompile pacemaker with those extra /usr/include/qb/qblog.h modifications as of the referenced commit) shortcut 2: if the above can be tolerated widely, this is certainly for local development only: recompile pacemaker with CPPFLAGS=-DQB_KILL_ATTRIBUTE_SECTION Hope this helps. -- Jan (Poki) pgpBsObeln97F.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown
Hello There is a simple configuration of two cluster nodes (built via RHEL 6 pcs interface) with multiple master/slave resources, disabled fencing and the single sync interface. All is ok mainly. But there is some problem of the cluster activity performance when the master node is powered off (hard): the slave node detects that the master one is down after about 100-3500 ms. And the main question is how to avoid this 3 sec delay that occurred sometimes. On the slave node i have a little script that checks the connection to the master node. It detects a problem of a sync breakage within about 100 ms.But corosync requires a much more time sometimes to figure out the situation and mark the master node as offline one. It shows 'ok' ring status. If i understand correctly then 1 the pacemaker actions (crm_resource --move) will not perform until corosync is not refreshed its ring state2 the detection of a problem (from a corosync side) can be speeded up via timeout tuning in the corosync.conf 3 there is no way to ask corosync to recheck its ring status or mark a ring as failed manually But maybe i'm missing something. All i want is to move resources faster.In my little script i tried to force the cluster software to move resources to the slave node. But i've no success so far. Could you please share your thoughts about the situation.Thank you in advance. Cluster software: corosync - 2.4.3pacemaker - 1.1.18libqb - 1.0.2 corosync.conf:totem { version: 2 secauth: off cluster_name: cluster transport: udpu token: 2000 }nodelist { node { ring0_addr: main-node nodeid: 1 }node { ring0_addr: reserve-node nodeid: 2 } }quorum { provider: corosync_votequorum two_node: 1 } Regards, Maxim. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Issues with DB2 HADR Resource Agent
Thanks Ondrej for the response. I also figured out the same and reduced the HADR_TIMEOUT and increased the promote timeout which helped in resolving the issue. Regards, Dileep V Nair Senior AIX Administrator Cloud Managed Services Delivery (MSD), India IBM Cloud E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy Manya Bangalore, KA 560045 India From: Ondrej FameraTo: Dileep V Nair Cc: Cluster Labs - All topics related to open-source clustering welcomed Date: 02/12/2018 11:46 AM Subject:Re: [ClusterLabs] Issues with DB2 HADR Resource Agent On 02/01/2018 07:24 PM, Dileep V Nair wrote: > Thanks Ondrej for the response. I have set the PEER_WINDOWto 1000 which > I guess is a reasonable value. What I am noticing is it does not wait > for the PEER_WINDOW. Before that itself the DB goes into a > REMOTE_CATCHUP_PENDING state and Pacemaker give an Error saying a DB in > STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted. > > > Regards, > > *Dileep V Nair* Hi Dileep, sorry for later response. The DB2 should not get into the 'REMOTE_CATCHUP' phase or the DB2 resource agent will indeed not promote. From my experience it usually gets into that state when the DB2 on standby was restarted during or after PEER_WINDOW timeout. When the primary DB2 fails then standby should end up in some state that would match the one on line 770 of DB2 resource agent and the promote operation is attempted. 770 STANDBY/*PEER/DISCONNECTED|Standby/DisconnectedPeer) https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_resource-2Dagents_blob_master_heartbeat_db2-23L770=DwIDBA=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=dhvUwjWghTBfDEHmzU3P5eaU9Ce3DkCRdRPNd71L1bU=3vPiNA4KGdZzc0xJOYv5hMCObjWdlxZDO_bLb86YaGM= The DB2 on standby can get restarted when the 'promote' operation times out, so you can try increasing the 'promote' timeout to something higher if this was the case. So if you see that DB2 was restarted after Primary failed, increase the promote timeout. If DB2 was not restarted then question is why DB2 has decided to change the status in this way. Let me know if above helped. -- Ondrej Faměra @Red Hat ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org