Re: [ClusterLabs] What does these logs mean in corosync.log

2018-02-12 Thread Ken Gaillot
On Mon, 2018-02-12 at 23:25 +0800, lkxjtu wrote:
> These logs are both print when system is abnormal, I am very confused
> what they mean. Does anyone know what they mean? Thank you very much
> corosync version   2.4.0
> pacemaker version  1.1.16
> 
> 1)
> Feb 01 10:57:58 [18927] paas-controller-192-167-0-2   crmd: 
> warning: find_xml_node:    Could not find parameters in resource-
> agent.

This looks like one of the OCF resource agents used by the cluster does
not have a "" section as it should.

> 2)
> Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice 
> [TOTEM ] orf_token_rtr Retransmit List: 19f1bb
> Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice 
> [TOTEM ] orf_token_rtr Retransmit List: 19f1bb
> Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice 
> [TOTEM ] orf_token_rtr Retransmit List: 19f1bb
> Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice 
> [TOTEM ] orf_token_rtr Retransmit List: 19f1bb
> Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice 
> [TOTEM ] orf_token_rtr Retransmit List: 19f1bb
> Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice 
> [TOTEM ] orf_token_rtr Retransmit List: 19f1cf
> Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice 
> [TOTEM ] orf_token_rtr Retransmit List: 19f1cf
> Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice 
> [TOTEM ] orf_token_rtr Retransmit List: 19f1cf
> Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice 
> [TOTEM ] orf_token_rtr Retransmit List: 19f1cf
> 
> 3)
> Feb 11 22:57:17 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11533
> (ratio 20:1) in 51ms
> Feb 11 22:57:21 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11522
> (ratio 20:1) in 53ms
> Feb 11 22:57:21 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11537
> (ratio 20:1) in 45ms
> Feb 11 22:57:21 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11514
> (ratio 20:1) in 47ms
> Feb 11 22:57:22 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11536
> (ratio 20:1) in 50ms
> Feb 11 22:57:22 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11551
> (ratio 20:1) in 51ms
> Feb 11 22:57:22 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11524
> (ratio 20:1) in 54ms
> Feb 11 22:57:22 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11545
> (ratio 20:1) in 60ms
> Feb 11 22:57:22 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11536
> (ratio 20:1) in 54ms
> Feb 11 22:57:25 [5206] paas-controller-192-20-20-6    cib:
> info: crm_compress_string:   Compressed 233922 bytes into 11522
> (ratio 20:1) in 61ms
> 
> 
>  
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] What does these logs mean in corosync.log

2018-02-12 Thread Jan Friesse

lkxjtu,
I will just comment corosync log.


These logs are both print when system is abnormal, I am very confused what they 
mean. Does anyone know what they mean? Thank you very much
corosync version   2.4.0
pacemaker version  1.1.16

1)
Feb 01 10:57:58 [18927] paas-controller-192-167-0-2   crmd:  warning: 
find_xml_node:Could not find parameters in resource-agent.

2)
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf



The question is. How often you get this lines? If there are only few of 
them, it's nothing to worry, it just means that corosync message(s) 
was/were lost and Corosync tries to resend them again.


But if you have a lot of these, followed by new membership forming, it 
means you ether:
- are using multicast, but messages got lost for some reason (usually 
switches) -> try UDPU
- MTU of network is smaller than 1500 bytes and fragmentation is not 
allowed -> try reduce totem.netmtu


Honza



3)
Feb 11 22:57:17 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11533 (ratio 20:1) in 51ms
Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11522 (ratio 20:1) in 53ms
Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11537 (ratio 20:1) in 45ms
Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11514 (ratio 20:1) in 47ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11536 (ratio 20:1) in 50ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11551 (ratio 20:1) in 51ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11524 (ratio 20:1) in 54ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11545 (ratio 20:1) in 60ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11536 (ratio 20:1) in 54ms
Feb 11 22:57:25 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11522 (ratio 20:1) in 61ms



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?

2018-02-12 Thread Jan Friesse

Eric,

General question. I tried to set up a cman + corosync + pacemaker cluster using two corosync rings. When I start the cluster, everything works fine, except when I do a 'corosync-cfgtool -s' it only shows one ring. I tried manually editing the /etc/cluster/cluster.conf file adding two  


AFAIK cluster.conf should be edited so altname is used. So something 
like in this example: 
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/cluster_administration/s1-config-rrp-cli-ca


I don't think you have to add altmulticast.

Honza

sections, but then cman complained that I didn't have a multicast 
address specified, even though I did. I tried editing the 
/etc/corosdync/corosync.conf file, and then I could get two rings, but 
the nodes would not both join the cluster. Bah! I did some reading and 
saw that cman didn't support multiple rings years ago. Did it never get 
updated?


[sig]




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?

2018-02-12 Thread Digimer
On 2018-02-12 07:10 AM, Eric Robinson wrote:
> General question. I tried to set up a cman + corosync + pacemaker
> cluster using two corosync rings. When I start the cluster, everything
> works fine, except when I do a ‘corosync-cfgtool -s’ it only shows one
> ring. I tried manually editing the /etc/cluster/cluster.conf file adding
> two  sections, but then cman complained that I didn’t have a
> multicast address specified, even though I did. I tried editing the
> /etc/corosdync/corosync.conf file, and then I could get two rings, but
> the nodes would not both join the cluster. Bah! I did some reading and
> saw that cman didn’t support multiple rings years ago. Did it never get
> updated?   
> 
>  
> 
> sig

It's been a while since I tested it (couldn't use it because of issues
with GFS2), but yes it worked. Don't edit corosync.conf, all corosync
config is handled in cman's cluster.conf. I believe you need to specify
the '' element for the second ring.

If you still have trouble, let me know and I'll see if I can find my old
notes.


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Digimer
On 2018-02-12 08:15 AM, Klaus Wenninger wrote:
> On 02/12/2018 01:02 PM, Maxim wrote:
>> Hello,
>>
>> [Sorry for a message duplication. Web mail client ruined the
>> formatting of the previous e-mail =( ]
>>
>> There is a simple configuration of two cluster nodes (built via RHEL 6
>> pcs interface) with multiple master/slave resources, disabled fencing
>> and the single sync interface.
> 
> fencing-disabled is probably due to it being a test-setup ...
> RHEL 6 pcs being made for configuring a cman-pacemaker-setup
> I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker
> setup with that. You've obviously edited corosync.conf to
> reflect that ...

Without fencing, all bets are off. Please enable it and see if the issue
remains

Changing EL6 to corosync 2 pushes further into uncharted waters. EL6
should be using the cman pluging with corosync 1. May I ask why you
don't use EL7 if you want such a recent stack?

>> All is ok mainly. But there is some problem of the cluster activity
>> performance when the master node is powered off (hard): the slave node
>> detects that the master one is down after about 100-3500 ms. And the
>> main question is how to avoid this 3 sec delay that occurred sometimes.
> 
> Kind of interesting that you ever get a detection below 2000ms with the
> token-timeout set to that value. (Given you are doing a hard-shutdown
> that doesn't give corosync time to sign off.)
> You've derived these times from the corosync-logs!?
> 
> Regards,
> Klaus
> 
>>
>> On the slave node i have a little script that checks the connection to
>> the master node. It detects a problem of a sync breakage within about
>> 100 ms. But corosync requires a much more time sometimes to figure out
>> the situation and mark the master node as offline one. It shows 'ok'
>> ring status.
>>
>> If i understand correctly then
>> 1 the pacemaker actions (crm_resource --move) will not perform until
>> corosync is not refreshed its ring state
>> 2 the detection of a problem (from a corosync side) can be speeded up
>> via timeout tuning in the corosync.conf
>> 3 there is no way to ask corosync to recheck its ring status or mark a
>> ring as failed manually
>>
>> But maybe i'm missing something.
>>
>> All i want is to move resources faster.
>> In my little script i tried to force the cluster software to move
>> resources to the slave node. But i've no success so far.
>>
>> Could you please share your thoughts about the situation.
>> Thank you in advance.
>>
>>
>> Cluster software:
>> corosync - 2.4.3
>> pacemaker - 1.1.18
>> libqb - 1.0.2
>>
>>
>> corosync.conf:
>> totem {
>>   version: 2
>>   secauth: off
>>   cluster_name: cluster
>>   transport: udpu
>>   token: 2000
>> }
>>
>> nodelist {
>>  node {
>>  ring0_addr: main-node
>>  nodeid: 1
>>     }
>>
>>  node {
>>  ring0_addr: reserve-node
>>  nodeid: 2
>>  }
>> }
>>
>> quorum {
>>  provider: corosync_votequorum
>>  two_node: 1
>> }
>>
>>
>> Regards,
>> Maxim.
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Klaus Wenninger
On 02/12/2018 04:34 PM, Maxim wrote:
> 12.02.2018 16:15, Klaus Wenninger пишет:
>> On 02/12/2018 01:02 PM, Maxim  wrote:
> > fencing-disabled is probably due to it being a test-setup ... RHEL 6
> > pcs being made for configuring a cman-pacemaker-setup I'm not sure if
> > it is advisable to do a setup for a corosync-2 pacemaker setup with
> > that. You've obviously edited corosync.conf to reflect that ...
> It is ok. Fencing is not required at the time.
> It works well with latest stable corosync and pacemaker that were
> built manually (not from RHEL 6 repos).
> And the attached config was generated by this pcs (i've removed
> 'logging' section from there to decrease a message size).
>
>>
> >>
> >> All is ok mainly. But there is some problem of the cluster
> >> activity performance when the master node is powered off (hard):
> >> the slave node detects that the master one is down after about
> >> 100-3500 ms. And the main question is how to avoid this 3 sec delay
> >> that occurred sometimes.
> >
> > Kind of interesting that you ever get a detection below 2000ms with
> > the token-timeout set to that value. (Given you are doing a
> > hard-shutdown that doesn't give corosync time to sign off.) You've
> > derived these times from the corosync-logs!?
> >
> > Regards, Klaus
> >
> Not actually. After your message i've conduct some more investigations
> with quite active logging on the master node to get the real time when
> node is going down. And... you are right. The delay is close to 4
> seconds. So there is a [foating] bug in my script.
> Thank you for your inside, Klaus =)
>
> Butneverthelessis there any mechanism to force the slave corosync "to
> think" that the master corosync is down?
> [I have seen the abilities of corosync-cfgtools but, seems, it doesn't
> contain similar functionality]
> Or maybe are there some another ways?

Maybe a few notes on the other way ;-)
In general it is not easy to have a reliable answer
to the question if the other node is down within just
let's say 100ms.
Think of network-hickups, scheduling issues and
alike ...
But if you are willing to accept false-positives
you can reduce the token timeout of corosync
instead of having another script that tries to do
the job corosync is (amonst other things) made
for (At least that is how I understood what you
are aiming to do.).

Regards,
Klaus

>
> Regards, Maxim
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


  

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Maxim

12.02.2018 16:15, Klaus Wenninger пишет:

On 02/12/2018 01:02 PM, Maxim  wrote:

> fencing-disabled is probably due to it being a test-setup ... RHEL 6
> pcs being made for configuring a cman-pacemaker-setup I'm not sure if
> it is advisable to do a setup for a corosync-2 pacemaker setup with
> that. You've obviously edited corosync.conf to reflect that ...
It is ok. Fencing is not required at the time.
It works well with latest stable corosync and pacemaker that were built 
manually (not from RHEL 6 repos).
And the attached config was generated by this pcs (i've removed 
'logging' section from there to decrease a message size).





>>
>> All is ok mainly. But there is some problem of the cluster
>> activity performance when the master node is powered off (hard):
>> the slave node detects that the master one is down after about
>> 100-3500 ms. And the main question is how to avoid this 3 sec delay
>> that occurred sometimes.
>
> Kind of interesting that you ever get a detection below 2000ms with
> the token-timeout set to that value. (Given you are doing a
> hard-shutdown that doesn't give corosync time to sign off.) You've
> derived these times from the corosync-logs!?
>
> Regards, Klaus
>
Not actually. After your message i've conduct some more investigations 
with quite active logging on the master node to get the real time when 
node is going down. And... you are right. The delay is close to 4 
seconds. So there is a [foating] bug in my script.

Thank you for your inside, Klaus =)

Butneverthelessis there any mechanism to force the slave corosync "to 
think" that the master corosync is down?
[I have seen the abilities of corosync-cfgtools but, seems, it doesn't 
contain similar functionality]

Or maybe are there some another ways?

Regards, Maxim


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] What does these logs mean in corosync.log

2018-02-12 Thread lkxjtu
These logs are both print when system is abnormal, I am very confused what they 
mean. Does anyone know what they mean? Thank you very much
corosync version   2.4.0
pacemaker version  1.1.16

1)
Feb 01 10:57:58 [18927] paas-controller-192-167-0-2   crmd:  warning: 
find_xml_node:Could not find parameters in resource-agent.

2)
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf

3)
Feb 11 22:57:17 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11533 (ratio 20:1) in 51ms
Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11522 (ratio 20:1) in 53ms
Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11537 (ratio 20:1) in 45ms
Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11514 (ratio 20:1) in 47ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11536 (ratio 20:1) in 50ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11551 (ratio 20:1) in 51ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11524 (ratio 20:1) in 54ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11545 (ratio 20:1) in 60ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11536 (ratio 20:1) in 54ms
Feb 11 22:57:25 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11522 (ratio 20:1) in 61ms___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Klaus Wenninger
On 02/12/2018 01:02 PM, Maxim wrote:
> Hello,
>
> [Sorry for a message duplication. Web mail client ruined the
> formatting of the previous e-mail =( ]
>
> There is a simple configuration of two cluster nodes (built via RHEL 6
> pcs interface) with multiple master/slave resources, disabled fencing
> and the single sync interface.

fencing-disabled is probably due to it being a test-setup ...
RHEL 6 pcs being made for configuring a cman-pacemaker-setup
I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker
setup with that. You've obviously edited corosync.conf to
reflect that ...
 
>
> All is ok mainly. But there is some problem of the cluster activity
> performance when the master node is powered off (hard): the slave node
> detects that the master one is down after about 100-3500 ms. And the
> main question is how to avoid this 3 sec delay that occurred sometimes.

Kind of interesting that you ever get a detection below 2000ms with the
token-timeout set to that value. (Given you are doing a hard-shutdown
that doesn't give corosync time to sign off.)
You've derived these times from the corosync-logs!?

Regards,
Klaus

>
> On the slave node i have a little script that checks the connection to
> the master node. It detects a problem of a sync breakage within about
> 100 ms. But corosync requires a much more time sometimes to figure out
> the situation and mark the master node as offline one. It shows 'ok'
> ring status.
>
> If i understand correctly then
> 1 the pacemaker actions (crm_resource --move) will not perform until
> corosync is not refreshed its ring state
> 2 the detection of a problem (from a corosync side) can be speeded up
> via timeout tuning in the corosync.conf
> 3 there is no way to ask corosync to recheck its ring status or mark a
> ring as failed manually
>
> But maybe i'm missing something.
>
> All i want is to move resources faster.
> In my little script i tried to force the cluster software to move
> resources to the slave node. But i've no success so far.
>
> Could you please share your thoughts about the situation.
> Thank you in advance.
>
>
> Cluster software:
> corosync - 2.4.3
> pacemaker - 1.1.18
> libqb - 1.0.2
>
>
> corosync.conf:
> totem {
>   version: 2
>   secauth: off
>   cluster_name: cluster
>   transport: udpu
>   token: 2000
> }
>
> nodelist {
>  node {
>  ring0_addr: main-node
>  nodeid: 1
>     }
>
>  node {
>  ring0_addr: reserve-node
>  nodeid: 2
>  }
> }
>
> quorum {
>  provider: corosync_votequorum
>  two_node: 1
> }
>
>
> Regards,
> Maxim.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?

2018-02-12 Thread Eric Robinson
General question. I tried to set up a cman + corosync + pacemaker cluster using 
two corosync rings. When I start the cluster, everything works fine, except 
when I do a 'corosync-cfgtool -s' it only shows one ring. I tried manually 
editing the /etc/cluster/cluster.conf file adding two  sections, but 
then cman complained that I didn't have a multicast address specified, even 
though I did. I tried editing the /etc/corosdync/corosync.conf file, and then I 
could get two rings, but the nodes would not both join the cluster. Bah! I did 
some reading and saw that cman didn't support multiple rings years ago. Did it 
never get updated?

[sig]

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread Maxim

Hello,

[Sorry for a message duplication. Web mail client ruined the formatting 
of the previous e-mail =( ]


There is a simple configuration of two cluster nodes (built via RHEL 6 
pcs interface) with multiple master/slave resources, disabled fencing 
and the single sync interface.


All is ok mainly. But there is some problem of the cluster activity 
performance when the master node is powered off (hard): the slave node 
detects that the master one is down after about 100-3500 ms. And the 
main question is how to avoid this 3 sec delay that occurred sometimes.


On the slave node i have a little script that checks the connection to 
the master node. It detects a problem of a sync breakage within about 
100 ms. But corosync requires a much more time sometimes to figure out 
the situation and mark the master node as offline one. It shows 'ok' 
ring status.


If i understand correctly then
1 the pacemaker actions (crm_resource --move) will not perform until 
corosync is not refreshed its ring state
2 the detection of a problem (from a corosync side) can be speeded up 
via timeout tuning in the corosync.conf
3 there is no way to ask corosync to recheck its ring status or mark a 
ring as failed manually


But maybe i'm missing something.

All i want is to move resources faster.
In my little script i tried to force the cluster software to move 
resources to the slave node. But i've no success so far.


Could you please share your thoughts about the situation.
Thank you in advance.


Cluster software:
corosync - 2.4.3
pacemaker - 1.1.18
libqb - 1.0.2


corosync.conf:
totem {
  version: 2
  secauth: off
  cluster_name: cluster
  transport: udpu
  token: 2000
}

nodelist {
 node {
 ring0_addr: main-node
 nodeid: 1
}

 node {
 ring0_addr: reserve-node
 nodeid: 2
 }
}

quorum {
 provider: corosync_votequorum
 two_node: 1
}


Regards,
Maxim.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Error when linking to libqb in shared library

2018-02-12 Thread Kristoffer Grönlund
Jan Pokorný  writes:

> I guess you are linking your python extension with one of the
> pacemaker libraries (directly on indirectly to libcrmcommon), and in
> that case, you need to rebuild pacemaker with the patched libqb[*] for
> the whole arrangement to work.  Likewise in that case, as you may be
> aware, the "API" is quite uncommitted at this point, stability hasn't
> been of importance so far (because of the handles into pacemaker being
> mostly abstracted through built-in CLI tools for the outside players
> so far, which I agree is encumbered with tedious round-trips, etc.).
> There's a huge debt in this area, so some discretion and perhaps
> feedback which functions are indeed proper-API-worth is advised.

The ultimate goal of my project is indeed to be able to propose or begin
a discussion around a stable API for Pacemaker to eventually move away
from command-line tools as the only way to interact with the cluster.

Thank you, I'll investigate the proposed changes.

Cheers,
Kristoffer

>
> [*]
> shortcut 1: just recompile pacemaker with those extra
> /usr/include/qb/qblog.h modifications as of the
>   referenced commit)
> shortcut 2: if the above can be tolerated widely, this is certainly
> for local development only: recompile pacemaker with
>   CPPFLAGS=-DQB_KILL_ATTRIBUTE_SECTION
>
> Hope this helps.
>
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Error when linking to libqb in shared library

2018-02-12 Thread Jan Pokorný
[let's move this to developers list]

On 12/02/18 07:22 +0100, Kristoffer Grönlund wrote:
> (and especially the libqb developers)
> 
> I started hacking on a python library written in C which links to
> pacemaker, and so to libqb as well, but I'm encountering a strange
> problem which I don't know how to solve.
> 
> When I try to import the library in python, I see this error:
> 
> --- command ---
> PYTHONPATH='/home/krig/projects/work/libpacemakerclient/build/python' 
> /usr/bin/python3 
> /home/krig/projects/python-pacemaker/build/../python/clienttest.py
> --- stderr ---
> python3: utils.c:66: common: Assertion `"implicit callsite section is 
> observable, otherwise target's and/or libqb's build is at fault, preventing 
> reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.
> ---
> 
> This appears to be coming from the following libqb macro:
> 
> https://github.com/ClusterLabs/libqb/blob/master/include/qb/qblog.h#L352
> 
> There is a long comment above the macro which if nothing else tells me
> that I'm not the first person to have issues with it, but it doesn't
> really tell me what I'm doing wrong...
> 
> Does anyone know what the issue is, and if so, what I could do to
> resolve it?

Something similar has been reported already:
https://github.com/ClusterLabs/libqb/pull/266#issuecomment-356855212

and the fix is proposed:
https://github.com/ClusterLabs/libqb/pull/288/commits/f9f180cdbcb189b6590e541502b1de658c81005e
https://github.com/ClusterLabs/libqb/pull/288

But the suitability depends on particular usecase.

I guess you are linking your python extension with one of the
pacemaker libraries (directly on indirectly to libcrmcommon), and in
that case, you need to rebuild pacemaker with the patched libqb[*] for
the whole arrangement to work.  Likewise in that case, as you may be
aware, the "API" is quite uncommitted at this point, stability hasn't
been of importance so far (because of the handles into pacemaker being
mostly abstracted through built-in CLI tools for the outside players
so far, which I agree is encumbered with tedious round-trips, etc.).
There's a huge debt in this area, so some discretion and perhaps
feedback which functions are indeed proper-API-worth is advised.

[*]
shortcut 1: just recompile pacemaker with those extra
/usr/include/qb/qblog.h modifications as of the
referenced commit)
shortcut 2: if the above can be tolerated widely, this is certainly
for local development only: recompile pacemaker with
CPPFLAGS=-DQB_KILL_ATTRIBUTE_SECTION

Hope this helps.

-- 
Jan (Poki)


pgpBsObeln97F.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

2018-02-12 Thread называется как хочется

Hello

There is a simple configuration of two cluster nodes (built via RHEL 6 pcs
interface) with multiple master/slave resources, disabled fencing and the single
sync interface.

All is ok mainly. But there is some problem of the cluster activity performance
when the master node is powered off (hard): the slave node detects that the
master one is down after about 100-3500 ms. And the main question is how to 
avoid
this 3 sec delay that occurred sometimes.
On the slave node i have a little script that checks the connection to the 
master
node. It detects a problem of a sync breakage within about 100 ms.But corosync
requires a much more time sometimes to figure out the situation and mark the
master node as offline one. It shows 'ok' ring status.

If i understand correctly then 1 the pacemaker actions (crm_resource --move) 
will
not perform until corosync is not refreshed its ring state2 the detection of a
problem (from a corosync side) can be speeded up via timeout tuning in the
corosync.conf
3 there is no way to ask corosync to recheck its ring status or mark a ring as
failed manually

But maybe i'm missing something.
All i want is to move resources faster.In my little script i tried to force the
cluster software to move resources to the slave node. But i've no success so 
far.

Could you please share your thoughts about the situation.Thank you in advance.

Cluster software:
corosync - 2.4.3pacemaker - 1.1.18libqb - 1.0.2

corosync.conf:totem {
version: 2
secauth: off
cluster_name: cluster
transport: udpu
token: 2000
}nodelist {
node {
ring0_addr: main-node
nodeid: 1
}node {
ring0_addr: reserve-node
nodeid: 2
}
}quorum {
provider: corosync_votequorum
two_node: 1
}


Regards, Maxim.
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-12 Thread Dileep V Nair

Thanks Ondrej for the response. I also figured out the same and reduced the
HADR_TIMEOUT and increased the promote timeout which helped in resolving
the issue.



   
 Regards,   
   

   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 









From:   Ondrej Famera 
To: Dileep V Nair 
Cc: Cluster Labs - All topics related to open-source clustering
welcomed 
Date:   02/12/2018 11:46 AM
Subject:Re: [ClusterLabs] Issues with DB2 HADR Resource Agent



On 02/01/2018 07:24 PM, Dileep V Nair wrote:
> Thanks Ondrej for the response. I have set the PEER_WINDOWto 1000 which
> I guess is a reasonable value. What I am noticing is it does not wait
> for the PEER_WINDOW. Before that itself the DB goes into a
> REMOTE_CATCHUP_PENDING state and Pacemaker give an Error saying a DB in
> STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.
>
>
> Regards,
>
> *Dileep V Nair*

Hi Dileep,

sorry for later response. The DB2 should not get into the
'REMOTE_CATCHUP' phase or the DB2 resource agent will indeed not
promote. From my experience it usually gets into that state when the DB2
on standby was restarted during or after PEER_WINDOW timeout.

When the primary DB2 fails then standby should end up in some state that
would match the one on line 770 of DB2 resource agent and the promote
operation is attempted.

  770  STANDBY/*PEER/DISCONNECTED|Standby/DisconnectedPeer)

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_resource-2Dagents_blob_master_heartbeat_db2-23L770=DwIDBA=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=dhvUwjWghTBfDEHmzU3P5eaU9Ce3DkCRdRPNd71L1bU=3vPiNA4KGdZzc0xJOYv5hMCObjWdlxZDO_bLb86YaGM=


The DB2 on standby can get restarted when the 'promote' operation times
out, so you can try increasing the 'promote' timeout to something higher
if this was the case.

So if you see that DB2 was restarted after Primary failed, increase the
promote timeout. If DB2 was not restarted then question is why DB2 has
decided to change the status in this way.

Let me know if above helped.

--
Ondrej Faměra
@Red Hat



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org