Re: [ClusterLabs] Updated attribute is not displayed in crm_mon

2017-08-15 Thread Ken Gaillot
On Tue, 2017-08-15 at 08:42 +0200, Jan Friesse wrote:
> Ken Gaillot napsal(a):
> > On Mon, 2017-08-14 at 12:33 -0500, Ken Gaillot wrote:
> >> On Wed, 2017-08-02 at 09:59 +, 井上 和徳 wrote:
> >>> Hi,
> >>>
> >>> In Pacemaker-1.1.17, the attribute updated while starting pacemaker is 
> >>> not displayed in crm_mon.
> >>> In Pacemaker-1.1.16, it is displayed and results are different.
> >>>
> >>> https://github.com/ClusterLabs/pacemaker/commit/fe44f400a3116a158ab331a92a49a4ad8937170d
> >>> This commit is the cause, but the following result (3.) is expected 
> >>> behavior?
> >>
> >> This turned out to be an odd one. The sequence of events is:
> >>
> >> 1. When the node leaves the cluster, the DC (correctly) wipes all its
> >> transient attributes from attrd and the CIB.
> >>
> >> 2. Pacemaker is newly started on the node, and a transient attribute is
> >> set before the node joins the cluster.
> >>
> >> 3. The node joins the cluster, and its transient attributes (including
> >> the new value) are sync'ed with the rest of the cluster, in both attrd
> >> and the CIB. So far, so good.
> >>
> >> 4. Because this is the node's first join since its crmd started, its
> >> crmd wipes all of its transient attributes again. The idea is that the
> >> node may have restarted so quickly that the DC hasn't yet done it (step
> >> 1 here), so clear them now to avoid any problems with old values.
> >> However, the crmd wipes only the CIB -- not attrd (arguably a bug).
> >
> > Whoops, clarification: the node may have restarted so quickly that
> > corosync didn't notice it left, so the DC would never have gotten the
> 
> Corosync always notice left of node no matter if left is longer or 
> within token timeout.

Looking back at the original commit, it has a comment "OpenAIS has a
nasty habit of not being able to tell if a node is returning or didn't
leave in the first place", so it looks like it's only relevant on legacy
stacks.

> 
> > "peer lost" message that triggers wiping its transient attributes.
> >
> > I suspect the crmd wipes only the CIB in this case because we assumed
> > attrd would be empty at this point -- missing exactly this case where a
> > value was set between start-up and first join.
> >
> >> 5. With the older pacemaker version, both the joining node and the DC
> >> would request a full write-out of all values from attrd. Because step 4
> >> only wiped the CIB, this ends up restoring the new value. With the newer
> >> pacemaker version, this step is no longer done, so the value winds up
> >> staying in attrd but not in CIB (until the next write-out naturally
> >> occurs).
> >>
> >> I don't have a solution yet, but step 4 is clearly the problem (rather
> >> than the new code that skips step 5, which is still a good idea
> >> performance-wise). I'll keep working on it.
> >>
> >>> [test case]
> >>> 1. Start pacemaker on two nodes at the same time and update the attribute 
> >>> during startup.
> >>> In this case, the attribute is displayed in crm_mon.
> >>>
> >>> [root@node1 ~]# ssh -f node1 'systemctl start pacemaker ; 
> >>> attrd_updater -n KEY -U V-1' ; \
> >>> ssh -f node3 'systemctl start pacemaker ; 
> >>> attrd_updater -n KEY -U V-3'
> >>> [root@node1 ~]# crm_mon -QA1
> >>> Stack: corosync
> >>> Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with 
> >>> quorum
> >>>
> >>> 2 nodes configured
> >>> 0 resources configured
> >>>
> >>> Online: [ node1 node3 ]
> >>>
> >>> No active resources
> >>>
> >>>
> >>> Node Attributes:
> >>> * Node node1:
> >>> + KEY   : V-1
> >>> * Node node3:
> >>> + KEY   : V-3
> >>>
> >>>
> >>> 2. Restart pacemaker on node1, and update the attribute during startup.
> >>>
> >>> [root@node1 ~]# systemctl stop pacemaker
> >>> [root@node1 ~]# systemctl start pacemaker ; attrd_updater -n KEY -U 
> >>> V-10
> >>>
> >>>
> >>> 3. The attribute is registered in attrd but it is not registered in CIB,
> >>> so the updated attribute is not displayed in crm_mon.
> >>>
> >>> [root@node1 ~]# attrd_updater -Q -n KEY -A
> >>> name="KEY" host="node3" value="V-3"
> >>> name="KEY" host="node1" value="V-10"
> >>>
> >>> [root@node1 ~]# crm_mon -QA1
> >>> Stack: corosync
> >>> Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with 
> >>> quorum
> >>>
> >>> 2 nodes configured
> >>> 0 resources configured
> >>>
> >>> Online: [ node1 node3 ]
> >>>
> >>> No active resources
> >>>
> >>>
> >>> Node Attributes:
> >>> * Node node1:
> >>> * Node node3:
> >>> + KEY   : V-3
> >>>
> >>>
> >>> Best Regards
> >>>
> >>> ___
> >>> Users mailing list: Users@clusterlabs.org
> >>> http://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: 

Re: [ClusterLabs] Antw: Re: Notification agent and Notification recipients

2017-08-15 Thread Sriram
Thanks for clarifying.

Regards,
Sriram.

On Mon, Aug 14, 2017 at 7:34 PM, Klaus Wenninger 
wrote:

> On 08/14/2017 03:19 PM, Sriram wrote:
>
> Yes, I had precreated the script file with the required permission.
>
> [root@*node1* alerts]# ls -l /usr/share/pacemaker/alert_file.sh
> -rwxr-xr-x. 1 root root 4140 Aug 14 01:51 /usr/share/pacemaker/alert_
> file.sh
>  [root@*node2* alerts]# ls -l /usr/share/pacemaker/alert_file.sh
> -rwxr-xr-x. 1 root root 4139 Aug 14 01:51 /usr/share/pacemaker/alert_
> file.sh
> [root@*node3* alerts]# ls -l /usr/share/pacemaker/alert_file.sh
> -rwxr-xr-x. 1 root root 4139 Aug 14 01:51 /usr/share/pacemaker/alert_
> file.sh
>
> Later I observed that user "hacluster" is not able to create the log file
> under /usr/share/pacemaker/alert_file.log.
> I am sorry, I should have observed this in the log before posting the
> query. Then I gave the path as /tmp/alert_file.log, it is able to create
> now.
> Thanks for pointing it out.
>
> I have one more clarification,
>
> if the resource is running in node2,
> [root@node2 tmp]# pcs resource
>  TRR(ocf::heartbeat:TimingRedundancyRA):Started node2
>
> And I executed the below command to make it standby.
> [root@node2 tmp] # pcs node standby node2
>
> Resource shifted to node3, because of higher location constraint.
> [root@node2 tmp]# pcs resource
>  TRR(ocf::heartbeat:TimingRedundancyRA):Started node3.
>
>
> I got the log file created under node2(resource stopped) and
> node3(resource started).
>
> Node1 was not notified about the resource shift, I mean no log file was
> created there.
> Its because alerts are designed to notify the external agents about the
> cluster events. Its not for internal notifications.
>
> Is my understanding correct ?
>
>
> Quite simple: crmd of node1 just didn't have anything to do with shifting
> the resource
> from node2 -> node3. There is no additional information passed between the
> nodes
> just to create a full set of notifications on every node. If you want to
> have a full log
> (or whatever you altert-agent is doing) in one place this would be up to
> your alert-agent.
>
>
> Regards,
> Klaus
>
>
> Regards,
> Sriram.
>
>
>
> On Mon, Aug 14, 2017 at 5:42 PM, Klaus Wenninger 
> wrote:
>
>> On 08/14/2017 12:32 PM, Sriram wrote:
>>
>> Hi Ken,
>>
>> I used the alerts as well, seems to be not working.
>>
>> Please check the below configuration
>> [root@node1 alerts]# pcs config show
>> Cluster Name:
>> Corosync Nodes:
>> Pacemaker Nodes:
>>  node1 node2 node3
>>
>> Resources:
>>  Resource: TRR (class=ocf provider=heartbeat type=TimingRedundancyRA)
>>   Operations: start interval=0s timeout=60s (TRR-start-interval-0s)
>>   stop interval=0s timeout=20s (TRR-stop-interval-0s)
>>   monitor interval=10 timeout=20 (TRR-monitor-interval-10)
>>
>> Stonith Devices:
>> Fencing Levels:
>>
>> Location Constraints:
>>   Resource: TRR
>> Enabled on: node1 (score:100) (id:location-TRR-node1-100)
>> Enabled on: node2 (score:200) (id:location-TRR-node2-200)
>> Enabled on: node3 (score:300) (id:location-TRR-node3-300)
>> Ordering Constraints:
>> Colocation Constraints:
>> Ticket Constraints:
>>
>> Alerts:
>>  Alert: alert_file (path=/usr/share/pacemaker/alert_file.sh)
>>   Options: debug_exec_order=false
>>   Meta options: timeout=15s
>>   Recipients:
>>Recipient: recipient_alert_file_id (value=/usr/share/pacemaker/al
>> ert_file.log)
>>
>>
>> Did you pre-create the file with proper rights? Be aware that the
>> alert-agent
>> is called as user hacluster.
>>
>>
>> Resources Defaults:
>>  resource-stickiness: INFINITY
>> Operations Defaults:
>>  No defaults set
>>
>> Cluster Properties:
>>  cluster-infrastructure: corosync
>>  dc-version: 1.1.15-11.el7_3.4-e174ec8
>>  default-action-timeout: 240
>>  have-watchdog: false
>>  no-quorum-policy: ignore
>>  placement-strategy: balanced
>>  stonith-enabled: false
>>  symmetric-cluster: false
>>
>> Quorum:
>>   Options:
>>
>>
>> /usr/share/pacemaker/alert_file.sh does not get called whenever I
>> trigger a scenario for failover.
>> Please let me know if I m missing anything.
>>
>>
>> Do you get any logs - like for startup of resources - or nothing at all?
>>
>> Regards,
>> Klaus
>>
>>
>>
>>
>> Regards,
>> Sriram.
>>
>> On Tue, Aug 8, 2017 at 8:29 PM, Ken Gaillot  wrote:
>>
>>> On Tue, 2017-08-08 at 17:40 +0530, Sriram wrote:
>>> > Hi Ulrich,
>>> >
>>> >
>>> > Please see inline.
>>> >
>>> > On Tue, Aug 8, 2017 at 2:01 PM, Ulrich Windl
>>> >  wrote:
>>> > >>> Sriram  schrieb am 08.08.2017 um
>>> > 09:30 in Nachricht
>>> > 

Re: [ClusterLabs] Updated attribute is not displayed in crm_mon

2017-08-15 Thread Jan Friesse

Ken Gaillot napsal(a):

On Mon, 2017-08-14 at 12:33 -0500, Ken Gaillot wrote:

On Wed, 2017-08-02 at 09:59 +, 井上 和徳 wrote:

Hi,

In Pacemaker-1.1.17, the attribute updated while starting pacemaker is not 
displayed in crm_mon.
In Pacemaker-1.1.16, it is displayed and results are different.

https://github.com/ClusterLabs/pacemaker/commit/fe44f400a3116a158ab331a92a49a4ad8937170d
This commit is the cause, but the following result (3.) is expected behavior?


This turned out to be an odd one. The sequence of events is:

1. When the node leaves the cluster, the DC (correctly) wipes all its
transient attributes from attrd and the CIB.

2. Pacemaker is newly started on the node, and a transient attribute is
set before the node joins the cluster.

3. The node joins the cluster, and its transient attributes (including
the new value) are sync'ed with the rest of the cluster, in both attrd
and the CIB. So far, so good.

4. Because this is the node's first join since its crmd started, its
crmd wipes all of its transient attributes again. The idea is that the
node may have restarted so quickly that the DC hasn't yet done it (step
1 here), so clear them now to avoid any problems with old values.
However, the crmd wipes only the CIB -- not attrd (arguably a bug).


Whoops, clarification: the node may have restarted so quickly that
corosync didn't notice it left, so the DC would never have gotten the


Corosync always notice left of node no matter if left is longer or 
within token timeout.



"peer lost" message that triggers wiping its transient attributes.

I suspect the crmd wipes only the CIB in this case because we assumed
attrd would be empty at this point -- missing exactly this case where a
value was set between start-up and first join.


5. With the older pacemaker version, both the joining node and the DC
would request a full write-out of all values from attrd. Because step 4
only wiped the CIB, this ends up restoring the new value. With the newer
pacemaker version, this step is no longer done, so the value winds up
staying in attrd but not in CIB (until the next write-out naturally
occurs).

I don't have a solution yet, but step 4 is clearly the problem (rather
than the new code that skips step 5, which is still a good idea
performance-wise). I'll keep working on it.


[test case]
1. Start pacemaker on two nodes at the same time and update the attribute 
during startup.
In this case, the attribute is displayed in crm_mon.

[root@node1 ~]# ssh -f node1 'systemctl start pacemaker ; attrd_updater -n 
KEY -U V-1' ; \
ssh -f node3 'systemctl start pacemaker ; attrd_updater -n 
KEY -U V-3'
[root@node1 ~]# crm_mon -QA1
Stack: corosync
Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with quorum

2 nodes configured
0 resources configured

Online: [ node1 node3 ]

No active resources


Node Attributes:
* Node node1:
+ KEY   : V-1
* Node node3:
+ KEY   : V-3


2. Restart pacemaker on node1, and update the attribute during startup.

[root@node1 ~]# systemctl stop pacemaker
[root@node1 ~]# systemctl start pacemaker ; attrd_updater -n KEY -U V-10


3. The attribute is registered in attrd but it is not registered in CIB,
so the updated attribute is not displayed in crm_mon.

[root@node1 ~]# attrd_updater -Q -n KEY -A
name="KEY" host="node3" value="V-3"
name="KEY" host="node1" value="V-10"

[root@node1 ~]# crm_mon -QA1
Stack: corosync
Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with quorum

2 nodes configured
0 resources configured

Online: [ node1 node3 ]

No active resources


Node Attributes:
* Node node1:
* Node node3:
+ KEY   : V-3


Best Regards

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org







___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org