On 05/03/2016 05:30 PM, Jehan-Guillaume de Rorthais wrote: > Le Tue, 3 May 2016 21:10:12 +0200, > Jehan-Guillaume de Rorthais <j...@dalibo.com> a écrit : > >> Le Mon, 2 May 2016 17:59:55 -0500, >> Ken Gaillot <kgail...@redhat.com> a écrit : >> >>> On 04/28/2016 04:47 AM, Jehan-Guillaume de Rorthais wrote: >>>> Hello all, >>>> >>>> While testing and experiencing with our RA for PostgreSQL, I found the >>>> meta_notify_active_* variables seems always empty. Here is an example of >>>> these variables as they are seen from our RA during a >>>> migration/switchover: >>>> >>>> >>>> { >>>> 'type' => 'pre', >>>> 'operation' => 'demote', >>>> 'active' => [], >>>> 'inactive' => [], >>>> 'start' => [], >>>> 'stop' => [], >>>> 'demote' => [ >>>> { >>>> 'rsc' => 'pgsqld:1', >>>> 'uname' => 'hanode1' >>>> } >>>> ], >>>> >>>> 'master' => [ >>>> { >>>> 'rsc' => 'pgsqld:1', >>>> 'uname' => 'hanode1' >>>> } >>>> ], >>>> >>>> 'promote' => [ >>>> { >>>> 'rsc' => 'pgsqld:0', >>>> 'uname' => 'hanode3' >>>> } >>>> ], >>>> 'slave' => [ >>>> { >>>> 'rsc' => 'pgsqld:0', >>>> 'uname' => 'hanode3' >>>> }, >>>> { >>>> 'rsc' => 'pgsqld:2', >>>> 'uname' => 'hanode2' >>>> } >>>> ], >>>> >>>> } >>>> >>>> In case this comes from our side, here is code building this: >>>> >>>> >>>> https://github.com/dalibo/PAF/blob/6e86284bc647ef1e81f01f047f1862e40ba62906/lib/OCF_Functions.pm#L444 >>>> >>>> But looking at the variable itself in debug logs, I always find it empty, >>>> in various situations (switchover, recover, failover). >>>> >>>> If I understand the documentation correctly, I would expect 'active' to >>>> list all the three resources, shouldn't it? Currently, to bypass this, we >>>> consider: active == master + slave >>> >>> You're right, it should. The pacemaker code that generates the "active" >>> variables is the same used for "demote" etc., so it seems unlikely the >>> issue is on pacemaker's side. Especially since your code treats active >>> etc. differently from demote etc., it seems like it must be in there >>> somewhere, but I don't see where. >> >> The code treat active, inactive, start and stop all together, for any cloned >> resource. If the resource is a multistate, it adds promote, demote, slave and >> master. >> >> Note that from this piece of code, the 7 other notify vars are set >> correctly: start, stop, inactive, promote, demote, slave, master. Only active >> is always missing. >> >> I'll investigate and try to find where is hiding the bug. > > So I added a piece of code to dump the **all** the environment variables to a > temp file as early as possible **to avoid any interaction with our perl > module** in the code of the RA, ie.: > > BEGIN { > use Time::HiRes qw(time); > my $now = time; > open my $fh, ">", "/tmp/test-$now.env.txt"; > printf($fh "%-20s = ''%s''\n", $_, $ENV{$_}) foreach sort keys %ENV; > } > > Then I started my cluster and set maintenance-mode=false while no resources > where running. So the debug files contains the probe action, start on all > nodes, one promote on the master and the first monitors. The "*active" > variables > are always empty anywhere in the cluster. Find in attachment the result of > the following command on the master node: > > for i in test-*; do echo "===== $i ====="; grep OCF_ $i; done > > debug-env.txt > > I'm using Pacemaker 1.1.13-10.el7_2.2-44eb2dd under CentOS 7.2.1511. > > For completeness, I added the Pacemaker configuration I use for my 3 node > dev/test cluster. > > Let me know if you think of more investigations and test I could run on this > issue. I'm out of ideas for tonight (and I really would prefer having this bug > on my side).
From your environment dumps, what I think is happening is that you are getting multiple notifications (start, pre-promote, post-promote) in a single cluster transition. So the variables reflect the initial state of that transition -- none of the instances are active, all three are being started (so the nodes are in the "*_start_*" variables), and one is being promoted. The starts will be done before the promote. If one of the starts fails, the transition will be aborted, and a new one will be calculated. So, if you get to the promote, you can assume anything in "*_start_*" is now active. > On a side note, I noticed with these debug files that the notify > variables where also available outside of notify actions (start and notify > here). Are they always available during "transition actions" (start, stop, > promote, demote)? Checking at the mysql RA, they are using > OCF_RESKEY_CRM_meta_notify_master_uname during the start action. So I suppose > it's safe? Good question, I've never tried that before. I'm reluctant to say it's guaranteed; it's possible seeing them in the start action is a side effect of the current implementation and could theoretically change in the future. But if mysql is relying on it, I suppose it's well-established already, making changing it unlikely ... _______________________________________________ Developers mailing list Developers@clusterlabs.org http://clusterlabs.org/mailman/listinfo/developers