Re: [ClusterLabs] Problem with stonith and starting services

2017-07-14 Thread Cesar Hernandez

> 
> 
> So if this is really the reason it would probably be worth
> finding out what is really happening.
> 
Thanks. Yes, I think this is really the reason. I fixed it one week ago and 
hasn't happened again


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-14 Thread Klaus Wenninger
On 07/12/2017 05:16 PM, Cesar Hernandez wrote:
>
>> El 6 jul 2017, a las 17:34, Ken Gaillot  escribió:
>>
>> On 07/06/2017 10:27 AM, Cesar Hernandez wrote:
 It looks like a bug when the fenced node rejoins quickly enough that it
 is a member again before its fencing confirmation has been sent. I know
 there have been plenty of clusters with nodes that quickly reboot and
 slow fencing devices, so that seems unlikely, but I don't see another
 explanation.

>>> Could it be caused if node 2 becomes rebooted and alive before the stonith 
>>> script has finished?
>> That *shouldn't* cause any problems, but I'm not sure what's happening
>> in this case.
>
> So, this was the cause for the problem...
> Before the two servers I have now, I've made other 3 cluster installations 
> with a different internet hosting provider. Using that provider, a machine 
> lasted more than 2 minutes to reboot using the fencing script (slow boot 
> process and slow remote api to respond)
> So I added a "sleep 90" before the end of the script and it always worked 
> perfectly.
>
> Now, with a different provider, I used the same script, just changing the 
> remote api for the provider api. In this case, a machine lasts aprox 10 
> seconds to do a full reboot, and also the api is faster (just 2 or 3 seconds 
> to respond).
> So the machine was up again in less than 20 seconds. 
>
> I suppose the problem comes when the node (node2 for example) that has been 
> rebooted sees that node1 is still waiting for the fencing script to finish 
> (due to the sleep 90) and it just becomes confused and exits pacemaker.
>
> I changed that sleep 90 for a sleep 5 and it hasn't happened again
I guess pacemaker should be able to cope with a situation like that.

Using sbd-fencing (e.g. fence_sbd) you would actually have quite
a similar case. The fence-agent writes the poison-pill in the disk-slot
of the node to be fenced and usually this will be read out by the
victim-node within a second. But as guaranteed response-times
of the shared-disk in an enterprise-environment can be really
huge the fence-agent would still wait for 60s or so to be really
sure that the other side has swallowed the pill.

So if this is really the reason it would probably be worth
finding out what is really happening.

Regards,
Klaus

>
> Thanks a lot to everyone for the help
>
> Cheers
> Cesar
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenn...@redhat.com   


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-12 Thread Cesar Hernandez


> El 6 jul 2017, a las 17:34, Ken Gaillot  escribió:
> 
> On 07/06/2017 10:27 AM, Cesar Hernandez wrote:
>> 
>>> 
>>> It looks like a bug when the fenced node rejoins quickly enough that it
>>> is a member again before its fencing confirmation has been sent. I know
>>> there have been plenty of clusters with nodes that quickly reboot and
>>> slow fencing devices, so that seems unlikely, but I don't see another
>>> explanation.
>>> 
>> 
>> Could it be caused if node 2 becomes rebooted and alive before the stonith 
>> script has finished?
> 
> That *shouldn't* cause any problems, but I'm not sure what's happening
> in this case.


So, this was the cause for the problem...
Before the two servers I have now, I've made other 3 cluster installations with 
a different internet hosting provider. Using that provider, a machine lasted 
more than 2 minutes to reboot using the fencing script (slow boot process and 
slow remote api to respond)
So I added a "sleep 90" before the end of the script and it always worked 
perfectly.

Now, with a different provider, I used the same script, just changing the 
remote api for the provider api. In this case, a machine lasts aprox 10 seconds 
to do a full reboot, and also the api is faster (just 2 or 3 seconds to 
respond).
So the machine was up again in less than 20 seconds. 

I suppose the problem comes when the node (node2 for example) that has been 
rebooted sees that node1 is still waiting for the fencing script to finish (due 
to the sleep 90) and it just becomes confused and exits pacemaker.

I changed that sleep 90 for a sleep 5 and it hasn't happened again

Thanks a lot to everyone for the help

Cheers
Cesar



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-07 Thread Cesar Hernandez
>>> 
>> 
>> Could it be caused if node 2 becomes rebooted and alive before the stonith 
>> script has finished?
> 
> That *shouldn't* cause any problems, but I'm not sure what's happening
> in this case.

Maybe is the cause for it... My other servers installations had a slow stonith 
device and also were slow booting servers, so I added a "sleep 90" on the 
fencing script. Now, I have a fast stonith device and fast booting servers, so 
node2 is becoming alive before the fencing script has finished, because of that 
sleep.
I'm trying just now without the sleep to see if it happens again or not
Thanks
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-06 Thread Klaus Wenninger
On 07/06/2017 04:48 PM, Ken Gaillot wrote:
> On 07/06/2017 09:26 AM, Klaus Wenninger wrote:
>> On 07/06/2017 04:20 PM, Cesar Hernandez wrote:
 If node2 is getting the notification of its own fencing, it wasn't
 successfully fenced. Successful fencing would render it incapacitated
 (powered down, or at least cut off from the network and any shared
 resources).
>>> Maybe I don't understand you, or maybe you don't understand me... ;)
>>> This is the syslog of the machine, where you can see that the machine has 
>>> rebooted successfully, and as I said, it has been rebooted successfully all 
>>> the times:
>> It is not just a question if it was rebooted at all.
>> Your fence-agent mustn't return positively until this definitely
>> has happened and the node is down.
>> Otherwise you will see that message and the node will try to
>> somehow cope with the fact that obviously the rest of the
>> cluster thinks that it is down already.
> But the "allegedly fenced" message comes in after the node has rebooted,
> so it would seem that everything was in the proper sequence.

True for this message but we don't see if they didn't exchange
anything in between the fencing-ok and the actual reboot.

>
> It looks like a bug when the fenced node rejoins quickly enough that it
> is a member again before its fencing confirmation has been sent. I know
> there have been plenty of clusters with nodes that quickly reboot and
> slow fencing devices, so that seems unlikely, but I don't see another
> explanation.

Anyway - maybe putting a delay at the end of the fence-agent might
give some insight.
In this case we have a combination of quick reboot & quick fencing
device ... possibly leading to nasty races ...

>
>>> Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys 
>>> cpuset
>>> Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpu
>>> Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys 
>>> cpuacct
>>> Jul  5 10:41:54 node2 kernel: [0.00] Linux version 3.16.0-4-amd64 
>>> (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 
>>> SMP Debian 3.16.39-1 (2016-12-30)
>>> Jul  5 10:41:54 node2 kernel: [0.00] Command line: 
>>> BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 
>>> root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd 
>>> console=ttyS0 console=hvc0
>>> Jul  5 10:41:54 node2 kernel: [0.00] e820: BIOS-provided physical 
>>> RAM map:
>>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>>> 0x-0x0009dfff] usable
>>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>>> 0x0009e000-0x0009] reserved
>>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>>> 0x000e-0x000f] reserved
>>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>>> 0x0010-0x3fff] usable
>>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>>> 0xfc00-0x] reserved
>>> Jul  5 10:41:54 node2 kernel: [0.00] NX (Execute Disable) 
>>> protection: active
>>> Jul  5 10:41:54 node2 kernel: [0.00] SMBIOS 2.4 present.
>>>
>>> ...
>>>
>>> Jul  5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 
>>> 67
>>>
>>> ...
>>>
>>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync Cluster Engine 
>>> ('UNKNOWN'): started and ready to provide service.
>>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync built-in features: 
>>> nss
>>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Successfully read main 
>>> configuration file '/etc/corosync/corosync.conf'.
>>>
>>> ...
>>>
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: Defaulting to uname -n for the 
>>> local classic openais (with plugin) node name
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: Membership 4308: quorum acquired
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
>>> node2[1108352940] - state is now member (was (null))
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
>>> node11[794540] - state is now member (was (null))
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: The local CRM is operational
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: State transition S_STARTING -> 
>>> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
>>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Watching for stonith 
>>> topology changes
>>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Membership 4308: quorum 
>>> acquired
>>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: plugin_handle_membership: 
>>> Node node11[794540] - state is now member (was (null))
>>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: On loss of CCM Quorum: 
>>> Ignore
>>> Jul  5 10:41:58 node2 stonith-ng[604]:   notice: Added 'st-fence_propio:0' 
>>> to the device list (1 active devices)
>>> Jul  5 10:41:59 node2 stonith

Re: [ClusterLabs] Problem with stonith and starting services

2017-07-06 Thread Ken Gaillot
On 07/06/2017 09:26 AM, Klaus Wenninger wrote:
> On 07/06/2017 04:20 PM, Cesar Hernandez wrote:
>>> If node2 is getting the notification of its own fencing, it wasn't
>>> successfully fenced. Successful fencing would render it incapacitated
>>> (powered down, or at least cut off from the network and any shared
>>> resources).
>>
>> Maybe I don't understand you, or maybe you don't understand me... ;)
>> This is the syslog of the machine, where you can see that the machine has 
>> rebooted successfully, and as I said, it has been rebooted successfully all 
>> the times:
> 
> It is not just a question if it was rebooted at all.
> Your fence-agent mustn't return positively until this definitely
> has happened and the node is down.
> Otherwise you will see that message and the node will try to
> somehow cope with the fact that obviously the rest of the
> cluster thinks that it is down already.

But the "allegedly fenced" message comes in after the node has rebooted,
so it would seem that everything was in the proper sequence.

It looks like a bug when the fenced node rejoins quickly enough that it
is a member again before its fencing confirmation has been sent. I know
there have been plenty of clusters with nodes that quickly reboot and
slow fencing devices, so that seems unlikely, but I don't see another
explanation.

>> Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys 
>> cpuset
>> Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpu
>> Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys 
>> cpuacct
>> Jul  5 10:41:54 node2 kernel: [0.00] Linux version 3.16.0-4-amd64 
>> (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 
>> SMP Debian 3.16.39-1 (2016-12-30)
>> Jul  5 10:41:54 node2 kernel: [0.00] Command line: 
>> BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 
>> root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd 
>> console=ttyS0 console=hvc0
>> Jul  5 10:41:54 node2 kernel: [0.00] e820: BIOS-provided physical 
>> RAM map:
>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>> 0x-0x0009dfff] usable
>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>> 0x0009e000-0x0009] reserved
>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>> 0x000e-0x000f] reserved
>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>> 0x0010-0x3fff] usable
>> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
>> 0xfc00-0x] reserved
>> Jul  5 10:41:54 node2 kernel: [0.00] NX (Execute Disable) 
>> protection: active
>> Jul  5 10:41:54 node2 kernel: [0.00] SMBIOS 2.4 present.
>>
>> ...
>>
>> Jul  5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 
>> 67
>>
>> ...
>>
>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync Cluster Engine 
>> ('UNKNOWN'): started and ready to provide service.
>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync built-in features: 
>> nss
>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Successfully read main 
>> configuration file '/etc/corosync/corosync.conf'.
>>
>> ...
>>
>> Jul  5 10:41:57 node2 crmd[608]:   notice: Defaulting to uname -n for the 
>> local classic openais (with plugin) node name
>> Jul  5 10:41:57 node2 crmd[608]:   notice: Membership 4308: quorum acquired
>> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
>> node2[1108352940] - state is now member (was (null))
>> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
>> node11[794540] - state is now member (was (null))
>> Jul  5 10:41:57 node2 crmd[608]:   notice: The local CRM is operational
>> Jul  5 10:41:57 node2 crmd[608]:   notice: State transition S_STARTING -> 
>> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Watching for stonith 
>> topology changes
>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Membership 4308: quorum 
>> acquired
>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: plugin_handle_membership: 
>> Node node11[794540] - state is now member (was (null))
>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: On loss of CCM Quorum: 
>> Ignore
>> Jul  5 10:41:58 node2 stonith-ng[604]:   notice: Added 'st-fence_propio:0' 
>> to the device list (1 active devices)
>> Jul  5 10:41:59 node2 stonith-ng[604]:   notice: Operation reboot of node2 
>> by node11 for crmd.2141@node11.61c3e613: OK
>> Jul  5 10:41:59 node2 crmd[608]: crit: We were allegedly just fenced by 
>> node11 for node11!
>> Jul  5 10:41:59 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
>> crmd (conn=0x228d970, async-conn=0x228d970) left
>> Jul  5 10:41:59 node2 pacemakerd[597]:  warning: The crmd process (608) can 
>> no longer be respawned, shutting the cluster dow

Re: [ClusterLabs] Problem with stonith and starting services

2017-07-06 Thread Klaus Wenninger
On 07/06/2017 04:20 PM, Cesar Hernandez wrote:
>> If node2 is getting the notification of its own fencing, it wasn't
>> successfully fenced. Successful fencing would render it incapacitated
>> (powered down, or at least cut off from the network and any shared
>> resources).
>
> Maybe I don't understand you, or maybe you don't understand me... ;)
> This is the syslog of the machine, where you can see that the machine has 
> rebooted successfully, and as I said, it has been rebooted successfully all 
> the times:

It is not just a question if it was rebooted at all.
Your fence-agent mustn't return positively until this definitely
has happened and the node is down.
Otherwise you will see that message and the node will try to
somehow cope with the fact that obviously the rest of the
cluster thinks that it is down already.

>
> Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpuset
> Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpu
> Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys 
> cpuacct
> Jul  5 10:41:54 node2 kernel: [0.00] Linux version 3.16.0-4-amd64 
> (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP 
> Debian 3.16.39-1 (2016-12-30)
> Jul  5 10:41:54 node2 kernel: [0.00] Command line: 
> BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 
> root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd 
> console=ttyS0 console=hvc0
> Jul  5 10:41:54 node2 kernel: [0.00] e820: BIOS-provided physical RAM 
> map:
> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
> 0x-0x0009dfff] usable
> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
> 0x0009e000-0x0009] reserved
> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
> 0x000e-0x000f] reserved
> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
> 0x0010-0x3fff] usable
> Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
> 0xfc00-0x] reserved
> Jul  5 10:41:54 node2 kernel: [0.00] NX (Execute Disable) protection: 
> active
> Jul  5 10:41:54 node2 kernel: [0.00] SMBIOS 2.4 present.
>
> ...
>
> Jul  5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
>
> ...
>
> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync Cluster Engine 
> ('UNKNOWN'): started and ready to provide service.
> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync built-in features: 
> nss
> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Successfully read main 
> configuration file '/etc/corosync/corosync.conf'.
>
> ...
>
> Jul  5 10:41:57 node2 crmd[608]:   notice: Defaulting to uname -n for the 
> local classic openais (with plugin) node name
> Jul  5 10:41:57 node2 crmd[608]:   notice: Membership 4308: quorum acquired
> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
> node2[1108352940] - state is now member (was (null))
> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
> node11[794540] - state is now member (was (null))
> Jul  5 10:41:57 node2 crmd[608]:   notice: The local CRM is operational
> Jul  5 10:41:57 node2 crmd[608]:   notice: State transition S_STARTING -> 
> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Watching for stonith 
> topology changes
> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Membership 4308: quorum 
> acquired
> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: plugin_handle_membership: 
> Node node11[794540] - state is now member (was (null))
> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: On loss of CCM Quorum: Ignore
> Jul  5 10:41:58 node2 stonith-ng[604]:   notice: Added 'st-fence_propio:0' to 
> the device list (1 active devices)
> Jul  5 10:41:59 node2 stonith-ng[604]:   notice: Operation reboot of node2 by 
> node11 for crmd.2141@node11.61c3e613: OK
> Jul  5 10:41:59 node2 crmd[608]: crit: We were allegedly just fenced by 
> node11 for node11!
> Jul  5 10:41:59 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
> crmd (conn=0x228d970, async-conn=0x228d970) left
> Jul  5 10:41:59 node2 pacemakerd[597]:  warning: The crmd process (608) can 
> no longer be respawned, shutting the cluster down.
> Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Shutting down Pacemaker
> Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Stopping pengine: Sent -15 
> to process 607
> Jul  5 10:41:59 node2 pengine[607]:   notice: Invoking handler for signal 15: 
> Terminated
> Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Stopping attrd: Sent -15 to 
> process 606
> Jul  5 10:41:59 node2 attrd[606]:   notice: Invoking handler for signal 15: 
> Terminated
> Jul  5 10:41:59 node2 attrd[606]:   notice: Exiting...
> Jul  5 10:41:59 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Cl

Re: [ClusterLabs] Problem with stonith and starting services

2017-07-06 Thread Cesar Hernandez

> 
> If node2 is getting the notification of its own fencing, it wasn't
> successfully fenced. Successful fencing would render it incapacitated
> (powered down, or at least cut off from the network and any shared
> resources).


Maybe I don't understand you, or maybe you don't understand me... ;)
This is the syslog of the machine, where you can see that the machine has 
rebooted successfully, and as I said, it has been rebooted successfully all the 
times:

Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpuset
Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpu
Jul  5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpuacct
Jul  5 10:41:54 node2 kernel: [0.00] Linux version 3.16.0-4-amd64 
(debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP 
Debian 3.16.39-1 (2016-12-30)
Jul  5 10:41:54 node2 kernel: [0.00] Command line: 
BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 
root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd 
console=ttyS0 console=hvc0
Jul  5 10:41:54 node2 kernel: [0.00] e820: BIOS-provided physical RAM 
map:
Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
0x-0x0009dfff] usable
Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
0x0009e000-0x0009] reserved
Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
0x000e-0x000f] reserved
Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
0x0010-0x3fff] usable
Jul  5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 
0xfc00-0x] reserved
Jul  5 10:41:54 node2 kernel: [0.00] NX (Execute Disable) protection: 
active
Jul  5 10:41:54 node2 kernel: [0.00] SMBIOS 2.4 present.

...

Jul  5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67

...

Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync Cluster Engine 
('UNKNOWN'): started and ready to provide service.
Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync built-in features: nss
Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Successfully read main 
configuration file '/etc/corosync/corosync.conf'.

...

Jul  5 10:41:57 node2 crmd[608]:   notice: Defaulting to uname -n for the local 
classic openais (with plugin) node name
Jul  5 10:41:57 node2 crmd[608]:   notice: Membership 4308: quorum acquired
Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
node2[1108352940] - state is now member (was (null))
Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
node11[794540] - state is now member (was (null))
Jul  5 10:41:57 node2 crmd[608]:   notice: The local CRM is operational
Jul  5 10:41:57 node2 crmd[608]:   notice: State transition S_STARTING -> 
S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Watching for stonith topology 
changes
Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Membership 4308: quorum 
acquired
Jul  5 10:41:57 node2 stonith-ng[604]:   notice: plugin_handle_membership: Node 
node11[794540] - state is now member (was (null))
Jul  5 10:41:57 node2 stonith-ng[604]:   notice: On loss of CCM Quorum: Ignore
Jul  5 10:41:58 node2 stonith-ng[604]:   notice: Added 'st-fence_propio:0' to 
the device list (1 active devices)
Jul  5 10:41:59 node2 stonith-ng[604]:   notice: Operation reboot of node2 by 
node11 for crmd.2141@node11.61c3e613: OK
Jul  5 10:41:59 node2 crmd[608]: crit: We were allegedly just fenced by 
node11 for node11!
Jul  5 10:41:59 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
crmd (conn=0x228d970, async-conn=0x228d970) left
Jul  5 10:41:59 node2 pacemakerd[597]:  warning: The crmd process (608) can no 
longer be respawned, shutting the cluster down.
Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Shutting down Pacemaker
Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Stopping pengine: Sent -15 to 
process 607
Jul  5 10:41:59 node2 pengine[607]:   notice: Invoking handler for signal 15: 
Terminated
Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Stopping attrd: Sent -15 to 
process 606
Jul  5 10:41:59 node2 attrd[606]:   notice: Invoking handler for signal 15: 
Terminated
Jul  5 10:41:59 node2 attrd[606]:   notice: Exiting...
Jul  5 10:41:59 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
attrd (conn=0x2280ef0, async-conn=0x2280ef0) left


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-06 Thread Ken Gaillot
On 07/06/2017 08:54 AM, Cesar Hernandez wrote:
> 
>>
>> So, the above log means that node1 decided that node2 needed to be
>> fenced, requested fencing of node2, and received a successful result for
>> the fencing, and yet node2 was not killed.
>>
>> Your fence agent should not return success until node2 has verifiably
>> been stopped. If there is some way to query the AWS API whether node2 is
>> running or not, that would be sufficient (merely checking that the node
>> is not responding to some command such as ping is not sufficient).
> 
> Thanks. But node2 has always been successfully fenced... so this is not the 
> problem

If node2 is getting the notification of its own fencing, it wasn't
successfully fenced. Successful fencing would render it incapacitated
(powered down, or at least cut off from the network and any shared
resources).

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-06 Thread Ken Gaillot
On 07/04/2017 08:28 AM, Cesar Hernandez wrote:
> 
>>
>> Agreed, I don't think it's multicast vs unicast.
>>
>> I can't see from this what's going wrong. Possibly node1 is trying to
>> re-fence node2 when it comes back. Check that the fencing resources are
>> configured correctly, and check whether node1 sees the first fencing
>> succeed.
> 
> 
> Thanks. Checked fencing resource and it always returns, it's a custom script 
> I used on other installations and it always worked.
> I think the clue are the two messages that appear when it fails:
> 
> Jul  3 09:07:04 node2 pacemakerd[597]:  warning: The crmd process (608) can 
> no longer be respawned, shutting the cluster down.
> Jul  3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by 
> node1 for node1!
> 
> Anyone knows what are they related to? Seems not to be much information on 
> the Internet
> 
> Thanks
> Cesar

"We were allegedly just fenced" means that the node just received a
notification from stonithd that another node successfully fenced it.
Clearly, this is a problem, because a node that is truly fenced should
be unable to receive any communications from the cluster. As such, the
cluster services immediately exit and stay down.

So, the above log means that node1 decided that node2 needed to be
fenced, requested fencing of node2, and received a successful result for
the fencing, and yet node2 was not killed.

Your fence agent should not return success until node2 has verifiably
been stopped. If there is some way to query the AWS API whether node2 is
running or not, that would be sufficient (merely checking that the node
is not responding to some command such as ping is not sufficient).

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-05 Thread Cesar Hernandez
> 
> 
>>> 
>>> But you definitely shouldn't have a fencing-agent that claims to have fenced
>>> a node if it is not sure - rather the other way round if in doubt.
>> 
>> 
> 
> True! Which is why I mentioned it to be dangerous.
> But your fencing-agent is even more dangerous ;-)
> 
> 

Well.. my startup fencing does, first, a "ssh $NODE reboot -f" and then another 
reboot using AWS API. All the times the node is reachable, the ssh reboot 
succeded. And the rest of the times, using AWS reboot works 100% of the times.

So I think it's not dangerous to always return OK on fencing script when 
I'm 100% (or 99.%) sure that fencing works ;)

Anyway, it's not the cause of this behavior. I'll keep trying... I'm also 
thinking on post this issue on the developers cluster list

Thanks
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-05 Thread Klaus Wenninger
On 07/05/2017 04:50 PM, Cesar Hernandez wrote:
>> Not a good idea probably - and the reason for what you are experiencing ;-)
>> If you have problems starting the nodes within a certain time-window
>> disabling startup-fencing might be an option to consider although dangerous.
>> But you definitely shouldn't have a fencing-agent that claims to have fenced
>> a node if it is not sure - rather the other way round if in doubt.
>
> Thanks. But I think is not a good idea to disable startup fencing: I have 
> shared disks (drbd) and stonith is very important in this scenario
>

True! Which is why I mentioned it to be dangerous.
But your fencing-agent is even more dangerous ;-)



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-05 Thread Cesar Hernandez

> Not a good idea probably - and the reason for what you are experiencing ;-)
> If you have problems starting the nodes within a certain time-window
> disabling startup-fencing might be an option to consider although dangerous.
> But you definitely shouldn't have a fencing-agent that claims to have fenced
> a node if it is not sure - rather the other way round if in doubt.


Thanks. But I think is not a good idea to disable startup fencing: I have 
shared disks (drbd) and stonith is very important in this scenario


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-05 Thread Klaus Wenninger
On 07/05/2017 04:22 PM, Cesar Hernandez wrote:
>
>> Are you logging which ones went OK and which failed.
>> The script returns negatively if both go wrong?
> The script always returns OK
Not a good idea probably - and the reason for what you are experiencing ;-)
If you have problems starting the nodes within a certain time-window
disabling startup-fencing might be an option to consider although dangerous.
But you definitely shouldn't have a fencing-agent that claims to have fenced
a node if it is not sure - rather the other way round if in doubt.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-05 Thread Cesar Hernandez


> Are you logging which ones went OK and which failed.
> The script returns negatively if both go wrong?

The script always returns OK

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-05 Thread Klaus Wenninger
On 07/05/2017 08:50 AM, Cesar Hernandez wrote:
>> Might be kind of a strange race as well ... but without knowing what the
>> script actually does ...
>>
> The script first try to reboot the node using ssh, something like ssh $NODE 
> reboot -f, then runs a remote reboot using AWS api
Are you logging which ones went OK and which failed.
The script returns negatively if both go wrong?

> Thanks 
>
>
>

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-04 Thread Cesar Hernandez

> Might be kind of a strange race as well ... but without knowing what the
> script actually does ...
> 

The script first try to reboot the node using ssh, something like ssh $NODE 
reboot -f, then runs a remote reboot using AWS api

Thanks 


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-04 Thread Klaus Wenninger
On 07/04/2017 04:52 PM, Cesar Hernandez wrote:
>> The first line is the consequence of the 2nd.
>> And the 1st says that node2 just has seen some fencing-resource
>> positively reporting to have fenced himself - which
>> is why crmd is exiting in a way that it is not respawned
>> by pacemakerd.
> Thanks. But my script have a logfile, I've checked it, and none of the nodes 
> have fenced himself: node1 has always fenced node2, and node2 has always 
> fenced node 1

Didn't say it has fenced itself. Actually it writes node1 has fenced
node2 on behalf of node1.
But obviously that is not true as otherwise node2 wouldn't be alive to
receive that info.
Might be kind of a strange race as well ... but without knowing what the
script actually does ...

Regards,
Klaus

> Cheers
> Cesar
>


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-04 Thread Cesar Hernandez

> The first line is the consequence of the 2nd.
> And the 1st says that node2 just has seen some fencing-resource
> positively reporting to have fenced himself - which
> is why crmd is exiting in a way that it is not respawned
> by pacemakerd.

Thanks. But my script have a logfile, I've checked it, and none of the nodes 
have fenced himself: node1 has always fenced node2, and node2 has always fenced 
node 1

Cheers
Cesar


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-04 Thread Klaus Wenninger
On 07/04/2017 03:28 PM, Cesar Hernandez wrote:
>> Agreed, I don't think it's multicast vs unicast.
>>
>> I can't see from this what's going wrong. Possibly node1 is trying to
>> re-fence node2 when it comes back. Check that the fencing resources are
>> configured correctly, and check whether node1 sees the first fencing
>> succeed.
>
> Thanks. Checked fencing resource and it always returns, it's a custom script 
> I used on other installations and it always worked.
> I think the clue are the two messages that appear when it fails:
>
> Jul  3 09:07:04 node2 pacemakerd[597]:  warning: The crmd process (608) can 
> no longer be respawned, shutting the cluster down.
> Jul  3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by 
> node1 for node1!
>
> Anyone knows what are they related to? Seems not to be much information on 
> the Internet

The first line is the consequence of the 2nd.
And the 1st says that node2 just has seen some fencing-resource
positively reporting to have fenced himself - which
is why crmd is exiting in a way that it is not respawned
by pacemakerd.
For the reason I can just guess ...
Maybe your script is not checking whom it should actually
fence and assumes something!?
You can configure for which targets a certain fencing-resource
using pcmk_host_map, pcmk_host_check & pcmk_host_list.
Like this it would be possible that the fencing-resource was
configured differently when your script worked in a different
setup.

Regards,
Klaus

>
> Thanks
> Cesar
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


  


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-04 Thread Cesar Hernandez

> 
> Agreed, I don't think it's multicast vs unicast.
> 
> I can't see from this what's going wrong. Possibly node1 is trying to
> re-fence node2 when it comes back. Check that the fencing resources are
> configured correctly, and check whether node1 sees the first fencing
> succeed.


Thanks. Checked fencing resource and it always returns, it's a custom script I 
used on other installations and it always worked.
I think the clue are the two messages that appear when it fails:

Jul  3 09:07:04 node2 pacemakerd[597]:  warning: The crmd process (608) can no 
longer be respawned, shutting the cluster down.
Jul  3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by 
node1 for node1!

Anyone knows what are they related to? Seems not to be much information on the 
Internet

Thanks
Cesar


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-03 Thread Ken Gaillot
On 07/03/2017 02:34 AM, Cesar Hernandez wrote:
> Hi
> 
> I have installed a pacemaker cluster with two nodes. The same type of 
> installation has done before many times and the following error never 
> appeared before. The situation is the following:
> 
> both nodes running cluster services
> stop pacemaker&corosync on node 1
> stop pacemaker&corosync on node 2
> start corosync&pacemaker on node 1
> 
> Then node 1 starts, it sees node2 down, and it fences it, as it was expected. 
> But the problem comes when node 2 is rebooted and starts cluster services: 
> sometimes, it starts the corosync service but the pacemaker service starts 
> and then stops. The syslog shows the following error in these cases:
> 
> Jul  3 09:07:04 node2 pacemakerd[597]:  warning: The crmd process (608) can 
> no longer be respawned, shutting the cluster down.
> Jul  3 09:07:04 node2 pacemakerd[597]:   notice: Shutting down Pacemaker
> 
> Previous messages show some warning messages that I'm not sure they are 
> related with the shutdown:
> 
> 
> Jul  3 09:07:04 node2 stonith-ng[604]:   notice: Operation reboot of node2 by 
> node1 for crmd.2413@node1.608d8118: OK
> Jul  3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by 
> node1 for node1!
> Jul  3 09:07:04 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
> crmd (conn=0x1471800, async-conn=0x1471800) left
> 
> 
> On node1, all resources become unrunnable and it stays there forever until I 
> start manually pacemaker service on node2. 
> As I said, same type of installation has done before on other servers and 
> never happened this. The only difference is that in previous installations I 
> configured corosync with multicast and now I have configured with unicast (my 
> current network environment doesn't allow multicast) but I think it's not 
> related with that behaviour

Agreed, I don't think it's multicast vs unicast.

I can't see from this what's going wrong. Possibly node1 is trying to
re-fence node2 when it comes back. Check that the fencing resources are
configured correctly, and check whether node1 sees the first fencing
succeed.

> Cluster software versions:
> corosync-1.4.8
> crmsh-2.1.5
> libqb-0.17.2
> Pacemaker-1.1.14
> resource-agents-3.9.6
> 
> 
> 
> Can you help me?
> 
> Thanks
> 
> Cesar

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org