Re: [ClusterLabs] Pacemaker remote node ofgline after reboot

2017-05-11 Thread Ken Gaillot
On 05/11/2017 03:45 PM, Ignazio Cassano wrote:
> Hello, I installed a pacemaker cluster with 3 nodes and 2 remote nodes
> (pacemaker remote). All nodes are centos 7.3. The remote nodes are
> online and pacemaker resources are running on them.  When I reboot e
> remote pacemaker node it does not return online and Its pacemaker
>  resources remain blocked. The only solution I found is removing the
> remote node (pcs resource delete node) and creating it again. In this
> case its resources become immediately available. 
> Please, any help?
> Ignazio

Did you do "systemctl enable pacemaker_remote"?

By reboot, do you mean gracefully or simulating a failure? A graceful
exit should cause all resources on the remote node to stop. If failure,
the cluster needs to be able to fence the remote node before it can
recover its resources.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-11 Thread Ken Gaillot
On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:
> Hi
> I translated the a Postgresql multi state RA
> (https://github.com/dalibo/PAF) in Python
> (https://github.com/ulodciv/deploy_cluster), and I have been editing it
> heavily.
> 
> In parallel I am writing unit tests and functional tests.
> 
> I am having an issue with a functional test that abruptly powers off a
> slave named says "host3" (hot standby PG instance). Later on I start the
> slave back. Once it is started, I run "pcs cluster start host3". And
> this is where I start having a problem.
> 
> I check every second the output of "pcs status xml" until host3 is said
> to be ready as a slave again. In the following I assume that test3 is
> ready as a slave:
> 
> 
>  standby_onfail="false" maintenance="false" pending="false"
> unclean="false" shutdown="false" expected_up="true" is_dc="false"
> resources_running="2" type="member" />
>  standby_onfail="false" maintenance="false" pending="false"
> unclean="false" shutdown="false" expected_up="true" is_dc="true"
> resources_running="1" type="member" />
>  standby_onfail="false" maintenance="false" pending="false"
> unclean="false" shutdown="false" expected_up="true" is_dc="false"
> resources_running="1" type="member" />
> 

The  section says nothing about the current state of the nodes.
Look at the  entries for that. in_ccm means the cluster
stack level, and crmd means the pacemaker level -- both need to be up.

> 
>  managed="true" failed="false" failure_ignored="false" >
>  role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Master" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
> By ready to go I mean that upon running "pcs cluster start test3", the
> following occurs before test3 appears ready in the XML:
> 
> pcs cluster start test3
> monitor-> RA returns unknown error (1) 
> notify/pre-stop-> RA returns ok (0)
> stop   -> RA returns ok (0)
> start-> RA returns ok (0)
> 
> The problem I have is that between "pcs cluster start test3" and
> "monitor", it seems that the XML returned by "pcs status xml" says test3
> is ready (the XML extract above is what I get at that moment). Once
> "monitor" occurs, the returned XML shows test3 to be offline, and not
> until the start is finished do I once again have test3 shown as ready.
> 
> I am getting anything wrong? Is there a simpler or better way to check
> if test3 is fully functional again, ie OCF start was successful?
> 
> Thanks
> 
> Ludovic

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker remote node ofgline after reboot

2017-05-11 Thread Ignazio Cassano
Hello, I installed a pacemaker cluster with 3 nodes and 2 remote nodes
(pacemaker remote). All nodes are centos 7.3. The remote nodes are online
and pacemaker resources are running on them.  When I reboot e remote
pacemaker node it does not return online and Its pacemaker  resources
remain blocked. The only solution I found is removing the remote node (pcs
resource delete node) and creating it again. In this case its resources
become immediately available.
Please, any help?
Ignazio
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-11 Thread Ludovic Vaugeois-Pepin
Hi
I translated the a Postgresql multi state RA (https://github.com/dalibo/PAF)
in Python (https://github.com/ulodciv/deploy_cluster), and I have been
editing it heavily.

In parallel I am writing unit tests and functional tests.

I am having an issue with a functional test that abruptly powers off a
slave named says "host3" (hot standby PG instance). Later on I start the
slave back. Once it is started, I run "pcs cluster start host3". And this
is where I start having a problem.

I check every second the output of "pcs status xml" until host3 is said to
be ready as a slave again. In the following I assume that test3 is ready as
a slave:


















By ready to go I mean that upon running "pcs cluster start test3", the
following occurs before test3 appears ready in the XML:

pcs cluster start test3
monitor -> RA returns unknown error (1)
notify/pre-stop -> RA returns ok (0)
stop -> RA returns ok (0)
start -> RA returns ok (0)

The problem I have is that between "pcs cluster start test3" and "monitor",
it seems that the XML returned by "pcs status xml" says test3 is ready (the
XML extract above is what I get at that moment). Once "monitor" occurs, the
returned XML shows test3 to be offline, and not until the start is finished
do I once again have test3 shown as ready.

I am getting anything wrong? Is there a simpler or better way to check if
test3 is fully functional again, ie OCF start was successful?

Thanks

Ludovic
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] SAP HANA resource start problem

2017-05-11 Thread Muhammad Sharfuddin

pacemaker 1.1.15-21.1
libpacemaker3 1.1.15-21.1
DB: SAP HANA SPS 12

Manually HANA DB starts and work perfectly, Master/Primary replicates to 
Secondary/Slave perfectly.


But when start the HANA DB pacemaker resource, crm_mon shows that HANA 
DB resource gets started and both the nodes becomes Slave, and keep 
showing them Slave forever. However HANA DB didn't get start from the 
cluster which could be verified by running "sapcontrol -nr 00 -function 
GetProcessList" which shows system is not running.


HANA Topology and DB resource configuration:

primitive rsc_SAPHanaTopology_TST_HDB00 ocf:suse:SAPHanaTopology \
operations $id=rsc_sap2_TST_HDB00-operations \
op monitor interval=10 timeout=600 \
op start interval=0 timeout=600 \
op stop interval=0 timeout=300 \
params SID=TST InstanceNumber=00

primitive rsc_SAPHana_TST_HDB00 ocf:suse:SAPHana \
operations $id=rsc_sap_TST_HDB00-operations \
op start interval=0 timeout=3600 \
op stop interval=0 timeout=3600 \
op promote interval=0 timeout=1600 \
op monitor interval=60 role=Master timeout=700 \
op monitor interval=61 role=Slave timeout=700 \
params SID=TST InstanceNumber=00 PREFER_SITE_TAKEOVER=true 
DUPLICATE_PRIMARY_TIMEOUT=600 AUTOMATED_REGISTER=true


ms msl_SAPHana_TST_HDB00 rsc_SAPHana_TST_HDB00 \
meta is-managed=true notify=true clone-max=2 clone-node-max=1 
target-role=Started interleave=true


clone cln_SAPHanaTopology_TST_HDB00 rsc_SAPHanaTopology_TST_HDB00 \
meta is-managed=true clone-node-max=1 target-role=Started 
interleave=true


Following events are logged when cluster tries to start the HANA DB 
resource:
2017-05-11T15:29:35.775044+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation p_fence_saphdbtst1_monitor_0 locally on 
saphdbtst2
2017-05-11T15:29:35.776600+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation p_fence_saphdbtst1_monitor_0 on saphdbtst1
2017-05-11T15:29:35.777021+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation p_fence_saphdbtst2_monitor_0 locally on 
saphdbtst2
2017-05-11T15:29:35.779302+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation p_fence_saphdbtst2_monitor_0 on saphdbtst1
2017-05-11T15:29:35.779770+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation rsc_ip_TST_HDB00_monitor_0 locally on 
saphdbtst2
2017-05-11T15:29:35.843129+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation rsc_ip_TST_HDB00_monitor_0 on saphdbtst1
2017-05-11T15:29:35.843567+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation rsc_SAPHana_TST_HDB00:0_monitor_0 locally 
on saphdbtst2
2017-05-11T15:29:35.845257+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation rsc_SAPHana_TST_HDB00:0_monitor_0 on saphdbtst1
2017-05-11T15:29:35.845682+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation rsc_SAPHanaTopology_TST_HDB00:0_monitor_0 
locally on saphdbtst2
2017-05-11T15:29:35.847105+05:00 saphdbtst2 crmd[10195]:   notice: 
Initiating monitor operation rsc_SAPHanaTopology_TST_HDB00:0_monitor_0 
on saphdbtst1
2017-05-11T15:29:35.902758+05:00 saphdbtst2 crmd[10195]:   notice: 
Result of probe operation for p_fence_saphdbtst1 on saphdbtst2: 7 (not 
running)
2017-05-11T15:29:35.903336+05:00 saphdbtst2 crmd[10195]:   notice: 
Result of probe operation for p_fence_saphdbtst2 on saphdbtst2: 7 (not 
running)
2017-05-11T15:29:35.938865+05:00 saphdbtst2 crmd[10195]:   notice: 
Result of probe operation for rsc_ip_TST_HDB00 on saphdbtst2: 7 (not 
running)

2017-05-11T15:29:35.950344+05:00 saphdbtst2 su: (to tstadm) root on none
2017-05-11T15:29:35.984612+05:00 saphdbtst2 systemd[1]: Started Session 
c45235 of user tstadm.

2017-05-11T15:29:36.432092+05:00 saphdbtst2 su: (to tstadm) root on none
2017-05-11T15:29:36.456621+05:00 saphdbtst2 systemd[1]: Started Session 
c45236 of user tstadm.

2017-05-11T15:29:36.463414+05:00 saphdbtst2 su: (to tstadm) root on none
2017-05-11T15:29:36.468627+05:00 saphdbtst2 systemd[1]: Started Session 
c45237 of user tstadm.
2017-05-11T15:29:36.991382+05:00 saphdbtst2 
SAPHana(rsc_SAPHana_TST_HDB00)[10203]: INFO: RA  begin action 
monitor_clone (0.152.17) 

2017-05-11T15:29:37.050443+05:00 saphdbtst2 su: (to tstadm) root on none
2017-05-11T15:29:37.072682+05:00 saphdbtst2 systemd[1]: Started Session 
c45238 of user tstadm.

2017-05-11T15:29:40.077857+05:00 saphdbtst2 su: (to tstadm) root on none
2017-05-11T15:29:40.100617+05:00 saphdbtst2 systemd[1]: Started Session 
c45239 of user tstadm.
2017-05-11T15:29:40.121749+05:00 saphdbtst2 crmd[10195]:   notice: 
Transition aborted by status-180881403-master-rsc_SAPHana_TST_HDB00 
doing create master-rsc_SAPHana_TST_HDB00=5: Transient attribute change

2017-05-11T15:29:40.614835+05:00 saphdbtst2 su: (to tstadm) root on none
2017-05-11T15:29:40.636628+05:00 saphdbtst2 systemd[1]: Started Session 
c45240 of user tstadm.
2017-05-11T15:29:43.363140+05:00 saphdbtst2 
SAPHa

Re: [ClusterLabs] newbie question

2017-05-11 Thread Ken Gaillot
On 05/05/2017 03:09 PM, Sergei Gerasenko wrote:
> Hi,
> 
> I have a very simple question. 
> 
> Pacemaker uses a dedicated "multicast" interface for the totem protocol.
> I'm using pacemaker with LVS to provide HA load balancing. LVS uses
> multicast interfaces to sync the status of TCP connections if a failover
> occurs.
> 
> I can understand services using the same interface if ports are used.
> That way you can get a socket (ip + port). But there's no ports in this
> case. So how can two services exchange messages without specifying
> ports? I guess that's somehow related to multicast, but how exactly I
> don't get.
> 
> Can somebody point me to a primer on this topic?
> 
> Thanks,
>   S.

Corosync is actually the cluster component that can use multicast, and
it does use a specific port on a specific address. By default, it uses
ports 5404 and 5405 when using multicast. See the corosync.conf(5) man
page for mcastaddr and mcastport. Also see the transport option;
corosync can be configured to use UDP unicast rather than multicast.

I don't remember much about LVS, but I would guess it's the same -- it's
probably just using a default port if not specified in the config.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith device locate on same host in active/passive cluster

2017-05-11 Thread Albert Weng
Hi Ken,

thank you for your comment.

i think this case can be closed, i use your suggestion of constraint and
then problem resolved.

thanks a lot~~

On Thu, May 4, 2017 at 10:28 PM, Ken Gaillot  wrote:

> On 05/03/2017 09:04 PM, Albert Weng wrote:
> > Hi Marek,
> >
> > Thanks your reply.
> >
> > On Tue, May 2, 2017 at 5:15 PM, Marek Grac  > > wrote:
> >
> >
> >
> > On Tue, May 2, 2017 at 11:02 AM, Albert Weng  > > wrote:
> >
> >
> > Hi Marek,
> >
> > thanks for your quickly responding.
> >
> > According to you opinion, when i type "pcs status" then i saw
> > the following result of fence :
> > ipmi-fence-node1(stonith:fence_ipmilan):Started cluaterb
> > ipmi-fence-node2(stonith:fence_ipmilan):Started clusterb
> >
> > Does it means both ipmi stonith devices are working correctly?
> > (rest of resources can failover to another node correctly)
> >
> >
> > Yes, they are working correctly.
> >
> > When it becomes important to run fence agents to kill the other
> > node. It will be executed from the other node, so the fact where
> > fence agent resides currently is not important
> >
> > Does "started on node" means which node is controlling fence behavior?
> > even all fence agents and resources "started on same node", the cluster
> > fence behavior still work correctly?
> >
> >
> > Thanks a lot.
> >
> > m,
>
> Correct. Fencing is *executed* independently of where or even whether
> fence devices are running. The node that is "running" a fence device
> performs the recurring monitor on the device; that's the only real effect.
>
> > should i have to use location constraint to avoid stonith device
> > running on same node ?
> > # pcs constraint location ipmi-fence-node1 prefers clustera
> > # pcs constraint location ipmi-fence-node2 prefers clusterb
> >
> > thanks a lot
>
> It's a good idea, so that a node isn't monitoring its own fence device,
> but that's the only reason -- it doesn't affect whether or how the node
> can be fenced. I would configure it as an anti-location, e.g.
>
>pcs constraint location ipmi-fence-node1 avoids node1=100
>
> In a 2-node cluster, there's no real difference, but in a larger
> cluster, it's the simplest config. I wouldn't use INFINITY (there's no
> harm in a node monitoring its own fence device if it's the last node
> standing), but I would use a score high enough to outweigh any stickiness.
>
> > On Tue, May 2, 2017 at 4:25 PM, Marek Grac  > > wrote:
> >
> > Hi,
> >
> >
> >
> > On Tue, May 2, 2017 at 3:39 AM, Albert Weng
> > mailto:weng.alb...@gmail.com>>
> wrote:
> >
> > Hi All,
> >
> > I have created active/passive pacemaker cluster on RHEL
> 7.
> >
> > here is my environment:
> > clustera : 192.168.11.1
> > clusterb : 192.168.11.2
> > clustera-ilo4 : 192.168.11.10
> > clusterb-ilo4 : 192.168.11.11
> >
> > both nodes are connected SAN storage for shared storage.
> >
> > i used the following cmd to create my stonith devices on
> > each node :
> > # pcs -f stonith_cfg stonith create ipmi-fence-node1
> > fence_ipmilan parms lanplus="ture"
> > pcmk_host_list="clustera" pcmk_host_check="static-list"
> > action="reboot" ipaddr="192.168.11.10"
> > login=adminsitrator passwd=1234322 op monitor
> interval=60s
> >
> > # pcs -f stonith_cfg stonith create ipmi-fence-node02
> > fence_ipmilan parms lanplus="true"
> > pcmk_host_list="clusterb" pcmk_host_check="static-list"
> > action="reboot" ipaddr="192.168.11.11" login=USERID
> > passwd=password op monitor interval=60s
> >
> > # pcs status
> > ipmi-fence-node1 clustera
> > ipmi-fence-node2 clusterb
> >
> > but when i failover to passive node, then i ran
> > # pcs status
> >
> > ipmi-fence-node1clusterb
> > ipmi-fence-node2clusterb
> >
> > why both fence device locate on the same node ?
> >
> >
> > When node 'clustera' is down, is there any place where
> > ipmi-fence-node* can be executed?
> >
> > If you are worrying that node can not self-fence itself you
> > are right. But if 'clustera' will become available then
> > attempt to fence clusterb will work as expected.
> >
> > m,
> >
> > ___
> > Users mailing