[ClusterLabs] both nodes OFFLINE

2017-05-12 Thread 石井 俊直
Hi.

We have, sometimes, a problem in our two nodes cluster on CentOS7. Let node-2 
and node-3
be the names of the nodes. When the problem happens, both nodes are recognized 
OFFLINE
on node-3 and on node-2, only node-3 is recognized OFFLINE.

When that happens, the following log message is added repeatedly on node-2 and 
log file
(/var/log/cluster/corosync.log) becomes hundreds of megabytes in short time. 
Log message
content on node-3 is different.

The erroneous state is temporally solved if OS of node-2 is restarted. On the 
other hand,
restarting OS of node-3 results in the same state.

I’ve searched content of ML and found a post (Mon Oct 1 01:27:39 CEST 2012) 
about
"Discarding update with feature set” problem. According to the message, our 
problem
may be solved by removing /var/lib/pacemaker/crm/cib.* on node-2.

What I want to know is whether removing the above files on just one of the node 
is safe ?
If there’s other method to solve the problem, I’d like to hear that.

Thanks.

—— from corosync.log  
cib:error: cib_perform_op:  Discarding update with feature set '3.0.11' 
greater than our own '3.0.10'
cib:error: cib_process_request: Completed cib_replace operation for 
section 'all': Protocol not supported (rc=-93, origin=node-3/crmd/12708, 
version=0.83.30)
crmd:   error: finalize_sync_callback:  Sync from node-3 failed: Protocol not 
supported
crmd:info: register_fsa_error_adv:  Resetting the current action list
crmd: warning: do_log:  Input I_ELECTION_DC received in state S_FINALIZE_JOIN 
from finalize_sync_callback
crmd:info: do_state_transition: State transition S_FINALIZE_JOIN -> 
S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL 
origin=finalize_sync_callback
crmd:info: crm_update_peer_join:initialize_join: Node node-2[1] - 
join-6329 phase 2 -> 0
crmd:info: crm_update_peer_join:initialize_join: Node node-3[2] - 
join-6329 phase 2 -> 0
crmd:info: update_dc:   Unset DC. Was node-2
crmd:info: join_make_offer: join-6329: Sending offer to node-2
crmd:info: crm_update_peer_join:join_make_offer: Node node-2[1] - 
join-6329 phase 0 -> 1
crmd:info: join_make_offer: join-6329: Sending offer to node-3
crmd:info: crm_update_peer_join:join_make_offer: Node node-3[2] - 
join-6329 phase 0 -> 1
crmd:info: do_dc_join_offer_all:join-6329: Waiting on 2 outstanding 
join acks
crmd:info: update_dc:   Set DC to node-2 (3.0.10)
crmd:info: crm_update_peer_join:do_dc_join_filter_offer: Node node-2[1] 
- join-6329 phase 1 -> 2
crmd:info: crm_update_peer_join:do_dc_join_filter_offer: Node node-3[2] 
- join-6329 phase 1 -> 2
crmd:info: do_state_transition: State transition S_INTEGRATION -> 
S_FINALIZE_JOIN | input=I_INTEGRATED cause=C_FSA_INTERNAL 
origin=check_join_state
crmd:info: crmd_join_phase_log: join-6329: node-2=integrated
crmd:info: crmd_join_phase_log: join-6329: node-3=integrated
crmd:  notice: do_dc_join_finalize: Syncing the Cluster Information Base 
from node-3 to rest of cluster | join-6329
crmd:  notice: do_dc_join_finalize: Requested version   
cib: info: cib_process_request: Forwarding cib_sync operation for 
section 'all' to node-3 (origin=local/crmd/12710)
cib: info: cib_process_replace: Digest matched on replace from node-3: 
85a19c7927c54ccb15794f2720e07ce1
cib: info: cib_process_replace: Replaced 0.83.30 with 0.84.1 from node-3
cib: info: __xml_diff_object:   Moved node_state@crmd (3 -> 2)
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Ken Gaillot
Another possibility you might want to look into is alerts. Pacemaker can
call a script of your choosing whenever a resource is started or
stopped. See:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683940283296

for the concepts, and the pcs man page for the "pcs alert" interface.

On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote:
> I checked the node_state of the node that is killed and brought back
> (test3). in_ccm == true and crmd == online for a second or two between
> "pcs cluster start test3" "monitor":
> 
>  crm-debug-origin="peer_update_callback" join="member" expected="member">
> 
> 
> 
> On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin
> mailto:ludovi...@gmail.com>> wrote:
> 
> Yes I haven't been using the "nodes" element in the XML, only the
> "resources" element. I couldn't find "node_state" elements or
> attributes in the XML, so after some searching I found that it is in
> the CIB that can be gotten with "pcs cluster cib foo.xml". I will
> start exploring this as an alternative to  crm_mon/"pcs status".
> 
> 
> However I still find what happens to be confusing, so below I try to
> better explain what I see:
> 
> 
> Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
> shutdown a minute ago):
> 
> crm_mon -1:
> 
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> partition with quorum
> Last updated: Fri May 12 10:45:36 2017  Last change: Fri
> May 12 09:18:13 2017 by root via crm_attribute on test1
> 
> 3 nodes and 4 resources configured
> 
> Online: [ test1 test2 ]
> OFFLINE: [ test3 ]
> 
> Active resources:
> 
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started
> test1
> 
>  
> crm_mon -X:
> 
> 
>  managed="true" failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true" f
> ailed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" fa
> iled="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Stopped" active="false" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="0" />
> 
>  resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> orphaned="false" managed
> ="true" failed="false" failure_ignored="false"
> nodes_running_on="1" >
> 
> 
> 
> 
> 
> 
> At 10:45:39.440, after "pcs cluster start test3", before first
> "monitor" on test3 (this is where I can't seem to know that
> resources on test3 are down):
> 
> crm_mon -1:
> 
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> partition with quorum
> Last updated: Fri May 12 10:45:39 2017  Last change: Fri
> May 12 10:45:39 2017 by root via crm_attribute on test1
> 
> 3 nodes and 4 resources configured
> 
> Online: [ test1 test2 test3 ]
> 
> Active resources:
> 
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 test3 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started
> test1
> 
> 
> crm_mon -X:
> 
> 
>  managed="true" failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>  resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
> 
> 
> 
> At 10:45:41.606, after first "monitor" on test3 (I can now tell the
> resources on test3 are not ready):
> 
> crm_mon -1:
> 
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> partition with quorum
> Last updated: Fri May 12 10:45:41 2017  Last change: Fri
> May 12 10:45:39 2017 by root via crm_attribute on test1
> 
> 3 nodes and 4 resources

Re: [ClusterLabs] Ubuntu 16.04 - Only binds on 127.0.0.1 then fails until reinstall

2017-05-12 Thread James

Hey, sorry for the delay in replying.

I've sorted this now as it seemed to be down to IP changes and sleep 
deprivation (The config IPs/subnets didn't match the node addresses).


Really appreciate you reaching out to help and at least 2 good things 
came out of this - I'm on the mailing list and I learned about strace!


Thanks.

On 06/05/17 07:44, Ferenc Wágner wrote:

James Booth  writes:


Sorry for the repeat mails, but I had issues subscribing list time
(Looks like it has worked successfully now!).

Anywho, I'm really desperate for some help on my issue in
http://lists.clusterlabs.org/pipermail/users/2017-April/005495.html -
I can recap the info in this thread and provide any configs if needed!

Hi James,

This thread is badly fragmented and confusing now, but let's try to
proceed.  It seems corosync ignores its config file.  Maybe you edit a
stray corosync.conf, not the one corosync actually reads (which should
probably be /etc/corosync/corosync.conf).  Please issue the following
command as a regular user, and show us its output (make sure strace is
installed):

$ strace -f -eopen /usr/sbin/corosync -p -f

It should reveal the name of the config file.  For example, under a
different version a section of the output looks like this:

open("/dev/shm/qb-corosync-16489-blackbox-header", O_RDWR|O_CREAT|O_TRUNC, 
0600) = 3
open("/dev/shm/qb-corosync-16489-blackbox-data", O_RDWR|O_CREAT|O_TRUNC, 0600) 
= 4
open("/etc/corosync/corosync.conf", O_RDONLY) = 3
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 3
Process 16490 attached
[pid 16489] open("/var/run/corosync.pid", O_WRONLY|O_CREAT, 0640) = -1 EACCES 
(Permission denied)

If you can identify the name of the config file, please also post its
path and its full content.




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Resources still retains in Primary Node even though its interface went down

2017-05-12 Thread pillai bs
Thank you for the Prompt reply.
I have one more question.sorry it might be silly. but am wondering after
noticed this.
 I made that interface down  but how the ip address(Public) &
VIP (IP resource) are still in primary node.
If i made interface down, public IP address also have to go down right?

Regards,
pillai.bs

On Wed, May 3, 2017 at 7:34 PM, Ken Gaillot  wrote:

> On 05/03/2017 02:43 AM, pillai bs wrote:
> > Hi Experts!!!
> >
> >   Am having two node setup for HA (Primary/Secondary) with
> > separate resources for Home/data/logs/Virtual IP.. As known the Expected
> > behavior should be , if Primary node went down, secondary has to take
> > in-charge (meaning initially the VIP will point the primary node, so
> > user can access home/data/logs from primary node.Once primary node went
> > down, the VIP/floatingIP will point the secondary node so that the user
> > can experienced uninterrupted service).
> >  I'm using dual ring support to avoid split brain. I have
> > two interfaces (Public & Private).Intention for having private interface
> > is for Data Sync alone.
> >
> > I have tested my setup in two different ways:
> > 1. Made primary Interface down (ifdown eth0), as expected VIP and other
> > resources moved from primary to secondary node.(VIP will not be
> > reachable from primary node)
> > 2. Made Primary Interface down (Physically unplugged the Ethernet
> > Cable). The primary node still retain the resources, VIP/FloatingIP was
> > reachable from primary node.
> >
> > Is my testing correct?? how come the VIP will be reachable even though
> > eth0 was down. Please advice!!!
> >
> > Regards,
> > Madhan.B
>
> Sorry, didn't see this message before replying to the other one :)
>
> The IP resource is successful if the IP is up *on that host*. It doesn't
> check that the IP is reachable from any other site. Similarly,
> filesystem resources just make sure that the filesystem can be mounted
> on the host. So, unplugging the Ethernet won't necessarily make those
> resources fail.
>
> Take a look at the ocf:pacemaker:ping resource for a way to ensure that
> the primary host has connectivity to the outside world. Also, be sure
> you have fencing configured, so that the surviving node can kill a node
> that is completely cut off or unresponsive.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: SAP HANA resource start problem

2017-05-12 Thread Muhammad Sharfuddin

Hello,

I think there might be a bug.. either in the SAP HANA resource or 
somewhere, because SUSE Support is still investigating this issue even 
after 4 days passed.



--
Regards,

Muhammad Sharfuddin


On 05/12/2017 05:04 PM, Ulrich Windl wrote:

Hi!

I have no specific answer to your question, but since SAP has moved from a command to 
start the instances to a command that sends another command to a java-based webserver 
that runs a command to start the instance, the whole mechanism is a joke (maybe that's 
your problem): It frequently reports "instance started successfully" when in 
fact it did not. And finding the reason for that is a real challenge. On one machine we 
have, it fails always after having booted the machine, and works after the SAP hostagent 
is killed (and restarted). The more SAP thinks they are part of the operating system, the 
more unreliable the whole system becomes.

It seems to be difficult to add a reliable layer on top of unreliable 
software...

Regards,
Ulrich


Muhammad Sharfuddin  schrieb am 12.05.2017 um 12:30 in

Nachricht <9e9fa3a1-b705-7a43-4899-84a1e500e...@nds.com.pk>:

is there a bug in SAP HANA resource ? crm_mon shows that cluster started
the resource and keep the HANA resource in slave state, while in actual
cluster doesn't start the resources, we found following events in the logs:


2017-05-12T01:01:55.194469+05:00 saphdbtst1 crmd[26357]:   notice:
Initiating start operation rsc_SAPHana_TST_HDB00_start_0 locally on
saphdbtst1
2017-05-12T01:01:55.195425+05:00 saphdbtst1 lrmd[26354]:   notice:
executing - rsc:rsc_SAPHana_TST_HDB00 action:start call_id:26
2017-05-12T01:01:55.487895+05:00 saphdbtst1 su: (to tstadm) root on none
2017-05-12T01:01:55.496584+05:00 saphdbtst1 systemd[1]: Created slice
User Slice of tstadm.
2017-05-12T01:01:55.519314+05:00 saphdbtst1 systemd[1]: Starting User
Manager for UID 1001...
2017-05-12T01:01:55.524831+05:00 saphdbtst1 systemd[1]: Started Session
c11387 of user tstadm.
2017-05-12T01:01:55.558886+05:00 saphdbtst1 systemd[27728]: Reached
target Paths.
2017-05-12T01:01:55.559365+05:00 saphdbtst1 systemd[27728]: Reached
target Timers.
2017-05-12T01:01:55.559740+05:00 saphdbtst1 systemd[27728]: Reached
target Sockets.
2017-05-12T01:01:55.560092+05:00 saphdbtst1 systemd[27728]: Reached
target Basic System.
2017-05-12T01:01:55.560437+05:00 saphdbtst1 systemd[27728]: Reached
target Default.
2017-05-12T01:01:55.560786+05:00 saphdbtst1 systemd[27728]: Startup
finished in 31ms.
2017-05-12T01:01:55.561336+05:00 saphdbtst1 systemd[1]: Started User
Manager for UID 1001.
2017-05-12T01:01:55.961270+05:00 saphdbtst1 systemd[1]: Stopping User
Manager for UID 1001...
2017-05-12T01:01:55.964003+05:00 saphdbtst1 systemd[27728]: Reached
target Shutdown.
2017-05-12T01:01:55.983723+05:00 saphdbtst1 systemd[27728]: Stopped
target Default.
2017-05-12T01:01:55.984039+05:00 saphdbtst1 systemd[27728]: Starting
Exit the Session...
2017-05-12T01:01:55.984333+05:00 saphdbtst1 systemd[27728]: Stopped
target Basic System.
2017-05-12T01:01:55.984615+05:00 saphdbtst1 systemd[27728]: Stopped
target Timers.
2017-05-12T01:01:55.984895+05:00 saphdbtst1 systemd[27728]: Stopped
target Sockets.
2017-05-12T01:01:55.985169+05:00 saphdbtst1 systemd[27728]: Stopped
target Paths.
2017-05-12T01:01:55.990544+05:00 saphdbtst1
SAPHana(rsc_SAPHana_TST_HDB00)[27645]: INFO: RA  begin action
start_clone (0.152.17) 
2017-05-12T01:01:55.995850+05:00 saphdbtst1 systemd[27728]: Received
SIGRTMIN+24 from PID 27878 (kill).
2017-05-12T01:01:55.999861+05:00 saphdbtst1 systemd[1]: Stopped User
Manager for UID 1001.
2017-05-12T01:01:56.000460+05:00 saphdbtst1 systemd[1]: Removed slice
User Slice of tstadm.
2017-05-12T01:01:56.033425+05:00 saphdbtst1 crmd[26357]:   notice:
Transition aborted by status-180881403-hana_tst_clone_state doing create
hana_tst_clone_state=DEMOTED: Transient attribute change
2017-05-12T01:01:56.044385+05:00 saphdbtst1 su: (to tstadm) root on none
2017-05-12T01:01:56.052758+05:00 saphdbtst1 systemd[1]: Created slice
User Slice of tstadm.
2017-05-12T01:01:56.075366+05:00 saphdbtst1 systemd[1]: Starting User
Manager for UID 1001...
2017-05-12T01:01:56.082157+05:00 saphdbtst1 systemd[1]: Started Session
c11388 of user tstadm.
2017-05-12T01:01:56.111651+05:00 saphdbtst1 systemd[27928]: Reached
target Sockets.
2017-05-12T01:01:56.112031+05:00 saphdbtst1 systemd[27928]: Reached
target Paths.
2017-05-12T01:01:56.112345+05:00 saphdbtst1 systemd[27928]: Reached
target Timers.
2017-05-12T01:01:56.112640+05:00 saphdbtst1 systemd[27928]: Reached
target Basic System.
2017-05-12T01:01:56.112936+05:00 saphdbtst1 systemd[27928]: Reached
target Default.
2017-05-12T01:01:56.113207+05:00 saphdbtst1 systemd[27928]: Startup
finished in 28ms.
2017-05-12T01:01:56.113480+05:00 saphdbtst1 systemd[1]: Started User
Manager for UID 1001.
2017-05-12T01:01:59.625804+05:00 saphdbtst1 systemd[1]: Stopping User
Manager for UID 1001...
2017-05-12T01:01:59.628078+05:00 saphdbtst1 systemd[27928]:

Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Ludovic Vaugeois-Pepin
Hi Jehan-Guillaume,

I would be glad to discuss my motivations and findings with you, by mail or
in person, even.

Let's just say that I originally wanted to create something that would
allow deploying a PG cluster in manners of minutes (yes using Python). From
there I tried to understand how PAF works, and at some point I wanted to
start changing it, but not being too good with Perl, I chose to translate
it. This kinda became a pet project.

Ludovic



On Fri, May 12, 2017 at 2:01 PM, Jehan-Guillaume de Rorthais <
j...@dalibo.com> wrote:

> Hi Ludovic,
>
> On Thu, 11 May 2017 22:00:12 +0200
> Ludovic Vaugeois-Pepin  wrote:
>
> > I translated the a Postgresql multi state RA (
> https://github.com/dalibo/PAF)
> > in Python (https://github.com/ulodciv/deploy_cluster), and I have been
> > editing it heavily.
>
> Could you please provide the feedback to the upstream project (or here :))?
>
> * what did you improved in PAF?
> * what did you changed in PAF?
> * why did you translate PAF to python? Any advantages?
>
> A lot of time and research has been dedicated to this project. PAF is a
> pure
> open source project. We would love some feedback and contributors to keep
> improving it. Do not hesitate to open issues on PAF project if you need to
> discuss improvements.
>
> Regards,
> --
> Jehan-Guillaume de Rorthais
> Dalibo
>



-- 
Ludovic Vaugeois-Pepin
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: SAP HANA resource start problem

2017-05-12 Thread Ulrich Windl
Hi!

I have no specific answer to your question, but since SAP has moved from a 
command to start the instances to a command that sends another command to a 
java-based webserver that runs a command to start the instance, the whole 
mechanism is a joke (maybe that's your problem): It frequently reports 
"instance started successfully" when in fact it did not. And finding the reason 
for that is a real challenge. On one machine we have, it fails always after 
having booted the machine, and works after the SAP hostagent is killed (and 
restarted). The more SAP thinks they are part of the operating system, the more 
unreliable the whole system becomes.

It seems to be difficult to add a reliable layer on top of unreliable 
software...

Regards,
Ulrich

>>> Muhammad Sharfuddin  schrieb am 12.05.2017 um 
>>> 12:30 in
Nachricht <9e9fa3a1-b705-7a43-4899-84a1e500e...@nds.com.pk>:
> is there a bug in SAP HANA resource ? crm_mon shows that cluster started 
> the resource and keep the HANA resource in slave state, while in actual 
> cluster doesn't start the resources, we found following events in the logs:
> 
> 
> 2017-05-12T01:01:55.194469+05:00 saphdbtst1 crmd[26357]:   notice: 
> Initiating start operation rsc_SAPHana_TST_HDB00_start_0 locally on 
> saphdbtst1
> 2017-05-12T01:01:55.195425+05:00 saphdbtst1 lrmd[26354]:   notice: 
> executing - rsc:rsc_SAPHana_TST_HDB00 action:start call_id:26
> 2017-05-12T01:01:55.487895+05:00 saphdbtst1 su: (to tstadm) root on none
> 2017-05-12T01:01:55.496584+05:00 saphdbtst1 systemd[1]: Created slice 
> User Slice of tstadm.
> 2017-05-12T01:01:55.519314+05:00 saphdbtst1 systemd[1]: Starting User 
> Manager for UID 1001...
> 2017-05-12T01:01:55.524831+05:00 saphdbtst1 systemd[1]: Started Session 
> c11387 of user tstadm.
> 2017-05-12T01:01:55.558886+05:00 saphdbtst1 systemd[27728]: Reached 
> target Paths.
> 2017-05-12T01:01:55.559365+05:00 saphdbtst1 systemd[27728]: Reached 
> target Timers.
> 2017-05-12T01:01:55.559740+05:00 saphdbtst1 systemd[27728]: Reached 
> target Sockets.
> 2017-05-12T01:01:55.560092+05:00 saphdbtst1 systemd[27728]: Reached 
> target Basic System.
> 2017-05-12T01:01:55.560437+05:00 saphdbtst1 systemd[27728]: Reached 
> target Default.
> 2017-05-12T01:01:55.560786+05:00 saphdbtst1 systemd[27728]: Startup 
> finished in 31ms.
> 2017-05-12T01:01:55.561336+05:00 saphdbtst1 systemd[1]: Started User 
> Manager for UID 1001.
> 2017-05-12T01:01:55.961270+05:00 saphdbtst1 systemd[1]: Stopping User 
> Manager for UID 1001...
> 2017-05-12T01:01:55.964003+05:00 saphdbtst1 systemd[27728]: Reached 
> target Shutdown.
> 2017-05-12T01:01:55.983723+05:00 saphdbtst1 systemd[27728]: Stopped 
> target Default.
> 2017-05-12T01:01:55.984039+05:00 saphdbtst1 systemd[27728]: Starting 
> Exit the Session...
> 2017-05-12T01:01:55.984333+05:00 saphdbtst1 systemd[27728]: Stopped 
> target Basic System.
> 2017-05-12T01:01:55.984615+05:00 saphdbtst1 systemd[27728]: Stopped 
> target Timers.
> 2017-05-12T01:01:55.984895+05:00 saphdbtst1 systemd[27728]: Stopped 
> target Sockets.
> 2017-05-12T01:01:55.985169+05:00 saphdbtst1 systemd[27728]: Stopped 
> target Paths.
> 2017-05-12T01:01:55.990544+05:00 saphdbtst1 
> SAPHana(rsc_SAPHana_TST_HDB00)[27645]: INFO: RA  begin action 
> start_clone (0.152.17) 
> 2017-05-12T01:01:55.995850+05:00 saphdbtst1 systemd[27728]: Received 
> SIGRTMIN+24 from PID 27878 (kill).
> 2017-05-12T01:01:55.999861+05:00 saphdbtst1 systemd[1]: Stopped User 
> Manager for UID 1001.
> 2017-05-12T01:01:56.000460+05:00 saphdbtst1 systemd[1]: Removed slice 
> User Slice of tstadm.
> 2017-05-12T01:01:56.033425+05:00 saphdbtst1 crmd[26357]:   notice: 
> Transition aborted by status-180881403-hana_tst_clone_state doing create 
> hana_tst_clone_state=DEMOTED: Transient attribute change
> 2017-05-12T01:01:56.044385+05:00 saphdbtst1 su: (to tstadm) root on none
> 2017-05-12T01:01:56.052758+05:00 saphdbtst1 systemd[1]: Created slice 
> User Slice of tstadm.
> 2017-05-12T01:01:56.075366+05:00 saphdbtst1 systemd[1]: Starting User 
> Manager for UID 1001...
> 2017-05-12T01:01:56.082157+05:00 saphdbtst1 systemd[1]: Started Session 
> c11388 of user tstadm.
> 2017-05-12T01:01:56.111651+05:00 saphdbtst1 systemd[27928]: Reached 
> target Sockets.
> 2017-05-12T01:01:56.112031+05:00 saphdbtst1 systemd[27928]: Reached 
> target Paths.
> 2017-05-12T01:01:56.112345+05:00 saphdbtst1 systemd[27928]: Reached 
> target Timers.
> 2017-05-12T01:01:56.112640+05:00 saphdbtst1 systemd[27928]: Reached 
> target Basic System.
> 2017-05-12T01:01:56.112936+05:00 saphdbtst1 systemd[27928]: Reached 
> target Default.
> 2017-05-12T01:01:56.113207+05:00 saphdbtst1 systemd[27928]: Startup 
> finished in 28ms.
> 2017-05-12T01:01:56.113480+05:00 saphdbtst1 systemd[1]: Started User 
> Manager for UID 1001.
> 2017-05-12T01:01:59.625804+05:00 saphdbtst1 systemd[1]: Stopping User 
> Manager for UID 1001...
> 2017-05-12T01:01:59.628078+05:00 saphdbtst1 systemd[27928]: Reached 
> target Shutdown.
> 2017-05

Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Jehan-Guillaume de Rorthais
Hi Ludovic,

On Thu, 11 May 2017 22:00:12 +0200
Ludovic Vaugeois-Pepin  wrote:

> I translated the a Postgresql multi state RA (https://github.com/dalibo/PAF)
> in Python (https://github.com/ulodciv/deploy_cluster), and I have been
> editing it heavily.

Could you please provide the feedback to the upstream project (or here :))? 

* what did you improved in PAF?
* what did you changed in PAF?
* why did you translate PAF to python? Any advantages?

A lot of time and research has been dedicated to this project. PAF is a pure
open source project. We would love some feedback and contributors to keep
improving it. Do not hesitate to open issues on PAF project if you need to
discuss improvements.

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker remote node ofgline after reboot

2017-05-12 Thread Ignazio Cassano
Hello, there are no constraints for node compute-1.

The following is the corosync.log on the cluste node  :

ay 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Forwarding cib_delete operation for section
//node_state[@uname='compute-0']//lrm_resource[@id='compute-1'] to all
(origin=local/crmd/2856)
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Completed cib_delete operation for section
//node_state[@uname='compute-0']//lrm_resource[@id='compute-1']: OK (rc=0,
origin=tst-controller-01/crmd/2856, version=0.555.6)
May 12 13:14:47 [7286] tst-controller-01   crmd: info:
delete_resource:Removing resource compute-1 for
328a6a8b-e4f1-4b48-9dc5-e418ba0e2850 (root) on tst-controller-01
May 12 13:14:47 [7286] tst-controller-01   crmd: info:
notify_deleted:Notifying 328a6a8b-e4f1-4b48-9dc5-e418ba0e2850 on
tst-controller-01 that compute-1 was deleted
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Forwarding cib_delete operation for section
//node_state[@uname='compute-0']//lrm_resource[@id='compute-1'] to all
(origin=local/crmd/2857)
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Completed cib_delete operation for section
//node_state[@uname='compute-0']//lrm_resource[@id='compute-1']: OK (rc=0,
origin=tst-controller-01/crmd/2857, version=0.555.6)
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Forwarding cib_delete operation for section
//node_state[@uname='tst-controller-01']//lrm_resource[@id='compute-1'] to
all (origin=local/crmd/2860)
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Completed cib_delete operation for section
//node_state[@uname='tst-controller-01']//lrm_resource[@id='compute-1']: OK
(rc=0, origin=tst-controller-01/crmd/2860, version=0.556.0)
May 12 13:14:47 [7286] tst-controller-01   crmd: info:
delete_resource:Removing resource compute-1 for
328a6a8b-e4f1-4b48-9dc5-e418ba0e2850 (root) on tst-controller-01
May 12 13:14:47 [7286] tst-controller-01   crmd: info:
notify_deleted:Notifying 328a6a8b-e4f1-4b48-9dc5-e418ba0e2850 on
tst-controller-01 that compute-1 was deleted
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Completed cib_delete operation for section
//node_state[@uname='tst-controller-03']//lrm_resource[@id='compute-1']: OK
(rc=0, origin=tst-controller-03/crmd/2021, version=0.556.0)
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Forwarding cib_delete operation for section
//node_state[@uname='tst-controller-01']//lrm_resource[@id='compute-1'] to
all (origin=local/crmd/2861)
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Completed cib_delete operation for section
//node_state[@uname='tst-controller-03']//lrm_resource[@id='compute-1']: OK
(rc=0, origin=tst-controller-03/crmd/2022, version=0.556.0)
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Completed cib_delete operation for section
//node_state[@uname='tst-controller-01']//lrm_resource[@id='compute-1']: OK
(rc=0, origin=tst-controller-01/crmd/2861, version=0.556.0)
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Completed cib_delete operation for section
//node_state[@uname='tst-controller-02']//lrm_resource[@id='compute-1']: OK
(rc=0, origin=tst-controller-02/crmd/4840, version=0.556.0)
May 12 13:14:47 [7281] tst-controller-01cib: info:
cib_process_request:Completed cib_delete operation for section
//node_state[@uname='tst-controller-02']//lrm_resource[@id='compute-1']: OK
(rc=0, origin=tst-controller-02/crmd/4841, version=0.556.0)



On 05/12/2017 12:32 PM, Ignazio Cassano wrote:
> Hello, some updates.
> Now I am not able enable compute-1 like yesterday: removing and
> readding it.
> Must If I remove it and add in the /etc/hosts of the cluster nodes an
> alias like compute1 , removing compute-1 and addiing compute1, it goes
> online .
>
>
> 2017-05-12 12:08 GMT+02:00 Ignazio Cassano  >:
>
> Hello, I do not know if it is the correct mode to answer in this
> mailing list.
> Anycase, either shutdown the remote node or fencing it with ipmi ,
> it does not retrurn online.
> them pacemaker-remote service is enabled and restart at reboot.
> But I continue to have the following on my cluster:
>
> Online: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
> RemoteOnline: [ compute-0 ]
> RemoteOFFLINE: [ compute-1 ]
>
> Full list of resources:
>
>  Resource Group: vip
>  vipmanagement(ocf::heartbeat:IPaddr2):Started
> tst-controller-03
>  vipinternalpi(ocf::heartbeat:IPaddr2):Started
> tst-controller-03
>  lb-haproxy(systemd:haproxy)

Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Ludovic Vaugeois-Pepin
I checked the node_state of the node that is killed and brought back
(test3). in_ccm == true and crmd == online for a second or two between "pcs
cluster start test3" "monitor":





On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin <
ludovi...@gmail.com> wrote:

> Yes I haven't been using the "nodes" element in the XML, only the
> "resources" element. I couldn't find "node_state" elements or attributes
> in the XML, so after some searching I found that it is in the CIB that can
> be gotten with "pcs cluster cib foo.xml". I will start exploring this as an
> alternative to  crm_mon/"pcs status".
>
>
> However I still find what happens to be confusing, so below I try to
> better explain what I see:
>
>
> Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
> shutdown a minute ago):
>
> crm_mon -1:
>
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
> quorum
> Last updated: Fri May 12 10:45:36 2017  Last change: Fri May
> 12 09:18:13 2017 by root via crm_attribute on test1
>
> 3 nodes and 4 resources configured
>
> Online: [ test1 test2 ]
> OFFLINE: [ test3 ]
>
> Active resources:
>
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1
>
>
> crm_mon -X:
>
> 
>  failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true" f
> ailed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" fa
> iled="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Stopped" active="false" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="0" />
> 
>  role="Started" active="true" orphaned="false" managed
> ="true" failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>
>
>
> At 10:45:39.440, after "pcs cluster start test3", before first "monitor"
> on test3 (this is where I can't seem to know that resources on test3 are
> down):
>
> crm_mon -1:
>
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
> quorum
> Last updated: Fri May 12 10:45:39 2017  Last change: Fri May
> 12 10:45:39 2017 by root via crm_attribute on test1
>
> 3 nodes and 4 resources configured
>
> Online: [ test1 test2 test3 ]
>
> Active resources:
>
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 test3 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1
>
>
> crm_mon -X:
>
> 
>  failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>  role="Started" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>
>
>
> At 10:45:41.606, after first "monitor" on test3 (I can now tell the
> resources on test3 are not ready):
>
> crm_mon -1:
>
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
> quorum
> Last updated: Fri May 12 10:45:41 2017  Last change: Fri May
> 12 10:45:39 2017 by root via crm_attribute on test1
>
> 3 nodes and 4 resources configured
>
> Online: [ test1 test2 test3 ]
>
> Active resources:
>
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1
>
>
> crm_mon -X:
>
> 
>  failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Stopped" active="false" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="0" />
> 
>  role="Started" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>
> On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot  wrote:
>
>> On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:
>>

Re: [ClusterLabs] pacemaker remote node ofgline after reboot

2017-05-12 Thread Klaus Wenninger
On 05/12/2017 12:32 PM, Ignazio Cassano wrote:
> Hello, some updates.
> Now I am not able enable compute-1 like yesterday: removing and
> readding it.
> Must If I remove it and add in the /etc/hosts of the cluster nodes an
> alias like compute1 , removing compute-1 and addiing compute1, it goes
> online .
>
>
> 2017-05-12 12:08 GMT+02:00 Ignazio Cassano  >:
>
> Hello, I do not know if it is the correct mode to answer in this
> mailing list.
> Anycase, either shutdown the remote node or fencing it with ipmi ,
> it does not retrurn online.
> them pacemaker-remote service is enabled and restart at reboot.
> But I continue to have the following on my cluster:
>
> Online: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
> RemoteOnline: [ compute-0 ]
> RemoteOFFLINE: [ compute-1 ]
>
> Full list of resources:
>
>  Resource Group: vip
>  vipmanagement(ocf::heartbeat:IPaddr2):Started
> tst-controller-03
>  vipinternalpi(ocf::heartbeat:IPaddr2):Started
> tst-controller-03
>  lb-haproxy(systemd:haproxy):Started tst-controller-03
>  Clone Set: httpd-clone [httpd]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-glance-api-clone [openstack-glance-api]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-glance-registry-clone
> [openstack-glance-registry]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-api-clone [openstack-nova-api]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-consoleauth-clone
> [openstack-nova-consoleauth]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-novncproxy-clone
> [openstack-nova-novncproxy]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-server-clone [neutron-server]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-openvswitch-agent-clone
> [neutron-openvswitch-agent]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-cinder-scheduler-clone
> [openstack-cinder-scheduler]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  openstack-cinder-volume(systemd:openstack-cinder-volume):   
> Started tst-controller-02
>  Clone Set: openstack-cinder-backup-clone [openstack-cinder-backup]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: cinder-perm-check-clone [cinder-perm-check]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: lbaasv2-clone [lbaasv2]
>  Started: [ tst-controller-01 tst-controller-02
> tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  lbaas-check(systemd:lbaas-check):Started tst-controller-01
>  Clone Set: clean-haproxy-clone [clean-haproxy]
>  Started: [ tst-controller-01 tst

Re: [ClusterLabs] pacemaker remote node ofgline after reboot

2017-05-12 Thread Ignazio Cassano
Hello, some updates.
Now I am not able enable compute-1 like yesterday: removing and readding it.
Must If I remove it and add in the /etc/hosts of the cluster nodes an alias
like compute1 , removing compute-1 and addiing compute1, it goes online .


2017-05-12 12:08 GMT+02:00 Ignazio Cassano :

> Hello, I do not know if it is the correct mode to answer in this mailing
> list.
> Anycase, either shutdown the remote node or fencing it with ipmi , it does
> not retrurn online.
> them pacemaker-remote service is enabled and restart at reboot.
> But I continue to have the following on my cluster:
>
> Online: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
> RemoteOnline: [ compute-0 ]
> RemoteOFFLINE: [ compute-1 ]
>
> Full list of resources:
>
>  Resource Group: vip
>  vipmanagement(ocf::heartbeat:IPaddr2):Started
> tst-controller-03
>  vipinternalpi(ocf::heartbeat:IPaddr2):Started
> tst-controller-03
>  lb-haproxy(systemd:haproxy):Started tst-controller-03
>  Clone Set: httpd-clone [httpd]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-glance-api-clone [openstack-glance-api]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-api-clone [openstack-nova-api]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-server-clone [neutron-server]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  openstack-cinder-volume(systemd:openstack-cinder-volume):Started
> tst-controller-02
>  Clone Set: openstack-cinder-backup-clone [openstack-cinder-backup]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: cinder-perm-check-clone [cinder-perm-check]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: lbaasv2-clone [lbaasv2]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  lbaas-check(systemd:lbaas-check):Started tst-controller-01
>  Clone Set: clean-haproxy-clone [clean-haproxy]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-heat-api-clone [openstack-heat-api]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
>  Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
>  Stopped: [ compute-0 compute-1 ]
>  Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
>  Started: [ tst-controller-01 tst-

Re: [ClusterLabs] SAP HANA resource start problem

2017-05-12 Thread Muhammad Sharfuddin
is there a bug in SAP HANA resource ? crm_mon shows that cluster started 
the resource and keep the HANA resource in slave state, while in actual 
cluster doesn't start the resources, we found following events in the logs:



2017-05-12T01:01:55.194469+05:00 saphdbtst1 crmd[26357]:   notice: 
Initiating start operation rsc_SAPHana_TST_HDB00_start_0 locally on 
saphdbtst1
2017-05-12T01:01:55.195425+05:00 saphdbtst1 lrmd[26354]:   notice: 
executing - rsc:rsc_SAPHana_TST_HDB00 action:start call_id:26

2017-05-12T01:01:55.487895+05:00 saphdbtst1 su: (to tstadm) root on none
2017-05-12T01:01:55.496584+05:00 saphdbtst1 systemd[1]: Created slice 
User Slice of tstadm.
2017-05-12T01:01:55.519314+05:00 saphdbtst1 systemd[1]: Starting User 
Manager for UID 1001...
2017-05-12T01:01:55.524831+05:00 saphdbtst1 systemd[1]: Started Session 
c11387 of user tstadm.
2017-05-12T01:01:55.558886+05:00 saphdbtst1 systemd[27728]: Reached 
target Paths.
2017-05-12T01:01:55.559365+05:00 saphdbtst1 systemd[27728]: Reached 
target Timers.
2017-05-12T01:01:55.559740+05:00 saphdbtst1 systemd[27728]: Reached 
target Sockets.
2017-05-12T01:01:55.560092+05:00 saphdbtst1 systemd[27728]: Reached 
target Basic System.
2017-05-12T01:01:55.560437+05:00 saphdbtst1 systemd[27728]: Reached 
target Default.
2017-05-12T01:01:55.560786+05:00 saphdbtst1 systemd[27728]: Startup 
finished in 31ms.
2017-05-12T01:01:55.561336+05:00 saphdbtst1 systemd[1]: Started User 
Manager for UID 1001.
2017-05-12T01:01:55.961270+05:00 saphdbtst1 systemd[1]: Stopping User 
Manager for UID 1001...
2017-05-12T01:01:55.964003+05:00 saphdbtst1 systemd[27728]: Reached 
target Shutdown.
2017-05-12T01:01:55.983723+05:00 saphdbtst1 systemd[27728]: Stopped 
target Default.
2017-05-12T01:01:55.984039+05:00 saphdbtst1 systemd[27728]: Starting 
Exit the Session...
2017-05-12T01:01:55.984333+05:00 saphdbtst1 systemd[27728]: Stopped 
target Basic System.
2017-05-12T01:01:55.984615+05:00 saphdbtst1 systemd[27728]: Stopped 
target Timers.
2017-05-12T01:01:55.984895+05:00 saphdbtst1 systemd[27728]: Stopped 
target Sockets.
2017-05-12T01:01:55.985169+05:00 saphdbtst1 systemd[27728]: Stopped 
target Paths.
2017-05-12T01:01:55.990544+05:00 saphdbtst1 
SAPHana(rsc_SAPHana_TST_HDB00)[27645]: INFO: RA  begin action 
start_clone (0.152.17) 
2017-05-12T01:01:55.995850+05:00 saphdbtst1 systemd[27728]: Received 
SIGRTMIN+24 from PID 27878 (kill).
2017-05-12T01:01:55.999861+05:00 saphdbtst1 systemd[1]: Stopped User 
Manager for UID 1001.
2017-05-12T01:01:56.000460+05:00 saphdbtst1 systemd[1]: Removed slice 
User Slice of tstadm.
2017-05-12T01:01:56.033425+05:00 saphdbtst1 crmd[26357]:   notice: 
Transition aborted by status-180881403-hana_tst_clone_state doing create 
hana_tst_clone_state=DEMOTED: Transient attribute change

2017-05-12T01:01:56.044385+05:00 saphdbtst1 su: (to tstadm) root on none
2017-05-12T01:01:56.052758+05:00 saphdbtst1 systemd[1]: Created slice 
User Slice of tstadm.
2017-05-12T01:01:56.075366+05:00 saphdbtst1 systemd[1]: Starting User 
Manager for UID 1001...
2017-05-12T01:01:56.082157+05:00 saphdbtst1 systemd[1]: Started Session 
c11388 of user tstadm.
2017-05-12T01:01:56.111651+05:00 saphdbtst1 systemd[27928]: Reached 
target Sockets.
2017-05-12T01:01:56.112031+05:00 saphdbtst1 systemd[27928]: Reached 
target Paths.
2017-05-12T01:01:56.112345+05:00 saphdbtst1 systemd[27928]: Reached 
target Timers.
2017-05-12T01:01:56.112640+05:00 saphdbtst1 systemd[27928]: Reached 
target Basic System.
2017-05-12T01:01:56.112936+05:00 saphdbtst1 systemd[27928]: Reached 
target Default.
2017-05-12T01:01:56.113207+05:00 saphdbtst1 systemd[27928]: Startup 
finished in 28ms.
2017-05-12T01:01:56.113480+05:00 saphdbtst1 systemd[1]: Started User 
Manager for UID 1001.
2017-05-12T01:01:59.625804+05:00 saphdbtst1 systemd[1]: Stopping User 
Manager for UID 1001...
2017-05-12T01:01:59.628078+05:00 saphdbtst1 systemd[27928]: Reached 
target Shutdown.
2017-05-12T01:01:59.628471+05:00 saphdbtst1 systemd[27928]: Stopped 
target Default.
2017-05-12T01:01:59.628958+05:00 saphdbtst1 systemd[27928]: Stopped 
target Basic System.
2017-05-12T01:01:59.643490+05:00 saphdbtst1 systemd[27928]: Stopped 
target Sockets.
2017-05-12T01:01:59.643798+05:00 saphdbtst1 systemd[27928]: Stopped 
target Timers.
2017-05-12T01:01:59.644092+05:00 saphdbtst1 systemd[27928]: Stopped 
target Paths.
2017-05-12T01:01:59.644370+05:00 saphdbtst1 systemd[27928]: Starting 
Exit the Session...
2017-05-12T01:01:59.653728+05:00 saphdbtst1 systemd[27928]: Received 
SIGRTMIN+24 from PID 28054 (kill).
2017-05-12T01:01:59.657669+05:00 saphdbtst1 systemd[1]: Stopped User 
Manager for UID 1001.
2017-05-12T01:01:59.658274+05:00 saphdbtst1 systemd[1]: Removed slice 
User Slice of tstadm.

2017-05-12T01:01:59.777090+05:00 saphdbtst1 su: (to tstadm) root on none
2017-05-12T01:01:59.785074+05:00 saphdbtst1 systemd[1]: Created slice 
User Slice of tstadm.
2017-05-12T01:01:59.815357+05:00 saphdbtst1 systemd[1]: Starting User 
Manager for UI

[ClusterLabs] pacemaker remote node ofgline after reboot

2017-05-12 Thread Ignazio Cassano
Hello, I do not know if it is the correct mode to answer in this mailing
list.
Anycase, either shutdown the remote node or fencing it with ipmi , it does
not retrurn online.
them pacemaker-remote service is enabled and restart at reboot.
But I continue to have the following on my cluster:

Online: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
RemoteOnline: [ compute-0 ]
RemoteOFFLINE: [ compute-1 ]

Full list of resources:

 Resource Group: vip
 vipmanagement(ocf::heartbeat:IPaddr2):Started tst-controller-03
 vipinternalpi(ocf::heartbeat:IPaddr2):Started tst-controller-03
 lb-haproxy(systemd:haproxy):Started tst-controller-03
 Clone Set: httpd-clone [httpd]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: neutron-server-clone [neutron-server]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 openstack-cinder-volume(systemd:openstack-cinder-volume):Started
tst-controller-02
 Clone Set: openstack-cinder-backup-clone [openstack-cinder-backup]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: cinder-perm-check-clone [cinder-perm-check]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: lbaasv2-clone [lbaasv2]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 lbaas-check(systemd:lbaas-check):Started tst-controller-01
 Clone Set: clean-haproxy-clone [clean-haproxy]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-ceilometer-notification-clone
[openstack-ceilometer-notification]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Set: openstack-ceilometer-central-clone
[openstack-ceilometer-central]
 Started: [ tst-controller-01 tst-controller-02 tst-controller-03 ]
 Stopped: [ compute-0 compute-1 ]
 Clone Se

Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Ludovic Vaugeois-Pepin
Yes I haven't been using the "nodes" element in the XML, only the
"resources" element. I couldn't find "node_state" elements or attributes in
the XML, so after some searching I found that it is in the CIB that can be
gotten with "pcs cluster cib foo.xml". I will start exploring this as an
alternative to  crm_mon/"pcs status".


However I still find what happens to be confusing, so below I try to better
explain what I see:


Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
shutdown a minute ago):

crm_mon -1:

Stack: corosync
Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
quorum
Last updated: Fri May 12 10:45:36 2017  Last change: Fri May 12
09:18:13 2017 by root via crm_attribute on test1

3 nodes and 4 resources configured

Online: [ test1 test2 ]
OFFLINE: [ test3 ]

Active resources:

 Master/Slave Set: pgsql-ha [pgsqld]
 Masters: [ test1 ]
 Slaves: [ test2 ]
 pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1


crm_mon -X:


















At 10:45:39.440, after "pcs cluster start test3", before first "monitor" on
test3 (this is where I can't seem to know that resources on test3 are down):

crm_mon -1:

Stack: corosync
Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
quorum
Last updated: Fri May 12 10:45:39 2017  Last change: Fri May 12
10:45:39 2017 by root via crm_attribute on test1

3 nodes and 4 resources configured

Online: [ test1 test2 test3 ]

Active resources:

 Master/Slave Set: pgsql-ha [pgsqld]
 Masters: [ test1 ]
 Slaves: [ test2 test3 ]
 pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1


crm_mon -X:




















At 10:45:41.606, after first "monitor" on test3 (I can now tell the
resources on test3 are not ready):

crm_mon -1:

Stack: corosync
Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
quorum
Last updated: Fri May 12 10:45:41 2017  Last change: Fri May 12
10:45:39 2017 by root via crm_attribute on test1

3 nodes and 4 resources configured

Online: [ test1 test2 test3 ]

Active resources:

 Master/Slave Set: pgsql-ha [pgsqld]
 Masters: [ test1 ]
 Slaves: [ test2 ]
 pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1


crm_mon -X:
















On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot  wrote:

> On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:
> > Hi
> > I translated the a Postgresql multi state RA
> > (https://github.com/dalibo/PAF) in Python
> > (https://github.com/ulodciv/deploy_cluster), and I have been editing it
> > heavily.
> >
> > In parallel I am writing unit tests and functional tests.
> >
> > I am having an issue with a functional test that abruptly powers off a
> > slave named says "host3" (hot standby PG instance). Later on I start the
> > slave back. Once it is started, I run "pcs cluster start host3". And
> > this is where I start having a problem.
> >
> > I check every second the output of "pcs status xml" until host3 is said
> > to be ready as a slave again. In the following I assume that test3 is
> > ready as a slave:
> >
> > 
> >  > standby_onfail="false" maintenance="false" pending="false"
> > unclean="false" shutdown="false" expected_up="true" is_dc="false"
> > resources_running="2" type="member" />
> >  > standby_onfail="false" maintenance="false" pending="false"
> > unclean="false" shutdown="false" expected_up="true" is_dc="true"
> > resources_running="1" type="member" />
> >  > standby_onfail="false" maintenance="false" pending="false"
> > unclean="false" shutdown="false" expected_up="true" is_dc="false"
> > resources_running="1" type="member" />
> > 
>
> The  section says nothing about the current state of the nodes.
> Look at the  entries for that. in_ccm means the cluster
> stack level, and crmd means the pacemaker level -- both need to be up.
>
> > 
> >  > managed="true" failed="false" failure_ignored="false" >
> >  > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Master" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> > 
> > By ready to go I mean