Re: [ClusterLabs] Why is node fenced ?

2019-10-10 Thread Ken Gaillot
On Thu, 2019-10-10 at 17:22 +0200, Lentes, Bernd wrote:
> HI,
> 
> i have a two node cluster running on SLES 12 SP4.
> I did some testing on it.
> I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a
> few minutes later because i made a mistake.
> ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started
> corosync/pacemaker on it.
> It seems ha-idg-1 didn't find the DC after starting cluster and some
> sec later elected itself  to the DC, 
> afterwards fenced ha-idg-2.

For some reason, corosync on the two nodes was not able to communicate
with each other.

This type of situation is why corosync's two_node option normally
includes wait_for_all.

> 
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [MAIN  ] Corosync
> Cluster Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:04:43 [9550] ha-idg-1 corosync info[MAIN  ] Corosync
> built-in features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ]
> Initializing transport (UDP/IP Multicast).
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ]
> Initializing transmit/receive security (NSS) crypto: aes256 hash:
> sha1
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] The network
> interface [192.168.100.10] is now up.
> 
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped
> (2ms)
> Oct 09 18:05:06 [9565] ha-idg-1   crmd:  warning: do_log:   Input
> I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> do_state_transition:  State transition S_PENDING -> S_ELECTION |
> input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> election_check:   election-DC won by local node
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_log:   Input
> I_ELECTION_DC received in state S_ELECTION from election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1   crmd:   notice:
> do_state_transition:  State transition S_ELECTION ->
> S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL
> origin=election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> do_te_control:Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd-
> 71bd17047f82
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> set_graph_functions:  Setting custom graph functions
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> do_dc_takeover:   Taking over DC status for this partition
> 
> Oct 09 18:05:07 [9564] ha-idg-1pengine:  warning:
> stage6:   Scheduling Node ha-idg-2 for STONITH
> Oct 09 18:05:07 [9564] ha-idg-1pengine:   notice:
> LogNodeActions:* Fence (Off) ha-idg-2 'node is unclean'
> 
> Is my understanding correct ?

Yes

> In the log of ha-idg-2 i don't find anything for this period:
> 
> Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info:
> cib_device_update:   Device fence_ilo_ha-idg-2 has been disabled
> on ha-idg-2: score=-1
> Oct 09 17:58:51 [12503] ha-idg-2cib: info:
> cib_process_ping:Reporting our current digest to ha-idg-2:
> 59c4cfb14defeafbeb3417e42cd9 for 2.9506.36 (0x242b110 0)
> 
> Oct 09 18:00:42 [12508] ha-idg-2   crmd: info:
> throttle_send_command:   New throttle mode: 0001 (was )
> Oct 09 18:01:12 [12508] ha-idg-2   crmd: info:
> throttle_check_thresholds:   Moderate CPU load detected:
> 32.220001
> Oct 09 18:01:12 [12508] ha-idg-2   crmd: info:
> throttle_send_command:   New throttle mode: 0010 (was 0001)
> Oct 09 18:01:42 [12508] ha-idg-2   crmd: info:
> throttle_send_command:   New throttle mode: 0001 (was 0010)
> Oct 09 18:02:42 [12508] ha-idg-2   crmd: info:
> throttle_send_command:   New throttle mode:  (was 0001)
> 
> ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on
> it again:
> 
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [MAIN  ] Corosync
> Cluster Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:29:05 [11795] ha-idg-2 corosync info[MAIN  ] Corosync
> built-in features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ]
> Initializing transport (UDP/IP Multicast).
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ]
> Initializing transmit/receive security (NSS) crypto: aes256 hash:
> sha1
> 
> What is the meaning of the lines with the throttle ?

Those messages could definitely be improved. The particular mode values
indicate no significant CPU load (), low load (0001), medium
(0010), high (0100), or extreme (1000).

I wouldn't expect a CPU spike to lock up corosync for very long, but it
could be related somehow.

> 
> Thanks.
> 
> 
> Bernd
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs 

Re: [ClusterLabs] change of the configuration of a resource which is part of a clone

2019-10-10 Thread Ken Gaillot
On Wed, 2019-10-09 at 16:53 +0200, Lentes, Bernd wrote:
> Hi,
> 
> i finally managed to find out how i can simulate configuration
> changes and see their results before committing them.
> OMG. That makes live much more relaxed. I need to change the
> configuration of a resource which is part of a group, the group is 
> running as a clone on all nodes.
> Unfortunately the resource is a prerequisite for several other
> resources. The other resources will restart when i commit
> the changes which i definitely want to avoid.
> What can i do ?
> I have a two node cluster on SLES 12 SP4, with pacemaker-
> 1.1.19+20181105.ccd6b5b10-3.13.1.x86_64 and corosync-2.3.6-
> 9.13.1.x86_64.
> 
> Bernd

I believe it would work to unmanage the other resources, change the
configuration, wait for the changed resource to restart, then re-manage 
the remaining resources.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-10 Thread Ken Gaillot
On Wed, 2019-10-09 at 20:10 +0200, Kadlecsik József wrote:
> On Wed, 9 Oct 2019, Ken Gaillot wrote:
> 
> > > One of the nodes has got a failure ("watchdog: BUG: soft lockup
> > > - 
> > > CPU#7 stuck for 23s"), which resulted that the node could
> > > process 
> > > traffic on the backend interface but not on the fronted one. Thus
> > > the 
> > > services became unavailable but the cluster thought the node is
> > > all 
> > > right and did not stonith it.
> > > 
> > > How could we protect the cluster against such failures?
> > 
> > See the ocf:heartbeat:ethmonitor agent (to monitor the interface
> > itself) 
> > and/or the ocf:pacemaker:ping agent (to monitor reachability of
> > some IP 
> > such as a gateway)
> 
> This looks really promising, thank you! Does the cluster regard it as
> a 
> failure when a ocf:heartbeat:ethmonitor agent clone on a node does
> not 
> run? :-)

If you configure it typically, so that it runs on all nodes, then a
start failure on any node will be recorded in the cluster status. To
get other resources to move off such a node, you would colocate them
with the ethmonitor resource.

> 
> Best regards,
> Jozsef
> --
> E-mail : kadlecsik.joz...@wigner.mta.hu
> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
> Address: Wigner Research Centre for Physics
>  H-1525 Budapest 114, POB. 49, Hungary
> __
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why is node fenced ?

2019-10-10 Thread Andrei Borzenkov
10.10.2019 18:22, Lentes, Bernd пишет:
> HI,
> 
> i have a two node cluster running on SLES 12 SP4.
> I did some testing on it.
> I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a few 
> minutes later because i made a mistake.
> ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started corosync/pacemaker 
> on it.
> It seems ha-idg-1 didn't find the DC after starting cluster

Which likely was the reason for fencing in the first place.

> and some sec later elected itself  to the DC, 
> afterwards fenced ha-idg-2.
> 
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [MAIN  ] Corosync Cluster 
> Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:04:43 [9550] ha-idg-1 corosync info[MAIN  ] Corosync built-in 
> features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] Initializing 
> transport (UDP/IP Multicast).
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] Initializing 
> transmit/receive security (NSS) crypto: aes256 hash: sha1
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] The network 
> interface [192.168.100.10] is now up.
> 
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: crm_timer_popped: 
> Election Trigger (I_DC_TIMEOUT) just popped (2ms)
> Oct 09 18:05:06 [9565] ha-idg-1   crmd:  warning: do_log:   Input 
> I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_state_transition:
>   State transition S_PENDING -> S_ELECTION | input=I_DC_TIMEOUT 
> cause=C_TIMER_POPPED origin=crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: election_check:   
> election-DC won by local node
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_log:   Input 
> I_ELECTION_DC received in state S_ELECTION from election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1   crmd:   notice: do_state_transition:
>   State transition S_ELECTION -> S_INTEGRATION | input=I_ELECTION_DC 
> cause=C_FSA_INTERNAL origin=election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_te_control:
> Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd-71bd17047f82
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: set_graph_functions:
>   Setting custom graph functions
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_dc_takeover:   
> Taking over DC status for this partition
> 
> Oct 09 18:05:07 [9564] ha-idg-1pengine:  warning: stage6:   Scheduling 
> Node ha-idg-2 for STONITH
> Oct 09 18:05:07 [9564] ha-idg-1pengine:   notice: LogNodeActions:* 
> Fence (Off) ha-idg-2 'node is unclean'
> 
> Is my understanding correct ?
> 
> 
> In the log of ha-idg-2 i don't find anything for this period:
> 
> Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info: cib_device_update: 
>   Device fence_ilo_ha-idg-2 has been disabled on ha-idg-2: score=-1
> Oct 09 17:58:51 [12503] ha-idg-2cib: info: cib_process_ping:  
>   Reporting our current digest to ha-idg-2: 59c4cfb14defeafbeb3417e42cd9 
> for 2.9506.36 (0x242b110 0)
> 
> Oct 09 18:00:42 [12508] ha-idg-2   crmd: info: throttle_send_command: 
>   New throttle mode: 0001 (was )
> Oct 09 18:01:12 [12508] ha-idg-2   crmd: info: 
> throttle_check_thresholds:   Moderate CPU load detected: 32.220001
> Oct 09 18:01:12 [12508] ha-idg-2   crmd: info: throttle_send_command: 
>   New throttle mode: 0010 (was 0001)
> Oct 09 18:01:42 [12508] ha-idg-2   crmd: info: throttle_send_command: 
>   New throttle mode: 0001 (was 0010)
> Oct 09 18:02:42 [12508] ha-idg-2   crmd: info: throttle_send_command: 
>   New throttle mode:  (was 0001)
> 
> ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on it 
> again:
> 
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [MAIN  ] Corosync Cluster 
> Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:29:05 [11795] ha-idg-2 corosync info[MAIN  ] Corosync built-in 
> features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ] Initializing 
> transport (UDP/IP Multicast).
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ] Initializing 
> transmit/receive security (NSS) crypto: aes256 hash: sha1
> 
> What is the meaning of the lines with the throttle ?
> 
> Thanks.
> 
> 
> Bernd
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Why is node fenced ?

2019-10-10 Thread Lentes, Bernd
HI,

i have a two node cluster running on SLES 12 SP4.
I did some testing on it.
I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a few 
minutes later because i made a mistake.
ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started corosync/pacemaker on 
it.
It seems ha-idg-1 didn't find the DC after starting cluster and some sec later 
elected itself  to the DC, 
afterwards fenced ha-idg-2.

Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [MAIN  ] Corosync Cluster 
Engine ('2.3.6'): started and ready to provide service.
Oct 09 18:04:43 [9550] ha-idg-1 corosync info[MAIN  ] Corosync built-in 
features: debug testagents augeas systemd pie relro bindnow
Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: aes256 hash: sha1
Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] The network interface 
[192.168.100.10] is now up.

Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: crm_timer_popped: 
Election Trigger (I_DC_TIMEOUT) just popped (2ms)
Oct 09 18:05:06 [9565] ha-idg-1   crmd:  warning: do_log:   Input 
I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_state_transition:  
State transition S_PENDING -> S_ELECTION | input=I_DC_TIMEOUT 
cause=C_TIMER_POPPED origin=crm_timer_popped
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: election_check:   
election-DC won by local node
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_log:   Input 
I_ELECTION_DC received in state S_ELECTION from election_win_cb
Oct 09 18:05:06 [9565] ha-idg-1   crmd:   notice: do_state_transition:  
State transition S_ELECTION -> S_INTEGRATION | input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=election_win_cb
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_te_control:
Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd-71bd17047f82
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: set_graph_functions:  
Setting custom graph functions
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_dc_takeover:   Taking 
over DC status for this partition

Oct 09 18:05:07 [9564] ha-idg-1pengine:  warning: stage6:   Scheduling Node 
ha-idg-2 for STONITH
Oct 09 18:05:07 [9564] ha-idg-1pengine:   notice: LogNodeActions:* 
Fence (Off) ha-idg-2 'node is unclean'

Is my understanding correct ?


In the log of ha-idg-2 i don't find anything for this period:

Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info: cib_device_update:   
Device fence_ilo_ha-idg-2 has been disabled on ha-idg-2: score=-1
Oct 09 17:58:51 [12503] ha-idg-2cib: info: cib_process_ping:
Reporting our current digest to ha-idg-2: 59c4cfb14defeafbeb3417e42cd9 for 
2.9506.36 (0x242b110 0)

Oct 09 18:00:42 [12508] ha-idg-2   crmd: info: throttle_send_command:   
New throttle mode: 0001 (was )
Oct 09 18:01:12 [12508] ha-idg-2   crmd: info: 
throttle_check_thresholds:   Moderate CPU load detected: 32.220001
Oct 09 18:01:12 [12508] ha-idg-2   crmd: info: throttle_send_command:   
New throttle mode: 0010 (was 0001)
Oct 09 18:01:42 [12508] ha-idg-2   crmd: info: throttle_send_command:   
New throttle mode: 0001 (was 0010)
Oct 09 18:02:42 [12508] ha-idg-2   crmd: info: throttle_send_command:   
New throttle mode:  (was 0001)

ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on it again:

Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [MAIN  ] Corosync Cluster 
Engine ('2.3.6'): started and ready to provide service.
Oct 09 18:29:05 [11795] ha-idg-2 corosync info[MAIN  ] Corosync built-in 
features: debug testagents augeas systemd pie relro bindnow
Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: aes256 hash: sha1

What is the meaning of the lines with the throttle ?

Thanks.


Bernd

-- 

Bernd Lentes 
Systemadministration 
Institut für Entwicklungsgenetik 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum münchen 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/idg 

Perfekt ist wer keine Fehler macht 
Also sind Tote perfekt
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:

Re: [ClusterLabs] Unable to resource due to nvpair[@name="target-role"]: No such device or address

2019-10-10 Thread S Sathish S
Hi Team,



Can you provide source code for cman, So we can go-ahead and use CMAN as stack,



Thanks and Regards,

S Sathish S



On Mon, 2019-10-07 at 13:34 +, S Sathish S wrote:

> Hi Team,

>

> I have two below query , we have been using Rhel 6.5 OS Version with

> below clusterlab source code compiled.

>

> corosync-1.4.10

> pacemaker-1.1.10

> pcs-0.9.90

> resource-agents-3.9.2



Ouch, that's really old. It should still work, but not many people here

will have experience with it.



> Query 1 : we have added below resource group as required later we are

> trying to start the resource group , but unable to perform it .

>But while executing RA file with start option ,

> required service is started but pacemaker unable to recognized it

> started .



Are you passing any arguments on the command line when starting the

agent directly? The cluster configuration below doesn't have any, so

that would be the first thing I'd consider.



>

> # pcs resource show MANAGER

> Resource: MANAGER (class=ocf provider=provider type=MANAGER_RA)

>   Meta Attrs: priority=100 failure-timeout=120s migration-threshold=5

>   Operations: monitor on-fail=restart interval=10s timeout=120s

> (MANAGER-monitor-interval-10s)

>   start on-fail=restart interval=0s timeout=120s

> (MANAGER-start-timeout-120s-on-fail-restart)

>   stop interval=0s timeout=120s (MANAGER-stop-timeout-

> 120s)

>

> Starting the below resource

> #pcs resource enable MANAGER

>

> Below are error we are getting in corosync.log file ,Please suggest

> what will be RCA for below issue.

>

> cib: info: crm_client_new:   Connecting 0x819e00 for uid=0 gid=0

> pid=18508 id=e5fdaf69-390b-447d-b407-6420ac45148f

> cib: info: cib_process_request:  Completed cib_query

> operation for section 'all': OK (rc=0, origin=local/crm_resource/2,

> version=0.89.1)

> cib: info: cib_process_request:  Completed cib_query

> operation for section //cib/configuration/resources//*[@id="MANAGER

> "]/meta_attributes//nvpair[@name="target-role"]: No such device or

> address (rc=-6, origin=local/crm_resource/3, version=0.89.1)

> cib: info: crm_client_destroy:   Destroying 0 events



"info" level messages aren't errors. You might find /var/log/messages

more helpful in most cases.



There will be two nodes of interest. At any given time, one of the

nodes serves as "DC" -- this node's logs will have "pengine:" entries

showing any actions that are needed (such as starting or stopping a

resource). Then the node that actually runs the resource will have any

logs from the resource agent.



Additionally the "pcs status" command will show if there were any

resource failures.



> Query 2 : stack we are using classic openais (with plugin) , In that

> start the pacemaker service by default "update-origin" parameter in

> cib.xml update as hostname which pull from get_node_name function

> (uname -n)  instead we need to configure IPADDRESS of the hostname ,

> Is it possible ? we have requirement to perform the same.

>

>

> Thanks and Regards,

> S Sathish S



I'm not familiar with what classic openais supported. At the very least

you might consider switching from the plugin to CMAN, which was better

supported on RHEL 6.



At least with corosync 2, I believe it is possible to configure IP

addresses as node names when setting up the cluster, but I'm not sure

there's a good reason to do so. "update-origin" is just a comment

indicating which node made the most recent configuration change, and

isn't used for anything.

--

Ken Gaillot https://lists.clusterlabs.org/mailman/listinfo/users>>

Thanks and Regards,
S Sathish S
From: S Sathish S
Sent: Monday, October 7, 2019 7:05 PM
To: 'users@clusterlabs.org' 
Subject: Unable to resource due to nvpair[@name="target-role"]: No such device 
or address

Hi Team,

I have two below query , we have been using Rhel 6.5 OS Version with below 
clusterlab source code compiled.

corosync-1.4.10
pacemaker-1.1.10
pcs-0.9.90
resource-agents-3.9.2

Query 1 : we have added below resource group as required later we are trying to 
start the resource group , but unable to perform it .
   But while executing RA file with start option , required 
service is started but pacemaker unable to recognized it started .

# pcs resource show MANAGER
Resource: MANAGER (class=ocf provider=provider type=MANAGER_RA)
  Meta Attrs: priority=100 failure-timeout=120s migration-threshold=5
  Operations: monitor on-fail=restart interval=10s timeout=120s 
(MANAGER-monitor-interval-10s)
  start on-fail=restart interval=0s timeout=120s 
(MANAGER-start-timeout-120s-on-fail-restart)
  stop interval=0s timeout=120s (MANAGER-stop-timeout-120s)

Starting the below resource
#pcs resource enable MANAGER

Below are error we are getting in corosync.log file ,Please suggest what will 
be RCA for below issue.

cib: info: crm_client_new:   

[ClusterLabs] Q: "show changed" in crm shell 4.0.0

2019-10-10 Thread Ulrich Windl
Hi!

Adding a parameter to a primitive that is part of a group, I noticed that "show 
changed" in "configure" of crm shell does not only display the primitive, but 
also the group, even though the group itself was not changed.
Is that a bug?

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Where to find documentation for cluster MD?

2019-10-10 Thread Ulrich Windl
>>> Andrei Borzenkov  schrieb am 10.10.2019 um 11:05 in
Nachricht
:
> On Thu, Oct 10, 2019 at 11:16 AM Ulrich Windl
>  wrote:
>>
>> Hi!
>>
>> In recent SLES there is "cluster MD", like in 
> cluster‑md‑kmp‑default‑4.12.14‑197.18.1.x86_64 
> (/lib/modules/4.12.14‑197.18‑default/kernel/drivers/md/md‑cluster.ko).
However 
> I could not find any manual page for it.
>>
>> Where is the official documentation, meaning: Where is a description of the

> feature supprted by SLES?
>>
> 
> E‑h‑h . . . did you try High Availability Extension Administration Guide?

Hi!

Yes, found chapter 22, but I was thinking of some mentioning in "man 4 md" or
"man -k md"...
Also the admin guide is a "american manual" (tell you what to do, but not how
it works).
In the meantime I also found a LWN article that explains some details, but I
would expect such documentation to be part of the OS...or at least some file in
the rpm package:

# rpm -ql cluster-md-kmp-default-4.12.14-197.18.1.x86_64
/lib/modules/4.12.14-197.18-default
/lib/modules/4.12.14-197.18-default/kernel
/lib/modules/4.12.14-197.18-default/kernel/drivers
/lib/modules/4.12.14-197.18-default/kernel/drivers/md
/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko

The description just says:
Description :
Clustering support for MD devices. This enables locking and
synchronization across multiple systems on the cluster, so all
nodes in the cluster can access the MD devices simultaneously.

Regards,
Ulrich


> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] SBD with shared device - loss of both interconnect and shared device?

2019-10-10 Thread Roger Zhou




On 10/9/19 3:28 PM, Andrei Borzenkov wrote:
> What happens if both interconnect and shared device is lost by node? I
> assume node will reboot, correct?
> 

 From my understanding from Pacemaker integration feature in `man sbd`

Yes, sbd will do self-fence upon lose access to sbd disk when the node 
is not in quorate state.

> Now assuming (two node cluster) second node still can access shared
> device it will fence (via SBD) and continue takeover, right?

Yes, 2-node cluster is special. The node lose access to disk will 
self-fence even it is in "quorate" state.

> 
> If both nodes lost shared device, both nodes will reboot and if access
> to shared device is not restored, then cluster services will simply
> not come up on both nodes, so it means total outage. Correct?

Yes, without functioning SBD, the pacemaker won't start at the systemd 
level.

Cheers,
Roger


> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Where to find documentation for cluster MD?

2019-10-10 Thread Roger Zhou
In addition to the admin guide, there are some more advanced articles 
about the internals:

https://lwn.net/Articles/674085/
https://www.kernel.org/doc/Documentation/driver-api/md/md-cluster.rst

Cheers,
Roger


On 10/10/19 4:27 PM, Gang He wrote:
> Hello Ulrich
> 
> Cluster MD belongs to SLE HA extension product.
> The related doc link is here, e.g. 
> https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-guide/#cha-ha-cluster-md
> 
> Thanks
> Gang
> 
>> -Original Message-
>> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ulrich
>> Windl
>> Sent: 2019年10月9日 15:13
>> To: users@clusterlabs.org
>> Subject: [ClusterLabs] Where to find documentation for cluster MD?
>>
>> Hi!
>>
>> In recent SLES there is "cluster MD", like in
>> cluster-md-kmp-default-4.12.14-197.18.1.x86_64
>> (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko).
>> However I could not find any manual page for it.
>>
>> Where is the official documentation, meaning: Where is a description of the
>> feature supprted by SLES?
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Where to find documentation for cluster MD?

2019-10-10 Thread Andrei Borzenkov
On Thu, Oct 10, 2019 at 11:16 AM Ulrich Windl
 wrote:
>
> Hi!
>
> In recent SLES there is "cluster MD", like in 
> cluster-md-kmp-default-4.12.14-197.18.1.x86_64 
> (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko). 
> However I could not find any manual page for it.
>
> Where is the official documentation, meaning: Where is a description of the 
> feature supprted by SLES?
>

E-h-h . . . did you try High Availability Extension Administration Guide?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Where to find documentation for cluster MD?

2019-10-10 Thread Gang He
Hello Ulrich

Cluster MD belongs to SLE HA extension product.
The related doc link is here, e.g. 
https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-guide/#cha-ha-cluster-md

Thanks
Gang

> -Original Message-
> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ulrich
> Windl
> Sent: 2019年10月9日 15:13
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Where to find documentation for cluster MD?
> 
> Hi!
> 
> In recent SLES there is "cluster MD", like in
> cluster-md-kmp-default-4.12.14-197.18.1.x86_64
> (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko).
> However I could not find any manual page for it.
> 
> Where is the official documentation, meaning: Where is a description of the
> feature supprted by SLES?
> 
> Regards,
> Ulrich
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-10 Thread Kadlecsik József
On Wed, 9 Oct 2019, Digimer wrote:

> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - 
> > CPU#7 stuck for 23s"), which resulted that the node could process 
> > traffic on the backend interface but not on the fronted one. Thus the 
> > services became unavailable but the cluster thought the node is all 
> > right and did not stonith it.
> > 
> > How could we protect the cluster against such failures?
> > 
> We use mode=1 (active-passive) bonded network interfaces for each 
> network connection (we also have a back-end, front-end and a storage 
> network). Each bond has a link going to one switch and the other link to 
> a second switch. For fence devices, we use IPMI fencing connected via 
> switch 1 and PDU fencing as the backup method connected on switch 2.
> 
> With this setup, no matter what might fail, one of the fence methods
> will still be available. It's saved us in the field a few times now.

A bonded interface helps, but I suspect that in this case it could not 
save the situation. It was not an interface failure but a strange kind of 
system lockup: some of the already running processes were fine (corosync), 
but for example sshd could not accept new connections from the direction 
of the seemingly fine backbone interface either.

In the backend direction we have got bonded (LACP) interfaces - the 
frontend uses single interfaces only.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Where to find documentation for cluster MD?

2019-10-10 Thread Ulrich Windl
Hi!

In recent SLES there is "cluster MD", like in 
cluster-md-kmp-default-4.12.14-197.18.1.x86_64 
(/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko). However 
I could not find any manual page for it.

Where is the official documentation, meaning: Where is a description of the 
feature supprted by SLES?

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/