Re: [ClusterLabs] Why is node fenced ?
On Thu, 2019-10-10 at 17:22 +0200, Lentes, Bernd wrote: > HI, > > i have a two node cluster running on SLES 12 SP4. > I did some testing on it. > I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a > few minutes later because i made a mistake. > ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started > corosync/pacemaker on it. > It seems ha-idg-1 didn't find the DC after starting cluster and some > sec later elected itself to the DC, > afterwards fenced ha-idg-2. For some reason, corosync on the two nodes was not able to communicate with each other. This type of situation is why corosync's two_node option normally includes wait_for_all. > > Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [MAIN ] Corosync > Cluster Engine ('2.3.6'): started and ready to provide service. > Oct 09 18:04:43 [9550] ha-idg-1 corosync info[MAIN ] Corosync > built-in features: debug testagents augeas systemd pie relro bindnow > Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] > Initializing transport (UDP/IP Multicast). > Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] > Initializing transmit/receive security (NSS) crypto: aes256 hash: > sha1 > Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] The network > interface [192.168.100.10] is now up. > > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: > crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped > (2ms) > Oct 09 18:05:06 [9565] ha-idg-1 crmd: warning: do_log: Input > I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: > do_state_transition: State transition S_PENDING -> S_ELECTION | > input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: > election_check: election-DC won by local node > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_log: Input > I_ELECTION_DC received in state S_ELECTION from election_win_cb > Oct 09 18:05:06 [9565] ha-idg-1 crmd: notice: > do_state_transition: State transition S_ELECTION -> > S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL > origin=election_win_cb > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: > do_te_control:Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd- > 71bd17047f82 > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: > set_graph_functions: Setting custom graph functions > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: > do_dc_takeover: Taking over DC status for this partition > > Oct 09 18:05:07 [9564] ha-idg-1pengine: warning: > stage6: Scheduling Node ha-idg-2 for STONITH > Oct 09 18:05:07 [9564] ha-idg-1pengine: notice: > LogNodeActions:* Fence (Off) ha-idg-2 'node is unclean' > > Is my understanding correct ? Yes > In the log of ha-idg-2 i don't find anything for this period: > > Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info: > cib_device_update: Device fence_ilo_ha-idg-2 has been disabled > on ha-idg-2: score=-1 > Oct 09 17:58:51 [12503] ha-idg-2cib: info: > cib_process_ping:Reporting our current digest to ha-idg-2: > 59c4cfb14defeafbeb3417e42cd9 for 2.9506.36 (0x242b110 0) > > Oct 09 18:00:42 [12508] ha-idg-2 crmd: info: > throttle_send_command: New throttle mode: 0001 (was ) > Oct 09 18:01:12 [12508] ha-idg-2 crmd: info: > throttle_check_thresholds: Moderate CPU load detected: > 32.220001 > Oct 09 18:01:12 [12508] ha-idg-2 crmd: info: > throttle_send_command: New throttle mode: 0010 (was 0001) > Oct 09 18:01:42 [12508] ha-idg-2 crmd: info: > throttle_send_command: New throttle mode: 0001 (was 0010) > Oct 09 18:02:42 [12508] ha-idg-2 crmd: info: > throttle_send_command: New throttle mode: (was 0001) > > ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on > it again: > > Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [MAIN ] Corosync > Cluster Engine ('2.3.6'): started and ready to provide service. > Oct 09 18:29:05 [11795] ha-idg-2 corosync info[MAIN ] Corosync > built-in features: debug testagents augeas systemd pie relro bindnow > Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [TOTEM ] > Initializing transport (UDP/IP Multicast). > Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [TOTEM ] > Initializing transmit/receive security (NSS) crypto: aes256 hash: > sha1 > > What is the meaning of the lines with the throttle ? Those messages could definitely be improved. The particular mode values indicate no significant CPU load (), low load (0001), medium (0010), high (0100), or extreme (1000). I wouldn't expect a CPU spike to lock up corosync for very long, but it could be related somehow. > > Thanks. > > > Bernd -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs
Re: [ClusterLabs] change of the configuration of a resource which is part of a clone
On Wed, 2019-10-09 at 16:53 +0200, Lentes, Bernd wrote: > Hi, > > i finally managed to find out how i can simulate configuration > changes and see their results before committing them. > OMG. That makes live much more relaxed. I need to change the > configuration of a resource which is part of a group, the group is > running as a clone on all nodes. > Unfortunately the resource is a prerequisite for several other > resources. The other resources will restart when i commit > the changes which i definitely want to avoid. > What can i do ? > I have a two node cluster on SLES 12 SP4, with pacemaker- > 1.1.19+20181105.ccd6b5b10-3.13.1.x86_64 and corosync-2.3.6- > 9.13.1.x86_64. > > Bernd I believe it would work to unmanage the other resources, change the configuration, wait for the changed resource to restart, then re-manage the remaining resources. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On Wed, 2019-10-09 at 20:10 +0200, Kadlecsik József wrote: > On Wed, 9 Oct 2019, Ken Gaillot wrote: > > > > One of the nodes has got a failure ("watchdog: BUG: soft lockup > > > - > > > CPU#7 stuck for 23s"), which resulted that the node could > > > process > > > traffic on the backend interface but not on the fronted one. Thus > > > the > > > services became unavailable but the cluster thought the node is > > > all > > > right and did not stonith it. > > > > > > How could we protect the cluster against such failures? > > > > See the ocf:heartbeat:ethmonitor agent (to monitor the interface > > itself) > > and/or the ocf:pacemaker:ping agent (to monitor reachability of > > some IP > > such as a gateway) > > This looks really promising, thank you! Does the cluster regard it as > a > failure when a ocf:heartbeat:ethmonitor agent clone on a node does > not > run? :-) If you configure it typically, so that it runs on all nodes, then a start failure on any node will be recorded in the cluster status. To get other resources to move off such a node, you would colocate them with the ethmonitor resource. > > Best regards, > Jozsef > -- > E-mail : kadlecsik.joz...@wigner.mta.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: Wigner Research Centre for Physics > H-1525 Budapest 114, POB. 49, Hungary > __ -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Why is node fenced ?
10.10.2019 18:22, Lentes, Bernd пишет: > HI, > > i have a two node cluster running on SLES 12 SP4. > I did some testing on it. > I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a few > minutes later because i made a mistake. > ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started corosync/pacemaker > on it. > It seems ha-idg-1 didn't find the DC after starting cluster Which likely was the reason for fencing in the first place. > and some sec later elected itself to the DC, > afterwards fenced ha-idg-2. > > Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [MAIN ] Corosync Cluster > Engine ('2.3.6'): started and ready to provide service. > Oct 09 18:04:43 [9550] ha-idg-1 corosync info[MAIN ] Corosync built-in > features: debug testagents augeas systemd pie relro bindnow > Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] Initializing > transport (UDP/IP Multicast). > Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] Initializing > transmit/receive security (NSS) crypto: aes256 hash: sha1 > Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] The network > interface [192.168.100.10] is now up. > > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: crm_timer_popped: > Election Trigger (I_DC_TIMEOUT) just popped (2ms) > Oct 09 18:05:06 [9565] ha-idg-1 crmd: warning: do_log: Input > I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_state_transition: > State transition S_PENDING -> S_ELECTION | input=I_DC_TIMEOUT > cause=C_TIMER_POPPED origin=crm_timer_popped > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: election_check: > election-DC won by local node > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_log: Input > I_ELECTION_DC received in state S_ELECTION from election_win_cb > Oct 09 18:05:06 [9565] ha-idg-1 crmd: notice: do_state_transition: > State transition S_ELECTION -> S_INTEGRATION | input=I_ELECTION_DC > cause=C_FSA_INTERNAL origin=election_win_cb > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_te_control: > Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd-71bd17047f82 > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: set_graph_functions: > Setting custom graph functions > Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_dc_takeover: > Taking over DC status for this partition > > Oct 09 18:05:07 [9564] ha-idg-1pengine: warning: stage6: Scheduling > Node ha-idg-2 for STONITH > Oct 09 18:05:07 [9564] ha-idg-1pengine: notice: LogNodeActions:* > Fence (Off) ha-idg-2 'node is unclean' > > Is my understanding correct ? > > > In the log of ha-idg-2 i don't find anything for this period: > > Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info: cib_device_update: > Device fence_ilo_ha-idg-2 has been disabled on ha-idg-2: score=-1 > Oct 09 17:58:51 [12503] ha-idg-2cib: info: cib_process_ping: > Reporting our current digest to ha-idg-2: 59c4cfb14defeafbeb3417e42cd9 > for 2.9506.36 (0x242b110 0) > > Oct 09 18:00:42 [12508] ha-idg-2 crmd: info: throttle_send_command: > New throttle mode: 0001 (was ) > Oct 09 18:01:12 [12508] ha-idg-2 crmd: info: > throttle_check_thresholds: Moderate CPU load detected: 32.220001 > Oct 09 18:01:12 [12508] ha-idg-2 crmd: info: throttle_send_command: > New throttle mode: 0010 (was 0001) > Oct 09 18:01:42 [12508] ha-idg-2 crmd: info: throttle_send_command: > New throttle mode: 0001 (was 0010) > Oct 09 18:02:42 [12508] ha-idg-2 crmd: info: throttle_send_command: > New throttle mode: (was 0001) > > ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on it > again: > > Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [MAIN ] Corosync Cluster > Engine ('2.3.6'): started and ready to provide service. > Oct 09 18:29:05 [11795] ha-idg-2 corosync info[MAIN ] Corosync built-in > features: debug testagents augeas systemd pie relro bindnow > Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [TOTEM ] Initializing > transport (UDP/IP Multicast). > Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [TOTEM ] Initializing > transmit/receive security (NSS) crypto: aes256 hash: sha1 > > What is the meaning of the lines with the throttle ? > > Thanks. > > > Bernd > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Why is node fenced ?
HI, i have a two node cluster running on SLES 12 SP4. I did some testing on it. I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a few minutes later because i made a mistake. ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started corosync/pacemaker on it. It seems ha-idg-1 didn't find the DC after starting cluster and some sec later elected itself to the DC, afterwards fenced ha-idg-2. Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [MAIN ] Corosync Cluster Engine ('2.3.6'): started and ready to provide service. Oct 09 18:04:43 [9550] ha-idg-1 corosync info[MAIN ] Corosync built-in features: debug testagents augeas systemd pie relro bindnow Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] Initializing transport (UDP/IP Multicast). Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1 Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] The network interface [192.168.100.10] is now up. Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped (2ms) Oct 09 18:05:06 [9565] ha-idg-1 crmd: warning: do_log: Input I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_state_transition: State transition S_PENDING -> S_ELECTION | input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: election_check: election-DC won by local node Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_log: Input I_ELECTION_DC received in state S_ELECTION from election_win_cb Oct 09 18:05:06 [9565] ha-idg-1 crmd: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=election_win_cb Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_te_control: Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd-71bd17047f82 Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: set_graph_functions: Setting custom graph functions Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_dc_takeover: Taking over DC status for this partition Oct 09 18:05:07 [9564] ha-idg-1pengine: warning: stage6: Scheduling Node ha-idg-2 for STONITH Oct 09 18:05:07 [9564] ha-idg-1pengine: notice: LogNodeActions:* Fence (Off) ha-idg-2 'node is unclean' Is my understanding correct ? In the log of ha-idg-2 i don't find anything for this period: Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info: cib_device_update: Device fence_ilo_ha-idg-2 has been disabled on ha-idg-2: score=-1 Oct 09 17:58:51 [12503] ha-idg-2cib: info: cib_process_ping: Reporting our current digest to ha-idg-2: 59c4cfb14defeafbeb3417e42cd9 for 2.9506.36 (0x242b110 0) Oct 09 18:00:42 [12508] ha-idg-2 crmd: info: throttle_send_command: New throttle mode: 0001 (was ) Oct 09 18:01:12 [12508] ha-idg-2 crmd: info: throttle_check_thresholds: Moderate CPU load detected: 32.220001 Oct 09 18:01:12 [12508] ha-idg-2 crmd: info: throttle_send_command: New throttle mode: 0010 (was 0001) Oct 09 18:01:42 [12508] ha-idg-2 crmd: info: throttle_send_command: New throttle mode: 0001 (was 0010) Oct 09 18:02:42 [12508] ha-idg-2 crmd: info: throttle_send_command: New throttle mode: (was 0001) ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on it again: Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [MAIN ] Corosync Cluster Engine ('2.3.6'): started and ready to provide service. Oct 09 18:29:05 [11795] ha-idg-2 corosync info[MAIN ] Corosync built-in features: debug testagents augeas systemd pie relro bindnow Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [TOTEM ] Initializing transport (UDP/IP Multicast). Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1 What is the meaning of the lines with the throttle ? Thanks. Bernd -- Bernd Lentes Systemadministration Institut für Entwicklungsgenetik Gebäude 35.34 - Raum 208 HelmholtzZentrum münchen bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/idg Perfekt ist wer keine Fehler macht Also sind Tote perfekt Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription:
Re: [ClusterLabs] Unable to resource due to nvpair[@name="target-role"]: No such device or address
Hi Team, Can you provide source code for cman, So we can go-ahead and use CMAN as stack, Thanks and Regards, S Sathish S On Mon, 2019-10-07 at 13:34 +, S Sathish S wrote: > Hi Team, > > I have two below query , we have been using Rhel 6.5 OS Version with > below clusterlab source code compiled. > > corosync-1.4.10 > pacemaker-1.1.10 > pcs-0.9.90 > resource-agents-3.9.2 Ouch, that's really old. It should still work, but not many people here will have experience with it. > Query 1 : we have added below resource group as required later we are > trying to start the resource group , but unable to perform it . >But while executing RA file with start option , > required service is started but pacemaker unable to recognized it > started . Are you passing any arguments on the command line when starting the agent directly? The cluster configuration below doesn't have any, so that would be the first thing I'd consider. > > # pcs resource show MANAGER > Resource: MANAGER (class=ocf provider=provider type=MANAGER_RA) > Meta Attrs: priority=100 failure-timeout=120s migration-threshold=5 > Operations: monitor on-fail=restart interval=10s timeout=120s > (MANAGER-monitor-interval-10s) > start on-fail=restart interval=0s timeout=120s > (MANAGER-start-timeout-120s-on-fail-restart) > stop interval=0s timeout=120s (MANAGER-stop-timeout- > 120s) > > Starting the below resource > #pcs resource enable MANAGER > > Below are error we are getting in corosync.log file ,Please suggest > what will be RCA for below issue. > > cib: info: crm_client_new: Connecting 0x819e00 for uid=0 gid=0 > pid=18508 id=e5fdaf69-390b-447d-b407-6420ac45148f > cib: info: cib_process_request: Completed cib_query > operation for section 'all': OK (rc=0, origin=local/crm_resource/2, > version=0.89.1) > cib: info: cib_process_request: Completed cib_query > operation for section //cib/configuration/resources//*[@id="MANAGER > "]/meta_attributes//nvpair[@name="target-role"]: No such device or > address (rc=-6, origin=local/crm_resource/3, version=0.89.1) > cib: info: crm_client_destroy: Destroying 0 events "info" level messages aren't errors. You might find /var/log/messages more helpful in most cases. There will be two nodes of interest. At any given time, one of the nodes serves as "DC" -- this node's logs will have "pengine:" entries showing any actions that are needed (such as starting or stopping a resource). Then the node that actually runs the resource will have any logs from the resource agent. Additionally the "pcs status" command will show if there were any resource failures. > Query 2 : stack we are using classic openais (with plugin) , In that > start the pacemaker service by default "update-origin" parameter in > cib.xml update as hostname which pull from get_node_name function > (uname -n) instead we need to configure IPADDRESS of the hostname , > Is it possible ? we have requirement to perform the same. > > > Thanks and Regards, > S Sathish S I'm not familiar with what classic openais supported. At the very least you might consider switching from the plugin to CMAN, which was better supported on RHEL 6. At least with corosync 2, I believe it is possible to configure IP addresses as node names when setting up the cluster, but I'm not sure there's a good reason to do so. "update-origin" is just a comment indicating which node made the most recent configuration change, and isn't used for anything. -- Ken Gaillot https://lists.clusterlabs.org/mailman/listinfo/users>> Thanks and Regards, S Sathish S From: S Sathish S Sent: Monday, October 7, 2019 7:05 PM To: 'users@clusterlabs.org' Subject: Unable to resource due to nvpair[@name="target-role"]: No such device or address Hi Team, I have two below query , we have been using Rhel 6.5 OS Version with below clusterlab source code compiled. corosync-1.4.10 pacemaker-1.1.10 pcs-0.9.90 resource-agents-3.9.2 Query 1 : we have added below resource group as required later we are trying to start the resource group , but unable to perform it . But while executing RA file with start option , required service is started but pacemaker unable to recognized it started . # pcs resource show MANAGER Resource: MANAGER (class=ocf provider=provider type=MANAGER_RA) Meta Attrs: priority=100 failure-timeout=120s migration-threshold=5 Operations: monitor on-fail=restart interval=10s timeout=120s (MANAGER-monitor-interval-10s) start on-fail=restart interval=0s timeout=120s (MANAGER-start-timeout-120s-on-fail-restart) stop interval=0s timeout=120s (MANAGER-stop-timeout-120s) Starting the below resource #pcs resource enable MANAGER Below are error we are getting in corosync.log file ,Please suggest what will be RCA for below issue. cib: info: crm_client_new:
[ClusterLabs] Q: "show changed" in crm shell 4.0.0
Hi! Adding a parameter to a primitive that is part of a group, I noticed that "show changed" in "configure" of crm shell does not only display the primitive, but also the group, even though the group itself was not changed. Is that a bug? Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Where to find documentation for cluster MD?
>>> Andrei Borzenkov schrieb am 10.10.2019 um 11:05 in Nachricht : > On Thu, Oct 10, 2019 at 11:16 AM Ulrich Windl > wrote: >> >> Hi! >> >> In recent SLES there is "cluster MD", like in > cluster‑md‑kmp‑default‑4.12.14‑197.18.1.x86_64 > (/lib/modules/4.12.14‑197.18‑default/kernel/drivers/md/md‑cluster.ko). However > I could not find any manual page for it. >> >> Where is the official documentation, meaning: Where is a description of the > feature supprted by SLES? >> > > E‑h‑h . . . did you try High Availability Extension Administration Guide? Hi! Yes, found chapter 22, but I was thinking of some mentioning in "man 4 md" or "man -k md"... Also the admin guide is a "american manual" (tell you what to do, but not how it works). In the meantime I also found a LWN article that explains some details, but I would expect such documentation to be part of the OS...or at least some file in the rpm package: # rpm -ql cluster-md-kmp-default-4.12.14-197.18.1.x86_64 /lib/modules/4.12.14-197.18-default /lib/modules/4.12.14-197.18-default/kernel /lib/modules/4.12.14-197.18-default/kernel/drivers /lib/modules/4.12.14-197.18-default/kernel/drivers/md /lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko The description just says: Description : Clustering support for MD devices. This enables locking and synchronization across multiple systems on the cluster, so all nodes in the cluster can access the MD devices simultaneously. Regards, Ulrich > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] SBD with shared device - loss of both interconnect and shared device?
On 10/9/19 3:28 PM, Andrei Borzenkov wrote: > What happens if both interconnect and shared device is lost by node? I > assume node will reboot, correct? > From my understanding from Pacemaker integration feature in `man sbd` Yes, sbd will do self-fence upon lose access to sbd disk when the node is not in quorate state. > Now assuming (two node cluster) second node still can access shared > device it will fence (via SBD) and continue takeover, right? Yes, 2-node cluster is special. The node lose access to disk will self-fence even it is in "quorate" state. > > If both nodes lost shared device, both nodes will reboot and if access > to shared device is not restored, then cluster services will simply > not come up on both nodes, so it means total outage. Correct? Yes, without functioning SBD, the pacemaker won't start at the systemd level. Cheers, Roger > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Where to find documentation for cluster MD?
In addition to the admin guide, there are some more advanced articles about the internals: https://lwn.net/Articles/674085/ https://www.kernel.org/doc/Documentation/driver-api/md/md-cluster.rst Cheers, Roger On 10/10/19 4:27 PM, Gang He wrote: > Hello Ulrich > > Cluster MD belongs to SLE HA extension product. > The related doc link is here, e.g. > https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-guide/#cha-ha-cluster-md > > Thanks > Gang > >> -Original Message- >> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ulrich >> Windl >> Sent: 2019年10月9日 15:13 >> To: users@clusterlabs.org >> Subject: [ClusterLabs] Where to find documentation for cluster MD? >> >> Hi! >> >> In recent SLES there is "cluster MD", like in >> cluster-md-kmp-default-4.12.14-197.18.1.x86_64 >> (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko). >> However I could not find any manual page for it. >> >> Where is the official documentation, meaning: Where is a description of the >> feature supprted by SLES? >> >> Regards, >> Ulrich >> >> >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Where to find documentation for cluster MD?
On Thu, Oct 10, 2019 at 11:16 AM Ulrich Windl wrote: > > Hi! > > In recent SLES there is "cluster MD", like in > cluster-md-kmp-default-4.12.14-197.18.1.x86_64 > (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko). > However I could not find any manual page for it. > > Where is the official documentation, meaning: Where is a description of the > feature supprted by SLES? > E-h-h . . . did you try High Availability Extension Administration Guide? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Where to find documentation for cluster MD?
Hello Ulrich Cluster MD belongs to SLE HA extension product. The related doc link is here, e.g. https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-guide/#cha-ha-cluster-md Thanks Gang > -Original Message- > From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ulrich > Windl > Sent: 2019年10月9日 15:13 > To: users@clusterlabs.org > Subject: [ClusterLabs] Where to find documentation for cluster MD? > > Hi! > > In recent SLES there is "cluster MD", like in > cluster-md-kmp-default-4.12.14-197.18.1.x86_64 > (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko). > However I could not find any manual page for it. > > Where is the official documentation, meaning: Where is a description of the > feature supprted by SLES? > > Regards, > Ulrich > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On Wed, 9 Oct 2019, Digimer wrote: > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - > > CPU#7 stuck for 23s"), which resulted that the node could process > > traffic on the backend interface but not on the fronted one. Thus the > > services became unavailable but the cluster thought the node is all > > right and did not stonith it. > > > > How could we protect the cluster against such failures? > > > We use mode=1 (active-passive) bonded network interfaces for each > network connection (we also have a back-end, front-end and a storage > network). Each bond has a link going to one switch and the other link to > a second switch. For fence devices, we use IPMI fencing connected via > switch 1 and PDU fencing as the backup method connected on switch 2. > > With this setup, no matter what might fail, one of the fence methods > will still be available. It's saved us in the field a few times now. A bonded interface helps, but I suspect that in this case it could not save the situation. It was not an interface failure but a strange kind of system lockup: some of the already running processes were fine (corosync), but for example sshd could not accept new connections from the direction of the seemingly fine backbone interface either. In the backend direction we have got bonded (LACP) interfaces - the frontend uses single interfaces only. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Where to find documentation for cluster MD?
Hi! In recent SLES there is "cluster MD", like in cluster-md-kmp-default-4.12.14-197.18.1.x86_64 (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko). However I could not find any manual page for it. Where is the official documentation, meaning: Where is a description of the feature supprted by SLES? Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/