Re: [ClusterLabs] Re: Fencing errors

2019-05-23 Thread Lopez, Francisco Javier [Global IT]
Hello again Ken et all.

I realized about many things investigating this issue but I feel I need a bit 
more help from you guys.

It's clear the monitoring process is reporting a timeout. Although I've 
increased this timeout to 30c using pcmk_monitoring_timeout,
and during this last 2 hours the process did not fail, I'd like to understand 
more in detail how this process works and if I'm
getting a timeout after 20 secs, it looks to me something else could be 
happening in my systems.

I tried enabling debug again and, as before, the 'debug' option creates the 
file but does not update anything unless I enable 'verbose'.
Funny thing because when I enable it, I hit a bug and the fencing does not 
start:

https://bugzilla.redhat.com/show_bug.cgi?id=1549366

I enabled debug at corosync layer and I got some more information that was nice 
to better understand this issue but still, not enough
information to narrow down where the issue comes from.

Said this, I'd like to know, if there is a way to review more in detail what 
the monitoring process is doing like ping, status, etc
and it that time is dedicated to the same action all those secs.

Any idea will be more than welcome.

As always, appreciate your help.

Regards
Javier



Francisco Javier​   Lopez
IT System Engineer   |  Global IT
O: +34 619 728 249|  M: +34 619 728 
249|
franciscojavier.lo...@solera.com   
 |  Solera.com
Audatex Datos, S.A.  |  Avda. de Bruselas, 36, Salida 16, A‑1 
(Diversia),   Alcobendas  ,   Madrid  ,   28108   , 
  Spain
[cid:image790996.png@A70D2A26.F4AADDCB]

On 5/21/2019 6:19 PM, Ken Gaillot wrote:

On Tue, 2019-05-21 at 11:10 +, Lopez, Francisco Javier [Global IT]
wrote:


Hello guys !

Need your help to try to understand and debug what I'm facing in one
of my clusters.

I set up fencing with this detail:

# pcs -f stonith_cfg stonith create fence_ao_pg01 fence_vmware_soap
ipaddr= ssl_insecure=1 login="" passwd=""
pcmk_reboot_action=reboot pcmk_host_list="ao-pg01-p.axadmin.net"
power_wait=3 op monitor interval=60s
# pcs -f stonith_cfg stonith create fence_ao_pg02 fence_vmware_soap
ipaddr= ssl_insecure=1 login="" passwd=""
pcmk_reboot_action=reboot pcmk_host_list="ao-pg02-p.axadmin.net"
power_wait=3 op monitor interval=60s

# pcs -f stonith_cfg constraint location fence_ao_pg01 avoids ao-
pg01-p.axadmin.net=INFINITY
# pcs -f stonith_cfg constraint location fence_ao_pg02 avoids ao-
pg02-p.axadmin.net=INFINITY

# pcs cluster cib-push stonith_cfg

The pcs status shows all ok during some time and then it turns to:

[root@ao-pg01-p ~]# pcs status --full
Cluster name: ao_cl_p_01
Stack: corosync
Current DC: ao-pg01-p.axadmin.net (1) (version 1.1.19-8.el7_6.4-
c3c624ea3d) - partition with quorum
Last updated: Tue May 21 12:18:46 2019
Last change: Fri May 17 18:54:32 2019 by hacluster via crmd on ao-
pg01-p.axadmin.net

2 nodes configured
3 resources configured

Online: [ ao-pg01-p.axadmin.net (1) ao-pg02-p.axadmin.net (2) ]

Full list of resources:

 ao-cl-p-01-vip01(ocf::heartbeat:IPaddr2):Started ao-pg01-
p.axadmin.net
 fence_ao_pg01(stonith:fence_vmware_soap):Stopped
 fence_ao_pg02(stonith:fence_vmware_soap):Stopped

Node Attributes:
* Node ao-pg01-p.axadmin.net (1):
* Node ao-pg02-p.axadmin.net (2):

Migration Summary:
* Node ao-pg02-p.axadmin.net (2):
   fence_ao_pg01: migration-threshold=100 fail-count=100
last-failure='Sat May 18 00:22:22 2019'
* Node ao-pg01-p.axadmin.net (1):
   fence_ao_pg02: migration-threshold=100 fail-count=100
last-failure='Fri May 17 20:52:53 2019'

Failed Actions:
* fence_ao_pg01_start_0 on ao-pg02-p.axadmin.net 'unknown error' (1):
call=22, status=Timed Out, exitreason='',
last-rc-change='Sat May 18 00:19:49 2019', queued=0ms,
exec=20022ms
* fence_ao_pg02_start_0 on ao-pg01-p.axadmin.net 'unknown error' (1):
call=84, status=Timed Out, exitreason='',
last-rc-change='Fri May 17 20:52:33 2019', queued=0ms,
exec=20032ms

PCSD Status:
  ao-pg02-p.axadmin.net: Online
  ao-pg01-p.axadmin.net: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled


>From the output I see there seems to be a 'Timed Out' but I'd like to
understand if this is a configuration issue
or something else I'm not aware of.



When pacemaker starts a fence device, it issues a monitor command to
the fence agent. That command is what's timing out here.

The first thing I'd try is running the monitor command manually using
the parameters in the device configuration. The fence agent likely has
a debug option you could turn on to get more details.




I'm attaching part of the log that shows the problem related to 17-
May.

Regards
Francisco JavierLopez IT System Engineer |
Global IT O: +34 619 728 249 |  M: +34 619 728 249
 |

Re: [ClusterLabs] drbd could not start by pacemaker. strange limited root privileges?

2019-05-23 Thread Mevo Govo
Hi Ken,
very thanks for SELinux advice ! I forget to check it. After I disabled
SELinux temporarily, pacmaker starts to handle the DRBD resources well. (I
checked the agent scripts earlier, but they were same as in the scripts
subdir of drbd9 source). And I will check my email settings.
Thanks again: lados.

Ken Gaillot  ezt írta (időpont: 2019. máj. 23., Cs,
16:21):

> On Thu, 2019-05-23 at 13:21 +0200, László Neduki wrote:
> > Hi,
> >
> > (
> > I sent a similar question from an other acount 3 days ago, but:
> > - I do not see it on the list. Maybe I should not see my own email?
>
> A DRBD message from govom...@gmail.com did make it to the list a week
> ago. You should get your own emails from the list server, though your
> own mail server or client might filter them.
>
> > So I created a new account
> > - I have additional infos (but no solution), so I rewrite the
> > question
> > )
> >
> > pacemaker cannot start drbd9 resources. As I see, root has very
> > limited privileges in the drbd resource agent, when it run by the
> > pacemaker. I downloaded the latest pacemaker this week, and I
> > compiled drbd9 rpms also. I hope, You can help me, I do not find the
> > cause of this behaviour. Please see the below test cases:
>
> I'm not a DRBD expert, but given the symptoms you describe, my first
> thoughts would be that either the ocf:linbit:drbd agent you're using
> isn't the right version for your DRBD version, or something like
> SELinux is restricting access.
>
> > 1. When I create Pacemaker DRBD resource I get errors
> > # pcs resource create DrbdDB ocf:linbit:drbd drbd_resource=drbd_db op
> > monitor interval=60s meta notify=true
> > # pcs resource master DrbdDBClone DrbdDB master-max=1 master-node-
> > max=1 clone-node-max=1 notify=true
> > # pcs constraint location DrbdDBClone prefers node1=INFINITY
> > # pcs cluster stop --all; pcs cluster start --all; pcs status
> >
> > Failed Actions:
> > * DrbdDB_monitor_0 on node1 'not installed' (5): call=6,
> > status=complete, exitreason='DRBD kernel (module) not available?',
> > last-rc-change='Thu May 23 09:54:09 2019', queued=0ms, exec=58ms
> > * DrbdDB_monitor_0 on node2 'not installed' (5): call=6,
> > status=complete, exitreason='DRBD kernel (module) not available?',
> > last-rc-change='Thu May 23 10:00:22 2019', queued=0ms, exec=71ms
> >
> > 2. when I try to start drbd_db by drbdadm directly, it works well:
> > # modprobe drbd #on each node
> > # drbdadm up drbd_db #on each node
> > # drbdadm primary drbd_db
> > # drbdadm status
> > it shows drbd_db is UpToDate on each node
> > I also can promote and mount filesystem well
> >
> > 3. When I use debug-start, it works fine (so the resource syntax
> > sould be correct)
> > # drbdadm status
> > No currently configured DRBD found.
> > # pcs resource debug-start DrbdDBMaster
> > Error: unable to debug-start a master, try the master's resource:
> > DrbdDB
> > # pcs resource debug-start DrbdDB #on each node
> > Operation start for DrbdDB:0 (ocf:linbit:drbd) returned: 'ok' (0)
> > # drbdadm status
> > it shows drbd_db is UpToDate on each node
> >
> > 4. Pacemaker handle other resources well . If I set auto_promote=yes,
> > and I start (but not promote) the drbd_db by drbdadm, then pacemaker
> > can create filesystem on it well, and also the appserver, database
> > resources.
> >
> > 5. The strangest behaviour for me. Root have very limited privileges
> > whitin the drbd resource agent. If I write this line to the
> > srbd_start() method of  /usr/lib/ocf/resource.d/linbit/drbd
> >
> > ocf_log err "lados " $(whoami) $( ls -l /home/opc/tmp/modprobe2.trace
> > ) $( do_cmd touch /home/opc/tmp/modprobe2.trace )
> >
> > I got theese messeges in log, when I start the cluster
> >
> > # tail -f /var/log/cluster/corosync.log | grep -A 8 -B 3 -i lados
> >
> > ...
> > May 21 15:35:12  drbd(DrbdDB)[31649]:ERROR: lados  root
> > May 21 15:35:12 [31309] node1   lrmd:   notice:
> > operation_finished:DrbdDB_start_0:31649:stderr [ ls: cannot
> > access /home/opc/tmp/modprobe2.trace: Permission denied ]
> > May 21 15:35:12 [31309] node1   lrmd:   notice:
> > operation_finished:DrbdFra_start_0:31649:stderr [ touch: cannot
> > touch '/home/opc/tmp/modprobe2.trace': Permission denied ]
> > ...
> > and also, when I try to strace the "modprobe -s drbd `$DRBDADM sh-
> > mod-parms`" in drbd resource agent, I only see 1 line in the
> > /root/modprobe2.trace. This meens for me:
> > - root cannot trace the calls in drbdadm (even if root can strace
> > drbdadm outside of pacemaker well)
> > - root can write into files his own directory
> > (/root/modprobe2.trace)
> >
> > 6. Opposit of previous test
> > root has these privileges outside from pacamaker
> >
> > # sudo su -
> > # touch /home/opc/tmp/modprobe2.trace
> > # ls -l /home/opc/tmp/modprobe2.trace
> > -rw-r--r--. 1 root root 0 May 21 15:44 /home/opc/tmp/modprobe2.trace
> >
> >
> > Thanks: lados.
> >
> >
> > ___
> > 

Re: [ClusterLabs] drbd could not start by pacemaker. strange limited root privileges?

2019-05-23 Thread Ken Gaillot
On Thu, 2019-05-23 at 13:21 +0200, László Neduki wrote:
> Hi,
> 
> (
> I sent a similar question from an other acount 3 days ago, but: 
> - I do not see it on the list. Maybe I should not see my own email? 

A DRBD message from govom...@gmail.com did make it to the list a week
ago. You should get your own emails from the list server, though your
own mail server or client might filter them.

> So I created a new account
> - I have additional infos (but no solution), so I rewrite the
> question
> )
> 
> pacemaker cannot start drbd9 resources. As I see, root has very
> limited privileges in the drbd resource agent, when it run by the
> pacemaker. I downloaded the latest pacemaker this week, and I
> compiled drbd9 rpms also. I hope, You can help me, I do not find the
> cause of this behaviour. Please see the below test cases:

I'm not a DRBD expert, but given the symptoms you describe, my first
thoughts would be that either the ocf:linbit:drbd agent you're using
isn't the right version for your DRBD version, or something like
SELinux is restricting access.

> 1. When I create Pacemaker DRBD resource I get errors
> # pcs resource create DrbdDB ocf:linbit:drbd drbd_resource=drbd_db op
> monitor interval=60s meta notify=true
> # pcs resource master DrbdDBClone DrbdDB master-max=1 master-node-
> max=1 clone-node-max=1 notify=true
> # pcs constraint location DrbdDBClone prefers node1=INFINITY
> # pcs cluster stop --all; pcs cluster start --all; pcs status
> 
> Failed Actions:
> * DrbdDB_monitor_0 on node1 'not installed' (5): call=6,
> status=complete, exitreason='DRBD kernel (module) not available?',
> last-rc-change='Thu May 23 09:54:09 2019', queued=0ms, exec=58ms
> * DrbdDB_monitor_0 on node2 'not installed' (5): call=6,
> status=complete, exitreason='DRBD kernel (module) not available?',
> last-rc-change='Thu May 23 10:00:22 2019', queued=0ms, exec=71ms
> 
> 2. when I try to start drbd_db by drbdadm directly, it works well:
> # modprobe drbd #on each node
> # drbdadm up drbd_db #on each node
> # drbdadm primary drbd_db
> # drbdadm status 
> it shows drbd_db is UpToDate on each node
> I also can promote and mount filesystem well
> 
> 3. When I use debug-start, it works fine (so the resource syntax
> sould be correct)
> # drbdadm status
> No currently configured DRBD found.
> # pcs resource debug-start DrbdDBMaster
> Error: unable to debug-start a master, try the master's resource:
> DrbdDB
> # pcs resource debug-start DrbdDB #on each node
> Operation start for DrbdDB:0 (ocf:linbit:drbd) returned: 'ok' (0)
> # drbdadm status
> it shows drbd_db is UpToDate on each node
> 
> 4. Pacemaker handle other resources well . If I set auto_promote=yes,
> and I start (but not promote) the drbd_db by drbdadm, then pacemaker
> can create filesystem on it well, and also the appserver, database
> resources. 
> 
> 5. The strangest behaviour for me. Root have very limited privileges
> whitin the drbd resource agent. If I write this line to the
> srbd_start() method of  /usr/lib/ocf/resource.d/linbit/drbd
> 
> ocf_log err "lados " $(whoami) $( ls -l /home/opc/tmp/modprobe2.trace
> ) $( do_cmd touch /home/opc/tmp/modprobe2.trace )
> 
> I got theese messeges in log, when I start the cluster
> 
> # tail -f /var/log/cluster/corosync.log | grep -A 8 -B 3 -i lados
> 
> ...
> May 21 15:35:12  drbd(DrbdDB)[31649]:ERROR: lados  root
> May 21 15:35:12 [31309] node1   lrmd:   notice:
> operation_finished:DrbdDB_start_0:31649:stderr [ ls: cannot
> access /home/opc/tmp/modprobe2.trace: Permission denied ]
> May 21 15:35:12 [31309] node1   lrmd:   notice:
> operation_finished:DrbdFra_start_0:31649:stderr [ touch: cannot
> touch '/home/opc/tmp/modprobe2.trace': Permission denied ]
> ...
> and also, when I try to strace the "modprobe -s drbd `$DRBDADM sh-
> mod-parms`" in drbd resource agent, I only see 1 line in the
> /root/modprobe2.trace. This meens for me:
> - root cannot trace the calls in drbdadm (even if root can strace
> drbdadm outside of pacemaker well)
> - root can write into files his own directory
> (/root/modprobe2.trace) 
> 
> 6. Opposit of previous test
> root has these privileges outside from pacamaker
> 
> # sudo su -
> # touch /home/opc/tmp/modprobe2.trace
> # ls -l /home/opc/tmp/modprobe2.trace
> -rw-r--r--. 1 root root 0 May 21 15:44 /home/opc/tmp/modprobe2.trace
> 
> 
> Thanks: lados.
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Antw: why is node fenced ?

2019-05-23 Thread Ulrich Windl
>>> "Lentes, Bernd"  schrieb am 23.05.2019
um
15:01 in Nachricht
<1029244418.9784641.1558616472505.javamail.zim...@helmholtz-muenchen.de>:

> 
> ‑ On May 20, 2019, at 8:28 AM, Ulrich Windl
ulrich.wi...@rz.uni‑regensburg.de 
> wrote:
> 
> "Lentes, Bernd"  schrieb am
16.05.2019
>> um
>> 17:10 in Nachricht
>> <1151882511.6631123.1558019430655.JavaMail.zimbra@helmholtz‑muenchen.de>:
>>> Hi,
>>> 
>>> my HA‑Cluster with two nodes fenced one on 14th of may.
>>> ha‑idg‑1 has been the DC, ha‑idg‑2 was fenced.
>>> It happened around 11:30 am.
>>> The log from the fenced one isn't really informative:
>>> 
>>> ==
>>> 2019‑05‑14T11:22:09.948980+02:00 ha‑idg‑2 liblogging‑stdlog: ‑‑ MARK ‑‑
>>> 2019‑05‑14T11:28:21.548898+02:00 ha‑idg‑2 sshd[14269]: Accepted
>>> keyboard‑interactive/pam for root from 10.35.34.70 port 59449 ssh2
>>> 2019‑05‑14T11:28:21.550602+02:00 ha‑idg‑2 sshd[14269]:
>>> pam_unix(sshd:session): session opened for user root by (uid=0)
>>> 2019‑05‑14T11:28:21.554640+02:00 ha‑idg‑2 systemd‑logind[2798]: New
session
>> 
>>> 15385 of user root.
>>> 2019‑05‑14T11:28:21.555067+02:00 ha‑idg‑2 systemd[1]: Started Session
15385
>> 
>>> of user root.
>>> 
>>> 2019‑05‑14T11:44:07.664785+02:00 ha‑idg‑2 systemd[1]: systemd 228 running
in
>> 
>>> system mode. (+PAM ‑AUDIT +SELINUX ‑IMA +APPARMOR ‑SMACK +SYSVINIT +UTMP
>>> +LIBCRYPTSETUP +GC   Neustart !!!
>>> RYPT ‑GNUTLS +ACL +XZ ‑LZ4 +SECCOMP +BLKID ‑ELFUTILS +KMOD ‑IDN)
>>> 2019‑05‑14T11:44:07.664902+02:00 ha‑idg‑2 kernel: [0.00] Linux
>>> version 4.12.14‑95.13‑default (geeko@buildhost) (gcc version 4.8.5 (SUSE
>>> Linux) ) #1 SMP Fri Mar
>>> 22 06:04:58 UTC 2019 (c01bf34)
>>> 2019‑05‑14T11:44:07.665492+02:00 ha‑idg‑2 systemd[1]: Detected
architecture
>> 
>>> x86‑64.
>>> 2019‑05‑14T11:44:07.665510+02:00 ha‑idg‑2 kernel: [0.00] Command
>>> line: BOOT_IMAGE=/boot/vmlinuz‑4.12.14‑95.13‑default
>>> root=/dev/mapper/vg_local‑lv_root resume=/
>>> dev/disk/by‑uuid/2849c504‑2e45‑4ec8‑bbf8‑724cf358ee25 splash=verbose
>>> showopts
>>> 2019‑05‑14T11:44:07.665510+02:00 ha‑idg‑2 systemd[1]: Set hostname to
>>>.
>>> =
> 
>>> 
>>> One network interface is gone for a short period. But it's in a bonding
>>> device (round‑robin),
>>> so the connection shouldn't be lost. Both nodes are connected directly,
>>> there is no switch in between.
>> 
>> I think you misunderstood: a round‑robin bonding device is not fault‑safe
>> IMHO, but it depends a lot on your cabling details. Also you did not show 
> the
>> logs on the other nodes.
> 
> /usr/src/linux/Documentation/networking/bonding.txt says:
> " balance‑rr or 0: Round‑robin policy: Transmit packets in sequential
> order from the first available slave through the
> last.  This mode provides load balancing >> and fault
> tolerance. <<

We had that, and when one of two cables had failed at side A, side B would
still use both, so A would receive only every 2nd packet or so

> Nevertheless i think i will switch to active/backup.
> 
> I showed up /var/log/messages from the fenced note above.
> Or do you mean /var/log/pacemaker.log from the fenced one ?
> Isn't that the same as the one from the DC ?
> 
>  
>>> I manually (ifconfig eth3 down) stopped the interface afterwards several
>>> times ... nothing happened.
>>> The same with the second Interface (eth2).
> 
> That means i also stopped several time the second interface, also with no 
> fencing of the node.
> 
> Bernd
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz‑muenchen.de 
> Stellv. Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich 
> Bassler, Kerstin Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt‑IdNr: DE 129521671
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: why is node fenced ?

2019-05-23 Thread Lentes, Bernd



- On May 20, 2019, at 8:28 AM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:

 "Lentes, Bernd"  schrieb am 16.05.2019
> um
> 17:10 in Nachricht
> <1151882511.6631123.1558019430655.javamail.zim...@helmholtz-muenchen.de>:
>> Hi,
>> 
>> my HA-Cluster with two nodes fenced one on 14th of may.
>> ha-idg-1 has been the DC, ha-idg-2 was fenced.
>> It happened around 11:30 am.
>> The log from the fenced one isn't really informative:
>> 
>> ==
>> 2019-05-14T11:22:09.948980+02:00 ha-idg-2 liblogging-stdlog: -- MARK --
>> 2019-05-14T11:28:21.548898+02:00 ha-idg-2 sshd[14269]: Accepted
>> keyboard-interactive/pam for root from 10.35.34.70 port 59449 ssh2
>> 2019-05-14T11:28:21.550602+02:00 ha-idg-2 sshd[14269]:
>> pam_unix(sshd:session): session opened for user root by (uid=0)
>> 2019-05-14T11:28:21.554640+02:00 ha-idg-2 systemd-logind[2798]: New session
> 
>> 15385 of user root.
>> 2019-05-14T11:28:21.555067+02:00 ha-idg-2 systemd[1]: Started Session 15385
> 
>> of user root.
>> 
>> 2019-05-14T11:44:07.664785+02:00 ha-idg-2 systemd[1]: systemd 228 running in
> 
>> system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP
>> +LIBCRYPTSETUP +GC   Neustart !!!
>> RYPT -GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN)
>> 2019-05-14T11:44:07.664902+02:00 ha-idg-2 kernel: [0.00] Linux
>> version 4.12.14-95.13-default (geeko@buildhost) (gcc version 4.8.5 (SUSE
>> Linux) ) #1 SMP Fri Mar
>> 22 06:04:58 UTC 2019 (c01bf34)
>> 2019-05-14T11:44:07.665492+02:00 ha-idg-2 systemd[1]: Detected architecture
> 
>> x86-64.
>> 2019-05-14T11:44:07.665510+02:00 ha-idg-2 kernel: [0.00] Command
>> line: BOOT_IMAGE=/boot/vmlinuz-4.12.14-95.13-default
>> root=/dev/mapper/vg_local-lv_root resume=/
>> dev/disk/by-uuid/2849c504-2e45-4ec8-bbf8-724cf358ee25 splash=verbose
>> showopts
>> 2019-05-14T11:44:07.665510+02:00 ha-idg-2 systemd[1]: Set hostname to
>>.
>> =

>> 
>> One network interface is gone for a short period. But it's in a bonding
>> device (round-robin),
>> so the connection shouldn't be lost. Both nodes are connected directly,
>> there is no switch in between.
> 
> I think you misunderstood: a round-robin bonding device is not fault-safe
> IMHO, but it depends a lot on your cabling details. Also you did not show the
> logs on the other nodes.

/usr/src/linux/Documentation/networking/bonding.txt says:
" balance-rr or 0: Round-robin policy: Transmit packets in sequential
order from the first available slave through the
last.  This mode provides load balancing >> and fault
tolerance. <<
Nevertheless i think i will switch to active/backup.

I showed up /var/log/messages from the fenced note above.
Or do you mean /var/log/pacemaker.log from the fenced one ?
Isn't that the same as the one from the DC ?

 
>> I manually (ifconfig eth3 down) stopped the interface afterwards several
>> times ... nothing happened.
>> The same with the second Interface (eth2).

That means i also stopped several time the second interface, also with no 
fencing of the node.

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Stellv. Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, 
Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] action 'monitor_Stopped' not found in Resource Agent meta-data

2019-05-23 Thread Ulrich Windl
Hi!

Reading the release Notes on SLES12 HAE, I found the ``op monitor 
role="Stopped" ...`` operation that had been discussed here before, too.
When trying to configure it, I get an error message (from crm shell):
WARNING: prm_ping_gw1-v582: action 'monitor_Stopped' not found in Resource 
Agent meta-data

Is this a "false alert" or does each RA have to support that action? Or is it a 
bug in crm shell?

My test resource was:
primitive prm_ping_gw1-v582 ocf:pacemaker:ping \
params name=val_network dampen=30s multiplier=1000 
host_list=172.20.17.254 \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s \
op monitor interval=60s timeout=60s \
op monitor interval=600s timeout=60s role=Stopped \
utilization utl_cpu=1 utl_ram=1

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] drbd could not start by pacemaker. strange limited root privileges?

2019-05-23 Thread László Neduki
Hi,

(
I sent a similar question from an other acount 3 days ago, but:
- I do not see it on the list. Maybe I should not see my own email? So I
created a new account
- I have additional infos (but no solution), so I rewrite the question
)

pacemaker cannot start drbd9 resources. As I see, root has very limited
privileges in the drbd resource agent, when it run by the pacemaker. I
downloaded the latest pacemaker this week, and I compiled drbd9 rpms also.
I hope, You can help me, I do not find the cause of this behaviour. Please
see the below test cases:

1. When I create Pacemaker DRBD resource I get errors
# pcs resource create DrbdDB ocf:linbit:drbd drbd_resource=drbd_db op
monitor interval=60s meta notify=true
# pcs resource master DrbdDBClone DrbdDB master-max=1 master-node-max=1
clone-node-max=1 notify=true
# pcs constraint location DrbdDBClone prefers node1=INFINITY
# pcs cluster stop --all; pcs cluster start --all; pcs status

Failed Actions:
* DrbdDB_monitor_0 on node1 'not installed' (5): call=6, status=complete,
exitreason='DRBD kernel (module) not available?',
last-rc-change='Thu May 23 09:54:09 2019', queued=0ms, exec=58ms
* DrbdDB_monitor_0 on node2 'not installed' (5): call=6, status=complete,
exitreason='DRBD kernel (module) not available?',
last-rc-change='Thu May 23 10:00:22 2019', queued=0ms, exec=71ms

2. when I try to start drbd_db by drbdadm directly, it works well:
# modprobe drbd #on each node
# drbdadm up drbd_db #on each node
# drbdadm primary drbd_db
# drbdadm status
it shows drbd_db is UpToDate on each node
I also can promote and mount filesystem well

3. When I use debug-start, it works fine (so the resource syntax sould be
correct)
# drbdadm status
No currently configured DRBD found.
# pcs resource debug-start DrbdDBMaster
Error: unable to debug-start a master, try the master's resource: DrbdDB
# pcs resource debug-start DrbdDB #on each node
Operation start for DrbdDB:0 (ocf:linbit:drbd) returned: 'ok' (0)
# drbdadm status
it shows drbd_db is UpToDate on each node

4. Pacemaker handle other resources well . If I set auto_promote=yes, and I
start (but not promote) the drbd_db by drbdadm, then pacemaker can create
filesystem on it well, and also the appserver, database resources.

5. The strangest behaviour for me. Root have very limited privileges whitin
the drbd resource agent. If I write this line to the srbd_start() method
of  /usr/lib/ocf/resource.d/linbit/drbd

ocf_log err "lados " $(whoami) $( ls -l /home/opc/tmp/modprobe2.trace ) $(
do_cmd touch /home/opc/tmp/modprobe2.trace )

I got theese messeges in log, when I start the cluster

# tail -f /var/log/cluster/corosync.log | grep -A 8 -B 3 -i lados

...
May 21 15:35:12  drbd(DrbdDB)[31649]:ERROR: lados  root
May 21 15:35:12 [31309] node1   lrmd:   notice: operation_finished:
DrbdDB_start_0:31649:stderr [ ls: cannot access
/home/opc/tmp/modprobe2.trace: Permission denied ]
May 21 15:35:12 [31309] node1   lrmd:   notice: operation_finished:
DrbdFra_start_0:31649:stderr [ touch: cannot touch
'/home/opc/tmp/modprobe2.trace': Permission denied ]
...
and also, when I try to strace the "modprobe -s drbd `$DRBDADM
sh-mod-parms`" in drbd resource agent, I only see 1 line in the
/root/modprobe2.trace. This meens for me:
- root cannot trace the calls in drbdadm (even if root can strace drbdadm
outside of pacemaker well)
- root can write into files his own directory (/root/modprobe2.trace)

6. Opposit of previous test
root has these privileges outside from pacamaker

# sudo su -
# touch /home/opc/tmp/modprobe2.trace
# ls -l /home/opc/tmp/modprobe2.trace
-rw-r--r--. 1 root root 0 May 21 15:44 /home/opc/tmp/modprobe2.trace


Thanks: lados.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/