On Thu, Feb 17, 2022 at 12:38 PM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote:
> >>> Klaus Wenninger <kwenn...@redhat.com> schrieb am 17.02.2022 um 10:49 > in > Nachricht > <calrdao0ungyyybnv9xwve9v4suxvjon-y8c8vd51zr5lt1o...@mail.gmail.com>: > ... > >> For completeness: Yes, sbd did recover: > >> Feb 14 13:01:42 h18 sbd[6615]: warning: cleanup_servant_by_pid: Servant > >> for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 6619) has terminated > >> Feb 14 13:01:42 h18 sbd[6615]: warning: cleanup_servant_by_pid: Servant > >> for /dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 6621) has terminated > >> Feb 14 13:01:42 h18 sbd[31668]: /dev/disk/by-id/dm-name-SBD_1-3P1: > >> notice: servant_md: Monitoring slot 4 on disk > >> /dev/disk/by-id/dm-name-SBD_1-3P1 > >> Feb 14 13:01:42 h18 sbd[31669]: /dev/disk/by-id/dm-name-SBD_1-3P2: > >> notice: servant_md: Monitoring slot 4 on disk > >> /dev/disk/by-id/dm-name-SBD_1-3P2 > >> Feb 14 13:01:49 h18 sbd[6615]: notice: inquisitor_child: Servant > >> /dev/disk/by-id/dm-name-SBD_1-3P1 is healthy (age: 0) > >> Feb 14 13:01:49 h18 sbd[6615]: notice: inquisitor_child: Servant > >> /dev/disk/by-id/dm-name-SBD_1-3P2 is healthy (age: 0) > >> > > > > Good to see that! > > Did you try several times? > > Well, we only have two fabrics, and the server is productive, so both > fabrics were interrupted once each (to change the cabling). > sbd survived. > Yup - sometimes the entities that would have to be failed are just too large to have them as part of the playground/sandbox :-( > > Second fabric: > Feb 14 13:03:51 h18 kernel: qla2xxx [0000:01:00.0]-500b:2: LOOP DOWN > detected (2 7 0 0). > Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3 > Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 2 > > Feb 14 13:05:18 h18 kernel: qla2xxx [0000:01:00.0]-500a:2: LOOP UP > detected (8 Gbps). > Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: sdr - tur checker reports > path is up > Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3 > Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: sdae - tur checker > reports path is up > Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 4 > Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: sdl - tur checker reports > path is up > Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 3 > Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: sdo - tur checker reports > path is up > Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 4 > > So this time multipath reacted before SBD noticed anything (the way it > should have been anyway) > Depends on how you like it to behave. You are free to configure the io-timeout in a way that sbd wouldn't see it or if you'd rather have some notice in the sbd-logs, or the added reliability of kicking off another try instead of waiting for a first - maybe doomed - one to finish you give it enough time to retry within your msgwait-timeout. Unfortunately it isn't possible to have one-fits-all defaults here. But feedback is welcome so that we can do a little tweaking that makes them fit for a larger audience. Remember a case where devices stalled for 50s during a firmware-update shouldn't trigger fencing - definitely a case that can't be covered by defaults. > > I have some memory that when testing with the kernel mentioned before > > behavior > > changed after a couple of timeouts and it wasn't able to create the > > read-request > > anymore (without the fix mentioned) - assume some kind of resource > depletion > > due to previously hanging attempts not destroyed properly. > > That can be a nasty rece condition, too, however. (I had my share of > signal handlers, threads and race conditions). > Of course more crude programming errors are possible, too. > One single threaded process and it was gone once the api was handled properly. I mean the different behavior after a couple of retries was gone. The basic issue was persistent with that kernel. > Debugging can be very hard, but dmsetup can create bad disks for testing > for you ;-) > DEV=bad_disk > dmsetup create "$DEV" <<EOF > 0 8 zero > 8 1 error > 9 7 zero > 16 1 error > 17 255 zero > EOF > We need to impose the problem dynamically. Otherwise sbd wouldn't come up in the first place - which is of course a useful test in itself as well. Atm regressions.sh is using wipe_table to impose an error dynamically but simultaneously on all blocks. The periodic reading is anyway done on just a single block (more accurately the header as well). So we should be fine with that. I saw that device-mapper offers a possibility to delay here as well. This looks as if it was useful for a CI test-case that simulates what we have here - even multiple times in a row without upsetting customers ;-) Regards, Klaus > > Regards, > Ulrich > ... > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/