On Sun, Apr 25, 2021 at 12:34 PM Gal Villaret <gal.villa...@gmail.com> wrote:
>
> Hi All,
> In the past few days, we have been having problems with hosts becoming 
> unmanageable due to multipathd identifying false failed paths and VDSM 
> crashing because of it.
> We were running version 4.4.1 and upgrading to 4.4.5(engine) and 4.4.6(nodes) 
> seems to have resolved the VDSM.
> However,  currently, we see that the multipathing events continue.
> From what we have observed,  the events start in correlation to the host 
> reporting on low swap space. The low swap space seems to be related to 
> Commvault backup operation running. By running top on the host while a backup 
> operation is running I can see swap being consumed to 100% although there is 
> plenty of RAM available.
> After the multipath events start happening the only means of stopping it was 
> to reboot the host.
>
> This is the warning in oVirt UI:
> Apr 24, 2021, 9:14:07 PM - Available swap memory of host Ovirt-Node2 [953 MB] 
> is under defined threshold [1024 MB].
>
> This is the first appearance of the multipath event in /var/log/messages:
> Apr 24 21:14:58 ovirt-node2 kernel: sd 21:0:0:9: [sdgi] tag#77 FAILED Result: 
> hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=15s
> Apr 24 21:14:58 ovirt-node2 kernel: sd 21:0:0:9: [sdgi] tag#77 Sense Key : 
> Aborted Command [current]
> Apr 24 21:14:58 ovirt-node2 kernel: sd 21:0:0:9: [sdgi] tag#77 
> <<vendor>>ASC=0xc1 ASCQ=0x1
> Apr 24 21:14:58 ovirt-node2 kernel: sd 21:0:0:9: [sdgi] tag#77 CDB: Read(16) 
> 88 00 00 00 00 00 79 ac 75 a0 00 00 06 00 00 00
> Apr 24 21:14:58 ovirt-node2 kernel: blk_update_request: I/O error, dev sdgi, 
> sector 2041345440 op 0x0:(READ) flags 0x4200 phys_seg 192 prio class 0
> Apr 24 21:14:58 ovirt-node2 kernel: device-mapper: multipath: 253:20: Failing 
> path 131:224.
> Apr 24 21:14:58 ovirt-node2 multipathd[3044]: sdgi: mark as failed
> Apr 24 21:14:58 ovirt-node2 multipathd[3044]: 
> 3600000e00d2c0000002cb4a8000b0000: remaining active paths: 11
> Apr 24 21:15:03 ovirt-node2 multipathd[3044]: 
> 3600000e00d2c0000002cb4a8000b0000: sdgi - tur checker reports path is up
> Apr 24 21:15:03 ovirt-node2 multipathd[3044]: 131:224: reinstated
> Apr 24 21:15:03 ovirt-node2 multipathd[3044]: 
> 3600000e00d2c0000002cb4a8000b0000: remaining active paths: 12
> Apr 24 21:15:03 ovirt-node2 kernel: device-mapper: multipath: 253:20: 
> Reinstating path 131:224.
> Apr 24 21:15:03 ovirt-node2 kernel: sd 21:0:0:9: alua: port group 8091 state 
> A preferred supports toluSNA
> Apr 24 21:15:03 ovirt-node2 kernel: sd 21:0:0:9: alua: port group 8091 state 
> A preferred supports toluSNA
> Apr 24 21:15:13 ovirt-node2 kernel: sd 13:0:0:9: [sdau] tag#25 FAILED Result: 
> hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=15s
> Apr 24 21:15:13 ovirt-node2 kernel: sd 13:0:0:9: [sdau] tag#25 Sense Key : 
> Aborted Command [current]
> Apr 24 21:15:13 ovirt-node2 kernel: sd 13:0:0:9: [sdau] tag#25 
> <<vendor>>ASC=0xc1 ASCQ=0x1
> Apr 24 21:15:13 ovirt-node2 kernel: sd 13:0:0:9: [sdau] tag#25 CDB: Read(16) 
> 88 00 00 00 00 00 79 ac 75 a0 00 00 06 00 00 00
> Apr 24 21:15:13 ovirt-node2 kernel: blk_update_request: I/O error, dev sdau, 
> sector 2041345440 op 0x0:(READ) flags 0x4200 phys_seg 192 prio class 0
> Apr 24 21:15:13 ovirt-node2 kernel: device-mapper: multipath: 253:20: Failing 
> path 66:224.
> Apr 24 21:15:13 ovirt-node2 multipathd[3044]: sdau: mark as failed
>
> Underlying storage is Fujitsu DX200 S5 with all SSD drives.
> Each host has two 10Gbit network adapters dedicated to ISCSI.
> Any help with this would be highly appreciated.

It looks like incorrect multipath configuration for your storage.

Looking in multipathd built in configuration we have only:

        device {
                vendor "FUJITSU"
                product "ETERNUS_DX(H|L|M|400|8000)"
                path_grouping_policy "group_by_prio"
                prio "alua"
                failback "immediate"
                no_path_retry 10
        }

This likely does match your storage. You can see the vendor/product
names in:

    $ multipath -ll

I found that they have multiapth driver for linux:
https://sp.ts.fujitsu.com/dmsp/Publications/public/eternusmpd-v2-linux-en.pdf

But there is no info configuration multipath. Maybe their installer configures
/etc/multipath.conf for you. In this case you need to remove the configuration
from /etc/multipath.conf (since it is managed by vdsm) and move it to local
configurat (see below how).

Check with the storage vendor what is the recommended multipath configuration
for your storage and add a local configuration in:

$ cat /etc/multipath/conf.d/99-local.conf
devices {
    device {
        vendor "FUJITSU"
        # You may need to change this
        product "ETERNUS_DX200"
       ...
    }
}

Reload multipath to activate the configuration:

    systemctl reload multipathd

Ben may have more specific recommendations.

Nir
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/KHTHAF6UKKPW432Y4735V5BR66BR6ZN7/

Reply via email to