[ceph-users] CEPH iSCSI issue - ESXi command timeout

Golasowski Martin Thu, 01 Oct 2020 05:46:00 -0700

Dear All,

a week ago we had to reboot our ESXi nodes since our CEPH cluster sudennly 
stopped serving all I/O. We have identified a VM (vCenter appliance) which was 
swapping heavily and causing heavy load. However, since then we are 
experiencing strange issues, as if the cluster cannot handle any spike in I/O 
load like migration or VM reboot.

The main problem is that the iSCSI commands issued by ESXi sometimes time out
and ESXi reports inaccessible datastore. It disrupts the I/O heavily, we had to
reboot the vmware cluster entirely several times. It started suddennly after
approx 10 months of operation without problems.

I can see a steadily increasing number of dropped Rx packets on the iSCSI
network interfaces in the OSDs.

Our CEPH setup is following: 4 OSDs, each having 3 10TB 7.2k rpm HDDs. The OSDs
are connected by 25 Gbps Ethernet to the other nodes. For the RBD pools I have
64 PGs. The OSDs have 32 GB RAM, free is around 1G on each, I have seen even
lower, though. OS is CentOS 7, CEPH release is Nautilus 14.2.11 deployed by
ceph-ansible. MONs are virtualized in ESXi nodes on the local SSD drives.

iSCSI NICs are on separate VLAN, other traffic is served via bond with
balance-xor (LACP is unusable due to VMware limitation for using SW iSCSI HBA)
in a different VLAN. Our network is Mellanox based - SN2100 switches and
Connect-X 5 NICs.

The iSCSI target serves 2 LUNs in RBD pool which is erasure coded. Yesterday I
have increased the number of PGs for that pool from 64 to 128, without much
effect after the cluster finished rebalancing.

In OSD servers kernel log we see the following:

[299560.618893] iSCSI Login negotiation failed.
[303088.450088] Did not receive response to NOPIN on CID: 0, failing connection
for I_T Nexus
iqn.1994-05.com.redhat:esxi1,i,0x00023d000002,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
[324926.694077] Did not receive response to NOPIN on CID: 0, failing connection
for I_T Nexus
iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
[407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891
[407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 5891
[411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722
[411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6722

The error in ESXi looks like this:

naa.60014053b46fc760ff0470dbd7980263" on path "vmhba64:C1:T0:L0" Failed:
2020-10-01T05:38:51.291Z cpu49:2144076)NMP: nmp_ThrottleLogForDevice:3856: Cmd
0x89 (0x459a5b1b9480, 2097241) to dev "naa.6001405a527d78935724451aa5f53513" on
path "vmhba64:C2:T0:L1" Failed:
2020-10-01T05:38:57.098Z cpu44:2099346)NMP: nmp_ThrottleLogForDevice:3856: Cmd
0x8a (0x45ba96710ec0, 2107403) to dev "naa.60014053b46fc760ff0470dbd7980263" on
path "vmhba64:C1:T0:L0" Failed:
2020-10-01T05:38:57.122Z cpu71:2098965)NMP: nmp_ThrottleLogForDevice:3856: Cmd
0x89 (0x45ba9676aec0, 2146212) to dev "naa.60014053b46fc760ff0470dbd7980263" on
path "vmhba64:C1:T0:L0" Failed:
2020-10-01T05:38:57.256Z cpu65:2098959)NMP: nmp_ThrottleLogForDevice:3856: Cmd
0x89 (0x459a4179d8c0, 2146269) to dev "naa.6001405a527d78935724451aa5f53513" on
path "vmhba64:C2:T0:L1" Failed:

We would appreciate any help you can give us.

Thank you very much.

Regards,
Martin Golasowski

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] CEPH iSCSI issue - ESXi command timeout

Reply via email to