Dear All,

a week ago we had to reboot our ESXi nodes since our CEPH cluster sudennly 
stopped serving all I/O. We have identified a VM (vCenter appliance) which was 
swapping heavily and causing heavy load. However, since then we are 
experiencing strange issues, as if the cluster cannot handle any spike in I/O 
load like migration or VM reboot.

The main problem is that the iSCSI commands issued by ESXi sometimes time out 
and ESXi reports inaccessible datastore. It disrupts the I/O heavily, we had to 
reboot the vmware cluster entirely several times. It started suddennly after 
approx 10 months of operation without problems.

I can see a steadily increasing number of dropped Rx packets on the iSCSI 
network interfaces in the OSDs.

Our CEPH setup is following: 4 OSDs, each having 3 10TB 7.2k rpm HDDs. The OSDs 
are connected by 25 Gbps Ethernet to the other nodes. For the RBD pools I have 
64 PGs. The OSDs have 32 GB RAM, free is around 1G on each, I have seen even 
lower, though. OS is CentOS 7, CEPH release is Nautilus 14.2.11 deployed by 
ceph-ansible. MONs are virtualized in ESXi nodes on the local SSD drives.

iSCSI NICs are on separate VLAN, other traffic is served via bond with 
balance-xor (LACP is unusable due to VMware limitation for using SW iSCSI HBA) 
in a different VLAN. Our network is Mellanox based - SN2100 switches and 
Connect-X 5 NICs. 

The iSCSI target serves 2 LUNs in RBD pool which is erasure coded. Yesterday I 
have increased the number of PGs for that pool from 64 to 128, without much 
effect after the cluster finished rebalancing.

In OSD servers kernel log we see the following:

[299560.618893] iSCSI Login negotiation failed.
[303088.450088] Did not receive response to NOPIN on CID: 0, failing connection 
for I_T Nexus 
iqn.1994-05.com.redhat:esxi1,i,0x00023d000002,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
[324926.694077] Did not receive response to NOPIN on CID: 0, failing connection 
for I_T Nexus 
iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
[407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891
[407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 5891
[411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722
[411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6722


The error in ESXi looks like this:

naa.60014053b46fc760ff0470dbd7980263" on path "vmhba64:C1:T0:L0" Failed:
2020-10-01T05:38:51.291Z cpu49:2144076)NMP: nmp_ThrottleLogForDevice:3856: Cmd 
0x89 (0x459a5b1b9480, 2097241) to dev "naa.6001405a527d78935724451aa5f53513" on 
path "vmhba64:C2:T0:L1" Failed:
2020-10-01T05:38:57.098Z cpu44:2099346)NMP: nmp_ThrottleLogForDevice:3856: Cmd 
0x8a (0x45ba96710ec0, 2107403) to dev "naa.60014053b46fc760ff0470dbd7980263" on 
path "vmhba64:C1:T0:L0" Failed:
2020-10-01T05:38:57.122Z cpu71:2098965)NMP: nmp_ThrottleLogForDevice:3856: Cmd 
0x89 (0x45ba9676aec0, 2146212) to dev "naa.60014053b46fc760ff0470dbd7980263" on 
path "vmhba64:C1:T0:L0" Failed:
2020-10-01T05:38:57.256Z cpu65:2098959)NMP: nmp_ThrottleLogForDevice:3856: Cmd 
0x89 (0x459a4179d8c0, 2146269) to dev "naa.6001405a527d78935724451aa5f53513" on 
path "vmhba64:C2:T0:L1" Failed:

We would appreciate any help you can give us.

Thank you very much.

Regards,
Martin Golasowski


Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to