[ceph-users] Latency spike investigations on all SSD hardware cluster

Martin Hronek Thu, 14 Jan 2021 00:50:00 -0800

Hello fellow CEPH-users,

we are currently investigating latency spikes in our CEPH(14.2.11) prodcluster, usually occurring when under heavy load.TLDR: Do you have an idea where to investigate some kv commit latencyspikes on a CEPH cluster with a LSI 9300-8i HBA and all SSD(Intel,Micron) OSDs?

The cluster consists of 3MDS nodes(2 active + 1 standby-replay), 3MONnodes(each running MGR+MON daemon) and 4OSD nodes(each having 8 SSDbluestore OSD disks).All nodes are running ubuntu latest 18.04 with kernel version 5.4(2 OSDServer still have 4.15 - but spike are seen on all of the OSD servers).

As the spikes seem to be randomly distributed across time(under load)and OSD, we followed the spikes to find following messages on the OSDnodes:

```

bluestore(/var/lib/ceph/osd/ceph-31) log_latency slow operation observedfor kv_sync, latency = 5.22298sbluestore(/var/lib/ceph/osd/ceph-31) log_latency_fn slow operationobserved for _txc_committed_kv, latency = 5.5732s, txc = 0x55b1a98d9e00

...

bluestore(/var/lib/ceph/osd/ceph-31) log_latency_fn slow operationobserved for _txc_committed_kv, latency = 5.50842s, txc = 0x55b1aa197800bluestore(/var/lib/ceph/osd/ceph-31) log_latency_fn slow operationobserved for _txc_committed_kv, latency = 5.5058s, txc = 0x55b1b7e75c00

```

We found timely correlated kernel messages, which suggest that it mighthas something to do with the underlying SSDs

```

kernel: [3613612.312027] sd 4:0:10:0: attempting taskabort!scmd(0x00000000dac86408), outstanding for 31384 ms & timeout 30000mskernel: [3613612.312034] sd 4:0:10:0: [sdg] tag#744 CDB: Write(10) 2a 00be 11 b8 80 00 00 08 00kernel: [3613612.312036] scsi target4:0:10: handle(0x0013),sas_address(0x4433221104000000), phy(4)kernel: [3613612.312038] scsi target4:0:10: enclosure logicalid(0x500605b00e70a7b0), slot(7)kernel: [3613612.312039] scsi target4:0:10: enclosure level(0x0000),connector name( )kernel: [3613612.312040] sd 4:0:10:0: No reference found at driver,assuming scmd(0x00000000dac86408) might have completedkernel: [3613612.312042] sd 4:0:10:0: task abort: SUCCESSscmd(0x00000000dac86408)

```

There are lots of the above blocks until the Kernel apparently hasenough of them and just resets the device/interface:

```

kernel: [3613612.312267] sd 4:0:10:0: attempting taskabort!scmd(0x00000000d7aaff5a), outstanding for 31388 ms & timeout 30000mskernel: [3613612.312269] sd 4:0:10:0: [sdg] tag#520 CDB: Write(10) 2a 00be 11 b4 e0 00 00 08 00kernel: [3613612.312269] scsi target4:0:10: handle(0x0013),sas_address(0x4433221104000000), phy(4)kernel: [3613612.312270] scsi target4:0:10: enclosure logicalid(0x500605b00e70a7b0), slot(7)kernel: [3613612.312271] scsi target4:0:10: enclosure level(0x0000),connector name( )kernel: [3613612.312272] sd 4:0:10:0: No reference found at driver,assuming scmd(0x00000000d7aaff5a) might have completedkernel: [3613612.312273] sd 4:0:10:0: task abort: SUCCESSscmd(0x00000000d7aaff5a)

kernel: [3613612.653004] sd 4:0:10:0: Power-on or device reset occurred
kernel: [3613613.254064] sd 4:0:10:0: Power-on or device reset occurred
```

OSD nodes are equipped with an "LSI 9300-8i SAS HBA" also we use twotypes of SSDs "Intel SSD D3-S4510 Series 1,92 TB", "Micron 5210 ION1.92TB SSD".Since these resets happen on both SSDs we figured the least commondenominator is the HBA so we did an Upgrade to the latest FW/BIOS on oneOSD node. Sadly this did not solve the issue.

* The question is now has someone a similar hardware configuration andissues with it ?

* Do you have an idea what could be the cause of this behaviour ?
* Or which part to investigate further ?

Thanks for your hints and time reading :)
M
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Latency spike investigations on all SSD hardware cluster

Reply via email to