[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Zakhar Kirpichenko
Thanks, Igor. I mentioned earlier that according to the OSD logs compaction wasn't an issue. I did run `ceph-kvstore-tool` offline though, it completed rather quickly without any warnings or errors, but the OSD kept showing excessive latency. I did something rather radical: rebooted the node and r

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Igor Fedotov
Hi Zakhar, you might want to try offline DB compaction using ceph-kvstore-tool for this specific OSD. Periodically we observe OSD perf drop due to degraded RocksDB performance, particularly after bulk data removal/migration.. Compaction is quite helpful in this case. Thanks, Igor On 2

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Zakhar Kirpichenko
Eugen, Thanks again for your suggestions! The cluster is balanced, OSDs on this host and other OSDs in the cluster are almost evenly utilized: ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL%USE VAR PGS STATUS ... 11hdd 9.38680 1.0 9.4 TiB 1.2

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Eugen Block
I don't see anything obvious in the pg output, they are relatively small and don't hold many objects. If deep-scrubs would impact performance that much you would see that in the iostat output as well. Have you watched it for a while, maybe with -xmt options to see the %util column as well?

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Zakhar Kirpichenko
Thanks, Eugen. I very much appreciate your time and replies. It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and block storage on HDD (Toshiba MG06SCA10TE). There are 6 uniform hosts with 2 x DB/WAL NVMEs and 9 x HDDs each, each NVME hosts DB/WAL for 4-5 OSDs. The cluster was ins

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Eugen Block
Those numbers look really high to me, more than 2 seconds for a write is awful. Is this a HDD-only cluster/pool? But even then it would be too high, I just compared with our HDD-backed cluster (although rocksDB is SSD-backed) which also mainly serves RBD to openstack. What is the general ut

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Zakhar Kirpichenko
Thanks, Eugen! It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm not really sure what to make of them. I checked adjacent OSDs and they have similar ops, but aren't showing excessive latency. /Z On Thu, 27 Apr 2023 at 10:42, Eugen Block wrote: > Hi, > > I would monitor the

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Eugen Block
Hi, I would monitor the historic_ops_by_duration for a while and see if any specific operation takes unusually long. # this is within the container [ceph: root@storage01 /]# ceph daemon osd.0 dump_historic_ops_by_duration | head { "size": 20, "duration": 600, "ops": [ {

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-26 Thread Zakhar Kirpichenko
As suggested by someone, I tried `dump_historic_slow_ops`. There aren't many, and they're somewhat difficult to interpret: "description": "osd_op(client.250533532.0:56821 13.16f 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write 3518464~8192] snapc 0=[] ondisk+writ