Thanks, Igor. I mentioned earlier that according to the OSD logs compaction
wasn't an issue. I did run `ceph-kvstore-tool` offline though, it completed
rather quickly without any warnings or errors, but the OSD kept showing
excessive latency.
I did something rather radical: rebooted the node and r
Hi Zakhar,
you might want to try offline DB compaction using ceph-kvstore-tool for
this specific OSD.
Periodically we observe OSD perf drop due to degraded RocksDB
performance, particularly after bulk data removal/migration.. Compaction
is quite helpful in this case.
Thanks,
Igor
On 2
Eugen,
Thanks again for your suggestions! The cluster is balanced, OSDs on this
host and other OSDs in the cluster are almost evenly utilized:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL%USE VAR PGS STATUS
...
11hdd 9.38680 1.0 9.4 TiB 1.2
I don't see anything obvious in the pg output, they are relatively
small and don't hold many objects. If deep-scrubs would impact
performance that much you would see that in the iostat output as well.
Have you watched it for a while, maybe with -xmt options to see the
%util column as well?
Thanks, Eugen. I very much appreciate your time and replies.
It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and block
storage on HDD (Toshiba MG06SCA10TE). There are 6 uniform hosts with 2 x
DB/WAL NVMEs and 9 x HDDs each, each NVME hosts DB/WAL for 4-5 OSDs. The
cluster was ins
Those numbers look really high to me, more than 2 seconds for a write
is awful. Is this a HDD-only cluster/pool? But even then it would be
too high, I just compared with our HDD-backed cluster (although
rocksDB is SSD-backed) which also mainly serves RBD to openstack. What
is the general ut
Thanks, Eugen!
It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm not
really sure what to make of them. I checked adjacent OSDs and they have
similar ops, but aren't showing excessive latency.
/Z
On Thu, 27 Apr 2023 at 10:42, Eugen Block wrote:
> Hi,
>
> I would monitor the
Hi,
I would monitor the historic_ops_by_duration for a while and see if
any specific operation takes unusually long.
# this is within the container
[ceph: root@storage01 /]# ceph daemon osd.0 dump_historic_ops_by_duration
| head
{
"size": 20,
"duration": 600,
"ops": [
{
As suggested by someone, I tried `dump_historic_slow_ops`. There aren't
many, and they're somewhat difficult to interpret:
"description": "osd_op(client.250533532.0:56821 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
3518464~8192] snapc 0=[] ondisk+writ