[ceph-users] Re: Latency increase after upgrade 14.2.8 to 14.2.16

Frank Schilder Sat, 13 Feb 2021 04:42:50 -0800

For comparison, a memory usage graph of a freshly deployed host with 
buffered_io=true: https://imgur.com/a/KUC2pio . Note the very rapid increase of 
buffer usage.


OK, so you are using a self-made dashboard definition. I was hoping that people 
published something, I try to avoid starting from scratch.

Best regards and good luck,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Björn Dolkemeier <b.dolkeme...@dbap.de>
Sent: 13 February 2021 09:33:12
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Latency increase after upgrade 14.2.8 to 14.2.16

I will definitely follow your steps and apply  bluefs_buffered_io=true via 
ceph.conf and restart. My first try was to update these dynamically. I’ll 
report when it’s done.

We monitor our clusters via Telegraf (Ceph input Plugin) and InfluxDB and a 
custom Grafana dashboard fitted for our needs.

Björn

> Am 13.02.2021 um 09:23 schrieb Frank Schilder <fr...@dtu.dk>:
>
> Ahh, OK. I'm not sure if it has that effect. What people observed was, that 
> rocks-DB access became faster due to system buffer cache hits. This has an 
> indirect influence on data access latency.
>
> The typical case is "high IOPs on WAL/DB device after upgrade" and setting 
> bluefs_buffered_io=true got this back to normal also improving client 
> performance as a result.
>
> Your latency graphs look actually suspiciously like it should work for you. 
> Are you sure the OSD is using the value? I had problems with setting some 
> parameters, I needed to include them in the ceph.conf file and restart to 
> force them through.
>
> A sign that bluefs_buffered_io=true is applied is rapidly increasing system 
> buffer usage reported by top or free. If the values reported are similar for 
> all hosts, bluefs_buffered_io is still disabled.
>
> If I may ask, what framework are you using to pull these graphs? Is there a 
> graphana dashboard one can download somewhere or is it something you 
> implemented yourself? I plan to enable prometheus on our cluster, but don't 
> know about a good data sink providing a pre-defined dashboard.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Björn Dolkemeier <b.dolkeme...@dbap.de>
> Sent: 13 February 2021 08:51:11
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Latency increase after upgrade 14.2.8 to 14.2.16
>
> Thanks for the quick reply, Frank.
>
> Sorry, the graphs/attachment where filtered. Here is an example of one 
> latency: 
> https://drive.google.com/file/d/1qSWmSmZ6JXVweepcoY13ofhfWXrBi2uZ/view?usp=sharing
>
> I’m aware that the overall performance depends on the slowest OSD.
>
> What I expect is that  bluefs_buffered_io=true set on one OSD reflects in 
> dropped latencies for that particular OSD.
>
> Best regards,
> Björn
>
> Am 13.02.2021 um 07:39 schrieb Frank Schilder 
> <fr...@dtu.dk<mailto:fr...@dtu.dk>>:
>
> The graphs were forgotten or filtered out.
>
> Changing the buffered_io value on one host will not change client IO 
> performance as its always the slowest OSD thats decisive. However, it should 
> have an effect on the IOP/s load reported by iostat on the disks on the host.
>
> Does setting bluefs_buffered_io=true on all hosts have an effect on client 
> IO? Note that it might need a restart even if the documentation says 
> otherwise.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Björn Dolkemeier <b.dolkeme...@dbap.de<mailto:b.dolkeme...@dbap.de>>
> Sent: 13 February 2021 07:16:06
> To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> Subject: [ceph-users] Latency increase after upgrade 14.2.8 to 14.2.16
>
> Hi,
>
> after upgrading Ceph from 14.2.8 to 14.2.16 we experienced increased 
> latencies. There were no changes in hardware, configuration, workload or 
> networking, just a rolling-update via ceph-ansible on running production 
> cluster. The cluster consists of 16 OSDs (all SSD) over 4 Nodes. The VMs 
> served via RBD from this cluster currently suffer on i/o wait cpu.
>
> These are some latencies that are increased after the update:
> - op_r_latency
> - op_w_latency
> - kv_final_lat
> - state_kv_commiting_lat
> - submit_lat
> - subop_w_latency
>
> Do these latencies point to KV/RocksDB?
>
> These  are some latencies which are NOT increased after the update:
> - kv_sync_lat
> - kv_flush_lat
> - kv_commit_lat
>
> I attached one graph showing the massive increase after the update.
>
> I tried setting bluefs_buffered_io=true (as it’s default value was changed 
> and it was mentioned as performance relevant) for all OSDs in one host but 
> this does not make a difference.
>
> The ceph.conf is fairly simple:
>
> [global]
> cluster network = xxx
> fsid = xxx
> mon host = xxx
> public network = xxx
>
> [osd]
> osd memory target = 10141014425
>
> Any ideas what to try? Help appreciated.
>
> Björn
>
>
>
>
>
>
> --
>
> dbap GmbH
> phone +49 251 609979-0 / fax +49 251 609979-99
> Heinr.-von-Kleist-Str. 47, 48161 Muenster, Germany
> http://www.dbap.de
>
> dbap GmbH, Sitz: Muenster
> HRB 5891, Amtsgericht Muenster
> Geschaeftsfuehrer: Bjoern Dolkemeier
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Latency increase after upgrade 14.2.8 to 14.2.16

Reply via email to