[ceph-users] Re: Latency increase after upgrade 14.2.8 to 14.2.16

2021-02-14 Thread Björn Dolkemeier
Setting bluefs_buffered_io=true via restart of (all) OSDs didn’t change 
anything.

But I made another observation: Once a week a lot of number of objects (and 
space) is reclaimed because of fstrim running inside the VMs. After this the 
latency is fine for about 12 hours or so and is then gradually getting worse.

Here is the visualisation (it shows a magnitude latency drop for kv_final_lat):

https://drive.google.com/file/d/1j4S4KXyZigRGKX-kng9KDU2QNv9qN3YF/view?usp=sharing
 

https://drive.google.com/file/d/1RTnQp8qeqiF04hBBjAZ5tw07_KBPElnq/view?usp=sharing

Before the upgrade to 14.2.16 this did not happen:

https://drive.google.com/file/d/1cUm2SaQ7XBmLwDPnUM4esrx-sbbnO-hi/view?usp=sharing

My first impulse  (maybe the new version has higher cache/space requirements) 
was to raise „osd memory target“ for one OSD to see if this has an effect. But 
this didn’t work.

IMHO the increased latencies point to RocksDB. Maybe the default settings 
regarding RocksDB-caches do not fit anymore? Is anyone aware of changes between 
14.2.8 and 14.2.16 regarding to RocksDB that could lead to this behaviour? 

Thanks for reading and support,
Björn


> Am 13.02.2021 um 09:42 schrieb Frank Schilder :
> 
> For comparison, a memory usage graph of a freshly deployed host with 
> buffered_io=true: https://imgur.com/a/KUC2pio . Note the very rapid increase 
> of buffer usage.
> 
> OK, so you are using a self-made dashboard definition. I was hoping that 
> people published something, I try to avoid starting from scratch.
> 
> Best regards and good luck,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Björn Dolkemeier 
> Sent: 13 February 2021 09:33:12
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Latency increase after upgrade 14.2.8 to 14.2.16
> 
> I will definitely follow your steps and apply  bluefs_buffered_io=true via 
> ceph.conf and restart. My first try was to update these dynamically. I’ll 
> report when it’s done.
> 
> We monitor our clusters via Telegraf (Ceph input Plugin) and InfluxDB and a 
> custom Grafana dashboard fitted for our needs.
> 
> Björn
> 
>> Am 13.02.2021 um 09:23 schrieb Frank Schilder :
>> 
>> Ahh, OK. I'm not sure if it has that effect. What people observed was, that 
>> rocks-DB access became faster due to system buffer cache hits. This has an 
>> indirect influence on data access latency.
>> 
>> The typical case is "high IOPs on WAL/DB device after upgrade" and setting 
>> bluefs_buffered_io=true got this back to normal also improving client 
>> performance as a result.
>> 
>> Your latency graphs look actually suspiciously like it should work for you. 
>> Are you sure the OSD is using the value? I had problems with setting some 
>> parameters, I needed to include them in the ceph.conf file and restart to 
>> force them through.
>> 
>> A sign that bluefs_buffered_io=true is applied is rapidly increasing system 
>> buffer usage reported by top or free. If the values reported are similar for 
>> all hosts, bluefs_buffered_io is still disabled.
>> 
>> If I may ask, what framework are you using to pull these graphs? Is there a 
>> graphana dashboard one can download somewhere or is it something you 
>> implemented yourself? I plan to enable prometheus on our cluster, but don't 
>> know about a good data sink providing a pre-defined dashboard.
>> 
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> 
>> From: Björn Dolkemeier 
>> Sent: 13 February 2021 08:51:11
>> To: Frank Schilder
>> Cc: ceph-users@ceph.io
>> Subject: Re: [ceph-users] Latency increase after upgrade 14.2.8 to 14.2.16
>> 
>> Thanks for the quick reply, Frank.
>> 
>> Sorry, the graphs/attachment where filtered. Here is an example of one 
>> latency: 
>> https://drive.google.com/file/d/1qSWmSmZ6JXVweepcoY13ofhfWXrBi2uZ/view?usp=sharing
>> 
>> I’m aware that the overall performance depends on the slowest OSD.
>> 
>> What I expect is that  bluefs_buffered_io=true set on one OSD reflects in 
>> dropped latencies for that particular OSD.
>> 
>> Best regards,
>> Björn
>> 
>> Am 13.02.2021 um 07:39 schrieb Frank Schilder 
>> mailto:fr...@dtu.dk>>:
>> 
>> The graphs were forgotten or filtered out.
>> 
>> Changing the buffered_io value on one host will not change client IO 
>> performance as its always the slowest OSD thats decisive. However, it should 
>> have an effect on the IOP/s load reported by iostat on the disks on the host.
>> 
>> Does setting bluefs_buffered_io=true on all hosts have an effect on client 
>> IO? Note that it might need a restart even if the documentation says 
>> otherwise.
>> 
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> 
>> From: 

[ceph-users] Re: share haproxy config for radosgw [EXT]

2021-02-14 Thread Tony Liu
You can have BGP-ECMP to multiple HAProxy instances to support
active-active mode, instead of using keepalived for active-backup mode,
if the traffic amount does required multiple HAProxy instances.

Tony

From: Graham Allan 
Sent: February 14, 2021 01:31 PM
To: Matthew Vernon
Cc: ceph-users
Subject: [ceph-users] Re: share haproxy config for radosgw [EXT]

On Tue, Feb 9, 2021 at 11:00 AM Matthew Vernon  wrote:

> On 07/02/2021 22:19, Marc wrote:
> >
> > I was wondering if someone could post a config for haproxy. Is there
> something specific to configure? Like binding clients to a specific backend
> server, client timeouts, security specific to rgw etc.
>
> Ours is templated out by ceph-ansible; to try and condense out just the
> interesting bits:
>
> (snipped the config...)
>
> The aim is to use all available CPU on the RGWs at peak load, but to
> also try and prevent one user overwhelming the service for everyone else
> - hence the dropping of idle connections and soft (and then hard) limits
> on per-IP connections.
>

Can I ask a followup question to this: how many haproxy instances do you
then run - one on each of your gateways, with keepalived to manage which is
active?

I ask because, since before I was involved with our ceph object store, it
has been load-balanced between multiple rgw servers directly using
bgp-ecmp. It doesn't sound like this is common practise in the ceph
community, and I'm wondering what the pros and cons are.

The bgp-ecmp load balancing has the flaw that it's not truly fault
tolerant, at least without additional checks to shut down the local quagga
instance if rgw isn't responding - it's only fault tolerant in the case of
an entire server going down, which meets our original goals of rolling
maintenance/updates, but not a radosgw process going unresponsive. In
addition I think we have always seen some background level of clients being
sent "connection reset by peer" errors, which I have never tracked down
within radosgw; I wonder if these might be masked by an haproxy frontend?

The converse is that all client gateway traffic must generally pass through
a single haproxy instance, while bgp-ecmp distributes the connections
across all nodes. Perhaps haproxy is lightweight and efficient enough that
this makes little difference to performance?

Graham
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Is replacing OSD whose data is on HDD and DB is on SSD supported?

2021-02-14 Thread Tony Liu
​Hi,

I've been trying with v15.2 and v15.2.8, no luck.
Wondering if this is actually supported or ever worked for anyone?

Here is what I've done.
1) Create a cluster with 1 controller (mon and mgr) and 3 OSD nodes,
   each of which is with 1 SSD for DB and 8 HDDs for data.
2) OSD service spec.
service_type: osd
service_id: osd-spec
placement:
 hosts:
 - ceph-osd-1
 - ceph-osd-2
 - ceph-osd-3
spec:
  block_db_size: 92341796864
  data_devices:
model: ST16000NM010G
  db_devices:
model: KPM5XRUG960G
3) Add OSD hosts and apply OSD service spec. 8 OSDs (data on HDD and
   DB on SSD) are created on each host properly.
4) Run "orch osd rm 1 --replace --force". OSD is marked "destroyed" and
   reweight is set to 0 in "osd tree". "pg dump" shows no PG on that OSD.
   "orch ps" shows no daemon running for that OSD.
5) Run "orch device zap  ". VG and LV for HDD are removed.
   LV for DB stays. "orch device ls" shows HDD device is available.
6) Cephadm finds OSD claims and applies OSD spec on the host.
   Here is the message.
   
   cephadm [INF] Found osd claims -> {'ceph-osd-1': ['1']}
   cephadm [INF] Found osd claims for drivegroup osd-spec -> {'ceph-osd-1': 
['1']}
   cephadm [INF] Applying osd-spec on host ceph-osd-1...
   cephadm [INF] Applying osd-spec on host ceph-osd-2...
   cephadm [INF] Applying osd-spec on host ceph-osd-3...
   cephadm [INF] ceph-osd-1: lvm batch --no-auto /dev/sdc /dev/sdd
 /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
 --db-devices /dev/sdb --block-db-size 92341796864
 --osd-ids 1 --yes --no-systemd
   code: 0
   out: ['']
   err: ['/bin/docker:stderr --> passed data devices: 8 physical, 0 LVM',
   '/bin/docker:stderr --> relative data size: 1.0',
   '/bin/docker:stderr --> passed block_db devices: 1 physical, 0 LVM',
   '/bin/docker:stderr --> 1 fast devices were passed, but none are available']
   

Q1. Is DB LV on SSD supposed to be deleted or not, when replacing an OSD
whose data is on HDD and DB is on SSD?
Q2. If yes from Q1, is a new DB LV supposed to be created on SSD as long as
there is sufficient free space, when building the new OSD?
Q3. If no from Q1, since it's replacing, is the old DB LV going to be reused
for the new OSD?

Again, is this actually supposed to work? Am I missing anything or just trying
on some unsupported feature?


Thanks!
Tony

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io