Hi Anthony and Frank,

Thanks for your responses!

I think you have answered my question: the impact of one complete sudden reweight to zero is bigger, because of the increased peering that is happening.

With impact I meant: OSDs being marked down by the cluster. (and automatically coming back online) (client performance seems basically unaffected) No OSDs crashing etc.

And yes: the cluster recovers quickly from the OSD that are (temporarily) down. Besides: I also set noout, so the impact was limited anyway.

I will next time also set nodown, thanks for that suggestion.

I had already set the osd_op_queue_cutoff and recovery/backfill settings to 1.

Thank you both for your answers! We'll continue with the gradual weight decreases. :-)

MJ

On 1/9/21 12:28 PM, Frank Schilder wrote:
One reason for such observations is swap usage. If you have swap configured, 
you should probably disable it. Swap can be useful with ceph, but you really 
need to know what you are doing and how swap actually works (it is not for 
providing more RAM as most people tend to believe).

In my case, I have substantial amounts swap configured. Then one needs to be 
aware of its impact on certain ceph operations. Code and data that's rarely 
used, as well as leaked memory will end up on swap. During normal operations, 
that is not a problem. However, during exceptional operations, you are likely 
in a situation where all OSDs try to swap the same code/data in/out at the same 
time, which can temporarily lead to very large response latencies.

One of these exceptional operations are large peering operations. The code/data 
for peering is rarely used, so it will be on swap. The increased latency can be 
bad enough for MONs to mark OSDs as down for a short while, I have seen that. 
Usually, the cluster recovers very quickly and this is not a real issue if you 
have an actual OSD fail.

If you add/remove disks, it can be irritating. The workaround is to set nodown 
in addition to noout when doing admin. This will not only speed up peering 
dramatically, it will also ignore the increased heartbeat ping times during the 
admin operation. I see the warnings, but no detrimental effects.

In general, deploying swap in a ceph cluster is more an exception than a rule. 
The most common use is to allow a cluster to recover during a period of 
increased RAM requirements. There are cases in this list for both, MDS and OSD 
recoveries where having more address space was the only way forward. If 
deployed during normal operation, swap really needs to be fast and be able to 
handle simultaneous requests from many processes in parallel. Usually, only RAM 
is fast enough, so don't buy NVMe drives, just buy more RAM. Having some fast 
drives in stock for emergency swap deployment is a good idea though.

I deployed swap to cope with a memory leak that was present in mimic 13.2.8. 
Seems to be fixed in 13.2.10. If swap is fast enough, the impact is there but 
harmless. Swap on a crappy disk is dangerous.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Anthony D'Atri <anthony.da...@gmail.com>
Sent: 08 January 2021 23:58:43
To: ceph-users@ceph.io
Subject: [ceph-users] Re: osd gradual reweight question


Hi,

We are replacing HDD with SSD, and we first (gradually) drain (reweight) the 
HDDs with 0.5 steps until 0 = empty.

Works perfectly.

Then (just for kicks) I tried reducing HDD weight from 3.6 to 0 in one large 
step. That seemed to have had more impact on the cluster, and we even noticed 
some OSD's temporarily go down after a few minutes. It all worked out, but the 
impact seemed much larger.

Please clarify “impact”.  Do you mean that client performance was decreased, or 
something else?

We never had OSDs go down when gradually reducing the weight step by step. This 
surprised us.

Please also clarify what you mean by going down — do you mean being marked 
“down” by the mons, or the daemons actually crashing?  I’m not being critical — 
I want to fully understand your situation.

Is it expected that the impact of a sudden reweight from 3.6 to 0 is bigger 
than a gradual step-by-step decrease?

There are a lot of variables there, so It Depends.

For sure going in one step means that more PGs will peer, which can be 
expensive.  I’ll speculate, with incomplete information, that this is what most 
of what you’re seeing.

I would assume the impact to be similar, only the time it takes to reach 
HEALTH_OK to be longer.

The end result, yes — the concern is how we get there.

The strategy of incremental downweighting has some advantages:

* If something goes wrong, you can stop without having a huge delta of data to 
move before health is restored
* Peering is spread out
* Impact on the network and drives *may* be less at a given time

A disadvantage is that you end up moving some data more than once.  This was 
worse with older releases and CRUSH details than with recent deployments.

The impact due to data movement can be limited by lowering the usual 
recovery/backfill settings to 1 from their defaults, and depending on release 
by adjusting the osd_op_queue_cutoff.

The impact due to peering can be limited by spreading out peering, either 
through an incremental process like yours, or by letting the balancer module do 
the work.

There are other strategies as well, eg. disabling rebalancing, downweighting 
OSDs in sequence or a little at a time then enabling balancing when 0.


Thanks,
MJ
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to