One reason for such observations is swap usage. If you have swap configured,
you should probably disable it. Swap can be useful with ceph, but you really
need to know what you are doing and how swap actually works (it is not for
providing more RAM as most people tend to believe).
In my case, I have substantial amounts swap configured. Then one needs to be
aware of its impact on certain ceph operations. Code and data that's rarely
used, as well as leaked memory will end up on swap. During normal operations,
that is not a problem. However, during exceptional operations, you are likely
in a situation where all OSDs try to swap the same code/data in/out at the same
time, which can temporarily lead to very large response latencies.
One of these exceptional operations are large peering operations. The code/data
for peering is rarely used, so it will be on swap. The increased latency can be
bad enough for MONs to mark OSDs as down for a short while, I have seen that.
Usually, the cluster recovers very quickly and this is not a real issue if you
have an actual OSD fail.
If you add/remove disks, it can be irritating. The workaround is to set nodown
in addition to noout when doing admin. This will not only speed up peering
dramatically, it will also ignore the increased heartbeat ping times during the
admin operation. I see the warnings, but no detrimental effects.
In general, deploying swap in a ceph cluster is more an exception than a rule.
The most common use is to allow a cluster to recover during a period of
increased RAM requirements. There are cases in this list for both, MDS and OSD
recoveries where having more address space was the only way forward. If
deployed during normal operation, swap really needs to be fast and be able to
handle simultaneous requests from many processes in parallel. Usually, only RAM
is fast enough, so don't buy NVMe drives, just buy more RAM. Having some fast
drives in stock for emergency swap deployment is a good idea though.
I deployed swap to cope with a memory leak that was present in mimic 13.2.8.
Seems to be fixed in 13.2.10. If swap is fast enough, the impact is there but
harmless. Swap on a crappy disk is dangerous.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Anthony D'Atri <anthony.da...@gmail.com>
Sent: 08 January 2021 23:58:43
To: ceph-users@ceph.io
Subject: [ceph-users] Re: osd gradual reweight question
Hi,
We are replacing HDD with SSD, and we first (gradually) drain (reweight) the
HDDs with 0.5 steps until 0 = empty.
Works perfectly.
Then (just for kicks) I tried reducing HDD weight from 3.6 to 0 in one large
step. That seemed to have had more impact on the cluster, and we even noticed
some OSD's temporarily go down after a few minutes. It all worked out, but the
impact seemed much larger.
Please clarify “impact”. Do you mean that client performance was decreased, or
something else?
We never had OSDs go down when gradually reducing the weight step by step. This
surprised us.
Please also clarify what you mean by going down — do you mean being marked
“down” by the mons, or the daemons actually crashing? I’m not being critical —
I want to fully understand your situation.
Is it expected that the impact of a sudden reweight from 3.6 to 0 is bigger
than a gradual step-by-step decrease?
There are a lot of variables there, so It Depends.
For sure going in one step means that more PGs will peer, which can be
expensive. I’ll speculate, with incomplete information, that this is what most
of what you’re seeing.
I would assume the impact to be similar, only the time it takes to reach
HEALTH_OK to be longer.
The end result, yes — the concern is how we get there.
The strategy of incremental downweighting has some advantages:
* If something goes wrong, you can stop without having a huge delta of data to
move before health is restored
* Peering is spread out
* Impact on the network and drives *may* be less at a given time
A disadvantage is that you end up moving some data more than once. This was
worse with older releases and CRUSH details than with recent deployments.
The impact due to data movement can be limited by lowering the usual
recovery/backfill settings to 1 from their defaults, and depending on release
by adjusting the osd_op_queue_cutoff.
The impact due to peering can be limited by spreading out peering, either
through an incremental process like yours, or by letting the balancer module do
the work.
There are other strategies as well, eg. disabling rebalancing, downweighting
OSDs in sequence or a little at a time then enabling balancing when 0.
Thanks,
MJ
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io