I appreciate all the tips! And thanks for the observation on weights. I
don't know how it got to 1 for all OSDs. The custer has a mixture of 8
and 10T drives. Is there a way to automatically readjust them or this is
done manually in the crush map (decompile/edit/compile)?
I ran ceph osd crush reweight 75 1.0 and it started recovering right away
3-4 Gbit/s sustained throughput. I know this is a bandaid, waiting on
your guidance on how to adjust the wrights above.
Here is the requested additional output:
# ceph -v
ceph version 18.2.4 (..) reef (stable)
NB: Once the cluster is stable and OK status, I plan to upgrade to 19.2.0
via ceph orch.
# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "fs01_data-ec",
"type": 3,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "central.rgw.buckets.data",
"type": 3,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
# ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:00.000350",
"last_optimize_started": "Wed Feb 26 14:01:03 2025",
"mode": "upmap",
"no_optimization_needed": true,
"optimize_result": "Some objects (0.003469) are degraded; try again
later",
"plans": []
}
On Wed, Feb 26, 2025 at 8:18 AM Anthony D'Atri <[email protected]> wrote:
>
> On Feb 26, 2025, at 7:47 AM, Deep Dish <[email protected]> wrote:
>
> Your parents had quite the sense of humor.
>
> Hello,
>
> I have an 80 OSD cluster (across 8 nodes). The average utilization across
> my OSDs is ~ 32%.
>
>
> Average isn’t what factors in here ...
>
> Recently the cluster had a bad drive, and it was replaced (same
> capacity).
>
>
> 1TB HDDs? How old is this gear?
> Oh, looks like your CRUSH weights don’t align with OSD TBs. Tricky. I
> suspect your drives are …. 8TB?
>
> So the one thing that sticks out straight away is OSD.75 and it having a
> different weight to all the other devices.
>
>
> That sure doesn’t help. I suspect that for some reason the CRUSH weights
> of all OSDs in the cluster were set to 1.0000 in the past. Which in and of
> itself is … okay, as operationally CRUSH weights are *relative* to each
> other. The replaced drive wasn’t brought up with that custom CRUSH weight,
> so it has the default TiB CRUSH weight.
>
> As Frédéric suggests, do this NOW:
>
> ceph osd crush reweight osd.75 1.0000
>
> This will back off your immediate problem.
>
>
> >ceph osd reweight 75 1
>
> Without `crush` in there this would actually be a no-op ;)
>
> You could set osd_crush_initial_weight = 1.0 to force all new OSDs to have
> that 1.000 CRUSH weight, but that would bite you if you do legitimately add
> larger drives down the road.
>
> I suggest reweighting all of your drives to 7.15359 at the same time by
> decompiling and editing the CRUSH map to avoid future problems.
>
> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each
> drive
>
> For the past week or so the cluster has been
> recovering, slowly,
>
>
> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each
> drive, and `storcli64 /c0 show termlog`.
>
> See if there are any indications of one or more bad drives: lots of
> reallocated sectors, SATA downshifts, etc.
>
> and reporting backfill_toofull. I can't figure out what's causing the
> issue given there's ample available capacity.
>
>
> Capacity and available capacity are different.
>
> Are you using EC? As wide as 8+2?
>
> usage: 197 TiB used, 413 TiB / 610 TiB avail
>
>
> > recovery: 16 MiB/s, 4 objects/s
>
> Small clusters recover more slowly, but that’s pretty slow for an 80 OSD
> cluster. Is this Reef or Squid with mclock?
>
>
>
>
> # ceph osd df
>
> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
> AVAIL %USE VAR PGS STATUS
>
>
> Please set your MUA to not wrap
>
>
> 1 hdd 1.00000 1.00000 9.1 TiB 2.2 TiB 2.2 TiB 720 KiB 5.8 GiB
> 6.9 TiB 24.28 0.75 108 up
>
> 9 hdd 1.00000 1.00000 7.3 TiB 2.7 TiB 2.7 TiB 20 MiB 8.8 GiB
> 4.6 TiB 36.76 1.14 103 up
>
> 16 hdd 1.00000 1.00000 7.3 TiB 2.2 TiB 2.2 TiB 63 KiB 6.1 GiB
> 5.1 TiB 29.82 0.92 109 up
>
> 27 hdd 1.00000 1.00000 9.1 TiB 2.4 TiB 2.4 TiB 1.9 MiB 6.5 GiB
> 6.7 TiB 26.23 0.81 108 up
>
> 75 hdd 7.15359 1.00000 7.2 TiB 4.5 TiB 4.5 TiB 158 MiB 13 GiB
> 2.6 TiB 63.47 1.96 356 up
>
> ...
> TiB 32.01 0.99 105 up
>
> TOTAL 610 TiB 197 TiB 196 TiB 1.7 GiB 651 GiB
> 413
>
>
> TiB 32.31
>
> MIN/MAX VAR: 0.67/1.96 STDDEV: 5.72
>
>
> You don’t have a balancer enabled, or it isn’t working. Your available
> space is a function not only of the *full ratios but of your replication
> strategies and is relative to the *most full* OSD.
>
> Send `ceph osd crush rule dump` and `ceph balancer status` and `ceph -v`
>
>
>
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]