Den fre 19 sep. 2025 kl 14:11 skrev Niklas Hambüchen <[email protected]>:
>
> Hi Boris,
>
> > If you have misplaced objects, the OSDs were marked out and ceph started to 
> > move the PGs to other node
> Does this really affect the question though?
> Even if Ceph started to move objects to other nodes for 5 minutes, why would 
> undoing *that* after the 5 minutes take 3 hours?
>
> Shouldn't moving stuff back into a node take roughly as long as moving it out?
>
> Separate:
> As far as I can tell, my nodes were down too short to be marked as `out`.

Well, if they were away long enough to get "out", then it is somewhat
reasonable even for ~5m downtimes.

In that case, I think the scenario is like this, you have an OSD.1
with x PGs on it, lets say one of them is PG 3.34a.
It normally lives on this OSD.1 and also on OSD.12 and OSD.23.

Then OSD.1 is gone so long the cluster starts "repairing" the hole it
left, so it grabs "next OSD to hold 3.34a" which is OSD.34, so .34
starts creating PGan empty  3.34a and tries to fill it with data for a
few minutes, while OSD.12 takes on writes which goes to OSD.23 aswell,
and would end up on OSD.34 too, except its busy backfilling from
scratch. This means the PG 3.34a version or date or whatever keeps
track of history moves forward a lot while OSD.1 is gone.

Then 5m later, OSD.1 comes back, but it has version-number too-old so
neither OSD.12 or OSD.23 has enough history to just replay what
happened last minutes (for recovery), in which case the temp-copy on
OSD.34 gets erased and OSD.1 starts a full backfill from OSD.12 or
OSD.23 in order to become fully updated copy of PG 3.34a.

And this backfill is the one that takes hours, not just "the small
diff of what occured in 5m".

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to