[ceph-users] Re: Question concerning selection of secondary OSDs

Janne Johansson via ceph-users Sat, 10 Jan 2026 01:05:15 -0800

> Furthermore, my expectation for a pool that has size=3 and min_size=2 is that 
> for any given write, the primary OSD on nodes 1 or 2 will select a secondary 
> OSD in node 1/2 respectively and one in node 3. Which would then lead me to 
> believe that any client writing into the cluster from node 1 will only ever 
> have the latency between node 1 and node 2 as an actual performance penalty 
> because
>
> * client selects primary OSD on node 1 or node 2
> * primary OSD selects secondary OSDs and starts transfer in parallel
> * Write to OSD with lower latency will finish much sooner than the one to the 
> other OSD, leading to the write acknowledgement being sent to the client, 
> because min_size=2
>
> But that appears not to be the case. priority-affinity has a very slight 
> impact, but the overall performance


That is indeed not the case, the write will only be accepted as
completed when all three replicas have ACKed the write. With your
situation, it would be rather quick for a client to fill up any
buffers on the fast OSD hosts so that the slow OSD host can't keep up,
and you would be in a situation where the cluster "promises" to keep 3
copies but in reality you only have two, and if the client keeps on
writing or changing data on the fast OSDs, you would be sure to have a
really bad situation in case one of the two fast nodes restarts, since
you now have a lot of PGs for which the slow OSD host has
bad/old/stale data and the one remaining fast OSD host would give you
repl=1 "safety" for those PGs.

So ceph will wait until it has 3 ACKs before client gets the ACK.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Question concerning selection of secondary OSDs

Reply via email to