"leading to the write acknowledgement being sent to the client, because min_size=2
A misunderstanding of min_size Ceph requires all replicas (or shards in EC) to ack the primary before the client is acknowledged. The primary OSD and therefore client are subjected to the latency of the hop between the primary and all replicas. min_size is how many up to date replicas are required for the PG to be active (accepts IO) Respectfully, *Wes Dillingham* On Fri, Jan 9, 2026 at 4:36 PM Martin Gerhard Loschwitz via ceph-users < [email protected]> wrote: > Folks, > > I am having a knowledge question concerning the selection of secondary > OSDs in Ceph. > > I have a cluster here that consists of three nodes. For the sake of the > argument, I have simulated latency between the third node and the other two > using tc and netem. I have set the priority-affinity of the OSDs on the > third node to 0, and indeed, RADOS is not using any of these OSDs as > primary OSD, so this part works as expected. > > Furthermore, my expectation for a pool that has size=3 and min_size=2 is > that for any given write, the primary OSD on nodes 1 or 2 will select a > secondary OSD in node 1/2 respectively and one in node 3. Which would then > lead me to believe that any client writing into the cluster from node 1 > will only ever have the latency between node 1 and node 2 as an actual > performance penalty because > > * client selects primary OSD on node 1 or node 2 > * primary OSD selects secondary OSDs and starts transfer in parallel > * Write to OSD with lower latency will finish much sooner than the one to > the other OSD, leading to the write acknowledgement being sent to the > client, because min_size=2 > > But that appears not to be the case. priority-affinity has a very slight > impact, but the overall performance when writing into the cluster with > queue depth 1 and request size of 4k still very much resembles a scenario > in which every single write appears to be latency-penalized with the > latency between node1/2 and node 3. > > Where is my understanding incorrect? Or are there any configuration > settings for this? I tried to search for this, but the only results I can > find refer to priority-affinity. I am looking into something like > „secondary affinity“ I guess, but I do not think that such a thing exists > in Ceph. Which leads me to believe that my understanding of this is > seriously wrong somehow. > > Any hint will be greatly appreciated. Thank you very much in advance. > > Best regards > Martin > > -- > Martin Gerhard Loschwitz > Geschäftsführer / CEO, True West IT Services GmbH > Phone: +49 2433 5253130 > Mobile: +49 176 61832178 > Address: Schmiedegasse 24a, 41836 Hückelhoven, Germany > Legal: HRB 21985, Amtsgericht Mönchengladbach > VAT: DE363893844 > > True West IT Services GmbH is compliant with the GDPR regulation on data > protection and privacy in the European Union and the European Economic > Area. You can request the information on how we collect and process your > private data according to the law by contacting the email sender. > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
