> Furthermore, my expectation for a pool that has size=3 and min_size=2 is that > for any given write, the primary OSD on nodes 1 or 2 will select a secondary > OSD in node 1/2 respectively and one in node 3. Which would then lead me to > believe that any client writing into the cluster from node 1 will only ever > have the latency between node 1 and node 2 as an actual performance penalty > because > > * client selects primary OSD on node 1 or node 2 > * primary OSD selects secondary OSDs and starts transfer in parallel > * Write to OSD with lower latency will finish much sooner than the one to the > other OSD, leading to the write acknowledgement being sent to the client, > because min_size=2 > > But that appears not to be the case. priority-affinity has a very slight > impact, but the overall performance
That is indeed not the case, the write will only be accepted as completed when all three replicas have ACKed the write. With your situation, it would be rather quick for a client to fill up any buffers on the fast OSD hosts so that the slow OSD host can't keep up, and you would be in a situation where the cluster "promises" to keep 3 copies but in reality you only have two, and if the client keeps on writing or changing data on the fast OSDs, you would be sure to have a really bad situation in case one of the two fast nodes restarts, since you now have a lot of PGs for which the slow OSD host has bad/old/stale data and the one remaining fast OSD host would give you repl=1 "safety" for those PGs. So ceph will wait until it has 3 ACKs before client gets the ACK. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
