Hello, I'd like to understand how replication works. In the paper [1] several replication strategies are described, and according to a (bit old) mailing list post [2] primary-copy is used. Therefore the primary OSD waits until the object is persisted and then updates all replicas in parallel.
Current cluster setup: Ceph jewel 10.2.3 6 storage nodes 24 HDDs each, journal on same disk [3] Frontend network: 10 Gbit/s Backend network: 2 x 10 Gbit/s bonded with layer3+4 hashing [4] CephFS with striping: 1M stripe unit, 10 stripe count, 10M object size My assumption was that there should be no difference whether I write to replication 2 or 3, because each storage node can accept 10 Gbit/s traffic from the frontend network and send 10 Gbit/s traffic simultaneous to two other storage nodes. Disk write capacity shouldn't be a problem either: 200 MB/s throughput * 6 nodes * 24 disks / 2 (journal) / 3 replicas = 4800 MB/s. Results with 7 clients: Replication 1: 5695.33 MB/s Replication 2: 3337.09 MB/s Replication 3: 1898.17 MB/s Replication 2 is about 1/2 of Replication 1, and Replication 3 is exact 1/3 of Replication 1. Any hints what the bottleneck is in this case? [1] http://ceph.com/papers/weil-rados-pdsw07.pdf [2] http://www.spinics.net/lists/ceph-devel/msg02420.html [3] Test with fio --name=job --ioengine=libaio --rw=write --blocksize=1M --size=30G --direct=1 --sync=1 --iodepth=128 --filename=/dev/sdw gives about 200 MB/s (test for journal writes) [4] Test with iperf3, 1 storage node connects to 2 other nodes to the backend IP gives 10 Gbit/s throughput for each connection Thanks, Andreas _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com