I've setup a CEPH cluster to test things before going into production but I've 
run into some performance issues that I cannot resolve or explain.

Hardware in use in each storage machine (x3)
- dual 10Gbit Solarflare Communications SFC9020 (Linux bond, mtu 9000)
- dual 10Gbit EdgeSwitch 16-Port XG
- LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 HBA
- 3x Intel S4500 480GB SSD as OSD's
- 2x SSD raid-1 boot/OS disks
- 2x Intel(R) Xeon(R) CPU E5-2630
- 128GB memory

Software wise I'm running CEPH 12.2.7-pve1 setup from Proxmox VE 5.2 on all 
nodes.

Running rados benchmark resulted in somewhat lower than expected performance 
unless ceph enters the 'near-full' state. When the cluster is mostly empty 
rados bench (180 write -b 4M -t 16) results in about 330MB/s with 0.18ms 
latency but when hitting near-full state this goes up to a more expected 
550MB/s and 0.11ms latency.

iostat on the storage machines shows the disks are hardly utilized unless the 
cluster hits near-full, CPU and network also aren't maxed out. I’ve also tried 
with NIC bonding and just one switch, without jumbo frames but nothing seem to 
matter in this case.

Is this expected behavior or what can I try to do to pinpoint the bottleneck ?

The expected performance is per Proxmox's benchmark results they released this 
year, they have 4 OSD's per server and hit almost 800MB/s with 0.08ms latency 
using 10Gbit and 3 nodes, though they have more OSD's and somewhat different 
hardware I understand I won't hit the 800MB/s mark but the difference between 
empty and almost full cluster makes no sense to me, I'd expect it to be the 
other way around.

Thanks,
Menno
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to