Christian, > Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4, > Ceph journal is inline (journal file).
Quick question. Is there any reason you selected Ext4? Cheers, Shinobu ----- Original Message ----- From: "Christian Balzer" <ch...@gol.com> To: ceph-users@lists.ceph.com Sent: Thursday, February 25, 2016 12:10:41 PM Subject: [ceph-users] Observations with a SSD based pool under Hammer Hello, For posterity and of course to ask some questions, here are my experiences with a pure SSD pool. SW: Debian Jessie, Ceph Hammer 0.94.5. HW: 2 nodes (thus replication of 2) with each: 2x E5-2623 CPUs 64GB RAM 4x DC S3610 800GB SSDs Infiniband (IPoIB) network Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4, Ceph journal is inline (journal file). Performance: A test run with "rados -p cache bench 30 write -t 32" (4MB blocks) gives me about 620MB/s, the storage nodes are I/O bound (all SSDs are 100% busy according to atop) and this meshes nicely with the speeds I saw when testing the individual SSDs with fio before involving Ceph. To elaborate on that, an individual SSD of that type can do about 500MB/s sequential writes, so ideally you would see 1GB/s writes with Ceph (500*8/2(replication)/2(journal on same disk). However my experience tells me that other activities (FS journals, leveldb PG updates, etc) impact things as well. A test run with "rados -p cache bench 30 write -t 32 -b 4096" (4KB blocks) gives me about 7200 IOPS, the SSDs are about 40% busy. All OSD processes are using about 2 cores and the OS another 2, but that leaves about 6 cores unused (MHz on all cores scales to max during the test run). Closer inspection with all CPUs being displayed in atop shows that no single core is fully used, they all average around 40% and even the busiest ones (handling IRQs) still have ample capacity available. I'm wondering if this an indication of insufficient parallelism or if it's latency of sorts. I'm aware of the many tuning settings for SSD based OSDs, however I was expecting to run into a CPU wall first and foremost. Write amplification: 10 second rados bench with 4MB blocks, 6348MB written in total. nand-writes per SSD:118*32MB=3776MB. 30208MB total written to all SSDs. Amplification:4.75 Very close to what you would expect with a replication of 2 and journal on same disk. 10 second rados bench with 4KB blocks, 219MB written in total. nand-writes per SSD:41*32MB=1312MB. 10496MB total written to all SSDs. Amplification:48!!! Le ouch. In my use case with rbd cache on all VMs I expect writes to be rather large for the most part and not like this extreme example. But as I wrote the last time I did this kind of testing, this is an area where caveat emptor most definitely applies when planning and buying SSDs. And where the Ceph code could probably do with some attention. Regards, Christian -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com