A couple of suggestions: 1) # of pgs per OSD should be 100-200 2) When dealing with SSD or Flash, performance of these devices hinge on how you partition them and how you tune linux: a) if using partitions, did you align the partitions on a 4k boundary? I start at sector 2048 using either fdisk or sfdisk b) There are quite a few Linux settings that benefit SSD/Flash and they are: Deadline io scheduler only when using the deadline associated settings, up QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, setting read ahead if doing majority of reads, and other 3) mount options: noatime, delaylog,inode64,noquota, etc…
I have written some papers/blogs on this subject if you are interested in seeing them. Rick > On Mar 3, 2016, at 2:41 AM, Adrian Saul <adrian.s...@tpgtelecom.com.au> wrote: > > Hi Ceph-users, > > TL;DR - I can't seem to pin down why an unloaded system with flash based OSD > journals has higher than desired write latencies for RBD devices. Any ideas? > > > I am developing a storage system based on Ceph and an SCST+pacemaker > cluster. Our initial testing showed promising results even with mixed > available hardware and we proceeded to order a more designed platform for > developing into production. The hardware is: > > 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using > RBD - they present iSCSI to other systems). > 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo > SSDs each > 3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each > > As part of the research and planning we opted to put a pair of Intel PC3700DC > 400G NVME cards in each OSD server. These are configured mirrored and setup > as the journals for the OSD disks, the aim being to improve write latencies. > All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 4 aggregated > 10G NICs back to a common pair of switches. All machines are running Centos > 7, with the frontends using the 4.4.1 elrepo-ml kernel to get a later RBD > kernel module. > > On the ceph side each disk in the OSD servers are setup as an individual OSD, > with a 12G journal created on the flash mirror. I setup the SSD servers > into one root, and the SATA servers into another and created pools using > hosts as fault boundaries, with the pools set for 2 copies. I created the > pools with the pg_num and pgp_num set to 32x the number of OSDs in the pool. > On the frontends we create RBD devices and present them as iSCSI LUNs using > SCST to clients - in this test case a Solaris host. > > The problem I have is that even with a lightly loaded system the service > times for the LUNs for writes is just not getting down to where we want it, > and they are not very stable - with 5 LUNs doing around 200 32K IOPS > consistently the service times sit at around 3-4ms, but regularly (every > 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5 > minutes. I fully expected we would have some latencies due to the > distributed and networked nature of Ceph, but in this instance I just cannot > find where these latencies are coming from, especially with the SSD based > pool and having flash based journaling. > > - The RBD devices show relatively low service times, but high queue times. > These are in line with what Solaris sees so I don't think SCST/iSCSI is > adding much latency. > - The journals are reporting 0.02ms service times, and seem to cope fine with > any bursts > - The SSDs do show similar latency variations with writes - bursting up to > 12ms or more whenever there is high write workloads. > - I have tried applying what tuning I can to the SSD block devices (noop > scheduler etc) - no difference > - I have removed any sort of smarts around IO grouping in SCST - no major > impact > - I have tried tuning up filesystore queue and wbthrottle values but could > not find much difference from that. > - Read performance is excellent, the RBD devices show little to no rwait and > I can do benchmarks up over 1GB/s in some tests. Write throughput can also > be good (~700MB/s). > - I have tried using different RBD orders more in line with the iSCSI client > block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse. > I would have thought better alignment would reduce latency but is that > offset buy the extra overhead in object work? > > What I am looking for is what other areas do I need to look or diagnostics do > I need to work this out? We would really like to use ceph across a mixed > workload that includes some DB systems that are fairly latency sensitive, but > as it stands its hard to be confident in the performance when a fairly quiet > unloaded system seems to struggle, even with all this hardware behind it. I > get the impression that the SSD write latencies might be coming into play as > they are similar to the numbers I see, but really for writes I would expect > them to be "hidden" behind the journaling. > > I also would have thought that being not under load and with the flash > journals the only latency would be coming from mapping calculations on the > client or otherwise some contention within the RBD module itself. Any ideas > how I can break out what the times are for what the RBD module is doing? > > Any help appreciated. > > As an aside - I think Ceph as a concept is exactly what a storage system > should be about, hence why we are using it this way. Its been awesome to get > stuck into it and learn how it works and what it can do. > > > > > Adrian Saul | Infrastructure Projects Team Lead > TPG Telecom (ASX: TPM) > > > > > > > > > > > Confidentiality: This email and any attachments are confidential and may be > subject to copyright, legal or some other professional privilege. They are > intended solely for the attention and use of the named addressee(s). They may > only be copied, distributed or disclosed with the consent of the copyright > owner. If you have received this email by mistake or by breach of the > confidentiality clause, please notify the sender immediately by return email > and delete or destroy all copies of the email. Any confidentiality, privilege > or copyright is not waived or lost because this email has been sent to you by > mistake. > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com