A couple of suggestions:
1)   # of pgs per OSD should be 100-200
2)  When dealing with SSD or Flash, performance of these devices hinge on how 
you partition them and how you tune linux:
        a)   if using partitions, did you align the partitions on a 4k 
boundary? I start at sector 2048 using either fdisk or sfdisk
        b)   There are quite a few Linux settings that benefit SSD/Flash and 
they are: Deadline io scheduler only when using the deadline associated 
settings, up      QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, 
setting read ahead if doing majority of reads, and other
3)   mount options:  noatime, delaylog,inode64,noquota, etc…

I have written some papers/blogs on this subject if you are interested in 
seeing them.
Rick
> On Mar 3, 2016, at 2:41 AM, Adrian Saul <adrian.s...@tpgtelecom.com.au> wrote:
> 
> Hi Ceph-users,
> 
> TL;DR - I can't seem to pin down why an unloaded system with flash based OSD 
> journals has higher than desired write latencies for RBD devices.  Any ideas?
> 
> 
>  I am developing a storage system based on Ceph and an SCST+pacemaker 
> cluster.   Our initial testing showed promising results even with mixed 
> available hardware and we proceeded to order a more designed platform for 
> developing into production.   The hardware is:
> 
> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using 
> RBD - they present iSCSI to other systems).
> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo 
> SSDs each
> 3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
> 
> As part of the research and planning we opted to put a pair of Intel PC3700DC 
> 400G NVME cards in each OSD server.  These are configured mirrored and setup 
> as the journals for the OSD disks, the aim being to improve write latencies.  
> All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 4 aggregated 
> 10G NICs back to a common pair of switches.   All machines are running Centos 
> 7, with the frontends using the 4.4.1 elrepo-ml kernel to get a later RBD 
> kernel module.
> 
> On the ceph side each disk in the OSD servers are setup as an individual OSD, 
> with a 12G journal created on the flash mirror.   I setup the SSD servers 
> into one root, and the SATA servers into another and created pools using 
> hosts as fault boundaries, with the pools set for 2 copies.   I created the 
> pools with the pg_num and pgp_num set to 32x the number of OSDs in the pool.  
>  On the frontends we create RBD devices and present them as iSCSI LUNs using 
> SCST to clients - in this test case a Solaris host.
> 
> The problem I have is that even with a lightly loaded system the service 
> times for the LUNs for writes is just not getting down to where we want it, 
> and they are not very stable - with 5 LUNs doing around 200 32K IOPS 
> consistently the service times sit at around 3-4ms, but regularly (every 
> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5 
> minutes.  I fully expected we would have some latencies due to the 
> distributed and networked nature of Ceph, but in this instance I just cannot 
> find where these latencies are coming from, especially with the SSD based 
> pool and having flash based journaling.
> 
> - The RBD devices show relatively low service times, but high queue times.  
> These are in line with what Solaris sees so I don't think SCST/iSCSI is 
> adding much latency.
> - The journals are reporting 0.02ms service times, and seem to cope fine with 
> any bursts
> - The SSDs do show similar latency variations with writes - bursting up to 
> 12ms or more whenever there is high write workloads.
> - I have tried applying what tuning I can to the SSD block devices (noop 
> scheduler etc) - no difference
> - I have removed any sort of smarts around IO grouping in SCST - no major 
> impact
> - I have tried tuning up filesystore  queue and wbthrottle values but could 
> not find much difference from that.
> - Read performance is excellent, the RBD devices show little to no rwait and 
> I can do benchmarks up over 1GB/s in some tests.  Write throughput can also 
> be good (~700MB/s).
> - I have tried using different RBD orders more in line with the iSCSI client 
> block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse. 
>  I would have thought better alignment would reduce latency but is that 
> offset buy the extra overhead in object work?
> 
> What I am looking for is what other areas do I need to look or diagnostics do 
> I need to work this out?  We would really like to use ceph across a mixed 
> workload that includes some DB systems that are fairly latency sensitive, but 
> as it stands its hard to be confident in the performance when a fairly quiet 
> unloaded system seems to struggle, even with all this hardware behind it.   I 
> get the impression that the SSD write latencies might be coming into play as 
> they are similar to the numbers I see, but really for writes I would expect 
> them to be "hidden" behind the journaling.
> 
> I also would have thought that being not under load and with the flash 
> journals the only latency would be coming from mapping calculations on the 
> client or otherwise some contention within the RBD module itself.   Any ideas 
> how I can break out what the times are for what the RBD module is doing?
> 
> Any help appreciated.
> 
> As an aside - I think Ceph as a concept is exactly what a storage system 
> should be about, hence why we are using it this way.  Its been awesome to get 
> stuck into it and learn how it works and what it can do.
> 
> 
> 
> 
> Adrian Saul | Infrastructure Projects Team Lead
> TPG Telecom (ASX: TPM)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional privilege. They are 
> intended solely for the attention and use of the named addressee(s). They may 
> only be copied, distributed or disclosed with the consent of the copyright 
> owner. If you have received this email by mistake or by breach of the 
> confidentiality clause, please notify the sender immediately by return email 
> and delete or destroy all copies of the email. Any confidentiality, privilege 
> or copyright is not waived or lost because this email has been sent to you by 
> mistake.
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to