Re: [ceph-users] RBD with SSD journals and SAS OSDs

Nick Fisk Mon, 17 Oct 2016 03:17:12 -0700

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> William Josefsson
> Sent: 17 October 2016 10:39
> To: n...@fisk.me.uk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] RBD with SSD journals and SAS OSDs
> 
> hi nick, I earlier did cpupower frequency-set --cpu-governor performance on 
> all my hosts, which bumped all CPUs up to almost max
> speed or more.


Did you also set /check the c-states, this can have a large impact as well?

> 
> It didn't really help much, and I still experience 5-10ms latency in my fio 
> benchmarks in VMs with this job description.
> 
> Is there anything else I can do to force the SSDs to be used more? 

Not really for small IO's you are limited by end to end latency of the whole 
system. Each request has to be actioned before the next
can be sent. You are probably get 100-200us of latency per network hop, plus 
Ceph introduces latency as it processes requests,
somewhere in the region of 500us to 1.5ms depending on CPU speed and finally 
your SSD's probably take between 50-100us per write. 

So

Client -> Net -> OSD1 -> SSDJournal -> Net -> OSD2+3 -> SSDJournal - > and then 
ACK back to client

It all adds up and so you will never get the same speed as testing to a locally 
attached SSD.

It might be worth running a single threaded test to get an idea on best case 
latency at least this with give you an idea of the best
you will be able to achieve. I would expect you to be able to get around ~1.5ms 
or 600-700 iops for a single threaded test with your
hardware



> I know DIRECT SYNCED WRITE may not be the most common
> application case, however I need help to improve a worst case. Benchmarking 
> these ssd locally with fio and direct sync write, can
do
> 40-50k IOPS.  I'm not sure exactly what, but something is holding back the 
> max performance.
> I know the journals are sparely used from collectd graphs. appreciate any 
> advice. thx will
> 
> >> [global]
> >> bs=4k
> >> rw=write
> >> sync=1
> >> direct=1
> >> iodepth=1
> >> filename=/dev/vdb1
> >> runtime=30
> >> stonewall=1
> >> group_reporting
> 
> 
> grep "cpu MHz" /proc/cpuinfo
> cpu MHz         : 2945.250
> cpu MHz         : 2617.500
> cpu MHz         : 3065.062
> cpu MHz         : 2574.281
> cpu MHz         : 2739.468
> cpu MHz         : 2857.593
> cpu MHz         : 2602.125
> cpu MHz         : 2581.687
> cpu MHz         : 2958.656
> cpu MHz         : 2793.093
> cpu MHz         : 2682.750
> cpu MHz         : 2699.718
> cpu MHz         : 2620.125
> cpu MHz         : 2926.875
> cpu MHz         : 2740.031
> cpu MHz         : 2559.656
> cpu MHz         : 2758.875
> cpu MHz         : 2656.593
> cpu MHz         : 1476.187
> cpu MHz         : 2545.125
> cpu MHz         : 2792.718
> cpu MHz         : 2630.156
> cpu MHz         : 3090.750
> cpu MHz         : 2951.906
> cpu MHz         : 2845.875
> cpu MHz         : 2553.281
> cpu MHz         : 2602.125
> cpu MHz         : 2600.906
> cpu MHz         : 2737.031
> cpu MHz         : 2552.156
> cpu MHz         : 2624.625
> cpu MHz         : 2614.125
> 
> 
> 
> 
> On Mon, Oct 17, 2016 at 5:17 PM, Nick Fisk <n...@fisk.me.uk> wrote:
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of William Josefsson
> >> Sent: 17 October 2016 09:31
> >> To: Christian Balzer <ch...@gol.com>
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] RBD with SSD journals and SAS OSDs
> >>
> >> Thx Christian for helping troubleshooting the latency issues. I have 
> >> attached my fio job template below.
> >>
> >> I thought to eliminate the factor that the VM is the bottleneck, I've
> >> created a 128GB 32 cCPU flavor. Here's the latest fio
> > benchmark.
> >> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
> >> performance for SYNCED WRITEs and how well suited it would be for
> >> disk intensive workloads or DBs
> >>
> >>
> >> > The size (45GB) of these journals is only going to be used by a
> >> > little fraction, unlikely to be more than 1GB in normal operations
> >> > and with default filestore/journal parameters.
> >>
> >> To consume more of the SSDs in the hope to achieve lower latency, can
> >> you pls advice what parameters I should be looking at? I
> > have
> >> already tried to what's mentioned in RaySun's ceph blog, which eventually 
> >> lowered my overall sync write IOPs performance by 1-
> 2k.
> >
> > You biggest gains will probably be around forcing the CPU's to max 
> > frequency and forcing c-state to 1.
> >
> > intel_idle.max_cstate=0 on kernel parameters and echo 100 >
> > /sys/devices/system/cpu/intel_pstate/min_perf_pct ( I think this is
> > the same as performance governor)
> >
> > Use something like powertop to check that all cores are running at max
> > freq and are staying in cstate1
> >
> > I have managed to get the latency on my cluster down to about 600us,
> > but with your hardware I don't suspect you would be able to get it below 
> > ~1-1.5ms best case.
> >
> >>
> >> # These are from RaySun's  write up, and worsen my total IOPs.
> >> #
> >> http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimizati
> >> on-summary/
> >>
> >> filestore xattr use omap = true
> >> filestore min sync interval = 10
> >> filestore max sync interval = 15
> >> filestore queue max ops = 25000
> >> filestore queue max bytes = 10485760
> >> filestore queue committing max ops = 5000 filestore queue committing
> >> max bytes = 10485760000 journal max write bytes =
> >> 1073714824 journal max write entries = 10000 journal queue max ops =
> >> 50000 journal queue max bytes = 10485760000
> >>
> >> My Journals are Intel s3610 200GB, split in 4-5 partitions each. When
> >> I did FIO on the disks locally with direct=1 and sync=1 the
> > WRITE
> >> performance was 50k iops for 7 threads.
> >>
> >> My hardware specs:
> >>
> >> - 3 Controllers, The mons run here
> >> Dell PE R630, 64GB, Intel SSD s3610
> >> - 9 Storage nodes
> >> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
> >> OSD: 18x1.8TB Hitachi 10krpm SAS
> >>
> >> RAID Controller is PERC 730
> >>
> >> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to 
> >> Arista 7050X 10Gbit Switches with VARP, and LACP
> interfaces.
> > I
> >> have from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I
> >> did iperf, and I can do 10Gbps from the VM to the storage
> > nodes.
> >>
> >> I've already been tuning, CPU scaling governor to 'performance' on
> >> all hosts for all cores. My CEPH release is latest hammer on CentOS7.
> >>
> >> The best write currently happens at 62 threads it seems, the IOPS is
> >> 8.3k for the direct synced writes. The latency and stddev are
> > still
> >> concerning.. :(
> >>
> >> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17
> >> 15:20:05 2016
> >>   write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec
> >>     clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
> >>      lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
> >>     clat percentiles (usec):
> >>      |  1.00th=[ 3888],  5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768],
> >>      | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384],
> >>      | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584],
> >>      | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512],
> >>      | 99.99th=[17792]
> >>     bw (KB  /s): min=  315, max=  761, per=1.61%, avg=537.06, stdev=77.13
> >>     lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01%
> >>   cpu          : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902
> >>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> >> >=64=0.0%
> >>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >> >=64=0.0%
> >>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >> >=64=0.0%
> >>      issued    : total=r=0/w=250527/d=0, short=r=0/w=0/d=0
> >>
> >>
> >> From the above we can tell that the latency for clients doing synced
> >> writes, is somewhere 5-10ms which seems very high, especially with
> >> quite high performing hardware, network, and SSD journals. I'm not
> >> sure whether it may be the syncing from Journal to OSD
> > that
> >> causes these fluctuations or high latencies.
> >>
> >> Any help or advice would be much appreciates. thx will
> >>
> >>
> >> [global]
> >> bs=4k
> >> rw=write
> >> sync=1
> >> direct=1
> >> iodepth=1
> >> filename=${FILE}
> >> runtime=30
> >> stonewall=1
> >> group_reporting
> >>
> >> [simple-write-6]
> >> numjobs=6
> >> [simple-write-10]
> >> numjobs=10
> >> [simple-write-14]
> >> numjobs=14
> >> [simple-write-18]
> >> numjobs=18
> >> [simple-write-22]
> >> numjobs=22
> >> [simple-write-26]
> >> numjobs=26
> >> [simple-write-30]
> >> numjobs=30
> >> [simple-write-34]
> >> numjobs=34
> >> [simple-write-38]
> >> numjobs=38
> >> [simple-write-42]
> >> numjobs=42
> >> [simple-write-46]
> >> numjobs=46
> >> [simple-write-50]
> >> numjobs=50
> >> [simple-write-54]
> >> numjobs=54
> >> [simple-write-58]
> >> numjobs=58
> >> [simple-write-62]
> >> numjobs=62
> >> [simple-write-66]
> >> numjobs=66
> >> [simple-write-70]
> >> numjobs=70
> >>
> >> On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <ch...@gol.com> wrote:
> >> >
> >> > Hello,
> >> >
> >> >
> >> > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote:
> >> >
> >> >> Ok thanks for sharing. yes my journals are Intel S3610 200GB,
> >> >> which I partition in 4 partitions each ~45GB. When I ceph-deploy I
> >> >> declare these as the journals of the OSDs.
> >> >>
> >> > The size (45GB) of these journals is only going to be used by a
> >> > little fraction, unlikely to be more than 1GB in normal operations
> >> > and with default filestore/journal parameters.
> >> >
> >> > Because those defaults start flushing things (from RAM, the journal
> >> > never gets read unless there is a crash) to the filestore (OSD HDD)
> >> > pretty much immediately.
> >> >
> >> > Again, use google to search the ML archives.
> >> >
> >> >> I was trying to understand the blocking, and how much my SAS OSDs
> >> >> affected my performance. I have a total of 9 hosts, 158 OSDs each
> >> >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds.
> >> >> My failure domain is by type RACK. The CRUSH rule set is by rack.
> >> >> 3 hosts in each rack. Pool size is =3. I'm running hammer on centos7.
> >> >>
> >> >
> >> > Which begs the question to fully detail your HW (CPUs, RAM),
> >> > network (topology, what switches, inter-rack/switch links), etc.
> >> > The reason for this will become obvious below.
> >> >
> >> >> I did a simple fio test from one of my xl instances, and got the
> >> >> results below. The Latency 7.21ms is worrying, is this expected
> >> >> results? Or is there any way I can further tune my cluster to
> >> >> achieve better results? thx will
> >> >>
> >> >
> >> >> FIO: sync=1, direct=1, bs=4k
> >> >>
> >> > Full command line, please.
> >> >
> >> > Small, sync I/Os are by far the hardest thing for Ceph.
> >> >
> >> > I can guess what some of the rest was, but it's better to know for sure.
> >> > Alternatively, additionally, try this please:
> >> >
> >> > "fio --size=1G --ioengine=libaio --invalidate=1  --direct=1
> >> > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32"
> >> >
> >> >>
> >> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 
> >> >> 2016
> >> >>   write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec
> >> >>     clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
> >> >>      lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
> >> >
> >> > These numbers suggest you did randwrite and aren't all that surprising.
> >> > If you were to run atop on your OSD nodes while doing that fio run,
> >> > you'll likely see that both CPUs and individual disk (HDDs) get very 
> >> > busy.
> >> >
> >> > There are several things conspiring against Ceph here, the latency
> >> > of it's own code, the network latency of getting all the individual
> >> > writes to each replica, the fact that 1000 of these 4K blocks will
> >> > hit one typical RBD object (4MB) and thus one PG, make 3 OSDs very busy, 
> >> > etc.
> >> >
> >> > If you absolutely need low latencies with Ceph, consider dedicated
> >> > SSD only pools for special need applications (DB) or a cache tier
> >> > if it fits the profile and avtive working set.
> >> > Lower Ceph latency in general by having fast CPUs which are have
> >> > powersaving (frequency throttling) disabled or set to "performance"
> >> > instead of "ondemand".
> >> >
> >> > Christan
> >> >
> >> >>     clat percentiles (msec):
> >> >>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    
> >> >> 5],
> >> >>      | 30.00th=[    5], 40.00th=[    6], 50.00th=[    7], 60.00th=[    
> >> >> 8],
> >> >>      | 70.00th=[    9], 80.00th=[   10], 90.00th=[   12], 95.00th=[   
> >> >> 14],
> >> >>      | 99.00th=[   17], 99.50th=[   19], 99.90th=[   21], 99.95th=[   
> >> >> 23],
> >> >>      | 99.99th=[  253]
> >> >>     bw (KB  /s): min=  341, max=  870, per=2.01%, avg=556.60, 
> >> >> stdev=136.98
> >> >>     lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02%
> >> >>   cpu          : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570
> >> >>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> >> >> >=64=0.0%
> >> >>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >> >> >=64=0.0%
> >> >>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >> >> >=64=0.0%
> >> >>      issued    : total=r=0/w=208023/d=0, short=r=0/w=0/d=0
> >> >>
> >> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <ch...@gol.com> wrote:
> >> >> >
> >> >> > Hello,
> >> >> >
> >> >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote:
> >> >> >
> >> >> >> Hi list, while I know that writes in the RADOS backend are
> >> >> >> sync() can anyone please explain when the cluster will return
> >> >> >> on a write call for RBD from VMs? Will data be considered
> >> >> >> synced one written to the journal or all the way to the OSD drive?
> >> >> >>
> >> >> > This has been answered countless (really) here, the Ceph
> >> >> > Architecture documentation should really be more detailed about
> >> >> > this, as well as how parallel the data is being sent to the secondary 
> >> >> > OSDs.
> >> >> >
> >> >> > It is of course ack'ed to the client once all journals have
> >> >> > successfully written the data, otherwise journal SSDs would make a 
> >> >> > LOT less sense.
> >> >> >
> >> >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 
> >> >> >> 10krpm SAS.
> >> >> >>
> >> >> > The size of your SSDs (you didn't mention) will determine the
> >> >> > speed, for journal purposes the sequential write speed is basically 
> >> >> > it.
> >> >> >
> >> >> > A 5:18 ratio implies that some of your SSDs hold more journals than 
> >> >> > others.
> >> >> >
> >> >> > You emphatically do NOT want that, because eventually the busier
> >> >> > ones will run out of endurance while the other ones still have plenty 
> >> >> > left.
> >> >> >
> >> >> > If possible change this to a 5:20 or 6:18 ratio (depending on
> >> >> > your SSDs and expected write volume).
> >> >> >
> >> >> > Christian
> >> >> >> I have size=3 for my pool. Will Ceph return once the data is
> >> >> >> written to at least 3 designated journals, or will it in fact
> >> >> >> wait until the data is written to the OSD drives? thx will
> >> >> >> _______________________________________________
> >> >> >> ceph-users mailing list
> >> >> >> ceph-users@lists.ceph.com
> >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >>
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Christian Balzer        Network/Systems Engineer
> >> >> > ch...@gol.com           Global OnLine Japan/Rakuten Communications
> >> >> > http://www.gol.com/
> >> >>
> >> >
> >> >
> >> > --
> >> > Christian Balzer        Network/Systems Engineer
> >> > ch...@gol.com           Global OnLine Japan/Rakuten Communications
> >> > http://www.gol.com/
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD with SSD journals and SAS OSDs

Reply via email to