Thx Christian for elaborating on this appreciate it, I will rerun some of my benchmarks and take your advice into consideration. I have also found maximum performance recommendations for the dell 730xd bios settings, hope these make sense: http://pasteboard.co/guHVMQVly.jpg I will set all these settings, and intel_idle.max_cstate=0 as suggested by Nick and rerun fio benchmarks. thx will
On Tue, Oct 18, 2016 at 9:44 AM, Christian Balzer <ch...@gol.com> wrote: > > Hello, > > As I had this written mostly already and since it covers some points Nick > raised in more detail, here we go. > > On Mon, 17 Oct 2016 16:30:48 +0800 William Josefsson wrote: > >> Thx Christian for helping troubleshooting the latency issues. I have >> attached my fio job template below. >> > There's no trouble here per se, just facts of life (Ceph). > > You'll be well advised to search the ML, especially with what Nick Fisk > had to write about these things (several times). > >> I thought to eliminate the factor that the VM is the bottleneck, I've >> created a 128GB 32 cCPU flavor. > Nope, The client is not the issue. > >>Here's the latest fio benchmark. >> http://pastebin.ca/raw/3729693 I'm trying to benchmark the clusters >> performance for SYNCED WRITEs and how well suited it would be for disk >> intensive workloads or DBs >> > > A single IOPS of that type and size will only hit the journal and be > ACK'ed quickly (well quicker than what you see now), but FIO is a creating > a constant stream of requests, eventually hitting the actual OSD as well. > > Aside from CPU load, of course. > >> >> > The size (45GB) of these journals is only going to be used by a little >> > fraction, unlikely to be more than 1GB in normal operations and with >> > default filestore/journal parameters. >> >> To consume more of the SSDs in the hope to achieve lower latency, can >> you pls advice what parameters I should be looking at? > > Not going to help with your prolonged FIO runs and once the flushing to > OSDs comments, stalls will ensue. > The moment the journal is full or the timers kick in, things will go down > to OSD (HDD) speed. > The journal is there to help with small, short bursts. > >>I have already >> tried to what's mentioned in RaySun's ceph blog, which eventually >> lowered my overall sync write IOPs performance by 1-2k. >> > Unsurprisingly, the default values are there for a reason. > >> # These are from RaySun's write up, and worsen my total IOPs. >> # >> http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/ >> >> filestore xattr use omap = true >> filestore min sync interval = 10 > Way too high, 0.5 is probably already excessive, I run with 0.1. > >> filestore max sync interval = 15 > >> filestore queue max ops = 25000 >> filestore queue max bytes = 10485760 >> filestore queue committing max ops = 5000 >> filestore queue committing max bytes = 10485760000 > Your HDDs will choke on those 4. With a 10k SAS HDD a small increase of > the defaults may help. > >> journal max write bytes = 1073714824 >> journal max write entries = 10000 >> journal queue max ops = 50000 >> journal queue max bytes = 10485760000 >> >> My Journals are Intel s3610 200GB, split in 4-5 partitions each. > Again, you want to event that out. > >>When >> I did FIO on the disks locally with direct=1 and sync=1 the WRITE >> performance was 50k iops for 7 threads. >> > Yes, but as I wrote that's not how journals work, think more of 7 > sequential writes, not rand-writes. > > And as I tried to explain before, the SSDs are not the bottleneck, your > CPUs may be and your OSD HDDs eventually will be. > Run atop on all your nodes when doing those tests and see how much things > get pushed (CPUs, disks, the OSD processes). > >> My hardware specs: >> >> - 3 Controllers, The mons run here >> Dell PE R630, 64GB, Intel SSD s3610 >> - 9 Storage nodes >> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD, >> OSD: 18x1.8TB Hitachi 10krpm SAS >> > I can't really fault you for the choice of CPU, but smaller nodes with > higher speed and fewer cores may help with this extreme test case (in > normal production you're fine). > >> RAID Controller is PERC 730 >> >> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to >> Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I have >> from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did >> iperf, and I can do 10Gbps from the VM to the storage nodes. >> > Bandwidth is irrelevant in this case, the RTT of 0.3ms feels a bit high. > If you look again at the flow in > http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale > > those will add up to a significant part of your Ceph latency. > > To elaborate and demonstrate: > > I have test cluster, consisting of 4 nodes, 2 of them HDD backed OSDs with > SSD journals and 2 of them SSD based (4x DC S3610 400GB each) as a > cache-tier for the "normal" ones. All replication 2. > So for the purpose of this test, this is all 100% against the SSDs in the > cache-pool only. > > The network is IPoIB (QDDR, 40Gb/s Infiniband) with 0.1ms latency between > nodes, CPU is a single E5-2620 v3. > > If I run this from a VM: > --- > fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > --rw=randwrite --name=fiojob --blocksize=4K --iodepth=64 > --- > > We wind up with: > --- > write: io=1024.0MB, bw=34172KB/s, iops=8543, runt= 30685msec > slat (usec): min=1, max=2874, avg= 4.66, stdev= 7.07 > clat (msec): min=1, max=66, avg= 7.49, stdev= 7.80 > lat (msec): min=1, max=66, avg= 7.49, stdev= 7.80 > --- > During this run the CPU is the bottleneck, idle is around 60% (of 1200), > all 4 OSD processes eat up nearly 3 CPU "cores". > As I said, small random IOPS is the most stressful thing for Ceph. > CPU performance settings influence this little/not at all, as everything > goes to full speed in less than a second and stays there. > > > If we change the FIO invocation to plain sequential "--rw=write" the CPU > usage is less than 250% (out of 1200), things are pretty relaxed. > At that point we're basically pushing the edge of latency in all > components involved: > --- > write: io=1024.0MB, bw=37819KB/s, iops=9454, runt= 27726msec > slat (usec): min=1, max=3834, avg= 3.77, stdev= 8.42 > clat (usec): min=943, max=38129, avg=6764.11, stdev=3262.91 > lat (usec): min=954, max=38135, avg=6768.04, stdev=3263.55 > --- > > If we lower this consequently to just one thread with "--iodepth=1" to see > how fast things could potentially be if we don't saturate everything: > --- > slat (usec): min=12, max=100, avg=21.43, stdev= 7.96 > clat (usec): min=1725, max=5873, avg=2485.46, stdev=256.97 > lat (usec): min=1744, max=5894, avg=2507.35, stdev=257.11 > --- > > So 2.5ms instead of 7ms. Not too shabby. > > > Now if we do the same run but with CPU governors set to performance we get: > --- > slat (usec): min=6, max=291, avg=17.34, stdev= 8.00 > clat (usec): min=957, max=13754, avg=1425.83, stdev=262.85 > lat (usec): min=968, max=13766, avg=1443.56, stdev=264.54 > --- > > So that's where the CPU tuning comes in. > And this is, in real life where you hopefully don't have thousands of > small sync I/Os at the same time, a pretty decent result. > > >> I've already been tuning, CPU scaling governor to 'performance' on all >> hosts for all cores. My CEPH release is latest hammer on CentOS7. >> > Jewel is also supposed to have many improvements in this area, but frankly > I haven't been brave (convinced) enough to upgrade from Hammer yet. > > Christian > >> The best write currently happens at 62 threads it seems, the IOPS is >> 8.3k for the direct synced writes. The latency and stddev are still >> concerning.. :( >> >> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17 >> 15:20:05 2016 >> write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec >> clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 >> lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 >> clat percentiles (usec): >> | 1.00th=[ 3888], 5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768], >> | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384], >> | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584], >> | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512], >> | 99.99th=[17792] >> bw (KB /s): min= 315, max= 761, per=1.61%, avg=537.06, stdev=77.13 >> lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01% >> cpu : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >> >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> issued : total=r=0/w=250527/d=0, short=r=0/w=0/d=0 >> >> >> From the above we can tell that the latency for clients doing synced >> writes, is somewhere 5-10ms which seems very high, especially with >> quite high performing hardware, network, and SSD journals. I'm not >> sure whether it may be the syncing from Journal to OSD that causes >> these fluctuations or high latencies. >> >> Any help or advice would be much appreciates. thx will >> >> >> [global] >> bs=4k >> rw=write >> sync=1 >> direct=1 >> iodepth=1 >> filename=${FILE} >> runtime=30 >> stonewall=1 >> group_reporting >> >> [simple-write-6] >> numjobs=6 >> [simple-write-10] >> numjobs=10 >> [simple-write-14] >> numjobs=14 >> [simple-write-18] >> numjobs=18 >> [simple-write-22] >> numjobs=22 >> [simple-write-26] >> numjobs=26 >> [simple-write-30] >> numjobs=30 >> [simple-write-34] >> numjobs=34 >> [simple-write-38] >> numjobs=38 >> [simple-write-42] >> numjobs=42 >> [simple-write-46] >> numjobs=46 >> [simple-write-50] >> numjobs=50 >> [simple-write-54] >> numjobs=54 >> [simple-write-58] >> numjobs=58 >> [simple-write-62] >> numjobs=62 >> [simple-write-66] >> numjobs=66 >> [simple-write-70] >> numjobs=70 >> >> On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <ch...@gol.com> wrote: >> > >> > Hello, >> > >> > >> > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote: >> > >> >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I >> >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare >> >> these as the journals of the OSDs. >> >> >> > The size (45GB) of these journals is only going to be used by a little >> > fraction, unlikely to be more than 1GB in normal operations and with >> > default filestore/journal parameters. >> > >> > Because those defaults start flushing things (from RAM, the journal never >> > gets read unless there is a crash) to the filestore (OSD HDD) pretty much >> > immediately. >> > >> > Again, use google to search the ML archives. >> > >> >> I was trying to understand the blocking, and how much my SAS OSDs >> >> affected my performance. I have a total of 9 hosts, 158 OSDs each >> >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds. >> >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3 >> >> hosts in each rack. Pool size is =3. I'm running hammer on centos7. >> >> >> > >> > Which begs the question to fully detail your HW (CPUs, RAM), network >> > (topology, what switches, inter-rack/switch links), etc. >> > The reason for this will become obvious below. >> > >> >> I did a simple fio test from one of my xl instances, and got the >> >> results below. The Latency 7.21ms is worrying, is this expected >> >> results? Or is there any way I can further tune my cluster to achieve >> >> better results? thx will >> >> >> > >> >> FIO: sync=1, direct=1, bs=4k >> >> >> > Full command line, please. >> > >> > Small, sync I/Os are by far the hardest thing for Ceph. >> > >> > I can guess what some of the rest was, but it's better to know for sure. >> > Alternatively, additionally, try this please: >> > >> > "fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 >> > --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32" >> > >> >> >> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 >> >> 2016 >> >> write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec >> >> clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 >> >> lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 >> > >> > These numbers suggest you did randwrite and aren't all that surprising. >> > If you were to run atop on your OSD nodes while doing that fio run, you'll >> > likely see that both CPUs and individual disk (HDDs) get very busy. >> > >> > There are several things conspiring against Ceph here, the latency of it's >> > own code, the network latency of getting all the individual writes to each >> > replica, the fact that 1000 of these 4K blocks will hit one typical RBD >> > object (4MB) and thus one PG, make 3 OSDs very busy, etc. >> > >> > If you absolutely need low latencies with Ceph, consider dedicated SSD >> > only pools for special need applications (DB) or a cache tier if it fits >> > the profile and avtive working set. >> > Lower Ceph latency in general by having fast CPUs which are have >> > powersaving (frequency throttling) disabled or set to "performance" >> > instead of "ondemand". >> > >> > Christan >> > >> >> clat percentiles (msec): >> >> | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5], >> >> | 30.00th=[ 5], 40.00th=[ 6], 50.00th=[ 7], 60.00th=[ 8], >> >> | 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 12], 95.00th=[ 14], >> >> | 99.00th=[ 17], 99.50th=[ 19], 99.90th=[ 21], 99.95th=[ 23], >> >> | 99.99th=[ 253] >> >> bw (KB /s): min= 341, max= 870, per=2.01%, avg=556.60, stdev=136.98 >> >> lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02% >> >> cpu : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570 >> >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >> >> >=64=0.0% >> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >> >=64=0.0% >> >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >> >=64=0.0% >> >> issued : total=r=0/w=208023/d=0, short=r=0/w=0/d=0 >> >> >> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <ch...@gol.com> wrote: >> >> > >> >> > Hello, >> >> > >> >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote: >> >> > >> >> >> Hi list, while I know that writes in the RADOS backend are sync() can >> >> >> anyone please explain when the cluster will return on a write call for >> >> >> RBD from VMs? Will data be considered synced one written to the >> >> >> journal or all the way to the OSD drive? >> >> >> >> >> > This has been answered countless (really) here, the Ceph Architecture >> >> > documentation should really be more detailed about this, as well as how >> >> > parallel the data is being sent to the secondary OSDs. >> >> > >> >> > It is of course ack'ed to the client once all journals have successfully >> >> > written the data, otherwise journal SSDs would make a LOT less sense. >> >> > >> >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi >> >> >> 10krpm SAS. >> >> >> >> >> > The size of your SSDs (you didn't mention) will determine the speed, for >> >> > journal purposes the sequential write speed is basically it. >> >> > >> >> > A 5:18 ratio implies that some of your SSDs hold more journals than >> >> > others. >> >> > >> >> > You emphatically do NOT want that, because eventually the busier ones >> >> > will >> >> > run out of endurance while the other ones still have plenty left. >> >> > >> >> > If possible change this to a 5:20 or 6:18 ratio (depending on your SSDs >> >> > and expected write volume). >> >> > >> >> > Christian >> >> >> I have size=3 for my pool. Will Ceph return once the data is written >> >> >> to at least 3 designated journals, or will it in fact wait until the >> >> >> data is written to the OSD drives? thx will >> >> >> _______________________________________________ >> >> >> ceph-users mailing list >> >> >> ceph-users@lists.ceph.com >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> > >> >> > >> >> > -- >> >> > Christian Balzer Network/Systems Engineer >> >> > ch...@gol.com Global OnLine Japan/Rakuten Communications >> >> > http://www.gol.com/ >> >> >> > >> > >> > -- >> > Christian Balzer Network/Systems Engineer >> > ch...@gol.com Global OnLine Japan/Rakuten Communications >> > http://www.gol.com/ >> > > > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com