Benchmarking is hard.

It's expected that random will be faster than sequential with this IO pattern.
Reason is that you are going to different blocks/pgs/osds with random
IO for every 4k block but the sequential IO is stuck on a 4mb block
for more IOs than your queue is deep.

Question is: is writing 4k blocks sequentially in any way
representative of your workload? Probably not, so don't test that.

You mentioned vmware, so writing 64k blocks with qd 64 might be
relevant for you (e.g., vmotion). In this case try to configure
striping with a 64kb stripe size for the image format and test
sequentially writing 64kb blocks.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Jan 18, 2020 at 5:14 AM Christian Balzer <ch...@gol.com> wrote:
>
>
> Hello,
>
> I had very odd results in the past with the fio rbd engine and would
> suggest testing things in the environment you're going to deploy in, end
> to end.
>
> That said, without any caching and coalescing of writes, sequential 4k
> writes will hit the same set of OSDs for 4MB worth of data, thus limiting
> things to whatever the overall latency (network, 3x write) is here.
> With random writes you will engage more or less all OSDs that hold your
> fio file, thus spreading things out.
> This becomes more and more visible with increasing number of OSDs and
> nodes.
>
> Regards,
>
> Christian
> On Fri, 17 Jan 2020 23:01:09 +0000 Anthony Brandelli (abrandel) wrote:
>
> > Not been able to make any headway on this after some significant effort.
> >
> > -Tested all 48 SSDs with FIO directly, all tested with 10% of each other 
> > for 4k iops in rand|seq read|write.
> > -Disabled all CPU power save.
> > -Tested with both rbd cache enabled and disabled on the client.
> > -Tested with drive caches enabled and disabled (hdparm)
> > -Minimal TCP retransmissions under load (<10 for a 2 minute duration).
> > -No drops/pause frames noted on upstream switches.
> > -CPU load on OSD nodes peaks at 6~.
> > -iostat shows a peak of 15ms under read/write workloads, %util peaks at 
> > about 10%.
> > -Swapped out the RBD client for a bigger box, since the load was peaking at 
> > 16. Now a 24 core box, load still peaks at 16.
> > -Disabled cephx signatures
> > -Verified hardware health (nothing in dmesg, nothing in CIMC fault logs, 
> > storage controller logs)
> > -Test multiple SSDs at once to find the controllers iops limit, which is 
> > apparently 650k @ 4k.
> >
> > Nothing has made a noticeable difference here. I'm pretty baffled as to 
> > what would be causing the awful sequential read and write performance, but 
> > allowing good random r/w speeds.
> >
> > I switched up fio testing methodologies to use more threads, but this 
> > didn't seem to help either:
> >
> > [global]
> > bs=4k
> > ioengine=rbd
> > iodepth=32
> > size=5g
> > runtime=120
> > numjobs=4
> > group_reporting=1
> > pool=rbd_af1
> > rbdname=image1
> >
> > [seq-read]
> > rw=read
> > stonewall
> >
> > [rand-read]
> > rw=randread
> > stonewall
> >
> > [seq-write]
> > rw=write
> > stonewall
> >
> > [rand-write]
> > rw=randwrite
> > stonewall
> >
> > Any pointers are appreciated at this point. I've been following other 
> > threads on the mailing list, and looked at the archives, related to RBD 
> > performance but none of the solutions that worked for others seem to have 
> > helped this setup.
> >
> > Thanks,
> > Anthony
> >
> > ________________________________
> > From: Anthony Brandelli (abrandel) <abran...@cisco.com>
> > Sent: Tuesday, January 14, 2020 12:43 AM
> > To: ceph-users@lists.ceph.com <ceph-users@lists.ceph.com>
> > Subject: Slow Performance - Sequential IO
> >
> >
> > I have a newly setup test cluster that is giving some surprising numbers 
> > when running fio against an RBD. The end goal here is to see how viable a 
> > Ceph based iSCSI SAN of sorts is for VMware clusters, which require a bunch 
> > of random IO.
> >
> >
> >
> > Hardware:
> >
> > 2x E5-2630L v2 (2.4GHz, 6 core)
> >
> > 256GB RAM
> >
> > 2x 10gbps bonded network, Intel X520
> >
> > LSI 9271-8i, SSDs used for OSDs in JBOD mode
> >
> > Mons: 2x 1.2TB 10K SAS in RAID1
> >
> > OSDs: 12x Samsung MZ6ER800HAGL-00003 800GB SAS SSDs, super cap/power loss 
> > protection
> >
> >
> >
> > Cluster setup:
> >
> > Three mon nodes, four OSD nodes
> >
> > Two OSDs per SSD
> >
> > Replica 3 pool
> >
> > Ceph 14.2.5
> >
> >
> >
> > Ceph status:
> >
> >   cluster:
> >
> >     id:     e3d93b4a-520c-4d82-a135-97d0bda3e69d
> >
> >     health: HEALTH_WARN
> >
> >             application not enabled on 1 pool(s)
> >
> >   services:
> >
> >     mon: 3 daemons, quorum mon1,mon2,mon3 (age 6d)
> >
> >     mgr: mon2(active, since 6d), standbys: mon3, mon1
> >
> >     osd: 96 osds: 96 up (since 3d), 96 in (since 3d)
> >
> >   data:
> >
> >     pools:   1 pools, 3072 pgs
> >
> >     objects: 857.00k objects, 1.8 TiB
> >
> >     usage:   432 GiB used, 34 TiB / 35 TiB avail
> >
> >     pgs:     3072 active+clean
> >
> >
> >
> > Network between nodes tests at 9.88gbps. Direct testing of the SSDs using a 
> > 4K block in fio shows 127k seq read, 86k randm read, 107k seq write, 52k 
> > random write. No high CPU load/interface saturation is noted when running 
> > tests against the rbd.
> >
> >
> >
> > When testing with a 4K block size against an RBD on a dedicated metal test 
> > host (same specs as other cluster nodes noted above) I get the following 
> > (command similar to fio -ioengine=rbd -direct=1 -name=test -bs=4k 
> > -iodepth=32 -rw=XXXX -pool=scbench -runtime=60 -rbdname=datatest):
> >
> >
> >
> > 10k sequential read iops
> >
> > 69k random read iops
> >
> > 13k sequential write iops
> >
> > 22k random write iops
> >
> >
> >
> > I’m not clear why the random ops, especially read, would be so much quicker 
> > compared to the sequential ops.
> >
> >
> >
> > Any points appreciated.
> >
> >
> >
> > Thanks,
> >
> > Anthony
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Rakuten Mobile Inc.
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to