Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Cedric Lemarchand Thu, 11 Sep 2014 10:34:35 -0700

Le 11/09/2014 08:20, Alexandre DERUMIER a écrit :
> Hi Sebastien,
>
> here my first results with crucial m550 (I'll send result with intel s3500 
> later):
>
> - 3 nodes
> - dell r620 without expander backplane
> - sas controller : lsi LSI 9207 (no hardware raid or cache)
> - 2 x E5-2603v2 1.8GHz (4cores)
> - 32GB ram
> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication.
>
> -os : debian wheezy, with kernel 3.10
>
> os + ceph mon : 2x intel s3500 100gb  linux soft raid
> osd : crucial m550 (1TB).
>
>
> 3mon in the ceph cluster,
> and 1 osd (journal and datas on same disk)
>
>
> ceph.conf 
> ---------
>           debug_lockdep = 0/0
>           debug_context = 0/0
>           debug_crush = 0/0
>           debug_buffer = 0/0
>           debug_timer = 0/0
>           debug_filer = 0/0
>           debug_objecter = 0/0
>           debug_rados = 0/0
>           debug_rbd = 0/0
>           debug_journaler = 0/0
>           debug_objectcatcher = 0/0
>           debug_client = 0/0
>           debug_osd = 0/0
>           debug_optracker = 0/0
>           debug_objclass = 0/0
>           debug_filestore = 0/0
>           debug_journal = 0/0
>           debug_ms = 0/0
>           debug_monc = 0/0
>           debug_tp = 0/0
>           debug_auth = 0/0
>           debug_finisher = 0/0
>           debug_heartbeatmap = 0/0
>           debug_perfcounter = 0/0
>           debug_asok = 0/0
>           debug_throttle = 0/0
>           debug_mon = 0/0
>           debug_paxos = 0/0
>           debug_rgw = 0/0
>           osd_op_threads = 5
>           filestore_op_threads = 4
>
>          ms_nocrc = true
>          cephx sign messages = false
>          cephx require signatures = false
>
>          ms_dispatch_throttle_bytes = 0
>
>          #0.85
>          throttler_perf_counter = false
>          filestore_fd_cache_size = 64
>          filestore_fd_cache_shards = 32
>          osd_op_num_threads_per_shard = 1
>          osd_op_num_shards = 25
>          osd_enable_op_tracker = true
>
>
>
> Fio disk 4K benchmark
> ------------------
> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
> bw=271755KB/s, iops=67938 
>
> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k 
> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
> bw=228293KB/s, iops=57073
>
>
>
> fio osd benchmark (through librbd)
> ----------------------------------
> [global]
> ioengine=rbd
> clientname=admin
> pool=test
> rbdname=test
> invalidate=0    # mandatory
> rw=randwrite
> rw=randread
> bs=4k
> direct=1
> numjobs=4
> group_reporting=1
>
> [rbd_iodepth32]
> iodepth=32
>
>
>
> FIREFLY RESULTS
> ----------------
> fio randwrite : bw=5009.6KB/s, iops=1252
>
> fio randread: bw=37820KB/s, iops=9455
>
>
>
> O.85 RESULTS
> ------------
>
> fio randwrite : bw=11658KB/s, iops=2914
>
> fio randread : bw=38642KB/s, iops=9660
>
>
>
> 0.85 + osd_enable_op_tracker=false
> -----------------------------------
> fio randwrite : bw=11630KB/s, iops=2907
> fio randread : bw=80606KB/s, iops=20151,   (cpu 100% - GREAT !)
>
>
>
> So, for read, seem that osd_enable_op_tracker is the bottleneck.
>
>
> Now for write, I really don't understand why it's so low.
>
>
> I have done some iostat:
>
>
> FIO directly on /dev/sdb
> bw=228293KB/s, iops=57073
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sdb               0,00     0,00    0,00 63613,00     0,00 254452,00     8,00  
>   31,24    0,49    0,00    0,49   0,02 100,00
>
>
> FIO directly on osd through librbd
> bw=11658KB/s, iops=2914
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sdb               0,00   355,00    0,00 5225,00     0,00 29678,00    11,36    
> 57,63   11,03    0,00   11,03   0,19  99,70
>
>
> (I don't understand what exactly is %util, 100% in the 2 cases, because 10x 
> slower with ceph)
It would be interesting if you could catch the size of writes on SSD
during the bench through librbd (I know nmon can do that)
>
> It could be a dsync problem, result seem pretty poor
>
> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct
> 65536+0 enregistrements lus
> 65536+0 enregistrements écrits
> 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s
>
>
> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct
> ^C17228+0 enregistrements lus
> 17228+0 enregistrements écrits
> 70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s
>
>
>
> I'll do tests with intel s3500 tomorrow to compare
>
> ----- Mail original ----- 
>
> De: "Sebastien Han" <sebastien....@enovance.com> 
> À: "Warren Wang" <warren_w...@cable.comcast.com> 
> Cc: ceph-users@lists.ceph.com 
> Envoyé: Lundi 8 Septembre 2014 22:58:25 
> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
> IOPS 
>
> They definitely are Warren! 
>
> Thanks for bringing this here :). 
>
> On 05 Sep 2014, at 23:02, Wang, Warren <warren_w...@cable.comcast.com> wrote: 
>
>> +1 to what Cedric said. 
>>
>> Anything more than a few minutes of heavy sustained writes tended to get our 
>> solid state devices into a state where garbage collection could not keep up. 
>> Originally we used small SSDs and did not overprovision the journals by 
>> much. Manufacturers publish their SSD stats, and then in very small font, 
>> state that the attained IOPS are with empty drives, and the tests are only 
>> run for very short amounts of time. Even if the drives are new, it's a good 
>> idea to perform an hdparm secure erase on them (so that the SSD knows that 
>> the blocks are truly unused), and then overprovision them. You'll know if 
>> you have a problem by watching for utilization and wait data on the 
>> journals. 
>>
>> One of the other interesting performance issues is that the Intel 10Gbe NICs 
>> + default kernel that we typically use max out around 1million packets/sec. 
>> It's worth tracking this metric to if you are close. 
>>
>> I know these aren't necessarily relevant to the test parameters you gave 
>> below, but they're worth keeping in mind. 
>>
>> -- 
>> Warren Wang 
>> Comcast Cloud (OpenStack) 
>>
>>
>> From: Cedric Lemarchand <ced...@yipikai.org> 
>> Date: Wednesday, September 3, 2014 at 5:14 PM 
>> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> 
>> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 
>> 2K IOPS 
>>
>>
>> Le 03/09/2014 22:11, Sebastien Han a écrit : 
>>> Hi Warren, 
>>>
>>> What do mean exactly by secure erase? At the firmware level with 
>>> constructor softwares? 
>>> SSDs were pretty new so I don’t we hit that sort of things. I believe that 
>>> only aged SSDs have this behaviour but I might be wrong. 
>>>
>> Sorry I forgot to reply to the real question ;-) 
>> So yes it only plays after some times, for your case, if the SSD still 
>> delivers write IOPS specified by the manufacturer, it will doesn't help in 
>> any ways. 
>>
>> But it seems this practice is nowadays increasingly used. 
>>
>> Cheers 
>>> On 02 Sep 2014, at 18:23, Wang, Warren <warren_w...@cable.comcast.com> 
>>> wrote: 
>>>
>>>
>>>> Hi Sebastien, 
>>>>
>>>> Something I didn't see in the thread so far, did you secure erase the SSDs 
>>>> before they got used? I assume these were probably repurposed for this 
>>>> test. We have seen some pretty significant garbage collection issue on 
>>>> various SSD and other forms of solid state storage to the point where we 
>>>> are overprovisioning pretty much every solid state device now. By as much 
>>>> as 50% to handle sustained write operations. Especially important for the 
>>>> journals, as we've found. 
>>>>
>>>> Maybe not an issue on the short fio run below, but certainly evident on 
>>>> longer runs or lots of historical data on the drives. The max transaction 
>>>> time looks pretty good for your test. Something to consider though. 
>>>>
>>>> Warren 
>>>>
>>>> -----Original Message----- 
>>>> From: ceph-users [ 
>>>> mailto:ceph-users-boun...@lists.ceph.com 
>>>> ] On Behalf Of Sebastien Han 
>>>> Sent: Thursday, August 28, 2014 12:12 PM 
>>>> To: ceph-users 
>>>> Cc: Mark Nelson 
>>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
>>>> IOPS 
>>>>
>>>> Hey all, 
>>>>
>>>> It has been a while since the last thread performance related on the ML :p 
>>>> I've been running some experiment to see how much I can get from an SSD on 
>>>> a Ceph cluster. 
>>>> To achieve that I did something pretty simple: 
>>>>
>>>> * Debian wheezy 7.6 
>>>> * kernel from debian 3.14-0.bpo.2-amd64 
>>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a real 
>>>> deployment i'll use 3) 
>>>> * 1 OSD backed by an SSD (journal and osd data on the same device) 
>>>> * 1 replica count of 1 
>>>> * partitions are perfectly aligned 
>>>> * io scheduler is set to noon but deadline was showing the same results 
>>>> * no updatedb running 
>>>>
>>>> About the box: 
>>>>
>>>> * 32GB of RAM 
>>>> * 12 cores with HT @ 2,4 GHz 
>>>> * WB cache is enabled on the controller 
>>>> * 10Gbps network (doesn't help here) 
>>>>
>>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K 
>>>> iops with random 4k writes (my fio results) As a benchmark tool I used fio 
>>>> with the rbd engine (thanks deutsche telekom guys!). 
>>>>
>>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD: 
>>>>
>>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536 
>>>> 65536+0 records in 
>>>> 65536+0 records out 
>>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s 
>>>>
>>>> # du -sh rand.file 
>>>> 256M rand.file 
>>>>
>>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct 
>>>> 65536+0 records in 
>>>> 65536+0 records out 
>>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s 
>>>>
>>>> See my ceph.conf: 
>>>>
>>>> [global] 
>>>> auth cluster required = cephx 
>>>> auth service required = cephx 
>>>> auth client required = cephx 
>>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 
>>>> osd pool default pg num = 4096 
>>>> osd pool default pgp num = 4096 
>>>> osd pool default size = 2 
>>>> osd crush chooseleaf type = 0 
>>>>
>>>> debug lockdep = 0/0 
>>>> debug context = 0/0 
>>>> debug crush = 0/0 
>>>> debug buffer = 0/0 
>>>> debug timer = 0/0 
>>>> debug journaler = 0/0 
>>>> debug osd = 0/0 
>>>> debug optracker = 0/0 
>>>> debug objclass = 0/0 
>>>> debug filestore = 0/0 
>>>> debug journal = 0/0 
>>>> debug ms = 0/0 
>>>> debug monc = 0/0 
>>>> debug tp = 0/0 
>>>> debug auth = 0/0 
>>>> debug finisher = 0/0 
>>>> debug heartbeatmap = 0/0 
>>>> debug perfcounter = 0/0 
>>>> debug asok = 0/0 
>>>> debug throttle = 0/0 
>>>>
>>>> [mon] 
>>>> mon osd down out interval = 600 
>>>> mon osd min down reporters = 13 
>>>> [mon.ceph-01] 
>>>> host = ceph-01 
>>>> mon addr = 172.20.20.171 
>>>> [mon.ceph-02] 
>>>> host = ceph-02 
>>>> mon addr = 172.20.20.172 
>>>> [mon.ceph-03] 
>>>> host = ceph-03 
>>>> mon addr = 172.20.20.173 
>>>>
>>>> debug lockdep = 0/0 
>>>> debug context = 0/0 
>>>> debug crush = 0/0 
>>>> debug buffer = 0/0 
>>>> debug timer = 0/0 
>>>> debug journaler = 0/0 
>>>> debug osd = 0/0 
>>>> debug optracker = 0/0 
>>>> debug objclass = 0/0 
>>>> debug filestore = 0/0 
>>>> debug journal = 0/0 
>>>> debug ms = 0/0 
>>>> debug monc = 0/0 
>>>> debug tp = 0/0 
>>>> debug auth = 0/0 
>>>> debug finisher = 0/0 
>>>> debug heartbeatmap = 0/0 
>>>> debug perfcounter = 0/0 
>>>> debug asok = 0/0 
>>>> debug throttle = 0/0 
>>>>
>>>> [osd] 
>>>> osd mkfs type = xfs 
>>>> osd mkfs options xfs = -f -i size=2048 
>>>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog 
>>>> osd journal size = 20480 
>>>> cluster_network = 172.20.20.0/24 
>>>> public_network = 172.20.20.0/24 
>>>> osd mon heartbeat interval = 30 
>>>> # Performance tuning 
>>>> filestore merge threshold = 40 
>>>> filestore split multiple = 8 
>>>> osd op threads = 8 
>>>> # Recovery tuning 
>>>> osd recovery max active = 1 
>>>> osd max backfills = 1 
>>>> osd recovery op priority = 1 
>>>>
>>>>
>>>> debug lockdep = 0/0 
>>>> debug context = 0/0 
>>>> debug crush = 0/0 
>>>> debug buffer = 0/0 
>>>> debug timer = 0/0 
>>>> debug journaler = 0/0 
>>>> debug osd = 0/0 
>>>> debug optracker = 0/0 
>>>> debug objclass = 0/0 
>>>> debug filestore = 0/0 
>>>> debug journal = 0/0 
>>>> debug ms = 0/0 
>>>> debug monc = 0/0 
>>>> debug tp = 0/0 
>>>> debug auth = 0/0 
>>>> debug finisher = 0/0 
>>>> debug heartbeatmap = 0/0 
>>>> debug perfcounter = 0/0 
>>>> debug asok = 0/0 
>>>> debug throttle = 0/0 
>>>>
>>>> Disabling all debugging made me win 200/300 more IOPS. 
>>>>
>>>> See my fio template: 
>>>>
>>>> [global] 
>>>> #logging 
>>>> #write_iops_log=write_iops_log 
>>>> #write_bw_log=write_bw_log 
>>>> #write_lat_log=write_lat_lo 
>>>>
>>>> time_based 
>>>> runtime=60 
>>>>
>>>> ioengine=rbd 
>>>> clientname=admin 
>>>> pool=test 
>>>> rbdname=fio 
>>>> invalidate=0 # mandatory 
>>>> #rw=randwrite 
>>>> rw=write 
>>>> bs=4k 
>>>> #bs=32m 
>>>> size=5G 
>>>> group_reporting 
>>>>
>>>> [rbd_iodepth32] 
>>>> iodepth=32 
>>>> direct=1 
>>>>
>>>> See my rio output: 
>>>>
>>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
>>>> iodepth=32 fio-2.1.11-14-gb74e Starting 1 process rbd engine: RBD version: 
>>>> 0.1.8 
>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0 iops] 
>>>> [eta 00m:00s] 
>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 00:28:26 
>>>> 2014 
>>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec 
>>>> slat (usec): min=42, max=1578, avg=66.50, stdev=16.96 
>>>> clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48 
>>>> lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47 
>>>> clat percentiles (usec): 
>>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 9152], 
>>>> | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 60.00th=[10048], 
>>>> | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944], 95.00th=[11456], 
>>>> | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], 
>>>> | 99.99th=[28032] 
>>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, stdev=407.35 
>>>> lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41% 
>>>> cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088 
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%, >=64=0.0% 
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>>> complete : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>>> issued : total=r=0/w=192862/d=0, short=r=0/w=0/d=0 
>>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>>>
>>>> Run status group 0 (all jobs): 
>>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, maxb=12855KB/s, 
>>>> mint=60010msec, maxt=60010msec 
>>>>
>>>> Disk stats (read/write): 
>>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, 
>>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, 
>>>> aggrutil=0.01% 
>>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% 
>>>>
>>>> I tried to tweak several parameters like: 
>>>>
>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 
>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 
>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 
>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue max ops 
>>>> = 2000 
>>>>
>>>> But didn't any improvement. 
>>>>
>>>> Then I tried other things: 
>>>>
>>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 more 
>>>> IOPS but it's not a realistic workload anymore and not that significant. 
>>>> * adding another SSD for the journal, still getting 3,2K IOPS 
>>>> * I tried with rbd bench and I also got 3K IOPS 
>>>> * I ran the test on a client machine and then locally on the server, still 
>>>> getting 3,2K IOPS 
>>>> * put the journal in memory, still getting 3,2K IOPS 
>>>> * with 2 clients running the test in parallel I got a total of 3,6K IOPS 
>>>> but I don't seem to be able to go over 
>>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 journals 
>>>> on 1 SSD, got 4,5K IOPS YAY! 
>>>>
>>>> Given the results of the last time it seems that something is limiting the 
>>>> number of IOPS per OSD process. 
>>>>
>>>> Running the test on a client or locally didn't show any difference. 
>>>> So it looks to me that there is some contention within Ceph that might 
>>>> cause this. 
>>>>
>>>> I also ran perf and looked at the output, everything looks decent, but 
>>>> someone might want to have a look at it :). 
>>>>
>>>> We have been able to reproduce this on 3 distinct platforms with some 
>>>> deviations (because of the hardware) but the behaviour is the same. 
>>>> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K 
>>>> IOPS SSD is a bit frustrating :). 
>>>>
>>>> Cheers. 
>>>> ---- 
>>>> Sébastien Han 
>>>> Cloud Architect 
>>>>
>>>> "Always give 100%. Unless you're giving blood." 
>>>>
>>>> Phone: +33 (0)1 49 70 99 72 
>>>> Mail: 
>>>> sebastien....@enovance.com 
>>>>
>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : 
>>>> www.enovance.com 
>>>> - Twitter : @enovance 
>>>>
>>>>
>>> Cheers. 
>>> –––– 
>>> Sébastien Han 
>>> Cloud Architect 
>>>
>>> "Always give 100%. Unless you're giving blood." 
>>>
>>> Phone: +33 (0)1 49 70 99 72 
>>> Mail: 
>>> sebastien....@enovance.com 
>>>
>>> Address : 11 bis, rue Roquépine - 75008 Paris 
>>> Web : 
>>> www.enovance.com 
>>> - Twitter : @enovance 
>>>
>>>
>>>
>>>
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>>
>>> ceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>  
>> -- 
>> Cédric 
>>
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>
> Cheers. 
> –––– 
> Sébastien Han 
> Cloud Architect 
>
> "Always give 100%. Unless you're giving blood." 
>
> Phone: +33 (0)1 49 70 99 72 
> Mail: sebastien....@enovance.com 
> Address : 11 bis, rue Roquépine - 75008 Paris 
> Web : www.enovance.com - Twitter : @enovance 
>
>
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Cédric

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Reply via email to