Le 11/09/2014 08:20, Alexandre DERUMIER a écrit : > Hi Sebastien, > > here my first results with crucial m550 (I'll send result with intel s3500 > later): > > - 3 nodes > - dell r620 without expander backplane > - sas controller : lsi LSI 9207 (no hardware raid or cache) > - 2 x E5-2603v2 1.8GHz (4cores) > - 32GB ram > - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication. > > -os : debian wheezy, with kernel 3.10 > > os + ceph mon : 2x intel s3500 100gb linux soft raid > osd : crucial m550 (1TB). > > > 3mon in the ceph cluster, > and 1 osd (journal and datas on same disk) > > > ceph.conf > --------- > debug_lockdep = 0/0 > debug_context = 0/0 > debug_crush = 0/0 > debug_buffer = 0/0 > debug_timer = 0/0 > debug_filer = 0/0 > debug_objecter = 0/0 > debug_rados = 0/0 > debug_rbd = 0/0 > debug_journaler = 0/0 > debug_objectcatcher = 0/0 > debug_client = 0/0 > debug_osd = 0/0 > debug_optracker = 0/0 > debug_objclass = 0/0 > debug_filestore = 0/0 > debug_journal = 0/0 > debug_ms = 0/0 > debug_monc = 0/0 > debug_tp = 0/0 > debug_auth = 0/0 > debug_finisher = 0/0 > debug_heartbeatmap = 0/0 > debug_perfcounter = 0/0 > debug_asok = 0/0 > debug_throttle = 0/0 > debug_mon = 0/0 > debug_paxos = 0/0 > debug_rgw = 0/0 > osd_op_threads = 5 > filestore_op_threads = 4 > > ms_nocrc = true > cephx sign messages = false > cephx require signatures = false > > ms_dispatch_throttle_bytes = 0 > > #0.85 > throttler_perf_counter = false > filestore_fd_cache_size = 64 > filestore_fd_cache_shards = 32 > osd_op_num_threads_per_shard = 1 > osd_op_num_shards = 25 > osd_enable_op_tracker = true > > > > Fio disk 4K benchmark > ------------------ > rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k > --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio > bw=271755KB/s, iops=67938 > > rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k > --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio > bw=228293KB/s, iops=57073 > > > > fio osd benchmark (through librbd) > ---------------------------------- > [global] > ioengine=rbd > clientname=admin > pool=test > rbdname=test > invalidate=0 # mandatory > rw=randwrite > rw=randread > bs=4k > direct=1 > numjobs=4 > group_reporting=1 > > [rbd_iodepth32] > iodepth=32 > > > > FIREFLY RESULTS > ---------------- > fio randwrite : bw=5009.6KB/s, iops=1252 > > fio randread: bw=37820KB/s, iops=9455 > > > > O.85 RESULTS > ------------ > > fio randwrite : bw=11658KB/s, iops=2914 > > fio randread : bw=38642KB/s, iops=9660 > > > > 0.85 + osd_enable_op_tracker=false > ----------------------------------- > fio randwrite : bw=11630KB/s, iops=2907 > fio randread : bw=80606KB/s, iops=20151, (cpu 100% - GREAT !) > > > > So, for read, seem that osd_enable_op_tracker is the bottleneck. > > > Now for write, I really don't understand why it's so low. > > > I have done some iostat: > > > FIO directly on /dev/sdb > bw=228293KB/s, iops=57073 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 0,00 0,00 0,00 63613,00 0,00 254452,00 8,00 > 31,24 0,49 0,00 0,49 0,02 100,00 > > > FIO directly on osd through librbd > bw=11658KB/s, iops=2914 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 0,00 355,00 0,00 5225,00 0,00 29678,00 11,36 > 57,63 11,03 0,00 11,03 0,19 99,70 > > > (I don't understand what exactly is %util, 100% in the 2 cases, because 10x > slower with ceph) It would be interesting if you could catch the size of writes on SSD during the bench through librbd (I know nmon can do that) > > It could be a dsync problem, result seem pretty poor > > # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct > 65536+0 enregistrements lus > 65536+0 enregistrements écrits > 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s > > > # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct > ^C17228+0 enregistrements lus > 17228+0 enregistrements écrits > 70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s > > > > I'll do tests with intel s3500 tomorrow to compare > > ----- Mail original ----- > > De: "Sebastien Han" <sebastien....@enovance.com> > À: "Warren Wang" <warren_w...@cable.comcast.com> > Cc: ceph-users@lists.ceph.com > Envoyé: Lundi 8 Septembre 2014 22:58:25 > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K > IOPS > > They definitely are Warren! > > Thanks for bringing this here :). > > On 05 Sep 2014, at 23:02, Wang, Warren <warren_w...@cable.comcast.com> wrote: > >> +1 to what Cedric said. >> >> Anything more than a few minutes of heavy sustained writes tended to get our >> solid state devices into a state where garbage collection could not keep up. >> Originally we used small SSDs and did not overprovision the journals by >> much. Manufacturers publish their SSD stats, and then in very small font, >> state that the attained IOPS are with empty drives, and the tests are only >> run for very short amounts of time. Even if the drives are new, it's a good >> idea to perform an hdparm secure erase on them (so that the SSD knows that >> the blocks are truly unused), and then overprovision them. You'll know if >> you have a problem by watching for utilization and wait data on the >> journals. >> >> One of the other interesting performance issues is that the Intel 10Gbe NICs >> + default kernel that we typically use max out around 1million packets/sec. >> It's worth tracking this metric to if you are close. >> >> I know these aren't necessarily relevant to the test parameters you gave >> below, but they're worth keeping in mind. >> >> -- >> Warren Wang >> Comcast Cloud (OpenStack) >> >> >> From: Cedric Lemarchand <ced...@yipikai.org> >> Date: Wednesday, September 3, 2014 at 5:14 PM >> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> >> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, >> 2K IOPS >> >> >> Le 03/09/2014 22:11, Sebastien Han a écrit : >>> Hi Warren, >>> >>> What do mean exactly by secure erase? At the firmware level with >>> constructor softwares? >>> SSDs were pretty new so I don’t we hit that sort of things. I believe that >>> only aged SSDs have this behaviour but I might be wrong. >>> >> Sorry I forgot to reply to the real question ;-) >> So yes it only plays after some times, for your case, if the SSD still >> delivers write IOPS specified by the manufacturer, it will doesn't help in >> any ways. >> >> But it seems this practice is nowadays increasingly used. >> >> Cheers >>> On 02 Sep 2014, at 18:23, Wang, Warren <warren_w...@cable.comcast.com> >>> wrote: >>> >>> >>>> Hi Sebastien, >>>> >>>> Something I didn't see in the thread so far, did you secure erase the SSDs >>>> before they got used? I assume these were probably repurposed for this >>>> test. We have seen some pretty significant garbage collection issue on >>>> various SSD and other forms of solid state storage to the point where we >>>> are overprovisioning pretty much every solid state device now. By as much >>>> as 50% to handle sustained write operations. Especially important for the >>>> journals, as we've found. >>>> >>>> Maybe not an issue on the short fio run below, but certainly evident on >>>> longer runs or lots of historical data on the drives. The max transaction >>>> time looks pretty good for your test. Something to consider though. >>>> >>>> Warren >>>> >>>> -----Original Message----- >>>> From: ceph-users [ >>>> mailto:ceph-users-boun...@lists.ceph.com >>>> ] On Behalf Of Sebastien Han >>>> Sent: Thursday, August 28, 2014 12:12 PM >>>> To: ceph-users >>>> Cc: Mark Nelson >>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K >>>> IOPS >>>> >>>> Hey all, >>>> >>>> It has been a while since the last thread performance related on the ML :p >>>> I've been running some experiment to see how much I can get from an SSD on >>>> a Ceph cluster. >>>> To achieve that I did something pretty simple: >>>> >>>> * Debian wheezy 7.6 >>>> * kernel from debian 3.14-0.bpo.2-amd64 >>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a real >>>> deployment i'll use 3) >>>> * 1 OSD backed by an SSD (journal and osd data on the same device) >>>> * 1 replica count of 1 >>>> * partitions are perfectly aligned >>>> * io scheduler is set to noon but deadline was showing the same results >>>> * no updatedb running >>>> >>>> About the box: >>>> >>>> * 32GB of RAM >>>> * 12 cores with HT @ 2,4 GHz >>>> * WB cache is enabled on the controller >>>> * 10Gbps network (doesn't help here) >>>> >>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K >>>> iops with random 4k writes (my fio results) As a benchmark tool I used fio >>>> with the rbd engine (thanks deutsche telekom guys!). >>>> >>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD: >>>> >>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536 >>>> 65536+0 records in >>>> 65536+0 records out >>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s >>>> >>>> # du -sh rand.file >>>> 256M rand.file >>>> >>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct >>>> 65536+0 records in >>>> 65536+0 records out >>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s >>>> >>>> See my ceph.conf: >>>> >>>> [global] >>>> auth cluster required = cephx >>>> auth service required = cephx >>>> auth client required = cephx >>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 >>>> osd pool default pg num = 4096 >>>> osd pool default pgp num = 4096 >>>> osd pool default size = 2 >>>> osd crush chooseleaf type = 0 >>>> >>>> debug lockdep = 0/0 >>>> debug context = 0/0 >>>> debug crush = 0/0 >>>> debug buffer = 0/0 >>>> debug timer = 0/0 >>>> debug journaler = 0/0 >>>> debug osd = 0/0 >>>> debug optracker = 0/0 >>>> debug objclass = 0/0 >>>> debug filestore = 0/0 >>>> debug journal = 0/0 >>>> debug ms = 0/0 >>>> debug monc = 0/0 >>>> debug tp = 0/0 >>>> debug auth = 0/0 >>>> debug finisher = 0/0 >>>> debug heartbeatmap = 0/0 >>>> debug perfcounter = 0/0 >>>> debug asok = 0/0 >>>> debug throttle = 0/0 >>>> >>>> [mon] >>>> mon osd down out interval = 600 >>>> mon osd min down reporters = 13 >>>> [mon.ceph-01] >>>> host = ceph-01 >>>> mon addr = 172.20.20.171 >>>> [mon.ceph-02] >>>> host = ceph-02 >>>> mon addr = 172.20.20.172 >>>> [mon.ceph-03] >>>> host = ceph-03 >>>> mon addr = 172.20.20.173 >>>> >>>> debug lockdep = 0/0 >>>> debug context = 0/0 >>>> debug crush = 0/0 >>>> debug buffer = 0/0 >>>> debug timer = 0/0 >>>> debug journaler = 0/0 >>>> debug osd = 0/0 >>>> debug optracker = 0/0 >>>> debug objclass = 0/0 >>>> debug filestore = 0/0 >>>> debug journal = 0/0 >>>> debug ms = 0/0 >>>> debug monc = 0/0 >>>> debug tp = 0/0 >>>> debug auth = 0/0 >>>> debug finisher = 0/0 >>>> debug heartbeatmap = 0/0 >>>> debug perfcounter = 0/0 >>>> debug asok = 0/0 >>>> debug throttle = 0/0 >>>> >>>> [osd] >>>> osd mkfs type = xfs >>>> osd mkfs options xfs = -f -i size=2048 >>>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog >>>> osd journal size = 20480 >>>> cluster_network = 172.20.20.0/24 >>>> public_network = 172.20.20.0/24 >>>> osd mon heartbeat interval = 30 >>>> # Performance tuning >>>> filestore merge threshold = 40 >>>> filestore split multiple = 8 >>>> osd op threads = 8 >>>> # Recovery tuning >>>> osd recovery max active = 1 >>>> osd max backfills = 1 >>>> osd recovery op priority = 1 >>>> >>>> >>>> debug lockdep = 0/0 >>>> debug context = 0/0 >>>> debug crush = 0/0 >>>> debug buffer = 0/0 >>>> debug timer = 0/0 >>>> debug journaler = 0/0 >>>> debug osd = 0/0 >>>> debug optracker = 0/0 >>>> debug objclass = 0/0 >>>> debug filestore = 0/0 >>>> debug journal = 0/0 >>>> debug ms = 0/0 >>>> debug monc = 0/0 >>>> debug tp = 0/0 >>>> debug auth = 0/0 >>>> debug finisher = 0/0 >>>> debug heartbeatmap = 0/0 >>>> debug perfcounter = 0/0 >>>> debug asok = 0/0 >>>> debug throttle = 0/0 >>>> >>>> Disabling all debugging made me win 200/300 more IOPS. >>>> >>>> See my fio template: >>>> >>>> [global] >>>> #logging >>>> #write_iops_log=write_iops_log >>>> #write_bw_log=write_bw_log >>>> #write_lat_log=write_lat_lo >>>> >>>> time_based >>>> runtime=60 >>>> >>>> ioengine=rbd >>>> clientname=admin >>>> pool=test >>>> rbdname=fio >>>> invalidate=0 # mandatory >>>> #rw=randwrite >>>> rw=write >>>> bs=4k >>>> #bs=32m >>>> size=5G >>>> group_reporting >>>> >>>> [rbd_iodepth32] >>>> iodepth=32 >>>> direct=1 >>>> >>>> See my rio output: >>>> >>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, >>>> iodepth=32 fio-2.1.11-14-gb74e Starting 1 process rbd engine: RBD version: >>>> 0.1.8 >>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0 iops] >>>> [eta 00m:00s] >>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 00:28:26 >>>> 2014 >>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec >>>> slat (usec): min=42, max=1578, avg=66.50, stdev=16.96 >>>> clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48 >>>> lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47 >>>> clat percentiles (usec): >>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 9152], >>>> | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 60.00th=[10048], >>>> | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944], 95.00th=[11456], >>>> | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], >>>> | 99.99th=[28032] >>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, stdev=407.35 >>>> lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41% >>>> cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088 >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%, >=64=0.0% >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> complete : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% >>>> issued : total=r=0/w=192862/d=0, short=r=0/w=0/d=0 >>>> latency : target=0, window=0, percentile=100.00%, depth=32 >>>> >>>> Run status group 0 (all jobs): >>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, maxb=12855KB/s, >>>> mint=60010msec, maxt=60010msec >>>> >>>> Disk stats (read/write): >>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, >>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, >>>> aggrutil=0.01% >>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% >>>> >>>> I tried to tweak several parameters like: >>>> >>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 >>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 >>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 >>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue max ops >>>> = 2000 >>>> >>>> But didn't any improvement. >>>> >>>> Then I tried other things: >>>> >>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 more >>>> IOPS but it's not a realistic workload anymore and not that significant. >>>> * adding another SSD for the journal, still getting 3,2K IOPS >>>> * I tried with rbd bench and I also got 3K IOPS >>>> * I ran the test on a client machine and then locally on the server, still >>>> getting 3,2K IOPS >>>> * put the journal in memory, still getting 3,2K IOPS >>>> * with 2 clients running the test in parallel I got a total of 3,6K IOPS >>>> but I don't seem to be able to go over >>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 journals >>>> on 1 SSD, got 4,5K IOPS YAY! >>>> >>>> Given the results of the last time it seems that something is limiting the >>>> number of IOPS per OSD process. >>>> >>>> Running the test on a client or locally didn't show any difference. >>>> So it looks to me that there is some contention within Ceph that might >>>> cause this. >>>> >>>> I also ran perf and looked at the output, everything looks decent, but >>>> someone might want to have a look at it :). >>>> >>>> We have been able to reproduce this on 3 distinct platforms with some >>>> deviations (because of the hardware) but the behaviour is the same. >>>> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K >>>> IOPS SSD is a bit frustrating :). >>>> >>>> Cheers. >>>> ---- >>>> Sébastien Han >>>> Cloud Architect >>>> >>>> "Always give 100%. Unless you're giving blood." >>>> >>>> Phone: +33 (0)1 49 70 99 72 >>>> Mail: >>>> sebastien....@enovance.com >>>> >>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : >>>> www.enovance.com >>>> - Twitter : @enovance >>>> >>>> >>> Cheers. >>> –––– >>> Sébastien Han >>> Cloud Architect >>> >>> "Always give 100%. Unless you're giving blood." >>> >>> Phone: +33 (0)1 49 70 99 72 >>> Mail: >>> sebastien....@enovance.com >>> >>> Address : 11 bis, rue Roquépine - 75008 Paris >>> Web : >>> www.enovance.com >>> - Twitter : @enovance >>> >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> >>> ceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> -- >> Cédric >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > Cheers. > –––– > Sébastien Han > Cloud Architect > > "Always give 100%. Unless you're giving blood." > > Phone: +33 (0)1 49 70 99 72 > Mail: sebastien....@enovance.com > Address : 11 bis, rue Roquépine - 75008 Paris > Web : www.enovance.com - Twitter : @enovance > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- Cédric _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com