*Update:* *Hardware:* Upgraded RAID controller to LSI Megaraid 9341 -12Gbps 3 Samsung 840 EVO - was showing 45K iops for fio test with 7 threads and 4k block size in *JBOD *mode CPU- 16 cores @2.27Ghz RAM- 24Gb NIC- 10Gbits with *under 1 ms latency, *iperf shows 9.18 Gbps between host and client
* Software* Ubuntu 14.04 with stock kernel 3.13- Upgraded from firefly to giant [*ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)*] Changed file system to btrfs and i/o scheduler to noop. *Ceph Setup* replication to 1 and using 2 SSD OSD and 1 SSD for Journal.All are samsung 840 EVO in *JBOD* mode on single server. *Configuration:* [global] fsid = 979f32fc-6f31-43b0-832f-29fcc4c5a648 mon_initial_members = ceph1 mon_host = 10.99.10.118 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd_pool_default_size = 1 osd_pool_default_min_size = 1 osd_pool_default_pg_num = 250 osd_pool_default_pgp_num = 250 debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_filer = 0/0 debug_objecter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_journaler = 0/0 debug_objectcatcher = 0/0 debug_client = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_filestore = 0/0 debug_journal = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_perfcounter = 0/0 debug_asok = 0/0 debug_throttle = 0/0 debug_mon = 0/0 debug_paxos = 0/0 debug_rgw = 0/0 [client] rbd_cache = true *Client* Ubuntu 14.04 with 16 Core @2.53 Ghz and 24G RAM *Results* rados bench -p rdp -b 4096 -t 16 10 write rados bench -p rbd -b 4096 -t 16 10 write Maintaining 16 concurrent writes of 4096 bytes for up to 10 seconds or 0 objects Object prefix: benchmark_data_ubuntucompute_3931 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 6370 6354 24.8124 24.8203 0.002210.00251512 2 16 11618 11602 22.6536 20.5 0.0010250.00275493 3 16 16889 16873 21.9637 20.5898 0.0012880.00281797 4 16 17310 17294 16.884 1.64453 0.0540660.00365805 5 16 17695 17679 13.808 1.50391 0.0014510.00444409 6 16 18127 18111 11.7868 1.6875 0.0014630.00527521 7 16 21647 21631 12.0669 13.75 0.001601 0.0051773 8 16 28056 28040 13.6872 25.0352 0.0052680.00456353 9 16 28947 28931 12.553 3.48047 0.066470.00494762 10 16 29346 29330 11.4536 1.55859 0.0013410.00542312 Total time run: 10.077931 Total writes made: 29347 Write size: 4096 Bandwidth (MB/sec): 11.375 Stddev Bandwidth: 10.5124 Max bandwidth (MB/sec): 25.0352 Min bandwidth (MB/sec): 0 Average Latency: 0.00548729 Stddev Latency: 0.0169545 Max latency: 0.249019 Min latency: 0.000748 *ceph -s* cluster 979f32fc-6f31-43b0-832f-29fcc4c5a648 health HEALTH_OK monmap e1: 1 mons at {ceph1=10.99.10.118:6789/0}, election epoch 1, quorum 0 ceph1 osdmap e30: 2 osds: 2 up, 2 in pgmap v255: 250 pgs, 1 pools, 92136 kB data, 23035 objects 77068 kB used, 929 GB / 931 GB avail 250 active+clean client io 11347 kB/s wr, 2836 op/s *iostat* device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 6.00 0.00 112.00 0 448 sdb 3985.50 0.00 21048.00 0 84192 sdd 414.50 0.00 14083.00 0 56332 sdc 415.00 0.00 10944.00 0 43776 where sdb - journal sdc,sdd - OSD *dd output* dd if=/dev/zero of=/dev/rbd0 bs=4k count=25000 oflag=direct 25000+0 records in 25000+0 records out 102400000 bytes (102 MB) copied, 23.0863 s, 4.4 MB/s here performance has increased from 1MBps to 4.4MBps but not what i was expecting. *fio with 4k writes with 2 threads* journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.3 Starting 2 processes Jobs: 2 (f=2): [WW] [1.4% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:27m:25s] journal-test: (groupid=0, jobs=2): err= 0: pid=4077: Sat Mar 7 02:50:45 2015 write: io=292936KB, bw=3946.1KB/s, iops=986, runt= 74236msec clat (usec): min=645, max=16855K, avg=2023.56, stdev=88071.07 lat (usec): min=645, max=16855K, avg=2023.97, stdev=88071.07 clat percentiles (usec): | 1.00th=[ 884], 5.00th=[ 1192], 10.00th=[ 1304], 20.00th=[ 1448], | 30.00th=[ 1512], 40.00th=[ 1560], 50.00th=[ 1592], 60.00th=[ 1624], | 70.00th=[ 1656], 80.00th=[ 1704], 90.00th=[ 1752], 95.00th=[ 1816], | 99.00th=[ 1928], 99.50th=[ 1992], 99.90th=[ 2160], 99.95th=[ 2288], | 99.99th=[39168] bw (KB /s): min= 54, max= 3568, per=64.10%, avg=2529.43, stdev=315.56 lat (usec) : 750=0.07%, 1000=2.53% lat (msec) : 2=96.96%, 4=0.43%, 50=0.01%, >=2000=0.01% cpu : usr=0.51%, sys=2.04%, ctx=73550, majf=0, minf=93 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=73234/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=292936KB, aggrb=3946KB/s, minb=3946KB/s, maxb=3946KB/s, mint=74236msec, maxt=74236msec Disk stats (read/write): rbd0: ios=186/73232, merge=0/0, ticks=120/109676, in_queue=143448, util=100.00% How can i improve performance of 4k write? Will adding more Nodes improve this Thanks for any help On Sun, Mar 1, 2015 at 3:07 AM, Somnath Roy <somnath....@sandisk.com> wrote: > Sorry, I saw you have already tried with ‘rados bench’. So, some points > here. > > > > 1. If you are considering write workload, I think with total of 2 copies > and with 4K workload , you should be able to get ~4K iops (considering it > hitting the disk, not with memstore). > > > > 2. You are having 9 OSDs and if you created only one pool with only 450 > PGS, you should try to increase that and see if getting any improvement or > not. > > > > 3. Also, the rados bench script you ran with very low QD, try increasing > that, may be 32/64. > > > > 4. If you are running firefly, other optimization won’t work here..But, > you can add the following in your ceph.conf file and it should give you > some boost. > > > > debug_lockdep = 0/0 > > debug_context = 0/0 > > debug_crush = 0/0 > > debug_buffer = 0/0 > > debug_timer = 0/0 > > debug_filer = 0/0 > > debug_objecter = 0/0 > > debug_rados = 0/0 > > debug_rbd = 0/0 > > debug_journaler = 0/0 > > debug_objectcatcher = 0/0 > > debug_client = 0/0 > > debug_osd = 0/0 > > debug_optracker = 0/0 > > debug_objclass = 0/0 > > debug_filestore = 0/0 > > debug_journal = 0/0 > > debug_ms = 0/0 > > debug_monc = 0/0 > > debug_tp = 0/0 > > debug_auth = 0/0 > > debug_finisher = 0/0 > > debug_heartbeatmap = 0/0 > > debug_perfcounter = 0/0 > > debug_asok = 0/0 > > debug_throttle = 0/0 > > debug_mon = 0/0 > > debug_paxos = 0/0 > > debug_rgw = 0/0 > > > > 5. Give us the ceph –s output and the iostat output while io is going on. > > > > Thanks & Regards > > Somnath > > > > > > > > *From:* Somnath Roy > *Sent:* Saturday, February 28, 2015 12:59 PM > *To:* 'mad Engineer'; Alexandre DERUMIER > *Cc:* ceph-users > *Subject:* RE: [ceph-users] Extreme slowness in SSD cluster with 3 nodes > and 9 OSD with 3.16-3 kernel > > > > I would say check with rados tool like ceph_smalliobench/rados bench first > to see how much performance these tools are reporting. This will help you > to isolate any upstream issues. > > Also, check with ‘iostat –xk 1’ for the resource utilization. Hope you are > running with powerful enough cpu complex since you are saying network is > not a bottleneck. > > > > Thanks & Regards > > Somnath > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *mad Engineer > *Sent:* Saturday, February 28, 2015 12:29 PM > *To:* Alexandre DERUMIER > *Cc:* ceph-users > *Subject:* Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes > and 9 OSD with 3.16-3 kernel > > > > reinstalled ceph packages and now with memstore backend [osd objectstore > =memstore] its giving 400Kbps .No idea where the problem is. > > > > On Sun, Mar 1, 2015 at 12:30 AM, mad Engineer <themadengin...@gmail.com> > wrote: > > tried changing scheduler from deadline to noop also upgraded to Gaint and > btrfs filesystem,downgraded kernel to 3.16 from 3.16-3 not much difference > > > > dd if=/dev/zero of=hi bs=4k count=25000 oflag=direct > > 25000+0 records in > > 25000+0 records out > > 102400000 bytes (102 MB) copied, 94.691 s, 1.1 MB/s > > > > Earlier on a vmware setup i was getting ~850 KBps and now even on physical > server with SSD drives its just over 1MBps.I doubt some serious > configuration issues. > > > > Tried iperf between 3 servers all are showing 9 Gbps,tried icmp with > different packet size ,no fragmentation. > > > > i also noticed that out of 9 osd 5 are 850 EVO and 4 are 840 EVO.I believe > this will not cause this much drop in performance. > > > > Thanks for any help > > > > > > On Sat, Feb 28, 2015 at 6:49 PM, Alexandre DERUMIER <aderum...@odiso.com> > wrote: > > As optimisation, > > try to set ioscheduler to noop, > > and also enable rbd_cache=true. (It's really helping for for sequential > writes) > > but your results seem quite low, 926kb/s with 4k, it's only 200io/s. > > check if you don't have any big network latencies, or mtu fragementation > problem. > > Maybe also try to bench with fio, with more parallel jobs. > > > > > ----- Mail original ----- > De: "mad Engineer" <themadengin...@gmail.com> > À: "Philippe Schwarz" <p...@schwarz-fr.net> > Cc: "ceph-users" <ceph-users@lists.ceph.com> > Envoyé: Samedi 28 Février 2015 13:06:59 > Objet: Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes and 9 > OSD with 3.16-3 kernel > > Thanks for the reply Philippe,we were using these disks in our NAS,now > it looks like i am in big trouble :-( > > On Sat, Feb 28, 2015 at 5:02 PM, Philippe Schwarz <p...@schwarz-fr.net> > wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Le 28/02/2015 12:19, mad Engineer a écrit : > >> Hello All, > >> > >> I am trying ceph-firefly 0.80.8 > >> (69eaad7f8308f21573c604f121956e64679a52a7) with 9 OSD ,all Samsung > >> SSD 850 EVO on 3 servers with 24 G RAM,16 cores @2.27 Ghz Ubuntu > >> 14.04 LTS with 3.16-3 kernel.All are connected to 10G ports with > >> maximum MTU.There are no extra disks for journaling and also there > >> are no separate network for replication and data transfer.All 3 > >> nodes are also hosting monitoring process.Operating system runs on > >> SATA disk. > >> > >> When doing a sequential benchmark using "dd" on RBD, mounted on > >> client as ext4 its taking 110s to write 100Mb data at an average > >> speed of 926Kbps. > >> > >> time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct > >> 25000+0 records in 25000+0 records out 102400000 bytes (102 MB) > >> copied, 110.582 s, 926 kB/s > >> > >> real 1m50.585s user 0m0.106s sys 0m2.233s > >> > >> While doing this directly on ssd mount point shows: > >> > >> time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct > >> 25000+0 records in 25000+0 records out 102400000 bytes (102 MB) > >> copied, 1.38567 s, 73.9 MB/s > >> > >> OSDs are in XFS with these extra arguments : > >> > >> rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M > >> > >> ceph.conf > >> > >> [global] fsid = 7d889081-7826-439c-9fe5-d4e57480d9be > >> mon_initial_members = ceph1, ceph2, ceph3 mon_host = > >> 10.99.10.118,10.99.10.119,10.99.10.120 auth_cluster_required = > >> cephx auth_service_required = cephx auth_client_required = cephx > >> filestore_xattr_use_omap = true osd_pool_default_size = 2 > >> osd_pool_default_min_size = 2 osd_pool_default_pg_num = 450 > >> osd_pool_default_pgp_num = 450 max_open_files = 131072 > >> > >> [osd] osd_mkfs_type = xfs osd_op_threads = 8 osd_disk_threads = 4 > >> osd_mount_options_xfs = > >> "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M" > >> > >> > >> on our traditional storage with Full SAS disk, same "dd" completes > >> in 16s with an average write speed of 6Mbps. > >> > >> Rados bench: > >> > >> rados bench -p rbd 10 write Maintaining 16 concurrent writes of > >> 4194304 bytes for up to 10 seconds or 0 objects Object prefix: > >> benchmark_data_ceph1_2977 sec Cur ops started finished avg MB/s > >> cur MB/s last lat avg lat 0 0 0 0 > >> 0 0 - 0 1 16 94 78 > >> 311.821 312 0.041228 0.140132 2 16 192 176 > >> 351.866 392 0.106294 0.175055 3 16 275 259 > >> 345.216 332 0.076795 0.166036 4 16 302 286 > >> 285.912 108 0.043888 0.196419 5 16 395 379 > >> 303.11 372 0.126033 0.207488 6 16 501 485 > >> 323.242 424 0.125972 0.194559 7 16 621 605 > >> 345.621 480 0.194155 0.183123 8 16 730 714 > >> 356.903 436 0.086678 0.176099 9 16 814 798 > >> 354.572 336 0.081567 0.174786 10 16 832 > >> 816 326.313 72 0.037431 0.182355 11 16 833 > >> 817 297.013 4 0.533326 0.182784 Total time run: > >> 11.489068 Total writes made: 833 Write size: > >> 4194304 Bandwidth (MB/sec): 290.015 > >> > >> Stddev Bandwidth: 175.723 Max bandwidth (MB/sec): 480 Min > >> bandwidth (MB/sec): 0 Average Latency: 0.220582 Stddev > >> Latency: 0.343697 Max latency: 2.85104 Min > >> latency: 0.035381 > >> > >> Our ultimate aim is to replace existing SAN with ceph,but for that > >> it should meet minimum 8000 iops.Can any one help me with this,OSD > >> are SSD,CPU has good clock speed,backend network is good but still > >> we are not able to extract full capability of SSD disks. > >> > >> > >> > >> Thanks, > > > > Hi, i'm new to ceph so, don't consider my words as holy truth. > > > > It seems that Samsung 840 (so i assume 850) are crappy for ceph : > > > > MTBF : > > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044258.html > > Bandwidth > > : > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-December/045247.html > > > > And according to a confirmed user of Ceph/ProxmoX, Samsung SSDs should > > be avoided if possible in ceph storage. > > > > Apart from that, it seems there was an limitation in ceph for the use > > of the complete bandwidth available in SSDs; but i think with less > > than 1Mb/s you haven't hit this limit. > > > > I remind you that i'm not a ceph-guru (far from that, indeed), so feel > > free to disagree; i'm on the way to improve my knowledge. > > > > Best regards. > > > > > > > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1 > > > > iEYEARECAAYFAlTxp0UACgkQlhqCFkbqHRb5+wCgrXCM3VsnVE6PCbbpOmQXCXbr > > 8u0An2BUgZWismSK0PxbwVDOD5+/UWik > > =0o0v > > -----END PGP SIGNATURE----- > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ------------------------------ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If > the reader of this message is not the intended recipient, you are hereby > notified that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies > or electronically stored copies). > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com