Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

mad Engineer Sat, 07 Mar 2015 02:55:59 -0800

*Update:*

*Hardware:*
Upgraded RAID controller to LSI Megaraid 9341 -12Gbps
3 Samsung 840 EVO - was showing 45K iops for fio test with 7 threads and 4k
block size in *JBOD *mode
CPU- 16 cores @2.27Ghz
RAM- 24Gb
NIC- 10Gbits with *under 1 ms latency, *iperf shows 9.18 Gbps between host
and client


* Software*
Ubuntu 14.04 with stock kernel 3.13-
Upgraded from firefly to giant [*ceph version 0.87.1
(283c2e7cfa2457799f534744d7d549f83ea1335e)*]
Changed file system to btrfs and i/o scheduler to noop.

*Ceph Setup*
replication to 1 and using 2 SSD OSD and 1 SSD for Journal.All are samsung
840 EVO in *JBOD* mode on single server.

*Configuration:*
[global]
fsid = 979f32fc-6f31-43b0-832f-29fcc4c5a648
mon_initial_members = ceph1
mon_host = 10.99.10.118
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 1
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 250
osd_pool_default_pgp_num = 250
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

[client]
rbd_cache = true

*Client*
Ubuntu 14.04 with 16 Core @2.53 Ghz and 24G RAM

*Results*
rados bench -p rdp -b 4096 -t 16 10 write

rados bench -p rbd -b 4096 -t 16 10 write
 Maintaining 16 concurrent writes of 4096 bytes for up to 10 seconds or 0
objects
 Object prefix: benchmark_data_ubuntucompute_3931
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16      6370      6354   24.8124   24.8203   0.002210.00251512
     2      16     11618     11602   22.6536      20.5  0.0010250.00275493
     3      16     16889     16873   21.9637   20.5898  0.0012880.00281797
     4      16     17310     17294    16.884   1.64453  0.0540660.00365805
     5      16     17695     17679    13.808   1.50391  0.0014510.00444409
     6      16     18127     18111   11.7868    1.6875  0.0014630.00527521
     7      16     21647     21631   12.0669     13.75  0.001601 0.0051773
     8      16     28056     28040   13.6872   25.0352  0.0052680.00456353
     9      16     28947     28931    12.553   3.48047   0.066470.00494762
    10      16     29346     29330   11.4536   1.55859  0.0013410.00542312
 Total time run:         10.077931
Total writes made:      29347
Write size:             4096
Bandwidth (MB/sec):     11.375

Stddev Bandwidth:       10.5124
Max bandwidth (MB/sec): 25.0352
Min bandwidth (MB/sec): 0
Average Latency:        0.00548729
Stddev Latency:         0.0169545
Max latency:            0.249019
Min latency:            0.000748

*ceph -s*
    cluster 979f32fc-6f31-43b0-832f-29fcc4c5a648
     health HEALTH_OK
     monmap e1: 1 mons at {ceph1=10.99.10.118:6789/0}, election epoch 1,
quorum 0 ceph1
     osdmap e30: 2 osds: 2 up, 2 in
      pgmap v255: 250 pgs, 1 pools, 92136 kB data, 23035 objects
            77068 kB used, 929 GB / 931 GB avail
                 250 active+clean
  client io 11347 kB/s wr, 2836 op/s

*iostat*
device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               6.00         0.00       112.00          0        448
sdb            3985.50         0.00     21048.00          0      84192
sdd             414.50         0.00     14083.00          0      56332
sdc             415.00         0.00     10944.00          0      43776

where

sdb - journal
sdc,sdd - OSD

*dd output*
dd if=/dev/zero of=/dev/rbd0 bs=4k count=25000 oflag=direct
25000+0 records in
25000+0 records out
102400000 bytes (102 MB) copied, 23.0863 s, 4.4 MB/s

here performance has increased from 1MBps to 4.4MBps but not what i was
expecting.

*fio with 4k writes with 2 threads*
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
iodepth=1
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
iodepth=1
fio-2.1.3
Starting 2 processes
Jobs: 2 (f=2): [WW] [1.4% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
01h:27m:25s]
journal-test: (groupid=0, jobs=2): err= 0: pid=4077: Sat Mar  7 02:50:45
2015
  write: io=292936KB, bw=3946.1KB/s, iops=986, runt= 74236msec
    clat (usec): min=645, max=16855K, avg=2023.56, stdev=88071.07
     lat (usec): min=645, max=16855K, avg=2023.97, stdev=88071.07
    clat percentiles (usec):
     |  1.00th=[  884],  5.00th=[ 1192], 10.00th=[ 1304], 20.00th=[ 1448],
     | 30.00th=[ 1512], 40.00th=[ 1560], 50.00th=[ 1592], 60.00th=[ 1624],
     | 70.00th=[ 1656], 80.00th=[ 1704], 90.00th=[ 1752], 95.00th=[ 1816],
     | 99.00th=[ 1928], 99.50th=[ 1992], 99.90th=[ 2160], 99.95th=[ 2288],
     | 99.99th=[39168]
    bw (KB  /s): min=   54, max= 3568, per=64.10%, avg=2529.43, stdev=315.56
    lat (usec) : 750=0.07%, 1000=2.53%
    lat (msec) : 2=96.96%, 4=0.43%, 50=0.01%, >=2000=0.01%
  cpu          : usr=0.51%, sys=2.04%, ctx=73550, majf=0, minf=93
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     issued    : total=r=0/w=73234/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=292936KB, aggrb=3946KB/s, minb=3946KB/s, maxb=3946KB/s,
mint=74236msec, maxt=74236msec

Disk stats (read/write):
  rbd0: ios=186/73232, merge=0/0, ticks=120/109676, in_queue=143448,
util=100.00%

How can i improve performance of 4k write? Will adding more Nodes improve
this

Thanks for any help

On Sun, Mar 1, 2015 at 3:07 AM, Somnath Roy <somnath....@sandisk.com> wrote:

>  Sorry, I saw you have already tried with ‘rados bench’. So, some points
> here.
>
>
>
> 1. If you are considering write workload, I think with total of 2 copies
> and with 4K workload , you should be able to get ~4K iops (considering it
> hitting the disk, not with memstore).
>
>
>
> 2. You are having 9 OSDs and if you created only one pool with only 450
> PGS, you should try to increase that and see if getting any improvement or
> not.
>
>
>
> 3. Also, the rados bench script you ran with very low QD, try increasing
> that, may be 32/64.
>
>
>
> 4. If you are running firefly, other optimization won’t work here..But,
> you can add the following in your ceph.conf file and it should give you
> some boost.
>
>
>
> debug_lockdep = 0/0
>
> debug_context = 0/0
>
> debug_crush = 0/0
>
> debug_buffer = 0/0
>
> debug_timer = 0/0
>
> debug_filer = 0/0
>
> debug_objecter = 0/0
>
> debug_rados = 0/0
>
> debug_rbd = 0/0
>
> debug_journaler = 0/0
>
> debug_objectcatcher = 0/0
>
> debug_client = 0/0
>
> debug_osd = 0/0
>
> debug_optracker = 0/0
>
> debug_objclass = 0/0
>
> debug_filestore = 0/0
>
> debug_journal = 0/0
>
> debug_ms = 0/0
>
> debug_monc = 0/0
>
> debug_tp = 0/0
>
> debug_auth = 0/0
>
> debug_finisher = 0/0
>
> debug_heartbeatmap = 0/0
>
> debug_perfcounter = 0/0
>
> debug_asok = 0/0
>
> debug_throttle = 0/0
>
> debug_mon = 0/0
>
> debug_paxos = 0/0
>
> debug_rgw = 0/0
>
>
>
> 5. Give us the ceph –s output and the iostat output while io is going on.
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
>
>
>
>
> *From:* Somnath Roy
> *Sent:* Saturday, February 28, 2015 12:59 PM
> *To:* 'mad Engineer'; Alexandre DERUMIER
> *Cc:* ceph-users
> *Subject:* RE: [ceph-users] Extreme slowness in SSD cluster with 3 nodes
> and 9 OSD with 3.16-3 kernel
>
>
>
> I would say check with rados tool like ceph_smalliobench/rados bench first
> to see how much performance these tools are reporting. This will help you
> to isolate any upstream issues.
>
> Also, check with ‘iostat –xk 1’ for the resource utilization. Hope you are
> running with powerful enough cpu complex since you are saying network is
> not a bottleneck.
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *mad Engineer
> *Sent:* Saturday, February 28, 2015 12:29 PM
> *To:* Alexandre DERUMIER
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes
> and 9 OSD with 3.16-3 kernel
>
>
>
> reinstalled ceph packages and now with memstore backend [osd objectstore
> =memstore] its giving 400Kbps .No idea where the problem is.
>
>
>
> On Sun, Mar 1, 2015 at 12:30 AM, mad Engineer <themadengin...@gmail.com>
> wrote:
>
> tried changing scheduler from deadline to noop also upgraded to Gaint and
> btrfs filesystem,downgraded kernel to 3.16 from 3.16-3 not much difference
>
>
>
> dd if=/dev/zero of=hi bs=4k count=25000 oflag=direct
>
> 25000+0 records in
>
> 25000+0 records out
>
> 102400000 bytes (102 MB) copied, 94.691 s, 1.1 MB/s
>
>
>
> Earlier on a vmware setup i was getting ~850 KBps and now even on physical
> server with SSD drives its just over 1MBps.I doubt some serious
> configuration issues.
>
>
>
> Tried iperf between 3 servers all are showing 9 Gbps,tried icmp with
> different packet size ,no fragmentation.
>
>
>
> i also noticed that out of 9 osd 5 are 850 EVO and 4 are 840 EVO.I believe
> this will not cause this much drop in performance.
>
>
>
> Thanks for any help
>
>
>
>
>
> On Sat, Feb 28, 2015 at 6:49 PM, Alexandre DERUMIER <aderum...@odiso.com>
> wrote:
>
> As optimisation,
>
> try to set ioscheduler to noop,
>
> and also enable rbd_cache=true. (It's really helping for for sequential
> writes)
>
> but your results seem quite low, 926kb/s with 4k, it's only 200io/s.
>
> check if you don't have any big network latencies, or mtu fragementation
> problem.
>
> Maybe also try to bench with fio, with more parallel jobs.
>
>
>
>
> ----- Mail original -----
> De: "mad Engineer" <themadengin...@gmail.com>
> À: "Philippe Schwarz" <p...@schwarz-fr.net>
> Cc: "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Samedi 28 Février 2015 13:06:59
> Objet: Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes and 9
> OSD with 3.16-3 kernel
>
> Thanks for the reply Philippe,we were using these disks in our NAS,now
> it looks like i am in big trouble :-(
>
> On Sat, Feb 28, 2015 at 5:02 PM, Philippe Schwarz <p...@schwarz-fr.net>
> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Le 28/02/2015 12:19, mad Engineer a écrit :
> >> Hello All,
> >>
> >> I am trying ceph-firefly 0.80.8
> >> (69eaad7f8308f21573c604f121956e64679a52a7) with 9 OSD ,all Samsung
> >> SSD 850 EVO on 3 servers with 24 G RAM,16 cores @2.27 Ghz Ubuntu
> >> 14.04 LTS with 3.16-3 kernel.All are connected to 10G ports with
> >> maximum MTU.There are no extra disks for journaling and also there
> >> are no separate network for replication and data transfer.All 3
> >> nodes are also hosting monitoring process.Operating system runs on
> >> SATA disk.
> >>
> >> When doing a sequential benchmark using "dd" on RBD, mounted on
> >> client as ext4 its taking 110s to write 100Mb data at an average
> >> speed of 926Kbps.
> >>
> >> time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct
> >> 25000+0 records in 25000+0 records out 102400000 bytes (102 MB)
> >> copied, 110.582 s, 926 kB/s
> >>
> >> real 1m50.585s user 0m0.106s sys 0m2.233s
> >>
> >> While doing this directly on ssd mount point shows:
> >>
> >> time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct
> >> 25000+0 records in 25000+0 records out 102400000 bytes (102 MB)
> >> copied, 1.38567 s, 73.9 MB/s
> >>
> >> OSDs are in XFS with these extra arguments :
> >>
> >> rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M
> >>
> >> ceph.conf
> >>
> >> [global] fsid = 7d889081-7826-439c-9fe5-d4e57480d9be
> >> mon_initial_members = ceph1, ceph2, ceph3 mon_host =
> >> 10.99.10.118,10.99.10.119,10.99.10.120 auth_cluster_required =
> >> cephx auth_service_required = cephx auth_client_required = cephx
> >> filestore_xattr_use_omap = true osd_pool_default_size = 2
> >> osd_pool_default_min_size = 2 osd_pool_default_pg_num = 450
> >> osd_pool_default_pgp_num = 450 max_open_files = 131072
> >>
> >> [osd] osd_mkfs_type = xfs osd_op_threads = 8 osd_disk_threads = 4
> >> osd_mount_options_xfs =
> >> "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
> >>
> >>
> >> on our traditional storage with Full SAS disk, same "dd" completes
> >> in 16s with an average write speed of 6Mbps.
> >>
> >> Rados bench:
> >>
> >> rados bench -p rbd 10 write Maintaining 16 concurrent writes of
> >> 4194304 bytes for up to 10 seconds or 0 objects Object prefix:
> >> benchmark_data_ceph1_2977 sec Cur ops started finished avg MB/s
> >> cur MB/s last lat avg lat 0 0 0 0
> >> 0 0 - 0 1 16 94 78
> >> 311.821 312 0.041228 0.140132 2 16 192 176
> >> 351.866 392 0.106294 0.175055 3 16 275 259
> >> 345.216 332 0.076795 0.166036 4 16 302 286
> >> 285.912 108 0.043888 0.196419 5 16 395 379
> >> 303.11 372 0.126033 0.207488 6 16 501 485
> >> 323.242 424 0.125972 0.194559 7 16 621 605
> >> 345.621 480 0.194155 0.183123 8 16 730 714
> >> 356.903 436 0.086678 0.176099 9 16 814 798
> >> 354.572 336 0.081567 0.174786 10 16 832
> >> 816 326.313 72 0.037431 0.182355 11 16 833
> >> 817 297.013 4 0.533326 0.182784 Total time run:
> >> 11.489068 Total writes made: 833 Write size:
> >> 4194304 Bandwidth (MB/sec): 290.015
> >>
> >> Stddev Bandwidth: 175.723 Max bandwidth (MB/sec): 480 Min
> >> bandwidth (MB/sec): 0 Average Latency: 0.220582 Stddev
> >> Latency: 0.343697 Max latency: 2.85104 Min
> >> latency: 0.035381
> >>
> >> Our ultimate aim is to replace existing SAN with ceph,but for that
> >> it should meet minimum 8000 iops.Can any one help me with this,OSD
> >> are SSD,CPU has good clock speed,backend network is good but still
> >> we are not able to extract full capability of SSD disks.
> >>
> >>
> >>
> >> Thanks,
> >
> > Hi, i'm new to ceph so, don't consider my words as holy truth.
> >
> > It seems that Samsung 840 (so i assume 850) are crappy for ceph :
> >
> > MTBF :
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044258.html
> > Bandwidth
> > :
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-December/045247.html
> >
> > And according to a confirmed user of Ceph/ProxmoX, Samsung SSDs should
> > be avoided if possible in ceph storage.
> >
> > Apart from that, it seems there was an limitation in ceph for the use
> > of the complete bandwidth available in SSDs; but i think with less
> > than 1Mb/s you haven't hit this limit.
> >
> > I remind you that i'm not a ceph-guru (far from that, indeed), so feel
> > free to disagree; i'm on the way to improve my knowledge.
> >
> > Best regards.
> >
> >
> >
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1
> >
> > iEYEARECAAYFAlTxp0UACgkQlhqCFkbqHRb5+wCgrXCM3VsnVE6PCbbpOmQXCXbr
> > 8u0An2BUgZWismSK0PxbwVDOD5+/UWik
> > =0o0v
> > -----END PGP SIGNATURE-----
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> ------------------------------
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

Reply via email to