Re: [ceph-users] Ceph InfiniBand Cluster - Jewel - Performance

2016-04-07 Thread Florian Haas
On Thu, Apr 7, 2016 at 10:09 PM, German Anders  wrote:
> also jewel does not supposed to get more 'performance', since it used
> bluestore in order to store metadata. Or do I need to specify during install
> to use bluestore?

Do the words "enable experimental unrecoverable data corrupting
features" strike terror in your heart? :)

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph InfiniBand Cluster - Jewel - Performance

2016-04-07 Thread Mark Nelson

On 04/07/2016 02:43 PM, German Anders wrote:

Hi Cephers,

I've setup a production environment Ceph cluster with the Jewel release
(10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON
Servers and 6 OSD Servers:

3x MON Servers:
2x Intel Xeon E5-2630v3@2.40Ghz
384GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
1x InfiniBand ConnectX-3 ADPT DP

6x OSD Servers:
2x Intel Xeon E5-2650v2@2.60Ghz
128GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
12x 800G Intel DC3510 (osd & journal) on same device
1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other
on the CLUS network)

ceph.conf file is:

[global]
fsid = xxx
mon_initial_members = cibm01, cibm02, cibm03
mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = xx.xx.16.0/20
cluster_network = xx.xx.32.0/20

[mon]

[mon.cibm01]
host = cibm01
mon_addr = xx.xx.xx.1:6789

[mon.cibm02]
host = cibm02
mon_addr = xx.xx.xx.2:6789

[mon.cibm03]
host = cibm03
mon_addr = xx.xx.xx.3:6789

[osd]
osd_pool_default_size = 2
osd_pool_default_min_size = 1

## OSD Configuration ##
[osd.0]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1

[osd.1]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1

...



They are all running *Ubuntu 14.04.4 LTS*. Journals are 5GB partitions
on each disk, since all the OSD daemons are SSD disks (Intel DC3510
800G). For example:

sdc  8:32   0 745.2G  0 disk
|-sdc1   8:33   0 740.2G  0 part
/var/lib/ceph/osd/ceph-0
`-sdc2   8:34   0 5G  0 part

The purpose of this cluster will be to serve as a backend storage for
Cinder volumes (RBD) and Glance images in an OpenStack cloud, most of
the clusters on OpenStack will be non-relational databases like
Cassandra with many instances each.

All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with
Mellanox Technologies MT27500 Family [ConnectX-3] adapters.


So I assume that performance will be really nice, right?...but.. I'm
getting some numbers that I think they could be really more important.

# rados --pool rbd bench 10 write -t 16

Total writes made:  1964
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): *755.435*

Stddev Bandwidth:   90.3288
Max bandwidth (MB/sec): 884
Min bandwidth (MB/sec): 612
Average IOPS:   188
Stddev IOPS:22
Max IOPS:   221
Min IOPS:   153
Average Latency(s): 0.0836802
Stddev Latency(s):  0.147561
Max latency(s): 1.50925
Min latency(s): 0.0192736


Then I connect to another server (this one is running on QDR - so I
would expect something between 2-3Gb/s), I map a RBD on the host, then
create a ext4 fs and mount it, and finally run a fio test:

# fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22
--time_based --size=10G --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1

fio-2.1.3
Starting 8 processes
cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB)
Jobs: 7 (f=7): [ww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0
iops] [eta 00m:00s]
cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr  7 15:24:12 2016
   write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec
 slat (msec): min=1, max=480, avg=46.15, stdev=63.68
 clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64
  lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63
 clat percentiles (msec):
  |  1.00th=[  235],  5.00th=[  478], 10.00th=[  611], 20.00th=[  766],
  | 30.00th=[  889], 40.00th=[  988], 50.00th=[ 1106], 60.00th=[ 1237],
  | 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[ 4555],
  | 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[ 8586],
  | 99.99th=[ 8979]
 bw (KB  /s): min= 3091, max=209877, per=12.31%, avg=83280.51,
stdev=35226.98
 lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61%
 lat (msec) : 2000=45.04%, >=2000=13.69%
   cpu  : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337
   IO depths: 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%,
 >=64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >=64=0.0%
  complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%,
 >=64=0.0%
  issued: total=r=0/w=3821/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s,
maxb=676411KB/s, mint=23138msec, maxt=23138msec

Disk stats (read/write):
   rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996,
util=99.08%


Does it look acceptable? I mean for an InfiniBand network, I guess that
throughput need to be better. How much more can I expect to 

Re: [ceph-users] Ceph InfiniBand Cluster - Jewel - Performance

2016-04-07 Thread German Anders
also jewel does not supposed to get more 'performance', since it used
bluestore in order to store metadata. Or do I need to specify during
install to use bluestore?

Thanks,


*German*

2016-04-07 16:55 GMT-03:00 Robert LeBlanc :

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Ceph is not able to use native Infiniband protocols yet and so it is
> only leveraging IPoIB at the moment. The most likely reason you are
> only getting ~10 Gb performance is that IPoIB heavily leverages
> multicast in Infiniband (if you do so research in this area you will
> understand why unicast IP still uses multicast on an Inifiniband
> network). To be extremely compatible with all adapters, the subnet
> manager will set the speed of multicast to 10 Gb/s so that SDR
> adapters can be used and not drop packets. If you know that you will
> never have adapters under a certain speed, you can configure the
> subnet manager to use a higher speed. This does not change IPoIB
> networks that are already configured (I had to down all the IPoIB
> adapter at the same time and bring them back up to upgrade the speed).
> Even after that, there still wasn't similar performance to native
> Infiniband, but I got at least a 2x improvement (along with setting
> the MTU to 64K) on the FDR adapters. There is still a ton of overhead
> for doing IPoIB so it is not an ideal transport to get performance on
> Infiniband, I think of it as a compatibility feature. Hopefully, that
> will give you enough information to perform the research. If you
> search the OFED mailing list, you will see some posts from me 2-3
> years ago regarding this very topic.
>
> Good luck and keep holding out for Ceph with XIO.
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.6
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJXBrtICRDmVDuy+mK58QAAqVkP/2hpe93FYIbQtpV4Qta4
> 9Fohqf478kVPX/v6XkAYOlAFFAISxfbDdm0FxOjbGSEOMKGNs/oSaFRCsqb9
> +T5dfMUHyhY51wyaNeVF3k3zgvGpNUO1xEQ1IenUquZp9825VRBze5/T6r8Z
> PMFySNtuHBp8AhARisPJcXqKv/Vowfy/LqyvlL6ytIHfwqsVHngbtVN7L/HX
> vzMZM93cLwwV44v2bT8t63U76GKyQpbksDx02CktMIFzNbfApsiMaA1dyx1O
> 9HEgirtddMO358f+1DN/OjNc/Z3zECILaw3tq/HUWJyBJqO95uBw++znIacb
> UKwqJ1HmUeDvdqY72ZQa2fQT7ayMMlPPwzoVtdQGMZnSaAjn8MlunDFCrdLw
> +JPT+kt0qnjzs9qK0zEp5drfUwnV5BXS4hZhKUvuxWmVjUv1EfJrIFCszSFO
> 2be/xLxqBTpCEcHL9fsc16P7HsrdBW8GDy3X5PC2sOl/2DSes4y2TpCfr7w9
> V8Mhs7mmkEQtwcvyaYQ0bx0Bs3o4cvTTeYbJUpLWEgMmGAEBZbf7Sx+y3dIp
> jUHb2jPEchBb83BGeLvAkCTfouq/J3pzQK96gA2Kh/KOlVJTpFdKUU5x+wpM
> ACqD+S/AFkgnfGm4fcgBexhro7GImiO6VIaOdxvTSdQbSsaoKckZOxFhVWih
> XyBJ
> =EF9A
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Apr 7, 2016 at 1:43 PM, German Anders 
> wrote:
> > Hi Cephers,
> >
> > I've setup a production environment Ceph cluster with the Jewel release
> > (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON
> > Servers and 6 OSD Servers:
> >
> > 3x MON Servers:
> > 2x Intel Xeon E5-2630v3@2.40Ghz
> > 384GB RAM
> > 2x 200G Intel DC3700 in RAID-1 for OS
> > 1x InfiniBand ConnectX-3 ADPT DP
> >
> > 6x OSD Servers:
> > 2x Intel Xeon E5-2650v2@2.60Ghz
> > 128GB RAM
> > 2x 200G Intel DC3700 in RAID-1 for OS
> > 12x 800G Intel DC3510 (osd & journal) on same device
> > 1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other
> on
> > the CLUS network)
> >
> > ceph.conf file is:
> >
> > [global]
> > fsid = xxx
> > mon_initial_members = cibm01, cibm02, cibm03
> > mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true
> > public_network = xx.xx.16.0/20
> > cluster_network = xx.xx.32.0/20
> >
> > [mon]
> >
> > [mon.cibm01]
> > host = cibm01
> > mon_addr = xx.xx.xx.1:6789
> >
> > [mon.cibm02]
> > host = cibm02
> > mon_addr = xx.xx.xx.2:6789
> >
> > [mon.cibm03]
> > host = cibm03
> > mon_addr = xx.xx.xx.3:6789
> >
> > [osd]
> > osd_pool_default_size = 2
> > osd_pool_default_min_size = 1
> >
> > ## OSD Configuration ##
> > [osd.0]
> > host = cibn01
> > public_addr = xx.xx.17.1
> > cluster_addr = xx.xx.32.1
> >
> > [osd.1]
> > host = cibn01
> > public_addr = xx.xx.17.1
> > cluster_addr = xx.xx.32.1
> >
> > ...
> >
> >
> >
> > They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on
> each
> > disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For
> > example:
> >
> > sdc  8:32   0 745.2G  0 disk
> > |-sdc1   8:33   0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-0
> > `-sdc2   8:34   0 5G  0 part
> >
> > The purpose of this cluster will be to serve as a backend storage for
> Cinder
> > volumes (RBD) and Glance images in an OpenStack cloud, most of the
> clusters
> > on OpenStack will be non-relational databases like Cassandra with many
> 

Re: [ceph-users] Ceph InfiniBand Cluster - Jewel - Performance

2016-04-07 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Ceph is not able to use native Infiniband protocols yet and so it is
only leveraging IPoIB at the moment. The most likely reason you are
only getting ~10 Gb performance is that IPoIB heavily leverages
multicast in Infiniband (if you do so research in this area you will
understand why unicast IP still uses multicast on an Inifiniband
network). To be extremely compatible with all adapters, the subnet
manager will set the speed of multicast to 10 Gb/s so that SDR
adapters can be used and not drop packets. If you know that you will
never have adapters under a certain speed, you can configure the
subnet manager to use a higher speed. This does not change IPoIB
networks that are already configured (I had to down all the IPoIB
adapter at the same time and bring them back up to upgrade the speed).
Even after that, there still wasn't similar performance to native
Infiniband, but I got at least a 2x improvement (along with setting
the MTU to 64K) on the FDR adapters. There is still a ton of overhead
for doing IPoIB so it is not an ideal transport to get performance on
Infiniband, I think of it as a compatibility feature. Hopefully, that
will give you enough information to perform the research. If you
search the OFED mailing list, you will see some posts from me 2-3
years ago regarding this very topic.

Good luck and keep holding out for Ceph with XIO.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJXBrtICRDmVDuy+mK58QAAqVkP/2hpe93FYIbQtpV4Qta4
9Fohqf478kVPX/v6XkAYOlAFFAISxfbDdm0FxOjbGSEOMKGNs/oSaFRCsqb9
+T5dfMUHyhY51wyaNeVF3k3zgvGpNUO1xEQ1IenUquZp9825VRBze5/T6r8Z
PMFySNtuHBp8AhARisPJcXqKv/Vowfy/LqyvlL6ytIHfwqsVHngbtVN7L/HX
vzMZM93cLwwV44v2bT8t63U76GKyQpbksDx02CktMIFzNbfApsiMaA1dyx1O
9HEgirtddMO358f+1DN/OjNc/Z3zECILaw3tq/HUWJyBJqO95uBw++znIacb
UKwqJ1HmUeDvdqY72ZQa2fQT7ayMMlPPwzoVtdQGMZnSaAjn8MlunDFCrdLw
+JPT+kt0qnjzs9qK0zEp5drfUwnV5BXS4hZhKUvuxWmVjUv1EfJrIFCszSFO
2be/xLxqBTpCEcHL9fsc16P7HsrdBW8GDy3X5PC2sOl/2DSes4y2TpCfr7w9
V8Mhs7mmkEQtwcvyaYQ0bx0Bs3o4cvTTeYbJUpLWEgMmGAEBZbf7Sx+y3dIp
jUHb2jPEchBb83BGeLvAkCTfouq/J3pzQK96gA2Kh/KOlVJTpFdKUU5x+wpM
ACqD+S/AFkgnfGm4fcgBexhro7GImiO6VIaOdxvTSdQbSsaoKckZOxFhVWih
XyBJ
=EF9A
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Apr 7, 2016 at 1:43 PM, German Anders  wrote:
> Hi Cephers,
>
> I've setup a production environment Ceph cluster with the Jewel release
> (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON
> Servers and 6 OSD Servers:
>
> 3x MON Servers:
> 2x Intel Xeon E5-2630v3@2.40Ghz
> 384GB RAM
> 2x 200G Intel DC3700 in RAID-1 for OS
> 1x InfiniBand ConnectX-3 ADPT DP
>
> 6x OSD Servers:
> 2x Intel Xeon E5-2650v2@2.60Ghz
> 128GB RAM
> 2x 200G Intel DC3700 in RAID-1 for OS
> 12x 800G Intel DC3510 (osd & journal) on same device
> 1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other on
> the CLUS network)
>
> ceph.conf file is:
>
> [global]
> fsid = xxx
> mon_initial_members = cibm01, cibm02, cibm03
> mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> public_network = xx.xx.16.0/20
> cluster_network = xx.xx.32.0/20
>
> [mon]
>
> [mon.cibm01]
> host = cibm01
> mon_addr = xx.xx.xx.1:6789
>
> [mon.cibm02]
> host = cibm02
> mon_addr = xx.xx.xx.2:6789
>
> [mon.cibm03]
> host = cibm03
> mon_addr = xx.xx.xx.3:6789
>
> [osd]
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
>
> ## OSD Configuration ##
> [osd.0]
> host = cibn01
> public_addr = xx.xx.17.1
> cluster_addr = xx.xx.32.1
>
> [osd.1]
> host = cibn01
> public_addr = xx.xx.17.1
> cluster_addr = xx.xx.32.1
>
> ...
>
>
>
> They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on each
> disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For
> example:
>
> sdc  8:32   0 745.2G  0 disk
> |-sdc1   8:33   0 740.2G  0 part
> /var/lib/ceph/osd/ceph-0
> `-sdc2   8:34   0 5G  0 part
>
> The purpose of this cluster will be to serve as a backend storage for Cinder
> volumes (RBD) and Glance images in an OpenStack cloud, most of the clusters
> on OpenStack will be non-relational databases like Cassandra with many
> instances each.
>
> All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with
> Mellanox Technologies MT27500 Family [ConnectX-3] adapters.
>
>
> So I assume that performance will be really nice, right?...but.. I'm getting
> some numbers that I think they could be really more important.
>
> # rados --pool rbd bench 10 write -t 16
>
> Total writes made:  1964
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 755.435
>
> Stddev Bandwidth: