From the benchmarks I have seen and done myself, I’m not sure why you are using 
an i/o scheduler at all with NVMe.  While there are a few cases where it may 
provide a slight benefit, simply having mq enabled with no scheduler seems to 
provide the best performance for an all flash, especially all NVMe, environment.

David Byte
Sr. Technology Strategist
SCE Enterprise Linux
SCE Enterprise Storage
Alliances and SUSE Embedded
db...@suse.com
918.528.4422

From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of German Anders 
<gand...@despegar.com>
Date: Monday, November 27, 2017 at 8:44 AM
To: Maged Mokhtar <mmokh...@petasan.org>
Cc: ceph-users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning

Hi Maged,

Thanks a lot for the response. We try with different number of threads and 
we're getting almost the same kind of difference between the storage types. 
Going to try with different rbd stripe size, object size values and see if we 
get more competitive numbers. Will get back with more tests and param changes 
to see if we get better :)

Thanks,

Best,

German

2017-11-27 11:36 GMT-03:00 Maged Mokhtar 
<mmokh...@petasan.org<mailto:mmokh...@petasan.org>>:

On 2017-11-27 15:02, German Anders wrote:
Hi All,

I've a performance question, we recently install a brand new Ceph cluster with 
all-nvme disks, using ceph version 12.2.0 with bluestore configured. The 
back-end of the cluster is using a bond IPoIB (active/passive) , and for the 
front-end we are using a bonding config with active/active (20GbE) to 
communicate with the clients.

The cluster configuration is the following:

MON Nodes:
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
3x 1U servers:
  2x Intel Xeon E5-2630v4 @2.2Ghz
  128G RAM
  2x Intel SSD DC S3520 150G (in RAID-1 for OS)
  2x 82599ES 10-Gigabit SFI/SFP+ Network Connection

OSD Nodes:
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
4x 2U servers:
  2x Intel Xeon E5-2640v4 @2.4Ghz
  128G RAM
  2x Intel SSD DC S3520 150G (in RAID-1 for OS)
  1x Ethernet Controller 10G X550T
  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
  12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)


Here's the tree:

ID CLASS WEIGHT   TYPE NAME          STATUS REWEIGHT PRI-AFF
-7       48.00000 root root
-5       24.00000     rack rack1
-1       12.00000         node cpn01
 0  nvme  1.00000             osd.0      up  1.00000 1.00000
 1  nvme  1.00000             osd.1      up  1.00000 1.00000
 2  nvme  1.00000             osd.2      up  1.00000 1.00000
 3  nvme  1.00000             osd.3      up  1.00000 1.00000
 4  nvme  1.00000             osd.4      up  1.00000 1.00000
 5  nvme  1.00000             osd.5      up  1.00000 1.00000
 6  nvme  1.00000             osd.6      up  1.00000 1.00000
 7  nvme  1.00000             osd.7      up  1.00000 1.00000
 8  nvme  1.00000             osd.8      up  1.00000 1.00000
 9  nvme  1.00000             osd.9      up  1.00000 1.00000
10  nvme  1.00000             osd.10     up  1.00000 1.00000
11  nvme  1.00000             osd.11     up  1.00000 1.00000
-3       12.00000         node cpn03
24  nvme  1.00000             osd.24     up  1.00000 1.00000
25  nvme  1.00000             osd.25     up  1.00000 1.00000
26  nvme  1.00000             osd.26     up  1.00000 1.00000
27  nvme  1.00000             osd.27     up  1.00000 1.00000
28  nvme  1.00000             osd.28     up  1.00000 1.00000
29  nvme  1.00000             osd.29     up  1.00000 1.00000
30  nvme  1.00000             osd.30     up  1.00000 1.00000
31  nvme  1.00000             osd.31     up  1.00000 1.00000
32  nvme  1.00000             osd.32     up  1.00000 1.00000
33  nvme  1.00000             osd.33     up  1.00000 1.00000
34  nvme  1.00000             osd.34     up  1.00000 1.00000
35  nvme  1.00000             osd.35     up  1.00000 1.00000
-6       24.00000     rack rack2
-2       12.00000         node cpn02
12  nvme  1.00000             osd.12     up  1.00000 1.00000
13  nvme  1.00000             osd.13     up  1.00000 1.00000
14  nvme  1.00000             osd.14     up  1.00000 1.00000
15  nvme  1.00000             osd.15     up  1.00000 1.00000
16  nvme  1.00000             osd.16     up  1.00000 1.00000
17  nvme  1.00000             osd.17     up  1.00000 1.00000
18  nvme  1.00000             osd.18     up  1.00000 1.00000
19  nvme  1.00000             osd.19     up  1.00000 1.00000
20  nvme  1.00000             osd.20     up  1.00000 1.00000
21  nvme  1.00000             osd.21     up  1.00000 1.00000
22  nvme  1.00000             osd.22     up  1.00000 1.00000
23  nvme  1.00000             osd.23     up  1.00000 1.00000
-4       12.00000         node cpn04
36  nvme  1.00000             osd.36     up  1.00000 1.00000
37  nvme  1.00000             osd.37     up  1.00000 1.00000
38  nvme  1.00000             osd.38     up  1.00000 1.00000
39  nvme  1.00000             osd.39     up  1.00000 1.00000
40  nvme  1.00000             osd.40     up  1.00000 1.00000
41  nvme  1.00000             osd.41     up  1.00000 1.00000
42  nvme  1.00000             osd.42     up  1.00000 1.00000
43  nvme  1.00000             osd.43     up  1.00000 1.00000
44  nvme  1.00000             osd.44     up  1.00000 1.00000
45  nvme  1.00000             osd.45     up  1.00000 1.00000
46  nvme  1.00000             osd.46     up  1.00000 1.00000
47  nvme  1.00000             osd.47     up  1.00000 1.00000

The disk partition of one of the OSD nodes:

NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme6n1                259:1    0   1.1T  0 disk
├─nvme6n1p2            259:15   0   1.1T  0 part
└─nvme6n1p1            259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6
nvme9n1                259:0    0   1.1T  0 disk
├─nvme9n1p2            259:8    0   1.1T  0 part
└─nvme9n1p1            259:7    0   100M  0 part  /var/lib/ceph/osd/ceph-9
sdb                      8:16   0 139.8G  0 disk
└─sdb1                   8:17   0 139.8G  0 part
  └─md0                  9:0    0 139.6G  0 raid1
    ├─md0p2            259:31   0     1K  0 md
    ├─md0p5            259:32   0 139.1G  0 md
    │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP]
    │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   /
    └─md0p1            259:30   0 486.3M  0 md    /boot
nvme11n1               259:2    0   1.1T  0 disk
├─nvme11n1p1           259:12   0   100M  0 part  /var/lib/ceph/osd/ceph-11
└─nvme11n1p2           259:14   0   1.1T  0 part
nvme2n1                259:6    0   1.1T  0 disk
├─nvme2n1p2            259:21   0   1.1T  0 part
└─nvme2n1p1            259:20   0   100M  0 part  /var/lib/ceph/osd/ceph-2
nvme5n1                259:3    0   1.1T  0 disk
├─nvme5n1p1            259:9    0   100M  0 part  /var/lib/ceph/osd/ceph-5
└─nvme5n1p2            259:10   0   1.1T  0 part
nvme8n1                259:24   0   1.1T  0 disk
├─nvme8n1p1            259:26   0   100M  0 part  /var/lib/ceph/osd/ceph-8
└─nvme8n1p2            259:28   0   1.1T  0 part
nvme10n1               259:11   0   1.1T  0 disk
├─nvme10n1p1           259:22   0   100M  0 part  /var/lib/ceph/osd/ceph-10
└─nvme10n1p2           259:23   0   1.1T  0 part
nvme1n1                259:33   0   1.1T  0 disk
├─nvme1n1p1            259:34   0   100M  0 part  /var/lib/ceph/osd/ceph-1
└─nvme1n1p2            259:35   0   1.1T  0 part
nvme4n1                259:5    0   1.1T  0 disk
├─nvme4n1p1            259:18   0   100M  0 part  /var/lib/ceph/osd/ceph-4
└─nvme4n1p2            259:19   0   1.1T  0 part
nvme7n1                259:25   0   1.1T  0 disk
├─nvme7n1p1            259:27   0   100M  0 part  /var/lib/ceph/osd/ceph-7
└─nvme7n1p2            259:29   0   1.1T  0 part
sda                      8:0    0 139.8G  0 disk
└─sda1                   8:1    0 139.8G  0 part
  └─md0                  9:0    0 139.6G  0 raid1
    ├─md0p2            259:31   0     1K  0 md
    ├─md0p5            259:32   0 139.1G  0 md
    │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP]
    │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   /
    └─md0p1            259:30   0 486.3M  0 md    /boot
nvme0n1                259:36   0   1.1T  0 disk
├─nvme0n1p1            259:37   0   100M  0 part  /var/lib/ceph/osd/ceph-0
└─nvme0n1p2            259:38   0   1.1T  0 part
nvme3n1                259:4    0   1.1T  0 disk
├─nvme3n1p1            259:16   0   100M  0 part  /var/lib/ceph/osd/ceph-3
└─nvme3n1p2            259:17   0   1.1T  0 part


For the disk scheduler we're using [kyber], for the read_ahead_kb we try 
different values (0,128 and 2048), the rq_affinity set to 2, and the rotational 
parameter set to 0.
We've also set the CPU governor to performance on all the cores, and tune some 
sysctl parameters also:

# for Ceph
net.ipv4.ip_forward=0
net.ipv4.conf.default.rp_filter=1
kernel.sysrq=0
kernel.core_uses_pid=1
net.ipv4.tcp_syncookies=0
#net.netfilter.nf_conntrack_max=2621440
#net.netfilter.nf_conntrack_tcp_timeout_established = 1800
# disable netfilter on bridges
#net.bridge.bridge-nf-call-ip6tables = 0
#net.bridge.bridge-nf-call-iptables = 0
#net.bridge.bridge-nf-call-arptables = 0
vm.min_free_kbytes=1000000

# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536

# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 65536

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296


The ceph.conf file is:

...
osd_pool_default_size = 3
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 1600
osd_pool_default_pgp_num = 1600

debug_crush = 1/1
debug_buffer = 0/1
debug_timer = 0/0
debug_filer = 0/1
debug_objecter = 0/1
debug_rados = 0/5
debug_rbd = 0/5
debug_ms = 0/5
debug_throttle = 1/1

debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_journal = 0/0
debug_filestore = 0/0
debug_mon = 0/0
debug_paxos = 0/0

osd_crush_chooseleaf_type = 0
filestore_xattr_use_omap = true

rbd_cache = true
mon_compact_on_trim = false

[osd]
osd_crush_update_on_start = false

[client]
rbd_cache = true
rbd_cache_writethrough_until_flush = true
rbd_default_features = 1
admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
log_file = /var/log/ceph/


The cluster has two production pools on for openstack (volumes) with RF of 3 
and another pool for db (db) with RF of 2. The DBA team has perform several 
tests with a volume mounted on the DB server (with RBD). The DB server has the 
following configuration:

OS: CentOS 6.9 | kernel 4.14.1
DB: MySQL
ProLiant BL685c G7
4x AMD Opteron Processor 6376 (total of 64 cores)
128G RAM
1x OneConnect 10Gb NIC (quad-port) - in a bond configuration (active/active) 
with 3 vlans



We also did some tests with sysbench on different storage types:

sysbench

disk

tps

qps

latency (ms) 95th percentile

Local SSD

261,28

5.225,61

5,18

Ceph NVMe

95,18

1.903,53

12,3

Pure Storage

196,49

3.929,71

6,32

NetApp FAS

189,83

3.796,59

6,67

EMC VMAX

196,14

3.922,82

6,32



Is there any specific tuning that I can apply to the ceph cluster, in order to 
improve those numbers? Or are those numbers ok for the type and size of the 
cluster that we have? Any advice would be really appreciated.

Thanks,



German


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi,

What is the value of --num-threads (def value is 1) ? Ceph will be better with 
more threads: 32 or 64.
What is the value of --file-block-size (def 16k) and file-test-mode ? If you 
are using sequential seqwr/seqrd you will be hitting the same OSD, so maybe try 
random (rndrd/rndwr) or better use rbd stripe size of 16kb (default rbd stripe 
is 4M). rbd striping is ideal for small block sequential io pattern typical in 
databases.

/Maged

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to