Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-12-04 Thread Gerhard W. Recher
I got error on this:
 sysbench
--test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
--mysql-host=127.0.0.1 --mysql-port=33033 --mysql-user=sysbench
--mysql-password=password --mysql-db=sysbench
--mysql-table-engine=innodb --db-driver=mysql --oltp_tables_count=10
--oltp-test-mode=complex --oltp-read-only=off --oltp-table-size=20
--threads=10 --rand-type=uniform --rand-init=on cleanup
Unknown option: --oltp_tables_count.
Usage:
  sysbench [general-options]... --test= [test-options]... command

General options:
  --num-threads=N    number of threads to use [1]
  --max-requests=N   limit for total number of requests [1]
  --max-time=N   limit for total execution time in seconds [0]
  --forced-shutdown=STRING   amount of time to wait after --max-time
before forcing shutdown [off]
  --thread-stack-size=SIZE   size of stack per thread [32K]
  --init-rng=[on|off]    initialize random number generator [off]
  --test=STRING  test to run
  --debug=[on|off]   print more debugging info [off]
  --validate=[on|off]    perform validation checks where possible [off]
  --help=[on|off]    print help and exit
  --version=[on|off] print version and exit

Compiled-in tests:
  fileio - File I/O test
  cpu - CPU performance test
  memory - Memory functions speed test
  threads - Threads subsystem performance test
  mutex - Mutex performance test
  oltp - OLTP test

Commands: prepare run cleanup help version

See 'sysbench --test= help' for a list of options for each test.



but i have these:
echo "Performing test SQ-${thread}T-${run}"
sysbench --test=oltp --db-driver=mysql --oltp-table-size=4000
--mysql-db=sysbench --mysql-user=sysbench --mysql-password=password
--max-time=60 --max-requests=0 --num-threads=${thread} run >
/root/SQ-${thread}T-${run}


[client]
port    = 3306
socket  = /var/run/mysqld/mysqld.sock
[mysqld_safe]
socket  = /var/run/mysqld/mysqld.sock
nice    = 0
[mysqld]
user    = mysql
pid-file    = /var/run/mysqld/mysqld.pid
socket  = /var/run/mysqld/mysqld.sock
port    = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir  = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
bind-address    = 127.0.0.1
key_buffer  = 16M
max_allowed_packet  = 16M
thread_stack    = 192K
thread_cache_size   = 8
myisam-recover = BACKUP
query_cache_limit   = 1M
query_cache_size    = 16M
log_error = /var/log/mysql/error.log
expire_logs_days    = 10
max_binlog_size = 100M
[mysqldump]
quick
quote-names
max_allowed_packet  = 16M
[mysql]
[isamchk]
key_buffer  = 16M
!includedir /etc/mysql/conf.d/



sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations,  1 pct of values are returned
in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Threads started!
Time limit exceeded, exiting...
Done.

OLTP test statistics:
    queries performed:
    read:    84126
    write:   30045
    other:   12018
    total:   126189
    transactions:    6009   (100.14 per sec.)
    deadlocks:   0  (0.00 per sec.)
    read/write requests: 114171 (1902.71 per sec.)
    other operations:    12018  (200.28 per sec.)

Test execution summary:
    total time:  60.0045s
    total number of events:  6009
    total time taken by event execution: 59.9812
    per-request statistics:
 min:  4.47ms
 avg:  9.98ms
 max: 91.38ms
 approx.  95 percentile:  19.44ms

Threads fairness:
    events (avg/stddev):   6009./0.00
    execution time (avg/stddev):   59.9812/0.00

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations,  1 pct of values are returned
in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Threads started!
Time limit exceeded, exiting...
(last message repeated 3 times)
Done.

OLTP test statistics:
    queries performed:
    read:    372036
    write:   132870
    other:   53148
    total:   558054
    transactions:    26574  (442.84 per sec.)
    deadlocks:   0  (0.00 per sec.)
    read/write 

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-12-04 Thread German Anders
Could anyone run the tests? and share some results..

Thanks in advance,

Best,


*German*

2017-11-30 14:25 GMT-03:00 German Anders :

> That's correct, IPoIB for the backend (already configured the irq
> affinity),  and 10GbE on the frontend. I would love to try rdma but like
> you said is not stable for production, so I think I'll have to wait for
> that. Yeah, the thing is that it's not my decision to go for 50GbE or
> 100GbE... :( so.. 10GbE for the front-end will be...
>
> Would be really helpful if someone could run the following sysbench test
> on a mysql db so I could make some compares:
>
> *my.cnf *configuration file:
>
> [mysqld_safe]
> nice= 0
> pid-file= /home/test_db/mysql/mysql.pid
>
> [client]
> port= 33033
> socket  = /home/test_db/mysql/mysql.sock
>
> [mysqld]
> user= test_db
> port= 33033
> socket  = /home/test_db/mysql/mysql.sock
> pid-file= /home/test_db/mysql/mysql.pid
> log-error   = /home/test_db/mysql/mysql.err
> datadir = /home/test_db/mysql/data
> tmpdir  = /tmp
> server-id   = 1
>
> # ** Binlogging **
> #log-bin= /home/test_db/mysql/binlog/
> mysql-bin
> #log_bin_index  = /home/test_db/mysql/binlog/
> mysql-bin.index
> expire_logs_days= 1
> max_binlog_size = 512MB
>
> thread_handling = pool-of-threads
> thread_pool_max_threads = 300
>
>
> # ** Slow query log **
> slow_query_log  = 1
> slow_query_log_file = /home/test_db/mysql/mysql-
> slow.log
> long_query_time = 10
> log_output  = FILE
> log_slow_slave_statements   = 1
> log_slow_verbosity  = query_plan,innodb,explain
>
> # ** INNODB Specific options **
> transaction_isolation   = READ-COMMITTED
> innodb_buffer_pool_size = 12G
> innodb_data_file_path   = ibdata1:256M:autoextend
> innodb_thread_concurrency   = 16
> innodb_log_file_size= 256M
> innodb_log_files_in_group   = 3
> innodb_file_per_table
> innodb_log_buffer_size  = 16M
> innodb_stats_on_metadata= 0
> innodb_lock_wait_timeout= 30
> # innodb_flush_method   = O_DSYNC
> innodb_flush_method = O_DIRECT
> max_connections = 1
> max_connect_errors  = 99
> max_allowed_packet  = 128M
> skip-host-cache
> skip-name-resolve
> explicit_defaults_for_timestamp = 1
> performance_schema  = OFF
> log_warnings= 2
> event_scheduler = ON
>
> # ** Specific Galera Cluster Settings **
> binlog_format   = ROW
> default-storage-engine  = innodb
> query_cache_size= 0
> query_cache_type= 0
>
>
> Volume is just an RBD (on a RF=3 pool) with the default 22 bit order
> mounted on */home/test_db/mysql/data*
>
> commands for the test:
>
> sysbench 
> --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
> --mysql-host= --mysql-port=33033 --mysql-user=sysbench
> --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
> --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
> --oltp-read-only=off --oltp-table-size=20 --threads=10
> --rand-type=uniform --rand-init=on cleanup > /dev/null 2>/dev/null
>
> sysbench 
> --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
> --mysql-host= --mysql-port=33033 --mysql-user=sysbench
> --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
> --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
> --oltp-read-only=off --oltp-table-size=20 --threads=10
> --rand-type=uniform --rand-init=on prepare > /dev/null 2>/dev/null
>
> sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua
> --mysql-host= --mysql-port=33033 --mysql-user=sysbench
> --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
> --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
> --oltp-read-only=off --oltp-table-size=20 --threads=20
> --rand-type=uniform --rand-init=on --time=120 run >
> result_sysbench_perf_test.out 2>/dev/null
>
> Im looking for tps, qps and 95th perc, could anyone with a all-nvme
> cluster run the test and share the results? I would really appreciate the
> help :)
>

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-30 Thread German Anders
That's correct, IPoIB for the backend (already configured the irq
affinity),  and 10GbE on the frontend. I would love to try rdma but like
you said is not stable for production, so I think I'll have to wait for
that. Yeah, the thing is that it's not my decision to go for 50GbE or
100GbE... :( so.. 10GbE for the front-end will be...

Would be really helpful if someone could run the following sysbench test on
a mysql db so I could make some compares:

*my.cnf *configuration file:

[mysqld_safe]
nice= 0
pid-file= /home/test_db/mysql/mysql.pid

[client]
port= 33033
socket  = /home/test_db/mysql/mysql.sock

[mysqld]
user= test_db
port= 33033
socket  = /home/test_db/mysql/mysql.sock
pid-file= /home/test_db/mysql/mysql.pid
log-error   = /home/test_db/mysql/mysql.err
datadir = /home/test_db/mysql/data
tmpdir  = /tmp
server-id   = 1

# ** Binlogging **
#log-bin= /home/test_db
/mysql/binlog/mysql-bin
#log_bin_index  = /home/test_db
/mysql/binlog/mysql-bin.index
expire_logs_days= 1
max_binlog_size = 512MB

thread_handling = pool-of-threads
thread_pool_max_threads = 300


# ** Slow query log **
slow_query_log  = 1
slow_query_log_file = /home/test_db/mysql/mysql-slow.log
long_query_time = 10
log_output  = FILE
log_slow_slave_statements   = 1
log_slow_verbosity  = query_plan,innodb,explain

# ** INNODB Specific options **
transaction_isolation   = READ-COMMITTED
innodb_buffer_pool_size = 12G
innodb_data_file_path   = ibdata1:256M:autoextend
innodb_thread_concurrency   = 16
innodb_log_file_size= 256M
innodb_log_files_in_group   = 3
innodb_file_per_table
innodb_log_buffer_size  = 16M
innodb_stats_on_metadata= 0
innodb_lock_wait_timeout= 30
# innodb_flush_method   = O_DSYNC
innodb_flush_method = O_DIRECT
max_connections = 1
max_connect_errors  = 99
max_allowed_packet  = 128M
skip-host-cache
skip-name-resolve
explicit_defaults_for_timestamp = 1
performance_schema  = OFF
log_warnings= 2
event_scheduler = ON

# ** Specific Galera Cluster Settings **
binlog_format   = ROW
default-storage-engine  = innodb
query_cache_size= 0
query_cache_type= 0


Volume is just an RBD (on a RF=3 pool) with the default 22 bit order
mounted on */home/test_db/mysql/data*

commands for the test:

sysbench
--test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
--mysql-host= --mysql-port=33033 --mysql-user=sysbench
--mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
--db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
--oltp-read-only=off --oltp-table-size=20 --threads=10
--rand-type=uniform --rand-init=on cleanup > /dev/null 2>/dev/null

sysbench
--test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
--mysql-host= --mysql-port=33033 --mysql-user=sysbench
--mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
--db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
--oltp-read-only=off --oltp-table-size=20 --threads=10
--rand-type=uniform --rand-init=on prepare > /dev/null 2>/dev/null

sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua
--mysql-host= --mysql-port=33033 --mysql-user=sysbench
--mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
--db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
--oltp-read-only=off --oltp-table-size=20 --threads=20
--rand-type=uniform --rand-init=on --time=120 run >
result_sysbench_perf_test.out 2>/dev/null

Im looking for tps, qps and 95th perc, could anyone with a all-nvme cluster
run the test and share the results? I would really appreciate the help :)

Thanks in advance,

Best,


*German *

2017-11-29 19:14 GMT-03:00 Zoltan Arnold Nagy :

> On 2017-11-27 14:02, German Anders wrote:
>
>> 4x 2U servers:
>>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>>
> so I assume you are using IPoIB as the cluster network for the

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-29 Thread Zoltan Arnold Nagy

On 2017-11-27 14:02, German Anders wrote:

4x 2U servers:
  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
so I assume you are using IPoIB as the cluster network for the 
replication...



1x OneConnect 10Gb NIC (quad-port) - in a bond configuration
(active/active) with 3 vlans

... and the 10GbE network for the front-end network?

At 4k writes your network latency will be very high (see the flame 
graphs at the Intel NVMe presentation from the Boston OpenStack Summit - 
not sure if there is a newer deck that somebody could link ;)) and the 
time will be spent in the kernel. You could give RDMAMessenger a try but 
it's not stable at the current LTS release.


If I were you I'd be looking at 100GbE - we've recently pulled in a 
bunch of 100GbE links and it's been wonderful to see 100+GB/s going over 
the network for just storage.


Some people suggested mounting multiple RBD volumes - unless I'm 
mistaken and you're using very recent qemu/libvirt combinations with the 
proper libvirt disk settings all IO will still be single threaded 
towards librbd thus not making any speedup.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-29 Thread Maged Mokhtar
Hi German, 

I would personally prefer to use rados bench/ fio which are more common
to benchmark the cluster first then later do mysql specific tests using
sysbench. Another thing is to run the client test simultaneously on more
than 1 machine and aggregate/add the performance numbers of each, the
limitation can be caused by client side resources which could be
stressed differently based on the different storage backends you tried. 

Maged 

On 2017-11-28 21:20, German Anders wrote:

> Don't know if there's any statistics available really, but Im running some 
> sysbench tests with mysql before the changes and the idea is to run those 
> tests again after the 'tuning' and see if numbers get better in any way, also 
> I'm gathering numbers from some collectd and statsd collectors running on the 
> osd nodes so, I hope to get some info about that :) 
> 
> GERMAN 
> 2017-11-28 16:12 GMT-03:00 Marc Roos :
> 
>> I was wondering if there are any statistics available that show the
>> performance increase of doing such things?
>> 
>> -Original Message-
>> From: German Anders [mailto:gand...@despegar.com]
>> Sent: dinsdag 28 november 2017 19:34
>> To: Luis Periquito
>> Cc: ceph-users
>> Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning
>> 
>> Thanks a lot Luis, I agree with you regarding the CPUs, but
>> unfortunately those were the best CPU model that we can afford :S
>> 
>> For the NUMA part, I manage to pinned the OSDs by changing the
>> /usr/lib/systemd/system/ceph-osd@.service file and adding the
>> CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes
>> or specific CPU list. But I can't find the way to specify a list for
>> only a specific number of OSDs.
>> 
>> Also, I notice that the NVMe disks are all on the same node (since I'm
>> using half of the shelf - so the other half will be pinned to the other
>> node), so the lanes of the NVMe disks are all on the same CPU (in this
>> case 0). Also, I find that the IB adapter that is mapped to the OSD
>> network (osd replication) is pinned to CPU 1, so this will cross the QPI
>> path.
>> 
>> And for the memory, from the other email, we are already using the
>> TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of
>> 134217728
>> 
>> In this case I can pinned all the actual OSDs to CPU 0, but in the near
>> future when I add more nvme disks to the OSD nodes, I'll definitely need
>> to pinned the other half OSDs to CPU 1, someone already did this?
>> 
>> Thanks a lot,
>> 
>> Best,
>> 
>> German
>> 
>> 2017-11-28 6:36 GMT-03:00 Luis Periquito :
>> 
>> There are a few things I don't like about your machines... If you
>> want latency/IOPS (as you seemingly do) you really want the highest
>> frequency CPUs, even over number of cores. These are not too bad, but
>> not great either.
>> 
>> Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA
>> nodes? Ideally OSD is pinned to same NUMA node the NVMe device is
>> connected to. Each NVMe device will be running on PCIe lanes generated
>> by one of the CPUs...
>> 
>> What versions of TCMalloc (or jemalloc) are you running? Have you
>> tuned them to have a bigger cache?
>> 
>> These are from what I've learned using filestore - I've yet to run
>> full tests on bluestore - but they should still apply...
>> 
>> On Mon, Nov 27, 2017 at 5:10 PM, German Anders
>>  wrote:
>> 
>> Hi Nick,
>> 
>> yeah, we are using the same nvme disk with an additional
>> partition to use as journal/wal. We double check the c-state and it was
>> not configure to use c1, so we change that on all the osd nodes and mon
>> nodes and we're going to make some new tests, and see how it goes. I'll
>> get back as soon as get got those tests running.
>> 
>> Thanks a lot,
>> 
>> Best,
>> 
>> German
>> 
>> 2017-11-27 12:16 GMT-03:00 Nick Fisk :
>> 
>> From: ceph-users
>> [mailto:ceph-users-boun...@lists.ceph.com
>> <mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of German Anders
>> Sent: 27 November 2017 14:44
>> To: Maged Mokhtar 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] ceph all-nvme mysql performance
>> tuning
>> 
>> Hi Maged,
>> 
>> Thanks a lot for the response. We try with different
>> number of threads and we're getting almost the same kind of difference
>> between the storage types. Going to try with different rbd stripe size,
>> obje

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-28 Thread German Anders
Don't know if there's any statistics available really, but Im running some
sysbench tests with mysql before the changes and the idea is to run those
tests again after the 'tuning' and see if numbers get better in any way,
also I'm gathering numbers from some collectd and statsd collectors running
on the osd nodes so, I hope to get some info about that :)


*German*

2017-11-28 16:12 GMT-03:00 Marc Roos :

>
> I was wondering if there are any statistics available that show the
> performance increase of doing such things?
>
>
>
>
>
>
> -Original Message-
> From: German Anders [mailto:gand...@despegar.com]
> Sent: dinsdag 28 november 2017 19:34
> To: Luis Periquito
> Cc: ceph-users
> Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning
>
> Thanks a lot Luis, I agree with you regarding the CPUs, but
> unfortunately those were the best CPU model that we can afford :S
>
> For the NUMA part, I manage to pinned the OSDs by changing the
> /usr/lib/systemd/system/ceph-osd@.service file and adding the
> CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes
> or specific CPU list. But I can't find the way to specify a list for
> only a specific number of OSDs.
>
> Also, I notice that the NVMe disks are all on the same node (since I'm
> using half of the shelf - so the other half will be pinned to the other
> node), so the lanes of the NVMe disks are all on the same CPU (in this
> case 0). Also, I find that the IB adapter that is mapped to the OSD
> network (osd replication) is pinned to CPU 1, so this will cross the QPI
> path.
>
> And for the memory, from the other email, we are already using the
> TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of
> 134217728
>
> In this case I can pinned all the actual OSDs to CPU 0, but in the near
> future when I add more nvme disks to the OSD nodes, I'll definitely need
> to pinned the other half OSDs to CPU 1, someone already did this?
>
> Thanks a lot,
>
> Best,
>
>
>
> German
>
> 2017-11-28 6:36 GMT-03:00 Luis Periquito :
>
>
> There are a few things I don't like about your machines... If you
> want latency/IOPS (as you seemingly do) you really want the highest
> frequency CPUs, even over number of cores. These are not too bad, but
> not great either.
>
> Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA
> nodes? Ideally OSD is pinned to same NUMA node the NVMe device is
> connected to. Each NVMe device will be running on PCIe lanes generated
> by one of the CPUs...
>
> What versions of TCMalloc (or jemalloc) are you running? Have you
> tuned them to have a bigger cache?
>
> These are from what I've learned using filestore - I've yet to run
> full tests on bluestore - but they should still apply...
>
> On Mon, Nov 27, 2017 at 5:10 PM, German Anders
>  wrote:
>
>
> Hi Nick,
>
> yeah, we are using the same nvme disk with an additional
> partition to use as journal/wal. We double check the c-state and it was
> not configure to use c1, so we change that on all the osd nodes and mon
> nodes and we're going to make some new tests, and see how it goes. I'll
> get back as soon as get got those tests running.
>
> Thanks a lot,
>
> Best,
>
>
>
>
>
>
> German
>
> 2017-11-27 12:16 GMT-03:00 Nick Fisk :
>
>
>         From: ceph-users
> [mailto:ceph-users-boun...@lists.ceph.com
> <mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of German Anders
> Sent: 27 November 2017 14:44
> To: Maged Mokhtar 
> Cc: ceph-users 
> Subject: Re: [ceph-users] ceph all-nvme mysql
> performance
> tuning
>
>
>
> Hi Maged,
>
>
>
> Thanks a lot for the response. We try with
> different
> number of threads and we're getting almost the same kind of difference
> between the storage types. Going to try with different rbd stripe size,
> object size values and see if we get more competitive numbers. Will get
> back with more tests and param changes to see if we get better :)
>
>
>
>
>
> Just to echo a couple of comments. Ceph will always
> struggle to match the performance of a traditional array for mainly 2
> reasons.
>
>
>
> 1.  You are replacing some sort of dual ported
> SAS or
> internally RDMA connected device with a network for Ceph replicat

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-28 Thread Marc Roos
 
I was wondering if there are any statistics available that show the 
performance increase of doing such things?






-Original Message-
From: German Anders [mailto:gand...@despegar.com] 
Sent: dinsdag 28 november 2017 19:34
To: Luis Periquito
Cc: ceph-users
Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning

Thanks a lot Luis, I agree with you regarding the CPUs, but 
unfortunately those were the best CPU model that we can afford :S

For the NUMA part, I manage to pinned the OSDs by changing the 
/usr/lib/systemd/system/ceph-osd@.service file and adding the 
CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes 
or specific CPU list. But I can't find the way to specify a list for 
only a specific number of OSDs. 

Also, I notice that the NVMe disks are all on the same node (since I'm 
using half of the shelf - so the other half will be pinned to the other 
node), so the lanes of the NVMe disks are all on the same CPU (in this 
case 0). Also, I find that the IB adapter that is mapped to the OSD 
network (osd replication) is pinned to CPU 1, so this will cross the QPI 
path.

And for the memory, from the other email, we are already using the 
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of 
134217728

In this case I can pinned all the actual OSDs to CPU 0, but in the near 
future when I add more nvme disks to the OSD nodes, I'll definitely need 
to pinned the other half OSDs to CPU 1, someone already did this?

Thanks a lot,

Best,



German

2017-11-28 6:36 GMT-03:00 Luis Periquito :


There are a few things I don't like about your machines... If you 
want latency/IOPS (as you seemingly do) you really want the highest 
frequency CPUs, even over number of cores. These are not too bad, but 
not great either.

Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA 
nodes? Ideally OSD is pinned to same NUMA node the NVMe device is 
connected to. Each NVMe device will be running on PCIe lanes generated 
by one of the CPUs...

What versions of TCMalloc (or jemalloc) are you running? Have you 
tuned them to have a bigger cache?

These are from what I've learned using filestore - I've yet to run 
full tests on bluestore - but they should still apply...

On Mon, Nov 27, 2017 at 5:10 PM, German Anders 
 wrote:


Hi Nick, 

yeah, we are using the same nvme disk with an additional 
partition to use as journal/wal. We double check the c-state and it was 
not configure to use c1, so we change that on all the osd nodes and mon 
nodes and we're going to make some new tests, and see how it goes. I'll 
get back as soon as get got those tests running.

Thanks a lot,

Best,






German

2017-11-27 12:16 GMT-03:00 Nick Fisk :


From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com 
<mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of German Anders
Sent: 27 November 2017 14:44
To: Maged Mokhtar 
Cc: ceph-users 
            Subject: Re: [ceph-users] ceph all-nvme mysql 
performance 
tuning

 

Hi Maged,

 

Thanks a lot for the response. We try with different 
number of threads and we're getting almost the same kind of difference 
between the storage types. Going to try with different rbd stripe size, 
object size values and see if we get more competitive numbers. Will get 
back with more tests and param changes to see if we get better :)

 

 

Just to echo a couple of comments. Ceph will always 
struggle to match the performance of a traditional array for mainly 2 
reasons.

 

1.  You are replacing some sort of dual ported SAS 
or 
internally RDMA connected device with a network for Ceph replication 
traffic. This will instantly have a large impact on write latency
2.  Ceph locks at the PG level and a PG will most 
likely cover at least one 4MB object, so lots of small accesses to the 
same blocks (on a block device) will wait on each other and go 
effectively at a single threaded rate.

 

The best thing you can do to mitigate these, is to run 
the fastest journal/WAL devices you can, fastest network connections (ie 
25Gb/s) and run your CPU’s at max C and P states.

 

You stated that you are running the performance profile 
on the CPU’s. Could you also just double check that the C-states are 
being held at C1(e)? There 

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-28 Thread German Anders
Thanks a lot Luis, I agree with you regarding the CPUs, but unfortunately
those were the best CPU model that we can afford :S

For the NUMA part, I manage to pinned the OSDs by changing the
/usr/lib/systemd/system/ceph-osd@.service file and adding the CPUAffinity
list to it. But, this is for ALL the OSDs to specific nodes or specific CPU
list. But I can't find the way to specify a list for only a specific number
of OSDs.

Also, I notice that the NVMe disks are all on the same node (since I'm
using half of the shelf - so the other half will be pinned to the other
node), so the lanes of the NVMe disks are all on the same CPU (in this case
0). Also, I find that the IB adapter that is mapped to the OSD network (osd
replication) is pinned to CPU 1, so this will cross the QPI path.

And for the memory, from the other email, we are already using the
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of 134217728

In this case I can pinned all the actual OSDs to CPU 0, but in the near
future when I add more nvme disks to the OSD nodes, I'll definitely need to
pinned the other half OSDs to CPU 1, someone already did this?

Thanks a lot,

Best,


*German*

2017-11-28 6:36 GMT-03:00 Luis Periquito :

> There are a few things I don't like about your machines... If you want
> latency/IOPS (as you seemingly do) you really want the highest frequency
> CPUs, even over number of cores. These are not too bad, but not great
> either.
>
> Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA nodes?
> Ideally OSD is pinned to same NUMA node the NVMe device is connected to.
> Each NVMe device will be running on PCIe lanes generated by one of the
> CPUs...
>
> What versions of TCMalloc (or jemalloc) are you running? Have you tuned
> them to have a bigger cache?
>
> These are from what I've learned using filestore - I've yet to run full
> tests on bluestore - but they should still apply...
>
> On Mon, Nov 27, 2017 at 5:10 PM, German Anders 
> wrote:
>
>> Hi Nick,
>>
>> yeah, we are using the same nvme disk with an additional partition to use
>> as journal/wal. We double check the c-state and it was not configure to use
>> c1, so we change that on all the osd nodes and mon nodes and we're going to
>> make some new tests, and see how it goes. I'll get back as soon as get got
>> those tests running.
>>
>> Thanks a lot,
>>
>> Best,
>>
>>
>> *German*
>>
>> 2017-11-27 12:16 GMT-03:00 Nick Fisk :
>>
>>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
>>> Behalf Of *German Anders
>>> *Sent:* 27 November 2017 14:44
>>> *To:* Maged Mokhtar 
>>> *Cc:* ceph-users 
>>> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning
>>>
>>>
>>>
>>> Hi Maged,
>>>
>>>
>>>
>>> Thanks a lot for the response. We try with different number of threads
>>> and we're getting almost the same kind of difference between the storage
>>> types. Going to try with different rbd stripe size, object size values and
>>> see if we get more competitive numbers. Will get back with more tests and
>>> param changes to see if we get better :)
>>>
>>>
>>>
>>>
>>>
>>> Just to echo a couple of comments. Ceph will always struggle to match
>>> the performance of a traditional array for mainly 2 reasons.
>>>
>>>
>>>
>>>1. You are replacing some sort of dual ported SAS or internally RDMA
>>>connected device with a network for Ceph replication traffic. This will
>>>instantly have a large impact on write latency
>>>2. Ceph locks at the PG level and a PG will most likely cover at
>>>least one 4MB object, so lots of small accesses to the same blocks (on a
>>>block device) will wait on each other and go effectively at a single
>>>threaded rate.
>>>
>>>
>>>
>>> The best thing you can do to mitigate these, is to run the fastest
>>> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and
>>> run your CPU’s at max C and P states.
>>>
>>>
>>>
>>> You stated that you are running the performance profile on the CPU’s.
>>> Could you also just double check that the C-states are being held at C1(e)?
>>> There are a few utilities that can show this in realtime.
>>>
>>>
>>>
>>> Other than that, although there could be some minor tweaks, you are
>>> probably nearing the limit of what you can hope to achieve.
>>>
>>>
>>>

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-28 Thread Luis Periquito
There are a few things I don't like about your machines... If you want
latency/IOPS (as you seemingly do) you really want the highest frequency
CPUs, even over number of cores. These are not too bad, but not great
either.

Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA nodes?
Ideally OSD is pinned to same NUMA node the NVMe device is connected to.
Each NVMe device will be running on PCIe lanes generated by one of the
CPUs...

What versions of TCMalloc (or jemalloc) are you running? Have you tuned
them to have a bigger cache?

These are from what I've learned using filestore - I've yet to run full
tests on bluestore - but they should still apply...

On Mon, Nov 27, 2017 at 5:10 PM, German Anders  wrote:

> Hi Nick,
>
> yeah, we are using the same nvme disk with an additional partition to use
> as journal/wal. We double check the c-state and it was not configure to use
> c1, so we change that on all the osd nodes and mon nodes and we're going to
> make some new tests, and see how it goes. I'll get back as soon as get got
> those tests running.
>
> Thanks a lot,
>
> Best,
>
>
> *German*
>
> 2017-11-27 12:16 GMT-03:00 Nick Fisk :
>
>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
>> Of *German Anders
>> *Sent:* 27 November 2017 14:44
>> *To:* Maged Mokhtar 
>> *Cc:* ceph-users 
>> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning
>>
>>
>>
>> Hi Maged,
>>
>>
>>
>> Thanks a lot for the response. We try with different number of threads
>> and we're getting almost the same kind of difference between the storage
>> types. Going to try with different rbd stripe size, object size values and
>> see if we get more competitive numbers. Will get back with more tests and
>> param changes to see if we get better :)
>>
>>
>>
>>
>>
>> Just to echo a couple of comments. Ceph will always struggle to match the
>> performance of a traditional array for mainly 2 reasons.
>>
>>
>>
>>1. You are replacing some sort of dual ported SAS or internally RDMA
>>connected device with a network for Ceph replication traffic. This will
>>instantly have a large impact on write latency
>>2. Ceph locks at the PG level and a PG will most likely cover at
>>least one 4MB object, so lots of small accesses to the same blocks (on a
>>block device) will wait on each other and go effectively at a single
>>threaded rate.
>>
>>
>>
>> The best thing you can do to mitigate these, is to run the fastest
>> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and
>> run your CPU’s at max C and P states.
>>
>>
>>
>> You stated that you are running the performance profile on the CPU’s.
>> Could you also just double check that the C-states are being held at C1(e)?
>> There are a few utilities that can show this in realtime.
>>
>>
>>
>> Other than that, although there could be some minor tweaks, you are
>> probably nearing the limit of what you can hope to achieve.
>>
>>
>>
>> Nick
>>
>>
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Best,
>>
>>
>> *German*
>>
>>
>>
>> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar :
>>
>> On 2017-11-27 15:02, German Anders wrote:
>>
>> Hi All,
>>
>>
>>
>> I've a performance question, we recently install a brand new Ceph cluster
>> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
>> The back-end of the cluster is using a bond IPoIB (active/passive) , and
>> for the front-end we are using a bonding config with active/active (20GbE)
>> to communicate with the clients.
>>
>>
>>
>> The cluster configuration is the following:
>>
>>
>>
>> *MON Nodes:*
>>
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>>
>> 3x 1U servers:
>>
>>   2x Intel Xeon E5-2630v4 @2.2Ghz
>>
>>   128G RAM
>>
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>
>>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>>
>>
>> *OSD Nodes:*
>>
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>>
>> 4x 2U servers:
>>
>>   2x Intel Xeon E5-2640v4 @2.4Ghz
>>
>>   128G RAM
>>
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>
>>   1x Ethernet Controller 10G X550T
>>
>>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>>   12x Intel SSD DC P3520

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi Nick,

yeah, we are using the same nvme disk with an additional partition to use
as journal/wal. We double check the c-state and it was not configure to use
c1, so we change that on all the osd nodes and mon nodes and we're going to
make some new tests, and see how it goes. I'll get back as soon as get got
those tests running.

Thanks a lot,

Best,


*German*

2017-11-27 12:16 GMT-03:00 Nick Fisk :

> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* 27 November 2017 14:44
> *To:* Maged Mokhtar 
> *Cc:* ceph-users 
> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning
>
>
>
> Hi Maged,
>
>
>
> Thanks a lot for the response. We try with different number of threads and
> we're getting almost the same kind of difference between the storage types.
> Going to try with different rbd stripe size, object size values and see if
> we get more competitive numbers. Will get back with more tests and param
> changes to see if we get better :)
>
>
>
>
>
> Just to echo a couple of comments. Ceph will always struggle to match the
> performance of a traditional array for mainly 2 reasons.
>
>
>
>1. You are replacing some sort of dual ported SAS or internally RDMA
>connected device with a network for Ceph replication traffic. This will
>instantly have a large impact on write latency
>2. Ceph locks at the PG level and a PG will most likely cover at least
>one 4MB object, so lots of small accesses to the same blocks (on a block
>device) will wait on each other and go effectively at a single threaded
>rate.
>
>
>
> The best thing you can do to mitigate these, is to run the fastest
> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and
> run your CPU’s at max C and P states.
>
>
>
> You stated that you are running the performance profile on the CPU’s.
> Could you also just double check that the C-states are being held at C1(e)?
> There are a few utilities that can show this in realtime.
>
>
>
> Other than that, although there could be some minor tweaks, you are
> probably nearing the limit of what you can hope to achieve.
>
>
>
> Nick
>
>
>
>
>
> Thanks,
>
>
>
> Best,
>
>
> *German*
>
>
>
> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar :
>
> On 2017-11-27 15:02, German Anders wrote:
>
> Hi All,
>
>
>
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
>
>
>
> The cluster configuration is the following:
>
>
>
> *MON Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 3x 1U servers:
>
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>
>
> *OSD Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 4x 2U servers:
>
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   1x Ethernet Controller 10G X550T
>
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
>
>
>
> Here's the tree:
>
>
>
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>
> -7   48.0 root root
>
> -5   24.0 rack rack1
>
> -1   12.0 node cpn01
>
>  0  nvme  1.0 osd.0  up  1.0 1.0
>
>  1  nvme  1.0 osd.1  up  1.0 1.0
>
>  2  nvme  1.0 osd.2  up  1.0 1.0
>
>  3  nvme  1.0 osd.3  up  1.0 1.0
>
>  4  nvme  1.0 osd.4  up  1.0 1.0
>
>  5  nvme  1.0 osd.5  up  1.0 1.0
>
>  6  nvme  1.0 osd.6  up  1.0 1.0
>
>  7  nvme  1.0 osd.7  up  1.0 1.0
>
>  8  nvme  1.0 osd.8  up  1.0 1.0
>
>  9  nvme  1.0 osd.9  up  1.0 1.0
>
> 10  nvme  1.0 osd.10 up  1.0 1.0
>
> 11  nvme  1.0 osd.11 up  1.0 1.0
>
> -3   12.0 node cpn03
>
> 24  nvme  1.0 osd.24 u

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi David,
Thanks a lot for the response. In fact, we first try to not use any
scheduler at all, but then we try kyber iosched and we notice a slightly
improve in terms of performance, that's why we actually keep it.


*German*

2017-11-27 13:48 GMT-03:00 David Byte :

> From the benchmarks I have seen and done myself, I’m not sure why you are
> using an i/o scheduler at all with NVMe.  While there are a few cases where
> it may provide a slight benefit, simply having mq enabled with no scheduler
> seems to provide the best performance for an all flash, especially all
> NVMe, environment.
>
>
>
> David Byte
>
> Sr. Technology Strategist
>
> *SCE Enterprise Linux*
>
> *SCE Enterprise Storage*
>
> Alliances and SUSE Embedded
>
> db...@suse.com
>
> 918.528.4422
>
>
>
> *From: *ceph-users  on behalf of
> German Anders 
> *Date: *Monday, November 27, 2017 at 8:44 AM
> *To: *Maged Mokhtar 
> *Cc: *ceph-users 
> *Subject: *Re: [ceph-users] ceph all-nvme mysql performance tuning
>
>
>
> Hi Maged,
>
>
>
> Thanks a lot for the response. We try with different number of threads and
> we're getting almost the same kind of difference between the storage types.
> Going to try with different rbd stripe size, object size values and see if
> we get more competitive numbers. Will get back with more tests and param
> changes to see if we get better :)
>
>
>
> Thanks,
>
>
>
> Best,
>
>
> *German*
>
>
>
> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar :
>
> On 2017-11-27 15:02, German Anders wrote:
>
> Hi All,
>
>
>
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
>
>
>
> The cluster configuration is the following:
>
>
>
> *MON Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 3x 1U servers:
>
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>
>
> *OSD Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 4x 2U servers:
>
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   1x Ethernet Controller 10G X550T
>
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
>
>
>
> Here's the tree:
>
>
>
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>
> -7   48.0 root root
>
> -5   24.0 rack rack1
>
> -1   12.0 node cpn01
>
>  0  nvme  1.0 osd.0  up  1.0 1.0
>
>  1  nvme  1.0 osd.1  up  1.0 1.0
>
>  2  nvme  1.0 osd.2  up  1.0 1.0
>
>  3  nvme  1.0 osd.3  up  1.0 1.0
>
>  4  nvme  1.0 osd.4  up  1.0 1.0
>
>  5  nvme  1.0 osd.5  up  1.0 1.0
>
>  6  nvme  1.0 osd.6  up  1.0 1.0
>
>  7  nvme  1.0 osd.7  up  1.0 1.0
>
>  8  nvme  1.0 osd.8  up  1.0 1.0
>
>  9  nvme  1.0 osd.9  up  1.0 1.0
>
> 10  nvme  1.0 osd.10 up  1.0 1.0
>
> 11  nvme  1.0 osd.11 up  1.0 1.0
>
> -3   12.0 node cpn03
>
> 24  nvme  1.0 osd.24 up  1.0 1.0
>
> 25  nvme  1.0 osd.25 up  1.0 1.0
>
> 26  nvme  1.0 osd.26 up  1.0 1.0
>
> 27  nvme  1.0 osd.27 up  1.0 1.0
>
> 28  nvme  1.0 osd.28 up  1.0 1.0
>
> 29  nvme  1.0 osd.29 up  1.0 1.0
>
> 30  nvme  1.0 osd.30 up  1.0 1.0
>
> 31  nvme  1.0 osd.31 up  1.0 1.0
>
> 32  nvme  1.0 osd.32 up  1.0 1.0
>
> 33  nvme  1.0 osd.33 up  1.0 1.0
>
> 34  nvme  1.0 osd.34 up  1.0 1.0
>
> 35  nvme  1.0 osd.35 up  1.0 1.0
>
> -6   24.0 rack rack2
>
> -2   12.000

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread David Byte
From the benchmarks I have seen and done myself, I’m not sure why you are using 
an i/o scheduler at all with NVMe.  While there are a few cases where it may 
provide a slight benefit, simply having mq enabled with no scheduler seems to 
provide the best performance for an all flash, especially all NVMe, environment.

David Byte
Sr. Technology Strategist
SCE Enterprise Linux
SCE Enterprise Storage
Alliances and SUSE Embedded
db...@suse.com
918.528.4422

From: ceph-users  on behalf of German Anders 

Date: Monday, November 27, 2017 at 8:44 AM
To: Maged Mokhtar 
Cc: ceph-users 
Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning

Hi Maged,

Thanks a lot for the response. We try with different number of threads and 
we're getting almost the same kind of difference between the storage types. 
Going to try with different rbd stripe size, object size values and see if we 
get more competitive numbers. Will get back with more tests and param changes 
to see if we get better :)

Thanks,

Best,

German

2017-11-27 11:36 GMT-03:00 Maged Mokhtar 
mailto:mmokh...@petasan.org>>:

On 2017-11-27 15:02, German Anders wrote:
Hi All,

I've a performance question, we recently install a brand new Ceph cluster with 
all-nvme disks, using ceph version 12.2.0 with bluestore configured. The 
back-end of the cluster is using a bond IPoIB (active/passive) , and for the 
front-end we are using a bonding config with active/active (20GbE) to 
communicate with the clients.

The cluster configuration is the following:

MON Nodes:
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
3x 1U servers:
  2x Intel Xeon E5-2630v4 @2.2Ghz
  128G RAM
  2x Intel SSD DC S3520 150G (in RAID-1 for OS)
  2x 82599ES 10-Gigabit SFI/SFP+ Network Connection

OSD Nodes:
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
4x 2U servers:
  2x Intel Xeon E5-2640v4 @2.4Ghz
  128G RAM
  2x Intel SSD DC S3520 150G (in RAID-1 for OS)
  1x Ethernet Controller 10G X550T
  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
  12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)


Here's the tree:

ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
-7   48.0 root root
-5   24.0 rack rack1
-1   12.0 node cpn01
 0  nvme  1.0 osd.0  up  1.0 1.0
 1  nvme  1.0 osd.1  up  1.0 1.0
 2  nvme  1.0 osd.2  up  1.0 1.0
 3  nvme  1.0 osd.3  up  1.0 1.0
 4  nvme  1.0 osd.4  up  1.0 1.0
 5  nvme  1.0 osd.5  up  1.0 1.0
 6  nvme  1.0 osd.6  up  1.0 1.0
 7  nvme  1.0 osd.7  up  1.0 1.0
 8  nvme  1.0 osd.8  up  1.0 1.0
 9  nvme  1.0 osd.9  up  1.0 1.0
10  nvme  1.0 osd.10 up  1.0 1.0
11  nvme  1.0 osd.11 up  1.0 1.0
-3   12.0 node cpn03
24  nvme  1.0 osd.24 up  1.0 1.0
25  nvme  1.0 osd.25 up  1.0 1.0
26  nvme  1.0 osd.26 up  1.0 1.0
27  nvme  1.0 osd.27 up  1.0 1.0
28  nvme  1.0 osd.28 up  1.0 1.0
29  nvme  1.0 osd.29 up  1.0 1.0
30  nvme  1.0 osd.30 up  1.0 1.0
31  nvme  1.0 osd.31 up  1.0 1.0
32  nvme  1.0 osd.32 up  1.0 1.0
33  nvme  1.0 osd.33 up  1.0 1.0
34  nvme  1.0 osd.34 up  1.0 1.0
35  nvme  1.0 osd.35 up  1.0 1.0
-6   24.0 rack rack2
-2   12.0 node cpn02
12  nvme  1.0 osd.12 up  1.0 1.0
13  nvme  1.0 osd.13 up  1.0 1.0
14  nvme  1.0 osd.14 up  1.0 1.0
15  nvme  1.0 osd.15 up  1.0 1.0
16  nvme  1.0 osd.16 up  1.0 1.0
17  nvme  1.0 osd.17 up  1.0 1.0
18  nvme  1.0 osd.18 up  1.0 1.0
19  nvme  1.0 osd.19 up  1.0 1.0
20  nvme  1.0 osd.20 up  1.0 1.0
21  nvme  1.0 osd.21 up  1.0 1.0
22  nvme  1.0 osd.22 up  1.0 1.0
23  nvme  1.0 osd.23 up  1.0 1.0
-4   12.0 node cpn04
36  nvme  1.0 osd.36 up  1.0 1.0
37  nvme  1.0 osd.37 up  1.0 1.0
38  nvme  1.0 osd.38 up  1.0 1.0
39  nvme  1.0 osd.39 up  1.0 1.0
40  nvme  1.0 osd.40 up  1.0 1.0
41  nvme  1.0 osd.41 up  1.0

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Donny Davis
Also what tuned profile are you using? There is something to be gained by
using a matching tuned profile for your workload.

On Mon, Nov 27, 2017 at 11:16 AM, Donny Davis  wrote:

> Why not ask Red Hat? All the rest of the storage vendors you are looking
> at are not free.
>
> Full disclosure, I am an employee at Red Hat.
>
> On Mon, Nov 27, 2017 at 10:16 AM, Nick Fisk  wrote:
>
>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
>> Of *German Anders
>> *Sent:* 27 November 2017 14:44
>> *To:* Maged Mokhtar 
>> *Cc:* ceph-users 
>> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning
>>
>>
>>
>> Hi Maged,
>>
>>
>>
>> Thanks a lot for the response. We try with different number of threads
>> and we're getting almost the same kind of difference between the storage
>> types. Going to try with different rbd stripe size, object size values and
>> see if we get more competitive numbers. Will get back with more tests and
>> param changes to see if we get better :)
>>
>>
>>
>>
>>
>> Just to echo a couple of comments. Ceph will always struggle to match the
>> performance of a traditional array for mainly 2 reasons.
>>
>>
>>
>>1. You are replacing some sort of dual ported SAS or internally RDMA
>>connected device with a network for Ceph replication traffic. This will
>>instantly have a large impact on write latency
>>2. Ceph locks at the PG level and a PG will most likely cover at
>>least one 4MB object, so lots of small accesses to the same blocks (on a
>>block device) will wait on each other and go effectively at a single
>>threaded rate.
>>
>>
>>
>> The best thing you can do to mitigate these, is to run the fastest
>> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and
>> run your CPU’s at max C and P states.
>>
>>
>>
>> You stated that you are running the performance profile on the CPU’s.
>> Could you also just double check that the C-states are being held at C1(e)?
>> There are a few utilities that can show this in realtime.
>>
>>
>>
>> Other than that, although there could be some minor tweaks, you are
>> probably nearing the limit of what you can hope to achieve.
>>
>>
>>
>> Nick
>>
>>
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Best,
>>
>>
>> *German*
>>
>>
>>
>> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar :
>>
>> On 2017-11-27 15:02, German Anders wrote:
>>
>> Hi All,
>>
>>
>>
>> I've a performance question, we recently install a brand new Ceph cluster
>> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
>> The back-end of the cluster is using a bond IPoIB (active/passive) , and
>> for the front-end we are using a bonding config with active/active (20GbE)
>> to communicate with the clients.
>>
>>
>>
>> The cluster configuration is the following:
>>
>>
>>
>> *MON Nodes:*
>>
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>>
>> 3x 1U servers:
>>
>>   2x Intel Xeon E5-2630v4 @2.2Ghz
>>
>>   128G RAM
>>
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>
>>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>>
>>
>> *OSD Nodes:*
>>
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>>
>> 4x 2U servers:
>>
>>   2x Intel Xeon E5-2640v4 @2.4Ghz
>>
>>   128G RAM
>>
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>
>>   1x Ethernet Controller 10G X550T
>>
>>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>>
>>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>>
>>
>>
>>
>>
>> Here's the tree:
>>
>>
>>
>> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>>
>> -7   48.0 root root
>>
>> -5   24.0 rack rack1
>>
>> -1   12.0 node cpn01
>>
>>  0  nvme  1.0 osd.0  up  1.0 1.0
>>
>>  1  nvme  1.0 osd.1  up  1.0 1.0
>>
>>  2  nvme  1.0 osd.2  up  1.0 1.0
>>
>>  3  nvme  1.0 osd.3  up  1.0 1.0
>>
>>  4  nvme  1.0 osd.4  up  1.0 1.0
&g

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Donny Davis
Why not ask Red Hat? All the rest of the storage vendors you are looking at
are not free.

Full disclosure, I am an employee at Red Hat.

On Mon, Nov 27, 2017 at 10:16 AM, Nick Fisk  wrote:

> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* 27 November 2017 14:44
> *To:* Maged Mokhtar 
> *Cc:* ceph-users 
> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning
>
>
>
> Hi Maged,
>
>
>
> Thanks a lot for the response. We try with different number of threads and
> we're getting almost the same kind of difference between the storage types.
> Going to try with different rbd stripe size, object size values and see if
> we get more competitive numbers. Will get back with more tests and param
> changes to see if we get better :)
>
>
>
>
>
> Just to echo a couple of comments. Ceph will always struggle to match the
> performance of a traditional array for mainly 2 reasons.
>
>
>
>1. You are replacing some sort of dual ported SAS or internally RDMA
>connected device with a network for Ceph replication traffic. This will
>instantly have a large impact on write latency
>2. Ceph locks at the PG level and a PG will most likely cover at least
>one 4MB object, so lots of small accesses to the same blocks (on a block
>device) will wait on each other and go effectively at a single threaded
>rate.
>
>
>
> The best thing you can do to mitigate these, is to run the fastest
> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and
> run your CPU’s at max C and P states.
>
>
>
> You stated that you are running the performance profile on the CPU’s.
> Could you also just double check that the C-states are being held at C1(e)?
> There are a few utilities that can show this in realtime.
>
>
>
> Other than that, although there could be some minor tweaks, you are
> probably nearing the limit of what you can hope to achieve.
>
>
>
> Nick
>
>
>
>
>
> Thanks,
>
>
>
> Best,
>
>
> *German*
>
>
>
> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar :
>
> On 2017-11-27 15:02, German Anders wrote:
>
> Hi All,
>
>
>
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
>
>
>
> The cluster configuration is the following:
>
>
>
> *MON Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 3x 1U servers:
>
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>
>
> *OSD Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 4x 2U servers:
>
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   1x Ethernet Controller 10G X550T
>
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
>
>
>
> Here's the tree:
>
>
>
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>
> -7   48.0 root root
>
> -5   24.0 rack rack1
>
> -1   12.0 node cpn01
>
>  0  nvme  1.0 osd.0  up  1.0 1.0
>
>  1  nvme  1.0 osd.1  up  1.0 1.0
>
>  2  nvme  1.0 osd.2  up  1.0 1.0
>
>  3  nvme  1.0 osd.3  up  1.0 1.0
>
>  4  nvme  1.0 osd.4  up  1.0 1.0
>
>  5  nvme  1.0 osd.5  up  1.0 1.0
>
>  6  nvme  1.0 osd.6  up  1.0 1.0
>
>  7  nvme  1.0 osd.7  up  1.0 1.0
>
>  8  nvme  1.0 osd.8  up  1.0 1.0
>
>  9  nvme  1.0 osd.9  up  1.0 1.0
>
> 10  nvme  1.0 osd.10 up  1.0 1.0
>
> 11  nvme  1.0 osd.11 up  1.0 1.0
>
> -3   12.0 node cpn03
>
> 24  nvme  1.0 osd.24 up  1.0 1.0
>
> 25  nvme  1.0 osd.25 up  1.0 1.0
>
> 26  nvme  1.0 osd.26 up  1.0 1.0
>
> 27  nvme  1.0 osd.27 up  1.0 1.0

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Gerhard W. Recher
Hi German,

We have similar config:

proxmox-ve: 5.1-27 (running kernel: 4.13.8-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.8-1-pve: 4.13.8-27
ceph: 12.2.1-pve3

system(4 nodes): Supermicro 2028U-TN24R4T+

2 port Mellanox connect x3pro 56Gbit
4 port intel 10GigE
memory: 768 GBytes
CPU DUAL  Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

ceph: 28 osds
24  Intel Nvme 2000GB Intel SSD DC P3520, 2,5", PCIe 3.0 x4,
 4  Intel Nvme 1,6TB Intel SSD DC P3700, 2,5", U.2 PCIe 3.0


Sysbench on container:

#!/bin/bash
sysbench --test=fileio --file-total-size=4G --file-num=64 prepare

for run in 1 2 3 ;do
for thread in 1 4 8 16 32 ;do

echo "Performing test RW-${thread}T-${run}"
sysbench --test=fileio --file-total-size=4G --file-test-mode=rndwr
--max-time=60 --max-requests=0 --file-block-size=4K --file-num=64
--num-threads=${thread} run > /root/RW-${thread}T-${run}

echo "Performing test RR-${thread}T-${run}"
sysbench --test=fileio --file-total-size=4G --file-test-mode=rndrd
--max-time=60 --max-requests=0 --file-block-size=4K --file-num=64
--num-threads=${thread} run > /root/RR-${thread}T-${run}

echo "Performing test SQ-${thread}T-${run}"
sysbench --test=oltp --db-driver=mysql --oltp-table-size=4000
--mysql-db=sysbench --mysql-user=sysbench --mysql-password=password
--max-time=60 --max-requests=0 --num-threads=${thread} run >
/root/SQ-${thread}T-${run}

done
done




 grep transactions: S*

SQ-1T-1:    transactions:    6009   (100.14 per sec.)
SQ-1T-2:    transactions:    9458   (157.62 per sec.)
SQ-1T-3:    transactions:    9479   (157.97 per sec.)
SQ-4T-1:    transactions:    26574  (442.84 per sec.)
SQ-4T-2:    transactions:    28275  (471.20 per sec.)
SQ-4T-3:    transactions:    28067  (467.69 per sec.)
SQ-8T-1:    transactions:    44450  (740.78 per sec.)
SQ-8T-2:    transactions:    44410  (740.09 per sec.)
SQ-8T-3:    transactions:    44459  (740.93 per sec.)
SQ-16T-1:    transactions:    59866  (997.59 per sec.)
SQ-16T-2:    transactions:    59539  (991.99 per sec.)
SQ-16T-3:    transactions:    59615  (993.50 per sec.)
SQ-32T-1:    transactions:    71070  (1184.18 per sec.)
SQ-32T-2:    transactions:    71007  (1183.14 per sec.)
SQ-32T-3:    transactions:    71320  (1188.51 per sec.)



 grep Requests/sec R*
RR-16T-1:1464550.51 Requests/sec executed
RR-16T-2:1473440.63 Requests/sec executed
RR-16T-3:1515853.86 Requests/sec executed
RR-1T-1:741333.28 Requests/sec executed
RR-1T-2:693246.00 Requests/sec executed
RR-1T-3:691166.38 Requests/sec executed
RR-32T-1:1432609.74 Requests/sec executed
RR-32T-2:1479191.78 Requests/sec executed
RR-32T-3:1476780.11 Requests/sec executed
RR-4T-1:1411168.95 Requests/sec executed
RR-4T-2:1373557.99 Requests/sec executed
RR-4T-3:1306820.18 Requests/sec executed
RR-8T-1:1549924.57 Requests/sec executed
RR-8T-2:1580304.14 Requests/sec executed
RR-8T-3:1603842.56 Requests/sec executed
RW-16T-1:12753.82 Requests/sec executed
RW-16T-2:12394.93 Requests/sec executed
RW-16T-3:12560.11 Requests/sec executed
RW-1T-1: 1344.99 Requests/sec executed
RW-1T-2: 1324.98 Requests/sec executed
RW-1T-3: 1306.64 Requests/sec executed
RW-32T-1:16565.37 Requests/sec executed
RW-32T-2:16497.67 Requests/sec executed
RW-32T-3:16542.54 Requests/sec executed
RW-4T-1: 5099.07 Requests/sec executed
RW-4T-2: 4970.28 Requests/sec executed
RW-4T-3: 5121.44 Requests/sec executed
RW-8T-1: 8487.91 Requests/sec executed
RW-8T-2: 8632.96 Requests/sec executed
RW-8T-3: 8393.91 Requests/sec executed






Gerhard W. Recher

net4sec UG (haftungsbeschränkt)
Leitenweg 6
86929 Penzing

+49 171 4802507
Am 27.11.2017 um 14:02 schrieb German Anders:
> Hi All,
>
> I've a performance question, we recently install a brand new Ceph
> cluster with all-nvme disks, using ceph version 12.2.0 with bluestore
> configured. The back-end of the cluster is using a bond IPoIB
> (active/passive) , and for the front-end we are using a bonding config
> with active/active (20GbE) to communicate with the clients.
>
> The cluster configuration is the following:
>
> *MON Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 
> 3x 1U servers:
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
> *OSD Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 4x 2U servers:
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   1x Ethernet Controller 10G X550T
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
> Here's the tree:
>
> ID CLASS WEIGHT   

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Nick Fisk
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of German 
Anders
Sent: 27 November 2017 14:44
To: Maged Mokhtar 
Cc: ceph-users 
Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning

 

Hi Maged,

 

Thanks a lot for the response. We try with different number of threads and 
we're getting almost the same kind of difference between the storage types. 
Going to try with different rbd stripe size, object size values and see if we 
get more competitive numbers. Will get back with more tests and param changes 
to see if we get better :)

 

 

Just to echo a couple of comments. Ceph will always struggle to match the 
performance of a traditional array for mainly 2 reasons.

 

1.  You are replacing some sort of dual ported SAS or internally RDMA 
connected device with a network for Ceph replication traffic. This will 
instantly have a large impact on write latency
2.  Ceph locks at the PG level and a PG will most likely cover at least one 
4MB object, so lots of small accesses to the same blocks (on a block device) 
will wait on each other and go effectively at a single threaded rate.

 

The best thing you can do to mitigate these, is to run the fastest journal/WAL 
devices you can, fastest network connections (ie 25Gb/s) and run your CPU’s at 
max C and P states.

 

You stated that you are running the performance profile on the CPU’s. Could you 
also just double check that the C-states are being held at C1(e)? There are a 
few utilities that can show this in realtime.

 

Other than that, although there could be some minor tweaks, you are probably 
nearing the limit of what you can hope to achieve.

 

Nick

 

 

Thanks,

 

Best,




German

 

2017-11-27 11:36 GMT-03:00 Maged Mokhtar mailto:mmokh...@petasan.org> >:

On 2017-11-27 15:02, German Anders wrote:

Hi All,

 

I've a performance question, we recently install a brand new Ceph cluster with 
all-nvme disks, using ceph version 12.2.0 with bluestore configured. The 
back-end of the cluster is using a bond IPoIB (active/passive) , and for the 
front-end we are using a bonding config with active/active (20GbE) to 
communicate with the clients.

 

The cluster configuration is the following:

 

MON Nodes:

OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 

3x 1U servers:

  2x Intel Xeon E5-2630v4 @2.2Ghz

  128G RAM

  2x Intel SSD DC S3520 150G (in RAID-1 for OS)

  2x 82599ES 10-Gigabit SFI/SFP+ Network Connection

 

OSD Nodes:

OS: Ubuntu 16.04.3 LTS | kernel 4.12.14

4x 2U servers:

  2x Intel Xeon E5-2640v4 @2.4Ghz

  128G RAM

  2x Intel SSD DC S3520 150G (in RAID-1 for OS)

  1x Ethernet Controller 10G X550T

  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection

  12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons

  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)

 

 

Here's the tree:

 

ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF

-7   48.0 root root

-5   24.0 rack rack1

-1   12.0 node cpn01

 0  nvme  1.0 osd.0  up  1.0 1.0

 1  nvme  1.0 osd.1  up  1.0 1.0

 2  nvme  1.0 osd.2  up  1.0 1.0

 3  nvme  1.0 osd.3  up  1.0 1.0

 4  nvme  1.0 osd.4  up  1.0 1.0

 5  nvme  1.0 osd.5  up  1.0 1.0

 6  nvme  1.0 osd.6  up  1.0 1.0

 7  nvme  1.0 osd.7  up  1.0 1.0

 8  nvme  1.0 osd.8  up  1.0 1.0

 9  nvme  1.0 osd.9  up  1.0 1.0

10  nvme  1.0 osd.10 up  1.0 1.0

11  nvme  1.0 osd.11 up  1.0 1.0

-3   12.0 node cpn03

24  nvme  1.0 osd.24 up  1.0 1.0

25  nvme  1.0 osd.25 up  1.0 1.0

26  nvme  1.0 osd.26 up  1.0 1.0

27  nvme  1.0 osd.27 up  1.0 1.0

28  nvme  1.0 osd.28 up  1.0 1.0

29  nvme  1.0 osd.29 up  1.0 1.0

30  nvme  1.0 osd.30 up  1.0 1.0

31  nvme  1.0 osd.31 up  1.0 1.0

32  nvme  1.0 osd.32 up  1.0 1.0

33  nvme  1.0 osd.33 up  1.0 1.0

34  nvme  1.0 osd.34 up  1.0 1.0

35  nvme  1.0 osd.35 up  1.0 1.0

-6   24.0 rack rack2

-2   12.0 node cpn02

12  nvme  1.0 osd.12 up  1.0 1.0

13  nvme  1.0 osd.13 up  1.0 1.0

14  nvme  1.0 osd.14 up  1.0 1.0

15  nvme  1.0 osd.15 up  1.0 1.0

16  nvme  1.0 osd.16 up  1.0 1.0

17  nvme  1.0 osd.17 up  1.00

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi Maged,

Thanks a lot for the response. We try with different number of threads and
we're getting almost the same kind of difference between the storage types.
Going to try with different rbd stripe size, object size values and see if
we get more competitive numbers. Will get back with more tests and param
changes to see if we get better :)

Thanks,

Best,

*German*

2017-11-27 11:36 GMT-03:00 Maged Mokhtar :

> On 2017-11-27 15:02, German Anders wrote:
>
> Hi All,
>
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
>
> The cluster configuration is the following:
>
> *MON Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 3x 1U servers:
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
> *OSD Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 4x 2U servers:
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   1x Ethernet Controller 10G X550T
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
> Here's the tree:
>
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> -7   48.0 root root
> -5   24.0 rack rack1
> -1   12.0 node cpn01
>  0  nvme  1.0 osd.0  up  1.0 1.0
>  1  nvme  1.0 osd.1  up  1.0 1.0
>  2  nvme  1.0 osd.2  up  1.0 1.0
>  3  nvme  1.0 osd.3  up  1.0 1.0
>  4  nvme  1.0 osd.4  up  1.0 1.0
>  5  nvme  1.0 osd.5  up  1.0 1.0
>  6  nvme  1.0 osd.6  up  1.0 1.0
>  7  nvme  1.0 osd.7  up  1.0 1.0
>  8  nvme  1.0 osd.8  up  1.0 1.0
>  9  nvme  1.0 osd.9  up  1.0 1.0
> 10  nvme  1.0 osd.10 up  1.0 1.0
> 11  nvme  1.0 osd.11 up  1.0 1.0
> -3   12.0 node cpn03
> 24  nvme  1.0 osd.24 up  1.0 1.0
> 25  nvme  1.0 osd.25 up  1.0 1.0
> 26  nvme  1.0 osd.26 up  1.0 1.0
> 27  nvme  1.0 osd.27 up  1.0 1.0
> 28  nvme  1.0 osd.28 up  1.0 1.0
> 29  nvme  1.0 osd.29 up  1.0 1.0
> 30  nvme  1.0 osd.30 up  1.0 1.0
> 31  nvme  1.0 osd.31 up  1.0 1.0
> 32  nvme  1.0 osd.32 up  1.0 1.0
> 33  nvme  1.0 osd.33 up  1.0 1.0
> 34  nvme  1.0 osd.34 up  1.0 1.0
> 35  nvme  1.0 osd.35 up  1.0 1.0
> -6   24.0 rack rack2
> -2   12.0 node cpn02
> 12  nvme  1.0 osd.12 up  1.0 1.0
> 13  nvme  1.0 osd.13 up  1.0 1.0
> 14  nvme  1.0 osd.14 up  1.0 1.0
> 15  nvme  1.0 osd.15 up  1.0 1.0
> 16  nvme  1.0 osd.16 up  1.0 1.0
> 17  nvme  1.0 osd.17 up  1.0 1.0
> 18  nvme  1.0 osd.18 up  1.0 1.0
> 19  nvme  1.0 osd.19 up  1.0 1.0
> 20  nvme  1.0 osd.20 up  1.0 1.0
> 21  nvme  1.0 osd.21 up  1.0 1.0
> 22  nvme  1.0 osd.22 up  1.0 1.0
> 23  nvme  1.0 osd.23 up  1.0 1.0
> -4   12.0 node cpn04
> 36  nvme  1.0 osd.36 up  1.0 1.0
> 37  nvme  1.0 osd.37 up  1.0 1.0
> 38  nvme  1.0 osd.38 up  1.0 1.0
> 39  nvme  1.0 osd.39 up  1.0 1.0
> 40  nvme  1.0 osd.40 up  1.0 1.0
> 41  nvme  1.0 osd.41 up  1.0 1.0
> 42  nvme  1.0 osd.42 up  1.0 1.0
> 43  nvme  1.0 osd.43 up  1.0 1.0
> 44  nvme  1.0 osd.44 up  1.0 1.0
> 45  nvme  1.0 osd.45 up  1.0 1.0
> 46  nvme  1.0 osd.46 up  1.0 1.0
> 47  nvme  1.0 osd.47 up  1.0 1.0
>
> The disk partition of one of the OSD nodes:
>
> NAME   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
> nvme6n1259:10   

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Maged Mokhtar
On 2017-11-27 15:02, German Anders wrote:

> Hi All, 
> 
> I've a performance question, we recently install a brand new Ceph cluster 
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The 
> back-end of the cluster is using a bond IPoIB (active/passive) , and for the 
> front-end we are using a bonding config with active/active (20GbE) to 
> communicate with the clients. 
> 
> The cluster configuration is the following: 
> 
> MON NODES: 
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14  
> 3x 1U servers: 
> 2x Intel Xeon E5-2630v4 @2.2Ghz 
> 128G RAM 
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 
> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection 
> 
> OSD NODES: 
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 
> 4x 2U servers: 
> 2x Intel Xeon E5-2640v4 @2.4Ghz 
> 128G RAM 
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 
> 1x Ethernet Controller 10G X550T 
> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection 
> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons 
> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) 
> 
> Here's the tree: 
> 
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF 
> -7   48.0 root root 
> -5   24.0 rack rack1 
> -1   12.0 node cpn01 
> 0  nvme  1.0 osd.0  up  1.0 1.0 
> 1  nvme  1.0 osd.1  up  1.0 1.0 
> 2  nvme  1.0 osd.2  up  1.0 1.0 
> 3  nvme  1.0 osd.3  up  1.0 1.0 
> 4  nvme  1.0 osd.4  up  1.0 1.0 
> 5  nvme  1.0 osd.5  up  1.0 1.0 
> 6  nvme  1.0 osd.6  up  1.0 1.0 
> 7  nvme  1.0 osd.7  up  1.0 1.0 
> 8  nvme  1.0 osd.8  up  1.0 1.0 
> 9  nvme  1.0 osd.9  up  1.0 1.0 
> 10  nvme  1.0 osd.10 up  1.0 1.0 
> 11  nvme  1.0 osd.11 up  1.0 1.0 
> -3   12.0 node cpn03 
> 24  nvme  1.0 osd.24 up  1.0 1.0 
> 25  nvme  1.0 osd.25 up  1.0 1.0 
> 26  nvme  1.0 osd.26 up  1.0 1.0 
> 27  nvme  1.0 osd.27 up  1.0 1.0 
> 28  nvme  1.0 osd.28 up  1.0 1.0 
> 
> 29  nvme  1.0 osd.29 up  1.0 1.0 
> 30  nvme  1.0 osd.30 up  1.0 1.0 
> 31  nvme  1.0 osd.31 up  1.0 1.0 
> 32  nvme  1.0 osd.32 up  1.0 1.0 
> 33  nvme  1.0 osd.33 up  1.0 1.0 
> 34  nvme  1.0 osd.34 up  1.0 1.0 
> 35  nvme  1.0 osd.35 up  1.0 1.0 
> -6   24.0 rack rack2 
> -2   12.0 node cpn02 
> 12  nvme  1.0 osd.12 up  1.0 1.0 
> 13  nvme  1.0 osd.13 up  1.0 1.0 
> 14  nvme  1.0 osd.14 up  1.0 1.0 
> 15  nvme  1.0 osd.15 up  1.0 1.0 
> 16  nvme  1.0 osd.16 up  1.0 1.0 
> 17  nvme  1.0 osd.17 up  1.0 1.0 
> 18  nvme  1.0 osd.18 up  1.0 1.0 
> 19  nvme  1.0 osd.19 up  1.0 1.0 
> 20  nvme  1.0 osd.20 up  1.0 1.0 
> 21  nvme  1.0 osd.21 up  1.0 1.0 
> 22  nvme  1.0 osd.22 up  1.0 1.0 
> 23  nvme  1.0 osd.23 up  1.0 1.0 
> -4   12.0 node cpn04 
> 36  nvme  1.0 osd.36 up  1.0 1.0 
> 37  nvme  1.0 osd.37 up  1.0 1.0 
> 38  nvme  1.0 osd.38 up  1.0 1.0 
> 39  nvme  1.0 osd.39 up  1.0 1.0 
> 40  nvme  1.0 osd.40 up  1.0 1.0 
> 41  nvme  1.0 osd.41 up  1.0 1.0 
> 42  nvme  1.0 osd.42 up  1.0 1.0 
> 43  nvme  1.0 osd.43 up  1.0 1.0 
> 44  nvme  1.0 osd.44 up  1.0 1.0 
> 45  nvme  1.0 osd.45 up  1.0 1.0 
> 46  nvme  1.0 osd.46 up  1.0 1.0 
> 47  nvme  1.0 osd.47 up  1.0 1.0 
> 
> The disk partition of one of the OSD nodes: 
> 
> NAME   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT 
> nvme6n1259:10   1.1T  0 disk 
> ├─nvme6n1p2259:15   0   1.1T  0 part 
> └─nvme6n1p1259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6 
> nvme9n1259:00   1.1T  0 disk 
> ├─nvme9n1p2259:80   1.1T  0 part 
> └─nvme9n1p1259:70   100M  0 part  /var/lib/ceph/osd/ceph-9 
> sdb  8:16   0 

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi Wido, thanks a lot for the quick response, regarding the questions:

Have you tried to attach multiple RBD volumes:

- Root for OS (the root partition has local SSDs)
- MySQL data dir (the idea is to have all the storage tests with the same
scheme, the first test is using one volume and put the data dir, innodb and
bin log)
- MySQL InnoDB Logfile
- MySQL Binary Logging

So 4 disks in total where you spread out the I/O over. (the following tests
are going to be spread into 3 disks, and we'll make a new compare between
the arrays)

Regarding the version of librbd it's not a type we use this server also
with an old ceph cluster. we are going to upgrade the version and see if
tests get better.

Thanks


*German*

2017-11-27 10:16 GMT-03:00 Wido den Hollander :

>
> > Op 27 november 2017 om 14:02 schreef German Anders  >:
> >
> >
> > Hi All,
> >
> > I've a performance question, we recently install a brand new Ceph cluster
> > with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> > The back-end of the cluster is using a bond IPoIB (active/passive) , and
> > for the front-end we are using a bonding config with active/active
> (20GbE)
> > to communicate with the clients.
> >
> > The cluster configuration is the following:
> >
> > *MON Nodes:*
> > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> > 3x 1U servers:
> >   2x Intel Xeon E5-2630v4 @2.2Ghz
> >   128G RAM
> >   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> >   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >
> > *OSD Nodes:*
> > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> > 4x 2U servers:
> >   2x Intel Xeon E5-2640v4 @2.4Ghz
> >   128G RAM
> >   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> >   1x Ethernet Controller 10G X550T
> >   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
> >   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
> >
> >
> > Here's the tree:
> >
> > ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> > -7   48.0 root root
> > -5   24.0 rack rack1
> > -1   12.0 node cpn01
> >  0  nvme  1.0 osd.0  up  1.0 1.0
> >  1  nvme  1.0 osd.1  up  1.0 1.0
> >  2  nvme  1.0 osd.2  up  1.0 1.0
> >  3  nvme  1.0 osd.3  up  1.0 1.0
> >  4  nvme  1.0 osd.4  up  1.0 1.0
> >  5  nvme  1.0 osd.5  up  1.0 1.0
> >  6  nvme  1.0 osd.6  up  1.0 1.0
> >  7  nvme  1.0 osd.7  up  1.0 1.0
> >  8  nvme  1.0 osd.8  up  1.0 1.0
> >  9  nvme  1.0 osd.9  up  1.0 1.0
> > 10  nvme  1.0 osd.10 up  1.0 1.0
> > 11  nvme  1.0 osd.11 up  1.0 1.0
> > -3   12.0 node cpn03
> > 24  nvme  1.0 osd.24 up  1.0 1.0
> > 25  nvme  1.0 osd.25 up  1.0 1.0
> > 26  nvme  1.0 osd.26 up  1.0 1.0
> > 27  nvme  1.0 osd.27 up  1.0 1.0
> > 28  nvme  1.0 osd.28 up  1.0 1.0
> > 29  nvme  1.0 osd.29 up  1.0 1.0
> > 30  nvme  1.0 osd.30 up  1.0 1.0
> > 31  nvme  1.0 osd.31 up  1.0 1.0
> > 32  nvme  1.0 osd.32 up  1.0 1.0
> > 33  nvme  1.0 osd.33 up  1.0 1.0
> > 34  nvme  1.0 osd.34 up  1.0 1.0
> > 35  nvme  1.0 osd.35 up  1.0 1.0
> > -6   24.0 rack rack2
> > -2   12.0 node cpn02
> > 12  nvme  1.0 osd.12 up  1.0 1.0
> > 13  nvme  1.0 osd.13 up  1.0 1.0
> > 14  nvme  1.0 osd.14 up  1.0 1.0
> > 15  nvme  1.0 osd.15 up  1.0 1.0
> > 16  nvme  1.0 osd.16 up  1.0 1.0
> > 17  nvme  1.0 osd.17 up  1.0 1.0
> > 18  nvme  1.0 osd.18 up  1.0 1.0
> > 19  nvme  1.0 osd.19 up  1.0 1.0
> > 20  nvme  1.0 osd.20 up  1.0 1.0
> > 21  nvme  1.0 osd.21 up  1.0 1.0
> > 22  nvme  1.0 osd.22 up  1.0 1.0
> > 23  nvme  1.0 osd.23 up  1.0 1.0
> > -4   12.0 node cpn04
> > 36  nvme  1.0 osd.36 up  1.0 1.0
> > 37  nvme  1.0 osd.37 up  1.0 1.0
> > 38  nvme  1.0 osd.38 up  1.0 1.0
> > 39  nvme  1.0 osd.39 up  1.0 1.0
> > 40  nvme  1.0 osd.40 up  1.0 1.0
> > 41  nvme  1.0 osd.41 up  1

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Wido den Hollander

> Op 27 november 2017 om 14:14 schreef German Anders :
> 
> 
> Hi Jason,
> 
> We are using librbd (librbd1-0.80.5-9.el6.x86_64), ok I will change those
> parameters and see if that changes something
> 

0.80? Is that a typo? You should really use 12.2.1 on the client.

Wido

> thanks a lot
> 
> best,
> 
> 
> *German*
> 
> 2017-11-27 10:09 GMT-03:00 Jason Dillaman :
> 
> > Are you using krbd or librbd? You might want to consider "debug_ms = 0/0"
> > as well since per-message log gathering takes a large hit on small IO
> > performance.
> >
> > On Mon, Nov 27, 2017 at 8:02 AM, German Anders 
> > wrote:
> >
> >> Hi All,
> >>
> >> I've a performance question, we recently install a brand new Ceph cluster
> >> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> >> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> >> for the front-end we are using a bonding config with active/active (20GbE)
> >> to communicate with the clients.
> >>
> >> The cluster configuration is the following:
> >>
> >> *MON Nodes:*
> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> >> 3x 1U servers:
> >>   2x Intel Xeon E5-2630v4 @2.2Ghz
> >>   128G RAM
> >>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> >>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >>
> >> *OSD Nodes:*
> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> >> 4x 2U servers:
> >>   2x Intel Xeon E5-2640v4 @2.4Ghz
> >>   128G RAM
> >>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> >>   1x Ethernet Controller 10G X550T
> >>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
> >>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
> >>
> >>
> >> Here's the tree:
> >>
> >> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> >> -7   48.0 root root
> >> -5   24.0 rack rack1
> >> -1   12.0 node cpn01
> >>  0  nvme  1.0 osd.0  up  1.0 1.0
> >>  1  nvme  1.0 osd.1  up  1.0 1.0
> >>  2  nvme  1.0 osd.2  up  1.0 1.0
> >>  3  nvme  1.0 osd.3  up  1.0 1.0
> >>  4  nvme  1.0 osd.4  up  1.0 1.0
> >>  5  nvme  1.0 osd.5  up  1.0 1.0
> >>  6  nvme  1.0 osd.6  up  1.0 1.0
> >>  7  nvme  1.0 osd.7  up  1.0 1.0
> >>  8  nvme  1.0 osd.8  up  1.0 1.0
> >>  9  nvme  1.0 osd.9  up  1.0 1.0
> >> 10  nvme  1.0 osd.10 up  1.0 1.0
> >> 11  nvme  1.0 osd.11 up  1.0 1.0
> >> -3   12.0 node cpn03
> >> 24  nvme  1.0 osd.24 up  1.0 1.0
> >> 25  nvme  1.0 osd.25 up  1.0 1.0
> >> 26  nvme  1.0 osd.26 up  1.0 1.0
> >> 27  nvme  1.0 osd.27 up  1.0 1.0
> >> 28  nvme  1.0 osd.28 up  1.0 1.0
> >> 29  nvme  1.0 osd.29 up  1.0 1.0
> >> 30  nvme  1.0 osd.30 up  1.0 1.0
> >> 31  nvme  1.0 osd.31 up  1.0 1.0
> >> 32  nvme  1.0 osd.32 up  1.0 1.0
> >> 33  nvme  1.0 osd.33 up  1.0 1.0
> >> 34  nvme  1.0 osd.34 up  1.0 1.0
> >> 35  nvme  1.0 osd.35 up  1.0 1.0
> >> -6   24.0 rack rack2
> >> -2   12.0 node cpn02
> >> 12  nvme  1.0 osd.12 up  1.0 1.0
> >> 13  nvme  1.0 osd.13 up  1.0 1.0
> >> 14  nvme  1.0 osd.14 up  1.0 1.0
> >> 15  nvme  1.0 osd.15 up  1.0 1.0
> >> 16  nvme  1.0 osd.16 up  1.0 1.0
> >> 17  nvme  1.0 osd.17 up  1.0 1.0
> >> 18  nvme  1.0 osd.18 up  1.0 1.0
> >> 19  nvme  1.0 osd.19 up  1.0 1.0
> >> 20  nvme  1.0 osd.20 up  1.0 1.0
> >> 21  nvme  1.0 osd.21 up  1.0 1.0
> >> 22  nvme  1.0 osd.22 up  1.0 1.0
> >> 23  nvme  1.0 osd.23 up  1.0 1.0
> >> -4   12.0 node cpn04
> >> 36  nvme  1.0 osd.36 up  1.0 1.0
> >> 37  nvme  1.0 osd.37 up  1.0 1.0
> >> 38  nvme  1.0 osd.38 up  1.0 1.0
> >> 39  nvme  1.0 osd.39 up  1.0 1.0
> >> 40  nvme  1.0 osd.40 up  1.0 1.0
> >> 41  nvme  1.0 osd.41 up  1.0 1.0
> >> 42  nvme  1.0 osd.42 up  1.0 1.0
> >> 43  nvme  1.0 osd.43 up  1.0 1.0
>

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Wido den Hollander

> Op 27 november 2017 om 14:02 schreef German Anders :
> 
> 
> Hi All,
> 
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
> 
> The cluster configuration is the following:
> 
> *MON Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 3x 1U servers:
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> 
> *OSD Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 4x 2U servers:
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   1x Ethernet Controller 10G X550T
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
> 
> 
> Here's the tree:
> 
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> -7   48.0 root root
> -5   24.0 rack rack1
> -1   12.0 node cpn01
>  0  nvme  1.0 osd.0  up  1.0 1.0
>  1  nvme  1.0 osd.1  up  1.0 1.0
>  2  nvme  1.0 osd.2  up  1.0 1.0
>  3  nvme  1.0 osd.3  up  1.0 1.0
>  4  nvme  1.0 osd.4  up  1.0 1.0
>  5  nvme  1.0 osd.5  up  1.0 1.0
>  6  nvme  1.0 osd.6  up  1.0 1.0
>  7  nvme  1.0 osd.7  up  1.0 1.0
>  8  nvme  1.0 osd.8  up  1.0 1.0
>  9  nvme  1.0 osd.9  up  1.0 1.0
> 10  nvme  1.0 osd.10 up  1.0 1.0
> 11  nvme  1.0 osd.11 up  1.0 1.0
> -3   12.0 node cpn03
> 24  nvme  1.0 osd.24 up  1.0 1.0
> 25  nvme  1.0 osd.25 up  1.0 1.0
> 26  nvme  1.0 osd.26 up  1.0 1.0
> 27  nvme  1.0 osd.27 up  1.0 1.0
> 28  nvme  1.0 osd.28 up  1.0 1.0
> 29  nvme  1.0 osd.29 up  1.0 1.0
> 30  nvme  1.0 osd.30 up  1.0 1.0
> 31  nvme  1.0 osd.31 up  1.0 1.0
> 32  nvme  1.0 osd.32 up  1.0 1.0
> 33  nvme  1.0 osd.33 up  1.0 1.0
> 34  nvme  1.0 osd.34 up  1.0 1.0
> 35  nvme  1.0 osd.35 up  1.0 1.0
> -6   24.0 rack rack2
> -2   12.0 node cpn02
> 12  nvme  1.0 osd.12 up  1.0 1.0
> 13  nvme  1.0 osd.13 up  1.0 1.0
> 14  nvme  1.0 osd.14 up  1.0 1.0
> 15  nvme  1.0 osd.15 up  1.0 1.0
> 16  nvme  1.0 osd.16 up  1.0 1.0
> 17  nvme  1.0 osd.17 up  1.0 1.0
> 18  nvme  1.0 osd.18 up  1.0 1.0
> 19  nvme  1.0 osd.19 up  1.0 1.0
> 20  nvme  1.0 osd.20 up  1.0 1.0
> 21  nvme  1.0 osd.21 up  1.0 1.0
> 22  nvme  1.0 osd.22 up  1.0 1.0
> 23  nvme  1.0 osd.23 up  1.0 1.0
> -4   12.0 node cpn04
> 36  nvme  1.0 osd.36 up  1.0 1.0
> 37  nvme  1.0 osd.37 up  1.0 1.0
> 38  nvme  1.0 osd.38 up  1.0 1.0
> 39  nvme  1.0 osd.39 up  1.0 1.0
> 40  nvme  1.0 osd.40 up  1.0 1.0
> 41  nvme  1.0 osd.41 up  1.0 1.0
> 42  nvme  1.0 osd.42 up  1.0 1.0
> 43  nvme  1.0 osd.43 up  1.0 1.0
> 44  nvme  1.0 osd.44 up  1.0 1.0
> 45  nvme  1.0 osd.45 up  1.0 1.0
> 46  nvme  1.0 osd.46 up  1.0 1.0
> 47  nvme  1.0 osd.47 up  1.0 1.0
> 
> The disk partition of one of the OSD nodes:
> 
> NAME   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
> nvme6n1259:10   1.1T  0 disk
> ├─nvme6n1p2259:15   0   1.1T  0 part
> └─nvme6n1p1259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6
> nvme9n1259:00   1.1T  0 disk
> ├─nvme9n1p2259:80   1.1T  0 part
> └─nvme9n1p1259:70   100M  0 part  /var/lib/ceph/osd/ceph-9
> sdb  8:16   0 139.8G  0 disk
> └─sdb1

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi Jason,

We are using librbd (librbd1-0.80.5-9.el6.x86_64), ok I will change those
parameters and see if that changes something

thanks a lot

best,


*German*

2017-11-27 10:09 GMT-03:00 Jason Dillaman :

> Are you using krbd or librbd? You might want to consider "debug_ms = 0/0"
> as well since per-message log gathering takes a large hit on small IO
> performance.
>
> On Mon, Nov 27, 2017 at 8:02 AM, German Anders 
> wrote:
>
>> Hi All,
>>
>> I've a performance question, we recently install a brand new Ceph cluster
>> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
>> The back-end of the cluster is using a bond IPoIB (active/passive) , and
>> for the front-end we are using a bonding config with active/active (20GbE)
>> to communicate with the clients.
>>
>> The cluster configuration is the following:
>>
>> *MON Nodes:*
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>> 3x 1U servers:
>>   2x Intel Xeon E5-2630v4 @2.2Ghz
>>   128G RAM
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>> *OSD Nodes:*
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>> 4x 2U servers:
>>   2x Intel Xeon E5-2640v4 @2.4Ghz
>>   128G RAM
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>   1x Ethernet Controller 10G X550T
>>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>>
>>
>> Here's the tree:
>>
>> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>> -7   48.0 root root
>> -5   24.0 rack rack1
>> -1   12.0 node cpn01
>>  0  nvme  1.0 osd.0  up  1.0 1.0
>>  1  nvme  1.0 osd.1  up  1.0 1.0
>>  2  nvme  1.0 osd.2  up  1.0 1.0
>>  3  nvme  1.0 osd.3  up  1.0 1.0
>>  4  nvme  1.0 osd.4  up  1.0 1.0
>>  5  nvme  1.0 osd.5  up  1.0 1.0
>>  6  nvme  1.0 osd.6  up  1.0 1.0
>>  7  nvme  1.0 osd.7  up  1.0 1.0
>>  8  nvme  1.0 osd.8  up  1.0 1.0
>>  9  nvme  1.0 osd.9  up  1.0 1.0
>> 10  nvme  1.0 osd.10 up  1.0 1.0
>> 11  nvme  1.0 osd.11 up  1.0 1.0
>> -3   12.0 node cpn03
>> 24  nvme  1.0 osd.24 up  1.0 1.0
>> 25  nvme  1.0 osd.25 up  1.0 1.0
>> 26  nvme  1.0 osd.26 up  1.0 1.0
>> 27  nvme  1.0 osd.27 up  1.0 1.0
>> 28  nvme  1.0 osd.28 up  1.0 1.0
>> 29  nvme  1.0 osd.29 up  1.0 1.0
>> 30  nvme  1.0 osd.30 up  1.0 1.0
>> 31  nvme  1.0 osd.31 up  1.0 1.0
>> 32  nvme  1.0 osd.32 up  1.0 1.0
>> 33  nvme  1.0 osd.33 up  1.0 1.0
>> 34  nvme  1.0 osd.34 up  1.0 1.0
>> 35  nvme  1.0 osd.35 up  1.0 1.0
>> -6   24.0 rack rack2
>> -2   12.0 node cpn02
>> 12  nvme  1.0 osd.12 up  1.0 1.0
>> 13  nvme  1.0 osd.13 up  1.0 1.0
>> 14  nvme  1.0 osd.14 up  1.0 1.0
>> 15  nvme  1.0 osd.15 up  1.0 1.0
>> 16  nvme  1.0 osd.16 up  1.0 1.0
>> 17  nvme  1.0 osd.17 up  1.0 1.0
>> 18  nvme  1.0 osd.18 up  1.0 1.0
>> 19  nvme  1.0 osd.19 up  1.0 1.0
>> 20  nvme  1.0 osd.20 up  1.0 1.0
>> 21  nvme  1.0 osd.21 up  1.0 1.0
>> 22  nvme  1.0 osd.22 up  1.0 1.0
>> 23  nvme  1.0 osd.23 up  1.0 1.0
>> -4   12.0 node cpn04
>> 36  nvme  1.0 osd.36 up  1.0 1.0
>> 37  nvme  1.0 osd.37 up  1.0 1.0
>> 38  nvme  1.0 osd.38 up  1.0 1.0
>> 39  nvme  1.0 osd.39 up  1.0 1.0
>> 40  nvme  1.0 osd.40 up  1.0 1.0
>> 41  nvme  1.0 osd.41 up  1.0 1.0
>> 42  nvme  1.0 osd.42 up  1.0 1.0
>> 43  nvme  1.0 osd.43 up  1.0 1.0
>> 44  nvme  1.0 osd.44 up  1.0 1.0
>> 45  nvme  1.0 osd.45 up  1.0 1.0
>> 46  nvme  1.0 osd.46 up  1.0 1.0
>> 47  nvme  1.0 osd.47 up  1.0 1.0
>>
>> The disk partition of one of the OSD nodes:
>>
>> NAME   MAJ:MIN RM  

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread Jason Dillaman
Are you using krbd or librbd? You might want to consider "debug_ms = 0/0"
as well since per-message log gathering takes a large hit on small IO
performance.

On Mon, Nov 27, 2017 at 8:02 AM, German Anders  wrote:

> Hi All,
>
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
>
> The cluster configuration is the following:
>
> *MON Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 3x 1U servers:
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
> *OSD Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 4x 2U servers:
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   1x Ethernet Controller 10G X550T
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
> Here's the tree:
>
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> -7   48.0 root root
> -5   24.0 rack rack1
> -1   12.0 node cpn01
>  0  nvme  1.0 osd.0  up  1.0 1.0
>  1  nvme  1.0 osd.1  up  1.0 1.0
>  2  nvme  1.0 osd.2  up  1.0 1.0
>  3  nvme  1.0 osd.3  up  1.0 1.0
>  4  nvme  1.0 osd.4  up  1.0 1.0
>  5  nvme  1.0 osd.5  up  1.0 1.0
>  6  nvme  1.0 osd.6  up  1.0 1.0
>  7  nvme  1.0 osd.7  up  1.0 1.0
>  8  nvme  1.0 osd.8  up  1.0 1.0
>  9  nvme  1.0 osd.9  up  1.0 1.0
> 10  nvme  1.0 osd.10 up  1.0 1.0
> 11  nvme  1.0 osd.11 up  1.0 1.0
> -3   12.0 node cpn03
> 24  nvme  1.0 osd.24 up  1.0 1.0
> 25  nvme  1.0 osd.25 up  1.0 1.0
> 26  nvme  1.0 osd.26 up  1.0 1.0
> 27  nvme  1.0 osd.27 up  1.0 1.0
> 28  nvme  1.0 osd.28 up  1.0 1.0
> 29  nvme  1.0 osd.29 up  1.0 1.0
> 30  nvme  1.0 osd.30 up  1.0 1.0
> 31  nvme  1.0 osd.31 up  1.0 1.0
> 32  nvme  1.0 osd.32 up  1.0 1.0
> 33  nvme  1.0 osd.33 up  1.0 1.0
> 34  nvme  1.0 osd.34 up  1.0 1.0
> 35  nvme  1.0 osd.35 up  1.0 1.0
> -6   24.0 rack rack2
> -2   12.0 node cpn02
> 12  nvme  1.0 osd.12 up  1.0 1.0
> 13  nvme  1.0 osd.13 up  1.0 1.0
> 14  nvme  1.0 osd.14 up  1.0 1.0
> 15  nvme  1.0 osd.15 up  1.0 1.0
> 16  nvme  1.0 osd.16 up  1.0 1.0
> 17  nvme  1.0 osd.17 up  1.0 1.0
> 18  nvme  1.0 osd.18 up  1.0 1.0
> 19  nvme  1.0 osd.19 up  1.0 1.0
> 20  nvme  1.0 osd.20 up  1.0 1.0
> 21  nvme  1.0 osd.21 up  1.0 1.0
> 22  nvme  1.0 osd.22 up  1.0 1.0
> 23  nvme  1.0 osd.23 up  1.0 1.0
> -4   12.0 node cpn04
> 36  nvme  1.0 osd.36 up  1.0 1.0
> 37  nvme  1.0 osd.37 up  1.0 1.0
> 38  nvme  1.0 osd.38 up  1.0 1.0
> 39  nvme  1.0 osd.39 up  1.0 1.0
> 40  nvme  1.0 osd.40 up  1.0 1.0
> 41  nvme  1.0 osd.41 up  1.0 1.0
> 42  nvme  1.0 osd.42 up  1.0 1.0
> 43  nvme  1.0 osd.43 up  1.0 1.0
> 44  nvme  1.0 osd.44 up  1.0 1.0
> 45  nvme  1.0 osd.45 up  1.0 1.0
> 46  nvme  1.0 osd.46 up  1.0 1.0
> 47  nvme  1.0 osd.47 up  1.0 1.0
>
> The disk partition of one of the OSD nodes:
>
> NAME   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
> nvme6n1259:10   1.1T  0 disk
> ├─nvme6n1p2259:15   0   1.1T  0 part
> └─nvme6n1p1259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6
> nvme9n1259:00   1.1T  0 disk
> ├─nvme9n1p2259:80   1.1T  0 part
> └─n

[ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi All,

I've a performance question, we recently install a brand new Ceph cluster
with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
The back-end of the cluster is using a bond IPoIB (active/passive) , and
for the front-end we are using a bonding config with active/active (20GbE)
to communicate with the clients.

The cluster configuration is the following:

*MON Nodes:*
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
3x 1U servers:
  2x Intel Xeon E5-2630v4 @2.2Ghz
  128G RAM
  2x Intel SSD DC S3520 150G (in RAID-1 for OS)
  2x 82599ES 10-Gigabit SFI/SFP+ Network Connection

*OSD Nodes:*
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
4x 2U servers:
  2x Intel Xeon E5-2640v4 @2.4Ghz
  128G RAM
  2x Intel SSD DC S3520 150G (in RAID-1 for OS)
  1x Ethernet Controller 10G X550T
  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
  12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)


Here's the tree:

ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
-7   48.0 root root
-5   24.0 rack rack1
-1   12.0 node cpn01
 0  nvme  1.0 osd.0  up  1.0 1.0
 1  nvme  1.0 osd.1  up  1.0 1.0
 2  nvme  1.0 osd.2  up  1.0 1.0
 3  nvme  1.0 osd.3  up  1.0 1.0
 4  nvme  1.0 osd.4  up  1.0 1.0
 5  nvme  1.0 osd.5  up  1.0 1.0
 6  nvme  1.0 osd.6  up  1.0 1.0
 7  nvme  1.0 osd.7  up  1.0 1.0
 8  nvme  1.0 osd.8  up  1.0 1.0
 9  nvme  1.0 osd.9  up  1.0 1.0
10  nvme  1.0 osd.10 up  1.0 1.0
11  nvme  1.0 osd.11 up  1.0 1.0
-3   12.0 node cpn03
24  nvme  1.0 osd.24 up  1.0 1.0
25  nvme  1.0 osd.25 up  1.0 1.0
26  nvme  1.0 osd.26 up  1.0 1.0
27  nvme  1.0 osd.27 up  1.0 1.0
28  nvme  1.0 osd.28 up  1.0 1.0
29  nvme  1.0 osd.29 up  1.0 1.0
30  nvme  1.0 osd.30 up  1.0 1.0
31  nvme  1.0 osd.31 up  1.0 1.0
32  nvme  1.0 osd.32 up  1.0 1.0
33  nvme  1.0 osd.33 up  1.0 1.0
34  nvme  1.0 osd.34 up  1.0 1.0
35  nvme  1.0 osd.35 up  1.0 1.0
-6   24.0 rack rack2
-2   12.0 node cpn02
12  nvme  1.0 osd.12 up  1.0 1.0
13  nvme  1.0 osd.13 up  1.0 1.0
14  nvme  1.0 osd.14 up  1.0 1.0
15  nvme  1.0 osd.15 up  1.0 1.0
16  nvme  1.0 osd.16 up  1.0 1.0
17  nvme  1.0 osd.17 up  1.0 1.0
18  nvme  1.0 osd.18 up  1.0 1.0
19  nvme  1.0 osd.19 up  1.0 1.0
20  nvme  1.0 osd.20 up  1.0 1.0
21  nvme  1.0 osd.21 up  1.0 1.0
22  nvme  1.0 osd.22 up  1.0 1.0
23  nvme  1.0 osd.23 up  1.0 1.0
-4   12.0 node cpn04
36  nvme  1.0 osd.36 up  1.0 1.0
37  nvme  1.0 osd.37 up  1.0 1.0
38  nvme  1.0 osd.38 up  1.0 1.0
39  nvme  1.0 osd.39 up  1.0 1.0
40  nvme  1.0 osd.40 up  1.0 1.0
41  nvme  1.0 osd.41 up  1.0 1.0
42  nvme  1.0 osd.42 up  1.0 1.0
43  nvme  1.0 osd.43 up  1.0 1.0
44  nvme  1.0 osd.44 up  1.0 1.0
45  nvme  1.0 osd.45 up  1.0 1.0
46  nvme  1.0 osd.46 up  1.0 1.0
47  nvme  1.0 osd.47 up  1.0 1.0

The disk partition of one of the OSD nodes:

NAME   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme6n1259:10   1.1T  0 disk
├─nvme6n1p2259:15   0   1.1T  0 part
└─nvme6n1p1259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6
nvme9n1259:00   1.1T  0 disk
├─nvme9n1p2259:80   1.1T  0 part
└─nvme9n1p1259:70   100M  0 part  /var/lib/ceph/osd/ceph-9
sdb  8:16   0 139.8G  0 disk
└─sdb1   8:17   0 139.8G  0 part
  └─md0  9:00 139.6G  0 raid1
├─md0p2259:31   0 1K  0 md
├─md0p5259:32   0 139.1G  0 md
│ ├─cpn01--vg-swap 253:10  27.4G  0 lvm   [SWAP]
│ └─cpn01--vg-root 253:0