Re: [ceph-users] ceph all-nvme mysql performance tuning
I got error on this: sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua --mysql-host=127.0.0.1 --mysql-port=33033 --mysql-user=sysbench --mysql-password=password --mysql-db=sysbench --mysql-table-engine=innodb --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex --oltp-read-only=off --oltp-table-size=20 --threads=10 --rand-type=uniform --rand-init=on cleanup Unknown option: --oltp_tables_count. Usage: sysbench [general-options]... --test= [test-options]... command General options: --num-threads=N number of threads to use [1] --max-requests=N limit for total number of requests [1] --max-time=N limit for total execution time in seconds [0] --forced-shutdown=STRING amount of time to wait after --max-time before forcing shutdown [off] --thread-stack-size=SIZE size of stack per thread [32K] --init-rng=[on|off] initialize random number generator [off] --test=STRING test to run --debug=[on|off] print more debugging info [off] --validate=[on|off] perform validation checks where possible [off] --help=[on|off] print help and exit --version=[on|off] print version and exit Compiled-in tests: fileio - File I/O test cpu - CPU performance test memory - Memory functions speed test threads - Threads subsystem performance test mutex - Mutex performance test oltp - OLTP test Commands: prepare run cleanup help version See 'sysbench --test= help' for a list of options for each test. but i have these: echo "Performing test SQ-${thread}T-${run}" sysbench --test=oltp --db-driver=mysql --oltp-table-size=4000 --mysql-db=sysbench --mysql-user=sysbench --mysql-password=password --max-time=60 --max-requests=0 --num-threads=${thread} run > /root/SQ-${thread}T-${run} [client] port = 3306 socket = /var/run/mysqld/mysqld.sock [mysqld_safe] socket = /var/run/mysqld/mysqld.sock nice = 0 [mysqld] user = mysql pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock port = 3306 basedir = /usr datadir = /var/lib/mysql tmpdir = /tmp lc-messages-dir = /usr/share/mysql skip-external-locking bind-address = 127.0.0.1 key_buffer = 16M max_allowed_packet = 16M thread_stack = 192K thread_cache_size = 8 myisam-recover = BACKUP query_cache_limit = 1M query_cache_size = 16M log_error = /var/log/mysql/error.log expire_logs_days = 10 max_binlog_size = 100M [mysqldump] quick quote-names max_allowed_packet = 16M [mysql] [isamchk] key_buffer = 16M !includedir /etc/mysql/conf.d/ sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Doing OLTP test. Running mixed OLTP test Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases) Using "BEGIN" for starting transactions Using auto_inc on the id column Threads started! Time limit exceeded, exiting... Done. OLTP test statistics: queries performed: read: 84126 write: 30045 other: 12018 total: 126189 transactions: 6009 (100.14 per sec.) deadlocks: 0 (0.00 per sec.) read/write requests: 114171 (1902.71 per sec.) other operations: 12018 (200.28 per sec.) Test execution summary: total time: 60.0045s total number of events: 6009 total time taken by event execution: 59.9812 per-request statistics: min: 4.47ms avg: 9.98ms max: 91.38ms approx. 95 percentile: 19.44ms Threads fairness: events (avg/stddev): 6009./0.00 execution time (avg/stddev): 59.9812/0.00 sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 4 Doing OLTP test. Running mixed OLTP test Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases) Using "BEGIN" for starting transactions Using auto_inc on the id column Threads started! Time limit exceeded, exiting... (last message repeated 3 times) Done. OLTP test statistics: queries performed: read: 372036 write: 132870 other: 53148 total: 558054 transactions: 26574 (442.84 per sec.) deadlocks: 0 (0.00 per sec.) read/write
Re: [ceph-users] ceph all-nvme mysql performance tuning
Could anyone run the tests? and share some results.. Thanks in advance, Best, *German* 2017-11-30 14:25 GMT-03:00 German Anders : > That's correct, IPoIB for the backend (already configured the irq > affinity), and 10GbE on the frontend. I would love to try rdma but like > you said is not stable for production, so I think I'll have to wait for > that. Yeah, the thing is that it's not my decision to go for 50GbE or > 100GbE... :( so.. 10GbE for the front-end will be... > > Would be really helpful if someone could run the following sysbench test > on a mysql db so I could make some compares: > > *my.cnf *configuration file: > > [mysqld_safe] > nice= 0 > pid-file= /home/test_db/mysql/mysql.pid > > [client] > port= 33033 > socket = /home/test_db/mysql/mysql.sock > > [mysqld] > user= test_db > port= 33033 > socket = /home/test_db/mysql/mysql.sock > pid-file= /home/test_db/mysql/mysql.pid > log-error = /home/test_db/mysql/mysql.err > datadir = /home/test_db/mysql/data > tmpdir = /tmp > server-id = 1 > > # ** Binlogging ** > #log-bin= /home/test_db/mysql/binlog/ > mysql-bin > #log_bin_index = /home/test_db/mysql/binlog/ > mysql-bin.index > expire_logs_days= 1 > max_binlog_size = 512MB > > thread_handling = pool-of-threads > thread_pool_max_threads = 300 > > > # ** Slow query log ** > slow_query_log = 1 > slow_query_log_file = /home/test_db/mysql/mysql- > slow.log > long_query_time = 10 > log_output = FILE > log_slow_slave_statements = 1 > log_slow_verbosity = query_plan,innodb,explain > > # ** INNODB Specific options ** > transaction_isolation = READ-COMMITTED > innodb_buffer_pool_size = 12G > innodb_data_file_path = ibdata1:256M:autoextend > innodb_thread_concurrency = 16 > innodb_log_file_size= 256M > innodb_log_files_in_group = 3 > innodb_file_per_table > innodb_log_buffer_size = 16M > innodb_stats_on_metadata= 0 > innodb_lock_wait_timeout= 30 > # innodb_flush_method = O_DSYNC > innodb_flush_method = O_DIRECT > max_connections = 1 > max_connect_errors = 99 > max_allowed_packet = 128M > skip-host-cache > skip-name-resolve > explicit_defaults_for_timestamp = 1 > performance_schema = OFF > log_warnings= 2 > event_scheduler = ON > > # ** Specific Galera Cluster Settings ** > binlog_format = ROW > default-storage-engine = innodb > query_cache_size= 0 > query_cache_type= 0 > > > Volume is just an RBD (on a RF=3 pool) with the default 22 bit order > mounted on */home/test_db/mysql/data* > > commands for the test: > > sysbench > --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua > --mysql-host= --mysql-port=33033 --mysql-user=sysbench > --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb > --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex > --oltp-read-only=off --oltp-table-size=20 --threads=10 > --rand-type=uniform --rand-init=on cleanup > /dev/null 2>/dev/null > > sysbench > --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua > --mysql-host= --mysql-port=33033 --mysql-user=sysbench > --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb > --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex > --oltp-read-only=off --oltp-table-size=20 --threads=10 > --rand-type=uniform --rand-init=on prepare > /dev/null 2>/dev/null > > sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua > --mysql-host= --mysql-port=33033 --mysql-user=sysbench > --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb > --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex > --oltp-read-only=off --oltp-table-size=20 --threads=20 > --rand-type=uniform --rand-init=on --time=120 run > > result_sysbench_perf_test.out 2>/dev/null > > Im looking for tps, qps and 95th perc, could anyone with a all-nvme > cluster run the test and share the results? I would really appreciate the > help :) >
Re: [ceph-users] ceph all-nvme mysql performance tuning
That's correct, IPoIB for the backend (already configured the irq affinity), and 10GbE on the frontend. I would love to try rdma but like you said is not stable for production, so I think I'll have to wait for that. Yeah, the thing is that it's not my decision to go for 50GbE or 100GbE... :( so.. 10GbE for the front-end will be... Would be really helpful if someone could run the following sysbench test on a mysql db so I could make some compares: *my.cnf *configuration file: [mysqld_safe] nice= 0 pid-file= /home/test_db/mysql/mysql.pid [client] port= 33033 socket = /home/test_db/mysql/mysql.sock [mysqld] user= test_db port= 33033 socket = /home/test_db/mysql/mysql.sock pid-file= /home/test_db/mysql/mysql.pid log-error = /home/test_db/mysql/mysql.err datadir = /home/test_db/mysql/data tmpdir = /tmp server-id = 1 # ** Binlogging ** #log-bin= /home/test_db /mysql/binlog/mysql-bin #log_bin_index = /home/test_db /mysql/binlog/mysql-bin.index expire_logs_days= 1 max_binlog_size = 512MB thread_handling = pool-of-threads thread_pool_max_threads = 300 # ** Slow query log ** slow_query_log = 1 slow_query_log_file = /home/test_db/mysql/mysql-slow.log long_query_time = 10 log_output = FILE log_slow_slave_statements = 1 log_slow_verbosity = query_plan,innodb,explain # ** INNODB Specific options ** transaction_isolation = READ-COMMITTED innodb_buffer_pool_size = 12G innodb_data_file_path = ibdata1:256M:autoextend innodb_thread_concurrency = 16 innodb_log_file_size= 256M innodb_log_files_in_group = 3 innodb_file_per_table innodb_log_buffer_size = 16M innodb_stats_on_metadata= 0 innodb_lock_wait_timeout= 30 # innodb_flush_method = O_DSYNC innodb_flush_method = O_DIRECT max_connections = 1 max_connect_errors = 99 max_allowed_packet = 128M skip-host-cache skip-name-resolve explicit_defaults_for_timestamp = 1 performance_schema = OFF log_warnings= 2 event_scheduler = ON # ** Specific Galera Cluster Settings ** binlog_format = ROW default-storage-engine = innodb query_cache_size= 0 query_cache_type= 0 Volume is just an RBD (on a RF=3 pool) with the default 22 bit order mounted on */home/test_db/mysql/data* commands for the test: sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua --mysql-host= --mysql-port=33033 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex --oltp-read-only=off --oltp-table-size=20 --threads=10 --rand-type=uniform --rand-init=on cleanup > /dev/null 2>/dev/null sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua --mysql-host= --mysql-port=33033 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex --oltp-read-only=off --oltp-table-size=20 --threads=10 --rand-type=uniform --rand-init=on prepare > /dev/null 2>/dev/null sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --mysql-host= --mysql-port=33033 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex --oltp-read-only=off --oltp-table-size=20 --threads=20 --rand-type=uniform --rand-init=on --time=120 run > result_sysbench_perf_test.out 2>/dev/null Im looking for tps, qps and 95th perc, could anyone with a all-nvme cluster run the test and share the results? I would really appreciate the help :) Thanks in advance, Best, *German * 2017-11-29 19:14 GMT-03:00 Zoltan Arnold Nagy : > On 2017-11-27 14:02, German Anders wrote: > >> 4x 2U servers: >> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection >> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) >> > so I assume you are using IPoIB as the cluster network for the
Re: [ceph-users] ceph all-nvme mysql performance tuning
On 2017-11-27 14:02, German Anders wrote: 4x 2U servers: 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) so I assume you are using IPoIB as the cluster network for the replication... 1x OneConnect 10Gb NIC (quad-port) - in a bond configuration (active/active) with 3 vlans ... and the 10GbE network for the front-end network? At 4k writes your network latency will be very high (see the flame graphs at the Intel NVMe presentation from the Boston OpenStack Summit - not sure if there is a newer deck that somebody could link ;)) and the time will be spent in the kernel. You could give RDMAMessenger a try but it's not stable at the current LTS release. If I were you I'd be looking at 100GbE - we've recently pulled in a bunch of 100GbE links and it's been wonderful to see 100+GB/s going over the network for just storage. Some people suggested mounting multiple RBD volumes - unless I'm mistaken and you're using very recent qemu/libvirt combinations with the proper libvirt disk settings all IO will still be single threaded towards librbd thus not making any speedup. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph all-nvme mysql performance tuning
Hi German, I would personally prefer to use rados bench/ fio which are more common to benchmark the cluster first then later do mysql specific tests using sysbench. Another thing is to run the client test simultaneously on more than 1 machine and aggregate/add the performance numbers of each, the limitation can be caused by client side resources which could be stressed differently based on the different storage backends you tried. Maged On 2017-11-28 21:20, German Anders wrote: > Don't know if there's any statistics available really, but Im running some > sysbench tests with mysql before the changes and the idea is to run those > tests again after the 'tuning' and see if numbers get better in any way, also > I'm gathering numbers from some collectd and statsd collectors running on the > osd nodes so, I hope to get some info about that :) > > GERMAN > 2017-11-28 16:12 GMT-03:00 Marc Roos : > >> I was wondering if there are any statistics available that show the >> performance increase of doing such things? >> >> -Original Message- >> From: German Anders [mailto:gand...@despegar.com] >> Sent: dinsdag 28 november 2017 19:34 >> To: Luis Periquito >> Cc: ceph-users >> Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning >> >> Thanks a lot Luis, I agree with you regarding the CPUs, but >> unfortunately those were the best CPU model that we can afford :S >> >> For the NUMA part, I manage to pinned the OSDs by changing the >> /usr/lib/systemd/system/ceph-osd@.service file and adding the >> CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes >> or specific CPU list. But I can't find the way to specify a list for >> only a specific number of OSDs. >> >> Also, I notice that the NVMe disks are all on the same node (since I'm >> using half of the shelf - so the other half will be pinned to the other >> node), so the lanes of the NVMe disks are all on the same CPU (in this >> case 0). Also, I find that the IB adapter that is mapped to the OSD >> network (osd replication) is pinned to CPU 1, so this will cross the QPI >> path. >> >> And for the memory, from the other email, we are already using the >> TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of >> 134217728 >> >> In this case I can pinned all the actual OSDs to CPU 0, but in the near >> future when I add more nvme disks to the OSD nodes, I'll definitely need >> to pinned the other half OSDs to CPU 1, someone already did this? >> >> Thanks a lot, >> >> Best, >> >> German >> >> 2017-11-28 6:36 GMT-03:00 Luis Periquito : >> >> There are a few things I don't like about your machines... If you >> want latency/IOPS (as you seemingly do) you really want the highest >> frequency CPUs, even over number of cores. These are not too bad, but >> not great either. >> >> Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA >> nodes? Ideally OSD is pinned to same NUMA node the NVMe device is >> connected to. Each NVMe device will be running on PCIe lanes generated >> by one of the CPUs... >> >> What versions of TCMalloc (or jemalloc) are you running? Have you >> tuned them to have a bigger cache? >> >> These are from what I've learned using filestore - I've yet to run >> full tests on bluestore - but they should still apply... >> >> On Mon, Nov 27, 2017 at 5:10 PM, German Anders >> wrote: >> >> Hi Nick, >> >> yeah, we are using the same nvme disk with an additional >> partition to use as journal/wal. We double check the c-state and it was >> not configure to use c1, so we change that on all the osd nodes and mon >> nodes and we're going to make some new tests, and see how it goes. I'll >> get back as soon as get got those tests running. >> >> Thanks a lot, >> >> Best, >> >> German >> >> 2017-11-27 12:16 GMT-03:00 Nick Fisk : >> >> From: ceph-users >> [mailto:ceph-users-boun...@lists.ceph.com >> <mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of German Anders >> Sent: 27 November 2017 14:44 >> To: Maged Mokhtar >> Cc: ceph-users >> Subject: Re: [ceph-users] ceph all-nvme mysql performance >> tuning >> >> Hi Maged, >> >> Thanks a lot for the response. We try with different >> number of threads and we're getting almost the same kind of difference >> between the storage types. Going to try with different rbd stripe size, >> obje
Re: [ceph-users] ceph all-nvme mysql performance tuning
Don't know if there's any statistics available really, but Im running some sysbench tests with mysql before the changes and the idea is to run those tests again after the 'tuning' and see if numbers get better in any way, also I'm gathering numbers from some collectd and statsd collectors running on the osd nodes so, I hope to get some info about that :) *German* 2017-11-28 16:12 GMT-03:00 Marc Roos : > > I was wondering if there are any statistics available that show the > performance increase of doing such things? > > > > > > > -Original Message- > From: German Anders [mailto:gand...@despegar.com] > Sent: dinsdag 28 november 2017 19:34 > To: Luis Periquito > Cc: ceph-users > Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning > > Thanks a lot Luis, I agree with you regarding the CPUs, but > unfortunately those were the best CPU model that we can afford :S > > For the NUMA part, I manage to pinned the OSDs by changing the > /usr/lib/systemd/system/ceph-osd@.service file and adding the > CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes > or specific CPU list. But I can't find the way to specify a list for > only a specific number of OSDs. > > Also, I notice that the NVMe disks are all on the same node (since I'm > using half of the shelf - so the other half will be pinned to the other > node), so the lanes of the NVMe disks are all on the same CPU (in this > case 0). Also, I find that the IB adapter that is mapped to the OSD > network (osd replication) is pinned to CPU 1, so this will cross the QPI > path. > > And for the memory, from the other email, we are already using the > TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of > 134217728 > > In this case I can pinned all the actual OSDs to CPU 0, but in the near > future when I add more nvme disks to the OSD nodes, I'll definitely need > to pinned the other half OSDs to CPU 1, someone already did this? > > Thanks a lot, > > Best, > > > > German > > 2017-11-28 6:36 GMT-03:00 Luis Periquito : > > > There are a few things I don't like about your machines... If you > want latency/IOPS (as you seemingly do) you really want the highest > frequency CPUs, even over number of cores. These are not too bad, but > not great either. > > Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA > nodes? Ideally OSD is pinned to same NUMA node the NVMe device is > connected to. Each NVMe device will be running on PCIe lanes generated > by one of the CPUs... > > What versions of TCMalloc (or jemalloc) are you running? Have you > tuned them to have a bigger cache? > > These are from what I've learned using filestore - I've yet to run > full tests on bluestore - but they should still apply... > > On Mon, Nov 27, 2017 at 5:10 PM, German Anders > wrote: > > > Hi Nick, > > yeah, we are using the same nvme disk with an additional > partition to use as journal/wal. We double check the c-state and it was > not configure to use c1, so we change that on all the osd nodes and mon > nodes and we're going to make some new tests, and see how it goes. I'll > get back as soon as get got those tests running. > > Thanks a lot, > > Best, > > > > > > > German > > 2017-11-27 12:16 GMT-03:00 Nick Fisk : > > > From: ceph-users > [mailto:ceph-users-boun...@lists.ceph.com > <mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of German Anders > Sent: 27 November 2017 14:44 > To: Maged Mokhtar > Cc: ceph-users > Subject: Re: [ceph-users] ceph all-nvme mysql > performance > tuning > > > > Hi Maged, > > > > Thanks a lot for the response. We try with > different > number of threads and we're getting almost the same kind of difference > between the storage types. Going to try with different rbd stripe size, > object size values and see if we get more competitive numbers. Will get > back with more tests and param changes to see if we get better :) > > > > > > Just to echo a couple of comments. Ceph will always > struggle to match the performance of a traditional array for mainly 2 > reasons. > > > > 1. You are replacing some sort of dual ported > SAS or > internally RDMA connected device with a network for Ceph replicat
Re: [ceph-users] ceph all-nvme mysql performance tuning
I was wondering if there are any statistics available that show the performance increase of doing such things? -Original Message- From: German Anders [mailto:gand...@despegar.com] Sent: dinsdag 28 november 2017 19:34 To: Luis Periquito Cc: ceph-users Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning Thanks a lot Luis, I agree with you regarding the CPUs, but unfortunately those were the best CPU model that we can afford :S For the NUMA part, I manage to pinned the OSDs by changing the /usr/lib/systemd/system/ceph-osd@.service file and adding the CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes or specific CPU list. But I can't find the way to specify a list for only a specific number of OSDs. Also, I notice that the NVMe disks are all on the same node (since I'm using half of the shelf - so the other half will be pinned to the other node), so the lanes of the NVMe disks are all on the same CPU (in this case 0). Also, I find that the IB adapter that is mapped to the OSD network (osd replication) is pinned to CPU 1, so this will cross the QPI path. And for the memory, from the other email, we are already using the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of 134217728 In this case I can pinned all the actual OSDs to CPU 0, but in the near future when I add more nvme disks to the OSD nodes, I'll definitely need to pinned the other half OSDs to CPU 1, someone already did this? Thanks a lot, Best, German 2017-11-28 6:36 GMT-03:00 Luis Periquito : There are a few things I don't like about your machines... If you want latency/IOPS (as you seemingly do) you really want the highest frequency CPUs, even over number of cores. These are not too bad, but not great either. Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA nodes? Ideally OSD is pinned to same NUMA node the NVMe device is connected to. Each NVMe device will be running on PCIe lanes generated by one of the CPUs... What versions of TCMalloc (or jemalloc) are you running? Have you tuned them to have a bigger cache? These are from what I've learned using filestore - I've yet to run full tests on bluestore - but they should still apply... On Mon, Nov 27, 2017 at 5:10 PM, German Anders wrote: Hi Nick, yeah, we are using the same nvme disk with an additional partition to use as journal/wal. We double check the c-state and it was not configure to use c1, so we change that on all the osd nodes and mon nodes and we're going to make some new tests, and see how it goes. I'll get back as soon as get got those tests running. Thanks a lot, Best, German 2017-11-27 12:16 GMT-03:00 Nick Fisk : From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com <mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of German Anders Sent: 27 November 2017 14:44 To: Maged Mokhtar Cc: ceph-users Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning Hi Maged, Thanks a lot for the response. We try with different number of threads and we're getting almost the same kind of difference between the storage types. Going to try with different rbd stripe size, object size values and see if we get more competitive numbers. Will get back with more tests and param changes to see if we get better :) Just to echo a couple of comments. Ceph will always struggle to match the performance of a traditional array for mainly 2 reasons. 1. You are replacing some sort of dual ported SAS or internally RDMA connected device with a network for Ceph replication traffic. This will instantly have a large impact on write latency 2. Ceph locks at the PG level and a PG will most likely cover at least one 4MB object, so lots of small accesses to the same blocks (on a block device) will wait on each other and go effectively at a single threaded rate. The best thing you can do to mitigate these, is to run the fastest journal/WAL devices you can, fastest network connections (ie 25Gb/s) and run your CPU’s at max C and P states. You stated that you are running the performance profile on the CPU’s. Could you also just double check that the C-states are being held at C1(e)? There
Re: [ceph-users] ceph all-nvme mysql performance tuning
Thanks a lot Luis, I agree with you regarding the CPUs, but unfortunately those were the best CPU model that we can afford :S For the NUMA part, I manage to pinned the OSDs by changing the /usr/lib/systemd/system/ceph-osd@.service file and adding the CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes or specific CPU list. But I can't find the way to specify a list for only a specific number of OSDs. Also, I notice that the NVMe disks are all on the same node (since I'm using half of the shelf - so the other half will be pinned to the other node), so the lanes of the NVMe disks are all on the same CPU (in this case 0). Also, I find that the IB adapter that is mapped to the OSD network (osd replication) is pinned to CPU 1, so this will cross the QPI path. And for the memory, from the other email, we are already using the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of 134217728 In this case I can pinned all the actual OSDs to CPU 0, but in the near future when I add more nvme disks to the OSD nodes, I'll definitely need to pinned the other half OSDs to CPU 1, someone already did this? Thanks a lot, Best, *German* 2017-11-28 6:36 GMT-03:00 Luis Periquito : > There are a few things I don't like about your machines... If you want > latency/IOPS (as you seemingly do) you really want the highest frequency > CPUs, even over number of cores. These are not too bad, but not great > either. > > Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA nodes? > Ideally OSD is pinned to same NUMA node the NVMe device is connected to. > Each NVMe device will be running on PCIe lanes generated by one of the > CPUs... > > What versions of TCMalloc (or jemalloc) are you running? Have you tuned > them to have a bigger cache? > > These are from what I've learned using filestore - I've yet to run full > tests on bluestore - but they should still apply... > > On Mon, Nov 27, 2017 at 5:10 PM, German Anders > wrote: > >> Hi Nick, >> >> yeah, we are using the same nvme disk with an additional partition to use >> as journal/wal. We double check the c-state and it was not configure to use >> c1, so we change that on all the osd nodes and mon nodes and we're going to >> make some new tests, and see how it goes. I'll get back as soon as get got >> those tests running. >> >> Thanks a lot, >> >> Best, >> >> >> *German* >> >> 2017-11-27 12:16 GMT-03:00 Nick Fisk : >> >>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On >>> Behalf Of *German Anders >>> *Sent:* 27 November 2017 14:44 >>> *To:* Maged Mokhtar >>> *Cc:* ceph-users >>> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning >>> >>> >>> >>> Hi Maged, >>> >>> >>> >>> Thanks a lot for the response. We try with different number of threads >>> and we're getting almost the same kind of difference between the storage >>> types. Going to try with different rbd stripe size, object size values and >>> see if we get more competitive numbers. Will get back with more tests and >>> param changes to see if we get better :) >>> >>> >>> >>> >>> >>> Just to echo a couple of comments. Ceph will always struggle to match >>> the performance of a traditional array for mainly 2 reasons. >>> >>> >>> >>>1. You are replacing some sort of dual ported SAS or internally RDMA >>>connected device with a network for Ceph replication traffic. This will >>>instantly have a large impact on write latency >>>2. Ceph locks at the PG level and a PG will most likely cover at >>>least one 4MB object, so lots of small accesses to the same blocks (on a >>>block device) will wait on each other and go effectively at a single >>>threaded rate. >>> >>> >>> >>> The best thing you can do to mitigate these, is to run the fastest >>> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and >>> run your CPU’s at max C and P states. >>> >>> >>> >>> You stated that you are running the performance profile on the CPU’s. >>> Could you also just double check that the C-states are being held at C1(e)? >>> There are a few utilities that can show this in realtime. >>> >>> >>> >>> Other than that, although there could be some minor tweaks, you are >>> probably nearing the limit of what you can hope to achieve. >>> >>> >>>
Re: [ceph-users] ceph all-nvme mysql performance tuning
There are a few things I don't like about your machines... If you want latency/IOPS (as you seemingly do) you really want the highest frequency CPUs, even over number of cores. These are not too bad, but not great either. Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA nodes? Ideally OSD is pinned to same NUMA node the NVMe device is connected to. Each NVMe device will be running on PCIe lanes generated by one of the CPUs... What versions of TCMalloc (or jemalloc) are you running? Have you tuned them to have a bigger cache? These are from what I've learned using filestore - I've yet to run full tests on bluestore - but they should still apply... On Mon, Nov 27, 2017 at 5:10 PM, German Anders wrote: > Hi Nick, > > yeah, we are using the same nvme disk with an additional partition to use > as journal/wal. We double check the c-state and it was not configure to use > c1, so we change that on all the osd nodes and mon nodes and we're going to > make some new tests, and see how it goes. I'll get back as soon as get got > those tests running. > > Thanks a lot, > > Best, > > > *German* > > 2017-11-27 12:16 GMT-03:00 Nick Fisk : > >> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf >> Of *German Anders >> *Sent:* 27 November 2017 14:44 >> *To:* Maged Mokhtar >> *Cc:* ceph-users >> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning >> >> >> >> Hi Maged, >> >> >> >> Thanks a lot for the response. We try with different number of threads >> and we're getting almost the same kind of difference between the storage >> types. Going to try with different rbd stripe size, object size values and >> see if we get more competitive numbers. Will get back with more tests and >> param changes to see if we get better :) >> >> >> >> >> >> Just to echo a couple of comments. Ceph will always struggle to match the >> performance of a traditional array for mainly 2 reasons. >> >> >> >>1. You are replacing some sort of dual ported SAS or internally RDMA >>connected device with a network for Ceph replication traffic. This will >>instantly have a large impact on write latency >>2. Ceph locks at the PG level and a PG will most likely cover at >>least one 4MB object, so lots of small accesses to the same blocks (on a >>block device) will wait on each other and go effectively at a single >>threaded rate. >> >> >> >> The best thing you can do to mitigate these, is to run the fastest >> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and >> run your CPU’s at max C and P states. >> >> >> >> You stated that you are running the performance profile on the CPU’s. >> Could you also just double check that the C-states are being held at C1(e)? >> There are a few utilities that can show this in realtime. >> >> >> >> Other than that, although there could be some minor tweaks, you are >> probably nearing the limit of what you can hope to achieve. >> >> >> >> Nick >> >> >> >> >> >> Thanks, >> >> >> >> Best, >> >> >> *German* >> >> >> >> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar : >> >> On 2017-11-27 15:02, German Anders wrote: >> >> Hi All, >> >> >> >> I've a performance question, we recently install a brand new Ceph cluster >> with all-nvme disks, using ceph version 12.2.0 with bluestore configured. >> The back-end of the cluster is using a bond IPoIB (active/passive) , and >> for the front-end we are using a bonding config with active/active (20GbE) >> to communicate with the clients. >> >> >> >> The cluster configuration is the following: >> >> >> >> *MON Nodes:* >> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 >> >> 3x 1U servers: >> >> 2x Intel Xeon E5-2630v4 @2.2Ghz >> >> 128G RAM >> >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) >> >> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection >> >> >> >> *OSD Nodes:* >> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 >> >> 4x 2U servers: >> >> 2x Intel Xeon E5-2640v4 @2.4Ghz >> >> 128G RAM >> >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) >> >> 1x Ethernet Controller 10G X550T >> >> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection >> >> 12x Intel SSD DC P3520
Re: [ceph-users] ceph all-nvme mysql performance tuning
Hi Nick, yeah, we are using the same nvme disk with an additional partition to use as journal/wal. We double check the c-state and it was not configure to use c1, so we change that on all the osd nodes and mon nodes and we're going to make some new tests, and see how it goes. I'll get back as soon as get got those tests running. Thanks a lot, Best, *German* 2017-11-27 12:16 GMT-03:00 Nick Fisk : > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *German Anders > *Sent:* 27 November 2017 14:44 > *To:* Maged Mokhtar > *Cc:* ceph-users > *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning > > > > Hi Maged, > > > > Thanks a lot for the response. We try with different number of threads and > we're getting almost the same kind of difference between the storage types. > Going to try with different rbd stripe size, object size values and see if > we get more competitive numbers. Will get back with more tests and param > changes to see if we get better :) > > > > > > Just to echo a couple of comments. Ceph will always struggle to match the > performance of a traditional array for mainly 2 reasons. > > > >1. You are replacing some sort of dual ported SAS or internally RDMA >connected device with a network for Ceph replication traffic. This will >instantly have a large impact on write latency >2. Ceph locks at the PG level and a PG will most likely cover at least >one 4MB object, so lots of small accesses to the same blocks (on a block >device) will wait on each other and go effectively at a single threaded >rate. > > > > The best thing you can do to mitigate these, is to run the fastest > journal/WAL devices you can, fastest network connections (ie 25Gb/s) and > run your CPU’s at max C and P states. > > > > You stated that you are running the performance profile on the CPU’s. > Could you also just double check that the C-states are being held at C1(e)? > There are a few utilities that can show this in realtime. > > > > Other than that, although there could be some minor tweaks, you are > probably nearing the limit of what you can hope to achieve. > > > > Nick > > > > > > Thanks, > > > > Best, > > > *German* > > > > 2017-11-27 11:36 GMT-03:00 Maged Mokhtar : > > On 2017-11-27 15:02, German Anders wrote: > > Hi All, > > > > I've a performance question, we recently install a brand new Ceph cluster > with all-nvme disks, using ceph version 12.2.0 with bluestore configured. > The back-end of the cluster is using a bond IPoIB (active/passive) , and > for the front-end we are using a bonding config with active/active (20GbE) > to communicate with the clients. > > > > The cluster configuration is the following: > > > > *MON Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 3x 1U servers: > > 2x Intel Xeon E5-2630v4 @2.2Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > > > *OSD Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 4x 2U servers: > > 2x Intel Xeon E5-2640v4 @2.4Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 1x Ethernet Controller 10G X550T > > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > > > > > Here's the tree: > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -7 48.0 root root > > -5 24.0 rack rack1 > > -1 12.0 node cpn01 > > 0 nvme 1.0 osd.0 up 1.0 1.0 > > 1 nvme 1.0 osd.1 up 1.0 1.0 > > 2 nvme 1.0 osd.2 up 1.0 1.0 > > 3 nvme 1.0 osd.3 up 1.0 1.0 > > 4 nvme 1.0 osd.4 up 1.0 1.0 > > 5 nvme 1.0 osd.5 up 1.0 1.0 > > 6 nvme 1.0 osd.6 up 1.0 1.0 > > 7 nvme 1.0 osd.7 up 1.0 1.0 > > 8 nvme 1.0 osd.8 up 1.0 1.0 > > 9 nvme 1.0 osd.9 up 1.0 1.0 > > 10 nvme 1.0 osd.10 up 1.0 1.0 > > 11 nvme 1.0 osd.11 up 1.0 1.0 > > -3 12.0 node cpn03 > > 24 nvme 1.0 osd.24 u
Re: [ceph-users] ceph all-nvme mysql performance tuning
Hi David, Thanks a lot for the response. In fact, we first try to not use any scheduler at all, but then we try kyber iosched and we notice a slightly improve in terms of performance, that's why we actually keep it. *German* 2017-11-27 13:48 GMT-03:00 David Byte : > From the benchmarks I have seen and done myself, I’m not sure why you are > using an i/o scheduler at all with NVMe. While there are a few cases where > it may provide a slight benefit, simply having mq enabled with no scheduler > seems to provide the best performance for an all flash, especially all > NVMe, environment. > > > > David Byte > > Sr. Technology Strategist > > *SCE Enterprise Linux* > > *SCE Enterprise Storage* > > Alliances and SUSE Embedded > > db...@suse.com > > 918.528.4422 > > > > *From: *ceph-users on behalf of > German Anders > *Date: *Monday, November 27, 2017 at 8:44 AM > *To: *Maged Mokhtar > *Cc: *ceph-users > *Subject: *Re: [ceph-users] ceph all-nvme mysql performance tuning > > > > Hi Maged, > > > > Thanks a lot for the response. We try with different number of threads and > we're getting almost the same kind of difference between the storage types. > Going to try with different rbd stripe size, object size values and see if > we get more competitive numbers. Will get back with more tests and param > changes to see if we get better :) > > > > Thanks, > > > > Best, > > > *German* > > > > 2017-11-27 11:36 GMT-03:00 Maged Mokhtar : > > On 2017-11-27 15:02, German Anders wrote: > > Hi All, > > > > I've a performance question, we recently install a brand new Ceph cluster > with all-nvme disks, using ceph version 12.2.0 with bluestore configured. > The back-end of the cluster is using a bond IPoIB (active/passive) , and > for the front-end we are using a bonding config with active/active (20GbE) > to communicate with the clients. > > > > The cluster configuration is the following: > > > > *MON Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 3x 1U servers: > > 2x Intel Xeon E5-2630v4 @2.2Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > > > *OSD Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 4x 2U servers: > > 2x Intel Xeon E5-2640v4 @2.4Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 1x Ethernet Controller 10G X550T > > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > > > > > Here's the tree: > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -7 48.0 root root > > -5 24.0 rack rack1 > > -1 12.0 node cpn01 > > 0 nvme 1.0 osd.0 up 1.0 1.0 > > 1 nvme 1.0 osd.1 up 1.0 1.0 > > 2 nvme 1.0 osd.2 up 1.0 1.0 > > 3 nvme 1.0 osd.3 up 1.0 1.0 > > 4 nvme 1.0 osd.4 up 1.0 1.0 > > 5 nvme 1.0 osd.5 up 1.0 1.0 > > 6 nvme 1.0 osd.6 up 1.0 1.0 > > 7 nvme 1.0 osd.7 up 1.0 1.0 > > 8 nvme 1.0 osd.8 up 1.0 1.0 > > 9 nvme 1.0 osd.9 up 1.0 1.0 > > 10 nvme 1.0 osd.10 up 1.0 1.0 > > 11 nvme 1.0 osd.11 up 1.0 1.0 > > -3 12.0 node cpn03 > > 24 nvme 1.0 osd.24 up 1.0 1.0 > > 25 nvme 1.0 osd.25 up 1.0 1.0 > > 26 nvme 1.0 osd.26 up 1.0 1.0 > > 27 nvme 1.0 osd.27 up 1.0 1.0 > > 28 nvme 1.0 osd.28 up 1.0 1.0 > > 29 nvme 1.0 osd.29 up 1.0 1.0 > > 30 nvme 1.0 osd.30 up 1.0 1.0 > > 31 nvme 1.0 osd.31 up 1.0 1.0 > > 32 nvme 1.0 osd.32 up 1.0 1.0 > > 33 nvme 1.0 osd.33 up 1.0 1.0 > > 34 nvme 1.0 osd.34 up 1.0 1.0 > > 35 nvme 1.0 osd.35 up 1.0 1.0 > > -6 24.0 rack rack2 > > -2 12.000
Re: [ceph-users] ceph all-nvme mysql performance tuning
From the benchmarks I have seen and done myself, I’m not sure why you are using an i/o scheduler at all with NVMe. While there are a few cases where it may provide a slight benefit, simply having mq enabled with no scheduler seems to provide the best performance for an all flash, especially all NVMe, environment. David Byte Sr. Technology Strategist SCE Enterprise Linux SCE Enterprise Storage Alliances and SUSE Embedded db...@suse.com 918.528.4422 From: ceph-users on behalf of German Anders Date: Monday, November 27, 2017 at 8:44 AM To: Maged Mokhtar Cc: ceph-users Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning Hi Maged, Thanks a lot for the response. We try with different number of threads and we're getting almost the same kind of difference between the storage types. Going to try with different rbd stripe size, object size values and see if we get more competitive numbers. Will get back with more tests and param changes to see if we get better :) Thanks, Best, German 2017-11-27 11:36 GMT-03:00 Maged Mokhtar mailto:mmokh...@petasan.org>>: On 2017-11-27 15:02, German Anders wrote: Hi All, I've a performance question, we recently install a brand new Ceph cluster with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The back-end of the cluster is using a bond IPoIB (active/passive) , and for the front-end we are using a bonding config with active/active (20GbE) to communicate with the clients. The cluster configuration is the following: MON Nodes: OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 3x 1U servers: 2x Intel Xeon E5-2630v4 @2.2Ghz 128G RAM 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection OSD Nodes: OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 4x 2U servers: 2x Intel Xeon E5-2640v4 @2.4Ghz 128G RAM 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 1x Ethernet Controller 10G X550T 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) Here's the tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -7 48.0 root root -5 24.0 rack rack1 -1 12.0 node cpn01 0 nvme 1.0 osd.0 up 1.0 1.0 1 nvme 1.0 osd.1 up 1.0 1.0 2 nvme 1.0 osd.2 up 1.0 1.0 3 nvme 1.0 osd.3 up 1.0 1.0 4 nvme 1.0 osd.4 up 1.0 1.0 5 nvme 1.0 osd.5 up 1.0 1.0 6 nvme 1.0 osd.6 up 1.0 1.0 7 nvme 1.0 osd.7 up 1.0 1.0 8 nvme 1.0 osd.8 up 1.0 1.0 9 nvme 1.0 osd.9 up 1.0 1.0 10 nvme 1.0 osd.10 up 1.0 1.0 11 nvme 1.0 osd.11 up 1.0 1.0 -3 12.0 node cpn03 24 nvme 1.0 osd.24 up 1.0 1.0 25 nvme 1.0 osd.25 up 1.0 1.0 26 nvme 1.0 osd.26 up 1.0 1.0 27 nvme 1.0 osd.27 up 1.0 1.0 28 nvme 1.0 osd.28 up 1.0 1.0 29 nvme 1.0 osd.29 up 1.0 1.0 30 nvme 1.0 osd.30 up 1.0 1.0 31 nvme 1.0 osd.31 up 1.0 1.0 32 nvme 1.0 osd.32 up 1.0 1.0 33 nvme 1.0 osd.33 up 1.0 1.0 34 nvme 1.0 osd.34 up 1.0 1.0 35 nvme 1.0 osd.35 up 1.0 1.0 -6 24.0 rack rack2 -2 12.0 node cpn02 12 nvme 1.0 osd.12 up 1.0 1.0 13 nvme 1.0 osd.13 up 1.0 1.0 14 nvme 1.0 osd.14 up 1.0 1.0 15 nvme 1.0 osd.15 up 1.0 1.0 16 nvme 1.0 osd.16 up 1.0 1.0 17 nvme 1.0 osd.17 up 1.0 1.0 18 nvme 1.0 osd.18 up 1.0 1.0 19 nvme 1.0 osd.19 up 1.0 1.0 20 nvme 1.0 osd.20 up 1.0 1.0 21 nvme 1.0 osd.21 up 1.0 1.0 22 nvme 1.0 osd.22 up 1.0 1.0 23 nvme 1.0 osd.23 up 1.0 1.0 -4 12.0 node cpn04 36 nvme 1.0 osd.36 up 1.0 1.0 37 nvme 1.0 osd.37 up 1.0 1.0 38 nvme 1.0 osd.38 up 1.0 1.0 39 nvme 1.0 osd.39 up 1.0 1.0 40 nvme 1.0 osd.40 up 1.0 1.0 41 nvme 1.0 osd.41 up 1.0
Re: [ceph-users] ceph all-nvme mysql performance tuning
Also what tuned profile are you using? There is something to be gained by using a matching tuned profile for your workload. On Mon, Nov 27, 2017 at 11:16 AM, Donny Davis wrote: > Why not ask Red Hat? All the rest of the storage vendors you are looking > at are not free. > > Full disclosure, I am an employee at Red Hat. > > On Mon, Nov 27, 2017 at 10:16 AM, Nick Fisk wrote: > >> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf >> Of *German Anders >> *Sent:* 27 November 2017 14:44 >> *To:* Maged Mokhtar >> *Cc:* ceph-users >> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning >> >> >> >> Hi Maged, >> >> >> >> Thanks a lot for the response. We try with different number of threads >> and we're getting almost the same kind of difference between the storage >> types. Going to try with different rbd stripe size, object size values and >> see if we get more competitive numbers. Will get back with more tests and >> param changes to see if we get better :) >> >> >> >> >> >> Just to echo a couple of comments. Ceph will always struggle to match the >> performance of a traditional array for mainly 2 reasons. >> >> >> >>1. You are replacing some sort of dual ported SAS or internally RDMA >>connected device with a network for Ceph replication traffic. This will >>instantly have a large impact on write latency >>2. Ceph locks at the PG level and a PG will most likely cover at >>least one 4MB object, so lots of small accesses to the same blocks (on a >>block device) will wait on each other and go effectively at a single >>threaded rate. >> >> >> >> The best thing you can do to mitigate these, is to run the fastest >> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and >> run your CPU’s at max C and P states. >> >> >> >> You stated that you are running the performance profile on the CPU’s. >> Could you also just double check that the C-states are being held at C1(e)? >> There are a few utilities that can show this in realtime. >> >> >> >> Other than that, although there could be some minor tweaks, you are >> probably nearing the limit of what you can hope to achieve. >> >> >> >> Nick >> >> >> >> >> >> Thanks, >> >> >> >> Best, >> >> >> *German* >> >> >> >> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar : >> >> On 2017-11-27 15:02, German Anders wrote: >> >> Hi All, >> >> >> >> I've a performance question, we recently install a brand new Ceph cluster >> with all-nvme disks, using ceph version 12.2.0 with bluestore configured. >> The back-end of the cluster is using a bond IPoIB (active/passive) , and >> for the front-end we are using a bonding config with active/active (20GbE) >> to communicate with the clients. >> >> >> >> The cluster configuration is the following: >> >> >> >> *MON Nodes:* >> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 >> >> 3x 1U servers: >> >> 2x Intel Xeon E5-2630v4 @2.2Ghz >> >> 128G RAM >> >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) >> >> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection >> >> >> >> *OSD Nodes:* >> >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 >> >> 4x 2U servers: >> >> 2x Intel Xeon E5-2640v4 @2.4Ghz >> >> 128G RAM >> >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) >> >> 1x Ethernet Controller 10G X550T >> >> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection >> >> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons >> >> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) >> >> >> >> >> >> Here's the tree: >> >> >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> >> -7 48.0 root root >> >> -5 24.0 rack rack1 >> >> -1 12.0 node cpn01 >> >> 0 nvme 1.0 osd.0 up 1.0 1.0 >> >> 1 nvme 1.0 osd.1 up 1.0 1.0 >> >> 2 nvme 1.0 osd.2 up 1.0 1.0 >> >> 3 nvme 1.0 osd.3 up 1.0 1.0 >> >> 4 nvme 1.0 osd.4 up 1.0 1.0 &g
Re: [ceph-users] ceph all-nvme mysql performance tuning
Why not ask Red Hat? All the rest of the storage vendors you are looking at are not free. Full disclosure, I am an employee at Red Hat. On Mon, Nov 27, 2017 at 10:16 AM, Nick Fisk wrote: > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *German Anders > *Sent:* 27 November 2017 14:44 > *To:* Maged Mokhtar > *Cc:* ceph-users > *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning > > > > Hi Maged, > > > > Thanks a lot for the response. We try with different number of threads and > we're getting almost the same kind of difference between the storage types. > Going to try with different rbd stripe size, object size values and see if > we get more competitive numbers. Will get back with more tests and param > changes to see if we get better :) > > > > > > Just to echo a couple of comments. Ceph will always struggle to match the > performance of a traditional array for mainly 2 reasons. > > > >1. You are replacing some sort of dual ported SAS or internally RDMA >connected device with a network for Ceph replication traffic. This will >instantly have a large impact on write latency >2. Ceph locks at the PG level and a PG will most likely cover at least >one 4MB object, so lots of small accesses to the same blocks (on a block >device) will wait on each other and go effectively at a single threaded >rate. > > > > The best thing you can do to mitigate these, is to run the fastest > journal/WAL devices you can, fastest network connections (ie 25Gb/s) and > run your CPU’s at max C and P states. > > > > You stated that you are running the performance profile on the CPU’s. > Could you also just double check that the C-states are being held at C1(e)? > There are a few utilities that can show this in realtime. > > > > Other than that, although there could be some minor tweaks, you are > probably nearing the limit of what you can hope to achieve. > > > > Nick > > > > > > Thanks, > > > > Best, > > > *German* > > > > 2017-11-27 11:36 GMT-03:00 Maged Mokhtar : > > On 2017-11-27 15:02, German Anders wrote: > > Hi All, > > > > I've a performance question, we recently install a brand new Ceph cluster > with all-nvme disks, using ceph version 12.2.0 with bluestore configured. > The back-end of the cluster is using a bond IPoIB (active/passive) , and > for the front-end we are using a bonding config with active/active (20GbE) > to communicate with the clients. > > > > The cluster configuration is the following: > > > > *MON Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 3x 1U servers: > > 2x Intel Xeon E5-2630v4 @2.2Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > > > *OSD Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 4x 2U servers: > > 2x Intel Xeon E5-2640v4 @2.4Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 1x Ethernet Controller 10G X550T > > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > > > > > Here's the tree: > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -7 48.0 root root > > -5 24.0 rack rack1 > > -1 12.0 node cpn01 > > 0 nvme 1.0 osd.0 up 1.0 1.0 > > 1 nvme 1.0 osd.1 up 1.0 1.0 > > 2 nvme 1.0 osd.2 up 1.0 1.0 > > 3 nvme 1.0 osd.3 up 1.0 1.0 > > 4 nvme 1.0 osd.4 up 1.0 1.0 > > 5 nvme 1.0 osd.5 up 1.0 1.0 > > 6 nvme 1.0 osd.6 up 1.0 1.0 > > 7 nvme 1.0 osd.7 up 1.0 1.0 > > 8 nvme 1.0 osd.8 up 1.0 1.0 > > 9 nvme 1.0 osd.9 up 1.0 1.0 > > 10 nvme 1.0 osd.10 up 1.0 1.0 > > 11 nvme 1.0 osd.11 up 1.0 1.0 > > -3 12.0 node cpn03 > > 24 nvme 1.0 osd.24 up 1.0 1.0 > > 25 nvme 1.0 osd.25 up 1.0 1.0 > > 26 nvme 1.0 osd.26 up 1.0 1.0 > > 27 nvme 1.0 osd.27 up 1.0 1.0
Re: [ceph-users] ceph all-nvme mysql performance tuning
Hi German, We have similar config: proxmox-ve: 5.1-27 (running kernel: 4.13.8-1-pve) pve-manager: 5.1-36 (running version: 5.1-36/131401db) pve-kernel-4.13.8-1-pve: 4.13.8-27 ceph: 12.2.1-pve3 system(4 nodes): Supermicro 2028U-TN24R4T+ 2 port Mellanox connect x3pro 56Gbit 4 port intel 10GigE memory: 768 GBytes CPU DUAL Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz ceph: 28 osds 24 Intel Nvme 2000GB Intel SSD DC P3520, 2,5", PCIe 3.0 x4, 4 Intel Nvme 1,6TB Intel SSD DC P3700, 2,5", U.2 PCIe 3.0 Sysbench on container: #!/bin/bash sysbench --test=fileio --file-total-size=4G --file-num=64 prepare for run in 1 2 3 ;do for thread in 1 4 8 16 32 ;do echo "Performing test RW-${thread}T-${run}" sysbench --test=fileio --file-total-size=4G --file-test-mode=rndwr --max-time=60 --max-requests=0 --file-block-size=4K --file-num=64 --num-threads=${thread} run > /root/RW-${thread}T-${run} echo "Performing test RR-${thread}T-${run}" sysbench --test=fileio --file-total-size=4G --file-test-mode=rndrd --max-time=60 --max-requests=0 --file-block-size=4K --file-num=64 --num-threads=${thread} run > /root/RR-${thread}T-${run} echo "Performing test SQ-${thread}T-${run}" sysbench --test=oltp --db-driver=mysql --oltp-table-size=4000 --mysql-db=sysbench --mysql-user=sysbench --mysql-password=password --max-time=60 --max-requests=0 --num-threads=${thread} run > /root/SQ-${thread}T-${run} done done grep transactions: S* SQ-1T-1: transactions: 6009 (100.14 per sec.) SQ-1T-2: transactions: 9458 (157.62 per sec.) SQ-1T-3: transactions: 9479 (157.97 per sec.) SQ-4T-1: transactions: 26574 (442.84 per sec.) SQ-4T-2: transactions: 28275 (471.20 per sec.) SQ-4T-3: transactions: 28067 (467.69 per sec.) SQ-8T-1: transactions: 44450 (740.78 per sec.) SQ-8T-2: transactions: 44410 (740.09 per sec.) SQ-8T-3: transactions: 44459 (740.93 per sec.) SQ-16T-1: transactions: 59866 (997.59 per sec.) SQ-16T-2: transactions: 59539 (991.99 per sec.) SQ-16T-3: transactions: 59615 (993.50 per sec.) SQ-32T-1: transactions: 71070 (1184.18 per sec.) SQ-32T-2: transactions: 71007 (1183.14 per sec.) SQ-32T-3: transactions: 71320 (1188.51 per sec.) grep Requests/sec R* RR-16T-1:1464550.51 Requests/sec executed RR-16T-2:1473440.63 Requests/sec executed RR-16T-3:1515853.86 Requests/sec executed RR-1T-1:741333.28 Requests/sec executed RR-1T-2:693246.00 Requests/sec executed RR-1T-3:691166.38 Requests/sec executed RR-32T-1:1432609.74 Requests/sec executed RR-32T-2:1479191.78 Requests/sec executed RR-32T-3:1476780.11 Requests/sec executed RR-4T-1:1411168.95 Requests/sec executed RR-4T-2:1373557.99 Requests/sec executed RR-4T-3:1306820.18 Requests/sec executed RR-8T-1:1549924.57 Requests/sec executed RR-8T-2:1580304.14 Requests/sec executed RR-8T-3:1603842.56 Requests/sec executed RW-16T-1:12753.82 Requests/sec executed RW-16T-2:12394.93 Requests/sec executed RW-16T-3:12560.11 Requests/sec executed RW-1T-1: 1344.99 Requests/sec executed RW-1T-2: 1324.98 Requests/sec executed RW-1T-3: 1306.64 Requests/sec executed RW-32T-1:16565.37 Requests/sec executed RW-32T-2:16497.67 Requests/sec executed RW-32T-3:16542.54 Requests/sec executed RW-4T-1: 5099.07 Requests/sec executed RW-4T-2: 4970.28 Requests/sec executed RW-4T-3: 5121.44 Requests/sec executed RW-8T-1: 8487.91 Requests/sec executed RW-8T-2: 8632.96 Requests/sec executed RW-8T-3: 8393.91 Requests/sec executed Gerhard W. Recher net4sec UG (haftungsbeschränkt) Leitenweg 6 86929 Penzing +49 171 4802507 Am 27.11.2017 um 14:02 schrieb German Anders: > Hi All, > > I've a performance question, we recently install a brand new Ceph > cluster with all-nvme disks, using ceph version 12.2.0 with bluestore > configured. The back-end of the cluster is using a bond IPoIB > (active/passive) , and for the front-end we are using a bonding config > with active/active (20GbE) to communicate with the clients. > > The cluster configuration is the following: > > *MON Nodes:* > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 3x 1U servers: > 2x Intel Xeon E5-2630v4 @2.2Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > *OSD Nodes:* > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 4x 2U servers: > 2x Intel Xeon E5-2640v4 @2.4Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 1x Ethernet Controller 10G X550T > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > > Here's the tree: > > ID CLASS WEIGHT
Re: [ceph-users] ceph all-nvme mysql performance tuning
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of German Anders Sent: 27 November 2017 14:44 To: Maged Mokhtar Cc: ceph-users Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning Hi Maged, Thanks a lot for the response. We try with different number of threads and we're getting almost the same kind of difference between the storage types. Going to try with different rbd stripe size, object size values and see if we get more competitive numbers. Will get back with more tests and param changes to see if we get better :) Just to echo a couple of comments. Ceph will always struggle to match the performance of a traditional array for mainly 2 reasons. 1. You are replacing some sort of dual ported SAS or internally RDMA connected device with a network for Ceph replication traffic. This will instantly have a large impact on write latency 2. Ceph locks at the PG level and a PG will most likely cover at least one 4MB object, so lots of small accesses to the same blocks (on a block device) will wait on each other and go effectively at a single threaded rate. The best thing you can do to mitigate these, is to run the fastest journal/WAL devices you can, fastest network connections (ie 25Gb/s) and run your CPU’s at max C and P states. You stated that you are running the performance profile on the CPU’s. Could you also just double check that the C-states are being held at C1(e)? There are a few utilities that can show this in realtime. Other than that, although there could be some minor tweaks, you are probably nearing the limit of what you can hope to achieve. Nick Thanks, Best, German 2017-11-27 11:36 GMT-03:00 Maged Mokhtar mailto:mmokh...@petasan.org> >: On 2017-11-27 15:02, German Anders wrote: Hi All, I've a performance question, we recently install a brand new Ceph cluster with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The back-end of the cluster is using a bond IPoIB (active/passive) , and for the front-end we are using a bonding config with active/active (20GbE) to communicate with the clients. The cluster configuration is the following: MON Nodes: OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 3x 1U servers: 2x Intel Xeon E5-2630v4 @2.2Ghz 128G RAM 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection OSD Nodes: OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 4x 2U servers: 2x Intel Xeon E5-2640v4 @2.4Ghz 128G RAM 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 1x Ethernet Controller 10G X550T 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) Here's the tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -7 48.0 root root -5 24.0 rack rack1 -1 12.0 node cpn01 0 nvme 1.0 osd.0 up 1.0 1.0 1 nvme 1.0 osd.1 up 1.0 1.0 2 nvme 1.0 osd.2 up 1.0 1.0 3 nvme 1.0 osd.3 up 1.0 1.0 4 nvme 1.0 osd.4 up 1.0 1.0 5 nvme 1.0 osd.5 up 1.0 1.0 6 nvme 1.0 osd.6 up 1.0 1.0 7 nvme 1.0 osd.7 up 1.0 1.0 8 nvme 1.0 osd.8 up 1.0 1.0 9 nvme 1.0 osd.9 up 1.0 1.0 10 nvme 1.0 osd.10 up 1.0 1.0 11 nvme 1.0 osd.11 up 1.0 1.0 -3 12.0 node cpn03 24 nvme 1.0 osd.24 up 1.0 1.0 25 nvme 1.0 osd.25 up 1.0 1.0 26 nvme 1.0 osd.26 up 1.0 1.0 27 nvme 1.0 osd.27 up 1.0 1.0 28 nvme 1.0 osd.28 up 1.0 1.0 29 nvme 1.0 osd.29 up 1.0 1.0 30 nvme 1.0 osd.30 up 1.0 1.0 31 nvme 1.0 osd.31 up 1.0 1.0 32 nvme 1.0 osd.32 up 1.0 1.0 33 nvme 1.0 osd.33 up 1.0 1.0 34 nvme 1.0 osd.34 up 1.0 1.0 35 nvme 1.0 osd.35 up 1.0 1.0 -6 24.0 rack rack2 -2 12.0 node cpn02 12 nvme 1.0 osd.12 up 1.0 1.0 13 nvme 1.0 osd.13 up 1.0 1.0 14 nvme 1.0 osd.14 up 1.0 1.0 15 nvme 1.0 osd.15 up 1.0 1.0 16 nvme 1.0 osd.16 up 1.0 1.0 17 nvme 1.0 osd.17 up 1.00
Re: [ceph-users] ceph all-nvme mysql performance tuning
Hi Maged, Thanks a lot for the response. We try with different number of threads and we're getting almost the same kind of difference between the storage types. Going to try with different rbd stripe size, object size values and see if we get more competitive numbers. Will get back with more tests and param changes to see if we get better :) Thanks, Best, *German* 2017-11-27 11:36 GMT-03:00 Maged Mokhtar : > On 2017-11-27 15:02, German Anders wrote: > > Hi All, > > I've a performance question, we recently install a brand new Ceph cluster > with all-nvme disks, using ceph version 12.2.0 with bluestore configured. > The back-end of the cluster is using a bond IPoIB (active/passive) , and > for the front-end we are using a bonding config with active/active (20GbE) > to communicate with the clients. > > The cluster configuration is the following: > > *MON Nodes:* > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 3x 1U servers: > 2x Intel Xeon E5-2630v4 @2.2Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > *OSD Nodes:* > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 4x 2U servers: > 2x Intel Xeon E5-2640v4 @2.4Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 1x Ethernet Controller 10G X550T > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > > Here's the tree: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -7 48.0 root root > -5 24.0 rack rack1 > -1 12.0 node cpn01 > 0 nvme 1.0 osd.0 up 1.0 1.0 > 1 nvme 1.0 osd.1 up 1.0 1.0 > 2 nvme 1.0 osd.2 up 1.0 1.0 > 3 nvme 1.0 osd.3 up 1.0 1.0 > 4 nvme 1.0 osd.4 up 1.0 1.0 > 5 nvme 1.0 osd.5 up 1.0 1.0 > 6 nvme 1.0 osd.6 up 1.0 1.0 > 7 nvme 1.0 osd.7 up 1.0 1.0 > 8 nvme 1.0 osd.8 up 1.0 1.0 > 9 nvme 1.0 osd.9 up 1.0 1.0 > 10 nvme 1.0 osd.10 up 1.0 1.0 > 11 nvme 1.0 osd.11 up 1.0 1.0 > -3 12.0 node cpn03 > 24 nvme 1.0 osd.24 up 1.0 1.0 > 25 nvme 1.0 osd.25 up 1.0 1.0 > 26 nvme 1.0 osd.26 up 1.0 1.0 > 27 nvme 1.0 osd.27 up 1.0 1.0 > 28 nvme 1.0 osd.28 up 1.0 1.0 > 29 nvme 1.0 osd.29 up 1.0 1.0 > 30 nvme 1.0 osd.30 up 1.0 1.0 > 31 nvme 1.0 osd.31 up 1.0 1.0 > 32 nvme 1.0 osd.32 up 1.0 1.0 > 33 nvme 1.0 osd.33 up 1.0 1.0 > 34 nvme 1.0 osd.34 up 1.0 1.0 > 35 nvme 1.0 osd.35 up 1.0 1.0 > -6 24.0 rack rack2 > -2 12.0 node cpn02 > 12 nvme 1.0 osd.12 up 1.0 1.0 > 13 nvme 1.0 osd.13 up 1.0 1.0 > 14 nvme 1.0 osd.14 up 1.0 1.0 > 15 nvme 1.0 osd.15 up 1.0 1.0 > 16 nvme 1.0 osd.16 up 1.0 1.0 > 17 nvme 1.0 osd.17 up 1.0 1.0 > 18 nvme 1.0 osd.18 up 1.0 1.0 > 19 nvme 1.0 osd.19 up 1.0 1.0 > 20 nvme 1.0 osd.20 up 1.0 1.0 > 21 nvme 1.0 osd.21 up 1.0 1.0 > 22 nvme 1.0 osd.22 up 1.0 1.0 > 23 nvme 1.0 osd.23 up 1.0 1.0 > -4 12.0 node cpn04 > 36 nvme 1.0 osd.36 up 1.0 1.0 > 37 nvme 1.0 osd.37 up 1.0 1.0 > 38 nvme 1.0 osd.38 up 1.0 1.0 > 39 nvme 1.0 osd.39 up 1.0 1.0 > 40 nvme 1.0 osd.40 up 1.0 1.0 > 41 nvme 1.0 osd.41 up 1.0 1.0 > 42 nvme 1.0 osd.42 up 1.0 1.0 > 43 nvme 1.0 osd.43 up 1.0 1.0 > 44 nvme 1.0 osd.44 up 1.0 1.0 > 45 nvme 1.0 osd.45 up 1.0 1.0 > 46 nvme 1.0 osd.46 up 1.0 1.0 > 47 nvme 1.0 osd.47 up 1.0 1.0 > > The disk partition of one of the OSD nodes: > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > nvme6n1259:10
Re: [ceph-users] ceph all-nvme mysql performance tuning
On 2017-11-27 15:02, German Anders wrote: > Hi All, > > I've a performance question, we recently install a brand new Ceph cluster > with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The > back-end of the cluster is using a bond IPoIB (active/passive) , and for the > front-end we are using a bonding config with active/active (20GbE) to > communicate with the clients. > > The cluster configuration is the following: > > MON NODES: > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 3x 1U servers: > 2x Intel Xeon E5-2630v4 @2.2Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > OSD NODES: > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 4x 2U servers: > 2x Intel Xeon E5-2640v4 @2.4Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 1x Ethernet Controller 10G X550T > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > Here's the tree: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -7 48.0 root root > -5 24.0 rack rack1 > -1 12.0 node cpn01 > 0 nvme 1.0 osd.0 up 1.0 1.0 > 1 nvme 1.0 osd.1 up 1.0 1.0 > 2 nvme 1.0 osd.2 up 1.0 1.0 > 3 nvme 1.0 osd.3 up 1.0 1.0 > 4 nvme 1.0 osd.4 up 1.0 1.0 > 5 nvme 1.0 osd.5 up 1.0 1.0 > 6 nvme 1.0 osd.6 up 1.0 1.0 > 7 nvme 1.0 osd.7 up 1.0 1.0 > 8 nvme 1.0 osd.8 up 1.0 1.0 > 9 nvme 1.0 osd.9 up 1.0 1.0 > 10 nvme 1.0 osd.10 up 1.0 1.0 > 11 nvme 1.0 osd.11 up 1.0 1.0 > -3 12.0 node cpn03 > 24 nvme 1.0 osd.24 up 1.0 1.0 > 25 nvme 1.0 osd.25 up 1.0 1.0 > 26 nvme 1.0 osd.26 up 1.0 1.0 > 27 nvme 1.0 osd.27 up 1.0 1.0 > 28 nvme 1.0 osd.28 up 1.0 1.0 > > 29 nvme 1.0 osd.29 up 1.0 1.0 > 30 nvme 1.0 osd.30 up 1.0 1.0 > 31 nvme 1.0 osd.31 up 1.0 1.0 > 32 nvme 1.0 osd.32 up 1.0 1.0 > 33 nvme 1.0 osd.33 up 1.0 1.0 > 34 nvme 1.0 osd.34 up 1.0 1.0 > 35 nvme 1.0 osd.35 up 1.0 1.0 > -6 24.0 rack rack2 > -2 12.0 node cpn02 > 12 nvme 1.0 osd.12 up 1.0 1.0 > 13 nvme 1.0 osd.13 up 1.0 1.0 > 14 nvme 1.0 osd.14 up 1.0 1.0 > 15 nvme 1.0 osd.15 up 1.0 1.0 > 16 nvme 1.0 osd.16 up 1.0 1.0 > 17 nvme 1.0 osd.17 up 1.0 1.0 > 18 nvme 1.0 osd.18 up 1.0 1.0 > 19 nvme 1.0 osd.19 up 1.0 1.0 > 20 nvme 1.0 osd.20 up 1.0 1.0 > 21 nvme 1.0 osd.21 up 1.0 1.0 > 22 nvme 1.0 osd.22 up 1.0 1.0 > 23 nvme 1.0 osd.23 up 1.0 1.0 > -4 12.0 node cpn04 > 36 nvme 1.0 osd.36 up 1.0 1.0 > 37 nvme 1.0 osd.37 up 1.0 1.0 > 38 nvme 1.0 osd.38 up 1.0 1.0 > 39 nvme 1.0 osd.39 up 1.0 1.0 > 40 nvme 1.0 osd.40 up 1.0 1.0 > 41 nvme 1.0 osd.41 up 1.0 1.0 > 42 nvme 1.0 osd.42 up 1.0 1.0 > 43 nvme 1.0 osd.43 up 1.0 1.0 > 44 nvme 1.0 osd.44 up 1.0 1.0 > 45 nvme 1.0 osd.45 up 1.0 1.0 > 46 nvme 1.0 osd.46 up 1.0 1.0 > 47 nvme 1.0 osd.47 up 1.0 1.0 > > The disk partition of one of the OSD nodes: > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > nvme6n1259:10 1.1T 0 disk > ├─nvme6n1p2259:15 0 1.1T 0 part > └─nvme6n1p1259:13 0 100M 0 part /var/lib/ceph/osd/ceph-6 > nvme9n1259:00 1.1T 0 disk > ├─nvme9n1p2259:80 1.1T 0 part > └─nvme9n1p1259:70 100M 0 part /var/lib/ceph/osd/ceph-9 > sdb 8:16 0
Re: [ceph-users] ceph all-nvme mysql performance tuning
Hi Wido, thanks a lot for the quick response, regarding the questions: Have you tried to attach multiple RBD volumes: - Root for OS (the root partition has local SSDs) - MySQL data dir (the idea is to have all the storage tests with the same scheme, the first test is using one volume and put the data dir, innodb and bin log) - MySQL InnoDB Logfile - MySQL Binary Logging So 4 disks in total where you spread out the I/O over. (the following tests are going to be spread into 3 disks, and we'll make a new compare between the arrays) Regarding the version of librbd it's not a type we use this server also with an old ceph cluster. we are going to upgrade the version and see if tests get better. Thanks *German* 2017-11-27 10:16 GMT-03:00 Wido den Hollander : > > > Op 27 november 2017 om 14:02 schreef German Anders >: > > > > > > Hi All, > > > > I've a performance question, we recently install a brand new Ceph cluster > > with all-nvme disks, using ceph version 12.2.0 with bluestore configured. > > The back-end of the cluster is using a bond IPoIB (active/passive) , and > > for the front-end we are using a bonding config with active/active > (20GbE) > > to communicate with the clients. > > > > The cluster configuration is the following: > > > > *MON Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 3x 1U servers: > > 2x Intel Xeon E5-2630v4 @2.2Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > > > *OSD Nodes:* > > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > > 4x 2U servers: > > 2x Intel Xeon E5-2640v4 @2.4Ghz > > 128G RAM > > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > > 1x Ethernet Controller 10G X550T > > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > > > > > Here's the tree: > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -7 48.0 root root > > -5 24.0 rack rack1 > > -1 12.0 node cpn01 > > 0 nvme 1.0 osd.0 up 1.0 1.0 > > 1 nvme 1.0 osd.1 up 1.0 1.0 > > 2 nvme 1.0 osd.2 up 1.0 1.0 > > 3 nvme 1.0 osd.3 up 1.0 1.0 > > 4 nvme 1.0 osd.4 up 1.0 1.0 > > 5 nvme 1.0 osd.5 up 1.0 1.0 > > 6 nvme 1.0 osd.6 up 1.0 1.0 > > 7 nvme 1.0 osd.7 up 1.0 1.0 > > 8 nvme 1.0 osd.8 up 1.0 1.0 > > 9 nvme 1.0 osd.9 up 1.0 1.0 > > 10 nvme 1.0 osd.10 up 1.0 1.0 > > 11 nvme 1.0 osd.11 up 1.0 1.0 > > -3 12.0 node cpn03 > > 24 nvme 1.0 osd.24 up 1.0 1.0 > > 25 nvme 1.0 osd.25 up 1.0 1.0 > > 26 nvme 1.0 osd.26 up 1.0 1.0 > > 27 nvme 1.0 osd.27 up 1.0 1.0 > > 28 nvme 1.0 osd.28 up 1.0 1.0 > > 29 nvme 1.0 osd.29 up 1.0 1.0 > > 30 nvme 1.0 osd.30 up 1.0 1.0 > > 31 nvme 1.0 osd.31 up 1.0 1.0 > > 32 nvme 1.0 osd.32 up 1.0 1.0 > > 33 nvme 1.0 osd.33 up 1.0 1.0 > > 34 nvme 1.0 osd.34 up 1.0 1.0 > > 35 nvme 1.0 osd.35 up 1.0 1.0 > > -6 24.0 rack rack2 > > -2 12.0 node cpn02 > > 12 nvme 1.0 osd.12 up 1.0 1.0 > > 13 nvme 1.0 osd.13 up 1.0 1.0 > > 14 nvme 1.0 osd.14 up 1.0 1.0 > > 15 nvme 1.0 osd.15 up 1.0 1.0 > > 16 nvme 1.0 osd.16 up 1.0 1.0 > > 17 nvme 1.0 osd.17 up 1.0 1.0 > > 18 nvme 1.0 osd.18 up 1.0 1.0 > > 19 nvme 1.0 osd.19 up 1.0 1.0 > > 20 nvme 1.0 osd.20 up 1.0 1.0 > > 21 nvme 1.0 osd.21 up 1.0 1.0 > > 22 nvme 1.0 osd.22 up 1.0 1.0 > > 23 nvme 1.0 osd.23 up 1.0 1.0 > > -4 12.0 node cpn04 > > 36 nvme 1.0 osd.36 up 1.0 1.0 > > 37 nvme 1.0 osd.37 up 1.0 1.0 > > 38 nvme 1.0 osd.38 up 1.0 1.0 > > 39 nvme 1.0 osd.39 up 1.0 1.0 > > 40 nvme 1.0 osd.40 up 1.0 1.0 > > 41 nvme 1.0 osd.41 up 1
Re: [ceph-users] ceph all-nvme mysql performance tuning
> Op 27 november 2017 om 14:14 schreef German Anders : > > > Hi Jason, > > We are using librbd (librbd1-0.80.5-9.el6.x86_64), ok I will change those > parameters and see if that changes something > 0.80? Is that a typo? You should really use 12.2.1 on the client. Wido > thanks a lot > > best, > > > *German* > > 2017-11-27 10:09 GMT-03:00 Jason Dillaman : > > > Are you using krbd or librbd? You might want to consider "debug_ms = 0/0" > > as well since per-message log gathering takes a large hit on small IO > > performance. > > > > On Mon, Nov 27, 2017 at 8:02 AM, German Anders > > wrote: > > > >> Hi All, > >> > >> I've a performance question, we recently install a brand new Ceph cluster > >> with all-nvme disks, using ceph version 12.2.0 with bluestore configured. > >> The back-end of the cluster is using a bond IPoIB (active/passive) , and > >> for the front-end we are using a bonding config with active/active (20GbE) > >> to communicate with the clients. > >> > >> The cluster configuration is the following: > >> > >> *MON Nodes:* > >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > >> 3x 1U servers: > >> 2x Intel Xeon E5-2630v4 @2.2Ghz > >> 128G RAM > >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > >> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > >> > >> *OSD Nodes:* > >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > >> 4x 2U servers: > >> 2x Intel Xeon E5-2640v4 @2.4Ghz > >> 128G RAM > >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > >> 1x Ethernet Controller 10G X550T > >> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > >> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > >> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > >> > >> > >> Here's the tree: > >> > >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > >> -7 48.0 root root > >> -5 24.0 rack rack1 > >> -1 12.0 node cpn01 > >> 0 nvme 1.0 osd.0 up 1.0 1.0 > >> 1 nvme 1.0 osd.1 up 1.0 1.0 > >> 2 nvme 1.0 osd.2 up 1.0 1.0 > >> 3 nvme 1.0 osd.3 up 1.0 1.0 > >> 4 nvme 1.0 osd.4 up 1.0 1.0 > >> 5 nvme 1.0 osd.5 up 1.0 1.0 > >> 6 nvme 1.0 osd.6 up 1.0 1.0 > >> 7 nvme 1.0 osd.7 up 1.0 1.0 > >> 8 nvme 1.0 osd.8 up 1.0 1.0 > >> 9 nvme 1.0 osd.9 up 1.0 1.0 > >> 10 nvme 1.0 osd.10 up 1.0 1.0 > >> 11 nvme 1.0 osd.11 up 1.0 1.0 > >> -3 12.0 node cpn03 > >> 24 nvme 1.0 osd.24 up 1.0 1.0 > >> 25 nvme 1.0 osd.25 up 1.0 1.0 > >> 26 nvme 1.0 osd.26 up 1.0 1.0 > >> 27 nvme 1.0 osd.27 up 1.0 1.0 > >> 28 nvme 1.0 osd.28 up 1.0 1.0 > >> 29 nvme 1.0 osd.29 up 1.0 1.0 > >> 30 nvme 1.0 osd.30 up 1.0 1.0 > >> 31 nvme 1.0 osd.31 up 1.0 1.0 > >> 32 nvme 1.0 osd.32 up 1.0 1.0 > >> 33 nvme 1.0 osd.33 up 1.0 1.0 > >> 34 nvme 1.0 osd.34 up 1.0 1.0 > >> 35 nvme 1.0 osd.35 up 1.0 1.0 > >> -6 24.0 rack rack2 > >> -2 12.0 node cpn02 > >> 12 nvme 1.0 osd.12 up 1.0 1.0 > >> 13 nvme 1.0 osd.13 up 1.0 1.0 > >> 14 nvme 1.0 osd.14 up 1.0 1.0 > >> 15 nvme 1.0 osd.15 up 1.0 1.0 > >> 16 nvme 1.0 osd.16 up 1.0 1.0 > >> 17 nvme 1.0 osd.17 up 1.0 1.0 > >> 18 nvme 1.0 osd.18 up 1.0 1.0 > >> 19 nvme 1.0 osd.19 up 1.0 1.0 > >> 20 nvme 1.0 osd.20 up 1.0 1.0 > >> 21 nvme 1.0 osd.21 up 1.0 1.0 > >> 22 nvme 1.0 osd.22 up 1.0 1.0 > >> 23 nvme 1.0 osd.23 up 1.0 1.0 > >> -4 12.0 node cpn04 > >> 36 nvme 1.0 osd.36 up 1.0 1.0 > >> 37 nvme 1.0 osd.37 up 1.0 1.0 > >> 38 nvme 1.0 osd.38 up 1.0 1.0 > >> 39 nvme 1.0 osd.39 up 1.0 1.0 > >> 40 nvme 1.0 osd.40 up 1.0 1.0 > >> 41 nvme 1.0 osd.41 up 1.0 1.0 > >> 42 nvme 1.0 osd.42 up 1.0 1.0 > >> 43 nvme 1.0 osd.43 up 1.0 1.0 >
Re: [ceph-users] ceph all-nvme mysql performance tuning
> Op 27 november 2017 om 14:02 schreef German Anders : > > > Hi All, > > I've a performance question, we recently install a brand new Ceph cluster > with all-nvme disks, using ceph version 12.2.0 with bluestore configured. > The back-end of the cluster is using a bond IPoIB (active/passive) , and > for the front-end we are using a bonding config with active/active (20GbE) > to communicate with the clients. > > The cluster configuration is the following: > > *MON Nodes:* > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 3x 1U servers: > 2x Intel Xeon E5-2630v4 @2.2Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > *OSD Nodes:* > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 4x 2U servers: > 2x Intel Xeon E5-2640v4 @2.4Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 1x Ethernet Controller 10G X550T > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > > Here's the tree: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -7 48.0 root root > -5 24.0 rack rack1 > -1 12.0 node cpn01 > 0 nvme 1.0 osd.0 up 1.0 1.0 > 1 nvme 1.0 osd.1 up 1.0 1.0 > 2 nvme 1.0 osd.2 up 1.0 1.0 > 3 nvme 1.0 osd.3 up 1.0 1.0 > 4 nvme 1.0 osd.4 up 1.0 1.0 > 5 nvme 1.0 osd.5 up 1.0 1.0 > 6 nvme 1.0 osd.6 up 1.0 1.0 > 7 nvme 1.0 osd.7 up 1.0 1.0 > 8 nvme 1.0 osd.8 up 1.0 1.0 > 9 nvme 1.0 osd.9 up 1.0 1.0 > 10 nvme 1.0 osd.10 up 1.0 1.0 > 11 nvme 1.0 osd.11 up 1.0 1.0 > -3 12.0 node cpn03 > 24 nvme 1.0 osd.24 up 1.0 1.0 > 25 nvme 1.0 osd.25 up 1.0 1.0 > 26 nvme 1.0 osd.26 up 1.0 1.0 > 27 nvme 1.0 osd.27 up 1.0 1.0 > 28 nvme 1.0 osd.28 up 1.0 1.0 > 29 nvme 1.0 osd.29 up 1.0 1.0 > 30 nvme 1.0 osd.30 up 1.0 1.0 > 31 nvme 1.0 osd.31 up 1.0 1.0 > 32 nvme 1.0 osd.32 up 1.0 1.0 > 33 nvme 1.0 osd.33 up 1.0 1.0 > 34 nvme 1.0 osd.34 up 1.0 1.0 > 35 nvme 1.0 osd.35 up 1.0 1.0 > -6 24.0 rack rack2 > -2 12.0 node cpn02 > 12 nvme 1.0 osd.12 up 1.0 1.0 > 13 nvme 1.0 osd.13 up 1.0 1.0 > 14 nvme 1.0 osd.14 up 1.0 1.0 > 15 nvme 1.0 osd.15 up 1.0 1.0 > 16 nvme 1.0 osd.16 up 1.0 1.0 > 17 nvme 1.0 osd.17 up 1.0 1.0 > 18 nvme 1.0 osd.18 up 1.0 1.0 > 19 nvme 1.0 osd.19 up 1.0 1.0 > 20 nvme 1.0 osd.20 up 1.0 1.0 > 21 nvme 1.0 osd.21 up 1.0 1.0 > 22 nvme 1.0 osd.22 up 1.0 1.0 > 23 nvme 1.0 osd.23 up 1.0 1.0 > -4 12.0 node cpn04 > 36 nvme 1.0 osd.36 up 1.0 1.0 > 37 nvme 1.0 osd.37 up 1.0 1.0 > 38 nvme 1.0 osd.38 up 1.0 1.0 > 39 nvme 1.0 osd.39 up 1.0 1.0 > 40 nvme 1.0 osd.40 up 1.0 1.0 > 41 nvme 1.0 osd.41 up 1.0 1.0 > 42 nvme 1.0 osd.42 up 1.0 1.0 > 43 nvme 1.0 osd.43 up 1.0 1.0 > 44 nvme 1.0 osd.44 up 1.0 1.0 > 45 nvme 1.0 osd.45 up 1.0 1.0 > 46 nvme 1.0 osd.46 up 1.0 1.0 > 47 nvme 1.0 osd.47 up 1.0 1.0 > > The disk partition of one of the OSD nodes: > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > nvme6n1259:10 1.1T 0 disk > ├─nvme6n1p2259:15 0 1.1T 0 part > └─nvme6n1p1259:13 0 100M 0 part /var/lib/ceph/osd/ceph-6 > nvme9n1259:00 1.1T 0 disk > ├─nvme9n1p2259:80 1.1T 0 part > └─nvme9n1p1259:70 100M 0 part /var/lib/ceph/osd/ceph-9 > sdb 8:16 0 139.8G 0 disk > └─sdb1
Re: [ceph-users] ceph all-nvme mysql performance tuning
Hi Jason, We are using librbd (librbd1-0.80.5-9.el6.x86_64), ok I will change those parameters and see if that changes something thanks a lot best, *German* 2017-11-27 10:09 GMT-03:00 Jason Dillaman : > Are you using krbd or librbd? You might want to consider "debug_ms = 0/0" > as well since per-message log gathering takes a large hit on small IO > performance. > > On Mon, Nov 27, 2017 at 8:02 AM, German Anders > wrote: > >> Hi All, >> >> I've a performance question, we recently install a brand new Ceph cluster >> with all-nvme disks, using ceph version 12.2.0 with bluestore configured. >> The back-end of the cluster is using a bond IPoIB (active/passive) , and >> for the front-end we are using a bonding config with active/active (20GbE) >> to communicate with the clients. >> >> The cluster configuration is the following: >> >> *MON Nodes:* >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 >> 3x 1U servers: >> 2x Intel Xeon E5-2630v4 @2.2Ghz >> 128G RAM >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) >> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection >> >> *OSD Nodes:* >> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 >> 4x 2U servers: >> 2x Intel Xeon E5-2640v4 @2.4Ghz >> 128G RAM >> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) >> 1x Ethernet Controller 10G X550T >> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection >> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons >> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) >> >> >> Here's the tree: >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -7 48.0 root root >> -5 24.0 rack rack1 >> -1 12.0 node cpn01 >> 0 nvme 1.0 osd.0 up 1.0 1.0 >> 1 nvme 1.0 osd.1 up 1.0 1.0 >> 2 nvme 1.0 osd.2 up 1.0 1.0 >> 3 nvme 1.0 osd.3 up 1.0 1.0 >> 4 nvme 1.0 osd.4 up 1.0 1.0 >> 5 nvme 1.0 osd.5 up 1.0 1.0 >> 6 nvme 1.0 osd.6 up 1.0 1.0 >> 7 nvme 1.0 osd.7 up 1.0 1.0 >> 8 nvme 1.0 osd.8 up 1.0 1.0 >> 9 nvme 1.0 osd.9 up 1.0 1.0 >> 10 nvme 1.0 osd.10 up 1.0 1.0 >> 11 nvme 1.0 osd.11 up 1.0 1.0 >> -3 12.0 node cpn03 >> 24 nvme 1.0 osd.24 up 1.0 1.0 >> 25 nvme 1.0 osd.25 up 1.0 1.0 >> 26 nvme 1.0 osd.26 up 1.0 1.0 >> 27 nvme 1.0 osd.27 up 1.0 1.0 >> 28 nvme 1.0 osd.28 up 1.0 1.0 >> 29 nvme 1.0 osd.29 up 1.0 1.0 >> 30 nvme 1.0 osd.30 up 1.0 1.0 >> 31 nvme 1.0 osd.31 up 1.0 1.0 >> 32 nvme 1.0 osd.32 up 1.0 1.0 >> 33 nvme 1.0 osd.33 up 1.0 1.0 >> 34 nvme 1.0 osd.34 up 1.0 1.0 >> 35 nvme 1.0 osd.35 up 1.0 1.0 >> -6 24.0 rack rack2 >> -2 12.0 node cpn02 >> 12 nvme 1.0 osd.12 up 1.0 1.0 >> 13 nvme 1.0 osd.13 up 1.0 1.0 >> 14 nvme 1.0 osd.14 up 1.0 1.0 >> 15 nvme 1.0 osd.15 up 1.0 1.0 >> 16 nvme 1.0 osd.16 up 1.0 1.0 >> 17 nvme 1.0 osd.17 up 1.0 1.0 >> 18 nvme 1.0 osd.18 up 1.0 1.0 >> 19 nvme 1.0 osd.19 up 1.0 1.0 >> 20 nvme 1.0 osd.20 up 1.0 1.0 >> 21 nvme 1.0 osd.21 up 1.0 1.0 >> 22 nvme 1.0 osd.22 up 1.0 1.0 >> 23 nvme 1.0 osd.23 up 1.0 1.0 >> -4 12.0 node cpn04 >> 36 nvme 1.0 osd.36 up 1.0 1.0 >> 37 nvme 1.0 osd.37 up 1.0 1.0 >> 38 nvme 1.0 osd.38 up 1.0 1.0 >> 39 nvme 1.0 osd.39 up 1.0 1.0 >> 40 nvme 1.0 osd.40 up 1.0 1.0 >> 41 nvme 1.0 osd.41 up 1.0 1.0 >> 42 nvme 1.0 osd.42 up 1.0 1.0 >> 43 nvme 1.0 osd.43 up 1.0 1.0 >> 44 nvme 1.0 osd.44 up 1.0 1.0 >> 45 nvme 1.0 osd.45 up 1.0 1.0 >> 46 nvme 1.0 osd.46 up 1.0 1.0 >> 47 nvme 1.0 osd.47 up 1.0 1.0 >> >> The disk partition of one of the OSD nodes: >> >> NAME MAJ:MIN RM
Re: [ceph-users] ceph all-nvme mysql performance tuning
Are you using krbd or librbd? You might want to consider "debug_ms = 0/0" as well since per-message log gathering takes a large hit on small IO performance. On Mon, Nov 27, 2017 at 8:02 AM, German Anders wrote: > Hi All, > > I've a performance question, we recently install a brand new Ceph cluster > with all-nvme disks, using ceph version 12.2.0 with bluestore configured. > The back-end of the cluster is using a bond IPoIB (active/passive) , and > for the front-end we are using a bonding config with active/active (20GbE) > to communicate with the clients. > > The cluster configuration is the following: > > *MON Nodes:* > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 3x 1U servers: > 2x Intel Xeon E5-2630v4 @2.2Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection > > *OSD Nodes:* > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 > 4x 2U servers: > 2x Intel Xeon E5-2640v4 @2.4Ghz > 128G RAM > 2x Intel SSD DC S3520 150G (in RAID-1 for OS) > 1x Ethernet Controller 10G X550T > 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection > 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons > 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) > > > Here's the tree: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -7 48.0 root root > -5 24.0 rack rack1 > -1 12.0 node cpn01 > 0 nvme 1.0 osd.0 up 1.0 1.0 > 1 nvme 1.0 osd.1 up 1.0 1.0 > 2 nvme 1.0 osd.2 up 1.0 1.0 > 3 nvme 1.0 osd.3 up 1.0 1.0 > 4 nvme 1.0 osd.4 up 1.0 1.0 > 5 nvme 1.0 osd.5 up 1.0 1.0 > 6 nvme 1.0 osd.6 up 1.0 1.0 > 7 nvme 1.0 osd.7 up 1.0 1.0 > 8 nvme 1.0 osd.8 up 1.0 1.0 > 9 nvme 1.0 osd.9 up 1.0 1.0 > 10 nvme 1.0 osd.10 up 1.0 1.0 > 11 nvme 1.0 osd.11 up 1.0 1.0 > -3 12.0 node cpn03 > 24 nvme 1.0 osd.24 up 1.0 1.0 > 25 nvme 1.0 osd.25 up 1.0 1.0 > 26 nvme 1.0 osd.26 up 1.0 1.0 > 27 nvme 1.0 osd.27 up 1.0 1.0 > 28 nvme 1.0 osd.28 up 1.0 1.0 > 29 nvme 1.0 osd.29 up 1.0 1.0 > 30 nvme 1.0 osd.30 up 1.0 1.0 > 31 nvme 1.0 osd.31 up 1.0 1.0 > 32 nvme 1.0 osd.32 up 1.0 1.0 > 33 nvme 1.0 osd.33 up 1.0 1.0 > 34 nvme 1.0 osd.34 up 1.0 1.0 > 35 nvme 1.0 osd.35 up 1.0 1.0 > -6 24.0 rack rack2 > -2 12.0 node cpn02 > 12 nvme 1.0 osd.12 up 1.0 1.0 > 13 nvme 1.0 osd.13 up 1.0 1.0 > 14 nvme 1.0 osd.14 up 1.0 1.0 > 15 nvme 1.0 osd.15 up 1.0 1.0 > 16 nvme 1.0 osd.16 up 1.0 1.0 > 17 nvme 1.0 osd.17 up 1.0 1.0 > 18 nvme 1.0 osd.18 up 1.0 1.0 > 19 nvme 1.0 osd.19 up 1.0 1.0 > 20 nvme 1.0 osd.20 up 1.0 1.0 > 21 nvme 1.0 osd.21 up 1.0 1.0 > 22 nvme 1.0 osd.22 up 1.0 1.0 > 23 nvme 1.0 osd.23 up 1.0 1.0 > -4 12.0 node cpn04 > 36 nvme 1.0 osd.36 up 1.0 1.0 > 37 nvme 1.0 osd.37 up 1.0 1.0 > 38 nvme 1.0 osd.38 up 1.0 1.0 > 39 nvme 1.0 osd.39 up 1.0 1.0 > 40 nvme 1.0 osd.40 up 1.0 1.0 > 41 nvme 1.0 osd.41 up 1.0 1.0 > 42 nvme 1.0 osd.42 up 1.0 1.0 > 43 nvme 1.0 osd.43 up 1.0 1.0 > 44 nvme 1.0 osd.44 up 1.0 1.0 > 45 nvme 1.0 osd.45 up 1.0 1.0 > 46 nvme 1.0 osd.46 up 1.0 1.0 > 47 nvme 1.0 osd.47 up 1.0 1.0 > > The disk partition of one of the OSD nodes: > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > nvme6n1259:10 1.1T 0 disk > ├─nvme6n1p2259:15 0 1.1T 0 part > └─nvme6n1p1259:13 0 100M 0 part /var/lib/ceph/osd/ceph-6 > nvme9n1259:00 1.1T 0 disk > ├─nvme9n1p2259:80 1.1T 0 part > └─n
[ceph-users] ceph all-nvme mysql performance tuning
Hi All, I've a performance question, we recently install a brand new Ceph cluster with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The back-end of the cluster is using a bond IPoIB (active/passive) , and for the front-end we are using a bonding config with active/active (20GbE) to communicate with the clients. The cluster configuration is the following: *MON Nodes:* OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 3x 1U servers: 2x Intel Xeon E5-2630v4 @2.2Ghz 128G RAM 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection *OSD Nodes:* OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 4x 2U servers: 2x Intel Xeon E5-2640v4 @2.4Ghz 128G RAM 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 1x Ethernet Controller 10G X550T 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) Here's the tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -7 48.0 root root -5 24.0 rack rack1 -1 12.0 node cpn01 0 nvme 1.0 osd.0 up 1.0 1.0 1 nvme 1.0 osd.1 up 1.0 1.0 2 nvme 1.0 osd.2 up 1.0 1.0 3 nvme 1.0 osd.3 up 1.0 1.0 4 nvme 1.0 osd.4 up 1.0 1.0 5 nvme 1.0 osd.5 up 1.0 1.0 6 nvme 1.0 osd.6 up 1.0 1.0 7 nvme 1.0 osd.7 up 1.0 1.0 8 nvme 1.0 osd.8 up 1.0 1.0 9 nvme 1.0 osd.9 up 1.0 1.0 10 nvme 1.0 osd.10 up 1.0 1.0 11 nvme 1.0 osd.11 up 1.0 1.0 -3 12.0 node cpn03 24 nvme 1.0 osd.24 up 1.0 1.0 25 nvme 1.0 osd.25 up 1.0 1.0 26 nvme 1.0 osd.26 up 1.0 1.0 27 nvme 1.0 osd.27 up 1.0 1.0 28 nvme 1.0 osd.28 up 1.0 1.0 29 nvme 1.0 osd.29 up 1.0 1.0 30 nvme 1.0 osd.30 up 1.0 1.0 31 nvme 1.0 osd.31 up 1.0 1.0 32 nvme 1.0 osd.32 up 1.0 1.0 33 nvme 1.0 osd.33 up 1.0 1.0 34 nvme 1.0 osd.34 up 1.0 1.0 35 nvme 1.0 osd.35 up 1.0 1.0 -6 24.0 rack rack2 -2 12.0 node cpn02 12 nvme 1.0 osd.12 up 1.0 1.0 13 nvme 1.0 osd.13 up 1.0 1.0 14 nvme 1.0 osd.14 up 1.0 1.0 15 nvme 1.0 osd.15 up 1.0 1.0 16 nvme 1.0 osd.16 up 1.0 1.0 17 nvme 1.0 osd.17 up 1.0 1.0 18 nvme 1.0 osd.18 up 1.0 1.0 19 nvme 1.0 osd.19 up 1.0 1.0 20 nvme 1.0 osd.20 up 1.0 1.0 21 nvme 1.0 osd.21 up 1.0 1.0 22 nvme 1.0 osd.22 up 1.0 1.0 23 nvme 1.0 osd.23 up 1.0 1.0 -4 12.0 node cpn04 36 nvme 1.0 osd.36 up 1.0 1.0 37 nvme 1.0 osd.37 up 1.0 1.0 38 nvme 1.0 osd.38 up 1.0 1.0 39 nvme 1.0 osd.39 up 1.0 1.0 40 nvme 1.0 osd.40 up 1.0 1.0 41 nvme 1.0 osd.41 up 1.0 1.0 42 nvme 1.0 osd.42 up 1.0 1.0 43 nvme 1.0 osd.43 up 1.0 1.0 44 nvme 1.0 osd.44 up 1.0 1.0 45 nvme 1.0 osd.45 up 1.0 1.0 46 nvme 1.0 osd.46 up 1.0 1.0 47 nvme 1.0 osd.47 up 1.0 1.0 The disk partition of one of the OSD nodes: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme6n1259:10 1.1T 0 disk ├─nvme6n1p2259:15 0 1.1T 0 part └─nvme6n1p1259:13 0 100M 0 part /var/lib/ceph/osd/ceph-6 nvme9n1259:00 1.1T 0 disk ├─nvme9n1p2259:80 1.1T 0 part └─nvme9n1p1259:70 100M 0 part /var/lib/ceph/osd/ceph-9 sdb 8:16 0 139.8G 0 disk └─sdb1 8:17 0 139.8G 0 part └─md0 9:00 139.6G 0 raid1 ├─md0p2259:31 0 1K 0 md ├─md0p5259:32 0 139.1G 0 md │ ├─cpn01--vg-swap 253:10 27.4G 0 lvm [SWAP] │ └─cpn01--vg-root 253:0