Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.

Mark Wu Mon, 20 Oct 2014 05:31:01 -0700

Test result Update:


Number of Hosts   Maximum single volume IOPS   Maximum aggregated IOPS
 SSD Disk IOPS     SSD Disk Utilization

7                              14k                                        45k
                                     9800+                    90%

8                              21k
 50k                                      9800+                    90%

9                              30k
 56k                                      9800+                    90%

10                            40k
 54k                                      8200+                    70%



Note:  the disk average request size is about 20 sectors, not same as
client side (4k)


I have two questions about the result:


1. No matter how many nodes the cluster has,  the backend write throughput
is always almost 8 times of client side.  Is it normal behavior in Ceph,
 or caused by some wrong configuration in my setup?


The following data is captured in the 9 hosts test.  Roughly, the
aggregated backend write throughput is 1000 * 22 * 512  * 2 * 9 = 1980M/s

The client side is 56k * 4 = 244M/s


Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s
  rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.33    0.00    1.33     0.00    10.67     8.00
    0.00    0.00   0.00   0.00
sdb               0.00     6.00    0.00 10219.67     0.00 223561.67
 21.88     4.08    0.40   0.09  89.43
sdc               0.00     6.00    0.00 9750.67     0.00 220286.67    22.59
    2.47    0.25   0.09  89.83
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    1.33     0.00    10.67     8.00
    0.00    0.00   0.00   0.00

Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s
  rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    1.00     0.00    26.67    26.67
    0.00    0.00   0.00   0.00
sdb               0.00     6.33    0.00 10389.00     0.00 224668.67
 21.63     3.78    0.36   0.09  89.23
sdc               0.00     4.33    0.00 10106.67     0.00 217986.00
 21.57     3.83    0.38   0.09  91.10
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    1.00     0.00    26.67    26.67
    0.00    0.00   0.00   0.00


2.  For the scalability issue ( 10 hosts performs worse than 9 hosts),  is
there any tuning suggestion to improve it?

Thanks!







2014-10-17 16:52 GMT+08:00 Mark Wu <wud...@gmail.com>:

>
> >> I assume you added more clients and checked that it didn't scale past
> >> that?
> Yes, correct.
> >> You might look through the list archives; there are a number of
> discussions about how and how far you can scale SSD-backed cluster
> performance.
> I have look at those discussions before, particular the one initiated by
> Sebastien:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12486.html
> I found that Giant can provide better utilization on SSD backend from the
> thread.  It does improve a lot in the test of 4k random write, compared
> with Firefly.
> In the previous tests with Firefly and 16 osds, I found that the iops of
> 4k random write on single volume is 14k, and which almost reach the peak of
> whole cluster.
> And the iops on SSD disk is less than 1000, which is far away from the
> hardware limitation. It looks that ceph doesn't dispatch fast enough.
>
> With 0.86,  the following  options and disabling debugging can improve
> obviously.
>  throttler perf counter = false
>  osd enable op tracker = false
>
> >> Just scanning through the config options you set, you might want to
> >> bump up all the filestore and journal queue values a lot farther.
>
> Tried the following options.  It doesn't change.
>
> ournal_queue_max_ops=3000
> objecter_inflight_ops=10240
> journal_max_write_bytes=1048576000
> journal_queue_max_bytes=1048576000
>
> ms_dispatch_throttle_bytes=1048576000
> objecter_infilght_op_bytes=1048576000
> filestore_max_sync_interval=10
>
> I have a question about the relationship between the write I/O numbers
> performed on ceph client and the osd disks. From the iostat pasted in the
> first message,
> the write per second is about 5000 and the average request size is 17~22
> sectors. Roughly, the write throughtput on all osd nodes is 20 * 512 * 5000
> * 30 = 1500MB/s
> The replica setting is 2 and the journal and osd data on the same disk, so
> can we assume the write on ssd disks is 40k (fio client result) * 4k * 2 *
> 2 = 640MB/s in theory?
> I don't understand why he actual write is so high compared with the
> theoretical value. And the average request size is also more than twice of
> client request size.
> I run blktrace to check if it's merged by the OS I/O scheduler. From the
> result, it looks that ceph willl merge the requests from client side into
> bigger ones if possible.
> And it also can demonstrate the write on osds (36,141KiB/s * 30 =
> 1084MB/s)is much more that the theoretical value (129641KB/s * 4 = 518MB/s)
>
>
> fio test config and result
> [global]
> #logging
> #write_iops_log=write_iops_log
> #write_bw_log=write_bw_log
> #write_lat_log=write_lat_log
> ioengine=rbd
> clientname=admin
> pool=volumes
> rbdname=image2
> invalidate=0    # mandatory
> rw=randwrite
> bs=4k
>
> [rbd_iodepth128]
> iodepth=128
> numjobs=3
>
> Run status group 0 (all jobs):
>   WRITE: io=3723.5MB, aggrb=129641KB/s, minb=42961KB/s, maxb=43452KB/s,
> mint=29404msec, maxt=29410msec
>
> Blktrace result:
> ==================== Device Overhead ====================
>
>        DEV |       Q2G       G2I       Q2M       I2D       D2C
> ---------- | --------- --------- --------- --------- ---------
>  (  8, 16) |   0.2906%   0.9602%   0.0017%   2.7507%  95.7801%
> ---------- | --------- --------- --------- --------- ---------
>    Overall |   0.2906%   0.9602%   0.0017%   2.7507%  95.7801%
>
> ==================== Device Merge Information ====================
>
>        DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax
>  Total
> ---------- | -------- -------- ------- | -------- -------- --------
> --------
>  (  8, 16) |   108683   106834     1.0 |        1       18      560
>  1924765
>
> Total (sdb):
>  Reads Queued:           0,        0KiB  Writes Queued:     108,683,
>  962,312KiB
>  Read Dispatches:        0,        0KiB  Write Dispatches:  106,834,
>  962,313KiB
>  Reads Requeued:         0               Writes Requeued:         0
>  Reads Completed:        0,        0KiB  Writes Completed:  106,834,
>  962,313KiB
>  Read Merges:            0,        0KiB  Write Merges:        1,849,
>  8,176KiB
>  IO unplugs:        73,163               Timer unplugs:           0
>
> Throughput (R/W): 0KiB/s / 36,141KiB/s
> Events (sdb): 792,897 entries
>
> sdb.btt_qhist.dat:  ( collected on queuing, before merging)
> req-size num
>    8   59403
>   16   40522
>   32   6057
>   48   1102
>   64   243
>   80   60
>   96   37
>   112  18
>   128  8
>
>
>
>
>
>
>
>>
>>
>>
>> On Thu, Oct 16, 2014 at 9:51 AM, Mark Wu <wud...@gmail.com> wrote:
>> > Thanks for the reply. I am not using single client. Writing 5 rbd
>> volumes on
>> > 3 host can reach the peak. The client is fio and also running on osd
>> nodes.
>> > But there're no bottlenecks on cpu or network. I also tried running
>> client
>> > on two non osd servers, but the same result.
>> >
>> > 2014 年 10 月 17 日 上午 12:29于 "Gregory Farnum" <g...@inktank.com>写道：
>> >
>> >> If you're running a single client to drive these tests, that's your
>> >> bottleneck. Try running multiple clients and aggregating their numbers.
>> >> -Greg
>> >>
>> >> On Thursday, October 16, 2014, Mark Wu <wud...@gmail.com> wrote:
>> >>>
>> >>> Hi list,
>> >>>
>> >>> During my test, I found ceph doesn't scale as I expected on a 30 osds
>> >>> cluster.
>> >>> The following is the information of my setup:
>> >>> HW configuration:
>> >>>    15 Dell R720 servers, and each server has:
>> >>>       Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and
>> >>> hyper-thread enabled.
>> >>>       128GB memory
>> >>>       two Intel 3500 SSD disks, connected with MegaRAID SAS 2208
>> >>> controller, each disk is configured as raid0 separately.
>> >>>       bonding with two 10GbE nics, used for both the public network
>> and
>> >>> cluster network.
>> >>>
>> >>> SW configuration:
>> >>>    OS CentOS 6.5, Kernel 3.17,  Ceph 0.86
>> >>>    XFS as file system for data.
>> >>>    each SSD disk has two partitions, one is osd data and the other is
>> osd
>> >>> journal.
>> >>>    the pool has 2048 pgs. 2 replicas.
>> >>>    5 monitors running on 5 of the 15 servers.
>> >>>    Ceph configuration (in memory debugging options are disabled)
>> >>>
>> >>> [osd]
>> >>> osd data = /var/lib/ceph/osd/$cluster-$id
>> >>> osd journal = /var/lib/ceph/osd/$cluster-$id/journal
>> >>> osd mkfs type = xfs
>> >>> osd mkfs options xfs = -f -i size=2048
>> >>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>> >>> osd journal size = 20480
>> >>> osd mon heartbeat interval = 30 # Performance tuning filestore
>> >>> osd_max_backfills = 10
>> >>> osd_recovery_max_active = 15
>> >>> merge threshold = 40
>> >>> filestore split multiple = 8
>> >>> filestore fd cache size = 1024
>> >>> osd op threads = 64 # Recovery tuning osd recovery max active = 1 osd
>> max
>> >>> backfills = 1
>> >>> osd recovery op priority = 1
>> >>> throttler perf counter = false
>> >>> osd enable op tracker = false
>> >>> filestore_queue_max_ops = 5000
>> >>> filestore_queue_committing_max_ops = 5000
>> >>> journal_max_write_entries = 1000
>> >>> journal_queue_max_ops = 5000
>> >>> objecter_inflight_ops = 8192
>> >>>
>> >>>
>> >>>   When I test with 7 servers (14 osds),  the maximum iops of 4k random
>> >>> write I saw is 17k on single volume and 44k on the whole cluster.
>> >>> I expected the number of 30 osds cluster could approximate 90k. But
>> >>> unfornately,  I found that with 30 osds, it almost provides the
>> performce
>> >>> as 14 osds, even worse sometime. I checked the iostat output on all
>> the
>> >>> nodes, which have similar numbers. It's well distributed but disk
>> >>> utilization is low.
>> >>> In the test with 14 osds, I can see higher utilization of disk
>> (80%~90%).
>> >>> So do you have any tunning suggestion to improve the performace with
>> 30
>> >>> osds?
>> >>> Any feedback is appreciated.
>> >>>
>> >>>
>> >>> iostat output:
>> >>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >>> avgrq-sz avgqu-sz   await  svctm  %util
>> >>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>> sdb               0.00    88.50    0.00 5188.00     0.00 93397.00
>> >>> 18.00     0.90    0.17   0.09  47.85
>> >>> sdc               0.00   443.50    0.00 5561.50     0.00 97324.00
>> >>> 17.50     4.06    0.73   0.09  47.90
>> >>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>>
>> >>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >>> avgrq-sz avgqu-sz   await  svctm  %util
>> >>> sda               0.00    17.50    0.00   28.00     0.00  3948.00
>> >>> 141.00     0.01    0.29   0.05   0.15
>> >>> sdb               0.00    69.50    0.00 4932.00     0.00 87067.50
>> >>> 17.65     2.27    0.46   0.09  43.45
>> >>> sdc               0.00    69.00    0.00 4855.50     0.00 105771.50
>> >>> 21.78     0.95    0.20   0.10  46.40
>> >>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>> dm-1              0.00     0.00    0.00   42.50     0.00  3948.00
>> >>> 92.89     0.01    0.19   0.04   0.15
>> >>>
>> >>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >>> avgrq-sz avgqu-sz   await  svctm  %util
>> >>> sda               0.00    12.00    0.00    8.00     0.00   568.00
>> >>> 71.00     0.00    0.12   0.12   0.10
>> >>> sdb               0.00    72.50    0.00 5046.50     0.00 113198.50
>> >>> 22.43     1.09    0.22   0.10  51.40
>> >>> sdc               0.00    72.50    0.00 4912.00     0.00 91204.50
>> >>> 18.57     2.25    0.46   0.09  43.60
>> >>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>> dm-1              0.00     0.00    0.00   18.00     0.00   568.00
>> >>> 31.56     0.00    0.17   0.06   0.10
>> >>>
>> >>>
>> >>>
>> >>> Regards,
>> >>> Mark Wu
>> >>>
>> >>
>> >>
>> >> --
>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> tried running client
>> > on two non osd servers, but the same result.
>> >
>> > 2014 年 10 月 17 日 上午 12:29于 "Gregory Farnum" <g...@inktank.com>写道：
>> >
>> >> If you're running a single client to drive these tests, that's your
>> >> bottleneck. Try running multiple clients and aggregating their numbers.
>> >> -Greg
>> >>
>> >> On Thursday, October 16, 2014, Mark Wu <wud...@gmail.com> wrote:
>> >>>
>> >>> Hi list,
>> >>>
>> >>> During my test, I found ceph doesn't scale as I expected on a 30 osds
>> >>> cluster.
>> >>> The following is the information of my setup:
>> >>> HW configuration:
>> >>>    15 Dell R720 servers, and each server has:
>> >>>       Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and
>> >>> hyper-thread enabled.
>> >>>       128GB memory
>> >>>       two Intel 3500 SSD disks, connected with MegaRAID SAS 2208
>> >>> controller, each disk is configured as raid0 separately.
>> >>>       bonding with two 10GbE nics, used for both the public network
>> and
>> >>> cluster network.
>> >>>
>> >>> SW configuration:
>> >>>    OS CentOS 6.5, Kernel 3.17,  Ceph 0.86
>> >>>    XFS as file system for data.
>> >>>    each SSD disk has two partitions, one is osd data and the other is
>> osd
>> >>> journal.
>> >>>    the pool has 2048 pgs. 2 replicas.
>> >>>    5 monitors running on 5 of the 15 servers.
>> >>>    Ceph configuration (in memory debugging options are disabled)
>> >>>
>> >>> [osd]
>> >>> osd data = /var/lib/ceph/osd/$cluster-$id
>> >>> osd journal = /var/lib/ceph/osd/$cluster-$id/journal
>> >>> osd mkfs type = xfs
>> >>> osd mkfs options xfs = -f -i size=2048
>> >>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>> >>> osd journal size = 20480
>> >>> osd mon heartbeat interval = 30 # Performance tuning filestore
>> >>> osd_max_backfills = 10
>> >>> osd_recovery_max_active = 15
>> >>> merge threshold = 40
>> >>> filestore split multiple = 8
>> >>> filestore fd cache size = 1024
>> >>> osd op threads = 64 # Recovery tuning osd recovery max active = 1 osd
>> max
>> >>> backfills = 1
>> >>> osd recovery op priority = 1
>> >>> throttler perf counter = false
>> >>> osd enable op tracker = false
>> >>> filestore_queue_max_ops = 5000
>> >>> filestore_queue_committing_max_ops = 5000
>> >>> journal_max_write_entries = 1000
>> >>> journal_queue_max_ops = 5000
>> >>> objecter_inflight_ops = 8192
>> >>>
>> >>>
>> >>>   When I test with 7 servers (14 osds),  the maximum iops of 4k random
>> >>> write I saw is 17k on single volume and 44k on the whole cluster.
>> >>> I expected the number of 30 osds cluster could approximate 90k. But
>> >>> unfornately,  I found that with 30 osds, it almost provides the
>> performce
>> >>> as 14 osds, even worse sometime. I checked the iostat output on all
>> the
>> >>> nodes, which have similar numbers. It's well distributed but disk
>> >>> utilization is low.
>> >>> In the test with 14 osds, I can see higher utilization of disk
>> (80%~90%).
>> >>> So do you have any tunning suggestion to improve the performace with
>> 30
>> >>> osds?
>> >>> Any feedback is appreciated.
>> >>>
>> >>>
>> >>> iostat output:
>> >>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >>> avgrq-sz avgqu-sz   await  svctm  %util
>> >>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>> sdb               0.00    88.50    0.00 5188.00     0.00 93397.00
>> >>> 18.00     0.90    0.17   0.09  47.85
>> >>> sdc               0.00   443.50    0.00 5561.50     0.00 97324.00
>> >>> 17.50     4.06    0.73   0.09  47.90
>> >>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>>
>> >>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >>> avgrq-sz avgqu-sz   await  svctm  %util
>> >>> sda               0.00    17.50    0.00   28.00     0.00  3948.00
>> >>> 141.00     0.01    0.29   0.05   0.15
>> >>> sdb               0.00    69.50    0.00 4932.00     0.00 87067.50
>> >>> 17.65     2.27    0.46   0.09  43.45
>> >>> sdc               0.00    69.00    0.00 4855.50     0.00 105771.50
>> >>> 21.78     0.95    0.20   0.10  46.40
>> >>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>> dm-1              0.00     0.00    0.00   42.50     0.00  3948.00
>> >>> 92.89     0.01    0.19   0.04   0.15
>> >>>
>> >>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >>> avgrq-sz avgqu-sz   await  svctm  %util
>> >>> sda               0.00    12.00    0.00    8.00     0.00   568.00
>> >>> 71.00     0.00    0.12   0.12   0.10
>> >>> sdb               0.00    72.50    0.00 5046.50     0.00 113198.50
>> >>> 22.43     1.09    0.22   0.10  51.40
>> >>> sdc               0.00    72.50    0.00 4912.00     0.00 91204.50
>> >>> 18.57     2.25    0.46   0.09  43.60
>> >>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>> >>> 0.00     0.00    0.00   0.00   0.00
>> >>> dm-1              0.00     0.00    0.00   18.00     0.00   568.00
>> >>> 31.56     0.00    0.17   0.06   0.10
>> >>>
>> >>>
>> >>>
>> >>> Regards,
>> >>> Mark Wu
>> >>>
>> >>
>> >>
>> >> --
>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.

Reply via email to