Re: [ceph-users] Replication strategy, write throughput

Andreas Gerstmayr Wed, 09 Nov 2016 12:57:41 -0800

Hello,

2 parallel jobs with one job simulating the journal (sequential
writes, ioengine=libaio, direct=1, sync=1, iodeph=128, bs=1MB) and the
other job simulating the datastore (random writes of 1MB)?

To test against a single HDD?
Yes, something like that, the first fio job would need go against a raw
partition and the iodepth isn't anywhere that high with a journal, in
theory it's actually 1 (some Ceph developer please pipe up here).


I took that number (iodeph=128) from 
https://github.com/ceph/ceph/blob/master/src/os/filestore/FileJournal.cc#L111
From the io_setup manpage: "The io_setup() system call creates an asynchronous I/O 
context suitable for concurrently processing nr_events operations."

The 2nd fio needs to run against an actual FS, the bs for both should
match your stripe unit size for sequential tests.

What this setup misses, especially the 2nd part is that Ceph operates on
individual files which it has to create on the fly for the first time, may
create or delete sub directories and trees, updates a leveldb[*] on the
same FS, etc...


[*] see /var/lib/ceph/osd/ceph-nn/current/omap/


Good point, thanks.

Last time I checked the disks were well utilized (i.e. they were busy
almost 100% the time), but that doesn't equate to "can't accept more
I/O operations".

Well, if it is really 100% busy and the next journal write has to wait
until all the seek and SYNC is done, then Ceph will block at this point
course.

The throughput (as seen by iostat -xz 1) was way
below the maximum.

Around 40MB/s per chance?


I repeated the test with 7 clients x 1 thread, replication 1,
CephFS stripe unit=4MB, stripe count=1, object size=4MB

Output from iostat -xzm 5:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12,33    0,00   20,33   24,38    0,00   42,96

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,60    0,00  143,60     0,00    62,74   894,74    
10,03   60,23    0,00   60,23   5,69  81,72
sdg               0,00    74,80    0,00  270,00     0,00   102,43   776,97    
83,25  308,53    0,00  308,53   3,63  97,96
sdk               0,00     2,00    0,00  181,40     0,00    79,05   892,52    
16,36   89,97    0,00   89,97   5,07  91,96
sdd               0,00     4,00    0,00  244,20     0,00   114,64   961,40   
165,66  662,56    0,00  662,56   4,09  99,84
sdl               0,00     0,60    0,00  185,60     0,00    84,05   927,42    
61,15  441,91    0,00  441,91   4,31  79,94
sde               0,00     0,80    0,00  183,60     0,00    82,76   923,14    
54,01  520,53    0,00  520,53   4,64  85,28
sdj               0,00     1,80    0,00  242,20     0,00   111,45   942,42   
119,56  493,59    0,00  493,59   4,00  96,98
sdi               0,00     4,00    0,00  192,60     0,00    90,05   957,57   
109,19  450,13    0,00  450,13   4,69  90,42
sdp               0,00     2,80    0,00  170,80     0,00    74,78   896,67    
10,72   58,13    0,00   58,13   5,48  93,68
sds               0,00     2,00    0,00  178,00     0,00    80,92   931,05    
48,59  273,00    0,00  273,00   5,08  90,44
sdn               0,00     0,40    0,00  178,60     0,00    77,97   894,09    
10,04   66,59    0,00   66,59   4,98  89,02
sdr               0,00    64,20    0,00  205,60     0,00    83,93   835,99    
49,16  218,47    0,00  218,47   4,65  95,64
sdu               0,00     1,80    0,00  194,20     0,00    87,82   926,11    
53,98  177,14    0,00  177,14   5,11  99,32
sdx               0,00     1,20    0,00  175,00     0,00    78,73   921,42    
33,68  131,99    0,00  131,99   5,47  95,78
sda               0,00     2,20    0,00  218,40     0,00    97,16   911,07    
39,80  182,23    0,00  182,23   4,51  98,48
sdm               0,00    74,00    1,00  244,60     0,01    86,52   721,50    
49,40  180,41   54,80  180,93   3,94  96,84
sdq               0,00     0,60    0,00  163,80     0,00    73,04   913,18    
17,03   62,75    0,00   62,75   5,77  94,52
sdh               0,00    97,00    1,00  211,40     0,01    71,28   687,35    
67,05  238,17   53,20  239,05   4,43  94,14
sdf               0,00     1,00    0,00  162,80     0,00    73,02   918,55    
27,24  167,31    0,00  167,31   5,40  87,96
sdo               0,00     2,00    0,00  244,40     0,00   111,54   934,68    
91,99  522,59    0,00  522,59   3,91  95,58
sdc               0,00     0,80    0,40  203,20     0,00    90,49   910,26    
25,71  126,16   46,00  126,32   4,75  96,80
sdt               0,00     1,00    0,00  165,60     0,00    74,71   923,98    
31,09  188,86    0,00  188,86   4,94  81,86
sdw               0,00     3,40    0,00  223,40     0,00   104,10   954,37   
144,31  532,19    0,00  532,19   4,46  99,64
sdv               0,00     2,20    0,00  242,00     0,00   109,82   929,40    
71,02  293,66    0,00  293,66   4,03  97,60

On average each disk was busy writing 87 MB/s. Average queue length per disk is 
57.
Similar output on the other 5 servers.
Aggregated client throughput is 5682 MB/s (no change in throughput compared to 
the other striping configuration).

When I divide that number by 2 (because of the journal), I get the 43.5 MB/s.
So this is my effective write speed per disk? Guess I should try the bluestore 
asap (to avoid the double writes).


Same benchmark test with replication 3 (and same striping settings as above):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12,91    0,00   26,38   28,04    0,00   32,67

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,60    0,00  193,60     0,00    90,83   960,82    
91,92  491,66    0,00  491,66   4,48  86,78
sdg               0,00    57,20    0,20  213,40     0,00    83,04   796,23    
58,43  278,98   67,00  279,18   4,46  95,22
sdk               0,00     6,00    0,00  223,20     0,00   108,08   991,69   
103,20  423,60    0,00  423,60   4,45  99,26
sdd               0,00    44,60    0,00  237,20     0,00    94,67   817,39    
97,37  395,30    0,00  395,30   4,19  99,38
sdl               0,00     1,80    0,00  286,40     0,00   137,32   981,94   
177,65  635,85    0,00  635,85   3,49 100,00
sde               0,00     1,20    0,00  183,60     0,00    82,56   920,90    
24,12  151,46    0,00  151,46   5,21  95,60
sdj               0,00    10,00    0,00  232,20     0,00   115,36  1017,51   
145,47  593,01    0,00  593,01   4,31 100,00
sdi               0,00     5,00    0,00  210,00     0,00   101,46   989,49   
117,92  501,59    0,00  501,59   4,71  98,88
sdp               0,00     5,00    0,00  257,60     0,00   124,37   988,76   
186,43  651,64    0,00  651,64   3,88  99,96
sds               0,00     1,20    0,00  249,00     0,00   118,78   976,91   
107,23  529,87    0,00  529,87   3,99  99,42
sdn               0,00     3,40    0,00  282,60     0,00   135,38   981,12   
212,53  768,44    0,00  768,44   3,54  99,98
sdr               0,00    41,60    0,00  265,20     0,00   112,67   870,09    
96,91  430,14    0,00  430,14   3,75  99,56
sdu               0,00    55,60    0,00  187,20     0,00    72,29   790,89    
23,08  173,73    0,00  173,73   5,09  95,26
sdx               0,40     2,20    0,80  222,20     0,15   109,18  1004,08   
194,31  757,00  573,50  757,66   4,48 100,00
sda               0,00     1,40    0,00  266,60     0,00   129,20   992,49   
145,07  567,22    0,00  567,22   3,75 100,00
sdm               0,00    58,60    0,00  225,20     0,00    89,67   815,43    
75,26  333,63    0,00  333,63   4,41  99,28
sdq               0,00     2,20    0,00  239,60     0,00   117,24  1002,14   
142,16  627,56    0,00  627,56   4,17  99,98
sdh               0,00    70,20    0,00  234,80     0,00    93,04   811,55    
66,73  228,26    0,00  228,26   4,23  99,26
sdf               0,00     2,40    0,00  271,00     0,00   131,08   990,63   
221,43  968,74    0,00  968,74   3,69 100,00
sdo               0,00     2,00    0,00  256,00     0,00   124,71   997,67   
193,33  770,51    0,00  770,51   3,91 100,00
sdc               0,00     2,60    0,00  267,40     0,00   126,37   967,88   
141,07  640,00    0,00  640,00   3,73  99,86
sdt               0,00     1,60    0,00  203,20     0,00    95,73   964,83    
74,25  456,10    0,00  456,10   4,48  91,10
sdw               0,00     2,80    0,00  267,20     0,00   128,19   982,56   
127,02  482,74    0,00  482,74   3,74  99,98
sdv               0,00     8,60    0,00  200,80     0,00    94,03   959,08    
57,78  291,25    0,00  291,25   4,89  98,28

Average throughput: 108 MB/s, average queue length: 119.
Aggregated client throughput is 1903 MB/s (also no measurable change).
The average throughput divided by 2 (because of the journal) would be 54 MB/s,
and therefore 144 OSDs * 54 MB/s = 7776 MB/s should be the baseline?

I noticed that during benchmarks with replication 3 I get lots of blocked ops
(I don't get that much replication 1), which disappear after the benchmark has 
finished:
15155 requests are blocked > 32 sec; 143 osds have slow requests;

Logs of a random OSD tell me:

2016-11-09 20:35:57.322967 7f5684c65700  0 log_channel(cluster) log [WRN] : 195 
slow requests, 5 included below; oldest blocked for > 70.047596 secs
2016-11-09 20:35:57.322979 7f5684c65700  0 log_channel(cluster) log [WRN] : 
slow request 60.405079 seconds old, received at 2016-11-09 20:34:56.917801: 
osd_repop(client.65784.1:22895980 5.49e 5:7939450d:::1000000a2e2.00000429:head 
v 1191'192668) currently started
2016-11-09 20:35:57.322985 7f5684c65700  0 log_channel(cluster) log [WRN] : 
slow request 30.160712 seconds old, received at 2016-11-09 20:35:27.162168: 
osd_repop(client.65781.1:23303524 5.614 5:286a4a86:::1000000996a.00000fab:head 
v 1191'191953) currently started

All slow requests have to do with "osd_repop".

Looking at a single server:
About 1800 network segments get retransmitted per second, of that 1400 
TCPFastRetrans.

If I take a look at netstat -s:
229373047157 segments send out
802376626 segments retransmited
Only 0,35% of the segments get retransmitted.


Can I deduct that the disks are saturated, therefore the blocked ops, therefore 
the depicted network traffic pattern?

Make that "more distributed I/O".
As in, you keep 4 times more OSDs busy than with the 4MB default stripe
size.
Which would be a good thing for small writes, so they hit different disks,
in an overall not very busy cluster.
For sequential writes at full speed, not so much.


Isn't more distributed I/O always favorable? Or is the problem the 4x overhead 
(1MB vs 4MB)?


Thanks for your helpful advice!
Andreas
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replication strategy, write throughput

Reply via email to