Re: [ceph-users] when an osd is started up, IO will be blocked

2015-10-27 Thread Jevon Qiao

Hi Cephers,

We're in the middle of trying to triage the issue with ceph cluster 
running 0.80.9 which was reported by Songbo and seeking for you experts' 
advices.


In fact, per our testing the process of stopping an working OSD and 
starting it again will lead to a huge performance downgrade. In other 
words, this issue can be reproduced quite easily, and we cannot lower 
the impact of the state of OSD by tuning the settings like 
osd_max_backfills/osd_recovery_max_chunk/osd_recovery_max_active. 
Through looking into the source code, we notice that the requests issued 
by clients will be queued firstly when the corresponding PGs are in some 
certain states (like recovering and backfill) and then processed. During 
this period, the IOPS outputted by fio drops significantly(from 2000 to 
60). What we can think of this is to guarantee the data consistency, are 
we correct? If that's the design, we're wondering how Ceph can support 
the applications that are performance-sensitive? Is there any other 
parameters we can tuning to reduce the impact?


Thanks,
Jevon
On 26/10/15 13:27, wangsongbo wrote:

Hi all,

When an osd is started, I will get a lot of slow requests from the 
corresponding osd log, as follows:


2015-10-26 03:42:51.593961 osd.4 [WRN] slow request 3.967808 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 
9744'347845) currently commit_sent
2015-10-26 03:42:51.593964 osd.4 [WRN] slow request 3.964537 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:52.594166 osd.4 [WRN] 40 slow requests, 17 included 
below; oldest blocked for > 53.692556 secs
2015-10-26 03:42:52.594172 osd.4 [WRN] slow request 2.272928 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 
9744'63645) currently commit_sent
2015-10-26 03:42:52.594175 osd.4 [WRN] slow request 2.270618 seconds 
old, received at 2015-10-26 03:42:50.323461: 
osd_op(client.3684690.0:191911 
rbd_data.1fc5ca7429fc17.0209 [write 2633728~4096] 
43.72b9f039 ack+ondisk+write e9744) currently commit_sent
2015-10-26 03:42:52.594264 osd.4 [WRN] slow request 4.968252 seconds 
old, received at 2015-10-26 03:42:47.625828: 
osd_repop(client.2682003.0:2686047 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193028) currently commit_sent
2015-10-26 03:42:52.594266 osd.4 [WRN] slow request 4.968111 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 
9744'347845) currently commit_sent
2015-10-26 03:42:52.594318 osd.4 [WRN] slow request 4.964841 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:53.594527 osd.4 [WRN] 40 slow requests, 16 included 
below; oldest blocked for > 54.692945 secs
2015-10-26 03:42:53.594533 osd.4 [WRN] slow request 16.004669 seconds 
old, received at 2015-10-26 03:42:37.589800: 
osd_repop(client.2682003.0:2686041 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193024) currently commit_sent
2015-10-26 03:42:53.594536 osd.4 [WRN] slow request 16.003889 seconds 
old, received at 2015-10-26 03:42:37.590580: 
osd_repop(client.2682003.0:2686040 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 
9744'347842) currently commit_sent
2015-10-26 03:42:53.594538 osd.4 [WRN] slow request 16.000954 seconds 
old, received at 2015-10-26 03:42:37.593515: 
osd_repop(client.2682003.0:2686042 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193025) currently commit_sent
2015-10-26 03:42:53.594541 osd.4 [WRN] slow request 29.138828 seconds 
old, received at 2015-10-26 03:42:24.455641: 
osd_repop(client.4764855.0:65121 43.dbe 
169a9dbe/rbd_data.49a7a4633ac0b1.0021/head//43 v 
9744'12509) currently commit_sent
2015-10-26 03:42:53.594543 osd.4 [WRN] slow request 15.998814 seconds 
old, received at 2015-10-26 03:42:37.595656: 
osd_repop(client.1800547.0:1205399 43.cc5 
9285ecc5/rbd_data.1b794560c6e2ea.00d0/head//43 v 
9744'36732) currently commit_sent
2015-10-26 03:42:54.594892 osd.4 [WRN] 39 slow requests, 17 included 
below; oldest blocked for > 55.693227 secs
2015-10-26 03:42:54.594908 osd.4 [WRN] slow request 4.273600 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 
9744'63645) currently commit_sent
2015-10-26 03:42:54.594911 osd.4 [WRN] slow request 4.271290 seconds 
old, received at 2015-10

Re: [ceph-users] when an osd is started up, IO will be blocked

2015-10-25 Thread wangsongbo

Hi all,

When an osd is started, I will get a lot of slow requests from the 
corresponding osd log, as follows:


2015-10-26 03:42:51.593961 osd.4 [WRN] slow request 3.967808 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) 
currently commit_sent
2015-10-26 03:42:51.593964 osd.4 [WRN] slow request 3.964537 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:52.594166 osd.4 [WRN] 40 slow requests, 17 included 
below; oldest blocked for > 53.692556 secs
2015-10-26 03:42:52.594172 osd.4 [WRN] slow request 2.272928 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) 
currently commit_sent
2015-10-26 03:42:52.594175 osd.4 [WRN] slow request 2.270618 seconds 
old, received at 2015-10-26 03:42:50.323461: 
osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 
[write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently 
commit_sent
2015-10-26 03:42:52.594264 osd.4 [WRN] slow request 4.968252 seconds 
old, received at 2015-10-26 03:42:47.625828: 
osd_repop(client.2682003.0:2686047 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193028) currently commit_sent
2015-10-26 03:42:52.594266 osd.4 [WRN] slow request 4.968111 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) 
currently commit_sent
2015-10-26 03:42:52.594318 osd.4 [WRN] slow request 4.964841 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:53.594527 osd.4 [WRN] 40 slow requests, 16 included 
below; oldest blocked for > 54.692945 secs
2015-10-26 03:42:53.594533 osd.4 [WRN] slow request 16.004669 seconds 
old, received at 2015-10-26 03:42:37.589800: 
osd_repop(client.2682003.0:2686041 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193024) currently commit_sent
2015-10-26 03:42:53.594536 osd.4 [WRN] slow request 16.003889 seconds 
old, received at 2015-10-26 03:42:37.590580: 
osd_repop(client.2682003.0:2686040 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347842) 
currently commit_sent
2015-10-26 03:42:53.594538 osd.4 [WRN] slow request 16.000954 seconds 
old, received at 2015-10-26 03:42:37.593515: 
osd_repop(client.2682003.0:2686042 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193025) currently commit_sent
2015-10-26 03:42:53.594541 osd.4 [WRN] slow request 29.138828 seconds 
old, received at 2015-10-26 03:42:24.455641: 
osd_repop(client.4764855.0:65121 43.dbe 
169a9dbe/rbd_data.49a7a4633ac0b1.0021/head//43 v 9744'12509) 
currently commit_sent
2015-10-26 03:42:53.594543 osd.4 [WRN] slow request 15.998814 seconds 
old, received at 2015-10-26 03:42:37.595656: 
osd_repop(client.1800547.0:1205399 43.cc5 
9285ecc5/rbd_data.1b794560c6e2ea.00d0/head//43 v 9744'36732) 
currently commit_sent
2015-10-26 03:42:54.594892 osd.4 [WRN] 39 slow requests, 17 included 
below; oldest blocked for > 55.693227 secs
2015-10-26 03:42:54.594908 osd.4 [WRN] slow request 4.273600 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) 
currently commit_sent
2015-10-26 03:42:54.594911 osd.4 [WRN] slow request 4.271290 seconds 
old, received at 2015-10-26 03:42:50.323461: 
osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 
[write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently 
commit_sent


Meanwhile, I run fio process with the rbd ioengine.
The iops of read and write were too small to response any IO from the 
fio process,
In other words, when an osd is started, the IO of the whole cluster will 
be blocked.

Is there some parameter to adjust ?
How to explain this  problem?
The results of running fio process were as fllows:

ebs: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=rbd, iodepth=64
fio-2.2.9-20-g1520
Starting 1 thread
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [m(1)] [0.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
05h:10m:14s]

ebs: (groupid=0, jobs=1): err= 0: pid=40323: Mon Oct 26 04:02:00 2015
  read : io=10904KB, bw=175183B/s, *iops=21*, runt= 63737msec
slat (usec): min=0, max=61, avg= 1.11, stdev= 3.16
clat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28
 lat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[   

Re: [ceph-users] when an osd is started up, IO will be blocked

2015-10-25 Thread wangsongbo

Hi all,

When an osd is started, I will get a lot of slow requests from the 
corresponding osd log, as follows:


2015-10-26 03:42:51.593961 osd.4 [WRN] slow request 3.967808 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) 
currently commit_sent
2015-10-26 03:42:51.593964 osd.4 [WRN] slow request 3.964537 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:52.594166 osd.4 [WRN] 40 slow requests, 17 included 
below; oldest blocked for > 53.692556 secs
2015-10-26 03:42:52.594172 osd.4 [WRN] slow request 2.272928 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) 
currently commit_sent
2015-10-26 03:42:52.594175 osd.4 [WRN] slow request 2.270618 seconds 
old, received at 2015-10-26 03:42:50.323461: 
osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 
[write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently 
commit_sent
2015-10-26 03:42:52.594264 osd.4 [WRN] slow request 4.968252 seconds 
old, received at 2015-10-26 03:42:47.625828: 
osd_repop(client.2682003.0:2686047 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193028) currently commit_sent
2015-10-26 03:42:52.594266 osd.4 [WRN] slow request 4.968111 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) 
currently commit_sent
2015-10-26 03:42:52.594318 osd.4 [WRN] slow request 4.964841 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:53.594527 osd.4 [WRN] 40 slow requests, 16 included 
below; oldest blocked for > 54.692945 secs
2015-10-26 03:42:53.594533 osd.4 [WRN] slow request 16.004669 seconds 
old, received at 2015-10-26 03:42:37.589800: 
osd_repop(client.2682003.0:2686041 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193024) currently commit_sent
2015-10-26 03:42:53.594536 osd.4 [WRN] slow request 16.003889 seconds 
old, received at 2015-10-26 03:42:37.590580: 
osd_repop(client.2682003.0:2686040 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347842) 
currently commit_sent
2015-10-26 03:42:53.594538 osd.4 [WRN] slow request 16.000954 seconds 
old, received at 2015-10-26 03:42:37.593515: 
osd_repop(client.2682003.0:2686042 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193025) currently commit_sent
2015-10-26 03:42:53.594541 osd.4 [WRN] slow request 29.138828 seconds 
old, received at 2015-10-26 03:42:24.455641: 
osd_repop(client.4764855.0:65121 43.dbe 
169a9dbe/rbd_data.49a7a4633ac0b1.0021/head//43 v 9744'12509) 
currently commit_sent
2015-10-26 03:42:53.594543 osd.4 [WRN] slow request 15.998814 seconds 
old, received at 2015-10-26 03:42:37.595656: 
osd_repop(client.1800547.0:1205399 43.cc5 
9285ecc5/rbd_data.1b794560c6e2ea.00d0/head//43 v 9744'36732) 
currently commit_sent
2015-10-26 03:42:54.594892 osd.4 [WRN] 39 slow requests, 17 included 
below; oldest blocked for > 55.693227 secs
2015-10-26 03:42:54.594908 osd.4 [WRN] slow request 4.273600 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) 
currently commit_sent
2015-10-26 03:42:54.594911 osd.4 [WRN] slow request 4.271290 seconds 
old, received at 2015-10-26 03:42:50.323461: 
osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 
[write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently 
commit_sent




Meanwhile, I run fio process with the rbd ioengine.
The iops of read and write were too small to response any IO from the 
fio process,
In other words, when an osd is started, the IO of the whole cluster will 
be blocked.

Is there some parameter to adjust ?
How to explain this  problem?
The results of running fio process were as fllows:

ebs: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=rbd, iodepth=64
fio-2.2.9-20-g1520
Starting 1 thread
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [m(1)] [0.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
05h:10m:14s]

ebs: (groupid=0, jobs=1): err= 0: pid=40323: Mon Oct 26 04:02:00 2015
  read : io=10904KB, bw=175183B/s, *iops=21*, runt= 63737msec
slat (usec): min=0, max=61, avg= 1.11, stdev= 3.16
clat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28
 lat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[