Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Somnath Roy
Thanks Wei/Pavan for the response, it seems I need to debug osds to find out 
what is the cause of slowing down.
Will update community if I find anything conclusive.

Regards
Somnath

-Original Message-
From: Wei Jin [mailto:wjin...@gmail.com] 
Sent: Monday, October 17, 2016 2:13 AM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] OSDs are flapping and marked down wrongly

On Mon, Oct 17, 2016 at 3:16 PM, Somnath Roy <somnath@sandisk.com> wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
> submit_message osd_op_reply(1463 
> rbd_data.55246b8b4567.d633 [set-alloc-hint object_size 
> 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890 
> ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.

I think you need to tune threads' timeout values as heartbeat message will be 
dropped during timeout and suicide (health check will fail).
That's why you observe 'wrongly marked me down' message but osd process is 
still alive. See function OSD::handle_osd_ping()

Also, you could backport this
pr(https://github.com/ceph/ceph/pull/8808) to accelerate dealing with heartbeat 
message.

After that, you may consider tuning grace time.


>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?
>
> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?
>
> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?
>
> Any help on clarifying this would be very helpful.
>
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Wei Jin
On Mon, Oct 17, 2016 at 3:16 PM, Somnath Roy  wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
> submit_message osd_op_reply(1463 rbd_data.55246b8b4567.d633 
> [set-alloc-hint object_size 4194304 write_size 4194304,write 3932160~262144] 
> v222'95890 uv95890 ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping 
> message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.

I think you need to tune threads' timeout values as heartbeat message
will be dropped during timeout and suicide (health check will fail).
That's why you observe 'wrongly marked me down' message but osd
process is still alive. See function OSD::handle_osd_ping()

Also, you could backport this
pr(https://github.com/ceph/ceph/pull/8808) to accelerate dealing with
heartbeat message.

After that, you may consider tuning grace time.


>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?
>
> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?
>
> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?
>
> Any help on clarifying this would be very helpful.
>
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Pavan Rallabhandi
Regarding the mon_osd_min_down_reports I was looking at it recently, this could 
provide some insight 
https://github.com/ceph/ceph/commit/0269a0c17723fd3e22738f7495fe017225b924a4

Thanks!

On 10/17/16, 1:36 PM, "ceph-users on behalf of Somnath Roy" 
 wrote:

Thanks Piotr, Wido for quick response.

@Wido , yes, I thought of trying with those values but I am seeing in the 
log messages at least 7 osds are reporting failure , so, didn't try. BTW, I 
found default mon_osd_min_down_reporters is 2 , not 1 and latest master is not 
having mon_osd_min_down_reports anymore. Not sure what it is replaced with..

@Piotr , yes, your PR really helps , thanks !  Regarding each messenger 
needs to respond to HB is confusing, I know each thread has a HB timeout value 
and beyond which it will crash with suicide timeout , are you talking about 
that ?

Regards
Somnath

-Original Message-
From: Piotr Dałek [mailto:bra...@predictor.org.pl]
Sent: Monday, October 17, 2016 12:52 AM
To: ceph-users@lists.ceph.com; Somnath Roy; ceph-de...@vger.kernel.org
Subject: Re: OSDs are flapping and marked down wrongly

On Mon, Oct 17, 2016 at 07:16:44AM +, Somnath Roy wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed 
to either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based) 
 is stressed with large block size and very high QD. Lowering QD it is working 
just fine.
> We are seeing the lossy connection message like below and followed by the 
osd marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767
> submit_message osd_op_reply(1463
> rbd_data.55246b8b4567.d633 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890
> ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly 
and rebalancing started. This is hurting performance very badly.
>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
10-12Gb/s , no network error is reported. So, why this lossy connection message 
is coming ? what could go wrong here ? Is it network prioritization issue of 
smaller ping packets ? I tried to gaze ping round time during this and nothing 
seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of 
network/memory/cpu/disk is left. So, I doubt my osds are unresponsive but yes 
it is really busy on IO path. Heartbeat is going through separate messenger and 
threads as well, so, busy op threads should not be making heartbeat delayed. 
Increasing osd heartbeat grace is only delaying this phenomenon , but, 
eventually happens after several hours. Anything else we can tune here ?

There's a bunch of messengers in OSD code, if ANY of them doesn't respond 
to heartbeat messages in reasonable time, it is marked as down. Since packets 
are processed in FIFO/synchronous manner, overloading OSD with large I/O will 
cause it to time-out on at least one messenger.
There was an idea to have heartbeat messages go in the OOB TCP/IP stream 
and process them asynchronously, but I don't know if that went beyond the idea 
stage.

> 3. What could be the side effect of big grace period ? I understand that 
detecting a faulty osd will be delayed, anything else ?

Yes - stalled ops. Assume that primary OSD goes down and replicas are still 
alive. Having big grace period will cause all ops going to that OSD to stall 
until that particular OSD is marked down or resumes normal operation.

> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
instantaneously and it is not waiting till this grace period. How it is 
distinguishing between unresponsive and crashed osds ? In which scenario this 
heartbeat grace is coming into picture ?

This is the effect of my PR#8558 (https://github.com/ceph/ceph/pull/8558)
which causes any OSD that crash to be immediately marked as down, 
preventing stalled I/Os in most common cases. Grace period is only applied to 
unresponsive OSDs (i.e. temporary packet loss, bad cases of network lags, 
routing issues, in other words, everything that is known to be at least 
possible to resolve by itself in a finite amount of time). OSDs that crash and 
burn won't respond - instead, OS will respond with ECONNREFUSED indicating that 
OSD is not listening and in that case the OSD will be immediately marked down.

--
Piotr Dałek
bra...@predictor.org.pl
http://blog.predictor.org.pl
   

Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Somnath Roy
Thanks Piotr, Wido for quick response.

@Wido , yes, I thought of trying with those values but I am seeing in the log 
messages at least 7 osds are reporting failure , so, didn't try. BTW, I found 
default mon_osd_min_down_reporters is 2 , not 1 and latest master is not having 
mon_osd_min_down_reports anymore. Not sure what it is replaced with..

@Piotr , yes, your PR really helps , thanks !  Regarding each messenger needs 
to respond to HB is confusing, I know each thread has a HB timeout value and 
beyond which it will crash with suicide timeout , are you talking about that ?

Regards
Somnath

-Original Message-
From: Piotr Dałek [mailto:bra...@predictor.org.pl]
Sent: Monday, October 17, 2016 12:52 AM
To: ceph-users@lists.ceph.com; Somnath Roy; ceph-de...@vger.kernel.org
Subject: Re: OSDs are flapping and marked down wrongly

On Mon, Oct 17, 2016 at 07:16:44AM +, Somnath Roy wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767
> submit_message osd_op_reply(1463
> rbd_data.55246b8b4567.d633 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890
> ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.
>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?

There's a bunch of messengers in OSD code, if ANY of them doesn't respond to 
heartbeat messages in reasonable time, it is marked as down. Since packets are 
processed in FIFO/synchronous manner, overloading OSD with large I/O will cause 
it to time-out on at least one messenger.
There was an idea to have heartbeat messages go in the OOB TCP/IP stream and 
process them asynchronously, but I don't know if that went beyond the idea 
stage.

> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?

Yes - stalled ops. Assume that primary OSD goes down and replicas are still 
alive. Having big grace period will cause all ops going to that OSD to stall 
until that particular OSD is marked down or resumes normal operation.

> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?

This is the effect of my PR#8558 (https://github.com/ceph/ceph/pull/8558)
which causes any OSD that crash to be immediately marked as down, preventing 
stalled I/Os in most common cases. Grace period is only applied to unresponsive 
OSDs (i.e. temporary packet loss, bad cases of network lags, routing issues, in 
other words, everything that is known to be at least possible to resolve by 
itself in a finite amount of time). OSDs that crash and burn won't respond - 
instead, OS will respond with ECONNREFUSED indicating that OSD is not listening 
and in that case the OSD will be immediately marked down.

--
Piotr Dałek
bra...@predictor.org.pl
http://blog.predictor.org.pl
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately 

Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Wido den Hollander

> Op 17 oktober 2016 om 9:16 schreef Somnath Roy :
> 
> 
> Hi Sage et. al,
> 
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
> 
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
> submit_message osd_op_reply(1463 rbd_data.55246b8b4567.d633 
> [set-alloc-hint object_size 4194304 write_size 4194304,write 3932160~262144] 
> v222'95890 uv95890 ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping 
> message
> 
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.
> 
> My question is the following.
> 
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
> 
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?
> 
> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?
> 

You might want to look at:

OPTION(mon_osd_min_down_reporters, OPT_INT, 1)   // number of OSDs who need to 
report a down OSD for it to count
OPTION(mon_osd_min_down_reports, OPT_INT, 3) // number of times a down OSD 
must be reported for it to count

Setting 'mon_osd_min_down_reporters' to 3 means that 3 individual OSDs have to 
mark a OSD as down. You could also increase the amount of reports.

On larger environments I always set reporters to 3 or 5, just to prevent such 
flapping.

> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?
> 

A crashed OSD will not be detected by the MON. It are the other OSDs which 
inform the monitor about this OSD crashing. But you will have to wait for the 
heartbeats to time out.

Only when a OSD gracefully shuts down it will mark itself down instantly.

Wido

> Any help on clarifying this would be very helpful.
> 
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Somnath Roy
Hi Sage et. al,

I know this issue is reported number of times in community and attributed to 
either network issue or unresponsive OSDs.
Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
stressed with large block size and very high QD. Lowering QD it is working just 
fine.
We are seeing the lossy connection message like below and followed by the osd 
marked down by monitor.

2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
submit_message osd_op_reply(1463 rbd_data.55246b8b4567.d633 
[set-alloc-hint object_size 4194304 write_size 4194304,write 3932160~262144] 
v222'95890 uv95890 ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping 
message

In the monitor log, I am seeing the osd is reported down by peers and 
subsequently monitor is marking it down.
OSDs is rejoining the cluster after detecting it is marked down wrongly and 
rebalancing started. This is hurting performance very badly.

My question is the following.

1. I have 40Gb network and I am seeing network is not utilized beyond 10-12Gb/s 
, no network error is reported. So, why this lossy connection message is coming 
? what could go wrong here ? Is it network prioritization issue of smaller ping 
packets ? I tried to gaze ping round time during this and nothing seems 
abnormal.

2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk is 
left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
path. Heartbeat is going through separate messenger and threads as well, so, 
busy op threads should not be making heartbeat delayed. Increasing osd 
heartbeat grace is only delaying this phenomenon , but, eventually happens 
after several hours. Anything else we can tune here ?

3. What could be the side effect of big grace period ? I understand that 
detecting a faulty osd will be delayed, anything else ?

4. I saw if an OSD is crashed, monitor will detect the down osd almost 
instantaneously and it is not waiting till this grace period. How it is 
distinguishing between unresponsive and crashed osds ? In which scenario this 
heartbeat grace is coming into picture ?

Any help on clarifying this would be very helpful.

Thanks & Regards
Somnath
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com