Re: [ceph-users] v12.2.0 bluestore - OSD down/crash " internal heartbeat not healthy, dropping ping reques "

2017-09-20 Thread nokia ceph
Hello Henrik,

Thanks for the update, I applied below  two changes and will monitor the
cluster.

~~~

bluestore_deferred_throttle_bytes = 0

bluestore_throttle_bytes = 0

~~~


Attached gdb trace after installed debug rpm's


Thanks




On Wed, Sep 20, 2017 at 12:18 PM, Henrik Korkuc  wrote:

> On 17-09-20 08:06, nokia ceph wrote:
>
> Hello,
>
> Env:- RHEL 7.2 , 3.10.0-327.el7.x86_64 , EC 4+1 , bluestore
>
> We are writing to ceph via librados C API  . Testing with rados no issues.
>
>
> The same we tested with Jewel/kraken without any issue. Need your view how
> to debug this issue?
>
> maybe similar to http://tracker.ceph.com/issues/21180? It seems it was
> resolved for me with mentioned fix. You could apply mentioned config
> options and see if it helps (and build newer version, if able).
>
>
> >>
>
> OSD.log
> ==
>
> ~~~
>
> 2017-09-18 14:51:59.895746 7f1e744e0700  0 log_channel(cluster) log [WRN]
> : slow request 60.068824 seconds old, received at 2017-09-18
> 14:50:59.826849: MOSDECSubOpWriteReply(1.132s0 1350/1344
> ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0))
> currently queued_for_pg
> 2017-09-18 14:51:59.895749 7f1e744e0700  0 log_channel(cluster) log [WRN]
> : slow request 60.068737 seconds old, received at 2017-09-18
> 14:50:59.826936: MOSDECSubOpWriteReply(1.132s0 1350/1344
> ECSubWriteReply(tid=971, last_complete=0'0, committed=0, applied=1))
> currently queued_for_pg
> 2017-09-18 14:51:59.895754 7f1e744e0700  0 log_channel(cluster) log [WRN]
> : slow request 60.067539 seconds old, received at 2017-09-18
> 14:50:59.828134: MOSDECSubOpWriteReply(1.132s0 1350/1344
> ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0))
> currently queued_for_pg
> 2017-09-18 14:51:59.923825 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
> 2017-09-18 14:51:59.923835 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
> 2017-09-18 14:51:59.923837 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
> 2017-09-18 14:51:59.923840 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
> 2017-09-18 14:51:59.923842 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
> 2017-09-18 14:51:59.940780 7f1e77ca5700 20 osd.181 1350 share_map_peer
> 0x7f1e8dbf2800 already has epoch 1350
> 2017-09-18 14:51:59.940855 7f1e78ca7700 20 osd.181 1350 share_map_peer
> 0x7f1e8dbf2800 already has epoch 1350
> 2017-09-18 14:52:00.081390 7f1e6f572700 20 osd.181 1350 OSD::ms_dispatch:
> ping magic: 0 v1
> 2017-09-18 14:52:00.081393 7f1e6f572700 10 osd.181 1350 do_waiters -- start
> 2017-09-18 14:52:00.081394 7f1e6f572700 10 osd.181 1350 do_waiters --
> finish
> 2017-09-18 14:52:00.081395 7f1e6f572700 20 osd.181 1350 _dispatch
> 0x7f1e90923a40 ping magic: 0 v1
> 2017-09-18 14:52:00.081397 7f1e6f572700 10 osd.181 1350 ping from
> client.414556
> 2017-09-18 14:52:00.123908 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
> 2017-09-18 14:52:00.123926 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
> 2017-09-18 14:52:00.123932 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
> 2017-09-18 14:52:00.123937 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
> 2017-09-18 14:52:00.123942 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
> 2017-09-18 14:52:00.145445 7f1e784a6700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
> 2017-09-18 14:52:00.145450 7f1e784a6700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60
> 2017-09-18 14:52:00.145496 7f1e784a6700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e63cbf700' had timed out after 60
> 2017-09-18 14:52:00.145534 7f1e784a6700 10 osd.181 1350 internal heartbeat
> not healthy, dropping ping request
> 2017-09-18 14:52:00.146224 7f1e78ca7700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
> 2017-09-18 14:52:00.146226 7f1e78ca7700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60
>
> ~~~
>
>  thread apply all bt
>
> Thread 54 (LWP 479360):
> #0  0x7f1e7b5606d5 in ?? ()
> #1  0x in ?? ()
>
> Thread 53 (LWP 484888):
> #0  0x7f1e7a644b7d in ?? ()
> #1  0x in ?? ()
>
> Thread 52 (LWP 484177):
> #0  0x7f1e7b5606d5 in ?? ()
> #1  0x000a in ?? ()
> #2  0x7f1e88d8df98 in ?? ()
> #3  

Re: [ceph-users] v12.2.0 bluestore - OSD down/crash " internal heartbeat not healthy, dropping ping reques "

2017-09-20 Thread Henrik Korkuc

On 17-09-20 08:06, nokia ceph wrote:

Hello,

Env:- RHEL 7.2 , 3.10.0-327.el7.x86_64 , EC 4+1 , bluestore

We are writing to ceph via librados C API  . Testing with rados no 
issues.


The same we tested with Jewel/kraken without any issue. Need your view 
how to debug this issue?
maybe similar to http://tracker.ceph.com/issues/21180? It seems it was 
resolved for me with mentioned fix. You could apply mentioned config 
options and see if it helps (and build newer version, if able).


>>

OSD.log
==

~~~

2017-09-18 14:51:59.895746 7f1e744e0700  0 log_channel(cluster) log 
[WRN] : slow request 60.068824 seconds old, received at 2017-09-18 
14:50:59.826849: MOSDECSubOpWriteReply(1.132s0 1350/1344 
ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, 
applied=0)) currently queued_for_pg
2017-09-18 14:51:59.895749 7f1e744e0700  0 log_channel(cluster) log 
[WRN] : slow request 60.068737 seconds old, received at 2017-09-18 
14:50:59.826936: MOSDECSubOpWriteReply(1.132s0 1350/1344 
ECSubWriteReply(tid=971, last_complete=0'0, committed=0, applied=1)) 
currently queued_for_pg
2017-09-18 14:51:59.895754 7f1e744e0700  0 log_channel(cluster) log 
[WRN] : slow request 60.067539 seconds old, received at 2017-09-18 
14:50:59.828134: MOSDECSubOpWriteReply(1.132s0 1350/1344 
ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, 
applied=0)) currently queued_for_pg
2017-09-18 14:51:59.923825 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
2017-09-18 14:51:59.923835 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
2017-09-18 14:51:59.923837 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
2017-09-18 14:51:59.923840 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
2017-09-18 14:51:59.923842 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
2017-09-18 14:51:59.940780 7f1e77ca5700 20 osd.181 1350 share_map_peer 
0x7f1e8dbf2800 already has epoch 1350
2017-09-18 14:51:59.940855 7f1e78ca7700 20 osd.181 1350 share_map_peer 
0x7f1e8dbf2800 already has epoch 1350
2017-09-18 14:52:00.081390 7f1e6f572700 20 osd.181 1350 
OSD::ms_dispatch: ping magic: 0 v1
2017-09-18 14:52:00.081393 7f1e6f572700 10 osd.181 1350 do_waiters -- 
start
2017-09-18 14:52:00.081394 7f1e6f572700 10 osd.181 1350 do_waiters -- 
finish
2017-09-18 14:52:00.081395 7f1e6f572700 20 osd.181 1350 _dispatch 
0x7f1e90923a40 ping magic: 0 v1
2017-09-18 14:52:00.081397 7f1e6f572700 10 osd.181 1350 ping from 
client.414556
2017-09-18 14:52:00.123908 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
2017-09-18 14:52:00.123926 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
2017-09-18 14:52:00.123932 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
2017-09-18 14:52:00.123937 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
2017-09-18 14:52:00.123942 7f1e71cdb700 10 trim shard target 102 M 
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
2017-09-18 14:52:00.145445 7f1e784a6700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
2017-09-18 14:52:00.145450 7f1e784a6700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60
2017-09-18 14:52:00.145496 7f1e784a6700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1e63cbf700' had timed out after 60
2017-09-18 14:52:00.145534 7f1e784a6700 10 osd.181 1350 internal 
heartbeat not healthy, dropping ping request
2017-09-18 14:52:00.146224 7f1e78ca7700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
2017-09-18 14:52:00.146226 7f1e78ca7700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60


~~~

 thread apply all bt

Thread 54 (LWP 479360):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

Thread 53 (LWP 484888):
#0  0x7f1e7a644b7d in ?? ()
#1  0x in ?? ()

Thread 52 (LWP 484177):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x000a in ?? ()
#2  0x7f1e88d8df98 in ?? ()
#3  0x7f1e88d8df48 in ?? ()
#4  0x000a in ?? ()
#5  0x7f1e5ccaf7f8 in ?? ()
#6  0x7f1e7e45b9ee in ?? ()
#7  0x7f1e88d8d860 in ?? ()
#8  0x7f1e8e6e5500 in ?? ()
#9  0x7f1e889881c0 in ?? ()
#10 0x7f1e7e3e9ea0 in ?? ()
#11 0x in ?? ()

Thread 51 (LWP 484176):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

Thread 50 (LWP 484175):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

Thread 49 

[ceph-users] v12.2.0 bluestore - OSD down/crash " internal heartbeat not healthy, dropping ping reques "

2017-09-19 Thread nokia ceph
Hello,

Env:- RHEL 7.2 , 3.10.0-327.el7.x86_64 , EC 4+1 , bluestore

We are writing to ceph via librados C API  . Testing with rados no issues.

The same we tested with Jewel/kraken without any issue. Need your view how
to debug this issue?

>>

OSD.log
==

~~~

2017-09-18 14:51:59.895746 7f1e744e0700  0 log_channel(cluster) log [WRN] :
slow request 60.068824 seconds old, received at 2017-09-18 14:50:59.826849:
MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971,
last_complete=1350'153, committed=1, applied=0)) currently queued_for_pg
2017-09-18 14:51:59.895749 7f1e744e0700  0 log_channel(cluster) log [WRN] :
slow request 60.068737 seconds old, received at 2017-09-18 14:50:59.826936:
MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971,
last_complete=0'0, committed=0, applied=1)) currently queued_for_pg
2017-09-18 14:51:59.895754 7f1e744e0700  0 log_channel(cluster) log [WRN] :
slow request 60.067539 seconds old, received at 2017-09-18 14:50:59.828134:
MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971,
last_complete=1350'153, committed=1, applied=0)) currently queued_for_pg
2017-09-18 14:51:59.923825 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
2017-09-18 14:51:59.923835 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
2017-09-18 14:51:59.923837 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
2017-09-18 14:51:59.923840 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
2017-09-18 14:51:59.923842 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
2017-09-18 14:51:59.940780 7f1e77ca5700 20 osd.181 1350 share_map_peer
0x7f1e8dbf2800 already has epoch 1350
2017-09-18 14:51:59.940855 7f1e78ca7700 20 osd.181 1350 share_map_peer
0x7f1e8dbf2800 already has epoch 1350
2017-09-18 14:52:00.081390 7f1e6f572700 20 osd.181 1350 OSD::ms_dispatch:
ping magic: 0 v1
2017-09-18 14:52:00.081393 7f1e6f572700 10 osd.181 1350 do_waiters -- start
2017-09-18 14:52:00.081394 7f1e6f572700 10 osd.181 1350 do_waiters -- finish
2017-09-18 14:52:00.081395 7f1e6f572700 20 osd.181 1350 _dispatch
0x7f1e90923a40 ping magic: 0 v1
2017-09-18 14:52:00.081397 7f1e6f572700 10 osd.181 1350 ping from
client.414556
2017-09-18 14:52:00.123908 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
2017-09-18 14:52:00.123926 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
2017-09-18 14:52:00.123932 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
2017-09-18 14:52:00.123937 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
2017-09-18 14:52:00.123942 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
2017-09-18 14:52:00.145445 7f1e784a6700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
2017-09-18 14:52:00.145450 7f1e784a6700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60
2017-09-18 14:52:00.145496 7f1e784a6700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e63cbf700' had timed out after 60
2017-09-18 14:52:00.145534 7f1e784a6700 10 osd.181 1350 internal heartbeat
not healthy, dropping ping request
2017-09-18 14:52:00.146224 7f1e78ca7700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
2017-09-18 14:52:00.146226 7f1e78ca7700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60

~~~

 thread apply all bt

Thread 54 (LWP 479360):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

Thread 53 (LWP 484888):
#0  0x7f1e7a644b7d in ?? ()
#1  0x in ?? ()

Thread 52 (LWP 484177):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x000a in ?? ()
#2  0x7f1e88d8df98 in ?? ()
#3  0x7f1e88d8df48 in ?? ()
#4  0x000a in ?? ()
#5  0x7f1e5ccaf7f8 in ?? ()
#6  0x7f1e7e45b9ee in ?? ()
#7  0x7f1e88d8d860 in ?? ()
#8  0x7f1e8e6e5500 in ?? ()
#9  0x7f1e889881c0 in ?? ()
#10 0x7f1e7e3e9ea0 in ?? ()
#11 0x in ?? ()

Thread 51 (LWP 484176):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

Thread 50 (LWP 484175):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

Thread 49 (LWP 484174):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

~~~

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com