Re: [ceph-users] v12.2.0 bluestore - OSD down/crash " internal heartbeat not healthy, dropping ping reques "
Hello Henrik, Thanks for the update, I applied below two changes and will monitor the cluster. ~~~ bluestore_deferred_throttle_bytes = 0 bluestore_throttle_bytes = 0 ~~~ Attached gdb trace after installed debug rpm's Thanks On Wed, Sep 20, 2017 at 12:18 PM, Henrik Korkuc wrote: > On 17-09-20 08:06, nokia ceph wrote: > > Hello, > > Env:- RHEL 7.2 , 3.10.0-327.el7.x86_64 , EC 4+1 , bluestore > > We are writing to ceph via librados C API . Testing with rados no issues. > > > The same we tested with Jewel/kraken without any issue. Need your view how > to debug this issue? > > maybe similar to http://tracker.ceph.com/issues/21180? It seems it was > resolved for me with mentioned fix. You could apply mentioned config > options and see if it helps (and build newer version, if able). > > > >> > > OSD.log > == > > ~~~ > > 2017-09-18 14:51:59.895746 7f1e744e0700 0 log_channel(cluster) log [WRN] > : slow request 60.068824 seconds old, received at 2017-09-18 > 14:50:59.826849: MOSDECSubOpWriteReply(1.132s0 1350/1344 > ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0)) > currently queued_for_pg > 2017-09-18 14:51:59.895749 7f1e744e0700 0 log_channel(cluster) log [WRN] > : slow request 60.068737 seconds old, received at 2017-09-18 > 14:50:59.826936: MOSDECSubOpWriteReply(1.132s0 1350/1344 > ECSubWriteReply(tid=971, last_complete=0'0, committed=0, applied=1)) > currently queued_for_pg > 2017-09-18 14:51:59.895754 7f1e744e0700 0 log_channel(cluster) log [WRN] > : slow request 60.067539 seconds old, received at 2017-09-18 > 14:50:59.828134: MOSDECSubOpWriteReply(1.132s0 1350/1344 > ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0)) > currently queued_for_pg > 2017-09-18 14:51:59.923825 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1359 k (1083 k + 276 k) > 2017-09-18 14:51:59.923835 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1066 k (1066 k + 0 ) > 2017-09-18 14:51:59.923837 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 643 k (643 k + 0 ) > 2017-09-18 14:51:59.923840 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1049 k (1049 k + 0 ) > 2017-09-18 14:51:59.923842 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 896 k (896 k + 0 ) > 2017-09-18 14:51:59.940780 7f1e77ca5700 20 osd.181 1350 share_map_peer > 0x7f1e8dbf2800 already has epoch 1350 > 2017-09-18 14:51:59.940855 7f1e78ca7700 20 osd.181 1350 share_map_peer > 0x7f1e8dbf2800 already has epoch 1350 > 2017-09-18 14:52:00.081390 7f1e6f572700 20 osd.181 1350 OSD::ms_dispatch: > ping magic: 0 v1 > 2017-09-18 14:52:00.081393 7f1e6f572700 10 osd.181 1350 do_waiters -- start > 2017-09-18 14:52:00.081394 7f1e6f572700 10 osd.181 1350 do_waiters -- > finish > 2017-09-18 14:52:00.081395 7f1e6f572700 20 osd.181 1350 _dispatch > 0x7f1e90923a40 ping magic: 0 v1 > 2017-09-18 14:52:00.081397 7f1e6f572700 10 osd.181 1350 ping from > client.414556 > 2017-09-18 14:52:00.123908 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1359 k (1083 k + 276 k) > 2017-09-18 14:52:00.123926 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1066 k (1066 k + 0 ) > 2017-09-18 14:52:00.123932 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 643 k (643 k + 0 ) > 2017-09-18 14:52:00.123937 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1049 k (1049 k + 0 ) > 2017-09-18 14:52:00.123942 7f1e71cdb700 10 trim shard target 102 M > meta/data ratios 0.5 + 0 (52428 k + 0 ), current 896 k (896 k + 0 ) > 2017-09-18 14:52:00.145445 7f1e784a6700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60 > 2017-09-18 14:52:00.145450 7f1e784a6700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60 > 2017-09-18 14:52:00.145496 7f1e784a6700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f1e63cbf700' had timed out after 60 > 2017-09-18 14:52:00.145534 7f1e784a6700 10 osd.181 1350 internal heartbeat > not healthy, dropping ping request > 2017-09-18 14:52:00.146224 7f1e78ca7700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60 > 2017-09-18 14:52:00.146226 7f1e78ca7700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60 > > ~~~ > > thread apply all bt > > Thread 54 (LWP 479360): > #0 0x7f1e7b5606d5 in ?? () > #1 0x in ?? () > > Thread 53 (LWP 484888): > #0 0x7f1e7a644b7d in ?? () > #1 0x in ?? () > > Thread 52 (LWP 484177): > #0 0x7f1e7b5606d5 in ?? () > #1 0x000a in ?? () > #2 0x7f1e88d8df98 in ?? () > #3 0x7f1e88d8df48 in ?? () > #
Re: [ceph-users] v12.2.0 bluestore - OSD down/crash " internal heartbeat not healthy, dropping ping reques "
On 17-09-20 08:06, nokia ceph wrote: Hello, Env:- RHEL 7.2 , 3.10.0-327.el7.x86_64 , EC 4+1 , bluestore We are writing to ceph via librados C API . Testing with rados no issues. The same we tested with Jewel/kraken without any issue. Need your view how to debug this issue? maybe similar to http://tracker.ceph.com/issues/21180? It seems it was resolved for me with mentioned fix. You could apply mentioned config options and see if it helps (and build newer version, if able). >> OSD.log == ~~~ 2017-09-18 14:51:59.895746 7f1e744e0700 0 log_channel(cluster) log [WRN] : slow request 60.068824 seconds old, received at 2017-09-18 14:50:59.826849: MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0)) currently queued_for_pg 2017-09-18 14:51:59.895749 7f1e744e0700 0 log_channel(cluster) log [WRN] : slow request 60.068737 seconds old, received at 2017-09-18 14:50:59.826936: MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971, last_complete=0'0, committed=0, applied=1)) currently queued_for_pg 2017-09-18 14:51:59.895754 7f1e744e0700 0 log_channel(cluster) log [WRN] : slow request 60.067539 seconds old, received at 2017-09-18 14:50:59.828134: MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0)) currently queued_for_pg 2017-09-18 14:51:59.923825 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1359 k (1083 k + 276 k) 2017-09-18 14:51:59.923835 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1066 k (1066 k + 0 ) 2017-09-18 14:51:59.923837 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 643 k (643 k + 0 ) 2017-09-18 14:51:59.923840 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1049 k (1049 k + 0 ) 2017-09-18 14:51:59.923842 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 896 k (896 k + 0 ) 2017-09-18 14:51:59.940780 7f1e77ca5700 20 osd.181 1350 share_map_peer 0x7f1e8dbf2800 already has epoch 1350 2017-09-18 14:51:59.940855 7f1e78ca7700 20 osd.181 1350 share_map_peer 0x7f1e8dbf2800 already has epoch 1350 2017-09-18 14:52:00.081390 7f1e6f572700 20 osd.181 1350 OSD::ms_dispatch: ping magic: 0 v1 2017-09-18 14:52:00.081393 7f1e6f572700 10 osd.181 1350 do_waiters -- start 2017-09-18 14:52:00.081394 7f1e6f572700 10 osd.181 1350 do_waiters -- finish 2017-09-18 14:52:00.081395 7f1e6f572700 20 osd.181 1350 _dispatch 0x7f1e90923a40 ping magic: 0 v1 2017-09-18 14:52:00.081397 7f1e6f572700 10 osd.181 1350 ping from client.414556 2017-09-18 14:52:00.123908 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1359 k (1083 k + 276 k) 2017-09-18 14:52:00.123926 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1066 k (1066 k + 0 ) 2017-09-18 14:52:00.123932 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 643 k (643 k + 0 ) 2017-09-18 14:52:00.123937 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1049 k (1049 k + 0 ) 2017-09-18 14:52:00.123942 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 896 k (896 k + 0 ) 2017-09-18 14:52:00.145445 7f1e784a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60 2017-09-18 14:52:00.145450 7f1e784a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60 2017-09-18 14:52:00.145496 7f1e784a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e63cbf700' had timed out after 60 2017-09-18 14:52:00.145534 7f1e784a6700 10 osd.181 1350 internal heartbeat not healthy, dropping ping request 2017-09-18 14:52:00.146224 7f1e78ca7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60 2017-09-18 14:52:00.146226 7f1e78ca7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60 ~~~ thread apply all bt Thread 54 (LWP 479360): #0 0x7f1e7b5606d5 in ?? () #1 0x in ?? () Thread 53 (LWP 484888): #0 0x7f1e7a644b7d in ?? () #1 0x in ?? () Thread 52 (LWP 484177): #0 0x7f1e7b5606d5 in ?? () #1 0x000a in ?? () #2 0x7f1e88d8df98 in ?? () #3 0x7f1e88d8df48 in ?? () #4 0x000a in ?? () #5 0x7f1e5ccaf7f8 in ?? () #6 0x7f1e7e45b9ee in ?? () #7 0x7f1e88d8d860 in ?? () #8 0x7f1e8e6e5500 in ?? () #9 0x7f1e889881c0 in ?? () #10 0x7f1e7e3e9ea0 in ?? () #11 0x in ?? () Thread 51 (LWP 484176): #0 0x7f1e7b5606d5 in ?? () #1 0x in ?? () Thread 50 (LWP 484175): #0 0x7f1e7b5606d5 in ?? () #1 0x in ?? () Thread 49 (
[ceph-users] v12.2.0 bluestore - OSD down/crash " internal heartbeat not healthy, dropping ping reques "
Hello, Env:- RHEL 7.2 , 3.10.0-327.el7.x86_64 , EC 4+1 , bluestore We are writing to ceph via librados C API . Testing with rados no issues. The same we tested with Jewel/kraken without any issue. Need your view how to debug this issue? >> OSD.log == ~~~ 2017-09-18 14:51:59.895746 7f1e744e0700 0 log_channel(cluster) log [WRN] : slow request 60.068824 seconds old, received at 2017-09-18 14:50:59.826849: MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0)) currently queued_for_pg 2017-09-18 14:51:59.895749 7f1e744e0700 0 log_channel(cluster) log [WRN] : slow request 60.068737 seconds old, received at 2017-09-18 14:50:59.826936: MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971, last_complete=0'0, committed=0, applied=1)) currently queued_for_pg 2017-09-18 14:51:59.895754 7f1e744e0700 0 log_channel(cluster) log [WRN] : slow request 60.067539 seconds old, received at 2017-09-18 14:50:59.828134: MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0)) currently queued_for_pg 2017-09-18 14:51:59.923825 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1359 k (1083 k + 276 k) 2017-09-18 14:51:59.923835 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1066 k (1066 k + 0 ) 2017-09-18 14:51:59.923837 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 643 k (643 k + 0 ) 2017-09-18 14:51:59.923840 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1049 k (1049 k + 0 ) 2017-09-18 14:51:59.923842 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 896 k (896 k + 0 ) 2017-09-18 14:51:59.940780 7f1e77ca5700 20 osd.181 1350 share_map_peer 0x7f1e8dbf2800 already has epoch 1350 2017-09-18 14:51:59.940855 7f1e78ca7700 20 osd.181 1350 share_map_peer 0x7f1e8dbf2800 already has epoch 1350 2017-09-18 14:52:00.081390 7f1e6f572700 20 osd.181 1350 OSD::ms_dispatch: ping magic: 0 v1 2017-09-18 14:52:00.081393 7f1e6f572700 10 osd.181 1350 do_waiters -- start 2017-09-18 14:52:00.081394 7f1e6f572700 10 osd.181 1350 do_waiters -- finish 2017-09-18 14:52:00.081395 7f1e6f572700 20 osd.181 1350 _dispatch 0x7f1e90923a40 ping magic: 0 v1 2017-09-18 14:52:00.081397 7f1e6f572700 10 osd.181 1350 ping from client.414556 2017-09-18 14:52:00.123908 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1359 k (1083 k + 276 k) 2017-09-18 14:52:00.123926 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1066 k (1066 k + 0 ) 2017-09-18 14:52:00.123932 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 643 k (643 k + 0 ) 2017-09-18 14:52:00.123937 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 1049 k (1049 k + 0 ) 2017-09-18 14:52:00.123942 7f1e71cdb700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 896 k (896 k + 0 ) 2017-09-18 14:52:00.145445 7f1e784a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60 2017-09-18 14:52:00.145450 7f1e784a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60 2017-09-18 14:52:00.145496 7f1e784a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e63cbf700' had timed out after 60 2017-09-18 14:52:00.145534 7f1e784a6700 10 osd.181 1350 internal heartbeat not healthy, dropping ping request 2017-09-18 14:52:00.146224 7f1e78ca7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60 2017-09-18 14:52:00.146226 7f1e78ca7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60 ~~~ thread apply all bt Thread 54 (LWP 479360): #0 0x7f1e7b5606d5 in ?? () #1 0x in ?? () Thread 53 (LWP 484888): #0 0x7f1e7a644b7d in ?? () #1 0x in ?? () Thread 52 (LWP 484177): #0 0x7f1e7b5606d5 in ?? () #1 0x000a in ?? () #2 0x7f1e88d8df98 in ?? () #3 0x7f1e88d8df48 in ?? () #4 0x000a in ?? () #5 0x7f1e5ccaf7f8 in ?? () #6 0x7f1e7e45b9ee in ?? () #7 0x7f1e88d8d860 in ?? () #8 0x7f1e8e6e5500 in ?? () #9 0x7f1e889881c0 in ?? () #10 0x7f1e7e3e9ea0 in ?? () #11 0x in ?? () Thread 51 (LWP 484176): #0 0x7f1e7b5606d5 in ?? () #1 0x in ?? () Thread 50 (LWP 484175): #0 0x7f1e7b5606d5 in ?? () #1 0x in ?? () Thread 49 (LWP 484174): #0 0x7f1e7b5606d5 in ?? () #1 0x in ?? () ~~~ Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com