Hi all,

I'm experiencing some stability issues with my ceph-mgr daemons post Nautilus 
upgrade (13.2.6->14.2.2).
On top of Ubuntu 18.04, the mgr's are colocated with the mon's, and 2 of the 3 
mon/mgr nodes also colocate an MDS, running a single rank active:standby-replay.

The main issue(s) I am seeing are that I will have mgr's silently die, to where 
ceph -s will report "no active mgr", despite none of the mgr daemons crashing 
or dying. Typically they end up coming back up/in after some amount of time, 
but metrics stall, and mgr commands will stall/hang.

> $ ceph mgr module ls
> {
>     "enabled_modules": [
>         "diskprediction_local",
>         "influx",
>         "iostat",
>         "prometheus",
>         "rbd_support",
>         "restful",
>         "telemetry"
>     ],


I ran the mgr with debug=20 and was unable to find anything terribly useful 
there, but can try to reproduce things if that seems helpful.
The only real non-standard bits of my mgr instances is that I cherry-picked 
this change 
<https://github.com/ceph/ceph/pull/28427/files#diff-5add36422933f238d3aaf5291e106489>
 to module.py to cure an issue with the influx plugin.
Typically these stalls happen when the mgr is servicing a command, and lately 
that command has been the balancer running.

For example:
> $ time ceph balancer eval $POOL --debug-ms 1
> 2019-08-14 11:36:21.370 7f4990601700  1  Processor -- start
> 2019-08-14 11:36:21.370 7f4990601700  1 --  start start
> 2019-08-14 11:36:21.370 7f4990601700  1 --2-  >> v2:10.0.1.250:3300/0 
> conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 
> tx=0).connect
> 2019-08-14 11:36:21.370 7f4990601700  1 --2-  >> v2:10.0.1.249:3300/0 
> conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 
> tx=0).connect
> 2019-08-14 11:36:21.370 7f4990601700  1 --  --> v1:10.0.1.250:6789/0 -- 
> auth(proto 0 30 bytes epoch 0) v1 -- 0x7f498808bbb0 con 0x7f49881684c0
> 2019-08-14 11:36:21.370 7f4990601700  1 --  --> v2:10.0.1.249:3300/0 -- 
> mon_getmap magic: 0 v1 -- 0x7f498808ce00 con 0x7f498816f6a0
> 2019-08-14 11:36:21.370 7f4990601700  1 --  --> v2:10.0.1.250:3300/0 -- 
> mon_getmap magic: 0 v1 -- 0x7f498808cfb0 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2-  >> v2:10.0.1.250:3300/0 
> conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 
> l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2-  >> v2:10.0.1.250:3300/0 
> conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 
> l=0 rx=0 tx=0).handle_hello peer v2:10.0.1.250:3300/0 says I am 
> v2:10.0.1.250:44952/0 (socket says 10.0.1.250:44952)
> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 
> learned_addr learned my addr 10.0.1.250:0/2021657657 (peer_addr_for_me 
> v2:10.0.1.250:0/0)
> 2019-08-14 11:36:21.370 7f498ebc8700  1 --2- 10.0.1.250:0/2021657657 >> 
> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 
> s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload 
> supported=0 required=0
> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 >> 
> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 msgr2=0x7f4988171b00 unknown :-1 
> s=STATE_CONNECTION_ESTABLISHED l=0).mark_down
> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2- 10.0.1.250:0/2021657657 >> 
> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 
> s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).stop
> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 >> 
> v1:10.0.1.250:6789/0 conn(0x7f49881684c0 legacy=0x7f49881692e0 unknown :-1 
> s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 --> 
> v2:10.0.1.250:3300/0 -- mon_subscribe({config=0+,monmap=0+}) v3 -- 
> 0x7f498808d1b0 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2- 10.0.1.250:0/2021657657 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 
> 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).ready 
> entity=mon.1 client_cookie=10837f5a102e309b server_cookie=0 in_seq=0 out_seq=0
> 2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 
> v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 
> 0x7f497c0020d0 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 
> v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 
> 0x7f497c002410 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 
> v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 
> 0x7f497c002b40 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 
> msgr2=0x7f4988169ae0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 
> 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 
> shutdown_connections
> 2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> 
> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=CLOSED 
> pgs=0 cs=0 l=0 rx=0 tx=0).stop
> 2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 
> 0x7f4988169ae0 unknown :-1 s=CLOSED pgs=82826 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 
> shutdown_connections
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 wait 
> complete.
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 >> 
> 10.0.1.250:0/2021657657 conn(0x7f4988163a40 msgr2=0x7f4988165ed0 unknown :-1 
> s=STATE_NONE l=0).mark_down
> 2019-08-14 11:36:21.374 7f4990601700  1  Processor -- start
> 2019-08-14 11:36:21.374 7f4990601700  1 --  start start
> 2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.249:6789/0 -- 
> auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b3f30 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.250:6789/0 -- 
> auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4170 con 0x7f49880b0030
> 2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.251:6789/0 -- 
> auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4420 con 0x7f49880a7da0
> 2019-08-14 11:36:21.374 7f498ebc8700  1 --1-  >> v1:10.0.1.250:6789/0 
> conn(0x7f49880b0030 0x7f49880b2560 :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY 
> pgs=0 cs=0 l=1).handle_server_banner_and_identify peer v1:10.0.1.250:6789/0 
> says I am v1:10.0.1.250:57466/0 (socket says 10.0.1.250:57466)
> 2019-08-14 11:36:21.374 7f498ebc8700  1 -- 10.0.1.250:0/40211647 learned_addr 
> learned my addr 10.0.1.250:0/40211647 (peer_addr_for_me v1:10.0.1.250:0/0)
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 
> v1:10.0.1.249:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 
> (unknown 3197104487 0 0) 0x7f497c007c60 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
> v1:10.0.1.249:6789/0 -- auth(proto 2 36 bytes epoch 0) v1 -- 0x7f496c002d30 
> con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 
> v1:10.0.1.249:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 935+0+0 
> (unknown 1101082921 0 0) 0x7f497c007f40 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
> v1:10.0.1.249:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7f496c003a30 
> con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 
> v1:10.0.1.249:6789/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (unknown 
> 1022418838 0 0) 0x7f497c0024e0 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> 
> v1:10.0.1.251:6789/0 conn(0x7f49880a7da0 legacy=0x7f49880a8c50 unknown :-1 
> s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> 
> v1:10.0.1.250:6789/0 conn(0x7f49880b0030 legacy=0x7f49880b2560 unknown :-1 
> s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> 
> v1:10.0.1.249:6789/0 conn(0x7f49880a8360 legacy=0x7f49880a9450 unknown :-1 
> s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 
> 0x7f496c0049c0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
> 2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
> 0x7f496c005360 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
> 2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
> 0x7f496c00bfe0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] -- mon_getmap magic: 0 v1 -- 
> 0x7f496c00c530 con 0x7f496c009b80
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_getmap magic: 0 v1 -- 
> 0x7f496c00c6e0 con 0x7f496c004f10
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] -- mon_getmap magic: 0 v1 -- 
> 0x7f496c00c890 con 0x7f496c003ed0
> 2019-08-14 11:36:21.374 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
> 0x7f496c005360 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 
> tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.374 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 
> 0x7f496c0049c0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 
> tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.374 7f498ebc8700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
> 0x7f496c00bfe0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 
> tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 
> msgr2=0x7f496c0049c0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 
> 0x7f496c0049c0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
> msgr2=0x7f496c00bfe0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
> 0x7f496c00bfe0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 --> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- 
> mon_subscribe({config=0+,monmap=0+}) v3 -- 0x7f497c008180 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f4990601700  1 -- 10.0.1.250:0/40211647 --> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({mgrmap=0+}) v3 
> -- 0x7f49880b47e0 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f498ebc8700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
> 0x7f496c00bfe0 unknown :-1 s=CLOSED pgs=0 cs=0 l=1 rx=0 
> tx=0).handle_auth_reply_more state changed!
> 2019-08-14 11:36:21.378 7f498ebc8700  1 -- 10.0.1.250:0/40211647 reap_dead 
> start
> 2019-08-14 11:36:21.378 7f4990601700  1 -- 10.0.1.250:0/40211647 --> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({osdmap=0}) v3 
> -- 0x7f49880a8360 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
> 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).ready 
> entity=mon.1 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
> v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 
> 0x7f49880b47e0 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
> v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 
> 0x7f497c0010a0 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
> v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 
> 0x7f497c001490 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
> v2:10.0.1.250:3300/0 4 ==== mgrmap(e 2536) v1 ==== 28602+0+0 (crc 0 0 0) 
> 0x7f497c001690 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
> 0x7f496c02ace0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
> v2:10.0.1.250:3300/0 5 ==== osd_map(172849..172849 src has 171976..172849) v4 
> ==== 106776+0+0 (crc 0 0 0) 0x7f497c0489d0 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
> 0x7f496c02ace0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 
> tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.378 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
> 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).ready 
> entity=mgr.1731601147 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
> 2019-08-14 11:36:21.382 7f4990601700  1 -- 10.0.1.250:0/40211647 --> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_command({"prefix": 
> "get_command_descriptions"} v 0) v1 -- 0x7f49680031e0 con 0x7f496c004f10
> 2019-08-14 11:36:21.390 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
> v2:10.0.1.250:3300/0 6 ==== mon_command_ack([{"prefix": 
> "get_command_descriptions"}]=0  v0) v1 ==== 72+0+101614 (crc 0 0 0) 
> 0x7f497c000e90 con 0x7f496c004f10
> 2019-08-14 11:36:21.586 7f4990601700  1 -- 10.0.1.250:0/40211647 --> 
> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] -- command(tid 0: 
> {"prefix": "balancer eval", "option": "$POOL", "target": ["mgr", ""]}) v1 -- 
> 0x7f4968002150 con 0x7f496c028790
> 2019-08-14 11:37:40.274 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== 
> mgr.1731601147 v2:10.0.1.250:6800/34061 1 ==== command_reply(tid 0: 0 ) v1 
> ==== 8+0+47 (crc 0 0 0) 0x7f4968002150 con 0x7f496c028790
> pool "$POOL" score 0.009701 (lower is better)
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
> msgr2=0x7f496c02ace0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
> 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
> msgr2=0x7f496c005360 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
> 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 
> shutdown_connections
> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
> 0x7f496c02ace0 unknown :-1 s=CLOSED pgs=16527 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> 
> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
> 0x7f496c005360 unknown :-1 s=CLOSED pgs=82828 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 
> shutdown_connections
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 wait 
> complete.
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> 
> 10.0.1.250:0/40211647 conn(0x7f498809d440 msgr2=0x7f498809d8b0 unknown :-1 
> s=STATE_NONE l=0).mark_down
> 
> real    1m19.023s
> user    0m0.288s
> sys     0m0.050s

While the mgr didn't die, it did take 80 seconds to eval a pool, which 
typically takes roughly 2-5 seconds to run.
And while this command was running, the ceph-mgr processes was sitting around 
100% in top.

> 2019-08-14 11:36:20.734 7fbd77570700  0 ms_deliver_dispatch: unhandled 
> message 0x55d96ee809c0 mgrreport(mds.db +110-0 packed 1366) v7 from 
> mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:36:20.734 7fbd8a95a700  1 mgr finish mon failed to return 
> metadata for mds.db: (2) No such file or directory
> 2019-08-14 11:36:21.586 7fbd77570700  0 log_channel(audit) log [DBG] : 
> from='client.1731656635 -' entity='client.admin' cmd=[{"prefix": "balancer 
> eval", "option": "$POOL", "target": ["mgr", ""]}]: dispatch
> 2019-08-14 11:36:21.734 7fbd77570700  0 ms_deliver_dispatch: unhandled 
> message 0x55d968250a40 mgrreport(mds.db +110-0 packed 1366) v7 from 
> mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:36:21.734 7fbd8a95a700  1 mgr finish mon failed to return 
> metadata for mds.db: (2) No such file or directory
> 2019-08-14 11:36:22.734 7fbd77570700  0 ms_deliver_dispatch: unhandled 
> message 0x55d969297a00 mgrreport(mds.db +110-0 packed 1366) v7 from 
> mds.1717053370 v2:10.0.1.251:6800/140458953
> [snip-repeating]

> 2019-08-14 11:37:38.826 7fbd77570700  0 ms_deliver_dispatch: unhandled 
> message 0x55d96b768d80 mgrreport(mds.db +110-0 packed 1366) v7 from 
> mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:37:38.826 7fbd8a95a700  1 mgr finish mon failed to return 
> metadata for mds.db: (2) No such file or directory
> 2019-08-14 11:37:39.574 7fbd77d71700  1 mgr[balancer] Handling command: 
> '{'prefix': 'balancer eval', 'option': '$POOL', 'target': ['mgr', '']}'
> 2019-08-14 11:37:40.186 7fbd77570700  0 ms_deliver_dispatch: unhandled 
> message 0x55d967aa23c0 mgrreport(mds.db +110-0 packed 1366) v7 from 
> mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:37:40.186 7fbd8a95a700  1 mgr finish mon failed to return 
> metadata for mds.db: (2) No such file or directory
> 2019-08-14 11:37:41.054 7fbd77570700  0 ms_deliver_dispatch: unhandled 
> message 0x55d96117c9c0 mgrreport(mds.db +110-0 packed 1366) v7 from 
> mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:37:41.054 7fbd8a95a700  1 mgr finish mon failed to return 
> metadata for mds.db: (2) No such file or directory


I filtered out a bunch of messages with the mgr servicing `osd pool stats` 
requests, which I have running in a watch -n1.
And I also snipped a bunch of repeating messages about mds.db, the 
standby-replay mds, and not sure if that is anything worth caring about. 

Hoping someone may have some ideas of where to look to try and improve 
stability with the mgr's.

Thanks,

Reed

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to