Thanks for that insight.

My main metrics source is the influx plugin, but I enabled the prometheus 
plugin to get access to the per-rbd image metrics.
I may disable prometheus and see if that yields better stability, until 
possibly the influx plugin gets updated to support those metric exports.

Reed

> On Aug 14, 2019, at 12:08 PM, shubjero <shubj...@gmail.com> wrote:
> 
> I'm having a similar issue with ceph-mgr stability problems since
> upgrading from 13.2.5 to 13.2.6. I have isolated the crashing to the
> prometheus module being enabled and notice much better stability when
> the prometheus module is NOT enabled. No more failovers, however I do
> notice that even with prometheus disabled my haproxy keepalive checks
> for the ceph-dashboard are failing their Layer-4 checks occasionally
> but not enough that the ceph-dashboard does not work. I had no
> perceived issues on 13.2.5. I have opened a ceph bug for 13.2.6 for
> which they have acknowledged as a duplicate to another ticket.
> Thankfully we also use Zabbix for monitoring so until this is fixed I
> am relying on Zabbix metrics instead of metrics in Prometheus/Grafana.
> 
> Ceph bug here: https://tracker.ceph.com/issues/41171
> 
> # ceph mgr module ls
> {
>    "enabled_modules": [
>        "balancer",
>        "crash",
>        "dashboard",
>        "restful",
>        "status",
>        "zabbix"
>    ],
> 
> On Wed, Aug 14, 2019 at 12:09 PM Reed Dier <reed.d...@focusvq.com> wrote:
>> 
>> Hi all,
>> 
>> I'm experiencing some stability issues with my ceph-mgr daemons post 
>> Nautilus upgrade (13.2.6->14.2.2).
>> On top of Ubuntu 18.04, the mgr's are colocated with the mon's, and 2 of the 
>> 3 mon/mgr nodes also colocate an MDS, running a single rank 
>> active:standby-replay.
>> 
>> The main issue(s) I am seeing are that I will have mgr's silently die, to 
>> where ceph -s will report "no active mgr", despite none of the mgr daemons 
>> crashing or dying. Typically they end up coming back up/in after some amount 
>> of time, but metrics stall, and mgr commands will stall/hang.
>> 
>> $ ceph mgr module ls
>> {
>>    "enabled_modules": [
>>        "diskprediction_local",
>>        "influx",
>>        "iostat",
>>        "prometheus",
>>        "rbd_support",
>>        "restful",
>>        "telemetry"
>>    ],
>> 
>> 
>> I ran the mgr with debug=20 and was unable to find anything terribly useful 
>> there, but can try to reproduce things if that seems helpful.
>> The only real non-standard bits of my mgr instances is that I cherry-picked 
>> this change to module.py to cure an issue with the influx plugin.
>> Typically these stalls happen when the mgr is servicing a command, and 
>> lately that command has been the balancer running.
>> 
>> For example:
>> 
>> $ time ceph balancer eval $POOL --debug-ms 1
>> 2019-08-14 11:36:21.370 7f4990601700  1  Processor -- start
>> 2019-08-14 11:36:21.370 7f4990601700  1 --  start start
>> 2019-08-14 11:36:21.370 7f4990601700  1 --2-  >> v2:10.0.1.250:3300/0 
>> conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 
>> tx=0).connect
>> 2019-08-14 11:36:21.370 7f4990601700  1 --2-  >> v2:10.0.1.249:3300/0 
>> conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 
>> tx=0).connect
>> 2019-08-14 11:36:21.370 7f4990601700  1 --  --> v1:10.0.1.250:6789/0 -- 
>> auth(proto 0 30 bytes epoch 0) v1 -- 0x7f498808bbb0 con 0x7f49881684c0
>> 2019-08-14 11:36:21.370 7f4990601700  1 --  --> v2:10.0.1.249:3300/0 -- 
>> mon_getmap magic: 0 v1 -- 0x7f498808ce00 con 0x7f498816f6a0
>> 2019-08-14 11:36:21.370 7f4990601700  1 --  --> v2:10.0.1.250:3300/0 -- 
>> mon_getmap magic: 0 v1 -- 0x7f498808cfb0 con 0x7f4988168a80
>> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2-  >> v2:10.0.1.250:3300/0 
>> conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=BANNER_CONNECTING pgs=0 
>> cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
>> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2-  >> v2:10.0.1.250:3300/0 
>> conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 
>> l=0 rx=0 tx=0).handle_hello peer v2:10.0.1.250:3300/0 says I am 
>> v2:10.0.1.250:44952/0 (socket says 10.0.1.250:44952)
>> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 
>> learned_addr learned my addr 10.0.1.250:0/2021657657 (peer_addr_for_me 
>> v2:10.0.1.250:0/0)
>> 2019-08-14 11:36:21.370 7f498ebc8700  1 --2- 10.0.1.250:0/2021657657 >> 
>> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 
>> s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload 
>> supported=0 required=0
>> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 >> 
>> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 msgr2=0x7f4988171b00 unknown :-1 
>> s=STATE_CONNECTION_ESTABLISHED l=0).mark_down
>> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2- 10.0.1.250:0/2021657657 >> 
>> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 
>> s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).stop
>> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 >> 
>> v1:10.0.1.250:6789/0 conn(0x7f49881684c0 legacy=0x7f49881692e0 unknown :-1 
>> s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
>> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 --> 
>> v2:10.0.1.250:3300/0 -- mon_subscribe({config=0+,monmap=0+}) v3 -- 
>> 0x7f498808d1b0 con 0x7f4988168a80
>> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2- 10.0.1.250:0/2021657657 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 
>> 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).ready 
>> entity=mon.1 client_cookie=10837f5a102e309b server_cookie=0 in_seq=0 
>> out_seq=0
>> 2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 
>> v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 
>> 0x7f497c0020d0 con 0x7f4988168a80
>> 2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 
>> v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 
>> 0x7f497c002410 con 0x7f4988168a80
>> 2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 
>> v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 
>> 0x7f497c002b40 con 0x7f4988168a80
>> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 
>> msgr2=0x7f4988169ae0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
>> 2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 
>> 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).stop
>> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 
>> shutdown_connections
>> 2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> 
>> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=CLOSED 
>> pgs=0 cs=0 l=0 rx=0 tx=0).stop
>> 2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 
>> 0x7f4988169ae0 unknown :-1 s=CLOSED pgs=82826 cs=0 l=1 rx=0 tx=0).stop
>> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 
>> shutdown_connections
>> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 wait 
>> complete.
>> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 >> 
>> 10.0.1.250:0/2021657657 conn(0x7f4988163a40 msgr2=0x7f4988165ed0 unknown :-1 
>> s=STATE_NONE l=0).mark_down
>> 2019-08-14 11:36:21.374 7f4990601700  1  Processor -- start
>> 2019-08-14 11:36:21.374 7f4990601700  1 --  start start
>> 2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.249:6789/0 -- 
>> auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b3f30 con 0x7f49880a8360
>> 2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.250:6789/0 -- 
>> auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4170 con 0x7f49880b0030
>> 2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.251:6789/0 -- 
>> auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4420 con 0x7f49880a7da0
>> 2019-08-14 11:36:21.374 7f498ebc8700  1 --1-  >> v1:10.0.1.250:6789/0 
>> conn(0x7f49880b0030 0x7f49880b2560 :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY 
>> pgs=0 cs=0 l=1).handle_server_banner_and_identify peer v1:10.0.1.250:6789/0 
>> says I am v1:10.0.1.250:57466/0 (socket says 10.0.1.250:57466)
>> 2019-08-14 11:36:21.374 7f498ebc8700  1 -- 10.0.1.250:0/40211647 
>> learned_addr learned my addr 10.0.1.250:0/40211647 (peer_addr_for_me 
>> v1:10.0.1.250:0/0)
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 
>> v1:10.0.1.249:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 
>> (unknown 3197104487 0 0) 0x7f497c007c60 con 0x7f49880a8360
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
>> v1:10.0.1.249:6789/0 -- auth(proto 2 36 bytes epoch 0) v1 -- 0x7f496c002d30 
>> con 0x7f49880a8360
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 
>> v1:10.0.1.249:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 
>> 935+0+0 (unknown 1101082921 0 0) 0x7f497c007f40 con 0x7f49880a8360
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
>> v1:10.0.1.249:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7f496c003a30 
>> con 0x7f49880a8360
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 
>> v1:10.0.1.249:6789/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (unknown 
>> 1022418838 0 0) 0x7f497c0024e0 con 0x7f49880a8360
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> 
>> v1:10.0.1.251:6789/0 conn(0x7f49880a7da0 legacy=0x7f49880a8c50 unknown :-1 
>> s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> 
>> v1:10.0.1.250:6789/0 conn(0x7f49880b0030 legacy=0x7f49880b2560 unknown :-1 
>> s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> 
>> v1:10.0.1.249:6789/0 conn(0x7f49880a8360 legacy=0x7f49880a9450 unknown :-1 
>> s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 
>> 0x7f496c0049c0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
>> 0x7f496c005360 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
>> 0x7f496c00bfe0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
>> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] -- mon_getmap magic: 0 v1 -- 
>> 0x7f496c00c530 con 0x7f496c009b80
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_getmap magic: 0 v1 -- 
>> 0x7f496c00c6e0 con 0x7f496c004f10
>> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> 
>> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] -- mon_getmap magic: 0 v1 -- 
>> 0x7f496c00c890 con 0x7f496c003ed0
>> 2019-08-14 11:36:21.374 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
>> 0x7f496c005360 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 
>> tx=0)._handle_peer_banner_payload supported=0 required=0
>> 2019-08-14 11:36:21.374 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 
>> 0x7f496c0049c0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 
>> tx=0)._handle_peer_banner_payload supported=0 required=0
>> 2019-08-14 11:36:21.374 7f498ebc8700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
>> 0x7f496c00bfe0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 
>> tx=0)._handle_peer_banner_payload supported=0 required=0
>> 2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 
>> msgr2=0x7f496c0049c0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED 
>> l=1).mark_down
>> 2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 
>> 0x7f496c0049c0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
>> 2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
>> msgr2=0x7f496c00bfe0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED 
>> l=1).mark_down
>> 2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
>> 0x7f496c00bfe0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
>> 2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 --> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- 
>> mon_subscribe({config=0+,monmap=0+}) v3 -- 0x7f497c008180 con 0x7f496c004f10
>> 2019-08-14 11:36:21.378 7f4990601700  1 -- 10.0.1.250:0/40211647 --> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({mgrmap=0+}) v3 
>> -- 0x7f49880b47e0 con 0x7f496c004f10
>> 2019-08-14 11:36:21.378 7f498ebc8700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 
>> 0x7f496c00bfe0 unknown :-1 s=CLOSED pgs=0 cs=0 l=1 rx=0 
>> tx=0).handle_auth_reply_more state changed!
>> 2019-08-14 11:36:21.378 7f498ebc8700  1 -- 10.0.1.250:0/40211647 reap_dead 
>> start
>> 2019-08-14 11:36:21.378 7f4990601700  1 -- 10.0.1.250:0/40211647 --> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({osdmap=0}) v3 
>> -- 0x7f49880a8360 con 0x7f496c004f10
>> 2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
>> 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).ready 
>> entity=mon.1 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
>> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
>> v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 
>> 0x7f49880b47e0 con 0x7f496c004f10
>> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
>> v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 
>> 0x7f497c0010a0 con 0x7f496c004f10
>> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
>> v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 
>> 0x7f497c001490 con 0x7f496c004f10
>> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
>> v2:10.0.1.250:3300/0 4 ==== mgrmap(e 2536) v1 ==== 28602+0+0 (crc 0 0 0) 
>> 0x7f497c001690 con 0x7f496c004f10
>> 2019-08-14 11:36:21.378 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
>> 0x7f496c02ace0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
>> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
>> v2:10.0.1.250:3300/0 5 ==== osd_map(172849..172849 src has 171976..172849) 
>> v4 ==== 106776+0+0 (crc 0 0 0) 0x7f497c0489d0 con 0x7f496c004f10
>> 2019-08-14 11:36:21.378 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
>> 0x7f496c02ace0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 
>> tx=0)._handle_peer_banner_payload supported=0 required=0
>> 2019-08-14 11:36:21.378 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
>> 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).ready 
>> entity=mgr.1731601147 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
>> 2019-08-14 11:36:21.382 7f4990601700  1 -- 10.0.1.250:0/40211647 --> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_command({"prefix": 
>> "get_command_descriptions"} v 0) v1 -- 0x7f49680031e0 con 0x7f496c004f10
>> 2019-08-14 11:36:21.390 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 
>> v2:10.0.1.250:3300/0 6 ==== mon_command_ack([{"prefix": 
>> "get_command_descriptions"}]=0  v0) v1 ==== 72+0+101614 (crc 0 0 0) 
>> 0x7f497c000e90 con 0x7f496c004f10
>> 2019-08-14 11:36:21.586 7f4990601700  1 -- 10.0.1.250:0/40211647 --> 
>> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] -- command(tid 0: 
>> {"prefix": "balancer eval", "option": "$POOL", "target": ["mgr", ""]}) v1 -- 
>> 0x7f4968002150 con 0x7f496c028790
>> 2019-08-14 11:37:40.274 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== 
>> mgr.1731601147 v2:10.0.1.250:6800/34061 1 ==== command_reply(tid 0: 0 ) v1 
>> ==== 8+0+47 (crc 0 0 0) 0x7f4968002150 con 0x7f496c028790
>> pool "$POOL" score 0.009701 (lower is better)
>> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
>> msgr2=0x7f496c02ace0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
>> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
>> 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).stop
>> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
>> msgr2=0x7f496c005360 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
>> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
>> 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).stop
>> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 
>> shutdown_connections
>> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 
>> 0x7f496c02ace0 unknown :-1 s=CLOSED pgs=16527 cs=0 l=1 rx=0 tx=0).stop
>> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> 
>> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 
>> 0x7f496c005360 unknown :-1 s=CLOSED pgs=82828 cs=0 l=1 rx=0 tx=0).stop
>> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 
>> shutdown_connections
>> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 wait 
>> complete.
>> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> 
>> 10.0.1.250:0/40211647 conn(0x7f498809d440 msgr2=0x7f498809d8b0 unknown :-1 
>> s=STATE_NONE l=0).mark_down
>> 
>> real    1m19.023s
>> user    0m0.288s
>> sys     0m0.050s
>> 
>> 
>> While the mgr didn't die, it did take 80 seconds to eval a pool, which 
>> typically takes roughly 2-5 seconds to run.
>> And while this command was running, the ceph-mgr processes was sitting 
>> around 100% in top.
>> 
>> 2019-08-14 11:36:20.734 7fbd77570700  0 ms_deliver_dispatch: unhandled 
>> message 0x55d96ee809c0 mgrreport(mds.db +110-0 packed 1366) v7 from 
>> mds.1717053370 v2:10.0.1.251:6800/140458953
>> 2019-08-14 11:36:20.734 7fbd8a95a700  1 mgr finish mon failed to return 
>> metadata for mds.db: (2) No such file or directory
>> 2019-08-14 11:36:21.586 7fbd77570700  0 log_channel(audit) log [DBG] : 
>> from='client.1731656635 -' entity='client.admin' cmd=[{"prefix": "balancer 
>> eval", "option": "$POOL", "target": ["mgr", ""]}]: dispatch
>> 2019-08-14 11:36:21.734 7fbd77570700  0 ms_deliver_dispatch: unhandled 
>> message 0x55d968250a40 mgrreport(mds.db +110-0 packed 1366) v7 from 
>> mds.1717053370 v2:10.0.1.251:6800/140458953
>> 2019-08-14 11:36:21.734 7fbd8a95a700  1 mgr finish mon failed to return 
>> metadata for mds.db: (2) No such file or directory
>> 2019-08-14 11:36:22.734 7fbd77570700  0 ms_deliver_dispatch: unhandled 
>> message 0x55d969297a00 mgrreport(mds.db +110-0 packed 1366) v7 from 
>> mds.1717053370 v2:10.0.1.251:6800/140458953
>> [snip-repeating]
>> 
>> 2019-08-14 11:37:38.826 7fbd77570700  0 ms_deliver_dispatch: unhandled 
>> message 0x55d96b768d80 mgrreport(mds.db +110-0 packed 1366) v7 from 
>> mds.1717053370 v2:10.0.1.251:6800/140458953
>> 2019-08-14 11:37:38.826 7fbd8a95a700  1 mgr finish mon failed to return 
>> metadata for mds.db: (2) No such file or directory
>> 2019-08-14 11:37:39.574 7fbd77d71700  1 mgr[balancer] Handling command: 
>> '{'prefix': 'balancer eval', 'option': '$POOL', 'target': ['mgr', '']}'
>> 
>> 2019-08-14 11:37:40.186 7fbd77570700  0 ms_deliver_dispatch: unhandled 
>> message 0x55d967aa23c0 mgrreport(mds.db +110-0 packed 1366) v7 from 
>> mds.1717053370 v2:10.0.1.251:6800/140458953
>> 2019-08-14 11:37:40.186 7fbd8a95a700  1 mgr finish mon failed to return 
>> metadata for mds.db: (2) No such file or directory
>> 2019-08-14 11:37:41.054 7fbd77570700  0 ms_deliver_dispatch: unhandled 
>> message 0x55d96117c9c0 mgrreport(mds.db +110-0 packed 1366) v7 from 
>> mds.1717053370 v2:10.0.1.251:6800/140458953
>> 2019-08-14 11:37:41.054 7fbd8a95a700  1 mgr finish mon failed to return 
>> metadata for mds.db: (2) No such file or directory
>> 
>> 
>> I filtered out a bunch of messages with the mgr servicing `osd pool stats` 
>> requests, which I have running in a watch -n1.
>> And I also snipped a bunch of repeating messages about mds.db, the 
>> standby-replay mds, and not sure if that is anything worth caring about.
>> 
>> Hoping someone may have some ideas of where to look to try and improve 
>> stability with the mgr's.
>> 
>> Thanks,
>> 
>> Reed
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to