[ceph-users] "ceph orch restart mgr" command creates mgr restart loop

Chris Read Mon, 11 Jan 2021 12:54:52 -0800

Greetings all...

I'm busy testing out Ceph and have hit this troublesome bug while following
the steps outlined here:


https://docs.ceph.com/en/octopus/cephadm/monitoring/#configuring-ssl-tls-for-grafana

When I issue the "ceph orch restart mgr" command, it appears the command is
not cleared from a message queue somewhere (I'm still very unclear on many
ceph specifics), and so each time the mgr process returns from restart it
picks up the message again and keeps restarting itself forever (so far it's
been stuck in this state for 45 minutes).

Watching the logs we see this going on:

$ ceph log last cephadm -w

root@ceph-poc-000:~# ceph log last cephadm -w
  cluster:
    id:     d23bc326-543a-11eb-bfe0-b324db228b6c
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum
ceph-poc-000,ceph-poc-003,ceph-poc-004,ceph-poc-002,ceph-poc-001 (age 2h)
    mgr: ceph-poc-000.himivo(active, since 4s), standbys:
ceph-poc-001.unjulx
    osd: 10 osds: 10 up (since 2h), 10 in (since 2h)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   10 GiB used, 5.4 TiB / 5.5 TiB avail
    pgs:     1 active+clean


2021-01-11T20:46:32.976606+0000 mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:32.980749+0000 mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:33.061519+0000 mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:46:39.156420+0000 mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:39.160618+0000 mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:39.242603+0000 mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:46:45.299953+0000 mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:45.304006+0000 mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:45.733495+0000 mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:46:51.871903+0000 mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:51.877107+0000 mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:51.976190+0000 mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:46:58.000720+0000 mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:58.006843+0000 mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:58.097163+0000 mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:47:04.188630+0000 mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:47:04.193501+0000 mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:47:04.285509+0000 mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:47:10.348099+0000 mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:47:10.352340+0000 mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:47:10.752243+0000 mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available

And in the logs for the mgr instance itself we see it keep replaying the
message over and over:

$ docker logs -f
ceph-d23bc326-543a-11eb-bfe0-b324db228b6c-mgr.ceph-poc-000.himivo
debug 2021-01-11T20:47:31.390+0000 7f48b0d0d200  0 set uid:gid to 167:167
(ceph:ceph)
debug 2021-01-11T20:47:31.390+0000 7f48b0d0d200  0 ceph version 15.2.8
(bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process
ceph-mgr, pid 1
debug 2021-01-11T20:47:31.390+0000 7f48b0d0d200  0 pidfile_write: ignore
empty --pid-file
debug 2021-01-11T20:47:31.414+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'alerts'
debug 2021-01-11T20:47:31.486+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'balancer'
debug 2021-01-11T20:47:31.542+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'cephadm'
debug 2021-01-11T20:47:31.742+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'crash'
debug 2021-01-11T20:47:31.798+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'dashboard'
debug 2021-01-11T20:47:32.258+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'devicehealth'
debug 2021-01-11T20:47:32.306+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'diskprediction_local'
debug 2021-01-11T20:47:32.498+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'influx'
debug 2021-01-11T20:47:32.550+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'insights'
debug 2021-01-11T20:47:32.598+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'iostat'
debug 2021-01-11T20:47:32.642+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'k8sevents'
debug 2021-01-11T20:47:33.034+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'localpool'
debug 2021-01-11T20:47:33.082+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'orchestrator'
debug 2021-01-11T20:47:33.266+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'osd_support'
debug 2021-01-11T20:47:33.310+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'pg_autoscaler'
debug 2021-01-11T20:47:33.382+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'progress'
debug 2021-01-11T20:47:33.442+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'prometheus'
debug 2021-01-11T20:47:33.794+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'rbd_support'
debug 2021-01-11T20:47:33.870+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'restful'
debug 2021-01-11T20:47:34.086+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'rook'
debug 2021-01-11T20:47:34.606+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'selftest'
debug 2021-01-11T20:47:34.654+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'status'
debug 2021-01-11T20:47:34.710+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'telegraf'
debug 2021-01-11T20:47:34.762+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'telemetry'
debug 2021-01-11T20:47:35.034+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'test_orchestrator'
debug 2021-01-11T20:47:35.146+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'volumes'
debug 2021-01-11T20:47:35.294+0000 7f48b0d0d200  1 mgr[py] Loading python
module 'zabbix'
debug 2021-01-11T20:47:35.366+0000 7f489df86700  0 ms_deliver_dispatch:
unhandled message 0x55d57b71a420 mon_map magic: 0 v1 from mon.0 v2:
10.128.18.16:3300/0
debug 2021-01-11T20:47:35.758+0000 7f489df86700  1 mgr handle_mgr_map
Activating!
debug 2021-01-11T20:47:35.758+0000 7f489df86700  1 mgr handle_mgr_map I am
now activating
debug 2021-01-11T20:47:35.770+0000 7f484ca14700  0 [balancer DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.770+0000 7f484ca14700  1 mgr load Constructed
class from module: balancer
debug 2021-01-11T20:47:35.774+0000 7f484ca14700  0 [cephadm DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.874+0000 7f484ca14700  1 mgr load Constructed
class from module: cephadm
debug 2021-01-11T20:47:35.874+0000 7f484ca14700  0 [crash DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.874+0000 7f484ca14700  1 mgr load Constructed
class from module: crash
debug 2021-01-11T20:47:35.878+0000 7f484ca14700  0 [dashboard DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.878+0000 7f484ca14700  1 mgr load Constructed
class from module: dashboard
debug 2021-01-11T20:47:35.878+0000 7f484ca14700  0 [devicehealth DEBUG
root] setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.878+0000 7f484ca14700  1 mgr load Constructed
class from module: devicehealth
debug 2021-01-11T20:47:35.878+0000 7f484ca14700  0 [iostat DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.878+0000 7f484ca14700  1 mgr load Constructed
class from module: iostat
debug 2021-01-11T20:47:35.886+0000 7f484ca14700  0 [orchestrator DEBUG
root] setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.886+0000 7f484ca14700  1 mgr load Constructed
class from module: orchestrator
debug 2021-01-11T20:47:35.890+0000 7f484ca14700  0 [pg_autoscaler DEBUG
root] setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.890+0000 7f484ca14700  1 mgr load Constructed
class from module: pg_autoscaler
debug 2021-01-11T20:47:35.890+0000 7f484ca14700  0 [progress DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.890+0000 7f484ca14700  1 mgr load Constructed
class from module: progress
debug 2021-01-11T20:47:35.898+0000 7f484ca14700  0 [prometheus DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.910+0000 7f484ca14700  1 mgr load Constructed
class from module: prometheus
debug 2021-01-11T20:47:35.914+0000 7f484ca14700  0 [rbd_support DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
[11/Jan/2021:20:47:35] ENGINE Bus STARTING
Warning: Permanently added 'ceph-poc-000,10.128.18.16' (ECDSA) to the list
of known hosts.
Warning: Permanently added 'ceph-poc-002,10.128.19.248' (ECDSA) to the list
of known hosts.
debug 2021-01-11T20:47:35.974+0000 7f484ca14700  1 mgr load Constructed
class from module: rbd_support
Warning: Permanently added 'ceph-poc-001,10.128.28.223' (ECDSA) to the list
of known hosts.
debug 2021-01-11T20:47:35.982+0000 7f484ca14700  0 [restful DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.982+0000 7f484ca14700  1 mgr load Constructed
class from module: restful
debug 2021-01-11T20:47:35.982+0000 7f484ca14700  0 [status DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.982+0000 7f484ca14700  1 mgr load Constructed
class from module: status
debug 2021-01-11T20:47:35.982+0000 7f484ca14700  0 [telemetry DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
debug 2021-01-11T20:47:35.982+0000 7f484ca14700  1 mgr load Constructed
class from module: telemetry
debug 2021-01-11T20:47:35.982+0000 7f486f3cc700  0 [restful WARNING root]
server not running: no certificate configured
debug 2021-01-11T20:47:35.982+0000 7f484ca14700  0 [volumes DEBUG root]
setting log level based on debug_mgr: WARNING (1/5)
Warning: Permanently added 'ceph-poc-004,10.128.31.60' (ECDSA) to the list
of known hosts.
debug 2021-01-11T20:47:35.994+0000 7f484ca14700  1 mgr load Constructed
class from module: volumes
Warning: Permanently added 'ceph-poc-003,10.128.30.127' (ECDSA) to the list
of known hosts.
CherryPy Checker:
The Application mounted at '' has an empty config.

[11/Jan/2021:20:47:36] ENGINE Serving on http://:::9283
[11/Jan/2021:20:47:36] ENGINE Bus STARTED
debug 2021-01-11T20:47:36.778+0000 7f484d215700  0 log_channel(audit) log
[DBG] : from='client.19140 -' entity='client.admin' cmd=[{"prefix": "orch",
"action": "restart", "service_name": "mgr", "target": ["mon-mgr", ""]}]:
dispatch
debug 2021-01-11T20:47:36.778+0000 7f484ca14700  0 log_channel(cephadm) log
[INF] : Restart service mgr
debug 2021-01-11T20:47:36.806+0000 7f484d215700  0 log_channel(cluster) log
[DBG] : pgmap v3: 1 pgs: 1 active+clean; 0 B data, 208 MiB used, 5.4 TiB /
5.5 TiB avail
debug 2021-01-11T20:47:37.406+0000 7f489ef88700 -1 received  signal:
Terminated from Kernel ( Could be generated by pthread_kill(), raise(),
abort(), alarm() ) UID: 0
debug 2021-01-11T20:47:37.406+0000 7f489ef88700 -1 mgr handle_mgr_signal
 *** Got signal Terminated ***


Is there anyway to manually clear this stuck message? Will it eventually
timeout by itself?

Either way it's a pretty unpleasant surprise to find when following the
docs so early in the process.

Thanks,

Chris
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] "ceph orch restart mgr" command creates mgr restart loop

Reply via email to