[ceph-users] Re: Manager carries wrong information until killing it

涂振南 Tue, 14 Dec 2021 09:26:36 -0800

Hi there,
I ask you to look for additional information and write me the end result. Down 
below I send the legal request.



crhconsultores.co.mz/minimanumquam/autnumquamqui

Hello, we have a recurring, funky problem with managers on Nautilus (and 
probably also earlier versions): the manager displays incorrect information. 
This is a recurring pattern and it also breaks the prometheus graphs, as the 
I/O is described insanely incorrectly: "recovery: 43 TiB/s, 3.62k keys/s, 
11.40M objects/s" - which basically changes the scale of any related graph to 
unusable. The latest example from today shows slow ops for an OSD that has been 
down for 17h: 
--------------------------------------------------------------------------------
 [09:50:31] black2.place6:~# ceph -s cluster: id: 
1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_WARN 18 slow ops, oldest 
one blocked for 975 sec, osd.53 has slow ops services: mon: 5 daemons, quorum 
server9,server2,server8,server6,server4 (age 2w) mgr: server2(active, since 
2w), standbys: server8, server4, server9, server6, ciara3 osd: 108 osds: 107 up 
(since 17h), 107 in (since 17h) data: pools: 4 pools, 2624 pgs objects: 42.52M 
objec
 ts, 162 TiB usage: 486 TiB used, 298 TiB / 784 TiB avail pgs: 2616 
active+clean 8 active+clean+scrubbing+deep io: client: 522 MiB/s rd, 22 MiB/s 
wr, 8.18k op/s rd, 689 op/s wr 
--------------------------------------------------------------------------------
 Killing the manager on server2 changes the status to another temporary 
incorrect status, because the rebalance finished hours ago, paired with the 
incorrect rebalance speed that we see from time to time: 
--------------------------------------------------------------------------------
 [09:51:59] black2.place6:~# ceph -s cluster: id: 
1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_OK services: mon: 5 
daemons, quorum server9,server2,server8,server6,server4 (age 2w) mgr: 
server8(active, since 11s), standbys: server4, server9, server6, ciara3 osd: 
108 osds: 107 up (since 17h), 107 in (since 17h) data: pools: 4 pools, 2624 pgs 
objects: 42.52M objects, 162 TiB usage: 486 TiB used, 298 TiB / 784 TiB avail 
pgs: 2616 active+clean 8 acti
 ve+clean+scrubbing+deep io: client: 214 TiB/s rd, 54 TiB/s wr, 4.86G op/s rd, 
1.06G op/s wr recovery: 43 TiB/s, 3.62k keys/s, 11.40M objects/s progress: 
Rebalancing after osd.53 marked out [========================......] 
--------------------------------------------------------------------------------
 Then a bit later, the status on the newly started manager is correct: 
--------------------------------------------------------------------------------
 [09:52:18] black2.place6:~# ceph -s cluster: id: 
1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_OK services: mon: 5 
daemons, quorum server9,server2,server8,server6,server4 (age 2w) mgr: 
server8(active, since 47s), standbys: server4, server9, server6, server2, 
ciara3 osd: 108 osds: 107 up (since 17h), 107 in (since 17h) data: pools: 4 
pools, 2624 pgs objects: 42.52M objects, 162 TiB usage: 486 TiB used, 298 TiB / 
784 TiB avail pgs: 2616 active+clean 8 active+clean+scrubbing+deep io: client: 
422 MiB/s rd, 39 MiB/s wr, 7.91k op/s rd, 7
 52 op/s wr 
--------------------------------------------------------------------------------
 Question: is this a know bug, is anyone else seeing it or are we doing 
something wrong? Best regards, Nico -- Sustainable and modern Infrastructures 
by ungleich.ch _______________________________________________ ceph-users 
mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Manager carries wrong information until killing it

Reply via email to