On 05.01.21 20:29, Uwe Sauter wrote:
Frank,
Am 05.01.21 um 20:24 schrieb Frank Thommen:
Hi Uwe,
did you look into the log of MON and OSD?
I can't see any specific MON and OSD logs. However the log available
in the UI (Ceph -> Log) has lots of messages regarding scrubbing but
no messages regarding issues with starting the monitor
On each host the logs should be in /var/log/ceph. These should be
rotated (see /etc/logrotate.d/ceph-common for details).
ok. I see lots of
-----------------------
2021-01-05 20:38:05.900 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:07.208 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:08.688 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:08.744 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:09.092 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:12.268 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:12.468 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:12.964 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:15.752 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:17.440 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:19.388 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:19.468 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:22.712 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
2021-01-05 20:38:22.828 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
handle_auth_request failed to assign global_id
-----------------------
in the mon log on the problematic host.
When (unsuccessfully) starting the monitor through the UI, the following
entries appear in ceph.audit.log:
-----------------------
2021-01-05 20:40:07.635369 mon.odcf-pve03 (mon.1) 288082 : audit [DBG]
from='client.? 192.168.255.2:0/2418486168' entity='client.admin'
cmd=[{"format":"json","prefix":"mgr metadata"}]: dispatch
2021-01-05 20:40:07.636592 mon.odcf-pve03 (mon.1) 288083 : audit [DBG]
from='client.? 192.168.255.2:0/2418486168' entity='client.admin'
cmd=[{"format":"json","prefix":"mgr dump"}]: dispatch
2021-01-05 20:40:08.296793 mon.odcf-pve03 (mon.1) 288084 : audit [DBG]
from='client.? 192.168.255.2:0/778781756' entity='client.admin'
cmd=[{"format":"json","prefix":"mon metadata"}]: dispatch
2021-01-05 20:40:08.297767 mon.odcf-pve03 (mon.1) 288085 : audit [DBG]
from='client.? 192.168.255.2:0/778781756' entity='client.admin'
cmd=[{"prefix":"quorum_status","format":"json"}]: dispatch
2021-01-05 20:40:08.436982 mon.odcf-pve01 (mon.0) 389632 : audit [DBG]
from='client.? 192.168.255.2:0/784579843' entity='client.admin'
cmd=[{"format":"json","prefix":"df"}]: dispatch
-----------------------
192.168.255.2 is the IP number of the problematic host in the Ceph mesh
network. odcf-pve01 and odcf-pve03 are the "good" nodes.
However I am not sure, what kind of information I should look for in the
logs
Frank
Regards,
Uwe
Can you provide the list of installed packages of the affected host
and the rest of the cluster?
let me compile the lists and post them somewhere. They are quite long.
Is the output of "ceph status" the same for all hosts?
yes
Frank
Regards,
Uwe
Am 05.01.21 um 20:01 schrieb Frank Thommen:
On 04.01.21 12:44, Frank Thommen wrote:
Dear all,
one of our three PVE hypervisors in the cluster crashed (it was
fenced successfully) and rebooted automatically. I took the chance
to do a complete dist-upgrade and rebooted again.
The PVE Ceph dashboard now reports, that
* the monitor on the host is down (out of quorum), and
* "A newer version was installed but old version still running,
please restart"
The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16
is installed. The hypervisor has been rebooted twice since the
upgrade, so it should be basically impossible that the old version
is still running.
`systemctl restart ceph.target` and restarting the monitor through
the PVE Ceph UI didn't help. The hypervisor is running PVE 6.3-3
(the other two are running 6.3-2 with monitor 14.2.15)
What to do in this situation?
I am happy with either UI or commandline instructions, but I have
no Ceph experience besides setting up it up following the PVE
instructions.
Any help or hint is appreciated.
Cheers, Frank
In an attempt to fix the issue I destroyed the monitor through the
UI and recreated it. Unfortunately it can still not be started. A
popup tells me that the monitor has been started, but the overview
still shows "stopped" and there is no version number any more.
Then I stopped and started Ceph on the node (`pveceph stop; pveceph
start`) which resulted in a degraded cluster (1 host down, 7 of 21
OSDs down). OSDs cannot be started through the UI either.
I feel extremely uncomfortable with this situation and would
appreciate any hint as to how I should proceed with the problem.
Cheers, Frank
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user