Hi Paul,

we might have found the reason for MONs going silly on our cluster. There is a 
message size parameter that seems way too large. We reduced it today from 10M 
(default) to 1M and didn't observe silly MONs since then:

ceph config set global osd_map_message_max_bytes $((1*1024*1024))

I cannot guarantee that this is the fix. However, I observed one window of a 
MON with high packet-out load after setting the above and it remained 
responsive and did not go to 100% CPU. Maybe worth a try? I will keep observing.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <fr...@dtu.dk>
Sent: 10 February 2021 17:32:07
To: Paul Mezzanini; ceph-users@ceph.io
Subject: [ceph-users] Re: OSDs cannot join, MON leader at 100%

It has become a ot more severe after adding a large nubmer of disks. I added a 
tracker

https://tracker.ceph.com/issues/49231

In case you have additional information, feel free to add.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Paul Mezzanini <pfm...@rit.edu>
Sent: 29 January 2021 20:04:12
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: OSDs cannot join, MON leader at 100%

We are currently running 3 MONs.  When one goes into silly town the others get 
wedged and won't respond well.  I don't think more MONs would solve that... but 
I'm not sure.

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------

________________________________________
From: Frank Schilder <fr...@dtu.dk>
Sent: Friday, January 29, 2021 12:58 PM
To: Paul Mezzanini; ceph-users@ceph.io
Subject: Re: OSDs cannot join, MON leader at 100%

Hi Poul,

thanks for sharing. I have the MONs on 2x10G bonded active-active. They don't 
manage to saturate 10G, but the CPU core is overloaded.

How many MONs do you have? I believe I have never seen more than 2 to be in 
this state for an extended period of time. My plan is to go from 3 to 5, which 
would leave a subcluster of 3 and I would be less hesitant to restart an 
affected MON right away.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Paul Mezzanini <pfm...@rit.edu>
Sent: 29 January 2021 17:44:42
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: OSDs cannot join, MON leader at 100%

We've been watching our MONs go unresponsive with a saturated 10GbE NIC.  The 
problem seems to be aggravated by peering.  We were shrinking the PG count on 
one of our large pools and it was happening a bunch.  Once that finished it 
seemed to calm down.  Yesterday I had an OSD go down and as it was rebalancing 
we had another MON go into silly mode.  We recover from this situation by just 
restarting the MON process on the hung node.


We are running 14.2.15.

I wish I could tell you what the problem actually is and how to fix it.  At 
least we aren't alone in this failure mode.

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------

________________________________________
From: Frank Schilder <fr...@dtu.dk>
Sent: Friday, January 29, 2021 5:22 AM
To: ceph-users@ceph.io
Subject: [ceph-users] OSDs cannot join, MON leader at 100%

Dear cephers,

I was doing some maintenance yesterday involving shutdown-power up cycles of 
ceph servers. With the last server I run into a problem. The server runs an MDS 
and a couple of OSDs. After reboot, the MDS joined the MDS cluster without 
problems, but the OSDs didn't come up. This was 1 out of 12 servers and I had 
no such problems with the other 11. I also observed that "ceph status" was 
responding very slow.

Upon further inspection, I found out that 2 of my 3 MONs (the leader and a 
peon) were running at 100% CPU. Client I/O was continuing, probably because the 
last cluster map remained valid. On our node performance monitoring I could see 
that the 2 busy MONs were showing extraordinary network activity.

This state lasted for over one hour. After the MONs settled down, the OSDs 
finally joined as well and everything went back to normal.

The other instance I have seen similar behaviour was, when I restarted a MON on 
an empty disk and the re-sync was extremely slow due to a too large value for 
mon_sync_max_payload_size. This time, I'm pretty sure it was MON-client 
communication; see below.

Are there any settings similar to mon_sync_max_payload_size that could 
influence responsiveness of MONs in a similar way?

Why do I suspect it is MON-client communication? In our monitoring, I do not 
see the huge amount of packages sent by the MONs arriving at any other ceph 
daemon. They seem to be distributed over client nodes, but since we have a 
large count of client nodes (>550) this is covered by the background network 
traffic. A second clue is that I have had such extended lock-ups before and, 
whenever I checked, I only observed these in case the leader had a large share 
of client sessions.

For example, yesterday the client session count per MON was:

ceph-01: 1339 (leader)
ceph-02:  189 (peon)
ceph-03:  839 (peon)

I usually restart the leader when such a critical distribution occurs. As long 
as the leader has the fewest client sessions, I never observe this problem.

Ceph version is 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic 
(stable).

Thanks for any clues!

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to