[ceph-users] Re: ceph-mon using 100% CPU after upgrade to 14.2.5

Rafael Lopez Mon, 16 Dec 2019 17:40:08 -0800

Hi guys,

I am running red hat ceph (basically luminous - ceph version
12.2.12-48.el7cp (26388d73d88602005946d4381cc5796d42904858)) and am seeing
something similar on our test cluster.


One of the mons is running at around 300% cpu non stop. It doesn't seem to
be the lead mon or one in particular, but the cpu load shifts to another
mon if high load mon is restarted.
I thought it might be related to this thread since it seems to have started
happening when removing and adding a lot of OSDs. In fact I have removed
and added several times all the OSDs in the cluster, and mons have been
restarted several times but the load persists.

At debug_mon 20/5, I see endless lines of this which seems to be to do with
the osdmap:

2019-12-17 11:59:47.916098 7f27dfba1700 10 mon.mon1@1(peon) e4
handle_get_version mon_get_version(what=osdmap handle=2874836684) v1
2019-12-17 11:59:47.916139 7f27dfba1700 20 mon.mon1@1(peon) e4 _ms_dispatch
existing session 0x55ab61fb6300 for client.27824428 10.0.0.2:0/461841538
2019-12-17 11:59:47.916146 7f27dfba1700 20 mon.mon1@1(peon) e4  caps allow *
2019-12-17 11:59:47.916149 7f27dfba1700 20 is_capable service=mon command=
read on cap allow *
2019-12-17 11:59:47.916151 7f27dfba1700 20  allow so far , doing grant
allow *
2019-12-17 11:59:47.916152 7f27dfba1700 20  allow all
2019-12-17 11:59:47.916153 7f27dfba1700 10 mon.mon1@1(peon) e4
handle_get_version mon_get_version(what=osdmap handle=2871621985) v1
2019-12-17 11:59:47.916203 7f27dfba1700 20 mon.mon1@1(peon) e4 _ms_dispatch
existing session 0x55ab61d7c780 for client.27824430 10.0.0.2:0/898487246
2019-12-17 11:59:47.916210 7f27dfba1700 20 mon.mon1@1(peon) e4  caps allow *
2019-12-17 11:59:47.916213 7f27dfba1700 20 is_capable service=mon command=
read on cap allow *
2019-12-17 11:59:47.916215 7f27dfba1700 20  allow so far , doing grant
allow *
2019-12-17 11:59:47.916216 7f27dfba1700 20  allow all
2019-12-17 11:59:47.916217 7f27dfba1700 10 mon.mon1@1(peon) e4
handle_get_version mon_get_version(what=osdmap handle=2882637609) v1
2019-12-17 11:59:47.916254 7f27dfba1700 20 mon.mon1@1(peon) e4 _ms_dispatch
existing session 0x55ab62649c80 for client.27824431 10.0.0.2:0/972633098
2019-12-17 11:59:47.916262 7f27dfba1700 20 mon.mon1@1(peon) e4  caps allow *
2019-12-17 11:59:47.916266 7f27dfba1700 20 is_capable service=mon command=
read on cap allow *
2019-12-17 11:59:47.916268 7f27dfba1700 20  allow so far , doing grant
allow *
2019-12-17 11:59:47.916269 7f27dfba1700 20  allow all

Continuing to investigate.

Raf

On Tue, 17 Dec 2019 at 11:53, Sasha Litvak <alexander.v.lit...@gmail.com>
wrote:

> Bryan, thank you.  We are about to start testing 14.2.2 -> 14.2.5 upgrade,
> so folks here are a bit cautious :-)  We don't need to convert but may have
> to rebuild few disks after an upgrade.
>
> On Mon, Dec 16, 2019 at 3:57 PM Bryan Stillwell <bstillw...@godaddy.com>
> wrote:
>
>> Sasha,
>>
>> I was able to get past it by restarting the ceph-mon processes every time
>> it got stuck, but that's not a very good solution for a production cluster.
>>
>> Right now I'm trying to narrow down what is causing the problem.
>> Rebuilding the OSDs with BlueStore doesn't seem to be enough.  I believe it
>> could be related to us using the extra space on the journal device as an
>> SSD-based OSD.  During the conversion process I'm removing this SSD-based
>> OSD (since with BlueStore the omap data is ending up on the SSD anyways),
>> and I'm suspecting it might be causing this problem.
>>
>> Bryan
>>
>> On Dec 14, 2019, at 10:27 AM, Sasha Litvak <alexander.v.lit...@gmail.com>
>> wrote:
>>
>> Notice: This email is from an external sender.
>>
>> Bryan,
>>
>> Were you able to resolve this?  If yes, can you please share with the
>> list?
>>
>> On Fri, Dec 13, 2019 at 10:08 AM Bryan Stillwell <bstillw...@godaddy.com>
>> wrote:
>>
>>> Adding the dev list since it seems like a bug in 14.2.5.
>>>
>>> I was able to capture the output from perf top:
>>>
>>>   21.58%  libceph-common.so.0               [.]
>>> ceph::buffer::v14_2_0::list::append
>>>   20.90%  libstdc++.so.6.0.19               [.] std::getline<char,
>>> std::char_traits<char>, std::allocator<char> >
>>>   13.25%  libceph-common.so.0               [.]
>>> ceph::buffer::v14_2_0::list::append
>>>   10.11%  libstdc++.so.6.0.19               [.]
>>> std::istream::sentry::sentry
>>>    8.94%  libstdc++.so.6.0.19               [.] std::basic_ios<char,
>>> std::char_traits<char> >::clear
>>>    3.24%  libceph-common.so.0               [.]
>>> ceph::buffer::v14_2_0::ptr::unused_tail_length
>>>    1.69%  libceph-common.so.0               [.] std::getline<char,
>>> std::char_traits<char>, std::allocator<char> >@plt
>>>    1.63%  libstdc++.so.6.0.19               [.]
>>> std::istream::sentry::sentry@plt
>>>    1.21%  [kernel]                          [k] __do_softirq
>>>    0.77%  libpython2.7.so.1.0               [.] PyEval_EvalFrameEx
>>>    0.55%  [kernel]                          [k]
>>> _raw_spin_unlock_irqrestore
>>>
>>> I increased mon debugging to 20 and nothing stuck out to me.
>>>
>>> Bryan
>>>
>>> > On Dec 12, 2019, at 4:46 PM, Bryan Stillwell <bstillw...@godaddy.com>
>>> wrote:
>>> >
>>> > On our test cluster after upgrading to 14.2.5 I'm having problems with
>>> the mons pegging a CPU core while moving data around.  I'm currently
>>> converting the OSDs from FileStore to BlueStore by marking the OSDs out in
>>> multiple nodes, destroying the OSDs, and then recreating them with
>>> ceph-volume lvm batch.  This seems too get the ceph-mon process into a
>>> state where it pegs a CPU core on one of the mons:
>>> >
>>> > 1764450 ceph      20   0 4802412   2.1g  16980 S 100.0 28.1   4:54.72
>>> ceph-mon
>>> >
>>> > Has anyone else run into this with 14.2.5 yet?  I didn't see this
>>> problem while the cluster was running 14.2.4.
>>> >
>>> > Thanks,
>>> > Bryan
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
*Rafael Lopez*
Research Devops Engineer
Monash University eResearch Centre

E: rafael.lo...@monash.edu

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-mon using 100% CPU after upgrade to 14.2.5

Reply via email to