Hi guys, I am running red hat ceph (basically luminous - ceph version 12.2.12-48.el7cp (26388d73d88602005946d4381cc5796d42904858)) and am seeing something similar on our test cluster.
One of the mons is running at around 300% cpu non stop. It doesn't seem to be the lead mon or one in particular, but the cpu load shifts to another mon if high load mon is restarted. I thought it might be related to this thread since it seems to have started happening when removing and adding a lot of OSDs. In fact I have removed and added several times all the OSDs in the cluster, and mons have been restarted several times but the load persists. At debug_mon 20/5, I see endless lines of this which seems to be to do with the osdmap: 2019-12-17 11:59:47.916098 7f27dfba1700 10 mon.mon1@1(peon) e4 handle_get_version mon_get_version(what=osdmap handle=2874836684) v1 2019-12-17 11:59:47.916139 7f27dfba1700 20 mon.mon1@1(peon) e4 _ms_dispatch existing session 0x55ab61fb6300 for client.27824428 10.0.0.2:0/461841538 2019-12-17 11:59:47.916146 7f27dfba1700 20 mon.mon1@1(peon) e4 caps allow * 2019-12-17 11:59:47.916149 7f27dfba1700 20 is_capable service=mon command= read on cap allow * 2019-12-17 11:59:47.916151 7f27dfba1700 20 allow so far , doing grant allow * 2019-12-17 11:59:47.916152 7f27dfba1700 20 allow all 2019-12-17 11:59:47.916153 7f27dfba1700 10 mon.mon1@1(peon) e4 handle_get_version mon_get_version(what=osdmap handle=2871621985) v1 2019-12-17 11:59:47.916203 7f27dfba1700 20 mon.mon1@1(peon) e4 _ms_dispatch existing session 0x55ab61d7c780 for client.27824430 10.0.0.2:0/898487246 2019-12-17 11:59:47.916210 7f27dfba1700 20 mon.mon1@1(peon) e4 caps allow * 2019-12-17 11:59:47.916213 7f27dfba1700 20 is_capable service=mon command= read on cap allow * 2019-12-17 11:59:47.916215 7f27dfba1700 20 allow so far , doing grant allow * 2019-12-17 11:59:47.916216 7f27dfba1700 20 allow all 2019-12-17 11:59:47.916217 7f27dfba1700 10 mon.mon1@1(peon) e4 handle_get_version mon_get_version(what=osdmap handle=2882637609) v1 2019-12-17 11:59:47.916254 7f27dfba1700 20 mon.mon1@1(peon) e4 _ms_dispatch existing session 0x55ab62649c80 for client.27824431 10.0.0.2:0/972633098 2019-12-17 11:59:47.916262 7f27dfba1700 20 mon.mon1@1(peon) e4 caps allow * 2019-12-17 11:59:47.916266 7f27dfba1700 20 is_capable service=mon command= read on cap allow * 2019-12-17 11:59:47.916268 7f27dfba1700 20 allow so far , doing grant allow * 2019-12-17 11:59:47.916269 7f27dfba1700 20 allow all Continuing to investigate. Raf On Tue, 17 Dec 2019 at 11:53, Sasha Litvak <alexander.v.lit...@gmail.com> wrote: > Bryan, thank you. We are about to start testing 14.2.2 -> 14.2.5 upgrade, > so folks here are a bit cautious :-) We don't need to convert but may have > to rebuild few disks after an upgrade. > > On Mon, Dec 16, 2019 at 3:57 PM Bryan Stillwell <bstillw...@godaddy.com> > wrote: > >> Sasha, >> >> I was able to get past it by restarting the ceph-mon processes every time >> it got stuck, but that's not a very good solution for a production cluster. >> >> Right now I'm trying to narrow down what is causing the problem. >> Rebuilding the OSDs with BlueStore doesn't seem to be enough. I believe it >> could be related to us using the extra space on the journal device as an >> SSD-based OSD. During the conversion process I'm removing this SSD-based >> OSD (since with BlueStore the omap data is ending up on the SSD anyways), >> and I'm suspecting it might be causing this problem. >> >> Bryan >> >> On Dec 14, 2019, at 10:27 AM, Sasha Litvak <alexander.v.lit...@gmail.com> >> wrote: >> >> Notice: This email is from an external sender. >> >> Bryan, >> >> Were you able to resolve this? If yes, can you please share with the >> list? >> >> On Fri, Dec 13, 2019 at 10:08 AM Bryan Stillwell <bstillw...@godaddy.com> >> wrote: >> >>> Adding the dev list since it seems like a bug in 14.2.5. >>> >>> I was able to capture the output from perf top: >>> >>> 21.58% libceph-common.so.0 [.] >>> ceph::buffer::v14_2_0::list::append >>> 20.90% libstdc++.so.6.0.19 [.] std::getline<char, >>> std::char_traits<char>, std::allocator<char> > >>> 13.25% libceph-common.so.0 [.] >>> ceph::buffer::v14_2_0::list::append >>> 10.11% libstdc++.so.6.0.19 [.] >>> std::istream::sentry::sentry >>> 8.94% libstdc++.so.6.0.19 [.] std::basic_ios<char, >>> std::char_traits<char> >::clear >>> 3.24% libceph-common.so.0 [.] >>> ceph::buffer::v14_2_0::ptr::unused_tail_length >>> 1.69% libceph-common.so.0 [.] std::getline<char, >>> std::char_traits<char>, std::allocator<char> >@plt >>> 1.63% libstdc++.so.6.0.19 [.] >>> std::istream::sentry::sentry@plt >>> 1.21% [kernel] [k] __do_softirq >>> 0.77% libpython2.7.so.1.0 [.] PyEval_EvalFrameEx >>> 0.55% [kernel] [k] >>> _raw_spin_unlock_irqrestore >>> >>> I increased mon debugging to 20 and nothing stuck out to me. >>> >>> Bryan >>> >>> > On Dec 12, 2019, at 4:46 PM, Bryan Stillwell <bstillw...@godaddy.com> >>> wrote: >>> > >>> > On our test cluster after upgrading to 14.2.5 I'm having problems with >>> the mons pegging a CPU core while moving data around. I'm currently >>> converting the OSDs from FileStore to BlueStore by marking the OSDs out in >>> multiple nodes, destroying the OSDs, and then recreating them with >>> ceph-volume lvm batch. This seems too get the ceph-mon process into a >>> state where it pegs a CPU core on one of the mons: >>> > >>> > 1764450 ceph 20 0 4802412 2.1g 16980 S 100.0 28.1 4:54.72 >>> ceph-mon >>> > >>> > Has anyone else run into this with 14.2.5 yet? I didn't see this >>> problem while the cluster was running 14.2.4. >>> > >>> > Thanks, >>> > Bryan >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> >> _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- *Rafael Lopez* Research Devops Engineer Monash University eResearch Centre E: rafael.lo...@monash.edu
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io