Thanks Sam, I will go ahead opening a tracker for this. Thanks, Guang
---------------------------------------- > Date: Tue, 18 Aug 2015 08:42:04 -0700 > Subject: Re: OSD::do_mon_report - do we need holding osd_lock > From: sj...@redhat.com > To: yguan...@outlook.com > CC: ceph-devel@vger.kernel.org > > Probably! A quick glance at do_mon_report doesn't seem to turn up > anything I'd expect to be really hard to refactor. You do need to > break out the required data (into OSDService, I'd think) so that the > lock is not necessary. > -Sam > > On Mon, Aug 17, 2015 at 6:10 PM, GuangYang <yguan...@outlook.com> wrote: >> Hi Sam, >> Today I noticed a scenario that monitor marked OSD down since it did not >> receive the PG stats from the OSD, further investigation showed that the >> reason why OSD didn't report stats because it failed to acquire the >> osd_lock, what happened was: >> 1. one PG is undergoing long-run peering (search for missing objects) >> 2. An OP hold the osd_lock and try to acquire the PG lock, which is being >> held by 1). >> 3. OSD tick thread failed to acquire osd_lock and stuck for 10 minutes, thus >> failed to update to monitor its stats >> 4. monitor mark it down >> >> After looking at the code, we found several assertions (that osd_lock should >> be held) around OSD::do_mon_report, is that required? Any chance to overcome >> the problem described above by refactoring the locking there? >> >> Thanks, >> Guang