Re: OSD suicide after being down/in for one day as it needs to search large amount of objects

Guang Yang Wed, 20 Aug 2014 18:42:51 -0700

Thanks Sage. We will provide a patch based on this.

Thanks,
Guang


On Aug 20, 2014, at 11:19 PM, Sage Weil <sw...@redhat.com> wrote:

> On Wed, 20 Aug 2014, Guang Yang wrote:
>> Thanks Greg.
>> On Aug 20, 2014, at 6:09 AM, Gregory Farnum <g...@inktank.com> wrote:
>> 
>>> On Mon, Aug 18, 2014 at 11:30 PM, Guang Yang <yguan...@outlook.com> wrote:
>>>> Hi ceph-devel,
>>>> David (cc?ed) reported a bug (http://tracker.ceph.com/issues/9128) which 
>>>> we came across in our test cluster during our failure testing, basically 
>>>> the way to reproduce it was to leave one OSD daemon down and in for a day, 
>>>> at the same time, keep giving write traffic. When the OSD daemon was 
>>>> started again, it hit suicide timeout and kill itself.
>>>> 
>>>> After some analysis (details in the bug), David found that the op thread 
>>>> was busy searching for missing objects and once the volume to search 
>>>> increase, the thread is expected to work that long time, please refer to 
>>>> the bug for detailed logs.
>>> 
>>> Can you talk a little more about what's going on here? At a quick
>>> naive glance, I'm not seeing why leaving an OSD down and in should
>>> require work based on the amount of write traffic. Perhaps if the rest
>>> of the cluster was changing mappings??
>> We increased the down to out time interval from 5 minutes to 2 days to 
>> avoid migrating data back and forth which could increase latency, so 
>> that we target to mark OSD out manually. To achieve such, we are testing 
>> against some boundary cases to let the OSD down and in for like 1 day, 
>> however, when we try to bring it up again, it always failed due to hit 
>> the suicide timeout.
> 
> Looking at the log snippet I see the PG had log range
> 
>       5481'28667,5646'34066
> 
> Which is ~5500 log events.  The default max is 10k.  search_for_missing is 
> basically going to iterate over this list and check if the object is 
> present locally.
> 
> If that's slow enough to trigger a suicide (which it seems to be), teh 
> fix is simple: as Greg says we just need to make it probe the internel 
> heartbeat code to indicate progress.  In most contexts this is done by 
> passing a ThreadPool::TPHandle &handle into each method and then 
> calling handle.reset_tp_timeout() on each iteration.  The same needs to be 
> done for search_for_missing...
> 
> sage
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD suicide after being down/in for one day as it needs to search large amount of objects

Reply via email to