Unfortunately, decreasing the osd_scrub_max_interval to 6 days isn’t going to 
fix it.

There is sort of quirk in the way the deep scrub is initiated.  It doesn’t 
trigger a deep scrub until a regular scrub is about to start.  So with 
osd_scrub_max_interval set to 1 week and a high load the next possible scrub or 
deep-scrub is 1 week from the last REGULAR scrub, even if the last deep scrub 
was more than 7 days ago.  

The longest wait for a deep scrub is osd_scrub_max_interval + 
osd_deep_scrub_interval between deep scrubs.

For example, a deep scrub happens on Jan 1.  Each day after that for six days a 
regular scrub happens with low load.  After 6 regular scrubs ending on Jan 7 
the load goes high.  Now with the load high no scrub can start until Jan 14 
because you must get past osd_scrub_max_interval since the last regular scrub 
on Jan 7.  At that time it will be a deep scrub because it is more than 7 days 
since the last deep scrub on Jan 1.

See also http://tracker.ceph.com/issues/6735

There may be a need for more documentation clarification in this area or a 
change to the behavior.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jun 23, 2014, at 11:10 PM, Christian Balzer <ch...@gol.com> wrote:

> 
> 
> Hello,
> 
> On Mon, 23 Jun 2014 21:50:50 -0700 David Zafman wrote:
> 
>> 
>> By default osd_scrub_max_interval and osd_deep_scrub_interval are 1 week
>> 604800 seconds (60*60*24*7) and osd_scrub_min_interval is 1 day 86400
>> seconds (60*60*24).  As long as osd_scrub_max_interval <=
>> osd_deep_scrub_interval then the load won’t impact when deep scrub
>> occurs.   I suggest that osd_scrub_min_interval <=
>> osd_scrub_max_interval <= osd_deep_scrub_interval.
>> 
>> I’d like to know how you have those 3 values set, so I can confirm that
>> this explains the issue.
>> 
> They are and were unsurprisingly set to the default values.
> 
> Now to provide some more information, shortly after the inception of this
> cluster I did initiate a deep scrub on all OSDs on 00:30 on a Sunday
> morning (the things we do for Ceph, a scheduler with a variety of rules
> would be nice, but I digress). 
> This took until 05:30 despite the cluster being idle and with close to no
> data in it. In retrospect it seems clear to me that this already was
> influenced by the load threshold (a scrub I initiated with the new
> threshold value of 1.5 finished in just 30 minutes last night).
> Consequently all the normal scrubs happened in the same time frame until
> this weekend on the 21st (normal scrub).
> The deep scrub on the 22nd clearly ran into the load threshold.
> 
> So if I understand you correctly setting osd_scrub_max_interval to 6 days
> should have deep scrubs ignore the load threshold as per the documentation?
> 
> Regards,
> 
> Christian
> 
>> 
>> David Zafman
>> Senior Developer
>> http://www.inktank.com
>> http://www.redhat.com
>> 
>> On Jun 23, 2014, at 7:01 PM, Christian Balzer <ch...@gol.com> wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Mon, 23 Jun 2014 14:20:37 -0400 Gregory Farnum wrote:
>>> 
>>>> Looks like it's a doc error (at least on master), but it might have
>>>> changed over time. If you're running Dumpling we should change the
>>>> docs.
>>> 
>>> Nope, I'm running 0.80.1 currently.
>>> 
>>> Christian
>>> 
>>>> -Greg
>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>> 
>>>> 
>>>> On Sun, Jun 22, 2014 at 10:18 PM, Christian Balzer <ch...@gol.com>
>>>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> This weekend I noticed that the deep scrubbing took a lot longer than
>>>>> usual (long periods without a scrub running/finishing), even though
>>>>> the cluster wasn't all that busy.
>>>>> It was however busier than in the past and the load average was above
>>>>> 0.5 frequently.
>>>>> 
>>>>> Now according to the documentation "osd scrub load threshold" is
>>>>> ignored when it comes to deep scrubs.
>>>>> 
>>>>> However after setting it to 1.5 and restarting the OSDs the
>>>>> floodgates opened and all those deep scrubs are now running at full
>>>>> speed.
>>>>> 
>>>>> Documentation error or did I "unstuck" something by the OSD restart?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Christian
>>>>> --
>>>>> Christian Balzer        Network/Systems Engineer
>>>>> ch...@gol.com           Global OnLine Japan/Fusion Communications
>>>>> http://www.gol.com/
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>> 
>>> 
>>> -- 
>>> Christian Balzer        Network/Systems Engineer                
>>> ch...@gol.com       Global OnLine Japan/Fusion Communications
>>> http://www.gol.com/
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> ch...@gol.com         Global OnLine Japan/Fusion Communications
> http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to