
I've struggled with the same issue for quite a while. If your i/o is similar to mine, I believe you are on the right track. For the past month or so, I have been running this cronjob:

* * * * * for strPg in `ceph pg dump | egrep '^[0-9]\.[0-9a-f]{1,4}' | sort -k20 | awk '{ print $1 }' | head -2`; do ceph pg deep-scrub $strPg; done

That roughly handles my 20672 PGs that are set to be deep-scrubbed every 7 days. Your script may be a bit better, but this quick and dirty method has helped my cluster maintain more consistency.

The real key for me is to avoid the "clumpiness" I have observed without that hack where concurrent deep-scrubs sit at zero for a long period of time (despite having PGs that were months overdue for a deep-scrub), then concurrent deep-scrubs suddenly spike up and stay in the teens for hours, killing client writes/second.

The scrubbing behavior table[0] indicates that a periodic tick initiates scrubs on a per-PG basis. Perhaps the timing of ticks aren't sufficiently randomized when you restart lots of OSDs concurrently (for instance via pdsh).

On my cluster I suffer a significant drag on client writes/second when I exceed perhaps four or five concurrent PGs in deep-scrub. When concurrent deep-scrubs get into the teens, I get a massive drop in client writes/second.

Greg, is there locking involved when a PG enters deep-scrub? If so, is the entire PG locked for the duration or is each individual object inside the PG locked as it is processed? Some of my PGs will be in deep-scrub for minutes at a time.


On 6/9/2014 6:22 PM, Craig Lewis wrote:
I've correlated a large deep scrubbing operation to cluster stability

My primary cluster does a small amount of deep scrubs all the time,
spread out over the whole week.  It has no stability problems.

My secondary cluster doesn't spread them out.  It saves them up, and
tries to do all of the deep scrubs over the weekend.  The secondary
starts loosing OSDs about an hour after these deep scrubs start.

To avoid this, I'm thinking of writing a script that continuously scrubs
the oldest outstanding PG.  In psuedo-bash:
# Sort by the deep-scrub timestamp, taking the single oldest PG
while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21,
$1}' | sort | head -1 | read date time pg
   ceph pg deep-scrub ${pg}
   while ceph status | grep scrubbing+deep
     sleep 5
   sleep 30

Does anybody think this will solve my problem?

I'm also considering disabling deep-scrubbing until the secondary
finishes replicating from the primary.  Once it's caught up, the write
load should drop enough that opportunistic deep scrubs should have a
chance to run.  It should only take another week or two to catch up.

