I've correlated a large deep scrubbing operation to cluster stability
problems.

My primary cluster does a small amount of deep scrubs all the time, spread
out over the whole week.  It has no stability problems.

My secondary cluster doesn't spread them out.  It saves them up, and tries
to do all of the deep scrubs over the weekend.  The secondary starts
loosing OSDs about an hour after these deep scrubs start.

To avoid this, I'm thinking of writing a script that continuously scrubs
the oldest outstanding PG.  In psuedo-bash:
# Sort by the deep-scrub timestamp, taking the single oldest PG
while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, $1}'
| sort | head -1 | read date time pg
 do
  ceph pg deep-scrub ${pg}
  while ceph status | grep scrubbing+deep
   do
    sleep 5
  done
  sleep 30
done


Does anybody think this will solve my problem?

I'm also considering disabling deep-scrubbing until the secondary finishes
replicating from the primary.  Once it's caught up, the write load should
drop enough that opportunistic deep scrubs should have a chance to run.  It
should only take another week or two to catch up.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to