I've correlated a large deep scrubbing operation to cluster stability problems.
My primary cluster does a small amount of deep scrubs all the time, spread out over the whole week. It has no stability problems. My secondary cluster doesn't spread them out. It saves them up, and tries to do all of the deep scrubs over the weekend. The secondary starts loosing OSDs about an hour after these deep scrubs start. To avoid this, I'm thinking of writing a script that continuously scrubs the oldest outstanding PG. In psuedo-bash: # Sort by the deep-scrub timestamp, taking the single oldest PG while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, $1}' | sort | head -1 | read date time pg do ceph pg deep-scrub ${pg} while ceph status | grep scrubbing+deep do sleep 5 done sleep 30 done Does anybody think this will solve my problem? I'm also considering disabling deep-scrubbing until the secondary finishes replicating from the primary. Once it's caught up, the write load should drop enough that opportunistic deep scrubs should have a chance to run. It should only take another week or two to catch up.
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com