Hi all,

a little gem for Christmas. After going through the OSD code, scratching my had 
and doing a bit of maths, I seem to have found a way to tune the built-in scrub 
machine to work perfectly. Its only few knobs to turn, but its difficult to 
find out, because the documentation is misleading to incorrect or even missing 
entirely. I plan to write a bit more in the documentation to this script 
(https://github.com/frans42/ceph-goodies/blob/5e2016f0b00f8dbc3e51c7e9904a7386b037fd82/scripts/pool-scrub-report),
 so here only the executive summary.

- global, set osd_max_scrubs=1, higher values do not have any effect except 
making your users angry
- global, set osd_deep_scrub_randomize_ratio=0, this parameter is unnecessary 
for distributing deep scrubs and its only effect in the current implementation 
is to trigger a large amount of too early deep-scrubs increasing the overall 
deep-scrub load significantly without a useful effect
- on the pools, set deep_scrub_interval according to needs and performance, 
scrubs will turn into deep-scrubs for every PG with a deep-scrub stamp older 
than deep_scrub_interval, here you will need to do some calculations what your 
hardware can do and how much average load you can tolerate
- on the pools, also set scrub_min_interval and scrub_max_interval so that its 
only one place to look for these settings
- per OSD device class, set osd_scrub_interval_randomize_ratio such that scrubs 
start within a reasonable window after scrub_min_interval. This parameter is 
very important for distributing scrubs as evenly as possible over time. The 
default of 0.5 is good for most cases. For the HDD pool used below I reduced it 
a bit, because scrub_min_interval is set to 66h and here 0.5 leads to a 
slightly too large start-interval.
- per OSD device class, set osd_scrub_backoff_ratio to a value close to but not 
higher than 1-1/(largest replication factor [=size] of pools on this device 
class). This parameter is labelled dev, but is really important for effective 
scrub scheduling. OSDs need to allocate scrub reservations and this process is 
extremely racy specifically for EC pools with high replication factor. The 
default 0.66 probably has 3-times replicated pools in mind, but it triggers way 
too many attempts to allocate scrub reservations for pools with larger 
replication factor, causing dead-locks and blocking scrubs being executed even 
if plenty of OSDs are idle. I found that 1-0.75/max_size_on_device_class works 
very well.

After having found out about what the parameters really do and adjusting them 
for my pools, I passed through a valley of tears and arrived now at the 
beautiful distribution of (deep-)scrub stamps for a pool on 16TB HDDs shown at 
the end. Everything gets scrubbed every 3-4 days and deep-scrubs start no 
earlier than 14 days after the last deep-scrub. The overall (deep)-scrub load 
is now half of what it was before the changes and I don't have the dreaded "PGs 
not (deep-)scrubbed in time" warnings any more.

I calculated the (deep-)scrub time window configs such that about 30% of OSDs 
will be continuously busy when the disks reach 70% utilization (currently ca. 
45%). No user will complain about that and there is enough spare performance 
left to catch up after high-load- or recovery episodes without having to do 
anything.

Here is the scrub report generated for the pool I was looking at for weeks now, 
its exactly as I wanted it and I don't have to run cron jobs, it just works:

# pool-scrub-report con-fs2-data2
Scrub info for pool con-fs2-data2 (id=19): dumped pgs

Scrub report:
   6%     566 PGs not scrubbed since  1 intervals (  6h)
  13%     528 PGs not scrubbed since  2 intervals ( 12h)
  21%     640 PGs not scrubbed since  3 intervals ( 18h)
  29%     668 PGs not scrubbed since  4 intervals ( 24h)
  37%     677 PGs not scrubbed since  5 intervals ( 30h)
  46%     729 PGs not scrubbed since  6 intervals ( 36h)
  54%     631 PGs not scrubbed since  7 intervals ( 42h)
  62%     662 PGs not scrubbed since  8 intervals ( 48h)
  70%     663 PGs not scrubbed since  9 intervals ( 54h)
  78%     660 PGs not scrubbed since 10 intervals ( 60h)
  85%     571 PGs not scrubbed since 11 intervals ( 66h)
  92%     585 PGs not scrubbed since 12 intervals ( 72h) [74 idle] 1 scrubbing
  96%     358 PGs not scrubbed since 13 intervals ( 78h) [34 idle] [3 
scrubbing+deep] 2 scrubbing
  99%     181 PGs not scrubbed since 14 intervals ( 84h) [23 idle] [3 
scrubbing+deep]
  99%      70 PGs not scrubbed since 15 intervals ( 90h) [9 idle] 1 scrubbing
 100%       3 PGs not scrubbed since 16 intervals ( 96h) [1 scrubbing+deep]
         8192 PGs out of 8192 reported, 0 missing, 4 scrubbing, 140 idle, 0 
unclean.

Deep-scrub report:
   3%     295 PGs not deep-scrubbed since  1 intervals ( 24h)
   9%     461 PGs not deep-scrubbed since  2 intervals ( 48h)
  16%     558 PGs not deep-scrubbed since  3 intervals ( 72h)
  23%     613 PGs not deep-scrubbed since  4 intervals ( 96h) [1 scrubbing]
  31%     619 PGs not deep-scrubbed since  5 intervals (120h)
  39%     660 PGs not deep-scrubbed since  6 intervals (144h)
  47%     726 PGs not deep-scrubbed since  7 intervals (168h) [1 scrubbing]
  57%     743 PGs not deep-scrubbed since  8 intervals (192h)
  65%     727 PGs not deep-scrubbed since  9 intervals (216h) [1 scrubbing]
  73%     656 PGs not deep-scrubbed since 10 intervals (240h)
  75%     107 PGs not deep-scrubbed since 11 intervals (264h)
  82%     626 PGs not deep-scrubbed since 12 intervals (288h)
  90%     588 PGs not deep-scrubbed since 13 intervals (312h)
  94%     388 PGs not deep-scrubbed since 14 intervals (336h) [1 scrubbing]
  96%     129 PGs not deep-scrubbed since 15 intervals (360h) 2 scrubbing+deep
  98%     207 PGs not deep-scrubbed since 16 intervals (384h) 2 scrubbing+deep
  99%      79 PGs not deep-scrubbed since 17 intervals (408h) 1 scrubbing+deep
 100%      10 PGs not deep-scrubbed since 18 intervals (432h) 2 scrubbing+deep
         8192 PGs out of 8192 reported, 0 missing, 7 scrubbing+deep, 0 unclean.

con-fs2-data2  scrub_min_interval=66h  (11i/84%/625PGs÷i)
con-fs2-data2  scrub_max_interval=168h  (7d)
con-fs2-data2  deep_scrub_interval=336h  (14d/~89%/~520PGs÷d)
osd.338  osd_scrub_interval_randomize_ratio=0.363636  scrubs start after: 
66h..90h
osd.338  osd_deep_scrub_randomize_ratio=0.000000
osd.338  osd_max_scrubs=1
osd.338  osd_scrub_backoff_ratio=0.931900  rec. this pool: .9319 (class hdd, 
size 11)
mon.ceph-01  mon_warn_pg_not_scrubbed_ratio=0.500000  warn: 10.5d (42.0i)
mon.ceph-01  mon_warn_pg_not_deep_scrubbed_ratio=0.750000  warn: 24.5d

Best regards, merry Christmas and a happy new year to everyone!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to