Thanks for your response David.

What you've described has been what I've been thinking about too. We have 1401 
OSDs in the cluster currently and this output is from the tail end of the 
backfill for +64 PG increase on the biggest pool.

The problem is we see this cluster do at most 20 backfills at the same time and 
as the queue of PGs to backfill gets smaller there are fewer and fewer actively 
backfilling which I don't quite understand.

Out of the PGs currently backfilling, all of them have completely changed their 
sets (difference between acting and up sets is 11), which makes some sense 
since what moves around are the newly spawned PGs. That's 5 PGs currently in 
backfilling states which makes 110 OSDs blocked. What happened to the other 
1300? That's what's strange to me. There are another 7 waiting to backfill.
Out of all the OSDs in the up and acting sets of all PGs currently backfilling 
or waiting to backfill there are 13 OSDs in common so I guess that kind of 
answers it. I haven't checked to see but I suspect each backfilling PG has at 
least one OSD in one of its sets in common with either set of one of the 
waiting PGs.

So I guess we can't do much about the tail end taking so long: there's no way 
for more of the PGs to actually be backfilling at the same time.

I think we'll have to try bumping osd_max_backfills. Has anyone tried bumping 
the relative priorities of recovery vs others? What about noscrub?

Best regards,

George

________________________________
From: David Turner [drakonst...@gmail.com]
Sent: 06 July 2017 16:08
To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Speeding up backfill after increasing PGs and or 
adding OSDs

Just a quick place to start is osd_max_backfills.  You have this set to 1.  
Each PG is on 11 OSDs.  When you have a PG moving, it is on the original 11 
OSDs and the new X number of OSDs that it is going to.  For each of your PGs 
that is moving, an OSD can only move 1 at a time (your osd_max_backfills), and 
each PG is on 11 + X OSDs.

So with your cluster.  I don't see how many OSDs you have, but you have 25 PGs 
moving around and 8 of them are actively backfilling.  Assuming you were only 
changing 1 OSD per backfill operation, that would mean that you had at least 96 
OSDs (11+1 * 8).  That would be a perfect distribution of OSDs for the PGs 
backfilling.  Let's say now that you're averaging closer to 3 OSDs changing per 
PG and that the remaining 17 PGs waiting to backfill are blocked by a few OSDs 
each (because those OSDs are already included in the 8 active backfilling PGs.  
That would indicate that you have closer to 200+ OSDs.

Every time I'm backfilling and want to speed things up, I watch iostat on some 
of my OSDs and increase osd_max_backfills until I'm consistently using about 
70% of the disk to allow for customer overhead.  You can always figure out 
what's best for your use case though.  Generally I've been ok running with 
osd_max_backfills=5 without much problem and bringing that up some when I know 
that client IO will be minimal, but again it depends on your use case and 
cluster.

On Thu, Jul 6, 2017 at 10:08 AM 
<george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>> wrote:
Hey folks,

We have a cluster that's currently backfilling from increasing PG counts. We 
have tuned recovery and backfill way down as a "precaution" and would like to 
start tuning it to bring up to a good balance between that and client I/O.

At the moment we're in the process of bumping up PG numbers for pools serving 
production workloads. Said pools are EC 8+3.

It looks like we're having very low numbers of PGs backfilling as in:

            2567 TB used, 5062 TB / 7630 TB avail
            145588/849529410 objects degraded (0.017%)
            5177689/849529410 objects misplaced (0.609%)
                7309 active+clean
                  23 active+clean+scrubbing
                  18 active+clean+scrubbing+deep
                  13 active+remapped+backfill_wait
                   5 active+undersized+degraded+remapped+backfilling
                   4 active+undersized+degraded+remapped+backfill_wait
                   3 active+remapped+backfilling
                   1 active+clean+inconsistent
recovery io 1966 MB/s, 96 objects/s
  client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr

Also, the rate of recovery in terms of data and object throughput varies a lot, 
even with the number of PGs backfilling remaining constant.

Here's the config in the OSDs:

    "osd_max_backfills": "1",
    "osd_min_recovery_priority": "0",
    "osd_backfill_full_ratio": "0.85",
    "osd_backfill_retry_interval": "10",
    "osd_allow_recovery_below_min_size": "true",
    "osd_recovery_threads": "1",
    "osd_backfill_scan_min": "16",
    "osd_backfill_scan_max": "64",
    "osd_recovery_thread_timeout": "30",
    "osd_recovery_thread_suicide_timeout": "300",
    "osd_recovery_sleep": "0",
    "osd_recovery_delay_start": "0",
    "osd_recovery_max_active": "5",
    "osd_recovery_max_single_start": "1",
    "osd_recovery_max_chunk": "8388608",
    "osd_recovery_max_omap_entries_per_chunk": "64000",
    "osd_recovery_forget_lost_objects": "false",
    "osd_scrub_during_recovery": "false",
    "osd_kill_backfill_at": "0",
    "osd_debug_skip_full_check_in_backfill_reservation": "false",
    "osd_debug_reject_backfill_probability": "0",
    "osd_recovery_op_priority": "5",
    "osd_recovery_priority": "5",
    "osd_recovery_cost": "20971520",
    "osd_recovery_op_warn_multiple": "16",

What I'm looking for, first of all, is a better understanding of the mechanism 
that schedules the backfilling/recovery work; the end goal is to understand how 
to tune this safely to achieve as close to an optimal balance between rate at 
which recovery and client work is performed.

I'm thinking things like osd_max_backfills, 
osd_backfill_scan_min/osd_backfill_scan_max might be prime candidates for 
tuning.

Any thoughs/insights by the Ceph community will be greatly appreciated,

George
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to