Re: [ceph-users] Slow performance during recovery operations

Lionel Bouton Thu, 02 Apr 2015 10:54:03 -0700

Hi,

On 04/02/15 19:31, Stillwell, Bryan wrote:
> All,
>
> Whenever we're doing some kind of recovery operation on our ceph
> clusters (cluster expansion or dealing with a drive failure), there
> seems to be a fairly noticable performance drop while it does the
> backfills (last time I measured it the performance during recovery was
> something like 20% of a healthy cluster).  I'm wondering if there are
> any settings that we might be missing which would improve this
> situation?
>
> Before doing any kind of expansion operation I make sure both 'noscrub'
> and 'nodeep-scrub' are set to make sure scrubing isn't making things
> worse.
>
> Also we have the following options set in our ceph.conf:
>
> [osd]
> osd_journal_size = 16384
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_recovery_op_priority = 1
> osd_recovery_max_single_start = 1
> osd_op_threads = 12
> osd_crush_initial_weight = 0
>
>
> I'm wondering if there might be a way to use ionice in the CFQ scheduler
> to delegate the recovery traffic to be of the Idle type so customer
> traffic has a higher priority?


Recovery creates I/O performance drops in our VM too but it's
manageable. What really hurts us are deep scrubs.
Our current situation is Firefly 0.80.9 with a total of 24 identical
OSDs evenly distributed on 4 servers with the following relevant
configuration:

    osd recovery max active      = 2
    osd scrub load threshold      = 3
    osd deep scrub interval       = 1209600 # 14 days
    osd max backfills             = 4
    osd disk thread ioprio class  = idle
    osd disk thread ioprio priority = 7

we managed to add several OSDs at once while deep scrubs were in
practice disabled: we just increased deep scrub interval from 1 to 2
weeks which if I understand correctly had the effect of disabling them
for 1 week (and indeed there were none while the backfilling went on for
several hours).

With these settings and no deep-scrubs the load increased a bit in the
VMs doing non negligible I/Os but this was manageable. Even disk thread
ioprio settings (which is what you want to get the ionice behaviour for
deep scrubs) didn't seem to make much of a difference.

Note : I don't believe Ceph will try to scatter the scrubs on the whole
period you set with deep scrub interval, it seems its algorithm is much
simpler than that and may lead to temporary salves of successive deep
scrubs and it might generate some temporary I/O load which is hard to
diagnose (by default scrubs and deep scrubs are logged by the OSD so you
can correlate them with whatever you use to supervise your cluster).

I actually considered monitoring Ceph for backfills and using ceph set
nodeep-scrub automatically when there are some and unset it when they
disappear.

Best regards,

Lionel Bouton

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow performance during recovery operations

Reply via email to