We have run into this same scenarios in terms of the long tail taking much
longer on recovery than the initial.

Either time we are adding osd or an osd get taken down. At first we have
max-backfill set to 1 so it doesn't kill the cluster with io. As time
passes by the single osd is performing the backfill. So we are gradually
increasing the max-backfill up to 10 to reduce the amount of time it needs
to recover fully. I know there are a few other factors at play here but for
us we tend to do this procedure every time.

On Wed, May 11, 2016 at 6:29 PM Christian Balzer <ch...@gol.com> wrote:

> On Wed, 11 May 2016 16:10:06 +0000 Somnath Roy wrote:
>
> > I bumped up the backfill/recovery settings to match up Hammer. It is
> > probably unlikely that long tail latency is a parallelism issue. If so,
> > entire recovery would be suffering not the tail alone. It's probably a
> > prioritization issue. Will start looking and update my findings. I can't
> > add devl because of the table but needed to add community that's why
> > ceph-users :-).. Also, wanted to know from Ceph's user if they are also
> > facing similar issues..
> >
>
> What I meant with lack of parallelism is that at the start of a rebuild,
> there are likely to be many candidate PGs for recovery and backfilling, so
> many things happen at the same time, up to the limits of what is
> configured (max backfill etc).
>
> From looking at my test cluster, it starts with 8-10 backfills and
> recoveries (out of 140 affected PGs), but later on in the game there are
> less and less PGs (and OSDs/nodes) to choose from, so things slow down
> around 60 PGs to just 3-4 backfills.
> And around 20 PGs it's down to 1-2 backfills, so the parallelism is
> clearly gone at that point and recovery speed is down to what a single
> PG/OSD can handle.
>
> Christian
>
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Christian Balzer [mailto:ch...@gol.com]
> > Sent: Wednesday, May 11, 2016 12:31 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; Nick Fisk; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Weighted Priority Queue testing
> >
> >
> >
> > Hello,
> >
> > not sure if the Cc: to the users ML was intentional or not, but either
> > way.
> >
> > The issue seen in the tracker:
> > http://tracker.ceph.com/issues/15763
> > and what you have seen (and I as well) feels a lot like the lack of
> > parallelism towards the end of rebuilds.
> >
> > This becomes even more obvious when backfills and recovery settings are
> > lowered.
> >
> > Regards,
> >
> > Christian
> > --
> > Christian Balzer        Network/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > PLEASE NOTE: The information contained in this electronic mail message
> > is intended only for the use of the designated recipient(s) named above.
> > If the reader of this message is not the intended recipient, you are
> > hereby notified that you have received this message in error and that
> > any review, dissemination, distribution, or copying of this message is
> > strictly prohibited. If you have received this communication in error,
> > please notify the sender by telephone or e-mail (as shown above)
> > immediately and destroy any and all copies of this message in your
> > possession (whether hard copies or electronically stored copies).
> >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to