Thanks for the clarification Christian.  Good to know about the potential
increase in OSD usage. As you said, given how much available capacity we
have, we're betting on the distribution not getting much worse. But we'll
look at re-weighting if things go sideways.

Cheers,
Robin

On Mon, Jul 11, 2016 at 11:07 PM Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Tue, 12 Jul 2016 03:43:41 +0000 Robin Percy wrote:
>
> > First off, thanks for the great response David.
> >
> Yes, that was a very good writeup.
>
> > If I understand correctly, you're saying there are two distinct costs to
> > consider: peering, and backfilling. The backfilling cost is a function of
> > the amount of data in our pool, and therefore won't benefit from
> > incremental steps. But the peering cost is a function of pg_num, and
> should
> > be incremented in steps of at most ~200 (depending on hardware) until we
> > reach a power of 2.
> >
> Peering is all about RAM (more links, states, permanently so), CPU and
> network (when setting up the links).
> And this happens instantaneously, with no parameters in Ceph to slow this
> down.
>
> So yes, you want to increase the pg_num and pgp_num somewhat slowly, at
> least at first until you have a feel for what you HW can handle.
>
> > Assuming I've got that right, one follow up question is: should we expect
> > blocked/delayed requests during both the peering and backfilling
> processes,
> > or is it more common in one than the other? I couldn't quite get a
> > definitive answer from the docs on peering.
> >
> Peering is a sharp shock, it should be quick to resolve (again, depending
> on HW, etc) and not lead to noticeable interruptions.
> But YMMWV, thus again initial baby steps.
>
> Backfilling is that inevitable avalanche, but if you start with
> osd_max_backfills=1 and then creep it up as you get a feel of what you
> cluster can handle you should be able to both keep slow requests at bay
> AND hopefully finish within a reasonable sized maintenance window.
>
> Since you're still on Firefly, you won't be getting the queue benefits of
> Jewel, which should help with backfilling stomping on client traffic toes
> as well.
>
> OTOH, you're currently only using a fraction of your cluster's capabilities
> (64 PGs with 126 OSDs), so there should be quite some capacity for this
> reshuffle available.
>
> > At this point we're planning to hedge our bets by increasing pg_num to
> 256
> > before backfilling so we can at least buy some headroom on our full OSDs
> > and evaluate the impact before deciding whether we can safely make the
> > jumps to 2048 without an outage. If that doesn't make sense, I may be
> > overestimating the cost of peering.
> >
> As David said, freeze your cluster (norecover, nobackfill, nodown and
> noout), slowly up your PGs and PGPs then let the good times roll and
> unleash the dogs of backfill.
>
>
> The thing that worries me the most in your scenario are the already
> near-full OSDs.
>
> As many people found out the hard way, Ceph may initially go and put MORE
> data on OSDs before later distributing things more evenly.
> See for example this mail from me and the image URL in it:
> http://www.spinics.net/lists/ceph-users/msg27794.html
>
> Normally my advise would be to re-weight the full (or near empty) OSDs so
> that things get a bit more evenly distributed and below near-full levels
> before starting the PG increase.
> But in your case with so few PGs to begin with, it's going to be tricky to
> get it right and not make things worse.
>
> Hopefully the plentiful PG/OSD choices Ceph has after the PG increase in
> your case will make it do the right thing from the get-go.
>
> Christian
>
>
> > Thanks again for your help,
> > Robin
> >
> >
> > On Mon, Jul 11, 2016 at 2:40 PM David Turner <
> david.tur...@storagecraft.com>
> > wrote:
> >
> > > When you increase your PGs you're already going to be moving around
> all of
> > > your data.  Doing a full doubling of your PGs from 64 -> 128 -> 256 ->
> ...
> > > -> 2048 over and over and letting it backfill to healthy every time is
> a
> > > lot of extra data movement that isn't needed.
> > >
> > > I would recommend setting osd_max_backfills to something that won't
> > > cripple your cluster (5 works decently for us), set the norecover,
> > > nobackfill, nodown, and noout flags, and then increase your pg_num and
> > > pgp_num slowly until you reach your target.  Depending on how much
> extra
> > > RAM you have in each of your storage nodes depends on how much you
> want to
> > > increase pg_num by at a time.  We don't do more than ~200 at a time.
> When
> > > you reach your target and there is no more peering happening, then
> unset
> > > norecover, nobackfill, and nodown.  After you finish all of the
> > > backfilling, then unset noout.
> > >
> > > You are likely to see slow/blocked requests in your cluster throughout
> > > this process, but the best thing is to get to the other side of
> increasing
> > > your pgs.  The official recommendation for increasing pgs is to plan
> ahead
> > > for the size of your cluster and start with that many pgs because this
> > > process is painful and will slow down your cluster until it's done.
> > >
> > > Note, if you're increasing pgs from 2048 to 4096, then doing it in
> smaller
> > > chunks of 512 at a time could make sense because of how ceph treats
> pools
> > > with a non-base 2 number of pgs.  if you have 8 pgs that are 4GB and
> > > increase the number to 10 (a non-power of 2) then you will have 6 pgs
> that
> > > are 4GB and 4 pgs that are 2GB.  It splits them in half to fill up the
> > > number of pgs that aren't a power of 2.  If you went to 14 pgs, then
> you
> > > would have 2 pgs that are 4GB and 12 pgs that are 2GB.  Finally when
> you
> > > set it to 16 pgs you would have 16 pgs that are all 2GB.
> > >
> > > So if you increase your PGs by less than a power of 2, then it will
> only
> > > work on  that number of pgs and leave the rest of them alone.  However
> in
> > > your scenario of going from 64 pgs to 2048, you are going to be
> affecting
> > > all of the PGs every time you split and buy yourself nothing by doing
> it in
> > > smaller chunks.  The reason to not just increase pg_num to 2048 is that
> > > when ceph creates each PG it has to peer and you can peer your osds
> into
> > > oblivion and lose access to all of your data for a while, that's why
> the
> > > recommendation to add them bit by bit with nodown, noout, nobackfill,
> and
> > > norecover set so that you get to the number you want and then can tell
> your
> > > cluster to start moving data.
> > > ------------------------------
> > > *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Robin
> > > Percy [rpe...@gmail.com]
> > > *Sent:* Monday, July 11, 2016 2:53 PM
> > > *To:* ceph-us...@ceph.com
> > > *Subject:* [ceph-users] Advice on increasing pgs
> > >
> > > Hello,
> > >
> > > I'm looking for some advice on how to most safely increase the pgs in
> our
> > > primary ceph pool.
> > >
> > > A bit of background: We're running ceph 0.80.9 and have a cluster of
> 126
> > > OSDs with only 64 pgs allocated to the pool. As a result, 2 OSDs are
> now
> > > 88% full, while the pool is only showing as 6% used.
> > >
> > > Based on my understanding, this is clearly a placement problem, so the
> > > plan is to increase to 2048 pgs. In order to avoid significant
> performance
> > > degradation, we'll be incrementing pg_num and pgp_num one power of two
> at a
> > > time and waiting for the cluster to rebalance before making the next
> > > increment.
> > >
> > > My question is: are there any other steps we can take to minimize
> > > potential performance impact? And/or is there a way to model or
> predict the
> > > level of impact, based on cluster configuration, data placement, etc?
> > >
> > > Thanks in advance for any answers,
> > > Robin
> > >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to