Re: [ceph-users] Increasing number of PGs by not a factor of two?

Bryan Banister Fri, 18 May 2018 07:00:40 -0700

+1

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kai 
Wagner
Sent: Thursday, May 17, 2018 4:20 PM
To: David Turner <drakonst...@gmail.com>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Increasing number of PGs by not a factor of two?

Great summary David. Wouldn't this be worth a blog post?

On 17.05.2018 20:36, David Turner wrote:
By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of your 
PGs will be the same size and easier to balance and manage.  What happens when 
you have a non base 2 number is something like this.  Say you have 4 PGs that 
are all 2GB in size.  If you increase pg(p)_num to 6, then you will have 2 PGs 
that are 2GB and 4 PGs that are 1GB as you've split 2 of the PGs into 4 to get 
to the 6 total.  If you increase the pg(p)_num to 8, then all 8 PGs will be 
1GB.  Depending on how you manage your cluster, that doesn't really matter, but 
for some methods of balancing your cluster, that will greatly imbalance things.

This would be a good time to go to a base 2 number.  I think you're thinking 
about Gluster where if you have 4 bricks and you want to increase your 
capacity, going to anything other than a multiple of 4 (8, 12, 16) kills 
performance (worse than increasing storage already does) and takes longer as it 
has to weirdly divide the data instead of splitting a single brick up to 
multiple bricks.

As you increase your PGs, do this slowly and in a loop.  I like to increase my 
PGs by 256, wait for all PGs to create, activate, and peer, rinse/repate until 
I get to my target.  [1] This is an example of a script that should accomplish 
this with no interference.  Notice the use of flags while increasing the PGs.  
It will make things take much longer if you have an OSD OOM itself or die for 
any reason by adding to the peering needing to happen.  It will also be wasted 
IO to start backfilling while you're still making changes; it's best to wait 
until you finish increasing your PGs and everything peers before you let data 
start moving.

Another thing to keep in mind is how long your cluster will be moving data 
around.  Increasing your PG count on a pool full of data is one of the most 
intensive operations you can tell a cluster to do.  The last time I had to do 
this, I increased pg(p)_num by 4k PGs from 16k to 32k, let it backfill, 
rinse/repeat until the desired PG count was achieved.  For me, that 4k PGs 
would take 3-5 days depending on other cluster load and how full the cluster 
was.  If you do decide to increase your PGs by 4k instead of the full increase, 
change the 16384 to the number you decide to go to, backfill, continue.

[1]
# Make sure to set pool variable as well as the number ranges to the 
appropriate values.
flags="nodown nobackfill norecover"
for flag in $flags; do
  ceph osd set $flag
done
pool=rbd
echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs"
# The first number is your current PG count for the pool, the second number is 
the target PG count, and the third number is how many to increase it by each 
time through the loop.
for num in {7700..16384..256}; do
  ceph osd pool set $pool pg_num $num
  while sleep 10; do
    ceph osd health | grep -q 'peering\|stale\|activating\|creating\|inactive' 
|| break
  done
  ceph osd pool set $pool pgp_num $num
  while sleep 10; do
    ceph osd health | grep -q 'peering\|stale\|activating\|creating\|inactive' 
|| break
  done
done
for flag in $flags; do
  ceph osd unset $flag
done

On Thu, May 17, 2018 at 9:27 AM Kai Wagner 
<kwag...@suse.com<mailto:kwag...@suse.com>> wrote:
Hi Oliver,

a good value is 100-150 PGs per OSD. So in your case between 20k and 30k.

You can increase your PGs, but keep in mind that this will keep the
cluster quite busy for some while. That said I would rather increase in
smaller steps than in one large move.

Kai

On 17.05.2018 01:29, Oliver Schulz wrote:
> Dear all,
>
> we have a Ceph cluster that has slowly evolved over several
> years and Ceph versions (started with 18 OSDs and 54 TB
> in 2013, now about 200 OSDs and 1.5 PB, still the same
> cluster, with data continuity). So there are some
> "early sins" in the cluster configuration, left over from
> the early days.
>
> One of these sins is the number of PGs in our CephFS "data"
> pool, which is 7200 and therefore not (as recommended)
> a power of two. Pretty much all of our data is in the
> "data" pool, the only other pools are "rbd" and "metadata",
> both contain little data (and they have way too many PGs
> already, another early sin).
>
> Is it possible - and safe - to change the number of "data"
> pool PGs from 7200 to 8192 or 16384? As we recently added
> more OSDs, I guess it would be time to increase the number
> of PGs anyhow. Or would we have to go to 14400 instead of
> 16384?
>
>
> Thanks for any advice,
>
> Oliver
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)

________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Increasing number of PGs by not a factor of two?

Reply via email to