Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread Jesus Cea
On 25/05/18 20:21, David Turner wrote:
> If you start your pool with 12 PGs, 4 of them will have double the size
> of the other 8.  It is 100% based on a power of 2 and has absolutely
> nothing to do with the number you start with vs the number you increase
> to.  If your PG count is not a power of 2 then you will have 2 different
> sizes of PGs with some being double the size of the others.

Thanks for correctly my wild speculation friendly and with facts. I have
spend quite a few hours digging into this I now I understand the issue
far better. I did some explanations in other email.

> Once upon a time I started a 2 rack cluster with 12,000 PGs.  All data
> was in 1 pool and I attempted to balance the cluster by making sure that
> every OSD in the cluster was within 2 PGs of each other.

I have spend some time thinking about the importance of PGs being equal
size and realized it depends a lot of the workload, existence of several
pools sharing the cluster, etc. In my particular situation (most data
under CephFS using quite wide 8+2 erasure code), low activity, immutable
data, write once, read many, it seems to be a non issue.

I need to think more about it.

Having wildly spread of OSD capacity (120 GB - 1TB) seems to be far
worse idea. I spend my days reweighting and it is quite difficult to
fully utilize the capacity of the cluster.

Thanks.

-- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread Jesus Cea
OK, I am writing this so you don't waste your time correcting me. I beg
your pardon.


On 25/05/18 18:28, Jesus Cea wrote:
> So, if I understand correctly, ceph tries to do the minimum splits. If
> you increase PG from 8 to 12, it will split 4 PGs and leave the other 4
> PGs alone, creating an imbalance.
> 
> According to that, would be far more advisable to create the pool with
> 12 PGs from the very beginning.
> 
> If I understand correctly, then, the advice of "power of two" is an
> oversimplification. The real advice would be: you better double your PG
> when you increase the PG count. That is: 12->24->48->96... Not real need
> for power of two.

Instead of trying to be smart, I just spend a few hours build a Ceph
experiment myself, testing different scenarios, PG resizing and such.

The "power of two" rule is law.

If you don't follow it, some PG will be contain double number of objects
than others.

The rule is something like this:

Lets say, your PG_num:

2^n-1 < PG_num <=2^n

The object name you created is hashed. "n" bits of it are considered.
Let's say that number is x.

If x < PG_num, your object will be stored under PG number x.

If x >= PG_num, then drop a bit (you use "n-1" bits) and it will be the
PG that will store your object.

This algorithm says that if your PG_num is not a "power of two", some of
your PG will be double size.

For instance, suppose PG_num = 13: (first number is the "x" of your
object, the second number is the PG used to store it)

0 -> 01 -> 12 -> 23 -> 3
4 -> 45 -> 56 -> 67 -> 7
8 -> 89 -> 9   10 -> 10  11 -> 11
12 -> 12

Good so far. But now:

13 -> 5  14 -> 6   15 -> 7

So, PGs 0-4 and 8-12 will store "COUNT" objects, but  PGs 5, 6 and 7
will store "2*COUNT" objects. PGs 5, 6 and 7 have twice the probability
of store your object.

Interestingly, The maximum object count difference between the biggest
PG and the smaller PG will be a factor of TWO. Statistically.

How important is that PG sizes are the same is something I am not sure
to understand.

> Also, a bad split is not important if the pool creates/destroys objects
> constantly, because new objects will be spread evenly. This could be an
> approach to rebalance a badly expanded pool: just copy & rename your
> objects (I am thinking about cephfs).
> 
> What am I saying makes sense?.

I answer to myself.

No, fool, it doesn't make sense. Ceph doesn't work that way. The PG
allocation is far more simpler and scalable, but also more dump. The
imbalance only depends of the number of PGs (should be a power of two),
not the process to get there.

The described idea doesn't work because if the PG numbers is not a power
of two, some PGs just simple get twice the lottery tickets and will get
double number of objects.  Copying, moving, replacing objects will not
change that.

> How Ceph decide what PG to split?. Per PG object count or by PG byte size?.

Following the algorithm described at the top of this post, Ceph will
just simply split PGs by increasing order. If my PG_num is 13 and I
increase it to 14, Ceph will split PG 5. Fully deterministic and
unrelated to size or how many objects are stored in that PG.

Since the association of an object to a PG is based in the hash of the
object name, we would expect every PG to have (statistically) the same
number of objects. Object size is not used here, so a huge object will
create a huge PG. This is a well known problem in Ceph (a few large
object will imbalance your cluster).

> Thank for your post. It deserves to be a blog!.

The original post was great. My reply was lame. I was just too smart for
my own good :).

Sorry for wasting your time.

-- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread David Turner
If you start your pool with 12 PGs, 4 of them will have double the size of
the other 8.  It is 100% based on a power of 2 and has absolutely nothing
to do with the number you start with vs the number you increase to.  If
your PG count is not a power of 2 then you will have 2 different sizes of
PGs with some being double the size of the others.

When increasing your PG count, ceph chooses which PGs to split in half
based on the pg name, not with how big the PG is or how many objects it
has.  The PG names are based on how many PGs you have in your pool and are
perfectly evenly split when and only if your PG count is a power of 2.

Once upon a time I started a 2 rack cluster with 12,000 PGs.  All data was
in 1 pool and I attempted to balance the cluster by making sure that every
OSD in the cluster was within 2 PGs of each other.  That is to say that if
the average PGs per OSD was 100, then no OSD had more than 101 PGs for that
pool and no OSD had less than 99 PGs.  My tooling made this possible and is
how we balanced our other clusters.  The resulting balance in this cluster
was AWFUL!!!  Digging in I found that some of the PGs were twice as big as
the other PGs.  It was actually very mathematical in how many.  Of the
12,000 PGs 4,384 PGs were twice as big as the remaining 7,616.  We
increased the PG count in the pool to 16,384 and all of the PGs were the
same in size when the backfilling finished.

On Fri, May 25, 2018 at 12:48 PM Jesus Cea  wrote:

> On 17/05/18 20:36, David Turner wrote:
> > By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of
> > your PGs will be the same size and easier to balance and manage.  What
> > happens when you have a non base 2 number is something like this.  Say
> > you have 4 PGs that are all 2GB in size.  If you increase pg(p)_num to
> > 6, then you will have 2 PGs that are 2GB and 4 PGs that are 1GB as
> > you've split 2 of the PGs into 4 to get to the 6 total.  If you increase
> > the pg(p)_num to 8, then all 8 PGs will be 1GB.  Depending on how you
> > manage your cluster, that doesn't really matter, but for some methods of
> > balancing your cluster, that will greatly imbalance things.
>
> So, if I understand correctly, ceph tries to do the minimum splits. If
> you increase PG from 8 to 12, it will split 4 PGs and leave the other 4
> PGs alone, creating an imbalance.
>
> According to that, would be far more advisable to create the pool with
> 12 PGs from the very beginning.
>
> If I understand correctly, then, the advice of "power of two" is an
> oversimplification. The real advice would be: you better double your PG
> when you increase the PG count. That is: 12->24->48->96... Not real need
> for power of two.
>
> Also, a bad split is not important if the pool creates/destroys objects
> constantly, because new objects will be spread evenly. This could be an
> approach to rebalance a badly expanded pool: just copy & rename your
> objects (I am thinking about cephfs).
>
> What am I saying makes sense?.
>
> How Ceph decide what PG to split?. Per PG object count or by PG byte size?.
>
> Thank for your post. It deserves to be a blog!.
>
> --
> Jesús Cea Avión _/_/  _/_/_/_/_/_/
> j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
> Twitter: @jcea_/_/_/_/  _/_/_/_/_/
> jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
> "Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
> "My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
> "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-25 Thread Jesus Cea
On 17/05/18 20:36, David Turner wrote:
> By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of
> your PGs will be the same size and easier to balance and manage.  What
> happens when you have a non base 2 number is something like this.  Say
> you have 4 PGs that are all 2GB in size.  If you increase pg(p)_num to
> 6, then you will have 2 PGs that are 2GB and 4 PGs that are 1GB as
> you've split 2 of the PGs into 4 to get to the 6 total.  If you increase
> the pg(p)_num to 8, then all 8 PGs will be 1GB.  Depending on how you
> manage your cluster, that doesn't really matter, but for some methods of
> balancing your cluster, that will greatly imbalance things.

So, if I understand correctly, ceph tries to do the minimum splits. If
you increase PG from 8 to 12, it will split 4 PGs and leave the other 4
PGs alone, creating an imbalance.

According to that, would be far more advisable to create the pool with
12 PGs from the very beginning.

If I understand correctly, then, the advice of "power of two" is an
oversimplification. The real advice would be: you better double your PG
when you increase the PG count. That is: 12->24->48->96... Not real need
for power of two.

Also, a bad split is not important if the pool creates/destroys objects
constantly, because new objects will be spread evenly. This could be an
approach to rebalance a badly expanded pool: just copy & rename your
objects (I am thinking about cephfs).

What am I saying makes sense?.

How Ceph decide what PG to split?. Per PG object count or by PG byte size?.

Thank for your post. It deserves to be a blog!.

-- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-18 Thread Bryan Banister
+1

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kai 
Wagner
Sent: Thursday, May 17, 2018 4:20 PM
To: David Turner 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Increasing number of PGs by not a factor of two?


Great summary David. Wouldn't this be worth a blog post?

On 17.05.2018 20:36, David Turner wrote:
By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of your 
PGs will be the same size and easier to balance and manage.  What happens when 
you have a non base 2 number is something like this.  Say you have 4 PGs that 
are all 2GB in size.  If you increase pg(p)_num to 6, then you will have 2 PGs 
that are 2GB and 4 PGs that are 1GB as you've split 2 of the PGs into 4 to get 
to the 6 total.  If you increase the pg(p)_num to 8, then all 8 PGs will be 
1GB.  Depending on how you manage your cluster, that doesn't really matter, but 
for some methods of balancing your cluster, that will greatly imbalance things.

This would be a good time to go to a base 2 number.  I think you're thinking 
about Gluster where if you have 4 bricks and you want to increase your 
capacity, going to anything other than a multiple of 4 (8, 12, 16) kills 
performance (worse than increasing storage already does) and takes longer as it 
has to weirdly divide the data instead of splitting a single brick up to 
multiple bricks.

As you increase your PGs, do this slowly and in a loop.  I like to increase my 
PGs by 256, wait for all PGs to create, activate, and peer, rinse/repate until 
I get to my target.  [1] This is an example of a script that should accomplish 
this with no interference.  Notice the use of flags while increasing the PGs.  
It will make things take much longer if you have an OSD OOM itself or die for 
any reason by adding to the peering needing to happen.  It will also be wasted 
IO to start backfilling while you're still making changes; it's best to wait 
until you finish increasing your PGs and everything peers before you let data 
start moving.

Another thing to keep in mind is how long your cluster will be moving data 
around.  Increasing your PG count on a pool full of data is one of the most 
intensive operations you can tell a cluster to do.  The last time I had to do 
this, I increased pg(p)_num by 4k PGs from 16k to 32k, let it backfill, 
rinse/repeat until the desired PG count was achieved.  For me, that 4k PGs 
would take 3-5 days depending on other cluster load and how full the cluster 
was.  If you do decide to increase your PGs by 4k instead of the full increase, 
change the 16384 to the number you decide to go to, backfill, continue.


[1]
# Make sure to set pool variable as well as the number ranges to the 
appropriate values.
flags="nodown nobackfill norecover"
for flag in $flags; do
  ceph osd set $flag
done
pool=rbd
echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs"
# The first number is your current PG count for the pool, the second number is 
the target PG count, and the third number is how many to increase it by each 
time through the loop.
for num in {7700..16384..256}; do
  ceph osd pool set $pool pg_num $num
  while sleep 10; do
ceph osd health | grep -q 'peering\|stale\|activating\|creating\|inactive' 
|| break
  done
  ceph osd pool set $pool pgp_num $num
  while sleep 10; do
ceph osd health | grep -q 'peering\|stale\|activating\|creating\|inactive' 
|| break
  done
done
for flag in $flags; do
  ceph osd unset $flag
done

On Thu, May 17, 2018 at 9:27 AM Kai Wagner 
> wrote:
Hi Oliver,

a good value is 100-150 PGs per OSD. So in your case between 20k and 30k.

You can increase your PGs, but keep in mind that this will keep the
cluster quite busy for some while. That said I would rather increase in
smaller steps than in one large move.

Kai


On 17.05.2018 01:29, Oliver Schulz wrote:
> Dear all,
>
> we have a Ceph cluster that has slowly evolved over several
> years and Ceph versions (started with 18 OSDs and 54 TB
> in 2013, now about 200 OSDs and 1.5 PB, still the same
> cluster, with data continuity). So there are some
> "early sins" in the cluster configuration, left over from
> the early days.
>
> One of these sins is the number of PGs in our CephFS "data"
> pool, which is 7200 and therefore not (as recommended)
> a power of two. Pretty much all of our data is in the
> "data" pool, the only other pools are "rbd" and "metadata",
> both contain little data (and they have way too many PGs
> already, another early sin).
>
> Is it possible - and safe - to change the number of "data"
> pool PGs from 7200 to 8192 or 16384? As we recently added
> more OSDs, I guess it would be time to increase the number
> of PGs anyhow. Or would we have to go to 14400 instead of
> 16384?
>
>
> Thanks for any advice,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com

Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-17 Thread Kai Wagner
Great summary David. Wouldn't this be worth a blog post?


On 17.05.2018 20:36, David Turner wrote:
> By sticking with PG numbers as a base 2 number (1024, 16384, etc) all
> of your PGs will be the same size and easier to balance and manage. 
> What happens when you have a non base 2 number is something like
> this.  Say you have 4 PGs that are all 2GB in size.  If you increase
> pg(p)_num to 6, then you will have 2 PGs that are 2GB and 4 PGs that
> are 1GB as you've split 2 of the PGs into 4 to get to the 6 total.  If
> you increase the pg(p)_num to 8, then all 8 PGs will be 1GB. 
> Depending on how you manage your cluster, that doesn't really matter,
> but for some methods of balancing your cluster, that will greatly
> imbalance things.
>
> This would be a good time to go to a base 2 number.  I think you're
> thinking about Gluster where if you have 4 bricks and you want to
> increase your capacity, going to anything other than a multiple of 4
> (8, 12, 16) kills performance (worse than increasing storage already
> does) and takes longer as it has to weirdly divide the data instead of
> splitting a single brick up to multiple bricks.
>
> As you increase your PGs, do this slowly and in a loop.  I like to
> increase my PGs by 256, wait for all PGs to create, activate, and
> peer, rinse/repate until I get to my target.  [1] This is an example
> of a script that should accomplish this with no interference.  Notice
> the use of flags while increasing the PGs.  It will make things take
> much longer if you have an OSD OOM itself or die for any reason by
> adding to the peering needing to happen.  It will also be wasted IO to
> start backfilling while you're still making changes; it's best to wait
> until you finish increasing your PGs and everything peers before you
> let data start moving.
>
> Another thing to keep in mind is how long your cluster will be moving
> data around.  Increasing your PG count on a pool full of data is one
> of the most intensive operations you can tell a cluster to do.  The
> last time I had to do this, I increased pg(p)_num by 4k PGs from 16k
> to 32k, let it backfill, rinse/repeat until the desired PG count was
> achieved.  For me, that 4k PGs would take 3-5 days depending on other
> cluster load and how full the cluster was.  If you do decide to
> increase your PGs by 4k instead of the full increase, change the 16384
> to the number you decide to go to, backfill, continue. 
>
>
> [1]
> # Make sure to set pool variable as well as the number ranges to the
> appropriate values.
> flags="nodown nobackfill norecover"
> for flag in $flags; do
>   ceph osd set $flag
> done
> pool=rbd
> echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs"
> # The first number is your current PG count for the pool, the second
> number is the target PG count, and the third number is how many to
> increase it by each time through the loop.
> for num in {7700..16384..256}; do
>   ceph osd pool set $pool pg_num $num
>   while sleep 10; do
>     ceph osd health | grep -q
> 'peering\|stale\|activating\|creating\|inactive' || break
>   done
>   ceph osd pool set $pool pgp_num $num
>   while sleep 10; do
>     ceph osd health | grep -q
> 'peering\|stale\|activating\|creating\|inactive' || break
>   done
> done
> for flag in $flags; do
>   ceph osd unset $flag
> done
>
> On Thu, May 17, 2018 at 9:27 AM Kai Wagner  > wrote:
>
> Hi Oliver,
>
> a good value is 100-150 PGs per OSD. So in your case between 20k
> and 30k.
>
> You can increase your PGs, but keep in mind that this will keep the
> cluster quite busy for some while. That said I would rather
> increase in
> smaller steps than in one large move.
>
> Kai
>
>
> On 17.05.2018 01:29, Oliver Schulz wrote:
> > Dear all,
> >
> > we have a Ceph cluster that has slowly evolved over several
> > years and Ceph versions (started with 18 OSDs and 54 TB
> > in 2013, now about 200 OSDs and 1.5 PB, still the same
> > cluster, with data continuity). So there are some
> > "early sins" in the cluster configuration, left over from
> > the early days.
> >
> > One of these sins is the number of PGs in our CephFS "data"
> > pool, which is 7200 and therefore not (as recommended)
> > a power of two. Pretty much all of our data is in the
> > "data" pool, the only other pools are "rbd" and "metadata",
> > both contain little data (and they have way too many PGs
> > already, another early sin).
> >
> > Is it possible - and safe - to change the number of "data"
> > pool PGs from 7200 to 8192 or 16384? As we recently added
> > more OSDs, I guess it would be time to increase the number
> > of PGs anyhow. Or would we have to go to 14400 instead of
> > 16384?
> >
> >
> > Thanks for any advice,
> >
> > Oliver
> > ___
> > ceph-users mailing list
>   

Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-17 Thread David Turner
You would actually need to go through one last time to get to your target
PGs, but anyway, like all commands you come across online, test them and
make sure they do what you intend.

On Thu, May 17, 2018 at 2:36 PM David Turner  wrote:

> By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of
> your PGs will be the same size and easier to balance and manage.  What
> happens when you have a non base 2 number is something like this.  Say you
> have 4 PGs that are all 2GB in size.  If you increase pg(p)_num to 6, then
> you will have 2 PGs that are 2GB and 4 PGs that are 1GB as you've split 2
> of the PGs into 4 to get to the 6 total.  If you increase the pg(p)_num to
> 8, then all 8 PGs will be 1GB.  Depending on how you manage your cluster,
> that doesn't really matter, but for some methods of balancing your cluster,
> that will greatly imbalance things.
>
> This would be a good time to go to a base 2 number.  I think you're
> thinking about Gluster where if you have 4 bricks and you want to increase
> your capacity, going to anything other than a multiple of 4 (8, 12, 16)
> kills performance (worse than increasing storage already does) and takes
> longer as it has to weirdly divide the data instead of splitting a single
> brick up to multiple bricks.
>
> As you increase your PGs, do this slowly and in a loop.  I like to
> increase my PGs by 256, wait for all PGs to create, activate, and peer,
> rinse/repate until I get to my target.  [1] This is an example of a script
> that should accomplish this with no interference.  Notice the use of flags
> while increasing the PGs.  It will make things take much longer if you have
> an OSD OOM itself or die for any reason by adding to the peering needing to
> happen.  It will also be wasted IO to start backfilling while you're still
> making changes; it's best to wait until you finish increasing your PGs and
> everything peers before you let data start moving.
>
> Another thing to keep in mind is how long your cluster will be moving data
> around.  Increasing your PG count on a pool full of data is one of the most
> intensive operations you can tell a cluster to do.  The last time I had to
> do this, I increased pg(p)_num by 4k PGs from 16k to 32k, let it backfill,
> rinse/repeat until the desired PG count was achieved.  For me, that 4k PGs
> would take 3-5 days depending on other cluster load and how full the
> cluster was.  If you do decide to increase your PGs by 4k instead of the
> full increase, change the 16384 to the number you decide to go to,
> backfill, continue.
>
>
> [1]
> # Make sure to set pool variable as well as the number ranges to the
> appropriate values.
> flags="nodown nobackfill norecover"
> for flag in $flags; do
>   ceph osd set $flag
> done
> pool=rbd
> echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs"
> # The first number is your current PG count for the pool, the second
> number is the target PG count, and the third number is how many to increase
> it by each time through the loop.
> for num in {7700..16384..256}; do
>   ceph osd pool set $pool pg_num $num
>   while sleep 10; do
> ceph osd health | grep -q
> 'peering\|stale\|activating\|creating\|inactive' || break
>   done
>   ceph osd pool set $pool pgp_num $num
>   while sleep 10; do
> ceph osd health | grep -q
> 'peering\|stale\|activating\|creating\|inactive' || break
>   done
> done
> for flag in $flags; do
>   ceph osd unset $flag
> done
>
> On Thu, May 17, 2018 at 9:27 AM Kai Wagner  wrote:
>
>> Hi Oliver,
>>
>> a good value is 100-150 PGs per OSD. So in your case between 20k and 30k.
>>
>> You can increase your PGs, but keep in mind that this will keep the
>> cluster quite busy for some while. That said I would rather increase in
>> smaller steps than in one large move.
>>
>> Kai
>>
>>
>> On 17.05.2018 01:29, Oliver Schulz wrote:
>> > Dear all,
>> >
>> > we have a Ceph cluster that has slowly evolved over several
>> > years and Ceph versions (started with 18 OSDs and 54 TB
>> > in 2013, now about 200 OSDs and 1.5 PB, still the same
>> > cluster, with data continuity). So there are some
>> > "early sins" in the cluster configuration, left over from
>> > the early days.
>> >
>> > One of these sins is the number of PGs in our CephFS "data"
>> > pool, which is 7200 and therefore not (as recommended)
>> > a power of two. Pretty much all of our data is in the
>> > "data" pool, the only other pools are "rbd" and "metadata",
>> > both contain little data (and they have way too many PGs
>> > already, another early sin).
>> >
>> > Is it possible - and safe - to change the number of "data"
>> > pool PGs from 7200 to 8192 or 16384? As we recently added
>> > more OSDs, I guess it would be time to increase the number
>> > of PGs anyhow. Or would we have to go to 14400 instead of
>> > 16384?
>> >
>> >
>> > Thanks for any advice,
>> >
>> > Oliver
>> > ___
>> > ceph-users 

Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-17 Thread David Turner
By sticking with PG numbers as a base 2 number (1024, 16384, etc) all of
your PGs will be the same size and easier to balance and manage.  What
happens when you have a non base 2 number is something like this.  Say you
have 4 PGs that are all 2GB in size.  If you increase pg(p)_num to 6, then
you will have 2 PGs that are 2GB and 4 PGs that are 1GB as you've split 2
of the PGs into 4 to get to the 6 total.  If you increase the pg(p)_num to
8, then all 8 PGs will be 1GB.  Depending on how you manage your cluster,
that doesn't really matter, but for some methods of balancing your cluster,
that will greatly imbalance things.

This would be a good time to go to a base 2 number.  I think you're
thinking about Gluster where if you have 4 bricks and you want to increase
your capacity, going to anything other than a multiple of 4 (8, 12, 16)
kills performance (worse than increasing storage already does) and takes
longer as it has to weirdly divide the data instead of splitting a single
brick up to multiple bricks.

As you increase your PGs, do this slowly and in a loop.  I like to increase
my PGs by 256, wait for all PGs to create, activate, and peer, rinse/repate
until I get to my target.  [1] This is an example of a script that should
accomplish this with no interference.  Notice the use of flags while
increasing the PGs.  It will make things take much longer if you have an
OSD OOM itself or die for any reason by adding to the peering needing to
happen.  It will also be wasted IO to start backfilling while you're still
making changes; it's best to wait until you finish increasing your PGs and
everything peers before you let data start moving.

Another thing to keep in mind is how long your cluster will be moving data
around.  Increasing your PG count on a pool full of data is one of the most
intensive operations you can tell a cluster to do.  The last time I had to
do this, I increased pg(p)_num by 4k PGs from 16k to 32k, let it backfill,
rinse/repeat until the desired PG count was achieved.  For me, that 4k PGs
would take 3-5 days depending on other cluster load and how full the
cluster was.  If you do decide to increase your PGs by 4k instead of the
full increase, change the 16384 to the number you decide to go to,
backfill, continue.


[1]
# Make sure to set pool variable as well as the number ranges to the
appropriate values.
flags="nodown nobackfill norecover"
for flag in $flags; do
  ceph osd set $flag
done
pool=rbd
echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs"
# The first number is your current PG count for the pool, the second number
is the target PG count, and the third number is how many to increase it by
each time through the loop.
for num in {7700..16384..256}; do
  ceph osd pool set $pool pg_num $num
  while sleep 10; do
ceph osd health | grep -q
'peering\|stale\|activating\|creating\|inactive' || break
  done
  ceph osd pool set $pool pgp_num $num
  while sleep 10; do
ceph osd health | grep -q
'peering\|stale\|activating\|creating\|inactive' || break
  done
done
for flag in $flags; do
  ceph osd unset $flag
done

On Thu, May 17, 2018 at 9:27 AM Kai Wagner  wrote:

> Hi Oliver,
>
> a good value is 100-150 PGs per OSD. So in your case between 20k and 30k.
>
> You can increase your PGs, but keep in mind that this will keep the
> cluster quite busy for some while. That said I would rather increase in
> smaller steps than in one large move.
>
> Kai
>
>
> On 17.05.2018 01:29, Oliver Schulz wrote:
> > Dear all,
> >
> > we have a Ceph cluster that has slowly evolved over several
> > years and Ceph versions (started with 18 OSDs and 54 TB
> > in 2013, now about 200 OSDs and 1.5 PB, still the same
> > cluster, with data continuity). So there are some
> > "early sins" in the cluster configuration, left over from
> > the early days.
> >
> > One of these sins is the number of PGs in our CephFS "data"
> > pool, which is 7200 and therefore not (as recommended)
> > a power of two. Pretty much all of our data is in the
> > "data" pool, the only other pools are "rbd" and "metadata",
> > both contain little data (and they have way too many PGs
> > already, another early sin).
> >
> > Is it possible - and safe - to change the number of "data"
> > pool PGs from 7200 to 8192 or 16384? As we recently added
> > more OSDs, I guess it would be time to increase the number
> > of PGs anyhow. Or would we have to go to 14400 instead of
> > 16384?
> >
> >
> > Thanks for any advice,
> >
> > Oliver
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB
> 21284 (AG Nürnberg)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___

Re: [ceph-users] Increasing number of PGs by not a factor of two?

2018-05-17 Thread Kai Wagner
Hi Oliver,

a good value is 100-150 PGs per OSD. So in your case between 20k and 30k.

You can increase your PGs, but keep in mind that this will keep the
cluster quite busy for some while. That said I would rather increase in
smaller steps than in one large move.

Kai


On 17.05.2018 01:29, Oliver Schulz wrote:
> Dear all,
>
> we have a Ceph cluster that has slowly evolved over several
> years and Ceph versions (started with 18 OSDs and 54 TB
> in 2013, now about 200 OSDs and 1.5 PB, still the same
> cluster, with data continuity). So there are some
> "early sins" in the cluster configuration, left over from
> the early days.
>
> One of these sins is the number of PGs in our CephFS "data"
> pool, which is 7200 and therefore not (as recommended)
> a power of two. Pretty much all of our data is in the
> "data" pool, the only other pools are "rbd" and "metadata",
> both contain little data (and they have way too many PGs
> already, another early sin).
>
> Is it possible - and safe - to change the number of "data"
> pool PGs from 7200 to 8192 or 16384? As we recently added
> more OSDs, I guess it would be time to increase the number
> of PGs anyhow. Or would we have to go to 14400 instead of
> 16384?
>
>
> Thanks for any advice,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increasing number of PGs by not a factor of two?

2018-05-16 Thread Oliver Schulz

Dear all,

we have a Ceph cluster that has slowly evolved over several
years and Ceph versions (started with 18 OSDs and 54 TB
in 2013, now about 200 OSDs and 1.5 PB, still the same
cluster, with data continuity). So there are some
"early sins" in the cluster configuration, left over from
the early days.

One of these sins is the number of PGs in our CephFS "data"
pool, which is 7200 and therefore not (as recommended)
a power of two. Pretty much all of our data is in the
"data" pool, the only other pools are "rbd" and "metadata",
both contain little data (and they have way too many PGs
already, another early sin).

Is it possible - and safe - to change the number of "data"
pool PGs from 7200 to 8192 or 16384? As we recently added
more OSDs, I guess it would be time to increase the number
of PGs anyhow. Or would we have to go to 14400 instead of
16384?


Thanks for any advice,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com