[ceph-users] estimate the impact of changing pg_num

2015-02-01 Thread Xu (Simon) Chen
Hi folks,

I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs
hosted on 33 new servers were added, and I have finished balancing the data
and then marked the 33 old OSDs out.

As I have 6x as many OSDs, I am thinking of increasing pg_num of my largest
pool from 1k to at least 8k. What worries me is that this cluster has
around 10M objects and is supporting many production VMs with RBD.

I am wondering if there is a good way to estimate the amount of data that
will be shuffled after I increase the PG_NUM. I want to make sure this can
be done within a reasonable amount of time, such that I can declare a
proper maintenance window (either over night, or throughout a weekend..)

Thanks!
-Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] estimate the impact of changing pg_num

2015-02-01 Thread Dan van der Ster
Hi,
I don't know the general calculation, but last week we split a pool with 20
million tiny objects from 512 to 1024 pgs, on a cluster with 80 OSDs. IIRC
around 7 million objects needed to move, and it took around 13 hours to
finish. The bottleneck in our case was objects per second (limited to
around 1000/s), not network throughput (which never exceeded ~50MB/s).

It wasn't completely transparent... the time to write a 4kB object
increased from 5ms to around 30ms during this splitting process.

I would guess that if you split from 1k to 8k pgs, around 80% of your data
will move. Basically, 7 out of 8 objects will be moved to a new primary PG,
but any objects that end up with 2nd or 3rd copies on the first 1k PGs
should not need to be moved.

I'd also be interested to hear of similar splitting experiences. We've been
planning a similar intervention on our larger cluster to move from 4k PGs
to 16k. I have been considering making the change gradually (10-100 PGs at
a time) instead of all at once. This approach would certainly lower the
performance impact, but would take much much longer to complete. I wrote a
short script to perform this gentle splitting here:
https://github.com/cernceph/ceph-scripts/blob/master/tools/split/ceph-gentle-split

Be sure to understand what it's doing before trying it.

Cheers,
Dan
On 1 Feb 2015 18:21, Xu (Simon) Chen xche...@gmail.com wrote:

 Hi folks,

 I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs
 hosted on 33 new servers were added, and I have finished balancing the data
 and then marked the 33 old OSDs out.

 As I have 6x as many OSDs, I am thinking of increasing pg_num of my
 largest pool from 1k to at least 8k. What worries me is that this cluster
 has around 10M objects and is supporting many production VMs with RBD.

 I am wondering if there is a good way to estimate the amount of data that
 will be shuffled after I increase the PG_NUM. I want to make sure this can
 be done within a reasonable amount of time, such that I can declare a
 proper maintenance window (either over night, or throughout a weekend..)

 Thanks!
 -Simon

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] estimate the impact of changing pg_num

2015-02-01 Thread Dan van der Ster
Hi,
When do you see thousands of slow requests during recovery... Does that
happen even with single OSD failures? You should be able to recover disks
without slow requests.

I always run with recovery op priority at the minimum 1. Tweaking the
number of max backfills did not change much during that recent splitting
exercise.

Which Ceph version are you running? There have been snap trim related
recovery problems that have only recently been fixed in production
releases. 0.80.8 is OK, but I don't know about giant...

Cheers, Dan

On 1 Feb 2015 21:39, Xu (Simon) Chen xche...@gmail.com wrote:

 In my case, each object is 8MB (glance default for storing images on rbd
backend.) RBD doesn't work extremely well when ceph is recovering - it is
common to see hundreds or a few thousands of blocked requests (30s to
finish). This translates high IO wait inside of VMs, and many applications
don't deal with this well.

 I am not convinced that increase pg_num gradually is the right way to go.
Have you tried giving backfilling traffic very low priorities?

 Thanks.
 -Simon

 On Sun, Feb 1, 2015 at 2:39 PM, Dan van der Ster d...@vanderster.com
wrote:

 Hi,
 I don't know the general calculation, but last week we split a pool with
20 million tiny objects from 512 to 1024 pgs, on a cluster with 80 OSDs.
IIRC around 7 million objects needed to move, and it took around 13 hours
to finish. The bottleneck in our case was objects per second (limited to
around 1000/s), not network throughput (which never exceeded ~50MB/s).

 It wasn't completely transparent... the time to write a 4kB object
increased from 5ms to around 30ms during this splitting process.

 I would guess that if you split from 1k to 8k pgs, around 80% of your
data will move. Basically, 7 out of 8 objects will be moved to a new
primary PG, but any objects that end up with 2nd or 3rd copies on the first
1k PGs should not need to be moved.

 I'd also be interested to hear of similar splitting experiences. We've
been planning a similar intervention on our larger cluster to move from 4k
PGs to 16k. I have been considering making the change gradually (10-100 PGs
at a time) instead of all at once. This approach would certainly lower the
performance impact, but would take much much longer to complete. I wrote a
short script to perform this gentle splitting here:
https://github.com/cernceph/ceph-scripts/blob/master/tools/split/ceph-gentle-split

 Be sure to understand what it's doing before trying it.

 Cheers,
 Dan

 On 1 Feb 2015 18:21, Xu (Simon) Chen xche...@gmail.com wrote:

 Hi folks,

 I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs
hosted on 33 new servers were added, and I have finished balancing the data
and then marked the 33 old OSDs out.

 As I have 6x as many OSDs, I am thinking of increasing pg_num of my
largest pool from 1k to at least 8k. What worries me is that this cluster
has around 10M objects and is supporting many production VMs with RBD.

 I am wondering if there is a good way to estimate the amount of data
that will be shuffled after I increase the PG_NUM. I want to make sure this
can be done within a reasonable amount of time, such that I can declare a
proper maintenance window (either over night, or throughout a weekend..)

 Thanks!
 -Simon

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] estimate the impact of changing pg_num

2015-02-01 Thread Xu (Simon) Chen
In my case, each object is 8MB (glance default for storing images on rbd
backend.) RBD doesn't work extremely well when ceph is recovering - it is
common to see hundreds or a few thousands of blocked requests (30s to
finish). This translates high IO wait inside of VMs, and many applications
don't deal with this well.

I am not convinced that increase pg_num gradually is the right way to go.
Have you tried giving backfilling traffic very low priorities?

Thanks.
-Simon

On Sun, Feb 1, 2015 at 2:39 PM, Dan van der Ster d...@vanderster.com wrote:

 Hi,
 I don't know the general calculation, but last week we split a pool with
 20 million tiny objects from 512 to 1024 pgs, on a cluster with 80 OSDs.
 IIRC around 7 million objects needed to move, and it took around 13 hours
 to finish. The bottleneck in our case was objects per second (limited to
 around 1000/s), not network throughput (which never exceeded ~50MB/s).

 It wasn't completely transparent... the time to write a 4kB object
 increased from 5ms to around 30ms during this splitting process.

 I would guess that if you split from 1k to 8k pgs, around 80% of your data
 will move. Basically, 7 out of 8 objects will be moved to a new primary PG,
 but any objects that end up with 2nd or 3rd copies on the first 1k PGs
 should not need to be moved.

 I'd also be interested to hear of similar splitting experiences. We've
 been planning a similar intervention on our larger cluster to move from 4k
 PGs to 16k. I have been considering making the change gradually (10-100 PGs
 at a time) instead of all at once. This approach would certainly lower the
 performance impact, but would take much much longer to complete. I wrote a
 short script to perform this gentle splitting here:
 https://github.com/cernceph/ceph-scripts/blob/master/tools/split/ceph-gentle-split

 Be sure to understand what it's doing before trying it.

 Cheers,
 Dan
 On 1 Feb 2015 18:21, Xu (Simon) Chen xche...@gmail.com wrote:

 Hi folks,

 I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs
 hosted on 33 new servers were added, and I have finished balancing the data
 and then marked the 33 old OSDs out.

 As I have 6x as many OSDs, I am thinking of increasing pg_num of my
 largest pool from 1k to at least 8k. What worries me is that this cluster
 has around 10M objects and is supporting many production VMs with RBD.

 I am wondering if there is a good way to estimate the amount of data that
 will be shuffled after I increase the PG_NUM. I want to make sure this can
 be done within a reasonable amount of time, such that I can declare a
 proper maintenance window (either over night, or throughout a weekend..)

 Thanks!
 -Simon

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] estimate the impact of changing pg_num

2015-02-01 Thread Udo Lembke
Hi Xu,

On 01.02.2015 21:39, Xu (Simon) Chen wrote:
 RBD doesn't work extremely well when ceph is recovering - it is common
 to see hundreds or a few thousands of blocked requests (30s to
 finish). This translates high IO wait inside of VMs, and many
 applications don't deal with this well.
this sounds like you don't have settings like
osd max backfills = 1
osd recovery max active = 1


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com