In my case, each object is 8MB (glance default for storing images on rbd
backend.) RBD doesn't work extremely well when ceph is recovering - it is
common to see hundreds or a few thousands of blocked requests (>30s to
finish). This translates high IO wait inside of VMs, and many applications
don't deal with this well.

I am not convinced that increase pg_num gradually is the right way to go.
Have you tried giving backfilling traffic very low priorities?

Thanks.
-Simon

On Sun, Feb 1, 2015 at 2:39 PM, Dan van der Ster <d...@vanderster.com> wrote:

> Hi,
> I don't know the general calculation, but last week we split a pool with
> 20 million tiny objects from 512 to 1024 pgs, on a cluster with 80 OSDs.
> IIRC around 7 million objects needed to move, and it took around 13 hours
> to finish. The bottleneck in our case was objects per second (limited to
> around 1000/s), not network throughput (which never exceeded ~50MB/s).
>
> It wasn't completely transparent... the time to write a 4kB object
> increased from 5ms to around 30ms during this splitting process.
>
> I would guess that if you split from 1k to 8k pgs, around 80% of your data
> will move. Basically, 7 out of 8 objects will be moved to a new primary PG,
> but any objects that end up with 2nd or 3rd copies on the first 1k PGs
> should not need to be moved.
>
> I'd also be interested to hear of similar splitting experiences. We've
> been planning a similar intervention on our larger cluster to move from 4k
> PGs to 16k. I have been considering making the change gradually (10-100 PGs
> at a time) instead of all at once. This approach would certainly lower the
> performance impact, but would take much much longer to complete. I wrote a
> short script to perform this gentle splitting here:
> https://github.com/cernceph/ceph-scripts/blob/master/tools/split/ceph-gentle-split
>
> Be sure to understand what it's doing before trying it.
>
> Cheers,
> Dan
> On 1 Feb 2015 18:21, "Xu (Simon) Chen" <xche...@gmail.com> wrote:
>
>> Hi folks,
>>
>> I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs
>> hosted on 33 new servers were added, and I have finished balancing the data
>> and then marked the 33 old OSDs out.
>>
>> As I have 6x as many OSDs, I am thinking of increasing pg_num of my
>> largest pool from 1k to at least 8k. What worries me is that this cluster
>> has around 10M objects and is supporting many production VMs with RBD.
>>
>> I am wondering if there is a good way to estimate the amount of data that
>> will be shuffled after I increase the PG_NUM. I want to make sure this can
>> be done within a reasonable amount of time, such that I can declare a
>> proper maintenance window (either over night, or throughout a weekend..)
>>
>> Thanks!
>> -Simon
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to