Hello,

Thank you for the feedback Jan, much appreciated! I wont post the whole tree as it is rather long, but here is an example of one of our hosts. All of the OSDs and hosts are weighted the same, with the exception of a host that is missing an OSD due to a broken backplane. We are only using hosts for buckets so no rack/DC. We have not manually adjusted the crush map at all for this cluster.

 -1 302.26959 root default
-24  14.47998     host osd23
192   1.81000         osd.192      up  1.00000          1.00000
193   1.81000         osd.193      up  1.00000          1.00000
194   1.81000         osd.194      up  1.00000          1.00000
195   1.81000         osd.195      up  1.00000          1.00000
199   1.81000         osd.199      up  1.00000          1.00000
200   1.81000         osd.200      up  1.00000          1.00000
201   1.81000         osd.201      up  1.00000          1.00000
202   1.81000         osd.202      up  1.00000          1.00000

I appreciate your input and will likely follow the same path you have, slowly increasing the PGs and adjusting the weights as necessary. If anyone else has any further suggestions I'd love to hear them as well!

- Daniel


On 06/02/2015 01:33 PM, Jan Schermer wrote:
Post the output from your “ceph osd tree”.
We were in a similiar situation, some of the OSDs were quite full while other had 
>50% free. This is exactly why we increased the number of PGs, and it helped to 
some degree.
Are all your hosts the same size? Does your CRUSH map select a host in the end? 
That way if you have few hosts with differing number of OSDs the distribution 
will be poor (IMHO).

Anyway, when we started increasing the PG numbers we first generated the PGs 
themselves (pg_num) in small increments since that put a lot of load on the 
OSDs and we were seeing slow requests with large increases.
So something like this:
for i in `seq 4096 64 8192` ; do ceph osd pool set poolname pg_num $i ; done
This ate a few gigs from the drives (1-2GB if I remember correctly).

Once that was finished we increased the pgp_num in larger and larger increments 
 - at first 64 at a time and then 512 at a time when we were reaching the 
target (16384 in our case). This does allocate more space temporarily, and it 
seems to just randomly move data around - one minute an OSD is fine, another 
and the OSD is nearing full. One of us basically had to watch the process all 
the time, reweighting the devices that were almost full.
With increasing number of PGs it became much simpler, as the overhead was 
smaller, every bit of work was smaller and all the management operations a lot 
smoother.

YMMV - our data distribution was poor from the start, hosts had differing 
weights due to differing number of OSDs, there were some historical remnants 
when we tried to load-balance the data by hand, and we ended in a much better 
state but not perfect - some OSDs still have much more free space than other.
We haven’t touched the CRUSH map at all during this process, once we do and set 
newer tunables then the data distribution should be much more even.

I’d love to hear the others’ input since we are not sure why exactly this 
problem is present at all - I’d expect it to fill all the OSDs to the same or 
close-enough level, but in reality we have OSDs with weight 1.0 which are 
almost empty and others with weight 0.5 which are nearly full… When adding data 
it seems to (subjectively) distribute them evenly...

Jan

On 02 Jun 2015, at 18:52, Daniel Maraio <dmar...@choopa.com> wrote:

Hello,

  I have some questions about the size of my placement groups and how I can get 
a more even distribution. We currently have 160 2TB OSDs across 20 chassis.  We 
have 133TB used in our radosgw pool with a replica size of 2. We want to move 
to 3 replicas but are concerned we may fill up some of our OSDs. Some OSDs have 
~1.1TB free while others only have ~600GB free. The radosgw pool has 4096 pgs, 
looking at the documentation I probably want to increase this up to 8192, but 
we have decided to hold off on that for now.

  So, now for the pg usage. I dumped out the PG stats and noticed that there 
are two groups of PG sizes in my cluster. There are about 1024 PGs that are 
each around 17-18GB in size. The rest of the PGs are all around 34-36GB in 
size. Any idea why there are two distinct groups? We only have the one pool 
with data in it, though there are several different buckets in the radosgw 
pool. The data in the pool ranges from small images to 4-6mb audio files. Will 
increasing the number of PGs on this pool provide a more even distribution?

  Another thing to note is that the initial cluster was built lopsided, with 
some 4TB OSDs and some 2TB, we have removed all the 4TB disks and are only 
using 2TBs across the entire cluster. Not sure if this would have had any 
impact.

  Thank you for your time and I would appreciate any insight the community can 
offer.

- Daniel
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to