Hello,
Thank you for the feedback Jan, much appreciated! I wont post the
whole tree as it is rather long, but here is an example of one of our
hosts. All of the OSDs and hosts are weighted the same, with the
exception of a host that is missing an OSD due to a broken backplane. We
are only using hosts for buckets so no rack/DC. We have not manually
adjusted the crush map at all for this cluster.
-1 302.26959 root default
-24 14.47998 host osd23
192 1.81000 osd.192 up 1.00000 1.00000
193 1.81000 osd.193 up 1.00000 1.00000
194 1.81000 osd.194 up 1.00000 1.00000
195 1.81000 osd.195 up 1.00000 1.00000
199 1.81000 osd.199 up 1.00000 1.00000
200 1.81000 osd.200 up 1.00000 1.00000
201 1.81000 osd.201 up 1.00000 1.00000
202 1.81000 osd.202 up 1.00000 1.00000
I appreciate your input and will likely follow the same path you
have, slowly increasing the PGs and adjusting the weights as necessary.
If anyone else has any further suggestions I'd love to hear them as well!
- Daniel
On 06/02/2015 01:33 PM, Jan Schermer wrote:
Post the output from your “ceph osd tree”.
We were in a similiar situation, some of the OSDs were quite full while other had
>50% free. This is exactly why we increased the number of PGs, and it helped to
some degree.
Are all your hosts the same size? Does your CRUSH map select a host in the end?
That way if you have few hosts with differing number of OSDs the distribution
will be poor (IMHO).
Anyway, when we started increasing the PG numbers we first generated the PGs
themselves (pg_num) in small increments since that put a lot of load on the
OSDs and we were seeing slow requests with large increases.
So something like this:
for i in `seq 4096 64 8192` ; do ceph osd pool set poolname pg_num $i ; done
This ate a few gigs from the drives (1-2GB if I remember correctly).
Once that was finished we increased the pgp_num in larger and larger increments
- at first 64 at a time and then 512 at a time when we were reaching the
target (16384 in our case). This does allocate more space temporarily, and it
seems to just randomly move data around - one minute an OSD is fine, another
and the OSD is nearing full. One of us basically had to watch the process all
the time, reweighting the devices that were almost full.
With increasing number of PGs it became much simpler, as the overhead was
smaller, every bit of work was smaller and all the management operations a lot
smoother.
YMMV - our data distribution was poor from the start, hosts had differing
weights due to differing number of OSDs, there were some historical remnants
when we tried to load-balance the data by hand, and we ended in a much better
state but not perfect - some OSDs still have much more free space than other.
We haven’t touched the CRUSH map at all during this process, once we do and set
newer tunables then the data distribution should be much more even.
I’d love to hear the others’ input since we are not sure why exactly this
problem is present at all - I’d expect it to fill all the OSDs to the same or
close-enough level, but in reality we have OSDs with weight 1.0 which are
almost empty and others with weight 0.5 which are nearly full… When adding data
it seems to (subjectively) distribute them evenly...
Jan
On 02 Jun 2015, at 18:52, Daniel Maraio <dmar...@choopa.com> wrote:
Hello,
I have some questions about the size of my placement groups and how I can get
a more even distribution. We currently have 160 2TB OSDs across 20 chassis. We
have 133TB used in our radosgw pool with a replica size of 2. We want to move
to 3 replicas but are concerned we may fill up some of our OSDs. Some OSDs have
~1.1TB free while others only have ~600GB free. The radosgw pool has 4096 pgs,
looking at the documentation I probably want to increase this up to 8192, but
we have decided to hold off on that for now.
So, now for the pg usage. I dumped out the PG stats and noticed that there
are two groups of PG sizes in my cluster. There are about 1024 PGs that are
each around 17-18GB in size. The rest of the PGs are all around 34-36GB in
size. Any idea why there are two distinct groups? We only have the one pool
with data in it, though there are several different buckets in the radosgw
pool. The data in the pool ranges from small images to 4-6mb audio files. Will
increasing the number of PGs on this pool provide a more even distribution?
Another thing to note is that the initial cluster was built lopsided, with
some 4TB OSDs and some 2TB, we have removed all the 4TB disks and are only
using 2TBs across the entire cluster. Not sure if this would have had any
impact.
Thank you for your time and I would appreciate any insight the community can
offer.
- Daniel
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com