[ceph-users] Re: Questions about the CRUSH details

Anthony D'Atri Thu, 25 Jan 2024 20:22:07 -0800

> 
>>> forth), so this is why "ceph df" will tell you a pool has X free
>>> space, where X is "smallest free space on the OSDs on which this pool
>>> lies, times the number of OSDs".

To be even more precise, this depends on the failure domain.  With the typical 
"rack" failure domain, say you use 3x replication and have 3 racks, you'll be 
limited to the capacity of the smallest rack. If you have more racks than 
failure domains, though, you are less affected racks that vary somewhat in 
CRUSH weight.

With respect to OSDs, the above is still true, which is one reason we have the 
balancer module.  Say your OSDs are on average 50% full but you have one that 
is 70% full.  The most-full outlier will limit the reported available space.

The available space for each pool is also a function of the replication 
strategy -- replication vs EC as well as the prevailing full ratio setting.


>>> Given the pseudorandom placement of
>>> objects to PGs, there is nothing to prevent you from having the worst
>>> luck ever and all the objects you create end up on the OSD with least
>>> free space.
>> 
>> This is why you need a decent amount of PGs, to not run into statistical
>> edge cases.
> 
> Yes, just take the experiment to someone with one PG only, then it can
> only fill one OSD. Someone with a pool with only 2 PGs could at the
> very best case only fill two and so on. If you have 100+ PGs per OSD,
> the chances for many files to end up only on a few PGs becomes very
> small.

Indeed, a healthy number of PG shards per OSD is important as well for this 
reason.  I use an analogy of filling a 55 gallon drum with sportsballs.  You 
can fit maybe two beach balls in there with a ton of air space, but you could 
fit thousands of pingpong balls in there with a lot less air space.  

Having a power of 2 number of PGs per pool also helps with uniform distribution 
-- the description of why this is the case is a bit abstruse so I'll spare the 
list, but enquiring minds can read chapter 8 ;)

> and every client can't have a complete list of millions of objects in
> the cluster, so it does client-side computations.


This is one reason we have PGs -- so that there's a manageable number of things 
to juggle, while not being so few as to run into statistical and other 
imbalances.


> 
> -- 
> May the most significant bit of your life be positive.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Questions about the CRUSH details

Reply via email to