[ceph-users] Re: Ceph Allocation - used space is unreasonably higher than stored space
Thank you Igor, Yeah the 25K waste per rados object seems reasonable, couple of questions though: 1. Is the story of blobs re-using empty sub-sections of already allocated "min_alloc_size"ed blocks, just for RBD/FS? I read some blogs about onode->extent->blob->min_alloc->pextent->disk flow and how a write smaller than min_alloc_size counts as a small write and if any other small write comes by later, which is fit to this empty area of the blob, it will use that area; I expected this re-use behavior in rados overally. Is my assumption of how allocation works totally wrong or it just doesn't apply for s3? (maybe because object hints are immutable?) and do you know any documentation about allocation details? I couldn't find much official data about it. 2. We have a ceph cluster that was updated to pacific but the OSDs were from a previous octopus cluster with bluestore_min_alloc_size_hdd = 64KB, bluestore_prefer_deferred_size_hdd = 65536, and both bluestore_allocator and bluefs_allocator are bitmap and OSDs were not re-deployed afterward. We were concerned that re-deploying HDD OSDs with bluestore_min_alloc_size_hdd = 4KB might cause i/o performance issues since the number of blocks and hence write operations will increase. Do you know how it might affect the cluster? Many thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Allocation - used space is unreasonably higher than stored space
Igor Fedotov wrote: > Hi Motahare, > > On 13/11/2023 14:44, Motahare S wrote: > > Hello everyone, > > > > Recently we have noticed that the results of "ceph df" stored and used > > space does not match; as the amount of stored data *1.5 (ec factor) is > > still like 5TB away from used amount: > > > > POOLID PGS STORED OBJECTS USED %USED > >MAX AVAIL > > default.rgw.buckets.data12 1024 144 TiB 70.60M 221 TiB 18.68 > >643 TiB > > > > blob and alloc configs are as below: > > bluestore_min_alloc_size_hdd : 65536 > > bluestore_min_alloc_size_ssd : 4096 > > luestore_max_blob_size_hdd : 524288 > > > > bluestore_max_blob_size_ssd : 65536 > > > > bluefs_shared_alloc_size : 65536 > > > > From sources across web about how ceph actually writes on the disk, I > > presumed that It will zero-pad the extents of an object to match the > > 4KB bdev_block_size, and then writes it in a blob which matches the > > min_alloc_size, however it can re-use parts of the blob's unwritten (but > > allocated because of min_alloc_size) space for another extent later. > > The problem though, was that we tested different configs in a minimal ceph > > octopus cluster with a 2G osd and bluestore_min_alloc_size_hdd = 65536. > > When we uploaded a 1KB file with aws s3 client, the amount of used/stored > > space was 64KB/1KB. We then uploaded another 1KB, and it went 128K/2K; kept > > doing it until 100% of the pool was used, but only 32MB stored. I expected > > ceph to start writing new 1KB files in the wasted 63KB(60KB)s of > > min_alloc_size blocks, but the cluster was totally acting as a full cluster > > and could no longer receive any new object. Is this behaviour expected for > > s3? Does ceph really use 64x space if your dataset is made of 1KB files? > > and all your object sizes should be a multiple of 64KB? Note that 5TB / > > (70.6M*1.5) ~ 50 so for every rados object about 50KB is wasted on average. > > we didn't observe this problem in RBD pools, probably because it cuts all > > objects in 4MB. > The above analysis is correct, indeed BlueStore will waste up to 64K for > every object unaligned to 64K (i.e. both 1K and 65K objects will waste > 63K). > > Hence n*1K objects take n*64K bytes. > > And since S3 objects are unaligned it tend to waste 32K bytes in average > on each object (assuming their sizes are distributed equally). > > The only correction to the above math would be due to the actual m+n EC > layout. E.g. for 2+1 EC object count multiplier would be 3 not 1.5. > Hence the overhead per rados object is rather less than 50K in your case. > > > I know that min_alloc_hdd is changed to 4KB in > > pacific, but I'm still > > curious how allocation really works and why it doesn't behave as expected? > > Also, re-deploying OSDs is a headache. > > > > Sincerely > > Motahare > > ___ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io Thank you Igor, Yeah the 25K waste per rados object seems reasonable. Couple of questions though: 1. Is the whole flow of blobs re-using allocated space (empty sub-sections of already allocated "min_alloc_size"ed blocks) just for RBD/FS? I read some blogs about onode->extent->blob->min_alloc->pextent re-using via small writes, and I expected this behavior in rados overally. e. g. https://blog.51cto.com/u_15265005/2888373 Is my assumption totally wrong or it just does not apply for s3? (maybe because objects are immutable?) 2. We have a ceph cluster that was updated to pacific but the OSDs were from a previous octopus cluster and were updated but not re-deployed afterward. We were concerned that re-deploying OSDs with bluestore_min_alloc_size_hdd = 4KB might cause i/o performance issues since the number of blocks and hence r/w operations will increase. Do you have any views on how it might affect our cluster? Many thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Allocation - used space is unreasonably higher than stored space
Hi Motahare, On 13/11/2023 14:44, Motahare S wrote: Hello everyone, Recently we have noticed that the results of "ceph df" stored and used space does not match; as the amount of stored data *1.5 (ec factor) is still like 5TB away from used amount: POOLID PGS STORED OBJECTS USED %USED MAX AVAIL default.rgw.buckets.data12 1024 144 TiB 70.60M 221 TiB 18.68 643 TiB blob and alloc configs are as below: bluestore_min_alloc_size_hdd : 65536 bluestore_min_alloc_size_ssd : 4096 luestore_max_blob_size_hdd : 524288 bluestore_max_blob_size_ssd : 65536 bluefs_shared_alloc_size : 65536 From sources across web about how ceph actually writes on the disk, I presumed that It will zero-pad the extents of an object to match the 4KB bdev_block_size, and then writes it in a blob which matches the min_alloc_size, however it can re-use parts of the blob's unwritten (but allocated because of min_alloc_size) space for another extent later. The problem though, was that we tested different configs in a minimal ceph octopus cluster with a 2G osd and bluestore_min_alloc_size_hdd = 65536. When we uploaded a 1KB file with aws s3 client, the amount of used/stored space was 64KB/1KB. We then uploaded another 1KB, and it went 128K/2K; kept doing it until 100% of the pool was used, but only 32MB stored. I expected ceph to start writing new 1KB files in the wasted 63KB(60KB)s of min_alloc_size blocks, but the cluster was totally acting as a full cluster and could no longer receive any new object. Is this behaviour expected for s3? Does ceph really use 64x space if your dataset is made of 1KB files? and all your object sizes should be a multiple of 64KB? Note that 5TB / (70.6M*1.5) ~ 50 so for every rados object about 50KB is wasted on average. we didn't observe this problem in RBD pools, probably because it cuts all objects in 4MB. The above analysis is correct, indeed BlueStore will waste up to 64K for every object unaligned to 64K (i.e. both 1K and 65K objects will waste 63K). Hence n*1K objects take n*64K bytes. And since S3 objects are unaligned it tend to waste 32K bytes in average on each object (assuming their sizes are distributed equally). The only correction to the above math would be due to the actual m+n EC layout. E.g. for 2+1 EC object count multiplier would be 3 not 1.5. Hence the overhead per rados object is rather less than 50K in your case. I know that min_alloc_hdd is changed to 4KB in pacific, but I'm still curious how allocation really works and why it doesn't behave as expected? Also, re-deploying OSDs is a headache. Sincerely Motahare ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io