[ceph-users] Re: Ceph Allocation - used space is unreasonably higher than stored space

2023-11-15 Thread motaharesdq
Thank you Igor,

Yeah the 25K waste per rados object seems reasonable, couple of questions 
though:

1. Is the story of blobs re-using empty sub-sections of already allocated 
"min_alloc_size"ed blocks, just for RBD/FS? I read some blogs about 
onode->extent->blob->min_alloc->pextent->disk flow and how a write smaller than 
min_alloc_size counts as a small write and if any other small write comes by 
later, which is fit to this empty area of the blob, it will use that area; I 
expected this re-use behavior in rados overally.
Is my assumption of how allocation works totally wrong or it just doesn't apply 
for s3? (maybe because object hints are immutable?) and do you know any 
documentation about allocation details? I couldn't find much official data 
about it.

2. We have a ceph cluster that was updated to pacific but the OSDs were from a 
previous octopus cluster with bluestore_min_alloc_size_hdd = 64KB, 
bluestore_prefer_deferred_size_hdd = 65536, and both bluestore_allocator and 
bluefs_allocator are bitmap and OSDs were not re-deployed afterward. We were 
concerned that re-deploying HDD OSDs with bluestore_min_alloc_size_hdd = 4KB 
might cause i/o performance issues since the number of blocks and hence write 
operations will increase. Do you know how it might affect the cluster?

Many thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Allocation - used space is unreasonably higher than stored space

2023-11-15 Thread motaharesdq
Igor Fedotov wrote:
> Hi Motahare,
> 
> On 13/11/2023 14:44, Motahare S wrote:
> >   Hello everyone,
> > 
> >  Recently we have noticed that the results of "ceph df" stored and used
> >  space does not match; as the amount of stored data *1.5 (ec factor) is
> >  still like 5TB away from used amount:
> > 
> >  POOLID   PGS   STORED  OBJECTS USED  %USED
> >MAX AVAIL
> >  default.rgw.buckets.data12  1024  144 TiB   70.60M  221 TiB  18.68
> >643 TiB
> > 
> >  blob and alloc configs are as below:
> >  bluestore_min_alloc_size_hdd : 65536
> >  bluestore_min_alloc_size_ssd  : 4096
> >  luestore_max_blob_size_hdd : 524288
> > 
> >  bluestore_max_blob_size_ssd : 65536
> > 
> >  bluefs_shared_alloc_size : 65536
> > 
> >   From sources across web about how ceph actually writes on the disk, I
> >  presumed that It will zero-pad the extents of an object to match the
> >  4KB bdev_block_size, and then writes it in a blob which matches the
> >  min_alloc_size, however it can re-use parts of the blob's unwritten (but
> >  allocated because of min_alloc_size) space for another extent later.
> >  The problem though, was that we tested different configs in a minimal ceph
> >  octopus cluster with a 2G osd and bluestore_min_alloc_size_hdd = 65536.
> >  When we uploaded a 1KB file with aws s3 client, the amount of used/stored
> >  space was 64KB/1KB. We then uploaded another 1KB, and it went 128K/2K; kept
> >  doing it until 100% of the pool was used, but only 32MB stored. I expected
> >  ceph to start writing new 1KB files in the wasted 63KB(60KB)s of
> >  min_alloc_size blocks, but the cluster was totally acting as a full cluster
> >  and could no longer receive any new object. Is this behaviour expected for
> >  s3? Does ceph really use 64x space if your dataset is made of 1KB files?
> >  and all your object sizes should be a multiple of 64KB? Note that 5TB /
> >  (70.6M*1.5) ~ 50 so for every rados object about 50KB is wasted on average.
> >  we didn't observe this problem in RBD pools, probably because it cuts all
> >  objects in 4MB. 
> The above analysis is correct, indeed BlueStore will waste up to 64K for 
> every object unaligned to 64K (i.e. both 1K and 65K objects will waste 
> 63K).
> 
> Hence n*1K objects take n*64K bytes.
> 
> And since S3 objects are unaligned it tend to waste 32K bytes in average 
> on each object (assuming their sizes are distributed equally).
> 
> The only correction to the above math would be due to the actual m+n EC 
> layout. E.g. for 2+1 EC object count multiplier would be 3 not 1.5. 
> Hence the overhead per rados object is rather less than 50K in your case.
> 
> >   I know that min_alloc_hdd is changed to 4KB in
> > pacific, but I'm still
> >  curious how allocation really works and why it doesn't behave as expected?
> >  Also, re-deploying OSDs is a headache. 
> >
> > Sincerely
> > Motahare
> > ___
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thank you Igor,

Yeah the 25K waste per rados object seems reasonable. Couple of questions 
though:

1. Is the whole flow of blobs re-using allocated space (empty sub-sections of 
already allocated "min_alloc_size"ed blocks) just for RBD/FS? I read some blogs 
about onode->extent->blob->min_alloc->pextent re-using via small writes, and I 
expected this behavior in rados overally.
e. g. https://blog.51cto.com/u_15265005/2888373
Is my assumption totally wrong or it just does not apply for s3? (maybe because 
objects are immutable?)

2. We have a ceph cluster that was updated to pacific but the OSDs were from a 
previous octopus cluster and were updated but not re-deployed afterward. We 
were concerned that re-deploying OSDs with bluestore_min_alloc_size_hdd = 4KB 
might cause i/o performance issues since the number of blocks and hence r/w 
operations will increase. Do you have any views on how it might affect our 
cluster?

Many thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Allocation - used space is unreasonably higher than stored space

2023-11-13 Thread Igor Fedotov

Hi Motahare,

On 13/11/2023 14:44, Motahare S wrote:

Hello everyone,

Recently we have noticed that the results of "ceph df" stored and used
space does not match; as the amount of stored data *1.5 (ec factor) is
still like 5TB away from used amount:

POOLID   PGS   STORED  OBJECTS USED  %USED
  MAX AVAIL
default.rgw.buckets.data12  1024  144 TiB   70.60M  221 TiB  18.68
  643 TiB

blob and alloc configs are as below:
bluestore_min_alloc_size_hdd : 65536
bluestore_min_alloc_size_ssd  : 4096
luestore_max_blob_size_hdd : 524288

bluestore_max_blob_size_ssd : 65536

bluefs_shared_alloc_size : 65536

 From sources across web about how ceph actually writes on the disk, I
presumed that It will zero-pad the extents of an object to match the
4KB bdev_block_size, and then writes it in a blob which matches the
min_alloc_size, however it can re-use parts of the blob's unwritten (but
allocated because of min_alloc_size) space for another extent later.
The problem though, was that we tested different configs in a minimal ceph
octopus cluster with a 2G osd and bluestore_min_alloc_size_hdd = 65536.
When we uploaded a 1KB file with aws s3 client, the amount of used/stored
space was 64KB/1KB. We then uploaded another 1KB, and it went 128K/2K; kept
doing it until 100% of the pool was used, but only 32MB stored. I expected
ceph to start writing new 1KB files in the wasted 63KB(60KB)s of
min_alloc_size blocks, but the cluster was totally acting as a full cluster
and could no longer receive any new object. Is this behaviour expected for
s3? Does ceph really use 64x space if your dataset is made of 1KB files?
and all your object sizes should be a multiple of 64KB? Note that 5TB /
(70.6M*1.5) ~ 50 so for every rados object about 50KB is wasted on average.
we didn't observe this problem in RBD pools, probably because it cuts all
objects in 4MB.


The above analysis is correct, indeed BlueStore will waste up to 64K for 
every object unaligned to 64K (i.e. both 1K and 65K objects will waste 
63K).


Hence n*1K objects take n*64K bytes.

And since S3 objects are unaligned it tend to waste 32K bytes in average 
on each object (assuming their sizes are distributed equally).


The only correction to the above math would be due to the actual m+n EC 
layout. E.g. for 2+1 EC object count multiplier would be 3 not 1.5. 
Hence the overhead per rados object is rather less than 50K in your case.



I know that min_alloc_hdd is changed to 4KB in pacific, but I'm still
curious how allocation really works and why it doesn't behave as expected?
Also, re-deploying OSDs is a headache.




Sincerely
Motahare
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io