Hi Ilya,
Thank you for your illuminating response!
I thought I had checked `ceph df` during my experiments before, but
apparently not carefully enough. :)
On 25/10/2024 18:43, Ilya Dryomov wrote:
> "rbd du" can be very imprecise even with --exact flag: one can
> construct an image that would use less than 1% of its provisioned space
> but "rbd du --exact" would report 100% used. This is because "rbd du"
> works only at the object level, meaning that as long as even a small
> part of an object is there, the entire object is reported as used (for
> the most part, with one minor exception).
>
> The catch is that an object or some part of it being there doesn't mean
> that it actually consumes space on the OSDs.
Right, I now recall reading your remark [1] about `rbd du --exact` not
accounting for "holes" in the objects, and thus reporting numbers that
are too big.
With that in mind, I suppose I can build such an image with a large
discrepancy between actual space usage and usage reported by `rbd du
--exact` by only writing data to the "tail" of each 4M object.
I tried with an image (4G for nicer numbers) in an otherwise empty pool:
# rbd create -p vmpool test --size 4G
# rbd map -p vmpool test
# for i in $(seq 3 4 4096); do dd if=/dev/urandom
of=/dev/rbd/vmpool/test bs=1M oseek=${i} count=1; done
`rbd du --exact` reports:
# rbd du --exact -p vmpool test
NAME PROVISIONED USED
test 4 GiB 4 GiB
but according to `ceph df`, only 1G is actually used in the pool:
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
vmpool 4 32 1.0 GiB 1.03k 3.0 GiB 1.11 89 GiB
So this is an image where `rbd du --exact` reports 100% used, but which
takes up only 25% of provisioned space.
>> attached via QEMU's librbd integration, fstrim seems to work much
>> better. I've found an earlier discussion [0] according to which, for
>> fstrim to work properly, the filesystem should be aligned a object size
>> (4M) boundaries. Indeed, in the test setups I've looked at, the
>> filesystem is not aligned to 4M boundaries.
>>
>> Still, I'm wondering if there might be a solution that doesn't require a
>> specific partitioning/filesystem layout. To have a simpler test setup,
>> I'm not looking at VMs and instead into unaligned blkdiscard on a
>> KRBD-backed block device (on the host).
>>
>> On my test cluster (for versions see [5]), I create an 1G test volume,
>> map it with default settings, write random data to it, and then issue
>> blkdiscard with an 1M offset (see [1] for complete commands):
>>
>>> # blkdiscard --offset 1M /dev/rbd/vmpool/test
>>
>> An `rbd du --exact` reports a size of 256M:
>>
>>> # rbd du --exact -p vmpool test
>>> NAME PROVISIONED USED
>>> test 1 GiB 256 MiB
>
> Try the same test, but look at the STORED column of "ceph df" output
> for the pool in question. Note the starting value, after writing 1G
> you should see it increase by 1G and after running that blkdiscard
> command it should decrease by 1023M, despite "rbd du --exact" reporting
> 256M as used.
Right, this is exactly what happens. After the blkdiscard:
# rbd du --exact -p vmpool test
NAME PROVISIONED USED
test 1 GiB 256 MiB
but:
# ceph df
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
[...]
vmpool 4 32 1.0 MiB 262 3.1 MiB 0 90 GiB
Only 1MiB of data is STORED, though the objects are still there. I see
that `rbd sparsify` cleans up the objects, but doesn't seem to play well
with a VM also accessing the block device (due to exclusive-locks). It
might be nice if these objects could be cleaned up somehow without
having to stop the VM, but I agree that with respect to the data
actually stored on the OSDs, the objects probably don't matter.
>> My expectation is that this could negatively impact non-discard IO
>> performance (write amplification?). But I am unsure, as I ran a few
>> small benchmarks and couldn't really see any difference between the two
>> settings. Thus, my questions:
>>
>> - Should I expect any downside for non-discard IO after setting
>> `alloc_size` to 4M?
>
> There is a major downside even for discard I/O. Bumping alloc_size
> to 4M would make the RBD driver ignore _all_ discard requests that are
> smaller than 4M -- which would amount to nearly all of discard requests
> in regular setups.
I tried to reproduce this and noticed that indeed, with alloc_size=4M
most <4M discard requests are ignored -- with the exception of requests
corresponding exactly to a object tail, e.g.:
# grep '' /sys/class/block/rbd*/device/config_info
10.1.1.201:6789,10.1.1.202:6789,10.1.1.203:6789
name=admin,key=client.admin,alloc_size=4194304 vmpool test -
# rbd du --exact -p vmpool test
NAME PROVISIONED USED
test 1 GiB 1 GiB
# blkdiscard --offset 1M --length 3M /dev/rbd/vmpool/test
# rbd du --exact -p vmpool test
NAME PROVISIONED USED
test 1 GiB 1021 MiB
I guess because the kernel driver doesn't enter the corresponding `if`
block in case alloc_size == object_size and the discard corresponds with
an object tail [2].
>> - If yes: would it be feasible for KRBD to decouple
>> `discard_granularity` and `minimum_io_size`, i.e., expose an option to
>> set only `discard_granularity` to 4M?
>
> I would advise against setting alloc_size option to anything higher
> than the default of 64k.
Makes sense. Thanks for clearing up my confusion!
Best wishes,
Friedrich
[1] https://www.spinics.net/lists/ceph-users/msg67776.html
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/rbd.c?h=v6.11&id=81983758430957d9a5cb3333fe324fd70cf63e7e#n2298
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]