On Tue, Jul 9, 2019 at 11:32 AM Max Reitz <mre...@redhat.com> wrote: > > On 09.07.19 15:09, Stefano Garzarella wrote: > > On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote: > >> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mre...@redhat.com> wrote: > >>> > >>> On 09.07.19 10:55, Max Reitz wrote: > >>>> On 09.07.19 05:08, Jason Dillaman wrote: > >>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarz...@redhat.com> > >>>>> wrote: > >>>>>> > >>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: > >>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote: > >>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for > >>>>>>>> the RBD images that have the fast-diff feature enabled. > >>>>>>>> > >>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API > >>>>>>>> to calculate the allocated size for the image. > >>>>>>>> > >>>>>>>> Signed-off-by: Stefano Garzarella <sgarz...@redhat.com> > >>>>>>>> --- > >>>>>>>> v3: > >>>>>>>> - return -ENOTSUP instead of -1 when fast-diff is not available > >>>>>>>> [John, Jason] > >>>>>>>> v2: > >>>>>>>> - calculate the actual usage only if the fast-diff feature is > >>>>>>>> enabled [Jason] > >>>>>>>> --- > >>>>>>>> block/rbd.c | 54 > >>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>> 1 file changed, 54 insertions(+) > >>>>>>> > >>>>>>> Well, the librbd documentation is non-existing as always, but while > >>>>>>> googling, I at least found that libvirt has exactly the same code. > >>>>>>> So I > >>>>>>> suppose it must be quite correct, then. > >>>>>>> > >>>>>> > >>>>>> While I wrote this code I took a look at libvirt implementation and > >>>>>> also > >>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in > >>>>>> src/tools/rbd/action/DiskUsage.cc > >>>>>> > >>>>>>>> diff --git a/block/rbd.c b/block/rbd.c > >>>>>>>> index 59757b3120..b6bed683e5 100644 > >>>>>>>> --- a/block/rbd.c > >>>>>>>> +++ b/block/rbd.c > >>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t > >>>>>>>> qemu_rbd_getlength(BlockDriverState *bs) > >>>>>>>> return info.size; > >>>>>>>> } > >>>>>>>> > >>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int > >>>>>>>> exists, > >>>>>>>> + void *arg) > >>>>>>>> +{ > >>>>>>>> + int64_t *alloc_size = (int64_t *) arg; > >>>>>>>> + > >>>>>>>> + if (exists) { > >>>>>>>> + (*alloc_size) += len; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + return 0; > >>>>>>>> +} > >>>>>>>> + > >>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState > >>>>>>>> *bs) > >>>>>>>> +{ > >>>>>>>> + BDRVRBDState *s = bs->opaque; > >>>>>>>> + uint64_t flags, features; > >>>>>>>> + int64_t alloc_size = 0; > >>>>>>>> + int r; > >>>>>>>> + > >>>>>>>> + r = rbd_get_flags(s->image, &flags); > >>>>>>>> + if (r < 0) { > >>>>>>>> + return r; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + r = rbd_get_features(s->image, &features); > >>>>>>>> + if (r < 0) { > >>>>>>>> + return r; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + /* > >>>>>>>> + * We use rbd_diff_iterate2() only if the RBD image have > >>>>>>>> fast-diff > >>>>>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() > >>>>>>>> could be > >>>>>>>> + * very slow on a big image. > >>>>>>>> + */ > >>>>>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) || > >>>>>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { > >>>>>>>> + return -ENOTSUP; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + /* > >>>>>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, > >>>>>>>> invokes > >>>>>>>> + * the callback on all allocated regions of the image. > >>>>>>>> + */ > >>>>>>>> + r = rbd_diff_iterate2(s->image, NULL, 0, > >>>>>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, > >>>>>>>> 1, > >>>>>>>> + &rbd_allocated_size_cb, &alloc_size); > >>>>>>> > >>>>>>> But I have a question. This is basically block_status, right? So it > >>>>>>> gives us information on which areas are allocated and which are not. > >>>>>>> The result thus gives us a lower bound on the allocation size, but is > >>>>>>> it > >>>>>>> really exactly the allocation size? > >>>>>>> > >>>>>>> There are two things I’m concerned about: > >>>>>>> > >>>>>>> 1. What about metadata? > >>>>>> > >>>>>> Good question, I don't think it includes the size used by metadata and > >>>>>> I > >>>>>> don't know if there is a way to know it. I'll check better. > >>>>> > >>>>> It does not include the size of metadata, the "rbd_diff_iterate2" > >>>>> function is literally just looking for touched data blocks within the > >>>>> RBD image. > >>>>> > >>>>>>> > >>>>>>> 2. If you have multiple snapshots, this will only report the overall > >>>>>>> allocation information, right? So say there is something like this: > >>>>>>> > >>>>>>> (“A” means an allocated MB, “-” is an unallocated MB) > >>>>>>> > >>>>>>> Snapshot 1: AAAA--- > >>>>>>> Snapshot 2: --AAAAA > >>>>>>> Snapshot 3: -AAAA-- > >>>>>>> > >>>>>>> I think the allocated data size is the number of As in total (13 MB). > >>>>>>> But I suppose this API will just return 7 MB, because it looks on > >>>>>>> everything an it sees the whole image range (7 MB) to be allocated. > >>>>>>> It > >>>>>>> doesn’t report in how many snapshots some region is allocated. > >>>>> > >>>>> It should return 13 dirty data blocks (multipled by the size of the > >>>>> data block) since when you don't provide a "from snapshot" name, it > >>>>> will iterate from the first snapshot to the HEAD revision. > >>>> > >>>> Have you tested that? > >>>> > >>>> I‘m so skeptical because the callback function interface has no way of > >>>> distinguishing between different layers of snapshots. > >>>> > >>>> And also because we have the bdrv_block_status_above() function which > >>>> just looks strikingly similar (with the difference that it does not > >>>> invoke a callback but just returns the next allocated range). If you > >>>> pass base=NULL to it, it will also “interpret that as the beginning of > >>>> time and return all allocated regions of the image” (or rather image > >>>> chain, in our case). But it would just return 7 MB as allocated. (Even > >>>> though it does in fact return layer information, i.e. where a given > >>>> continuous chunk of data is allocated.) > >>>> > >>>> Sure, there is no good reason for why our interface should by chance be > >>>> the same as librbd’s interface. But without having tested it, the fact > >>>> that the callback cannot detect which layer a chunk is allocated on just > >>>> makes me wary. > >>> > >>> And the librbd code doesn’t alleviate my concerns. > >>> > >>> From what I can see, it first creates a bitmap (two bits per entry) that > >>> covers the whole image and says which objects are allocated and which > >>> aren‘t. Through the whole chain, that is. So in the above example, the > >>> bitmap would report everything as allocated. (Or rather “updated“ in > >>> librbd‘s terms.) > >>> > >>> Then it has this piece: > >>> > >>>> uint64_t off = m_offset; > >>>> uint64_t left = m_length; > >>>> > >>>> while (left > 0) { > >>>> uint64_t period_off = off - (off % period); > >>>> uint64_t read_len = min(period_off + period - off, left); > >>>> > >>>> // map to extents > >>>> map<object_t,vector<ObjectExtent> > object_extents; > >>>> Striper::file_to_extents(cct, m_image_ctx.format_string, > >>>> &m_image_ctx.layout, off, read_len, 0, > >>>> object_extents, 0); > >>>> > >>>> // get snap info for each object > >>>> for (map<object_t,vector<ObjectExtent> >::iterator p = > >>>> object_extents.begin(); > >>>> p != object_extents.end(); ++p) > >>> [...] > >>>> for (std::vector<ObjectExtent>::iterator q = p->second.begin(); > >>>> q != p->second.end(); ++q) { > >>>> r = m_callback(off + q->offset, q->length, updated, > >>>> m_callback_arg); > >>> [...] > >>>> } > >>> [...] > >>>> } > >>>>> left -= read_len; > >>>> off += read_len; > >>>> } > >>> > >>> So that iterates over the whole image (in our case, because m_offset is > >>> 0 and m_length is the image length), then picks out a chunk of read_len > >>> length, converts that to objects, and then iterates over all of those > >>> objects and all of their extents. > >>> > >>> file_to_extents looks like it’s just an arithmetic operation. So it > >>> doesn‘t look like that function returns extents in multiple snapshots. > >>> It just converts a range into subranges called “objects” and “extents” > >>> (at least that’s the way it looks to me). > >>> > >>> So overall, this looks awfully like it simply iterates over the whole > >>> image and then returns allocation information gathered as a top-down > >>> view through all of the snapshots, but not for each snapshot individually. > >> > >> Sorry, you're correct. The API function was originally designed to > >> support delta extents for supporting diff exports, so while it does > >> open each snapshot's object-map, it only returns a unioned set of > >> delta extents. The rbd CLI's "disk-usage" action behaves how I > >> described by counting the actual dirty block usage between snapshots. > > > > Max, Jason, thanks for the details! > > > > If you agree, I'll try to implement something similar, iterating on all > > snapshots. > > Yes, that should work. > > > What about metadata? > > I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the > > metadata. > > An idea could be to iterate over them and sum the keys-values size returned, > > but I'm not sure they are returned in the same format as they are stored. > > I wouldn’t mind ignoring metadata too much. If you can get something to > work, even better, but I don’t consider that too important.
I don't think it's a good idea to attempt to estimate it. If there is enough desire, we can always add a supported "disk-usage" API method that takes into account things like the image metadata, object-map, journaling, etc overhead. However, I wouldn't expect it to be anywhere near the actual block usage (unless something has gone terribly wrong w/ journaling and it fails to trim your log). > Max > -- Jason