On Sat, Apr 27, 2019 at 08:43:26AM -0400, Jason Dillaman wrote: > On Sat, Apr 27, 2019 at 7:37 AM Stefano Garzarella <sgarz...@redhat.com> > wrote: > > > > This patch adds the support of preallocation (off/full) for the RBD > > block driver. > > If available, we use rbd_writesame() to quickly fill the image when > > full preallocation is required. > > > > Signed-off-by: Stefano Garzarella <sgarz...@redhat.com> > > --- > > block/rbd.c | 149 ++++++++++++++++++++++++++++++++++++++----- > > qapi/block-core.json | 4 +- > > 2 files changed, 136 insertions(+), 17 deletions(-) > > > > diff --git a/block/rbd.c b/block/rbd.c > > index 0c549c9935..29dd1bb040 100644 > > --- a/block/rbd.c > > +++ b/block/rbd.c > > @@ -13,6 +13,7 @@ > > > > #include "qemu/osdep.h" > > > > +#include "qemu/units.h" > > #include <rbd/librbd.h> > > #include "qapi/error.h" > > #include "qemu/error-report.h" > > @@ -331,6 +332,110 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t > > offs) > > } > > } > > > > +static int qemu_rbd_do_truncate(rbd_image_t image, int64_t offset, > > + PreallocMode prealloc, Error **errp) > > +{ > > + uint64_t current_length; > > + char *buf = NULL; > > + int ret; > > + > > + ret = rbd_get_size(image, ¤t_length); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, "Failed to get file length"); > > + goto out; > > + } > > + > > + if (current_length > offset && prealloc != PREALLOC_MODE_OFF) { > > + error_setg(errp, "Cannot use preallocation for shrinking files"); > > + ret = -ENOTSUP; > > + goto out; > > + } > > + > > + switch (prealloc) { > > + case PREALLOC_MODE_FULL: { > > + uint64_t current_offset = current_length; > > + int buf_size = 64 * KiB; > > This should probably be 512B or 4KiB (which is the smallest striped > unit size). Not only will this avoid sending unnecessary zeroes to the > backing cluster, writesame silently turns into a standard write if > your buffer isn't properly aligned with the min(object size, stripe > unit size). >
Okay, I'll change it on v2. Should we query about the "stripe_unit" size or we simply use the smallest allowed? > > + ssize_t bytes; > > + > > + buf = g_malloc(buf_size); > > + /* > > + * Some versions of rbd_writesame() discards writes of buffers with > > + * all zeroes. In order to avoid this behaviour, we set the first > > byte > > + * to one. > > + */ > > + buf[0] = 1; > > You could also use "rados_conf_set(cluster, > "rbd_discard_on_zeroed_write_same", "false"). > I tried it, but it is not supported on all versions. (eg. I have Ceph v12.2.11 on my Fedora 29 and it is not supported, but rbd_writesame() is available) Maybe we can use both: "rbd_discard_on_zeroed_write_same = false" and "buf[0] = 1" > > + ret = rbd_resize(image, offset); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, "Failed to resize file"); > > + goto out; > > + } > > + > > +#ifdef LIBRBD_SUPPORTS_WRITESAME > > + while (offset - current_offset > buf_size) { > > + /* > > + * rbd_writesame() supports only request where the size of the > > + * operation is multiple of buffer size and it must be less or > > + * equal to INT_MAX. > > + */ > > + bytes = MIN(offset - current_offset, INT_MAX); > > + bytes -= bytes % buf_size; > > Using the default object size of 4MiB, this write size would result in > up to 512 concurrent ops to the backing cluster. Perhaps the size > should be bounded such that only a dozen or so concurrent requests are > issued per write, always rounded next largest object / stripe period > size. librbd and the rbd CLI usually try to bound themselves to the > value in the "rbd_concurrent_management_ops" configuration setting > (currently defaults to 10). > Do you suggest to use "rbd_concurrent_management_ops" to limit concurrent requests or use a new QEMU parameters for the RBD driver? Thanks for your comments, Stefano