On Mon, Apr 29, 2019 at 8:47 AM Stefano Garzarella <sgarz...@redhat.com> wrote: > > On Sat, Apr 27, 2019 at 08:43:26AM -0400, Jason Dillaman wrote: > > On Sat, Apr 27, 2019 at 7:37 AM Stefano Garzarella <sgarz...@redhat.com> > > wrote: > > > > > > This patch adds the support of preallocation (off/full) for the RBD > > > block driver. > > > If available, we use rbd_writesame() to quickly fill the image when > > > full preallocation is required. > > > > > > Signed-off-by: Stefano Garzarella <sgarz...@redhat.com> > > > --- > > > block/rbd.c | 149 ++++++++++++++++++++++++++++++++++++++----- > > > qapi/block-core.json | 4 +- > > > 2 files changed, 136 insertions(+), 17 deletions(-) > > > > > > diff --git a/block/rbd.c b/block/rbd.c > > > index 0c549c9935..29dd1bb040 100644 > > > --- a/block/rbd.c > > > +++ b/block/rbd.c > > > @@ -13,6 +13,7 @@ > > > > > > #include "qemu/osdep.h" > > > > > > +#include "qemu/units.h" > > > #include <rbd/librbd.h> > > > #include "qapi/error.h" > > > #include "qemu/error-report.h" > > > @@ -331,6 +332,110 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t > > > offs) > > > } > > > } > > > > > > +static int qemu_rbd_do_truncate(rbd_image_t image, int64_t offset, > > > + PreallocMode prealloc, Error **errp) > > > +{ > > > + uint64_t current_length; > > > + char *buf = NULL; > > > + int ret; > > > + > > > + ret = rbd_get_size(image, ¤t_length); > > > + if (ret < 0) { > > > + error_setg_errno(errp, -ret, "Failed to get file length"); > > > + goto out; > > > + } > > > + > > > + if (current_length > offset && prealloc != PREALLOC_MODE_OFF) { > > > + error_setg(errp, "Cannot use preallocation for shrinking files"); > > > + ret = -ENOTSUP; > > > + goto out; > > > + } > > > + > > > + switch (prealloc) { > > > + case PREALLOC_MODE_FULL: { > > > + uint64_t current_offset = current_length; > > > + int buf_size = 64 * KiB; > > > > This should probably be 512B or 4KiB (which is the smallest striped > > unit size). Not only will this avoid sending unnecessary zeroes to the > > backing cluster, writesame silently turns into a standard write if > > your buffer isn't properly aligned with the min(object size, stripe > > unit size). > > > > Okay, I'll change it on v2. > Should we query about the "stripe_unit" size or we simply use the > smallest allowed?
Technically we don't prevent a user from choosing terrible stripe unit sizes (e.g. 1 byte), so you are probably safe to just use 4KiB. > > > + ssize_t bytes; > > > + > > > + buf = g_malloc(buf_size); > > > + /* > > > + * Some versions of rbd_writesame() discards writes of buffers > > > with > > > + * all zeroes. In order to avoid this behaviour, we set the > > > first byte > > > + * to one. > > > + */ > > > + buf[0] = 1; > > > > You could also use "rados_conf_set(cluster, > > "rbd_discard_on_zeroed_write_same", "false"). > > > > I tried it, but it is not supported on all versions. (eg. I have Ceph > v12.2.11 on my Fedora 29 and it is not supported, but rbd_writesame() is > available) > > Maybe we can use both: "rbd_discard_on_zeroed_write_same = false" and > "buf[0] = 1" Probably not worth the effort if it's not supported across all releases. > > > + ret = rbd_resize(image, offset); > > > + if (ret < 0) { > > > + error_setg_errno(errp, -ret, "Failed to resize file"); > > > + goto out; > > > + } > > > + > > > +#ifdef LIBRBD_SUPPORTS_WRITESAME > > > + while (offset - current_offset > buf_size) { > > > + /* > > > + * rbd_writesame() supports only request where the size of > > > the > > > + * operation is multiple of buffer size and it must be less > > > or > > > + * equal to INT_MAX. > > > + */ > > > + bytes = MIN(offset - current_offset, INT_MAX); > > > + bytes -= bytes % buf_size; > > > > Using the default object size of 4MiB, this write size would result in > > up to 512 concurrent ops to the backing cluster. Perhaps the size > > should be bounded such that only a dozen or so concurrent requests are > > issued per write, always rounded next largest object / stripe period > > size. librbd and the rbd CLI usually try to bound themselves to the > > value in the "rbd_concurrent_management_ops" configuration setting > > (currently defaults to 10). > > > > Do you suggest to use "rbd_concurrent_management_ops" to limit > concurrent requests or use a new QEMU parameters for the RBD driver? I think it would be nicer to just query the "rbd_concurrent_management_ops" limit to derive your writesame size since the Ceph cluster admin can globally set that option to match the available parallelism of the cluster. > Thanks for your comments, > Stefano -- Jason