On Mon, Apr 29, 2019 at 8:47 AM Stefano Garzarella <sgarz...@redhat.com> wrote:
>
> On Sat, Apr 27, 2019 at 08:43:26AM -0400, Jason Dillaman wrote:
> > On Sat, Apr 27, 2019 at 7:37 AM Stefano Garzarella <sgarz...@redhat.com> 
> > wrote:
> > >
> > > This patch adds the support of preallocation (off/full) for the RBD
> > > block driver.
> > > If available, we use rbd_writesame() to quickly fill the image when
> > > full preallocation is required.
> > >
> > > Signed-off-by: Stefano Garzarella <sgarz...@redhat.com>
> > > ---
> > >  block/rbd.c          | 149 ++++++++++++++++++++++++++++++++++++++-----
> > >  qapi/block-core.json |   4 +-
> > >  2 files changed, 136 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/block/rbd.c b/block/rbd.c
> > > index 0c549c9935..29dd1bb040 100644
> > > --- a/block/rbd.c
> > > +++ b/block/rbd.c
> > > @@ -13,6 +13,7 @@
> > >
> > >  #include "qemu/osdep.h"
> > >
> > > +#include "qemu/units.h"
> > >  #include <rbd/librbd.h>
> > >  #include "qapi/error.h"
> > >  #include "qemu/error-report.h"
> > > @@ -331,6 +332,110 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t 
> > > offs)
> > >      }
> > >  }
> > >
> > > +static int qemu_rbd_do_truncate(rbd_image_t image, int64_t offset,
> > > +                                PreallocMode prealloc, Error **errp)
> > > +{
> > > +    uint64_t current_length;
> > > +    char *buf = NULL;
> > > +    int ret;
> > > +
> > > +    ret = rbd_get_size(image, &current_length);
> > > +    if (ret < 0) {
> > > +        error_setg_errno(errp, -ret, "Failed to get file length");
> > > +        goto out;
> > > +    }
> > > +
> > > +    if (current_length > offset && prealloc != PREALLOC_MODE_OFF) {
> > > +        error_setg(errp, "Cannot use preallocation for shrinking files");
> > > +        ret = -ENOTSUP;
> > > +        goto out;
> > > +    }
> > > +
> > > +    switch (prealloc) {
> > > +    case PREALLOC_MODE_FULL: {
> > > +        uint64_t current_offset = current_length;
> > > +        int buf_size = 64 * KiB;
> >
> > This should probably be 512B or 4KiB (which is the smallest striped
> > unit size). Not only will this avoid sending unnecessary zeroes to the
> > backing cluster, writesame silently turns into a standard write if
> > your buffer isn't properly aligned with the min(object size, stripe
> > unit size).
> >
>
> Okay, I'll change it on v2.
> Should we query about the "stripe_unit" size or we simply use the
> smallest allowed?

Technically we don't prevent a user from choosing terrible stripe unit
sizes (e.g. 1 byte), so you are probably safe to just use 4KiB.

> > > +        ssize_t bytes;
> > > +
> > > +        buf = g_malloc(buf_size);
> > > +        /*
> > > +         * Some versions of rbd_writesame() discards writes of buffers 
> > > with
> > > +         * all zeroes. In order to avoid this behaviour, we set the 
> > > first byte
> > > +         * to one.
> > > +         */
> > > +        buf[0] = 1;
> >
> > You could also use "rados_conf_set(cluster,
> > "rbd_discard_on_zeroed_write_same", "false").
> >
>
> I tried it, but it is not supported on all versions. (eg. I have Ceph
> v12.2.11 on my Fedora 29 and it is not supported, but rbd_writesame() is
> available)
>
> Maybe we can use both: "rbd_discard_on_zeroed_write_same = false" and
> "buf[0] = 1"

Probably not worth the effort if it's not supported across all releases.

> > > +        ret = rbd_resize(image, offset);
> > > +        if (ret < 0) {
> > > +            error_setg_errno(errp, -ret, "Failed to resize file");
> > > +            goto out;
> > > +        }
> > > +
> > > +#ifdef LIBRBD_SUPPORTS_WRITESAME
> > > +        while (offset - current_offset > buf_size) {
> > > +            /*
> > > +             * rbd_writesame() supports only request where the size of 
> > > the
> > > +             * operation is multiple of buffer size and it must be less 
> > > or
> > > +             * equal to INT_MAX.
> > > +             */
> > > +            bytes = MIN(offset - current_offset, INT_MAX);
> > > +            bytes -= bytes % buf_size;
> >
> > Using the default object size of 4MiB, this write size would result in
> > up to 512 concurrent ops to the backing cluster. Perhaps the size
> > should be bounded such that only a dozen or so concurrent requests are
> > issued per write, always rounded next largest object / stripe period
> > size. librbd and the rbd CLI usually try to bound themselves to the
> > value in the "rbd_concurrent_management_ops" configuration setting
> > (currently defaults to 10).
> >
>
> Do you suggest to use "rbd_concurrent_management_ops" to limit
> concurrent requests or use a new QEMU parameters for the RBD driver?

I think it would be nicer to just query the
"rbd_concurrent_management_ops" limit to derive your writesame size
since the Ceph cluster admin can globally set that option to match the
available parallelism of the cluster.

> Thanks for your comments,
> Stefano


-- 
Jason

Reply via email to