Re: [PATCH V3 5/6] block/rbd: add write zeroes support

2021-06-27 Thread Peter Lieven



> Am 26.06.2021 um 17:57 schrieb Ilya Dryomov :
> 
> On Mon, Jun 21, 2021 at 10:49 AM Peter Lieven  wrote:
>> 
>>> Am 18.06.21 um 12:34 schrieb Ilya Dryomov:
>>> On Fri, Jun 18, 2021 at 11:00 AM Peter Lieven  wrote:
 Am 16.06.21 um 14:34 schrieb Ilya Dryomov:
> On Wed, May 19, 2021 at 4:28 PM Peter Lieven  wrote:
>> Signed-off-by: Peter Lieven 
>> ---
>>  block/rbd.c | 37 -
>>  1 file changed, 36 insertions(+), 1 deletion(-)
>> 
>> diff --git a/block/rbd.c b/block/rbd.c
>> index 0d8612a988..ee13f08a74 100644
>> --- a/block/rbd.c
>> +++ b/block/rbd.c
>> @@ -63,7 +63,8 @@ typedef enum {
>>  RBD_AIO_READ,
>>  RBD_AIO_WRITE,
>>  RBD_AIO_DISCARD,
>> -RBD_AIO_FLUSH
>> +RBD_AIO_FLUSH,
>> +RBD_AIO_WRITE_ZEROES
>>  } RBDAIOCmd;
>> 
>>  typedef struct BDRVRBDState {
>> @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, 
>> QDict *options, int flags,
>>  }
>>  }
>> 
>> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
>> +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
> does not really have a notion of non-efficient explicit zeroing.
 
 This is only true if thick provisioning is supported which is in Octopus 
 onwards, right?
>>> Since Pacific, I think.
>>> 
 So it would only be correct to set this if thick provisioning is supported 
 otherwise we could
 
 fail with ENOTSUP and then qemu emulates the zeroing with plain writes.
>>> I actually had a question about that.  Why are you returning ENOTSUP
>>> in case BDRV_REQ_MAY_UNMAP is not specified and that can't be fulfilled
>>> because librbd is too old for RBD_WRITE_ZEROES_FLAG_THICK_PROVISION?
>>> 
>>> My understanding has always been that BDRV_REQ_MAY_UNMAP is just
>>> a hint.  Deallocating if BDRV_REQ_MAY_UNMAP is specified is not nice
>>> but should be perfectly acceptable.  It is certainly better than
>>> returning ENOTSUP, particularly if ENOTSUP causes Qemu to do plain
>>> zeroing.
>> 
>> 
>> I think this was introduced to support different provisioning modes. If 
>> BDRV_REQ_MAY_UNMAP is not set
>> 
>> the caller of bdrv_write_zeroes expects that the driver does thick 
>> provisioning. If the driver cannot handle that (efficiently)
>> 
>> qemu does a plain zero write.
>> 
>> 
>> I am still not fully understanding the meaning of the BDRV_REQ_NO_FALLBACK 
>> flag. The original commit states that it was introduced for qemu-img to 
>> efficiently
>> 
>> zero out the target and avoid the slow fallback. When I last worked on 
>> qemu-img convert I remember that there was a call to zero out the target if 
>> bdrv_has_zero_init
>> 
>> is not 1. It seems hat meanwhile a target_is_zero cmdline switch for 
>> qemu-img convert was added to let the user assert that a preexisting target 
>> is zero.
>> 
>> Maybe someone can help here if it would be right to set BDRV_REQ_NO_FALLBACK 
>> for rbd in either of the 2 cases (thick provisioning is support or not)?
> 
> Since no one spoke up I think we should
> 
> a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
>   and as a consequence always unmap if librbd is too old
> 
>   It's not clear what qemu's expectation is but in general Write
>   Zeroes is allowed to unmap.  The only guarantee is that subsequent
>   reads return zeroes, everything else is a hint.  This is how it is
>   specified in the kernel and in the NVMe spec.
> 
>   In particular, block/nvme.c implements it as follows:
> 
>   if (flags & BDRV_REQ_MAY_UNMAP) {
>   cdw12 |= (1 << 25);
>   }
> 
>   This sets the Deallocate bit.  But if it's not set, the device may
>   still deallocate:
> 
>   """
>   If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
>   command, and the namespace supports clearing all bytes to 0h in the
>   values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
>   from a deallocated logical block and its metadata (excluding
>   protection information), then for each specified logical block, the
>   controller:
>   - should deallocate that logical block;
> 
>   ...
> 
>   If the Deallocate bit is cleared to '0' in a Write Zeroes command,
>   and the namespace supports clearing all bytes to 0h in the values
>   read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
>   a deallocated logical block and its metadata (excluding protection
>   information), then, for each specified logical block, the
>   controller:
>   - may deallocate that logical block;
>   """
> 
>   
> https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
> 
> b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
> 
>   Again, it's not clear what qemu expects here, but without it we end
>   up in a ridiculous situation where specifying the "don't allow slow
>   fallback" 

Re: [PATCH V3 5/6] block/rbd: add write zeroes support

2021-06-26 Thread Ilya Dryomov
On Mon, Jun 21, 2021 at 10:49 AM Peter Lieven  wrote:
>
> Am 18.06.21 um 12:34 schrieb Ilya Dryomov:
> > On Fri, Jun 18, 2021 at 11:00 AM Peter Lieven  wrote:
> >> Am 16.06.21 um 14:34 schrieb Ilya Dryomov:
> >>> On Wed, May 19, 2021 at 4:28 PM Peter Lieven  wrote:
>  Signed-off-by: Peter Lieven 
>  ---
>    block/rbd.c | 37 -
>    1 file changed, 36 insertions(+), 1 deletion(-)
> 
>  diff --git a/block/rbd.c b/block/rbd.c
>  index 0d8612a988..ee13f08a74 100644
>  --- a/block/rbd.c
>  +++ b/block/rbd.c
>  @@ -63,7 +63,8 @@ typedef enum {
>    RBD_AIO_READ,
>    RBD_AIO_WRITE,
>    RBD_AIO_DISCARD,
>  -RBD_AIO_FLUSH
>  +RBD_AIO_FLUSH,
>  +RBD_AIO_WRITE_ZEROES
>    } RBDAIOCmd;
> 
>    typedef struct BDRVRBDState {
>  @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, 
>  QDict *options, int flags,
>    }
>    }
> 
>  +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
>  +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> >>> I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
> >>> does not really have a notion of non-efficient explicit zeroing.
> >>
> >> This is only true if thick provisioning is supported which is in Octopus 
> >> onwards, right?
> > Since Pacific, I think.
> >
> >> So it would only be correct to set this if thick provisioning is supported 
> >> otherwise we could
> >>
> >> fail with ENOTSUP and then qemu emulates the zeroing with plain writes.
> > I actually had a question about that.  Why are you returning ENOTSUP
> > in case BDRV_REQ_MAY_UNMAP is not specified and that can't be fulfilled
> > because librbd is too old for RBD_WRITE_ZEROES_FLAG_THICK_PROVISION?
> >
> > My understanding has always been that BDRV_REQ_MAY_UNMAP is just
> > a hint.  Deallocating if BDRV_REQ_MAY_UNMAP is specified is not nice
> > but should be perfectly acceptable.  It is certainly better than
> > returning ENOTSUP, particularly if ENOTSUP causes Qemu to do plain
> > zeroing.
>
>
> I think this was introduced to support different provisioning modes. If 
> BDRV_REQ_MAY_UNMAP is not set
>
> the caller of bdrv_write_zeroes expects that the driver does thick 
> provisioning. If the driver cannot handle that (efficiently)
>
> qemu does a plain zero write.
>
>
> I am still not fully understanding the meaning of the BDRV_REQ_NO_FALLBACK 
> flag. The original commit states that it was introduced for qemu-img to 
> efficiently
>
> zero out the target and avoid the slow fallback. When I last worked on 
> qemu-img convert I remember that there was a call to zero out the target if 
> bdrv_has_zero_init
>
> is not 1. It seems hat meanwhile a target_is_zero cmdline switch for qemu-img 
> convert was added to let the user assert that a preexisting target is zero.
>
> Maybe someone can help here if it would be right to set BDRV_REQ_NO_FALLBACK 
> for rbd in either of the 2 cases (thick provisioning is support or not)?

Since no one spoke up I think we should

a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
   and as a consequence always unmap if librbd is too old

   It's not clear what qemu's expectation is but in general Write
   Zeroes is allowed to unmap.  The only guarantee is that subsequent
   reads return zeroes, everything else is a hint.  This is how it is
   specified in the kernel and in the NVMe spec.

   In particular, block/nvme.c implements it as follows:

   if (flags & BDRV_REQ_MAY_UNMAP) {
   cdw12 |= (1 << 25);
   }

   This sets the Deallocate bit.  But if it's not set, the device may
   still deallocate:

   """
   If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
   command, and the namespace supports clearing all bytes to 0h in the
   values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
   from a deallocated logical block and its metadata (excluding
   protection information), then for each specified logical block, the
   controller:
   - should deallocate that logical block;

   ...

   If the Deallocate bit is cleared to '0' in a Write Zeroes command,
   and the namespace supports clearing all bytes to 0h in the values
   read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
   a deallocated logical block and its metadata (excluding protection
   information), then, for each specified logical block, the
   controller:
   - may deallocate that logical block;
   """

   
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf

b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags

   Again, it's not clear what qemu expects here, but without it we end
   up in a ridiculous situation where specifying the "don't allow slow
   fallback" switch immediately fails all efficient zeroing requests on
   a device where Write Zeroes is always efficient:

   $ qemu-io -c 'help write' | grep -- '-[zun]'
  

Re: [PATCH V3 5/6] block/rbd: add write zeroes support

2021-06-21 Thread Peter Lieven

Am 18.06.21 um 12:34 schrieb Ilya Dryomov:

On Fri, Jun 18, 2021 at 11:00 AM Peter Lieven  wrote:

Am 16.06.21 um 14:34 schrieb Ilya Dryomov:

On Wed, May 19, 2021 at 4:28 PM Peter Lieven  wrote:

Signed-off-by: Peter Lieven 
---
  block/rbd.c | 37 -
  1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 0d8612a988..ee13f08a74 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -63,7 +63,8 @@ typedef enum {
  RBD_AIO_READ,
  RBD_AIO_WRITE,
  RBD_AIO_DISCARD,
-RBD_AIO_FLUSH
+RBD_AIO_FLUSH,
+RBD_AIO_WRITE_ZEROES
  } RBDAIOCmd;

  typedef struct BDRVRBDState {
@@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
  }
  }

+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;

I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
does not really have a notion of non-efficient explicit zeroing.


This is only true if thick provisioning is supported which is in Octopus 
onwards, right?

Since Pacific, I think.


So it would only be correct to set this if thick provisioning is supported 
otherwise we could

fail with ENOTSUP and then qemu emulates the zeroing with plain writes.

I actually had a question about that.  Why are you returning ENOTSUP
in case BDRV_REQ_MAY_UNMAP is not specified and that can't be fulfilled
because librbd is too old for RBD_WRITE_ZEROES_FLAG_THICK_PROVISION?

My understanding has always been that BDRV_REQ_MAY_UNMAP is just
a hint.  Deallocating if BDRV_REQ_MAY_UNMAP is specified is not nice
but should be perfectly acceptable.  It is certainly better than
returning ENOTSUP, particularly if ENOTSUP causes Qemu to do plain
zeroing.



I think this was introduced to support different provisioning modes. If 
BDRV_REQ_MAY_UNMAP is not set

the caller of bdrv_write_zeroes expects that the driver does thick 
provisioning. If the driver cannot handle that (efficiently)

qemu does a plain zero write.


I am still not fully understanding the meaning of the BDRV_REQ_NO_FALLBACK 
flag. The original commit states that it was introduced for qemu-img to 
efficiently

zero out the target and avoid the slow fallback. When I last worked on qemu-img 
convert I remember that there was a call to zero out the target if 
bdrv_has_zero_init

is not 1. It seems hat meanwhile a target_is_zero cmdline switch for qemu-img 
convert was added to let the user assert that a preexisting target is zero.

Maybe someone can help here if it would be right to set BDRV_REQ_NO_FALLBACK 
for rbd in either of the 2 cases (thick provisioning is support or not)?


Thanks

Peter







Re: [PATCH V3 5/6] block/rbd: add write zeroes support

2021-06-18 Thread Ilya Dryomov
On Fri, Jun 18, 2021 at 11:00 AM Peter Lieven  wrote:
>
> Am 16.06.21 um 14:34 schrieb Ilya Dryomov:
> > On Wed, May 19, 2021 at 4:28 PM Peter Lieven  wrote:
> >> Signed-off-by: Peter Lieven 
> >> ---
> >>  block/rbd.c | 37 -
> >>  1 file changed, 36 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/block/rbd.c b/block/rbd.c
> >> index 0d8612a988..ee13f08a74 100644
> >> --- a/block/rbd.c
> >> +++ b/block/rbd.c
> >> @@ -63,7 +63,8 @@ typedef enum {
> >>  RBD_AIO_READ,
> >>  RBD_AIO_WRITE,
> >>  RBD_AIO_DISCARD,
> >> -RBD_AIO_FLUSH
> >> +RBD_AIO_FLUSH,
> >> +RBD_AIO_WRITE_ZEROES
> >>  } RBDAIOCmd;
> >>
> >>  typedef struct BDRVRBDState {
> >> @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> >> *options, int flags,
> >>  }
> >>  }
> >>
> >> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> >> +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> > I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
> > does not really have a notion of non-efficient explicit zeroing.
>
>
> This is only true if thick provisioning is supported which is in Octopus 
> onwards, right?

Since Pacific, I think.

>
> So it would only be correct to set this if thick provisioning is supported 
> otherwise we could
>
> fail with ENOTSUP and then qemu emulates the zeroing with plain writes.

I actually had a question about that.  Why are you returning ENOTSUP
in case BDRV_REQ_MAY_UNMAP is not specified and that can't be fulfilled
because librbd is too old for RBD_WRITE_ZEROES_FLAG_THICK_PROVISION?

My understanding has always been that BDRV_REQ_MAY_UNMAP is just
a hint.  Deallocating if BDRV_REQ_MAY_UNMAP is specified is not nice
but should be perfectly acceptable.  It is certainly better than
returning ENOTSUP, particularly if ENOTSUP causes Qemu to do plain
zeroing.

Thanks,

Ilya



Re: [PATCH V3 5/6] block/rbd: add write zeroes support

2021-06-18 Thread Peter Lieven
Am 16.06.21 um 14:34 schrieb Ilya Dryomov:
> On Wed, May 19, 2021 at 4:28 PM Peter Lieven  wrote:
>> Signed-off-by: Peter Lieven 
>> ---
>>  block/rbd.c | 37 -
>>  1 file changed, 36 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/rbd.c b/block/rbd.c
>> index 0d8612a988..ee13f08a74 100644
>> --- a/block/rbd.c
>> +++ b/block/rbd.c
>> @@ -63,7 +63,8 @@ typedef enum {
>>  RBD_AIO_READ,
>>  RBD_AIO_WRITE,
>>  RBD_AIO_DISCARD,
>> -RBD_AIO_FLUSH
>> +RBD_AIO_FLUSH,
>> +RBD_AIO_WRITE_ZEROES
>>  } RBDAIOCmd;
>>
>>  typedef struct BDRVRBDState {
>> @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
>> *options, int flags,
>>  }
>>  }
>>
>> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
>> +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
> does not really have a notion of non-efficient explicit zeroing.


This is only true if thick provisioning is supported which is in Octopus 
onwards, right?

So it would only be correct to set this if thick provisioning is supported 
otherwise we could

fail with ENOTSUP and then qemu emulates the zeroing with plain writes.


Peter






Re: [PATCH V3 5/6] block/rbd: add write zeroes support

2021-06-16 Thread Ilya Dryomov
On Wed, May 19, 2021 at 4:28 PM Peter Lieven  wrote:
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 37 -
>  1 file changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 0d8612a988..ee13f08a74 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -63,7 +63,8 @@ typedef enum {
>  RBD_AIO_READ,
>  RBD_AIO_WRITE,
>  RBD_AIO_DISCARD,
> -RBD_AIO_FLUSH
> +RBD_AIO_FLUSH,
> +RBD_AIO_WRITE_ZEROES
>  } RBDAIOCmd;
>
>  typedef struct BDRVRBDState {
> @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  }
>  }
>
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;

I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
does not really have a notion of non-efficient explicit zeroing.

Thanks,

Ilya



[PATCH V3 5/6] block/rbd: add write zeroes support

2021-05-19 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 37 -
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 0d8612a988..ee13f08a74 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -63,7 +63,8 @@ typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
 RBD_AIO_DISCARD,
-RBD_AIO_FLUSH
+RBD_AIO_FLUSH,
+RBD_AIO_WRITE_ZEROES
 } RBDAIOCmd;
 
 typedef struct BDRVRBDState {
@@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
+#endif
+
 /* When extending regular files, we get zeros from the OS */
 bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
 
@@ -818,6 +823,18 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState 
*bs,
 case RBD_AIO_FLUSH:
 r = rbd_aio_flush(s->image, c);
 break;
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+case RBD_AIO_WRITE_ZEROES: {
+int zero_flags = 0;
+#ifdef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+zero_flags = RBD_WRITE_ZEROES_FLAG_THICK_PROVISION;
+}
+#endif
+r = rbd_aio_write_zeroes(s->image, offset, bytes, c, zero_flags, 0);
+break;
+}
+#endif
 default:
 r = -EINVAL;
 }
@@ -888,6 +905,21 @@ static int coroutine_fn 
qemu_rbd_co_pdiscard(BlockDriverState *bs,
 return qemu_rbd_start_co(bs, offset, count, NULL, 0, RBD_AIO_DISCARD);
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+static int
+coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+  int count, BdrvRequestFlags flags)
+{
+#ifndef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+return -ENOTSUP;
+}
+#endif
+return qemu_rbd_start_co(bs, offset, count, NULL, flags,
+ RBD_AIO_WRITE_ZEROES);
+}
+#endif
+
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
@@ -1113,6 +1145,9 @@ static BlockDriver bdrv_rbd = {
 .bdrv_co_pwritev= qemu_rbd_co_pwritev,
 .bdrv_co_flush_to_disk  = qemu_rbd_co_flush,
 .bdrv_co_pdiscard   = qemu_rbd_co_pdiscard,
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+.bdrv_co_pwrite_zeroes  = qemu_rbd_co_pwrite_zeroes,
+#endif
 
 .bdrv_snapshot_create   = qemu_rbd_snap_create,
 .bdrv_snapshot_delete   = qemu_rbd_snap_remove,
-- 
2.17.1