Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-21 Thread Klaus Jensen
On Sep 21 13:44, Damien Le Moal wrote:
> On 9/20/22 17:51, Klaus Jensen wrote:
> > On Sep 10 13:27, Sam Li wrote:
> > > Add a new zoned_host_device BlockDriver. The zoned_host_device option
> > > accepts only zoned host block devices. By adding zone management
> > > operations in this new BlockDriver, users can use the new block
> > > layer APIs including Report Zone and four zone management operations
> > > (open, close, finish, reset).
> > > 
> > > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > > zone_finish(zf).
> > > 
> > > For example, to test zone_report, use following command:
> > > $ ./build/qemu-io --image-opts -n driver=zoned_host_device, 
> > > filename=/dev/nullb0
> > > -c "zrp offset nr_zones"
> > > 
> > > Signed-off-by: Sam Li 
> > > Reviewed-by: Hannes Reinecke 
> > > ---
> > >   block/block-backend.c | 145 ++
> > >   block/file-posix.c| 323 +-
> > >   block/io.c|  41 
> > >   include/block/block-io.h  |   7 +
> > >   include/block/block_int-common.h  |  21 ++
> > >   include/block/raw-aio.h   |   6 +-
> > >   include/sysemu/block-backend-io.h |  17 ++
> > >   meson.build   |   1 +
> > >   qapi/block-core.json  |   8 +-
> > >   qemu-io-cmds.c| 143 +
> > >   10 files changed, 708 insertions(+), 4 deletions(-)
> > > 
> > > +/*
> > > + * zone management operations - Execute an operation on a zone
> > > + */
> > > +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, 
> > > BlockZoneOp op,
> > > +int64_t offset, int64_t len) {
> > > +#if defined(CONFIG_BLKZONED)
> > > +BDRVRawState *s = bs->opaque;
> > > +RawPosixAIOData acb;
> > > +int64_t zone_sector, zone_sector_mask;
> > > +const char *zone_op_name;
> > > +unsigned long zone_op;
> > > +bool is_all = false;
> > > +
> > > +zone_sector = bs->bl.zone_sectors;
> > > +zone_sector_mask = zone_sector - 1;
> > > +if (offset & zone_sector_mask) {
> > > +error_report("sector offset %" PRId64 " is not aligned to zone 
> > > size "
> > > + "%" PRId64 "", offset, zone_sector);
> > > +return -EINVAL;
> > > +}
> > > +
> > > +if (len & zone_sector_mask) {
> > > +error_report("number of sectors %" PRId64 " is not aligned to 
> > > zone size"
> > > +  " %" PRId64 "", len, zone_sector);
> > > +return -EINVAL;
> > > +}
> > 
> > These checks impose a power-of-two constraint on the zone size. Can they
> > be changed to divisions to lift that constraint? I don't see anything in
> > this patch set that relies on power of two zone sizes.
> 
> Given that Linux will only expose zoned devices that have a zone size that
> is a power of 2 number of LBAs, this will work as is and avoid divisions in
> the IO path. But given that zone management operations are not performance
> critical, generalizing the code should be fine.
> 

Aight. That's fine, we can relax the constraint later. But I don't see why.

> However, once we start adding the code for full zone emulation on top of a
> regular file or qcow image, sector-to-zone conversions requiring divisions
> will hurt. So I really would prefer the code be left as-is for now.
> 

Block driver based zone emulation would not hit the above code path
anyway (there would be no iotcl to call), so I don't think that is an
argument for keeping it as-is.

On the sector-to-zone convertions. Yes, they may potentially hurt, but
it would be emulation after all. Why would we care about optimizing
those conversions at the expense of not being able to emulate all valid
geometries? The performance option (which this series enables) is to use
an underlying zoned device through virtio (or, potentially, hw/nvme).
What would be the usecase for deploying a performance sensitive emulated
zoned device?


signature.asc
Description: PGP signature


Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-20 Thread Damien Le Moal

On 9/20/22 17:51, Klaus Jensen wrote:

On Sep 10 13:27, Sam Li wrote:

Add a new zoned_host_device BlockDriver. The zoned_host_device option
accepts only zoned host block devices. By adding zone management
operations in this new BlockDriver, users can use the new block
layer APIs including Report Zone and four zone management operations
(open, close, finish, reset).

Qemu-io uses the new APIs to perform zoned storage commands of the device:
zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts -n driver=zoned_host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
---
  block/block-backend.c | 145 ++
  block/file-posix.c| 323 +-
  block/io.c|  41 
  include/block/block-io.h  |   7 +
  include/block/block_int-common.h  |  21 ++
  include/block/raw-aio.h   |   6 +-
  include/sysemu/block-backend-io.h |  17 ++
  meson.build   |   1 +
  qapi/block-core.json  |   8 +-
  qemu-io-cmds.c| 143 +
  10 files changed, 708 insertions(+), 4 deletions(-)

+/*
+ * zone management operations - Execute an operation on a zone
+ */
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+int64_t offset, int64_t len) {
+#if defined(CONFIG_BLKZONED)
+BDRVRawState *s = bs->opaque;
+RawPosixAIOData acb;
+int64_t zone_sector, zone_sector_mask;
+const char *zone_op_name;
+unsigned long zone_op;
+bool is_all = false;
+
+zone_sector = bs->bl.zone_sectors;
+zone_sector_mask = zone_sector - 1;
+if (offset & zone_sector_mask) {
+error_report("sector offset %" PRId64 " is not aligned to zone size "
+ "%" PRId64 "", offset, zone_sector);
+return -EINVAL;
+}
+
+if (len & zone_sector_mask) {
+error_report("number of sectors %" PRId64 " is not aligned to zone 
size"
+  " %" PRId64 "", len, zone_sector);
+return -EINVAL;
+}


These checks impose a power-of-two constraint on the zone size. Can they
be changed to divisions to lift that constraint? I don't see anything in
this patch set that relies on power of two zone sizes.


Given that Linux will only expose zoned devices that have a zone size 
that is a power of 2 number of LBAs, this will work as is and avoid 
divisions in the IO path. But given that zone management operations are 
not performance critical, generalizing the code should be fine.


However, once we start adding the code for full zone emulation on top of 
a regular file or qcow image, sector-to-zone conversions requiring 
divisions will hurt. So I really would prefer the code be left as-is for 
now.



--
Damien Le Moal
Western Digital Research




Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-20 Thread Sam Li
Klaus Jensen  于2022年9月20日周二 16:51写道:
>
> On Sep 10 13:27, Sam Li wrote:
> > Add a new zoned_host_device BlockDriver. The zoned_host_device option
> > accepts only zoned host block devices. By adding zone management
> > operations in this new BlockDriver, users can use the new block
> > layer APIs including Report Zone and four zone management operations
> > (open, close, finish, reset).
> >
> > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > zone_finish(zf).
> >
> > For example, to test zone_report, use following command:
> > $ ./build/qemu-io --image-opts -n driver=zoned_host_device, 
> > filename=/dev/nullb0
> > -c "zrp offset nr_zones"
> >
> > Signed-off-by: Sam Li 
> > Reviewed-by: Hannes Reinecke 
> > ---
> >  block/block-backend.c | 145 ++
> >  block/file-posix.c| 323 +-
> >  block/io.c|  41 
> >  include/block/block-io.h  |   7 +
> >  include/block/block_int-common.h  |  21 ++
> >  include/block/raw-aio.h   |   6 +-
> >  include/sysemu/block-backend-io.h |  17 ++
> >  meson.build   |   1 +
> >  qapi/block-core.json  |   8 +-
> >  qemu-io-cmds.c| 143 +
> >  10 files changed, 708 insertions(+), 4 deletions(-)
> >
> > +/*
> > + * zone management operations - Execute an operation on a zone
> > + */
> > +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp 
> > op,
> > +int64_t offset, int64_t len) {
> > +#if defined(CONFIG_BLKZONED)
> > +BDRVRawState *s = bs->opaque;
> > +RawPosixAIOData acb;
> > +int64_t zone_sector, zone_sector_mask;
> > +const char *zone_op_name;
> > +unsigned long zone_op;
> > +bool is_all = false;
> > +
> > +zone_sector = bs->bl.zone_sectors;
> > +zone_sector_mask = zone_sector - 1;
> > +if (offset & zone_sector_mask) {
> > +error_report("sector offset %" PRId64 " is not aligned to zone 
> > size "
> > + "%" PRId64 "", offset, zone_sector);
> > +return -EINVAL;
> > +}
> > +
> > +if (len & zone_sector_mask) {
> > +error_report("number of sectors %" PRId64 " is not aligned to zone 
> > size"
> > +  " %" PRId64 "", len, zone_sector);
> > +return -EINVAL;
> > +}
>
> These checks impose a power-of-two constraint on the zone size. Can they
> be changed to divisions to lift that constraint? I don't see anything in
> this patch set that relies on power of two zone sizes.

Yes, it can be changed when the kernel block layer has non-power-of-2
zone sizes support. For now, Dmitry's work on zoned device support on
the kernel side put the constraint on zone sectors to be po2. The
checks here follow this constraint.

Dmitry's patches can be found here:
https://lore.kernel.org/linux-block/20220919022921.946344-1-dmitry.fomic...@wdc.com/T/#m9b0b3c8220de1307e53235d1a73dab1e3a10f62b


Sam



Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-20 Thread Klaus Jensen
On Sep 10 13:27, Sam Li wrote:
> Add a new zoned_host_device BlockDriver. The zoned_host_device option
> accepts only zoned host block devices. By adding zone management
> operations in this new BlockDriver, users can use the new block
> layer APIs including Report Zone and four zone management operations
> (open, close, finish, reset).
> 
> Qemu-io uses the new APIs to perform zoned storage commands of the device:
> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> zone_finish(zf).
> 
> For example, to test zone_report, use following command:
> $ ./build/qemu-io --image-opts -n driver=zoned_host_device, 
> filename=/dev/nullb0
> -c "zrp offset nr_zones"
> 
> Signed-off-by: Sam Li 
> Reviewed-by: Hannes Reinecke 
> ---
>  block/block-backend.c | 145 ++
>  block/file-posix.c| 323 +-
>  block/io.c|  41 
>  include/block/block-io.h  |   7 +
>  include/block/block_int-common.h  |  21 ++
>  include/block/raw-aio.h   |   6 +-
>  include/sysemu/block-backend-io.h |  17 ++
>  meson.build   |   1 +
>  qapi/block-core.json  |   8 +-
>  qemu-io-cmds.c| 143 +
>  10 files changed, 708 insertions(+), 4 deletions(-)
> 
> +/*
> + * zone management operations - Execute an operation on a zone
> + */
> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp 
> op,
> +int64_t offset, int64_t len) {
> +#if defined(CONFIG_BLKZONED)
> +BDRVRawState *s = bs->opaque;
> +RawPosixAIOData acb;
> +int64_t zone_sector, zone_sector_mask;
> +const char *zone_op_name;
> +unsigned long zone_op;
> +bool is_all = false;
> +
> +zone_sector = bs->bl.zone_sectors;
> +zone_sector_mask = zone_sector - 1;
> +if (offset & zone_sector_mask) {
> +error_report("sector offset %" PRId64 " is not aligned to zone size "
> + "%" PRId64 "", offset, zone_sector);
> +return -EINVAL;
> +}
> +
> +if (len & zone_sector_mask) {
> +error_report("number of sectors %" PRId64 " is not aligned to zone 
> size"
> +  " %" PRId64 "", len, zone_sector);
> +return -EINVAL;
> +}

These checks impose a power-of-two constraint on the zone size. Can they
be changed to divisions to lift that constraint? I don't see anything in
this patch set that relies on power of two zone sizes.


signature.asc
Description: PGP signature


Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-17 Thread Stefan Hajnoczi
On Sat, Sep 10, 2022 at 01:27:55PM +0800, Sam Li wrote:
> Add a new zoned_host_device BlockDriver. The zoned_host_device option
> accepts only zoned host block devices. By adding zone management
> operations in this new BlockDriver, users can use the new block
> layer APIs including Report Zone and four zone management operations
> (open, close, finish, reset).
> 
> Qemu-io uses the new APIs to perform zoned storage commands of the device:
> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> zone_finish(zf).
> 
> For example, to test zone_report, use following command:
> $ ./build/qemu-io --image-opts -n driver=zoned_host_device, 
> filename=/dev/nullb0
> -c "zrp offset nr_zones"
> 
> Signed-off-by: Sam Li 
> Reviewed-by: Hannes Reinecke 
> ---
>  block/block-backend.c | 145 ++
>  block/file-posix.c| 323 +-
>  block/io.c|  41 
>  include/block/block-io.h  |   7 +
>  include/block/block_int-common.h  |  21 ++
>  include/block/raw-aio.h   |   6 +-
>  include/sysemu/block-backend-io.h |  17 ++
>  meson.build   |   1 +
>  qapi/block-core.json  |   8 +-
>  qemu-io-cmds.c| 143 +
>  10 files changed, 708 insertions(+), 4 deletions(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index d4a5df2ac2..ebe8d7bdf3 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1431,6 +1431,15 @@ typedef struct BlkRwCo {
>  void *iobuf;
>  int ret;
>  BdrvRequestFlags flags;
> +union {
> +struct {
> +unsigned int *nr_zones;
> +BlockZoneDescriptor *zones;
> +} zone_report;
> +struct {
> +BlockZoneOp op;
> +} zone_mgmt;
> +};
>  } BlkRwCo;
>  
>  int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
> @@ -1775,6 +1784,142 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
>  return ret;
>  }
>  
> +static void blk_aio_zone_report_entry(void *opaque) {
> +BlkAioEmAIOCB *acb = opaque;
> +BlkRwCo *rwco = >rwco;
> +
> +rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
> +   rwco->zone_report.nr_zones,
> +   rwco->zone_report.zones);
> +blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> +unsigned int *nr_zones,
> +BlockZoneDescriptor  *zones,
> +BlockCompletionFunc *cb, void *opaque)
> +{
> +BlkAioEmAIOCB *acb;
> +Coroutine *co;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk);
> +acb = blk_aio_get(_aio_em_aiocb_info, blk, cb, opaque);
> +acb->rwco = (BlkRwCo) {
> +.blk= blk,
> +.offset = offset,
> +.ret= NOT_DONE,
> +.zone_report = {
> +.zones = zones,
> +.nr_zones = nr_zones,

Indentation is off here. QEMU uses 4-space indentation.

> +},
> +};
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
> +bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +acb->has_returned = true;
> +if (acb->rwco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> + blk_aio_complete_bh, acb);
> +}
> +
> +return >common;
> +}
> +
> +static void blk_aio_zone_mgmt_entry(void *opaque) {
> +BlkAioEmAIOCB *acb = opaque;
> +BlkRwCo *rwco = >rwco;
> +
> +rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
> + rwco->offset, acb->bytes);
> +blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +  int64_t offset, int64_t len,
> +  BlockCompletionFunc *cb, void *opaque) {
> +BlkAioEmAIOCB *acb;
> +Coroutine *co;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk);
> +acb = blk_aio_get(_aio_em_aiocb_info, blk, cb, opaque);
> +acb->rwco = (BlkRwCo) {
> +.blk= blk,
> +.offset = offset,
> +.ret= NOT_DONE,
> +.zone_mgmt = {
> +.op = op,

Indentation is off here. QEMU uses 4-space indentation.

> +},
> +};
> +acb->bytes = len;
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
> +bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +acb->has_returned = true;
> +if (acb->rwco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> + blk_aio_complete_bh, acb);
> +}
> +
> +return >common;
> +}
> +
> +/*
> + * Send a zone_report command.
> + * 

Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-11 Thread Sam Li
Damien Le Moal  于2022年9月11日周日 14:48写道:
>
> On 2022/09/11 15:33, Sam Li wrote:
> > Damien Le Moal  于2022年9月11日周日 13:31写道:
> [...]
> >>> +/*
> >>> + * zone management operations - Execute an operation on a zone
> >>> + */
> >>> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, 
> >>> BlockZoneOp op,
> >>> +int64_t offset, int64_t len) {
> >>> +#if defined(CONFIG_BLKZONED)
> >>> +BDRVRawState *s = bs->opaque;
> >>> +RawPosixAIOData acb;
> >>> +int64_t zone_sector, zone_sector_mask;
> >>> +const char *zone_op_name;
> >>> +unsigned long zone_op;
> >>> +bool is_all = false;
> >>> +
> >>> +zone_sector = bs->bl.zone_sectors;
> >>> +zone_sector_mask = zone_sector - 1;
> >>> +if (offset & zone_sector_mask) {
> >>> +error_report("sector offset %" PRId64 " is not aligned to zone 
> >>> size "
> >>> + "%" PRId64 "", offset, zone_sector);
> >>> +return -EINVAL;
> >>> +}
> >>> +
> >>> +if (len & zone_sector_mask) {
> >>
> >> Linux allows SMR drives to have a smaller last zone. So this needs to be
> >> accounted for here. Otherwise, a zone operation that includes the last 
> >> smaller
> >> zone would always fail. Something like this would work:
> >>
> >> if (((offset + len) < capacity &&
> >> len & zone_sector_mask) ||
> >> offset + len > capacity) {
> >>
> >
> > I see. I think the offset can be removed, like:
> > if (((len < capacity && len & zone_sector_mask) || len > capacity) {
> > Then if we use the previous zone's len for the last smaller zone, it
> > will be greater than its capacity.
>
> Nope, you cannot remove the offset since the zone operation may be for that 
> last
> zone only, that is, offset == last zone start and len == last zone smaller 
> size.
> In that case, len is alwats smaller than capacity.

Ok, I was mixing opening one zone with opening several zones.

>
> >
> > I will also include "opening the last zone" as a test case later.
>
> Note that you can create such smaller last zone on the host with null_blk by
> specifying a device capacity that is *not* a multiple of the zone size.
>
> >
> >>> +error_report("number of sectors %" PRId64 " is not aligned to 
> >>> zone size"
> >>> +  " %" PRId64 "", len, zone_sector);
> >>> +return -EINVAL;
> >>> +}
> >>> +
> >>> +switch (op) {
> >>> +case BLK_ZO_OPEN:
> >>> +zone_op_name = "BLKOPENZONE";
> >>> +zone_op = BLKOPENZONE;
> >>> +break;
> >>> +case BLK_ZO_CLOSE:
> >>> +zone_op_name = "BLKCLOSEZONE";
> >>> +zone_op = BLKCLOSEZONE;
> >>> +break;
> >>> +case BLK_ZO_FINISH:
> >>> +zone_op_name = "BLKFINISHZONE";
> >>> +zone_op = BLKFINISHZONE;
> >>> +break;
> >>> +case BLK_ZO_RESET:
> >>> +zone_op_name = "BLKRESETZONE";
> >>> +zone_op = BLKRESETZONE;
> >>> +break;
> >>> +default:
> >>> +g_assert_not_reached();
> >>> +}
> >>> +
> >>> +acb = (RawPosixAIOData) {
> >>> +.bs = bs,
> >>> +.aio_fildes = s->fd,
> >>> +.aio_type   = QEMU_AIO_ZONE_MGMT,
> >>> +.aio_offset = offset,
> >>> +.aio_nbytes = len,
> >>> +.zone_mgmt  = {
> >>> +.zone_op = zone_op,
> >>> +.zone_op_name = zone_op_name,
> >>> +.all = is_all,
> >>> +},
> >>> +};
> >>> +
> >>> +return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, );
> >>> +#else
> >>> +return -ENOTSUP;
> >>> +#endif
> >>> +}
>
> --
> Damien Le Moal
> Western Digital Research
>



Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-11 Thread Damien Le Moal
On 2022/09/10 14:27, Sam Li wrote:
> Add a new zoned_host_device BlockDriver. The zoned_host_device option
> accepts only zoned host block devices. By adding zone management
> operations in this new BlockDriver, users can use the new block
> layer APIs including Report Zone and four zone management operations
> (open, close, finish, reset).
> 
> Qemu-io uses the new APIs to perform zoned storage commands of the device:
> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> zone_finish(zf).
> 
> For example, to test zone_report, use following command:
> $ ./build/qemu-io --image-opts -n driver=zoned_host_device, 
> filename=/dev/nullb0
> -c "zrp offset nr_zones"
> 
> Signed-off-by: Sam Li 
> Reviewed-by: Hannes Reinecke 
> ---
>  block/block-backend.c | 145 ++
>  block/file-posix.c| 323 +-
>  block/io.c|  41 
>  include/block/block-io.h  |   7 +
>  include/block/block_int-common.h  |  21 ++
>  include/block/raw-aio.h   |   6 +-
>  include/sysemu/block-backend-io.h |  17 ++
>  meson.build   |   1 +
>  qapi/block-core.json  |   8 +-
>  qemu-io-cmds.c| 143 +
>  10 files changed, 708 insertions(+), 4 deletions(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index d4a5df2ac2..ebe8d7bdf3 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1431,6 +1431,15 @@ typedef struct BlkRwCo {
>  void *iobuf;
>  int ret;
>  BdrvRequestFlags flags;
> +union {
> +struct {
> +unsigned int *nr_zones;
> +BlockZoneDescriptor *zones;
> +} zone_report;
> +struct {
> +BlockZoneOp op;
> +} zone_mgmt;
> +};
>  } BlkRwCo;
>  
>  int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
> @@ -1775,6 +1784,142 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
>  return ret;
>  }
>  
> +static void blk_aio_zone_report_entry(void *opaque) {
> +BlkAioEmAIOCB *acb = opaque;
> +BlkRwCo *rwco = >rwco;
> +
> +rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
> +   rwco->zone_report.nr_zones,
> +   rwco->zone_report.zones);
> +blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> +unsigned int *nr_zones,
> +BlockZoneDescriptor  *zones,
> +BlockCompletionFunc *cb, void *opaque)
> +{
> +BlkAioEmAIOCB *acb;
> +Coroutine *co;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk);
> +acb = blk_aio_get(_aio_em_aiocb_info, blk, cb, opaque);
> +acb->rwco = (BlkRwCo) {
> +.blk= blk,
> +.offset = offset,
> +.ret= NOT_DONE,
> +.zone_report = {
> +.zones = zones,
> +.nr_zones = nr_zones,
> +},
> +};
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
> +bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +acb->has_returned = true;
> +if (acb->rwco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> + blk_aio_complete_bh, acb);
> +}
> +
> +return >common;
> +}
> +
> +static void blk_aio_zone_mgmt_entry(void *opaque) {
> +BlkAioEmAIOCB *acb = opaque;
> +BlkRwCo *rwco = >rwco;
> +
> +rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
> + rwco->offset, acb->bytes);
> +blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +  int64_t offset, int64_t len,
> +  BlockCompletionFunc *cb, void *opaque) {
> +BlkAioEmAIOCB *acb;
> +Coroutine *co;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk);
> +acb = blk_aio_get(_aio_em_aiocb_info, blk, cb, opaque);
> +acb->rwco = (BlkRwCo) {
> +.blk= blk,
> +.offset = offset,
> +.ret= NOT_DONE,
> +.zone_mgmt = {
> +.op = op,
> +},
> +};
> +acb->bytes = len;
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
> +bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +acb->has_returned = true;
> +if (acb->rwco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> + blk_aio_complete_bh, acb);
> +}
> +
> +return >common;
> +}
> +
> +/*
> + * Send a zone_report command.
> + * offset is a byte offset from the start of the device. No alignment
> + * required for offset.
> + * nr_zones represents IN maximum and OUT 

Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-11 Thread Damien Le Moal
On 2022/09/11 15:33, Sam Li wrote:
> Damien Le Moal  于2022年9月11日周日 13:31写道:
[...]
>>> +/*
>>> + * zone management operations - Execute an operation on a zone
>>> + */
>>> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp 
>>> op,
>>> +int64_t offset, int64_t len) {
>>> +#if defined(CONFIG_BLKZONED)
>>> +BDRVRawState *s = bs->opaque;
>>> +RawPosixAIOData acb;
>>> +int64_t zone_sector, zone_sector_mask;
>>> +const char *zone_op_name;
>>> +unsigned long zone_op;
>>> +bool is_all = false;
>>> +
>>> +zone_sector = bs->bl.zone_sectors;
>>> +zone_sector_mask = zone_sector - 1;
>>> +if (offset & zone_sector_mask) {
>>> +error_report("sector offset %" PRId64 " is not aligned to zone 
>>> size "
>>> + "%" PRId64 "", offset, zone_sector);
>>> +return -EINVAL;
>>> +}
>>> +
>>> +if (len & zone_sector_mask) {
>>
>> Linux allows SMR drives to have a smaller last zone. So this needs to be
>> accounted for here. Otherwise, a zone operation that includes the last 
>> smaller
>> zone would always fail. Something like this would work:
>>
>> if (((offset + len) < capacity &&
>> len & zone_sector_mask) ||
>> offset + len > capacity) {
>>
> 
> I see. I think the offset can be removed, like:
> if (((len < capacity && len & zone_sector_mask) || len > capacity) {
> Then if we use the previous zone's len for the last smaller zone, it
> will be greater than its capacity.

Nope, you cannot remove the offset since the zone operation may be for that last
zone only, that is, offset == last zone start and len == last zone smaller size.
In that case, len is alwats smaller than capacity.

> 
> I will also include "opening the last zone" as a test case later.

Note that you can create such smaller last zone on the host with null_blk by
specifying a device capacity that is *not* a multiple of the zone size.

> 
>>> +error_report("number of sectors %" PRId64 " is not aligned to zone 
>>> size"
>>> +  " %" PRId64 "", len, zone_sector);
>>> +return -EINVAL;
>>> +}
>>> +
>>> +switch (op) {
>>> +case BLK_ZO_OPEN:
>>> +zone_op_name = "BLKOPENZONE";
>>> +zone_op = BLKOPENZONE;
>>> +break;
>>> +case BLK_ZO_CLOSE:
>>> +zone_op_name = "BLKCLOSEZONE";
>>> +zone_op = BLKCLOSEZONE;
>>> +break;
>>> +case BLK_ZO_FINISH:
>>> +zone_op_name = "BLKFINISHZONE";
>>> +zone_op = BLKFINISHZONE;
>>> +break;
>>> +case BLK_ZO_RESET:
>>> +zone_op_name = "BLKRESETZONE";
>>> +zone_op = BLKRESETZONE;
>>> +break;
>>> +default:
>>> +g_assert_not_reached();
>>> +}
>>> +
>>> +acb = (RawPosixAIOData) {
>>> +.bs = bs,
>>> +.aio_fildes = s->fd,
>>> +.aio_type   = QEMU_AIO_ZONE_MGMT,
>>> +.aio_offset = offset,
>>> +.aio_nbytes = len,
>>> +.zone_mgmt  = {
>>> +.zone_op = zone_op,
>>> +.zone_op_name = zone_op_name,
>>> +.all = is_all,
>>> +},
>>> +};
>>> +
>>> +return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, );
>>> +#else
>>> +return -ENOTSUP;
>>> +#endif
>>> +}

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-11 Thread Sam Li
Damien Le Moal  于2022年9月11日周日 13:31写道:
>
> On 2022/09/10 14:27, Sam Li wrote:
> [...]
> > +/*
> > + * Send a zone_report command.
> > + * offset is a byte offset from the start of the device. No alignment
> > + * required for offset.
> > + * nr_zones represents IN maximum and OUT actual.
> > + */
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +unsigned int *nr_zones,
> > +BlockZoneDescriptor *zones)
> > +{
> > +int ret;
> > +IO_CODE();
> > +
> > +blk_inc_in_flight(blk); /* increase before waiting */
> > +blk_wait_while_drained(blk);
> > +if (!blk_is_available(blk)) {
> > +blk_dec_in_flight(blk);
> > +return -ENOMEDIUM;
> > +}
> > +ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> > +blk_dec_in_flight(blk);
> > +return ret;
> > +}
> > +
> > +/*
> > + * Send a zone_management command.
> > + * op is the zone operation;
> > + * offset is the byte offset from the start of the zoned device;
> > + * len is the maximum number of bytes the command should operate on. It
> > + * should be aligned with the zone sector size.
>
> This should read:
>
> * offset is the byte offset of the start of the first zone to operate on;
> * len is the maximum number of bytes the command should operate on. It
> * should be aligned with the device zone size.
>
> No ?

Right. The zone sector size here is meant for the zone size whose unit
is a 512-byte sector.

>
> > + */
> > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +int64_t offset, int64_t len)
> > +{
> > +int ret;
> > +IO_CODE();
> > +
> > +
> > +blk_inc_in_flight(blk);
> > +blk_wait_while_drained(blk);
> > +
> > +ret = blk_check_byte_request(blk, offset, len);
> > +if (ret < 0) {
> > +return ret;
> > +}
> > +
> > +ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> > +blk_dec_in_flight(blk);
> > +return ret;
> > +}
> > +
> >  void blk_drain(BlockBackend *blk)
> >  {
> >  BlockDriverState *bs = blk_bs(blk);
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 0a8b4b426e..4edfa25d04 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -67,6 +67,9 @@
> >  #include 
> >  #include 
> >  #include 
> > +#if defined(CONFIG_BLKZONED)
> > +#include 
> > +#endif
> >  #include 
> >  #include 
> >  #include 
> > @@ -216,6 +219,15 @@ typedef struct RawPosixAIOData {
> >  PreallocMode prealloc;
> >  Error **errp;
> >  } truncate;
> > +struct {
> > +unsigned int *nr_zones;
> > +BlockZoneDescriptor *zones;
> > +} zone_report;
> > +struct {
> > +unsigned long zone_op;
> > +const char *zone_op_name;
> > +bool all;
> > +} zone_mgmt;
> >  };
> >  } RawPosixAIOData;
> >
> > @@ -1339,7 +1351,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> > Error **errp)
> >  #endif
> >
> >  if (bs->sg || S_ISBLK(st.st_mode)) {
> > -int ret = hdev_get_max_hw_transfer(s->fd, );
> > +ret = hdev_get_max_hw_transfer(s->fd, );
> >
> >  if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> >  bs->bl.max_hw_transfer = ret;
> > @@ -1356,6 +1368,27 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> > Error **errp)
> >  zoned = BLK_Z_NONE;
> >  }
> >  bs->bl.zoned = zoned;
> > +if (zoned != BLK_Z_NONE) {
> > +ret = get_sysfs_long_val(, "chunk_sectors");
> > +if (ret > 0) {
> > +bs->bl.zone_sectors = ret;
> > +}
>
> It may be good to check that we are getting a valid zone size here. So may be
> change the check to something like this ?
>
> if (ret <= 0) {
> *** print some error message mentioning the invalid zone size ***
> bs->bl.zoned = BLK_Z_NONE;
> return;
> }
> bs->bl.zone_sectors = ret;
>

Ok, thanks!

> > +
> > +ret = get_sysfs_long_val(, "zone_append_max_bytes");
> > +if (ret > 0) {
> > +bs->bl.max_append_sectors = ret / 512;
> > +}
> > +
> > +ret = get_sysfs_long_val(, "max_open_zones");
> > +if (ret >= 0) {
> > +bs->bl.max_open_zones = ret;
> > +}
> > +
> > +ret = get_sysfs_long_val(, "max_active_zones");
> > +if (ret >= 0) {
> > +bs->bl.max_active_zones = ret;
> > +}
> > +}
> >  }
> >
> >  static int check_for_dasd(int fd)
> > @@ -1850,6 +1883,145 @@ static off_t copy_file_range(int in_fd, off_t 
> > *in_off, int out_fd,
> >  }
> >  #endif
> >
> > +/*
> > + * parse_zone - Fill a zone descriptor
> > + */
> > +#if defined(CONFIG_BLKZONED)
> > +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> > +  const struct blk_zone *blkz) {
> > +zone->start = blkz->start;
> > +

Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-10 Thread Damien Le Moal
On 2022/09/10 14:27, Sam Li wrote:
[...]
> +/*
> + * Send a zone_report command.
> + * offset is a byte offset from the start of the device. No alignment
> + * required for offset.
> + * nr_zones represents IN maximum and OUT actual.
> + */
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +unsigned int *nr_zones,
> +BlockZoneDescriptor *zones)
> +{
> +int ret;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk); /* increase before waiting */
> +blk_wait_while_drained(blk);
> +if (!blk_is_available(blk)) {
> +blk_dec_in_flight(blk);
> +return -ENOMEDIUM;
> +}
> +ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
> +/*
> + * Send a zone_management command.
> + * op is the zone operation;
> + * offset is the byte offset from the start of the zoned device;
> + * len is the maximum number of bytes the command should operate on. It
> + * should be aligned with the zone sector size.

This should read:

* offset is the byte offset of the start of the first zone to operate on;
* len is the maximum number of bytes the command should operate on. It
* should be aligned with the device zone size.

No ?

> + */
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +int64_t offset, int64_t len)
> +{
> +int ret;
> +IO_CODE();
> +
> +
> +blk_inc_in_flight(blk);
> +blk_wait_while_drained(blk);
> +
> +ret = blk_check_byte_request(blk, offset, len);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
>  void blk_drain(BlockBackend *blk)
>  {
>  BlockDriverState *bs = blk_bs(blk);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 0a8b4b426e..4edfa25d04 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -67,6 +67,9 @@
>  #include 
>  #include 
>  #include 
> +#if defined(CONFIG_BLKZONED)
> +#include 
> +#endif
>  #include 
>  #include 
>  #include 
> @@ -216,6 +219,15 @@ typedef struct RawPosixAIOData {
>  PreallocMode prealloc;
>  Error **errp;
>  } truncate;
> +struct {
> +unsigned int *nr_zones;
> +BlockZoneDescriptor *zones;
> +} zone_report;
> +struct {
> +unsigned long zone_op;
> +const char *zone_op_name;
> +bool all;
> +} zone_mgmt;
>  };
>  } RawPosixAIOData;
>  
> @@ -1339,7 +1351,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  #endif
>  
>  if (bs->sg || S_ISBLK(st.st_mode)) {
> -int ret = hdev_get_max_hw_transfer(s->fd, );
> +ret = hdev_get_max_hw_transfer(s->fd, );
>  
>  if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
>  bs->bl.max_hw_transfer = ret;
> @@ -1356,6 +1368,27 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  zoned = BLK_Z_NONE;
>  }
>  bs->bl.zoned = zoned;
> +if (zoned != BLK_Z_NONE) {
> +ret = get_sysfs_long_val(, "chunk_sectors");
> +if (ret > 0) {
> +bs->bl.zone_sectors = ret;
> +}

It may be good to check that we are getting a valid zone size here. So may be
change the check to something like this ?

if (ret <= 0) {
*** print some error message mentioning the invalid zone size ***
bs->bl.zoned = BLK_Z_NONE;
return;
}
bs->bl.zone_sectors = ret;

> +
> +ret = get_sysfs_long_val(, "zone_append_max_bytes");
> +if (ret > 0) {
> +bs->bl.max_append_sectors = ret / 512;
> +}
> +
> +ret = get_sysfs_long_val(, "max_open_zones");
> +if (ret >= 0) {
> +bs->bl.max_open_zones = ret;
> +}
> +
> +ret = get_sysfs_long_val(, "max_active_zones");
> +if (ret >= 0) {
> +bs->bl.max_active_zones = ret;
> +}
> +}
>  }
>  
>  static int check_for_dasd(int fd)
> @@ -1850,6 +1883,145 @@ static off_t copy_file_range(int in_fd, off_t 
> *in_off, int out_fd,
>  }
>  #endif
>  
> +/*
> + * parse_zone - Fill a zone descriptor
> + */
> +#if defined(CONFIG_BLKZONED)
> +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> +  const struct blk_zone *blkz) {
> +zone->start = blkz->start;
> +zone->length = blkz->len;
> +zone->cap = blkz->capacity;
> +zone->wp = blkz->wp;
> +
> +switch (blkz->type) {
> +case BLK_ZONE_TYPE_SEQWRITE_REQ:
> +zone->type = BLK_ZT_SWR;
> +break;
> +case BLK_ZONE_TYPE_SEQWRITE_PREF:
> +zone->type = BLK_ZT_SWP;
> +break;
> +case BLK_ZONE_TYPE_CONVENTIONAL:
> +zone->type = BLK_ZT_CONV;
> +break;
> +default:
> +