Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.

Stefan Hajnoczi Thu, 02 Jun 2022 12:37:41 -0700

On Thu, 2 Jun 2022 at 11:28, Sam Li <faithilike...@gmail.com> wrote:
>
> Stefan Hajnoczi <stefa...@gmail.com> 于2022年6月2日周四 16:05写道：
> >
> > On Thu, 2 Jun 2022 at 06:43, Sam Li <faithilike...@gmail.com> wrote:
> > >
> > > Hi Stefan,
> > >
> > > Stefan Hajnoczi <stefa...@gmail.com> 于2022年6月1日周三 19:43写道：
> > > >
> > > > On Wed, 1 Jun 2022 at 06:47, Damien Le Moal
> > > > <damien.lem...@opensource.wdc.com> wrote:
> > > > >
> > > > > On 6/1/22 11:57, Sam Li wrote:
> > > > > > Hi Stefan,
> > > > > >
> > > > > > Stefan Hajnoczi <stefa...@gmail.com> 于2022年5月30日周一 19:19写道：
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilike...@gmail.com> 
> > > > > >> wrote:
> > > > > >>>
> > > > > >>> Hi everyone,
> > > > > >>> I'm Sam Li, working on the Outreachy project which is to add zoned
> > > > > >>> device support to QEMU's virtio-blk emulation.
> > > > > >>>
> > > > > >>> For the first goal, adding QEMU block layer APIs resembling Linux 
> > > > > >>> ZBD
> > > > > >>> ioctls, I think the naive approach would be to introduce a new 
> > > > > >>> stable
> > > > > >>> struct zbd_zone descriptor for the library function interface. 
> > > > > >>> More
> > > > > >>> specifically, what I'd like to add to the BlockDriver struct are:
> > > > > >>> 1. zbd_info as zone block device information: includes numbers of
> > > > > >>> zones, size of logical blocks, and physical blocks.
> > > > > >>> 2. zbd_zone_type and zbd_zone_state
> > > > > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> > > > > >>> With those basic structs, we can start to implement new functions 
> > > > > >>> as
> > > > > >>> bdrv*() APIs for BLOCK*ZONE ioctls.
> > > > > >>>
> > > > > >>> I'll start to finish this task based on the above description. If
> > > > > >>> there is any problem or something I may miss in the design, 
> > > > > >>> please let
> > > > > >>> me know.
> > > > > >>
> > > > > >> Hi Sam,
> > > > > >> Can you propose function prototypes for the new BlockDriver 
> > > > > >> callbacks
> > > > > >> needed for zoned devices?
> > > > > >
> > > > > > I have made some modifications based on Damien's device in design 
> > > > > > part
> > > > > > 1 and added the function prototypes in design part 2. If there is 
> > > > > > any
> > > > > > problem or part I missed, please let me know.
> > > > > >
> > > > > > Design of Block Layer APIs in BlockDriver:
> > > > > > 1. introduce a new stable struct zbd_zone descriptor for the library
> > > > > > function interface.
> > > > > >   a. zbd_info as zone block device information: includes numbers of
> > > > > > zones, size of blocks, write granularity in byte(minimal write size
> > > > > > and alignment
> > > > > >     - write granularity: 512e SMRs: writes in units of physical 
> > > > > > block
> > > > > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> > > > > > size.
> > > > > >     - zone descriptor: start, length, capacity, write pointer, zone 
> > > > > > type
> > > > > >   b. zbd_zone_type
> > > > > >     - zone type: conventional, sequential write required, sequential
> > > > > > write preferred
> > > > > >   c. zbd_dev_model: host-managed zbd, host-aware zbd
> > > > >
> > > > > This explanation is a little hard to understand. It seems to be 
> > > > > mixing up
> > > > > device level information and per-zone information. I think it would 
> > > > > be a
> > > > > lot simpler to write a struct definition to directly illustrate what 
> > > > > you
> > > > > are planning.
> > > > >
> > > > > It is something like this ?
> > > > >
> > > > > struct zbd_zone {
> > > > >         enum zone_type  type;
> > > > >         enum zone_cond  cond;
> > > > >         uint64_t        start;
> > > > >         uint32_t        length;
> > > > >         uint32_t        cap;
> > > > >         uint64_t        wp;
> > > > > };
> > > > >
> > > > > strcut zbd_dev {
> > > > >         enum zone_model model;
> > > > >         uint32_t        block_size;
> > > > >         uint32_t        write_granularity;
> > > > >         uint32_t        nr_zones
> > > > >         struct zbd_zone *zones; /* array of zones */
> > > > > };
> > > > >
> > > > > If yes, then my comments are as follows.
> > > > >
> > > > > For the device struct: It may be good to have also the maximum number 
> > > > > of
> > > > > open zones and the maximum number of active zones.
> > > > >
> > > > > For the zone struct: You may need to add a read-write lock per zone 
> > > > > to be
> > > > > able to write lock zones to ensure a sequential write pattern (virtio
> > > > > devices can be multi-queue and so writes may be coming in from 
> > > > > different
> > > > > contexts) and to correctly emulate zone append operations with an 
> > > > > atomic
> > > > > update of the wp field.
> > > > >
> > > > > These need to be integrated into the generic block driver interface in
> > > > > include/block/block_int-common.h or include/block/block-common.h.
> > > >
> > > > QEMU's block layer has a few ways of exposing information about block 
> > > > devices:
> > > >
> > > >     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
> > > >     ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
> > > > Error **errp);
> > > >
> > > > These fetch information from the BlockDriver and are good when a small
> > > > amount of data is reported occassionally and consumed by the caller.
> > > >
> > > > For data that is continuously accessed or that could be large, it may
> > > > be necessary for the data to reside inside BlockDriverState so that it
> > > > can be accessed in place (without copying):
> > > >
> > > >     void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
> > > >
> > > > QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that
> > > > is continuously accessed by the block layer while processing I/O
> > > > requests. The "refresh" function updates the data in case the
> > > > underlying storage device has changed somehow. If no update function
> > > > is necessary then data can simply be populated during .bdrv_open() and
> > > > no new BlockDriver callback needs to be added.
> > > >
> > > > So in the simplest case BlockDriverState can be extended with a struct
> > > > zbd_dev field that is populated during .bdrv_open(). If the
> > > > BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0
> > > > or the model field indicates that this is not a zoned storage device.
> > > >
> > > > However, a BlockBackend (not BlockDriverState!) API will be needed to
> > > > expose this data to users like the hw/block/virtio-blk.c emulation
> > > > code or the qemu-io-cmds.c utility that can be used for testing. A
> > > > BlockBackend has a root pointer to a BlockDriverState graph (for
> > > > example, qcow2 on top of file-posix). It will be necessary to
> > > > propagate zoned storage information from the leaf BlockDriverState all
> > > > the way up to the BlockBackend. In simple cases the BB root points
> > > > directly to the file-posix BDS that has Linux ZBD support but the
> > > > design needs to account for additional BDS graph nodes.
> > >
> > > I think a simple way to think BlockBackend APIs is to use following
> > > callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() +
> > > blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt().
> > > The last function call will call bdrv_co_zone_mgmt() in
> > > block/file-posix.c. If I understand the additional case correctly, the
> > > BlockBackend API can expose the zone information to the virtio-blk
> > > emulation now.
> >
> > Yes!
> >
> > block/raw-format.c also needs to implement .bdrv_co_zone_mgmt() by
> > calling bdrv_co_zone_mgmt(bs->file, ...). This is because the
> > raw-format.c driver usually sits on top of file-posix.c and has to
> > pass through requests.
> >
> > There are filter block drivers like block/throttle.c,
> > block/blkdebug.c, etc (git grep is_filter block/) that will also need
> > to be modified to pass through requests in the same way.
> >
>
> Are the filter block drivers also on top of file-posix.c but below
> block-backend.c? I read that the filter block drivers, and formats are
> designed to be manageable pieces so as to make block device
> configuration easier and clearer.


Yes, the filters are BlockDrivers and each instance has a
BlockDriverState. They are part of the same BlockDriverState graph as
file-posix.c.

BlockBackend has a "root" BlockDriverState pointer. It points to a
graph of BlockDriverStates and there are 3 types of nodes:
- Filter nodes like block/throttle.c that provide some extra
functionality like I/O throttling.
- Format nodes like qcow2, raw, or vmdk that implement disk image file formats.
- Protocol nodes like file-posix, iSCSI, etc that implement access to
underlying storage.

The graph is pretty flexible. It's possible to insert/remove nodes to
construct arbitrary graphs.

Protocol nodes are the leaf nodes in the graph. Filter and format
nodes are above protocol nodes.

If we want to open /dev/nullb0 and limit I/O rates to 10 MB/s the
graph would be:
throttle (10 MB/s) -> raw-format -> file-posix (/dev/nullb0)

The BlockBackend root would point at the throttle node. I/O requests
made using blk_*() APIs will be forwarded to the throttle node using
bdrv_*() APIs. The throttle node forwards requests to the raw-format
node using bdrv_*() APIs. The raw-format node forwards I/O requests to
the file-posix node using bdrv_*() APIs. Here is block/raw-format.c's
preadv implementation:
  static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
                                        int64_t bytes, QEMUIOVector *qiov,
                                        BdrvRequestFlags flags)
  {
      int ret;

      ret = raw_adjust_offset(bs, &offset, bytes, false);
      if (ret) {
          return ret;
      }

      BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
      return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
                 ^^^^^^^^^^^^^^^^^^^^^^^ forward I/O to child graph node
  }

>
> > Based on what I've read in Dmitry's virtio-blk spec proposal, the
> > BlockBackend API could look something like:
> >
> > typedef struct { ...model, zone_sectors, max_open_zones, etc... } 
> > BlockZoneInfo;
> > void blk_get_zone_info(BlockBackend *blk, BlockZoneInfo *info);
> >
> > virtio-blk.c calls this to fill out configuration space fields and
> > determine whether the BlockBackend is a zoned device.
> >
> > Then there are 3 commands that happen in the I/O code path:
> >
> > typedef struct { ... } BlockZoneDescriptor;
> > BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> > BlockZoneDescriptor *zones, size_t max_zones, BlockCompletionFunc *cb,
> > void *opaque);
> >
> > typedef enum { ... } BlockZoneMgmtCmd;
> > BlockAIOCB *blk_aio_zone_mgmt_send(BlockBackend *blk, int64_t offset,
> > BlockZoneMgmtCmd cmd, bool all, BlockCompletionFunc *cb, void
> > *opaque);
> >
> > typedef void BlockZoneAppendCompletionFunc(void *opaque, int ret,
> > int64_t new_wp);
> > BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t offset,
> > QEMUIOVector *qiov, BlockZoneAppendCompletionFunc *cb, void *opaque);
> >
> > > Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c
> > > which lead to include/block/block-io.h, we may need consider the case
> > > when calling block layer API from non-coroutine context. Meanwhile,
> > > using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per
> > > zone may be a option too.
> >
> > Yes, device emulation code usually uses the aio versions of the
> > BlockBackend I/O functions (read, write, flush). The QEMU block layer
> > runs the aio I/O request inside a coroutine and usually also exposes
> > coroutine versions of the same functions. For example, block jobs
> > (e.g. storage mirroring, backup, and migration background tasks)
> > usually call the coroutine versions of the BlockBackend APIs instead
> > of the aio ones.
> >
> > qemu-io-cmds.c will want synchronous versions of the aio commands
> > (blk_zone_report(), blk_zone_mgmt_send(), blk_zone_append()) that
> > block until the command completes. This is because the qemu-io utility
> > typically executes one command at a time and it's written mostly in
> > blocking style rather than async callbacks or coroutines.
> > docs/devel/block-coroutine-wrapper.rst describes how to generate
> > synchronous versions of coroutine functions.
> >
> > Do you want to start implementing blk_get_zone_info()? This will
> > require blk_*(), bdrv_*(), and BlockDriver (block/file-posix.c)
> > functions.
>
> I want to implement the smallest part that can be tested first and
> then move on to the next part. And I want to test zone report
> operation first. Does the qemu io-test require the following part to
> work: bdrv_co_zone_report in file-pisix.c, blk_get_zone_info() in
> block-backend.c, blk_aio_zone_report() in io code path and modify some
> test in test/qemu-iotests? If it does, then yes.

blk_aio_zone_report() can be implemented later. It is not needed by qemu-io.

blk_get_zone_info() will be needed soon but maybe you can skip it
while working on the first version of blk_co_zone_report().

The steps are:
1. Add a .bdrv_co_zone_report() callback to BlockDriver and define a
BlockZoneDescriptor struct.
2. Implement the .bdrv_co_zone_report() callback in block/file-posix.c
using ioctl(BLKREPORTZONE).
3. Implement bdrv_co_zone_report() in block/io.c. It calls
bs->drv->bdrv_co_zone_report() or returns -ENOTSUP if
bs->drv->bdrv_co_zone_report is NULL.
4. Implement blk_co_zone_report() in block/block-backend.c. It calls
bdrv_co_zone_report(blk->root, ...).
5. Generate a synchronous blk_zone_report() wrapper. See
docs/devel/block-coroutine-wrapper.rst.

You now have a working zone report command. It will work with
--blockdev file,filename=/dev/nullb0,node-name=blk0, which creates a
graph with just one block/file-posix.c BlockDriverState node. It won't
work with QEMU's older --drive
if=none,id=blk0,format=raw,file=/dev/nullb0 syntax because that
creates a raw-format -> file-posix graph and you haven't implemented
.bdrv_co_zone_report() in block/raw-format.c yet (but you can skip it
for now).

For testing you can add a qemu-io -c zone_report command to
qemu-io-cmds.c that calls blk_zone_report(). Then you can write a
tests/qemu-iotests/tests/zoned test script that report zones using
qemu-io, writes to the first sectors of the disk using qemu-io, and
then reports zones again to prove that the output has changed. Use the
qemu-io --image-opts driver=host_device,filename=/dev/nullb0 option to
open a Linux null_blk device using the block/file-posix.c BlockDriver.

About qemu-iotests: the output from running the test case is diffed
against a reference file that contains the expected output. This is
quite convenient because you don't have to write code that checks for
the expected output, you just provide a
tests/qemu-iotests/tests/zoned.out file containing the output for a
passing test. There is documentation about qemu-iotests here:
https://qemu.readthedocs.io/en/latest/devel/testing.html#qemu-iotests

Stefan

Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.

Reply via email to