On Thu, 2 Jun 2022 at 11:28, Sam Li <faithilike...@gmail.com> wrote: > > Stefan Hajnoczi <stefa...@gmail.com> 于2022年6月2日周四 16:05写道: > > > > On Thu, 2 Jun 2022 at 06:43, Sam Li <faithilike...@gmail.com> wrote: > > > > > > Hi Stefan, > > > > > > Stefan Hajnoczi <stefa...@gmail.com> 于2022年6月1日周三 19:43写道: > > > > > > > > On Wed, 1 Jun 2022 at 06:47, Damien Le Moal > > > > <damien.lem...@opensource.wdc.com> wrote: > > > > > > > > > > On 6/1/22 11:57, Sam Li wrote: > > > > > > Hi Stefan, > > > > > > > > > > > > Stefan Hajnoczi <stefa...@gmail.com> 于2022年5月30日周一 19:19写道: > > > > > > > > > > > > > > > > > >> > > > > > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilike...@gmail.com> > > > > > >> wrote: > > > > > >>> > > > > > >>> Hi everyone, > > > > > >>> I'm Sam Li, working on the Outreachy project which is to add zoned > > > > > >>> device support to QEMU's virtio-blk emulation. > > > > > >>> > > > > > >>> For the first goal, adding QEMU block layer APIs resembling Linux > > > > > >>> ZBD > > > > > >>> ioctls, I think the naive approach would be to introduce a new > > > > > >>> stable > > > > > >>> struct zbd_zone descriptor for the library function interface. > > > > > >>> More > > > > > >>> specifically, what I'd like to add to the BlockDriver struct are: > > > > > >>> 1. zbd_info as zone block device information: includes numbers of > > > > > >>> zones, size of logical blocks, and physical blocks. > > > > > >>> 2. zbd_zone_type and zbd_zone_state > > > > > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd > > > > > >>> With those basic structs, we can start to implement new functions > > > > > >>> as > > > > > >>> bdrv*() APIs for BLOCK*ZONE ioctls. > > > > > >>> > > > > > >>> I'll start to finish this task based on the above description. If > > > > > >>> there is any problem or something I may miss in the design, > > > > > >>> please let > > > > > >>> me know. > > > > > >> > > > > > >> Hi Sam, > > > > > >> Can you propose function prototypes for the new BlockDriver > > > > > >> callbacks > > > > > >> needed for zoned devices? > > > > > > > > > > > > I have made some modifications based on Damien's device in design > > > > > > part > > > > > > 1 and added the function prototypes in design part 2. If there is > > > > > > any > > > > > > problem or part I missed, please let me know. > > > > > > > > > > > > Design of Block Layer APIs in BlockDriver: > > > > > > 1. introduce a new stable struct zbd_zone descriptor for the library > > > > > > function interface. > > > > > > a. zbd_info as zone block device information: includes numbers of > > > > > > zones, size of blocks, write granularity in byte(minimal write size > > > > > > and alignment > > > > > > - write granularity: 512e SMRs: writes in units of physical > > > > > > block > > > > > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block > > > > > > size. > > > > > > - zone descriptor: start, length, capacity, write pointer, zone > > > > > > type > > > > > > b. zbd_zone_type > > > > > > - zone type: conventional, sequential write required, sequential > > > > > > write preferred > > > > > > c. zbd_dev_model: host-managed zbd, host-aware zbd > > > > > > > > > > This explanation is a little hard to understand. It seems to be > > > > > mixing up > > > > > device level information and per-zone information. I think it would > > > > > be a > > > > > lot simpler to write a struct definition to directly illustrate what > > > > > you > > > > > are planning. > > > > > > > > > > It is something like this ? > > > > > > > > > > struct zbd_zone { > > > > > enum zone_type type; > > > > > enum zone_cond cond; > > > > > uint64_t start; > > > > > uint32_t length; > > > > > uint32_t cap; > > > > > uint64_t wp; > > > > > }; > > > > > > > > > > strcut zbd_dev { > > > > > enum zone_model model; > > > > > uint32_t block_size; > > > > > uint32_t write_granularity; > > > > > uint32_t nr_zones > > > > > struct zbd_zone *zones; /* array of zones */ > > > > > }; > > > > > > > > > > If yes, then my comments are as follows. > > > > > > > > > > For the device struct: It may be good to have also the maximum number > > > > > of > > > > > open zones and the maximum number of active zones. > > > > > > > > > > For the zone struct: You may need to add a read-write lock per zone > > > > > to be > > > > > able to write lock zones to ensure a sequential write pattern (virtio > > > > > devices can be multi-queue and so writes may be coming in from > > > > > different > > > > > contexts) and to correctly emulate zone append operations with an > > > > > atomic > > > > > update of the wp field. > > > > > > > > > > These need to be integrated into the generic block driver interface in > > > > > include/block/block_int-common.h or include/block/block-common.h. > > > > > > > > QEMU's block layer has a few ways of exposing information about block > > > > devices: > > > > > > > > int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi); > > > > ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs, > > > > Error **errp); > > > > > > > > These fetch information from the BlockDriver and are good when a small > > > > amount of data is reported occassionally and consumed by the caller. > > > > > > > > For data that is continuously accessed or that could be large, it may > > > > be necessary for the data to reside inside BlockDriverState so that it > > > > can be accessed in place (without copying): > > > > > > > > void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp); > > > > > > > > QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that > > > > is continuously accessed by the block layer while processing I/O > > > > requests. The "refresh" function updates the data in case the > > > > underlying storage device has changed somehow. If no update function > > > > is necessary then data can simply be populated during .bdrv_open() and > > > > no new BlockDriver callback needs to be added. > > > > > > > > So in the simplest case BlockDriverState can be extended with a struct > > > > zbd_dev field that is populated during .bdrv_open(). If the > > > > BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0 > > > > or the model field indicates that this is not a zoned storage device. > > > > > > > > However, a BlockBackend (not BlockDriverState!) API will be needed to > > > > expose this data to users like the hw/block/virtio-blk.c emulation > > > > code or the qemu-io-cmds.c utility that can be used for testing. A > > > > BlockBackend has a root pointer to a BlockDriverState graph (for > > > > example, qcow2 on top of file-posix). It will be necessary to > > > > propagate zoned storage information from the leaf BlockDriverState all > > > > the way up to the BlockBackend. In simple cases the BB root points > > > > directly to the file-posix BDS that has Linux ZBD support but the > > > > design needs to account for additional BDS graph nodes. > > > > > > I think a simple way to think BlockBackend APIs is to use following > > > callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() + > > > blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt(). > > > The last function call will call bdrv_co_zone_mgmt() in > > > block/file-posix.c. If I understand the additional case correctly, the > > > BlockBackend API can expose the zone information to the virtio-blk > > > emulation now. > > > > Yes! > > > > block/raw-format.c also needs to implement .bdrv_co_zone_mgmt() by > > calling bdrv_co_zone_mgmt(bs->file, ...). This is because the > > raw-format.c driver usually sits on top of file-posix.c and has to > > pass through requests. > > > > There are filter block drivers like block/throttle.c, > > block/blkdebug.c, etc (git grep is_filter block/) that will also need > > to be modified to pass through requests in the same way. > > > > Are the filter block drivers also on top of file-posix.c but below > block-backend.c? I read that the filter block drivers, and formats are > designed to be manageable pieces so as to make block device > configuration easier and clearer.
Yes, the filters are BlockDrivers and each instance has a BlockDriverState. They are part of the same BlockDriverState graph as file-posix.c. BlockBackend has a "root" BlockDriverState pointer. It points to a graph of BlockDriverStates and there are 3 types of nodes: - Filter nodes like block/throttle.c that provide some extra functionality like I/O throttling. - Format nodes like qcow2, raw, or vmdk that implement disk image file formats. - Protocol nodes like file-posix, iSCSI, etc that implement access to underlying storage. The graph is pretty flexible. It's possible to insert/remove nodes to construct arbitrary graphs. Protocol nodes are the leaf nodes in the graph. Filter and format nodes are above protocol nodes. If we want to open /dev/nullb0 and limit I/O rates to 10 MB/s the graph would be: throttle (10 MB/s) -> raw-format -> file-posix (/dev/nullb0) The BlockBackend root would point at the throttle node. I/O requests made using blk_*() APIs will be forwarded to the throttle node using bdrv_*() APIs. The throttle node forwards requests to the raw-format node using bdrv_*() APIs. The raw-format node forwards I/O requests to the file-posix node using bdrv_*() APIs. Here is block/raw-format.c's preadv implementation: static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags) { int ret; ret = raw_adjust_offset(bs, &offset, bytes, false); if (ret) { return ret; } BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO); return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags); ^^^^^^^^^^^^^^^^^^^^^^^ forward I/O to child graph node } > > > Based on what I've read in Dmitry's virtio-blk spec proposal, the > > BlockBackend API could look something like: > > > > typedef struct { ...model, zone_sectors, max_open_zones, etc... } > > BlockZoneInfo; > > void blk_get_zone_info(BlockBackend *blk, BlockZoneInfo *info); > > > > virtio-blk.c calls this to fill out configuration space fields and > > determine whether the BlockBackend is a zoned device. > > > > Then there are 3 commands that happen in the I/O code path: > > > > typedef struct { ... } BlockZoneDescriptor; > > BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset, > > BlockZoneDescriptor *zones, size_t max_zones, BlockCompletionFunc *cb, > > void *opaque); > > > > typedef enum { ... } BlockZoneMgmtCmd; > > BlockAIOCB *blk_aio_zone_mgmt_send(BlockBackend *blk, int64_t offset, > > BlockZoneMgmtCmd cmd, bool all, BlockCompletionFunc *cb, void > > *opaque); > > > > typedef void BlockZoneAppendCompletionFunc(void *opaque, int ret, > > int64_t new_wp); > > BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t offset, > > QEMUIOVector *qiov, BlockZoneAppendCompletionFunc *cb, void *opaque); > > > > > Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c > > > which lead to include/block/block-io.h, we may need consider the case > > > when calling block layer API from non-coroutine context. Meanwhile, > > > using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per > > > zone may be a option too. > > > > Yes, device emulation code usually uses the aio versions of the > > BlockBackend I/O functions (read, write, flush). The QEMU block layer > > runs the aio I/O request inside a coroutine and usually also exposes > > coroutine versions of the same functions. For example, block jobs > > (e.g. storage mirroring, backup, and migration background tasks) > > usually call the coroutine versions of the BlockBackend APIs instead > > of the aio ones. > > > > qemu-io-cmds.c will want synchronous versions of the aio commands > > (blk_zone_report(), blk_zone_mgmt_send(), blk_zone_append()) that > > block until the command completes. This is because the qemu-io utility > > typically executes one command at a time and it's written mostly in > > blocking style rather than async callbacks or coroutines. > > docs/devel/block-coroutine-wrapper.rst describes how to generate > > synchronous versions of coroutine functions. > > > > Do you want to start implementing blk_get_zone_info()? This will > > require blk_*(), bdrv_*(), and BlockDriver (block/file-posix.c) > > functions. > > I want to implement the smallest part that can be tested first and > then move on to the next part. And I want to test zone report > operation first. Does the qemu io-test require the following part to > work: bdrv_co_zone_report in file-pisix.c, blk_get_zone_info() in > block-backend.c, blk_aio_zone_report() in io code path and modify some > test in test/qemu-iotests? If it does, then yes. blk_aio_zone_report() can be implemented later. It is not needed by qemu-io. blk_get_zone_info() will be needed soon but maybe you can skip it while working on the first version of blk_co_zone_report(). The steps are: 1. Add a .bdrv_co_zone_report() callback to BlockDriver and define a BlockZoneDescriptor struct. 2. Implement the .bdrv_co_zone_report() callback in block/file-posix.c using ioctl(BLKREPORTZONE). 3. Implement bdrv_co_zone_report() in block/io.c. It calls bs->drv->bdrv_co_zone_report() or returns -ENOTSUP if bs->drv->bdrv_co_zone_report is NULL. 4. Implement blk_co_zone_report() in block/block-backend.c. It calls bdrv_co_zone_report(blk->root, ...). 5. Generate a synchronous blk_zone_report() wrapper. See docs/devel/block-coroutine-wrapper.rst. You now have a working zone report command. It will work with --blockdev file,filename=/dev/nullb0,node-name=blk0, which creates a graph with just one block/file-posix.c BlockDriverState node. It won't work with QEMU's older --drive if=none,id=blk0,format=raw,file=/dev/nullb0 syntax because that creates a raw-format -> file-posix graph and you haven't implemented .bdrv_co_zone_report() in block/raw-format.c yet (but you can skip it for now). For testing you can add a qemu-io -c zone_report command to qemu-io-cmds.c that calls blk_zone_report(). Then you can write a tests/qemu-iotests/tests/zoned test script that report zones using qemu-io, writes to the first sectors of the disk using qemu-io, and then reports zones again to prove that the output has changed. Use the qemu-io --image-opts driver=host_device,filename=/dev/nullb0 option to open a Linux null_blk device using the block/file-posix.c BlockDriver. About qemu-iotests: the output from running the test case is diffed against a reference file that contains the expected output. This is quite convenient because you don't have to write code that checks for the expected output, you just provide a tests/qemu-iotests/tests/zoned.out file containing the output for a passing test. There is documentation about qemu-iotests here: https://qemu.readthedocs.io/en/latest/devel/testing.html#qemu-iotests Stefan