Re: [PATCH v5 00/11] block: file-posix queue
On 25.06.21 10:52, Paolo Bonzini wrote: On 25/06/21 10:37, Max Reitz wrote: Thanks, looks good to me, though I’m afraid I’ll be on PTO the next two weeks so I can’t take this series through my tree... (I don’t really want to send a pull request today when I probably wouldn’t be able to deal with potential problems) I can take it too, if it's okay for you. Sure!
Re: [PATCH v5 00/11] block: file-posix queue
On 24.06.21 20:04, Paolo Bonzini wrote: New patches: - 3/4 (for review comments), - 9 (split for ease of review), - 11 (new bugfix) v1->v2: add missing patch v2->v3: add max_hw_transfer to BlockLimits v3->v4: fix compilation after patch 1, tweak commit messages according to Vladimir's review v4->v5: round down max_transfer and max_hw_transfer to request alignment checkpatch fixes return -ENOTSUP, -not -EIO if block limits ioctls fail handle host_cdrom like host_device in QAPI split "block: try BSD disk size ioctls one after another" new bugfix patch "file-posix: handle EINTR during ioctl" Thanks, looks good to me, though I’m afraid I’ll be on PTO the next two weeks so I can’t take this series through my tree... (I don’t really want to send a pull request today when I probably wouldn’t be able to deal with potential problems) Max
Re: [PATCH 11/11] file-posix: handle EINTR during ioctl
On 24.06.21 20:04, Paolo Bonzini wrote: Similar to other handle_aiocb_* functions, handle_aiocb_ioctl needs to cater for the possibility that ioctl is interrupted by a signal. Otherwise, the I/O is incorrectly reported as a failure to the guest. Reported-by: Gordon Watson Signed-off-by: Paolo Bonzini --- block/file-posix.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/file-posix.c b/block/file-posix.c index 74b8216077..a26eab0ac3 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1347,7 +1347,9 @@ static int handle_aiocb_ioctl(void *opaque) RawPosixAIOData *aiocb = opaque; int ret; -ret = ioctl(aiocb->aio_fildes, aiocb->ioctl.cmd, aiocb->ioctl.buf); +do { +ret = ioctl(aiocb->aio_fildes, aiocb->ioctl.cmd, aiocb->ioctl.buf); +} while (ret == -1 && errno == EINTR); if (ret == -1) { return -errno; } I think the macro TFR() from qemu-common.h could be applied here, probably like `TFR(ret = ioctl(...));`. No matter: Reviewed-by: Max Reitz
Re: [PATCH 10/11] block: detect DKIOCGETBLOCKCOUNT/SIZE before use
On 24.06.21 20:04, Paolo Bonzini wrote: From: Joelle van Dyne iOS hosts do not have these defined so we fallback to the default behaviour. Co-authored-by: Warner Losh Signed-off-by: Joelle van Dyne Signed-off-by: Paolo Bonzini --- block/file-posix.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Max Reitz
Re: [PATCH 09/11] block: try BSD disk size ioctls one after another
On 24.06.21 20:04, Paolo Bonzini wrote: Try all the possible ioctls for disk size as long as they are supported, to keep the #if ladder simple. Extracted and cleaned up from a patch by Joelle van Dyne and Warner Losh. Signed-off-by: Paolo Bonzini --- block/file-posix.c | 34 -- 1 file changed, 16 insertions(+), 18 deletions(-) Thanks, that’s much better. Reviewed-by: Max Reitz
Re: [PATCH 07/11] block: feature detection for host block support
On 24.06.21 20:04, Paolo Bonzini wrote: From: Joelle van Dyne On Darwin (iOS), there are no system level APIs for directly accessing host block devices. We detect this at configure time. Signed-off-by: Joelle van Dyne Message-Id: <20210315180341.31638-...@getutm.app> Signed-off-by: Paolo Bonzini --- block/file-posix.c | 33 ++--- meson.build | 6 +- qapi/block-core.json | 14 ++ 3 files changed, 37 insertions(+), 16 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH 06/11] file-posix: try BLKSECTGET on block devices too, do not round to power of 2
On 24.06.21 20:04, Paolo Bonzini wrote: bs->sg is only true for character devices, but block devices can also be used with scsi-block and scsi-generic. Unfortunately BLKSECTGET returns bytes in an int for /dev/sgN devices, and sectors in a short for block devices, so account for that in the code. The maximum transfer also need not be a power of 2 (for example I have seen disks with 1280 KiB maximum transfer) so there's no need to pass the result through pow2floor. Signed-off-by: Paolo Bonzini --- block/file-posix.c | 67 ++ 1 file changed, 38 insertions(+), 29 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH 05/11] block: add max_hw_transfer to BlockLimits
On 24.06.21 20:04, Paolo Bonzini wrote: For block host devices, I/O can happen through either the kernel file descriptor I/O system calls (preadv/pwritev, io_submit, io_uring) or the SCSI passthrough ioctl SG_IO. In the latter case, the size of each transfer can be limited by the HBA, while for file descriptor I/O the kernel is able to split and merge I/O in smaller pieces as needed. Applying the HBA limits to file descriptor I/O results in more system calls and suboptimal performance, so this patch splits the max_transfer limit in two: max_transfer remains valid and is used in general, while max_hw_transfer is limited to the maximum hardware size. max_hw_transfer can then be included by the scsi-generic driver in the block limits page, to ensure that the stricter hardware limit is used. Signed-off-by: Paolo Bonzini --- block/block-backend.c | 13 + block/file-posix.c | 2 +- block/io.c | 2 ++ hw/scsi/scsi-generic.c | 2 +- include/block/block_int.h | 7 +++ include/sysemu/block-backend.h | 1 + 6 files changed, 25 insertions(+), 2 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH 04/11] block-backend: align max_transfer to request alignment
On 24.06.21 20:04, Paolo Bonzini wrote: Block device requests must be aligned to bs->bl.request_alignment. It makes sense for drivers to align bs->bl.max_transfer the same way; however when there is no specified limit, blk_get_max_transfer just returns INT_MAX. Since the contract of the function does not specify that INT_MAX means "no maximum", just align the outcome of the function (whether INT_MAX or bs->bl.max_transfer) before returning it. Signed-off-by: Paolo Bonzini --- block/block-backend.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH 03/11] osdep: provide ROUND_DOWN macro
On 24.06.21 20:04, Paolo Bonzini wrote: osdep.h provides a ROUND_UP macro to hide bitwise operations for the purpose of rounding a number up to a power of two; add a ROUND_DOWN macro that does the same with truncation towards zero. While at it, change the formatting of some comments. Signed-off-by: Paolo Bonzini --- include/qemu/osdep.h | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) What a nice thing to have. Reviewed-by: Max Reitz
[PULL 1/1] block/snapshot: Clarify goto fallback behavior
In the bdrv_snapshot_goto() fallback code, we work with a pointer to either bs->file or bs->backing. We detach that child, close the node (with .bdrv_close()), apply the snapshot on the child node, and then re-open the node (with .bdrv_open()). In order for .bdrv_open() to attach the same child node that we had before, we pass "file={child-node}" or "backing={child-node}" to it. Therefore, when .bdrv_open() has returned success, we can assume that bs->file or bs->backing (respectively) points to our original child again. This is verified by an assertion. All of this is not immediately clear from a quick glance at the code, so add a comment to the assertion what it is for, and why it is valid. It certainly confused Coverity. Reported-by: Coverity (CID 1452774) Signed-off-by: Max Reitz Message-Id: <20210503095418.31521-1-mre...@redhat.com> [mreitz: s/close/detach/] Reviewed-by: Vladimir Sementsov-Ogievskiy --- block/snapshot.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/block/snapshot.c b/block/snapshot.c index 6702c75e42..ccacda8bd5 100644 --- a/block/snapshot.c +++ b/block/snapshot.c @@ -275,13 +275,16 @@ int bdrv_snapshot_goto(BlockDriverState *bs, qobject_unref(file_options); g_free(subqdict_prefix); +/* Force .bdrv_open() below to re-attach fallback_bs on *fallback_ptr */ qdict_put_str(options, (*fallback_ptr)->name, bdrv_get_node_name(fallback_bs)); +/* Now close bs, apply the snapshot on fallback_bs, and re-open bs */ if (drv->bdrv_close) { drv->bdrv_close(bs); } +/* .bdrv_open() will re-attach it */ bdrv_unref_child(bs, *fallback_ptr); *fallback_ptr = NULL; @@ -296,7 +299,16 @@ int bdrv_snapshot_goto(BlockDriverState *bs, return ret < 0 ? ret : open_ret; } -assert(fallback_bs == (*fallback_ptr)->bs); +/* + * fallback_ptr is &bs->file or &bs->backing. *fallback_ptr + * was closed above and set to NULL, but the .bdrv_open() call + * has opened it again, because we set the respective option + * (with the qdict_put_str() call above). + * Assert that .bdrv_open() has attached some child on + * *fallback_ptr, and that it has attached the one we wanted + * it to (i.e., fallback_bs). + */ +assert(*fallback_ptr && fallback_bs == (*fallback_ptr)->bs); bdrv_unref(fallback_bs); return ret; } -- 2.31.1
[PULL 0/1] Block patch
The following changes since commit b22726abdfa54592d6ad88f65b0297c0e8b363e2: Merge remote-tracking branch 'remotes/vivier2/tags/linux-user-for-6.1-pull-request' into staging (2021-06-22 16:07:53 +0100) are available in the Git repository at: https://github.com/XanClic/qemu.git tags/pull-block-2021-06-24 for you to fetch changes up to 32a9a245d719a883eef2cbf07d2cf89efa0206d0: block/snapshot: Clarify goto fallback behavior (2021-06-24 09:49:04 +0200) Block patch: - Fix Coverity complaint in block/snapshot.c -------- Max Reitz (1): block/snapshot: Clarify goto fallback behavior block/snapshot.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) -- 2.31.1
Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
On 24.06.21 12:25, Vladimir Sementsov-Ogievskiy wrote: 24.06.2021 13:16, Max Reitz wrote: On 24.06.21 11:15, Vladimir Sementsov-Ogievskiy wrote: 23.06.2021 18:01, Max Reitz wrote: .bdrv_co_block_status() implementations are free to return a *pnum that exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp *pnum as necessary. On the other hand, if drivers' implementations return values for *pnum that are as large as possible, our recently introduced block-status cache will become more effective. So, make a note in block_int.h that @bytes is no upper limit for *pnum. Suggested-by: Eric Blake Signed-off-by: Max Reitz --- include/block/block_int.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/block/block_int.h b/include/block/block_int.h index fcb599dd1c..f85b96ed99 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -347,6 +347,11 @@ struct BlockDriver { * clamped to bdrv_getlength() and aligned to request_alignment, * as well as non-NULL pnum, map, and file; in turn, the driver * must return an error or set pnum to an aligned non-zero value. + * + * Note that @bytes is just a hint on how big of a region the + * caller wants to inspect. It is not a limit on *pnum. + * Implementations are free to return larger values of *pnum if + * doing so does not incur a performance penalty. Worth mention that the cache will benefit of it? Oh, right, absolutely. Like so: "block/io.c's bdrv_co_block_status() will clamp *pnum before returning it to its caller, but it itself can still make use of the unclamped *pnum value. Specifically, the block-status cache for protocol nodes will benefit from storing as large a region as possible." Sounds good. Do you mean this as an addition or substitution? If the latter, I'd keep "if doing so does not incur a performance penalty I meant it as an addition, just a new paragraph. Max
Re: [PATCH v2 2/6] block: block-status cache for data regions
On 24.06.21 12:05, Vladimir Sementsov-Ogievskiy wrote: 23.06.2021 18:01, Max Reitz wrote: As we have attempted before (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html, "file-posix: Cache lseek result for data regions"; https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html, "file-posix: Cache next hole"), this patch seeks to reduce the number of SEEK_DATA/HOLE operations the file-posix driver has to perform. The main difference is that this time it is implemented as part of the general block layer code. The problem we face is that on some filesystems or in some circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the implementation is outside of qemu, there is little we can do about its performance. We have already introduced the want_zero parameter to bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls unless we really want zero information; but sometimes we do want that information, because for files that consist largely of zero areas, special-casing those areas can give large performance boosts. So the real problem is with files that consist largely of data, so that inquiring the block status does not gain us much performance, but where such an inquiry itself takes a lot of time. To address this, we want to cache data regions. Most of the time, when bad performance is reported, it is in places where the image is iterated over from start to end (qemu-img convert or the mirror job), so a simple yet effective solution is to cache only the current data region. (Note that only caching data regions but not zero regions means that returning false information from the cache is not catastrophic: Treating zeroes as data is fine. While we try to invalidate the cache on zero writes and discards, such incongruences may still occur when there are other processes writing to the image.) We only use the cache for nodes without children (i.e. protocol nodes), because that is where the problem is: Drivers that rely on block-status implementations outside of qemu (e.g. SEEK_DATA/HOLE). Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 Signed-off-by: Max Reitz I'm new to RCU, so my review can't be reliable.. Yeah, well, same here. :) --- include/block/block_int.h | 47 ++ block.c | 84 +++ block/io.c | 61 ++-- 3 files changed, 189 insertions(+), 3 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index a8f9598102..fcb599dd1c 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -832,6 +832,22 @@ struct BdrvChild { QLIST_ENTRY(BdrvChild) next_parent; }; +/* + * Allows bdrv_co_block_status() to cache one data region for a + * protocol node. + * + * @valid: Whether the cache is valid (should be accessed with atomic + * functions so this can be reset by RCU readers) + * @data_start: Offset where we know (or strongly assume) is data + * @data_end: Offset where the data region ends (which is not necessarily + * the start of a zeroed region) + */ +typedef struct BdrvBlockStatusCache { + bool valid; + int64_t data_start; + int64_t data_end; +} BdrvBlockStatusCache; + struct BlockDriverState { /* Protected by big QEMU lock or read-only after opening. No special * locking needed during I/O... @@ -997,6 +1013,11 @@ struct BlockDriverState { /* BdrvChild links to this node may never be frozen */ bool never_freeze; + + /* Lock for block-status cache RCU writers */ + CoMutex bsc_modify_lock; + /* Always non-NULL, but must only be dereferenced under an RCU read guard */ + BdrvBlockStatusCache *block_status_cache;> }; struct BlockBackendRootState { @@ -1422,4 +1443,30 @@ static inline BlockDriverState *bdrv_primary_bs(BlockDriverState *bs) */ void bdrv_drain_all_end_quiesce(BlockDriverState *bs); +/** + * Check whether the given offset is in the cached block-status data + * region. + * + * If it is, and @pnum is not NULL, *pnum is set to + * `bsc.data_end - offset`, i.e. how many bytes, starting from + * @offset, are data (according to the cache). + * Otherwise, *pnum is not touched. + */ +bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum); + +/** + * If [offset, offset + bytes) overlaps with the currently cached + * block-status region, invalidate the cache. + * + * (To be used by I/O paths that cause data regions to be zero or + * holes.) + */ +void bdrv_bsc_invalidate_range(BlockDriverState *bs, + int64_t offset, int64_t bytes); + +/** + * Mark the range [offset, offset + bytes) as a data region. + */ +void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes); + #endif /* BLOCK_INT_H */ diff --git a/block.c b/block.c index 3f456892d0..9ab9459f7a 100644 --- a/block.c +++ b/block.c @
Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
On 24.06.21 11:15, Vladimir Sementsov-Ogievskiy wrote: 23.06.2021 18:01, Max Reitz wrote: .bdrv_co_block_status() implementations are free to return a *pnum that exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp *pnum as necessary. On the other hand, if drivers' implementations return values for *pnum that are as large as possible, our recently introduced block-status cache will become more effective. So, make a note in block_int.h that @bytes is no upper limit for *pnum. Suggested-by: Eric Blake Signed-off-by: Max Reitz --- include/block/block_int.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/block/block_int.h b/include/block/block_int.h index fcb599dd1c..f85b96ed99 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -347,6 +347,11 @@ struct BlockDriver { * clamped to bdrv_getlength() and aligned to request_alignment, * as well as non-NULL pnum, map, and file; in turn, the driver * must return an error or set pnum to an aligned non-zero value. + * + * Note that @bytes is just a hint on how big of a region the + * caller wants to inspect. It is not a limit on *pnum. + * Implementations are free to return larger values of *pnum if + * doing so does not incur a performance penalty. Worth mention that the cache will benefit of it? Oh, right, absolutely. Like so: "block/io.c's bdrv_co_block_status() will clamp *pnum before returning it to its caller, but it itself can still make use of the unclamped *pnum value. Specifically, the block-status cache for protocol nodes will benefit from storing as large a region as possible." ? Max
[PATCH v2 2/2] iotests/307: Test iothread conflict for exports
Passing fixed-iothread=true should make iothread conflicts fatal, whereas fixed-iothread=false should not. Combine the second case with an error condition that is checked after the iothread is handled, to verify that qemu does not crash if there is such an error after changing the iothread failed. Signed-off-by: Max Reitz --- tests/qemu-iotests/307 | 15 +++ tests/qemu-iotests/307.out | 8 2 files changed, 23 insertions(+) diff --git a/tests/qemu-iotests/307 b/tests/qemu-iotests/307 index c7685347bc..b429b5aa50 100755 --- a/tests/qemu-iotests/307 +++ b/tests/qemu-iotests/307 @@ -41,9 +41,11 @@ with iotests.FilePath('image') as img, \ iotests.log('=== Launch VM ===') vm.add_object('iothread,id=iothread0') +vm.add_object('iothread,id=iothread1') vm.add_blockdev(f'file,filename={img},node-name=file') vm.add_blockdev(f'{iotests.imgfmt},file=file,node-name=fmt') vm.add_blockdev('raw,file=file,node-name=ro,read-only=on') +vm.add_blockdev('null-co,node-name=null') vm.add_device(f'id=scsi0,driver=virtio-scsi,iothread=iothread0') vm.launch() @@ -74,6 +76,19 @@ with iotests.FilePath('image') as img, \ vm.qmp_log('query-block-exports') iotests.qemu_nbd_list_log('-k', socket) +iotests.log('\n=== Add export with conflicting iothread ===') + +vm.qmp_log('device_add', id='sdb', driver='scsi-hd', drive='null') + +# Should fail because of fixed-iothread +vm.qmp_log('block-export-add', id='export1', type='nbd', node_name='null', + iothread='iothread1', fixed_iothread=True, writable=True) + +# Should ignore the iothread conflict, but then fail because of the +# permission conflict (and not crash) +vm.qmp_log('block-export-add', id='export1', type='nbd', node_name='null', + iothread='iothread1', fixed_iothread=False, writable=True) + iotests.log('\n=== Add a writable export ===') # This fails because share-rw=off diff --git a/tests/qemu-iotests/307.out b/tests/qemu-iotests/307.out index 4b0c7e155a..ec8d2be0e0 100644 --- a/tests/qemu-iotests/307.out +++ b/tests/qemu-iotests/307.out @@ -51,6 +51,14 @@ exports available: 1 base:allocation +=== Add export with conflicting iothread === +{"execute": "device_add", "arguments": {"drive": "null", "driver": "scsi-hd", "id": "sdb"}} +{"return": {}} +{"execute": "block-export-add", "arguments": {"fixed-iothread": true, "id": "export1", "iothread": "iothread1", "node-name": "null", "type": "nbd", "writable": true}} +{"error": {"class": "GenericError", "desc": "Cannot change iothread of active block backend"}} +{"execute": "block-export-add", "arguments": {"fixed-iothread": false, "id": "export1", "iothread": "iothread1", "node-name": "null", "type": "nbd", "writable": true}} +{"error": {"class": "GenericError", "desc": "Permission conflict on node 'null': permissions 'write' are both required by an unnamed block device (uses node 'null' as 'root' child) and unshared by block device 'sdb' (uses node 'null' as 'root' child)."}} + === Add a writable export === {"execute": "block-export-add", "arguments": {"description": "This is the writable second export", "id": "export1", "name": "export1", "node-name": "fmt", "type": "nbd", "writable": true, "writethrough": true}} {"error": {"class": "GenericError", "desc": "Permission conflict on node 'fmt': permissions 'write' are both required by an unnamed block device (uses node 'fmt' as 'root' child) and unshared by block device 'sda' (uses node 'fmt' as 'root' child)."}} -- 2.31.1
[PATCH v2 0/2] block/export: Conditionally ignore set-context error
Hi, I had this v2 lying around for quite some time but for some reason never sent it. I probably just forgot. Sorry. v1: https://lists.nongnu.org/archive/html/qemu-block/2021-04/msg00584.html Changes in v2: - Letting `bdrv_try_set_aio_context()` return an error and then just discarding it conditionally is kind of not right. If we want to ignore the error, decide so from the start: Depending on `fixed_iothread`, pass either `errp` or `NULL`. - Patch 2: Reference output has changed because of 30ebb9aa920. git-backport-diff against v1: Key: [] : patches are identical [] : number of functional differences between upstream/downstream patch [down] : patch is downstream-only The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively 001/2:[0009] [FC] 'block/export: Conditionally ignore set-context error' 002/2:[0002] [FC] 'iotests/307: Test iothread conflict for exports' Max Reitz (2): block/export: Conditionally ignore set-context error iotests/307: Test iothread conflict for exports block/export/export.c | 5 - tests/qemu-iotests/307 | 15 +++ tests/qemu-iotests/307.out | 8 3 files changed, 27 insertions(+), 1 deletion(-) -- 2.31.1
[PATCH v2 1/2] block/export: Conditionally ignore set-context error
When invoking block-export-add with some iothread and fixed-iothread=false, and changing the node's iothread fails, the error is supposed to be ignored. However, it is still stored in *errp, which is wrong. If a second error occurs, the "*errp must be NULL" assertion in error_setv() fails: qemu-system-x86_64: ../util/error.c:59: error_setv: Assertion `*errp == NULL' failed. So if fixed-iothread=false, we should ignore the error by passing NULL to bdrv_try_set_aio_context(). Fixes: f51d23c80af73c95e0ce703ad06a300f1b3d63ef ("block/export: add iothread and fixed-iothread options") Signed-off-by: Max Reitz --- block/export/export.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/block/export/export.c b/block/export/export.c index fec7d9f738..6d3b9964c8 100644 --- a/block/export/export.c +++ b/block/export/export.c @@ -111,6 +111,7 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp) if (export->has_iothread) { IOThread *iothread; AioContext *new_ctx; +Error **set_context_errp; iothread = iothread_by_id(export->iothread); if (!iothread) { @@ -120,7 +121,9 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp) new_ctx = iothread_get_aio_context(iothread); -ret = bdrv_try_set_aio_context(bs, new_ctx, errp); +/* Ignore errors with fixed-iothread=false */ +set_context_errp = fixed_iothread ? errp : NULL; +ret = bdrv_try_set_aio_context(bs, new_ctx, set_context_errp); if (ret == 0) { aio_context_release(ctx); aio_context_acquire(new_ctx); -- 2.31.1
Re: [PATCH v4 7/7] block: detect DKIOCGETBLOCKCOUNT/SIZE before use
On 08.06.21 15:16, Paolo Bonzini wrote: From: Joelle van Dyne iOS hosts do not have these defined so we fallback to the default behaviour. Co-authored-by: Warner Losh Reviewed-by: Peter Maydell Signed-off-by: Joelle van Dyne Message-Id: <20210315180341.31638-...@getutm.app> Signed-off-by: Paolo Bonzini --- block/file-posix.c | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 5821e1afed..4e2f7cf508 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2322,8 +2322,11 @@ static int64_t raw_getlength(BlockDriverState *bs) again: #endif if (!fstat(fd, &sb) && (S_IFCHR & sb.st_mode)) { +size = 0; #ifdef DIOCGMEDIASIZE -if (ioctl(fd, DIOCGMEDIASIZE, (off_t *)&size)) +if (ioctl(fd, DIOCGMEDIASIZE, (off_t *)&size)) { +size = 0; +} #elif defined(DIOCGPART) { struct partinfo pi; @@ -2332,9 +2335,7 @@ again: else size = 0; } -if (size == 0) -#endif -#if defined(__APPLE__) && defined(__MACH__) +#elif defined(DKIOCGETBLOCKCOUNT) && defined(DKIOCGETBLOCKSIZE) In v3, I was wondering whether it’s intentional that the following DKIOCGETBLOCKCOUNT/SIZE block would no longer be used as a fallback if the DIOCGMEDIASIZE ioctl failed (which it was before this patch). I’m still wondering. Max { uint64_t sectors = 0; uint32_t sector_size = 0; @@ -2342,19 +2343,15 @@ again: if (ioctl(fd, DKIOCGETBLOCKCOUNT, §ors) == 0 && ioctl(fd, DKIOCGETBLOCKSIZE, §or_size) == 0) { size = sectors * sector_size; -} else { -size = lseek(fd, 0LL, SEEK_END); -if (size < 0) { -return -errno; -} } } -#else -size = lseek(fd, 0LL, SEEK_END); +#endif +if (size == 0) { +size = lseek(fd, 0LL, SEEK_END); +} if (size < 0) { return -errno; } -#endif #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) switch(s->type) { case FTYPE_CD:
Re: [PATCH v4 5/7] block: feature detection for host block support
On 08.06.21 15:16, Paolo Bonzini wrote: From: Joelle van Dyne On Darwin (iOS), there are no system level APIs for directly accessing host block devices. We detect this at configure time. Signed-off-by: Joelle van Dyne Message-Id: <20210315180341.31638-...@getutm.app> Signed-off-by: Paolo Bonzini --- block/file-posix.c | 33 ++--- meson.build | 6 +- qapi/block-core.json | 10 +++--- 3 files changed, 34 insertions(+), 15 deletions(-) [...] diff --git a/qapi/block-core.json b/qapi/block-core.json index 2ea294129e..2dd41be156 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -897,7 +897,8 @@ 'discriminator': 'driver', 'data': { 'file': 'BlockStatsSpecificFile', - 'host_device': 'BlockStatsSpecificFile', + 'host_device': { 'type': 'BlockStatsSpecificFile', + 'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' }, 'nvme': 'BlockStatsSpecificNvme' } } ## @@ -2814,7 +2815,9 @@ { 'enum': 'BlockdevDriver', 'data': [ 'blkdebug', 'blklogwrites', 'blkreplay', 'blkverify', 'bochs', 'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps', -'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi', +'gluster', 'host_cdrom', +{'name': 'host_device', 'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' }, +'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' }, @@ -3996,7 +3999,8 @@ 'ftps': 'BlockdevOptionsCurlFtps', 'gluster':'BlockdevOptionsGluster', 'host_cdrom': 'BlockdevOptionsFile', - 'host_device':'BlockdevOptionsFile', + 'host_device': { 'type': 'BlockdevOptionsFile', + 'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' }, 'http': 'BlockdevOptionsCurlHttp', 'https': 'BlockdevOptionsCurlHttps', 'iscsi': 'BlockdevOptionsIscsi', As I asked in v3: What about host_cdrom? Max
Re: [PATCH v4 1/7] file-posix: fix max_iov for /dev/sg devices
On 08.06.21 21:14, Vladimir Sementsov-Ogievskiy wrote: 08.06.2021 16:16, Paolo Bonzini wrote: Even though it was only called for devices that have bs->sg set (which must be character devices), sg_get_max_segments looked at /sys/dev/block which only works for block devices. On Linux the sg driver has its own way to provide the maximum number of iovecs in a scatter/gather list, so add support for it. The block device path is kept because it will be reinstated in the next patches. Signed-off-by: Paolo Bonzini --- block/file-posix.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/file-posix.c b/block/file-posix.c index f37dfc10b3..536998a1d6 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1180,6 +1180,17 @@ static int sg_get_max_segments(int fd) goto out; } + if (S_ISCHR(st.st_mode)) { Why not check "if (bs->sg) {" instead? It seems to be more consistent with issuing SG_ ioctl. Or what I miss? I dismissed this in v3, because I didn’t understand why you’d raise this point. The function is called sg_*(), and it’s only called if bs->sg is true anyway. So clearly we can use SG_ ioctls, because the whole function is intended only for SG devices anyway. This time, I looked forward, and perhaps starting at patch 4 I can understand where you’re coming from, because then the function is used for host devices in general. So now I don’t particularly mind. I think it’s still clear that if there’s a host device here that’s a character device, then that’s going to be an SG device, so I don’t really have a preference between S_ISCHR() and bs->sg. Max + if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) { + return ret; + } + return -ENOTSUP; + } + + if (!S_ISBLK(st.st_mode)) { + return -ENOTSUP; + } + sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments", major(st.st_rdev), minor(st.st_rdev)); sysfd = open(sysfspath, O_RDONLY);
[PATCH v2 2/6] block: block-status cache for data regions
As we have attempted before (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html, "file-posix: Cache lseek result for data regions"; https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html, "file-posix: Cache next hole"), this patch seeks to reduce the number of SEEK_DATA/HOLE operations the file-posix driver has to perform. The main difference is that this time it is implemented as part of the general block layer code. The problem we face is that on some filesystems or in some circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the implementation is outside of qemu, there is little we can do about its performance. We have already introduced the want_zero parameter to bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls unless we really want zero information; but sometimes we do want that information, because for files that consist largely of zero areas, special-casing those areas can give large performance boosts. So the real problem is with files that consist largely of data, so that inquiring the block status does not gain us much performance, but where such an inquiry itself takes a lot of time. To address this, we want to cache data regions. Most of the time, when bad performance is reported, it is in places where the image is iterated over from start to end (qemu-img convert or the mirror job), so a simple yet effective solution is to cache only the current data region. (Note that only caching data regions but not zero regions means that returning false information from the cache is not catastrophic: Treating zeroes as data is fine. While we try to invalidate the cache on zero writes and discards, such incongruences may still occur when there are other processes writing to the image.) We only use the cache for nodes without children (i.e. protocol nodes), because that is where the problem is: Drivers that rely on block-status implementations outside of qemu (e.g. SEEK_DATA/HOLE). Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 Signed-off-by: Max Reitz --- include/block/block_int.h | 47 ++ block.c | 84 +++ block/io.c| 61 ++-- 3 files changed, 189 insertions(+), 3 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index a8f9598102..fcb599dd1c 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -832,6 +832,22 @@ struct BdrvChild { QLIST_ENTRY(BdrvChild) next_parent; }; +/* + * Allows bdrv_co_block_status() to cache one data region for a + * protocol node. + * + * @valid: Whether the cache is valid (should be accessed with atomic + * functions so this can be reset by RCU readers) + * @data_start: Offset where we know (or strongly assume) is data + * @data_end: Offset where the data region ends (which is not necessarily + *the start of a zeroed region) + */ +typedef struct BdrvBlockStatusCache { +bool valid; +int64_t data_start; +int64_t data_end; +} BdrvBlockStatusCache; + struct BlockDriverState { /* Protected by big QEMU lock or read-only after opening. No special * locking needed during I/O... @@ -997,6 +1013,11 @@ struct BlockDriverState { /* BdrvChild links to this node may never be frozen */ bool never_freeze; + +/* Lock for block-status cache RCU writers */ +CoMutex bsc_modify_lock; +/* Always non-NULL, but must only be dereferenced under an RCU read guard */ +BdrvBlockStatusCache *block_status_cache; }; struct BlockBackendRootState { @@ -1422,4 +1443,30 @@ static inline BlockDriverState *bdrv_primary_bs(BlockDriverState *bs) */ void bdrv_drain_all_end_quiesce(BlockDriverState *bs); +/** + * Check whether the given offset is in the cached block-status data + * region. + * + * If it is, and @pnum is not NULL, *pnum is set to + * `bsc.data_end - offset`, i.e. how many bytes, starting from + * @offset, are data (according to the cache). + * Otherwise, *pnum is not touched. + */ +bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum); + +/** + * If [offset, offset + bytes) overlaps with the currently cached + * block-status region, invalidate the cache. + * + * (To be used by I/O paths that cause data regions to be zero or + * holes.) + */ +void bdrv_bsc_invalidate_range(BlockDriverState *bs, + int64_t offset, int64_t bytes); + +/** + * Mark the range [offset, offset + bytes) as a data region. + */ +void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes); + #endif /* BLOCK_INT_H */ diff --git a/block.c b/block.c index 3f456892d0..9ab9459f7a 100644 --- a/block.c +++ b/block.c @@ -49,6 +49,8 @@ #include "qemu/timer.h" #include "qemu/cutils.h" #include "qemu/id.h" +#include "qemu/range.h" +#include "qemu/rcu.h" #include "bl
[PATCH v2 4/6] block/file-posix: Do not force-cap *pnum
bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz Reviewed-by: Eric Blake Reviewed-by: Vladimir Sementsov-Ogievskiy --- block/file-posix.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index b3fbb9bd63..aeb370d5bb 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2689,7 +2689,8 @@ static int find_allocation(BlockDriverState *bs, off_t start, * the specified offset) that are known to be in the same * allocated/unallocated state. * - * 'bytes' is the max value 'pnum' should be set to. + * 'bytes' is a soft cap for 'pnum'. If the information is free, 'pnum' may + * well exceed it. */ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, bool want_zero, @@ -2727,7 +2728,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, } else if (data == offset) { /* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ -*pnum = MIN(bytes, hole - offset); +*pnum = hole - offset; /* * We are not allowed to return partial sectors, though, so @@ -2746,7 +2747,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, } else { /* On a hole, compute bytes to the beginning of the next extent. */ assert(hole == offset); -*pnum = MIN(bytes, data - offset); +*pnum = data - offset; ret = BDRV_BLOCK_ZERO; } *map = offset; -- 2.31.1
[PATCH v2 6/6] block/iscsi: Do not force-cap *pnum
bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz Reviewed-by: Eric Blake Reviewed-by: Vladimir Sementsov-Ogievskiy --- block/iscsi.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/block/iscsi.c b/block/iscsi.c index 4d2a416ce7..852384086b 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -781,9 +781,6 @@ retry: iscsi_allocmap_set_allocated(iscsilun, offset, *pnum); } -if (*pnum > bytes) { -*pnum = bytes; -} out_unlock: qemu_mutex_unlock(&iscsilun->mutex); g_free(iTask.err_str); -- 2.31.1
[PATCH v2 1/6] block: Drop BDS comment regarding bdrv_append()
There is a comment above the BDS definition stating care must be taken to consider handling newly added fields in bdrv_append(). Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac (nine years ago), and in any case, bdrv_swap() was dropped in 8e419aefa (six years ago). So no such care is necessary anymore. Signed-off-by: Max Reitz Reviewed-by: Eric Blake Reviewed-by: Vladimir Sementsov-Ogievskiy --- include/block/block_int.h | 6 -- 1 file changed, 6 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index 057d88b1fc..a8f9598102 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -832,12 +832,6 @@ struct BdrvChild { QLIST_ENTRY(BdrvChild) next_parent; }; -/* - * Note: the function bdrv_append() copies and swaps contents of - * BlockDriverStates, so if you add new fields to this struct, please - * inspect bdrv_append() to determine if the new fields need to be - * copied as well. - */ struct BlockDriverState { /* Protected by big QEMU lock or read-only after opening. No special * locking needed during I/O... -- 2.31.1
[PATCH v2 5/6] block/gluster: Do not force-cap *pnum
bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz Reviewed-by: Eric Blake Reviewed-by: Vladimir Sementsov-Ogievskiy --- block/gluster.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/block/gluster.c b/block/gluster.c index e8ee14c8e9..8ef7bb18d5 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -1461,7 +1461,8 @@ exit: * the specified offset) that are known to be in the same * allocated/unallocated state. * - * 'bytes' is the max value 'pnum' should be set to. + * 'bytes' is a soft cap for 'pnum'. If the information is free, 'pnum' may + * well exceed it. * * (Based on raw_co_block_status() from file-posix.c.) */ @@ -1500,12 +1501,12 @@ static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs, } else if (data == offset) { /* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ -*pnum = MIN(bytes, hole - offset); +*pnum = hole - offset; ret = BDRV_BLOCK_DATA; } else { /* On a hole, compute bytes to the beginning of the next extent. */ assert(hole == offset); -*pnum = MIN(bytes, data - offset); +*pnum = data - offset; ret = BDRV_BLOCK_ZERO; } -- 2.31.1
[PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum
.bdrv_co_block_status() implementations are free to return a *pnum that exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp *pnum as necessary. On the other hand, if drivers' implementations return values for *pnum that are as large as possible, our recently introduced block-status cache will become more effective. So, make a note in block_int.h that @bytes is no upper limit for *pnum. Suggested-by: Eric Blake Signed-off-by: Max Reitz --- include/block/block_int.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/block/block_int.h b/include/block/block_int.h index fcb599dd1c..f85b96ed99 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -347,6 +347,11 @@ struct BlockDriver { * clamped to bdrv_getlength() and aligned to request_alignment, * as well as non-NULL pnum, map, and file; in turn, the driver * must return an error or set pnum to an aligned non-zero value. + * + * Note that @bytes is just a hint on how big of a region the + * caller wants to inspect. It is not a limit on *pnum. + * Implementations are free to return larger values of *pnum if + * doing so does not incur a performance penalty. */ int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs, bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum, -- 2.31.1
[PATCH v2 0/6] block: block-status cache for data regions
Hi, See the cover letter from v1 for the general idea: https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg00843.html The biggest change here in v2 is that instead of having a common CoMutex protect the block-status cache, we’re using RCU now. So to read from the cache, or even to invalidate it, no lock is needed, only to update it with new data. Disclaimer: I have no experience with RCU in practice so far, neither in qemu nor anywhere else. So I hope I’ve used it correctly... Differences to v1 in detail: - Patch 2: - Moved BdrvBlockStatusCache.lock up to BDS, it is now the RCU writer lock - BDS.block_status_cache is now a pointer, so it can be replaced with RCU - Moved all cache access functionality into helper functions (bdrv_bsc_is_data(), bdrv_bsc_invalidate_range(), bdrv_bsc_fill()) in block.c - Guard BSC accesses with RCU (BSC.valid is to be accessed atomically, which allows resetting it without taking an RCU write lock) - Check QLIST_EMPTY(&bs->children) not just when reading from the cache, but when filling it, too (so we don’t need an RCU update when it won’t make sense) - Patch 3: Added - Dropped the block/nbd patch (because it would make NBD query a larger range; the patch’s intent was to get more information for free, which this would not be) git-backport-diff against v1: Key: [] : patches are identical [] : number of functional differences between upstream/downstream patch [down] : patch is downstream-only The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively 001/6:[] [--] 'block: Drop BDS comment regarding bdrv_append()' 002/6:[0169] [FC] 'block: block-status cache for data regions' 003/6:[down] 'block: Clarify that @bytes is no limit on *pnum' 004/6:[] [--] 'block/file-posix: Do not force-cap *pnum' 005/6:[] [--] 'block/gluster: Do not force-cap *pnum' 006/6:[----] [--] 'block/iscsi: Do not force-cap *pnum' Max Reitz (6): block: Drop BDS comment regarding bdrv_append() block: block-status cache for data regions block: Clarify that @bytes is no limit on *pnum block/file-posix: Do not force-cap *pnum block/gluster: Do not force-cap *pnum block/iscsi: Do not force-cap *pnum include/block/block_int.h | 54 +++-- block.c | 84 +++ block/file-posix.c| 7 ++-- block/gluster.c | 7 ++-- block/io.c| 61 ++-- block/iscsi.c | 3 -- 6 files changed, 200 insertions(+), 16 deletions(-) -- 2.31.1
Re: [PATCH 4/4] iotests/308: Test allow-other
On 22.06.21 17:08, Kevin Wolf wrote: Am 14.06.2021 um 16:44 hat Max Reitz geschrieben: We cannot reasonably test the main point of allow-other, which is to allow users other than the current one to access the FUSE export, because that would require access to sudo, which this test most likely will not have. (Also, we would need to figure out some user/group that is on the machine and that is not the current user/group, which may become a bit hairy.) But we can test some byproducts: First, whether changing permissions works (our FUSE code only allows so for allow-other=true), and second, whether the kernel applies permission checks with allow-other=true (because that implies default_permissions). Signed-off-by: Max Reitz This seems to have the problem that you mentioned: --- /home/kwolf/source/qemu/tests/qemu-iotests/308.out +++ 308.out.bad @@ -205,7 +205,9 @@ 'writable': true, 'allow-other': true } } -{"return": {}} +fusermount3: option allow_other only allowed if 'user_allow_other' is set in /etc/fuse.conf +{"error": {"class": "GenericError", "desc": "Failed to mount FUSE session to export"}} +Timeout waiting for return on handle 2 (Invoking chmod) Permissions post-chmod: 666 (Removing all permissions) Maybe it should be a separate test case that is skipped with user_allow_other is disabled. Right. tests/qemu-iotests/308 | 91 ++ tests/qemu-iotests/308.out | 47 2 files changed, 138 insertions(+) diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308 index f122065d0f..1b2f908947 100755 --- a/tests/qemu-iotests/308 +++ b/tests/qemu-iotests/308 @@ -334,6 +334,97 @@ echo '=== Compare copy with original ===' $QEMU_IMG compare -f raw -F $IMGFMT "$COPIED_IMG" "$TEST_IMG" +echo +echo '=== Test permissions ===' + +# Test that you can only change permissions on the export with allow-other=true. +# We cannot really test the primary reason behind allow-other (i.e. to allow +# users other than the current one access to the export), because for that we +# would need sudo, which realistically nobody will allow this test to use. +# What we can do is test that allow-other=true also enables default_permissions, +# i.e. whether we can still read from the file if we remove the read permission. We already have other test cases that use sudo if available. Though I guess it means that these tests aren't run very often. Yes, I know, but honestly I don’t really want to deal with user management either. I had a paragraph about that in a preliminary version but decided to cut it, because, I thought it wouldn’t really matter. That problem is that I’d need to run qemu-io as some different user, and the question is, who is a different user? I suppose I could rely on “root” and “nobody” being valid users on any system, but I don’t think I can be sure that the user running the tests isn’t either of those. So I would have to check whether the current user is “root”, and then run it as “nobody”, or otherwise run it as “root”, but that just seems like I’m getting in too deep for something that isn’t really useful anyway, because on developers’ machines, it will most likely be skipped anyway. Max
Re: [PATCH 3/4] export/fuse: Let permissions be adjustable
On 22.06.21 17:02, Kevin Wolf wrote: Am 14.06.2021 um 16:44 hat Max Reitz geschrieben: Allow changing the file mode, UID, and GID through SETATTR. This only really makes sense with allow-other, though (because without it, the effective access mode is fixed to be 0600 (u+rw) with qemu's user being the file's owner), so changing these stat fields is not allowed without allow-other. Signed-off-by: Max Reitz --- block/export/fuse.c | 48 ++--- 1 file changed, 37 insertions(+), 11 deletions(-) diff --git a/block/export/fuse.c b/block/export/fuse.c index 1d54286d90..742e0af657 100644 --- a/block/export/fuse.c +++ b/block/export/fuse.c @@ -47,6 +47,10 @@ typedef struct FuseExport { bool writable; bool growable; bool allow_other; + +mode_t st_mode; +uid_t st_uid; +gid_t st_gid; } FuseExport; static GHashTable *exports; @@ -120,6 +124,13 @@ static int fuse_export_create(BlockExport *blk_exp, exp->growable = args->growable; exp->allow_other = args->allow_other; +exp->st_mode = S_IFREG | S_IRUSR; +if (exp->writable) { +exp->st_mode |= S_IWUSR; +} +exp->st_uid = getuid(); +exp->st_gid = getgid(); + ret = setup_fuse_export(exp, args->mountpoint, errp); if (ret < 0) { goto fail; @@ -329,7 +340,6 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode, int64_t length, allocated_blocks; time_t now = time(NULL); FuseExport *exp = fuse_req_userdata(req); -mode_t mode; length = blk_getlength(exp->common.blk); if (length < 0) { @@ -344,17 +354,12 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode, allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512); } -mode = S_IFREG | S_IRUSR; -if (exp->writable) { -mode |= S_IWUSR; -} - statbuf = (struct stat) { .st_ino = inode, -.st_mode= mode, +.st_mode= exp->st_mode, .st_nlink = 1, -.st_uid = getuid(), -.st_gid = getgid(), +.st_uid = exp->st_uid, +.st_gid = exp->st_gid, .st_size= length, .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment, .st_blocks = allocated_blocks, @@ -400,15 +405,23 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size, } /** - * Let clients set file attributes. Only resizing is supported. + * Let clients set file attributes. With allow_other, only resizing and + * changing permissions (st_mode, st_uid, st_gid) is allowed. Without + * allow_other, only resizing is supported. */ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf, int to_set, struct fuse_file_info *fi) { FuseExport *exp = fuse_req_userdata(req); +int supported_attrs; int ret; -if (to_set & ~FUSE_SET_ATTR_SIZE) { +supported_attrs = FUSE_SET_ATTR_SIZE; +if (exp->allow_other) { +supported_attrs |= FUSE_SET_ATTR_MODE | FUSE_SET_ATTR_UID | +FUSE_SET_ATTR_GID; +} +if (to_set & ~supported_attrs) { fuse_reply_err(req, ENOTSUP); return; } @@ -426,6 +439,19 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf, } } +if (to_set & FUSE_SET_ATTR_MODE) { +/* Only allow changing the file mode, not the type */ +exp->st_mode = (statbuf->st_mode & 0) | S_IFREG; +} Should we check that the mode actually makes sense? Not sure if making an image executable has a good use case, and making it writable in the permissions for a read-only export isn't a good idea either. I mean, I don’t mind what the user does. It doesn’t really faze us, I believe. If the image contains an executable ELF and the user wants to run it directly from FUSE... I don’t mind. As for +w on RO exports, I’m not sure. This reminds me of `sudo chattr +i $file`, which effectively makes any regular file read-only, too, and it can still keep +w. So the file permissions are basically just ACLs, getting permission for something doesn’t mean you can actually do it. OTOH, the difference to `chattr +i` is that we’d allow opening the export R/W, only writing would then fail. `chattr +i` does give EPERM when opening the file. So I’m not quite sure. I don’t really want to prevent the user from setting any access restrictions they want, but on the other hand, if writing can never work, then there really is no point in allowing +w. (Then I’m wondering, if we don’t allow +w, should we silently drop it or return an error? I guess returning success but not actually succeeding is weird, so we probably should return EROFS.) But +x can technically work, so I wouldn’t disallow it. Max
[PATCH] block: BDRV_O_NO_IO for backing file on creation
When creating an image file with a backing file, we generally try to open the backing file (unless -u was specified), mostly to verify that it is there, but also to get the file size if none was specified for the new image. For neither of these things do we need data I/O, and so we can pass BDRV_O_NO_IO when opening the backing file. This allows us to open even encrypted backing images without requiring the user to provide a secret. This makes the -u switch in iotests 189 and 198 unnecessary (and the $size parameter), so drop it, because this way we get regression tests for this patch here. Fixes: https://gitlab.com/qemu-project/qemu/-/issues/441 Signed-off-by: Max Reitz --- block.c| 6 +- tests/qemu-iotests/189 | 2 +- tests/qemu-iotests/198 | 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/block.c b/block.c index 3f456892d0..b459502632 100644 --- a/block.c +++ b/block.c @@ -6553,9 +6553,13 @@ void bdrv_img_create(const char *filename, const char *fmt, } assert(full_backing); -/* backing files always opened read-only */ +/* + * No need to do I/O here, which allows us to open encrypted + * backing images without needing the secret + */ back_flags = flags; back_flags &= ~(BDRV_O_RDWR | BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING); +back_flags |= BDRV_O_NO_IO; backing_options = qdict_new(); if (backing_fmt) { diff --git a/tests/qemu-iotests/189 b/tests/qemu-iotests/189 index 4e463385b2..801494c6b9 100755 --- a/tests/qemu-iotests/189 +++ b/tests/qemu-iotests/189 @@ -67,7 +67,7 @@ echo "== verify pattern ==" $QEMU_IO --object $SECRET0 -c "read -P 0xa 0 $size" --image-opts $IMGSPECBASE | _filter_qemu_io | _filter_testdir echo "== create overlay ==" -_make_test_img --object $SECRET1 -o "encrypt.format=luks,encrypt.key-secret=sec1,encrypt.iter-time=10" -u -b "$TEST_IMG_BASE" -F $IMGFMT $size +_make_test_img --object $SECRET1 -o "encrypt.format=luks,encrypt.key-secret=sec1,encrypt.iter-time=10" -b "$TEST_IMG_BASE" -F $IMGFMT echo echo "== writing part of a cluster ==" diff --git a/tests/qemu-iotests/198 b/tests/qemu-iotests/198 index b333a8f281..1c93dea1f7 100755 --- a/tests/qemu-iotests/198 +++ b/tests/qemu-iotests/198 @@ -64,7 +64,7 @@ echo "== writing whole image base ==" $QEMU_IO --object $SECRET0 -c "write -P 0xa 0 $size" --image-opts $IMGSPECBASE | _filter_qemu_io | _filter_testdir echo "== create overlay ==" -_make_test_img --object $SECRET1 -o "encrypt.format=luks,encrypt.key-secret=sec1,encrypt.iter-time=10" -u -b "$TEST_IMG_BASE" -F $IMGFMT $size +_make_test_img --object $SECRET1 -o "encrypt.format=luks,encrypt.key-secret=sec1,encrypt.iter-time=10" -b "$TEST_IMG_BASE" -F $IMGFMT echo echo "== writing whole image layer ==" -- 2.31.1
Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table
On 21.06.21 17:51, Vivek Goyal wrote: On Mon, Jun 21, 2021 at 11:02:16AM +0200, Max Reitz wrote: On 18.06.21 20:29, Vivek Goyal wrote: [snip] What am I not able to understand is that why we can't return error if we run into a temporary issue and can't generate file handle. What's that requirement that we need to hide the error and try to cover it up by some other means. There is no requirement, it’s just that we need to hide ENOTSUPP errors anyway (because e.g. some submounted filesystem may not support file handles), so I considered hiding temporary errors ENOTSUPP is not a temporary error I guess. But this is a good point that if host filesystem does not support file handles, should we return error so that user is forced to remove "-o inode_file_handles" option. I can see multiple modes and they all seem to be useful in different scenarios. A. Do not use file handles at all. B. Use file handles if filesystem supports it. Also this could do some kind of mix and matching so that some inodes use file handles while others use fd. We could think of implementing some threshold and if open fds go above this threshold, switch to file handle stuff. C. Strictly use file handles otherwise return error to caller. One possibility is that we implement two options inode_file_handles and no_inode_file_handles. - User specified -o no_inode_file_handles, implement A. - User specified -o inode_file_handles, implement C. - User did not specify anything, then use file handles opportunistically as seen fit by daemon. This provides us the maximum flexibility and this practically implements option B. IOW, if user did not specify anything, then we can think of implementing file handles by default and fallback to using O_PATH fds if filesystem does not support file handles (ENOTSUPP). But if user did specify "-o no_inode_file_handles" or "-o inode_file_handles", this kind of points to strictly implementing respective policy, IMHO. That's how I have implemented some other options. Alternatively, we could think of adding one more option say "inode_file_handles_only. - "-o no_inode_files_handles" implements A. - "-o inode_files_handles" implements B. - "-o inode_files_handles_only" implements C. - If user did not specify anything on command line, then its up to the daemon whatever default policy it wants and default can change over time. I think it makes sense not to punish the user for wanting to use file handles as much as possible and still gracefully handling submounts that don’t support them. So I’d want to implement B first, and have that be -o inode_files_handles. I think A as -o no_inode_file_handles is also there, automatically...? I don’t think there’s much of a problem with implementing C, except we probably want to log ENOTSUPP errors then, so the user can figure out what’s going on. I feel like we can still wait with such an option until there’s a demand for it. (Except that perhaps the demand would come in the form of “please help I use -o inode_file_handles but virtiofsd’s FD count is still too high I don’t know what to do”, and that probably wouldn’t be so great. But then again, inodes_files_handles_only wouldn’t help a user in that case either, because it changes “works badly” to “doesn’t work at all”. Now I’m wondering what the actual use case of inodes_files_handles_only would be.) not to really add complexity. (Which is true, as can be seen from the diff stat I posted below: Dropping the second hash table and making the handle part of lo_key, so that temporary errors are now fatal, generates a diff of +37/-66, where -20 are just two comments (which realistically I’d need to replace by different comments), so in terms of code, it’s +37/-46.) I’d just rather handle errors gracefully when I find it doesn’t really cost complexity. However, you made a good point in that we must require name_to_handle_at() to work if it worked before for some inode, not because it would be simpler, but because it would be wrong otherwise. As for the other way around... Well, now I don’t have a strong opinion on it. Handling temporary name_to_handle_at() failure after it worked the first time should not add extra complexity, but it wouldn’t be symmetric. Like, allowing temporary failure sometimes but not at other times. Right. If we decided that we need to generate file handle for an inode and underlying filesystem supports file handles, then handling temporary failures only sometimes will make it assymetric. At this point of time I am more inclined to return error to caller on temporary failures. But if this does not work well in practice, I am open to change the policy. The next question is, how do we detect temporary failure, because if we look up some new inode, name_to_handle_at() fails, we ignore it, and then it starts to work and we fail all further lookups, tha
Re: [PATCH 0/4] export/fuse: Allow other users access to the export
On 21.06.21 18:12, Kevin Wolf wrote: Am 14.06.2021 um 16:44 hat Max Reitz geschrieben: Hi, With the default mount options, FUSE mounts are not accessible to any users but the one who did the mount, not even to root. To allow such accesses, allow_other must be passed. This is probably useful to some people (it certainly is to me, e.g. when exporting some image as my normal user, and then trying to loop mount it as root), so this series adds a QAPI allow-other bool that will make the FUSE export code pass allow_other,default_permissions to FUSE. (default_permissions will make the kernel do the usual UNIX permission checks, which is something that makes a lot of sense when allowing other users access to the export.) This also requires our SETATTR code to be able to handle permission changes, though, so the user can then run chmod/chown/chgrp on the export to adjust its permissions to their need. The final patch adds a test. If there is even a use case for leaving the option off (not trusting root?), it must certainly be the less common case? So I'm not sure if allow-other should be an option at all, but if it is, enabling it by default would make more sense to me. Is there a reason why you picked false as the default, except that it is the old behaviour? No. :) Well, mostly. I also thought, if FUSE thinks allow_other shouldn’t be the default, who am I to decide otherwise. Now that I tried to find out why FUSE has it as the default (I only remember vague “security reasons”), I still couldn’t find out why, but I did find that using this option as non-root user requires /etc/fuse.conf to have user_allow_other in it, which I don’t think we can require. So I think it must be an option. As for which value should be the default, that probably depends on how common having user_allow_other in /etc/fuse.conf is. I know I never put it there, and it’s both on my Fedora and my Arch system. So I guess it seems rather common? Max
Re: [PATCH 2/6] block: block-status cache for data regions
On 19.06.21 12:20, Vladimir Sementsov-Ogievskiy wrote: 17.06.2021 18:52, Max Reitz wrote: As we have attempted before (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html, "file-posix: Cache lseek result for data regions"; https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html, "file-posix: Cache next hole"), this patch seeks to reduce the number of SEEK_DATA/HOLE operations the file-posix driver has to perform. The main difference is that this time it is implemented as part of the general block layer code. The problem we face is that on some filesystems or in some circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the implementation is outside of qemu, there is little we can do about its performance. We have already introduced the want_zero parameter to bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls unless we really want zero information; but sometimes we do want that information, because for files that consist largely of zero areas, special-casing those areas can give large performance boosts. So the real problem is with files that consist largely of data, so that inquiring the block status does not gain us much performance, but where such an inquiry itself takes a lot of time. To address this, we want to cache data regions. Most of the time, when bad performance is reported, it is in places where the image is iterated over from start to end (qemu-img convert or the mirror job), so a simple yet effective solution is to cache only the current data region. (Note that only caching data regions but not zero regions means that returning false information from the cache is not catastrophic: Treating zeroes as data is fine. While we try to invalidate the cache on zero writes and discards, such incongruences may still occur when there are other processes writing to the image.) We only use the cache for nodes without children (i.e. protocol nodes), because that is where the problem is: Drivers that rely on block-status implementations outside of qemu (e.g. SEEK_DATA/HOLE). Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 Signed-off-by: Max Reitz --- include/block/block_int.h | 19 ++ block.c | 2 + block/io.c | 80 +-- 3 files changed, 98 insertions(+), 3 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index a8f9598102..c09512556a 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -832,6 +832,23 @@ struct BdrvChild { QLIST_ENTRY(BdrvChild) next_parent; }; +/* + * Allows bdrv_co_block_status() to cache one data region for a + * protocol node. + * + * @lock: Lock for accessing this object's fields + * @valid: Whether the cache is valid + * @data_start: Offset where we know (or strongly assume) is data + * @data_end: Offset where the data region ends (which is not necessarily + * the start of a zeroed region) + */ +typedef struct BdrvBlockStatusCache { + CoMutex lock; + bool valid; + int64_t data_start; + int64_t data_end; +} BdrvBlockStatusCache; + struct BlockDriverState { /* Protected by big QEMU lock or read-only after opening. No special * locking needed during I/O... @@ -997,6 +1014,8 @@ struct BlockDriverState { /* BdrvChild links to this node may never be frozen */ bool never_freeze; + + BdrvBlockStatusCache block_status_cache; }; struct BlockBackendRootState { diff --git a/block.c b/block.c index 3f456892d0..bad64d5c4f 100644 --- a/block.c +++ b/block.c @@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void) qemu_co_queue_init(&bs->flush_queue); + qemu_co_mutex_init(&bs->block_status_cache.lock); + for (i = 0; i < bdrv_drain_all_count; i++) { bdrv_drained_begin(bs); } diff --git a/block/io.c b/block/io.c index 323854d063..320638cc48 100644 --- a/block/io.c +++ b/block/io.c @@ -35,6 +35,7 @@ #include "qapi/error.h" #include "qemu/error-report.h" #include "qemu/main-loop.h" +#include "qemu/range.h" #include "sysemu/replay.h" /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */ @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, bool need_flush = false; int head = 0; int tail = 0; + BdrvBlockStatusCache *bsc = &bs->block_status_cache; int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX); int alignment = MAX(bs->bl.pwrite_zeroes_alignment, @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, return -ENOTSUP; } + /* Invalidate the cached block-status data range if this write overlaps */ + qemu_co_mutex_lock(&bsc->lock); + if (bsc->valid && + ranges_overlap(o
Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum
On 19.06.21 12:53, Vladimir Sementsov-Ogievskiy wrote: 17.06.2021 18:52, Max Reitz wrote: bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz --- block/nbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/nbd.c b/block/nbd.c index 616f9ae6c4..930bd234de 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status( .type = NBD_CMD_BLOCK_STATUS, .from = offset, .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), - MIN(bytes, s->info.size - offset)), + s->info.size - offset), .flags = NBD_CMD_FLAG_REQ_ONE, }; Hmm.. I don't that this change is correct. In contrast with file-posix you don't get extra information for free, you just make a larger request. This means that server will have to do more work. Oh, oops. Seems I was blind in my rage to replace this MIN() pattern. You’re absolutely right. So this patch should be dropped. Max (look at blockstatus_to_extents, it calls bdrv_block_status_above in a loop). For example, assume that nbd export is a qcow2 image with all clusters allocated. With this change, nbd server will loop through the whole qcow2 image, load all L2 tables to return big allocated extent. So, only server can decide, could it add some extra free information to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't allow it.
Re: [PATCH 4/6] block/gluster: Do not force-cap *pnum
On 19.06.21 12:36, Vladimir Sementsov-Ogievskiy wrote: 17.06.2021 18:52, Max Reitz wrote: bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz Reviewed-by: Vladimir Sementsov-Ogievskiy --- block/gluster.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/block/gluster.c b/block/gluster.c index e8ee14c8e9..8ef7bb18d5 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -1461,7 +1461,8 @@ exit: * the specified offset) that are known to be in the same * allocated/unallocated state. * - * 'bytes' is the max value 'pnum' should be set to. + * 'bytes' is a soft cap for 'pnum'. If the information is free, 'pnum' may + * well exceed it. * * (Based on raw_co_block_status() from file-posix.c.) */ @@ -1500,12 +1501,12 @@ static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs, } else if (data == offset) { /* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ - *pnum = MIN(bytes, hole - offset); + *pnum = hole - offset; ret = BDRV_BLOCK_DATA; Interesting, isn't it a bug that we don't ROUND_UP *pnum to request_alignment here like it is done in file-posix ? Guess I forgot gluster in 9c3db310ff0 O:) I don’t think I’ll be able to reproduce it for gluster, but I suppose just doing the same thing for gluster should be fine... Max
Re: [PATCH 3/6] block/file-posix: Do not force-cap *pnum
On 18.06.21 22:16, Eric Blake wrote: On Thu, Jun 17, 2021 at 05:52:44PM +0200, Max Reitz wrote: bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. We should update the documentation in include/block/block_int.h to mention that the driver's block_status callback may treat *pnum as a soft cap, and that returning a larger value is fine. Oh, sure. Max But I agree with this change in the individual drivers, as long as we remember to make our global contract explicit that we can now rely on it ;) Reviewed-by: Eric Blake
Re: [PATCH 2/6] block: block-status cache for data regions
On 18.06.21 20:51, Eric Blake wrote: On Thu, Jun 17, 2021 at 05:52:43PM +0200, Max Reitz wrote: To address this, we want to cache data regions. Most of the time, when bad performance is reported, it is in places where the image is iterated over from start to end (qemu-img convert or the mirror job), so a simple yet effective solution is to cache only the current data region. Here's hoping third time's the charm! Indeed :) (Note that only caching data regions but not zero regions means that returning false information from the cache is not catastrophic: Treating zeroes as data is fine. While we try to invalidate the cache on zero writes and discards, such incongruences may still occur when there are other processes writing to the image.) We only use the cache for nodes without children (i.e. protocol nodes), because that is where the problem is: Drivers that rely on block-status implementations outside of qemu (e.g. SEEK_DATA/HOLE). Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 Signed-off-by: Max Reitz --- include/block/block_int.h | 19 ++ block.c | 2 + block/io.c| 80 +-- 3 files changed, 98 insertions(+), 3 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index a8f9598102..c09512556a 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -832,6 +832,23 @@ struct BdrvChild { QLIST_ENTRY(BdrvChild) next_parent; }; +/* + * Allows bdrv_co_block_status() to cache one data region for a + * protocol node. + * + * @lock: Lock for accessing this object's fields + * @valid: Whether the cache is valid + * @data_start: Offset where we know (or strongly assume) is data + * @data_end: Offset where the data region ends (which is not necessarily + *the start of a zeroed region) + */ +typedef struct BdrvBlockStatusCache { +CoMutex lock; +bool valid; +int64_t data_start; +int64_t data_end; +} BdrvBlockStatusCache; Looks like the right bits of information, and I'm glad you documented the need to be prepared for protocols that report split data sections rather than consolidated. +++ b/block/io.c @@ -35,6 +35,7 @@ #include "qapi/error.h" #include "qemu/error-report.h" #include "qemu/main-loop.h" +#include "qemu/range.h" #include "sysemu/replay.h" /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */ @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, bool need_flush = false; int head = 0; int tail = 0; +BdrvBlockStatusCache *bsc = &bs->block_status_cache; int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX); int alignment = MAX(bs->bl.pwrite_zeroes_alignment, @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, return -ENOTSUP; } +/* Invalidate the cached block-status data range if this write overlaps */ +qemu_co_mutex_lock(&bsc->lock); Are we going to be suffering from a lot of lock contention performance degradation? Is there a way to take advantage of RCU access patterns for any more performance without sacrificing correctness? The critical section is so short that I considered it fine. I wanted to use RW locks, but then realized that every RW lock operation is internally locked by another mutex, so it wouldn’t gain anything. I’m not sure whether RCU is worth it here. We could try something very crude, namely to just not take a lock and make `valid` an atomic. After all, it doesn’t really matter whether `data_start` and `data_end` are consistent values, and resetting `valid` to false is always safe. The worst that could happen is that a concurrent block-status call tries to set up an overlapping data area, which we thus fail to recognize here. But if such a thing were to happen, it could just as well happen before said concurrent call took any lock on `bsc`. +if (bsc->valid && +ranges_overlap(offset, bytes, bsc->data_start, + bsc->data_end - bsc->data_start)) +{ +bsc->valid = false; +} Do we want to invalidate the entire bsc, or can we be smart and leave the prefix intact (if offset > bsc->data_start, then set bsc->data_end to offset)? Perhaps we could be smart, but I don’t think it really makes a difference in practice, so I think keeping it simple is better. +qemu_co_mutex_unlock(&bsc->lock); Worth using WITH_QEMU_LOCK_GUARD? I knew I forgot something, right. Will use! Max
Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table
On 18.06.21 20:29, Vivek Goyal wrote: On Fri, Jun 18, 2021 at 10:28:38AM +0200, Max Reitz wrote: On 17.06.21 23:21, Vivek Goyal wrote: On Wed, Jun 16, 2021 at 03:38:13PM +0200, Max Reitz wrote: On 11.06.21 22:04, Vivek Goyal wrote: On Wed, Jun 09, 2021 at 05:55:49PM +0200, Max Reitz wrote: Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH FD in lo_inode.fd. Therefore, when the respective inode is unlinked, its inode ID will remain in use until we drop our lo_inode (and lo_inode_put() thus closes the FD). Therefore, lo_find() can safely use the inode ID as an lo_inode key, because any inode with an inode ID we find in lo_data.inodes (on the same filesystem) must be the exact same file. This will change when we start setting lo_inode.fhandle so we do not have to keep an O_PATH FD open. Then, unlinking such an inode will immediately remove it, so its ID can then be reused by newly created files, even while the lo_inode object is still there[1]. So creating a new file can then reuse the old file's inode ID, and looking up the new file would lead to us finding the old file's lo_inode, which is not ideal. Luckily, just as file handles cause this problem, they also solve it: A file handle contains a generation ID, which changes when an inode ID is reused, so the new file can be distinguished from the old one. So all we need to do is to add a second map besides lo_data.inodes that maps file handles to lo_inodes, namely lo_data.inodes_by_handle. For clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids. Unfortunately, we cannot rely on being able to generate file handles every time. Hi Max, What are the cases where we can not rely being able to generate file handles? I have no idea, but it’s much easier to claim we can’t than to prove that we can. I’d rather be resilient. Assuming that we can not genererate file handles all the time and hence mainitaing another inode cache seems little problematic to me. How so? It is extra complexity. Need to worry about one more hash table. Sure, I give it to you that we are not creating two copies of inodes. Same inode object is being added to two different hash tables and is being looked up using two different keys. I would rather start with that we can generate file handles and have a single inode cache. The assumption that we can generate file handles all the time is a much stronger one (and one which needs to be proven) than assuming that failure is possible. So if temporary failures can happen (like -ENOMEM, as you mentioned), these can happen with openat() too. And in that case we return error to caller. So why to try to hide temporary failures from name_to_handle_at(). Well, for openat() we don’t have a choice, if that fails, there is no fallback, so we must return an error. For name_to_handle_at(), there is a fallback. I am still reading your code and trying to understand it. But one question came to mind. What happens if we can generate file handle during lookup. But can't generate when same file is looked up again. - A file foo.txt is looked. We can create file handle and we add it to lo->inodes_by_handle as well as lo->inodes_by_ds. - Say somebody deleted file and created again and inode number got reused. - Now during ->revalidation path, lookup happens again. This time say we can't generate file handle. If am reading lo_do_find() code correctly, it will find the old inode using ids and return same inode as result of lookup. And we did not recognize that inode number has been reused. Oh, that’s a good point. If an lo_inode has no O_PATH fd but is only addressed by handle, we must always look it up by handle. And issues might arise if we could not generate file handle in first lookup but could generate in second. - A file foo.txt is lookedup. We can not create file handle and we add it to lo->inodes_by_ids. - Say somebody deleted file and created again and inode number got reused. This is not possible, because if we could not generate a file handle on the first lookup, the lo_inode will have an O_PATH fd attached to it, so the inode number cannot be reused while the lo_inode still exists. - Now during ->revalidation path, lookup happens again. This time say we can generate file handle. If am reading lo_do_find() code correctly, it will find the old inode using ids and return same inode as result of lookup. And we did not recognize that inode number has been reused. IOW, because we could not generate the file handle, we lost the ability to recognize that inode number has been reused. That means either we don't switch to using file handles or if we do switch, it is important that we can generate file handle to differentiate whether inode number has been used or not. If not, return error to caller. Caller probably will mark inode bad and let and do a lookup again. Also, it is still a single inode cache. I'
Re: [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table
On 09.06.21 17:55, Max Reitz wrote: Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH FD in lo_inode.fd. Therefore, when the respective inode is unlinked, its inode ID will remain in use until we drop our lo_inode (and lo_inode_put() thus closes the FD). Therefore, lo_find() can safely use the inode ID as an lo_inode key, because any inode with an inode ID we find in lo_data.inodes (on the same filesystem) must be the exact same file. This will change when we start setting lo_inode.fhandle so we do not have to keep an O_PATH FD open. Then, unlinking such an inode will immediately remove it, so its ID can then be reused by newly created files, even while the lo_inode object is still there[1]. So creating a new file can then reuse the old file's inode ID, and looking up the new file would lead to us finding the old file's lo_inode, which is not ideal. Luckily, just as file handles cause this problem, they also solve it: A file handle contains a generation ID, which changes when an inode ID is reused, so the new file can be distinguished from the old one. So all we need to do is to add a second map besides lo_data.inodes that maps file handles to lo_inodes, namely lo_data.inodes_by_handle. For clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids. Unfortunately, we cannot rely on being able to generate file handles every time. Therefore, we still enter every lo_inode object into inodes_by_ids, but having an entry in inodes_by_handle is optional. A potential inodes_by_handle entry then has precedence, the inodes_by_ids entry is just a fallback. Note that we do not generate lo_fhandle objects yet, and so we also do not enter anything into the inodes_by_handle map yet. Also, all lookups skip that map. We might manually create file handles with some code that is immediately removed by the next patch again, but that would break the assumption in lo_find() that every lo_inode with a non-NULL .fhandle must have an entry in inodes_by_handle and vice versa. So we leave actually using the inodes_by_handle map for the next patch. [1] If some application in the guest still has the file open, there is going to be a corresponding FD mapping in lo_data.fd_map. In such a case, the inode will only go away once every application in the guest has closed it. The problem described only applies to cases where the guest does not have the file open, and it is just in the dentry cache, basically. Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 80 +--- 1 file changed, 64 insertions(+), 16 deletions(-) [...] @@ -3931,10 +3954,34 @@ static gboolean lo_key_equal(gconstpointer a, gconstpointer b) return la->ino == lb->ino && la->dev == lb->dev && la->mnt_id == lb->mnt_id; } +static guint lo_fhandle_hash(gconstpointer key) +{ +const struct lo_fhandle *fh = key; +guint hash; +size_t i; + +/* Basically g_str_hash() */ +hash = 5381; +for (i = 0; i < sizeof(fh->padding); i++) { +hash += hash * 33 + (unsigned char)fh->padding[i]; +} +hash += hash * 33 + fh->mount_id; Just spotted: These `+=` should be `=`. Max + +return hash; +}
Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table
On 17.06.21 23:21, Vivek Goyal wrote: On Wed, Jun 16, 2021 at 03:38:13PM +0200, Max Reitz wrote: On 11.06.21 22:04, Vivek Goyal wrote: On Wed, Jun 09, 2021 at 05:55:49PM +0200, Max Reitz wrote: Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH FD in lo_inode.fd. Therefore, when the respective inode is unlinked, its inode ID will remain in use until we drop our lo_inode (and lo_inode_put() thus closes the FD). Therefore, lo_find() can safely use the inode ID as an lo_inode key, because any inode with an inode ID we find in lo_data.inodes (on the same filesystem) must be the exact same file. This will change when we start setting lo_inode.fhandle so we do not have to keep an O_PATH FD open. Then, unlinking such an inode will immediately remove it, so its ID can then be reused by newly created files, even while the lo_inode object is still there[1]. So creating a new file can then reuse the old file's inode ID, and looking up the new file would lead to us finding the old file's lo_inode, which is not ideal. Luckily, just as file handles cause this problem, they also solve it: A file handle contains a generation ID, which changes when an inode ID is reused, so the new file can be distinguished from the old one. So all we need to do is to add a second map besides lo_data.inodes that maps file handles to lo_inodes, namely lo_data.inodes_by_handle. For clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids. Unfortunately, we cannot rely on being able to generate file handles every time. Hi Max, What are the cases where we can not rely being able to generate file handles? I have no idea, but it’s much easier to claim we can’t than to prove that we can. I’d rather be resilient. Assuming that we can not genererate file handles all the time and hence mainitaing another inode cache seems little problematic to me. How so? I would rather start with that we can generate file handles and have a single inode cache. The assumption that we can generate file handles all the time is a much stronger one (and one which needs to be proven) than assuming that failure is possible. Also, it is still a single inode cache. I'm just adding a third key. (The two existing keys are lo_key (through lo->inodes) and fuse_ino_t (through lo->ino_map).) Therefore, we still enter every lo_inode object into inodes_by_ids, but having an entry in inodes_by_handle is optional. A potential inodes_by_handle entry then has precedence, the inodes_by_ids entry is just a fallback. If we have to keep inodes_by_ids around, then can we just add fhandle to the lo_key. That way we can manage with single hash table and still be able to detect if inode ID has been reused. We cannot, because I assume we cannot rely on name_to_handle_at() working every time. I guess either we need concrete information that we can't generate file handle every time or we should assume we can until we are proven wrong. And then fix it accordingly, IMHO. I don’t know why we need this other than because it would simplify the code. Under the assumption that for a specific file we can either generate file handles all the time or never, the code as it is will behave correct. It’s just a bit more complicated than it would need to be, but I don’t find the diffstat of +64/-16 to be indicative of something that’s really bad. Therefore, maybe at one point we can generate a file handle, and at another, we cannot – we should still be able to look up the inode regardless. If the file handle were part of inodes_by_ids, then we can look up inodes only if we can generate a file handle either every time (for a given inode) or never. Right. And is there a reason to belive that for the same file we can sometimes generate file handles and other times not. Looking into name_to_handle_at()’s man page, there is no error listed that I could imagine happening only sometimes. But it doesn’t give an explicit guarantee that it will either always succeed or fail for a given inode. Looking into the kernel, I can see that most filesystems only fail .encode_fh() if the buffer is too small. Overlayfs can also fail with ENOMEM (which will be translated to EOVERFLOW along the way, so so much for name_to_handle_at()’s list of error conditions), and EIO on conditions I don’t understand well enough (again, will become EOVERFLOW for the user). You’re probably right that at least in practice we don’t need to accommodate for name_to_handle_at() sometimes working for some inode and sometimes not. But I feel quite uneasy relying on this being the case, and being the case forever, when I find it quite simple to just have some added complexity to deal with it. It’s just a third key for our inode cache. If you want to, I can write a 10/9 patch that simplifies the code under the assumption that name_to_handle_at() will either fail or not, but frankly I wouldn’t want to have my name under it. (Which is wh
[PATCH 5/6] block/nbd: Do not force-cap *pnum
bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz --- block/nbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/nbd.c b/block/nbd.c index 616f9ae6c4..930bd234de 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status( .type = NBD_CMD_BLOCK_STATUS, .from = offset, .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), - MIN(bytes, s->info.size - offset)), + s->info.size - offset), .flags = NBD_CMD_FLAG_REQ_ONE, }; -- 2.31.1
[PATCH 4/6] block/gluster: Do not force-cap *pnum
bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz --- block/gluster.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/block/gluster.c b/block/gluster.c index e8ee14c8e9..8ef7bb18d5 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -1461,7 +1461,8 @@ exit: * the specified offset) that are known to be in the same * allocated/unallocated state. * - * 'bytes' is the max value 'pnum' should be set to. + * 'bytes' is a soft cap for 'pnum'. If the information is free, 'pnum' may + * well exceed it. * * (Based on raw_co_block_status() from file-posix.c.) */ @@ -1500,12 +1501,12 @@ static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs, } else if (data == offset) { /* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ -*pnum = MIN(bytes, hole - offset); +*pnum = hole - offset; ret = BDRV_BLOCK_DATA; } else { /* On a hole, compute bytes to the beginning of the next extent. */ assert(hole == offset); -*pnum = MIN(bytes, data - offset); +*pnum = data - offset; ret = BDRV_BLOCK_ZERO; } -- 2.31.1
[PATCH 3/6] block/file-posix: Do not force-cap *pnum
bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz --- block/file-posix.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index b3fbb9bd63..aeb370d5bb 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2689,7 +2689,8 @@ static int find_allocation(BlockDriverState *bs, off_t start, * the specified offset) that are known to be in the same * allocated/unallocated state. * - * 'bytes' is the max value 'pnum' should be set to. + * 'bytes' is a soft cap for 'pnum'. If the information is free, 'pnum' may + * well exceed it. */ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, bool want_zero, @@ -2727,7 +2728,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, } else if (data == offset) { /* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ -*pnum = MIN(bytes, hole - offset); +*pnum = hole - offset; /* * We are not allowed to return partial sectors, though, so @@ -2746,7 +2747,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, } else { /* On a hole, compute bytes to the beginning of the next extent. */ assert(hole == offset); -*pnum = MIN(bytes, data - offset); +*pnum = data - offset; ret = BDRV_BLOCK_ZERO; } *map = offset; -- 2.31.1
[PATCH 2/6] block: block-status cache for data regions
As we have attempted before (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html, "file-posix: Cache lseek result for data regions"; https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html, "file-posix: Cache next hole"), this patch seeks to reduce the number of SEEK_DATA/HOLE operations the file-posix driver has to perform. The main difference is that this time it is implemented as part of the general block layer code. The problem we face is that on some filesystems or in some circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the implementation is outside of qemu, there is little we can do about its performance. We have already introduced the want_zero parameter to bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls unless we really want zero information; but sometimes we do want that information, because for files that consist largely of zero areas, special-casing those areas can give large performance boosts. So the real problem is with files that consist largely of data, so that inquiring the block status does not gain us much performance, but where such an inquiry itself takes a lot of time. To address this, we want to cache data regions. Most of the time, when bad performance is reported, it is in places where the image is iterated over from start to end (qemu-img convert or the mirror job), so a simple yet effective solution is to cache only the current data region. (Note that only caching data regions but not zero regions means that returning false information from the cache is not catastrophic: Treating zeroes as data is fine. While we try to invalidate the cache on zero writes and discards, such incongruences may still occur when there are other processes writing to the image.) We only use the cache for nodes without children (i.e. protocol nodes), because that is where the problem is: Drivers that rely on block-status implementations outside of qemu (e.g. SEEK_DATA/HOLE). Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307 Signed-off-by: Max Reitz --- include/block/block_int.h | 19 ++ block.c | 2 + block/io.c| 80 +-- 3 files changed, 98 insertions(+), 3 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index a8f9598102..c09512556a 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -832,6 +832,23 @@ struct BdrvChild { QLIST_ENTRY(BdrvChild) next_parent; }; +/* + * Allows bdrv_co_block_status() to cache one data region for a + * protocol node. + * + * @lock: Lock for accessing this object's fields + * @valid: Whether the cache is valid + * @data_start: Offset where we know (or strongly assume) is data + * @data_end: Offset where the data region ends (which is not necessarily + *the start of a zeroed region) + */ +typedef struct BdrvBlockStatusCache { +CoMutex lock; +bool valid; +int64_t data_start; +int64_t data_end; +} BdrvBlockStatusCache; + struct BlockDriverState { /* Protected by big QEMU lock or read-only after opening. No special * locking needed during I/O... @@ -997,6 +1014,8 @@ struct BlockDriverState { /* BdrvChild links to this node may never be frozen */ bool never_freeze; + +BdrvBlockStatusCache block_status_cache; }; struct BlockBackendRootState { diff --git a/block.c b/block.c index 3f456892d0..bad64d5c4f 100644 --- a/block.c +++ b/block.c @@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void) qemu_co_queue_init(&bs->flush_queue); +qemu_co_mutex_init(&bs->block_status_cache.lock); + for (i = 0; i < bdrv_drain_all_count; i++) { bdrv_drained_begin(bs); } diff --git a/block/io.c b/block/io.c index 323854d063..320638cc48 100644 --- a/block/io.c +++ b/block/io.c @@ -35,6 +35,7 @@ #include "qapi/error.h" #include "qemu/error-report.h" #include "qemu/main-loop.h" +#include "qemu/range.h" #include "sysemu/replay.h" /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */ @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, bool need_flush = false; int head = 0; int tail = 0; +BdrvBlockStatusCache *bsc = &bs->block_status_cache; int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX); int alignment = MAX(bs->bl.pwrite_zeroes_alignment, @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, return -ENOTSUP; } +/* Invalidate the cached block-status data range if this write overlaps */ +qemu_co_mutex_lock(&bsc->lock); +if (bsc->valid && +ranges_overlap(offset, bytes, bsc->data_start, + bsc->data_end - bsc->data_start)) +{ +bsc->va
[PATCH 6/6] block/iscsi: Do not force-cap *pnum
bdrv_co_block_status() does it for us, we do not need to do it here. The advantage of not capping *pnum is that bdrv_co_block_status() can cache larger data regions than requested by its caller. Signed-off-by: Max Reitz --- block/iscsi.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/block/iscsi.c b/block/iscsi.c index 4d2a416ce7..852384086b 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -781,9 +781,6 @@ retry: iscsi_allocmap_set_allocated(iscsilun, offset, *pnum); } -if (*pnum > bytes) { -*pnum = bytes; -} out_unlock: qemu_mutex_unlock(&iscsilun->mutex); g_free(iTask.err_str); -- 2.31.1
[PATCH 0/6] block: block-status cache for data regions
Hi, We’ve already had two attempts at introducing a block-status cache for data regions (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html, "file-posix: Cache lseek result for data regions"; https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html, "file-posix: Cache next hole"), both of which were local to file-posix. I have a gitlab issue assigned to me (https://gitlab.com/qemu-project/qemu/-/issues/307), so I suppose it’s time for round three. This series tries to address what I gathered from the reviews for both: (1) We should try to have this cache in the general block layer for all protocol drivers; (2) Just to be sure, we should have a mutex for it; (3) External writers don’t matter if we just cache data regions, because (for a protocol node) reporting a zero region as data isn’t fatal. Patch 1 is clean-up, patch 2 is the main one, patches 3 to 6 make it more useful: Some protocol drivers force-clamp the returned *pnum based on the @bytes parameter, but that’s completely unnecessary, because bdrv_co_block_status() will clamp the value, too. It’s beneficial to return as large a *pnum value as possible to bdrv_co_block_status(), so our cache can be more effective. Max Reitz (6): block: Drop BDS comment regarding bdrv_append() block: block-status cache for data regions block/file-posix: Do not force-cap *pnum block/gluster: Do not force-cap *pnum block/nbd: Do not force-cap *pnum block/iscsi: Do not force-cap *pnum include/block/block_int.h | 21 -- block.c | 2 + block/file-posix.c| 7 ++-- block/gluster.c | 7 ++-- block/io.c| 80 +-- block/iscsi.c | 3 -- block/nbd.c | 2 +- 7 files changed, 105 insertions(+), 17 deletions(-) -- 2.31.1
[PATCH 1/6] block: Drop BDS comment regarding bdrv_append()
There is a comment above the BDS definition stating care must be taken to consider handling newly added fields in bdrv_append(). Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac (nine years ago), and in any case, bdrv_swap() was dropped in 8e419aefa (six years ago). So no such care is necessary anymore. Signed-off-by: Max Reitz --- include/block/block_int.h | 6 -- 1 file changed, 6 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index 057d88b1fc..a8f9598102 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -832,12 +832,6 @@ struct BdrvChild { QLIST_ENTRY(BdrvChild) next_parent; }; -/* - * Note: the function bdrv_append() copies and swaps contents of - * BlockDriverStates, so if you add new fields to this struct, please - * inspect bdrv_append() to determine if the new fields need to be - * copied as well. - */ struct BlockDriverState { /* Protected by big QEMU lock or read-only after opening. No special * locking needed during I/O... -- 2.31.1
Re: [PATCH v3 3/7] block: add max_hw_transfer to BlockLimits
On 16.06.21 15:18, Paolo Bonzini wrote: On 15/06/21 18:18, Max Reitz wrote: } +/* Returns the maximum hardware transfer length, in bytes; guaranteed nonzero */ +uint64_t blk_get_max_hw_transfer(BlockBackend *blk) +{ + BlockDriverState *bs = blk_bs(blk); + uint64_t max = INT_MAX; + + if (bs) { + max = MIN_NON_ZERO(bs->bl.max_hw_transfer, bs->bl.max_transfer); + } + return max; Both `max_hw_transfer` and `max_transfer` can be 0, so this could return 0, contrary to what the comment above promises. Should `max` be initialized to 0 with a `MIN_NON_ZERO(max, INT_MAX)` here (like `blk_get_max_transfer()` does it)? Yes, something to that effect. (As for the rest, I think aligning to the request alignment makes sense, but then again we don’t do that for max_transfer either, so... this at least wouldn’t be a new bug. Ok, will do. I will also add a new patch to align max_transfer to the request alignment. Regarding the comment, checkpatch complains about it, so it should be fixed so that /* is on its own line. That makes it different from every other comment in block_int.h though. Is it okay to fix all of them in a follow-up? The reason it’s different is that the comment style in question was added to checkpatch only relatively recently. I can’t speak for others, but I’m a simple person. I just do what makes checkpatch happy. :) Given that the checkpatch complaint is only a warning, I think it’s OK to keep the comment as it is here, and perhaps optionally fix all comments in block_int.h in a follow-up. I don’t think we need to fix existing comments, but, well, it wouldn’t be wrong. Max
Re: [PATCH v2 0/9] virtiofsd: Allow using file handles instead of O_PATH FDs
On 11.06.21 21:19, Vivek Goyal wrote: On Wed, Jun 09, 2021 at 05:55:42PM +0200, Max Reitz wrote: Hi, v1 cover letter for an overview: https://listman.redhat.com/archives/virtio-fs/2021-June/msg00033.html Hi Max, What's the impact of these patches on performance? Just trying to get some idea what to expect. Performance remains more or less same or we expect a hit. I definitely expect a hit if you just benchmark the right way. Having to open FDs all the time (for metadata operations) will have an impact (when you do lookups all the time, or open files and close them immediately). How much of an impact, that’s completely up to the filesystem’s open_by_handle_at() implementation, I presume. I don’t expect it to be significant for real-world use cases, though. When really doing I/O, there should be absolutely no difference, because then we’re operating with these FUSE handles that have actual (non-path) FDs attached to them. Max
Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table
On 11.06.21 22:04, Vivek Goyal wrote: On Wed, Jun 09, 2021 at 05:55:49PM +0200, Max Reitz wrote: Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH FD in lo_inode.fd. Therefore, when the respective inode is unlinked, its inode ID will remain in use until we drop our lo_inode (and lo_inode_put() thus closes the FD). Therefore, lo_find() can safely use the inode ID as an lo_inode key, because any inode with an inode ID we find in lo_data.inodes (on the same filesystem) must be the exact same file. This will change when we start setting lo_inode.fhandle so we do not have to keep an O_PATH FD open. Then, unlinking such an inode will immediately remove it, so its ID can then be reused by newly created files, even while the lo_inode object is still there[1]. So creating a new file can then reuse the old file's inode ID, and looking up the new file would lead to us finding the old file's lo_inode, which is not ideal. Luckily, just as file handles cause this problem, they also solve it: A file handle contains a generation ID, which changes when an inode ID is reused, so the new file can be distinguished from the old one. So all we need to do is to add a second map besides lo_data.inodes that maps file handles to lo_inodes, namely lo_data.inodes_by_handle. For clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids. Unfortunately, we cannot rely on being able to generate file handles every time. Hi Max, What are the cases where we can not rely being able to generate file handles? I have no idea, but it’s much easier to claim we can’t than to prove that we can. I’d rather be resilient. Therefore, we still enter every lo_inode object into inodes_by_ids, but having an entry in inodes_by_handle is optional. A potential inodes_by_handle entry then has precedence, the inodes_by_ids entry is just a fallback. If we have to keep inodes_by_ids around, then can we just add fhandle to the lo_key. That way we can manage with single hash table and still be able to detect if inode ID has been reused. We cannot, because I assume we cannot rely on name_to_handle_at() working every time. Therefore, maybe at one point we can generate a file handle, and at another, we cannot – we should still be able to look up the inode regardless. If the file handle were part of inodes_by_ids, then we can look up inodes only if we can generate a file handle either every time (for a given inode) or never. Or, well, I suppose we could always create two entries, one with the file handles zeroed out, and one with the file handle specified, but I wouldn’t find that very beautiful. Max Note that we do not generate lo_fhandle objects yet, and so we also do not enter anything into the inodes_by_handle map yet. Also, all lookups skip that map. We might manually create file handles with some code that is immediately removed by the next patch again, but that would break the assumption in lo_find() that every lo_inode with a non-NULL .fhandle must have an entry in inodes_by_handle and vice versa. So we leave actually using the inodes_by_handle map for the next patch. [1] If some application in the guest still has the file open, there is going to be a corresponding FD mapping in lo_data.fd_map. In such a case, the inode will only go away once every application in the guest has closed it. The problem described only applies to cases where the guest does not have the file open, and it is just in the dentry cache, basically. Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 80 +--- 1 file changed, 64 insertions(+), 16 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index e665575401..793d2c333e 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -179,7 +179,8 @@ struct lo_data { int announce_submounts; bool use_statx; struct lo_inode root; -GHashTable *inodes; /* protected by lo->mutex */ +GHashTable *inodes_by_ids; /* protected by lo->mutex */ +GHashTable *inodes_by_handle; /* protected by lo->mutex */ struct lo_map ino_map; /* protected by lo->mutex */ struct lo_map dirp_map; /* protected by lo->mutex */ struct lo_map fd_map; /* protected by lo->mutex */ @@ -257,8 +258,9 @@ static struct { /* That we loaded cap-ng in the current thread from the saved */ static __thread bool cap_loaded = 0; -static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st, -uint64_t mnt_id); +static struct lo_inode *lo_find(struct lo_data *lo, +const struct lo_fhandle *fhandle, +struct stat *st, uint64_t mnt_id); static int xattr_map_client(const struct lo_data *lo, const char *client_name, char **out_name); @@ -1032,18 +1034,39 @@ out_er
Re: [PATCH v3 7/7] block: detect DKIOCGETBLOCKCOUNT/SIZE before use
On 03.06.21 15:37, Paolo Bonzini wrote: From: Joelle van Dyne iOS hosts do not have these defined so we fallback to the default behaviour. Co-authored-by: Warner Losh Reviewed-by: Peter Maydell Signed-off-by: Joelle van Dyne Message-Id: <20210315180341.31638-...@getutm.app> Signed-off-by: Paolo Bonzini --- block/file-posix.c | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 5821e1afed..4e2f7cf508 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2322,8 +2322,11 @@ static int64_t raw_getlength(BlockDriverState *bs) again: #endif if (!fstat(fd, &sb) && (S_IFCHR & sb.st_mode)) { +size = 0; #ifdef DIOCGMEDIASIZE -if (ioctl(fd, DIOCGMEDIASIZE, (off_t *)&size)) +if (ioctl(fd, DIOCGMEDIASIZE, (off_t *)&size)) { Pre-existing, but I feel compelled to express my unease about this cast. +size = 0; +} #elif defined(DIOCGPART) { struct partinfo pi; @@ -2332,9 +2335,7 @@ again: else size = 0; } -if (size == 0) -#endif -#if defined(__APPLE__) && defined(__MACH__) +#elif defined(DKIOCGETBLOCKCOUNT) && defined(DKIOCGETBLOCKSIZE) As far a I can tell, before this patch, if the DIOCGMEDIASIZE ioctl failed, we fell back to this DKIOCGETBLOCKCOUNT/SIZE block (if compiled in). Now this is an #elif and so will not be used if DIOCGMEDIASIZE was defined. Is that intentional? This may be fine, and apart from that, this patch looks good to me, but this change in behavior wasn’t mentioned in the commit message, hence me asking. { uint64_t sectors = 0; uint32_t sector_size = 0; @@ -2342,19 +2343,15 @@ again: if (ioctl(fd, DKIOCGETBLOCKCOUNT, §ors) == 0 && ioctl(fd, DKIOCGETBLOCKSIZE, §or_size) == 0) { size = sectors * sector_size; -} else { -size = lseek(fd, 0LL, SEEK_END); -if (size < 0) { -return -errno; -} } } -#else -size = lseek(fd, 0LL, SEEK_END); +#endif +if (size == 0) { +size = lseek(fd, 0LL, SEEK_END); +} if (size < 0) { return -errno; } -#endif #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) switch(s->type) { case FTYPE_CD:
Re: [PATCH v3 6/7] block: check for sys/disk.h
On 03.06.21 15:37, Paolo Bonzini wrote: From: Joelle van Dyne Some BSD platforms do not have this header. Reviewed-by: Peter Maydell Reviewed-by: Philippe Mathieu-Daudé Signed-off-by: Joelle van Dyne Message-Id: <20210315180341.31638-...@getutm.app> Signed-off-by: Paolo Bonzini --- block.c | 2 +- meson.build | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) Reviewed-by: Max Reitz
Re: [PATCH v3 5/7] block: feature detection for host block support
On 03.06.21 15:37, Paolo Bonzini wrote: From: Joelle van Dyne On Darwin (iOS), there are no system level APIs for directly accessing host block devices. We detect this at configure time. Signed-off-by: Joelle van Dyne Message-Id: <20210315180341.31638-...@getutm.app> Signed-off-by: Paolo Bonzini --- block/file-posix.c | 33 ++--- meson.build | 6 +- qapi/block-core.json | 10 +++--- 3 files changed, 34 insertions(+), 15 deletions(-) [...] diff --git a/qapi/block-core.json b/qapi/block-core.json index 2ea294129e..2dd41be156 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -897,7 +897,8 @@ 'discriminator': 'driver', 'data': { 'file': 'BlockStatsSpecificFile', - 'host_device': 'BlockStatsSpecificFile', + 'host_device': { 'type': 'BlockStatsSpecificFile', + 'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' }, 'nvme': 'BlockStatsSpecificNvme' } } ## @@ -2814,7 +2815,9 @@ { 'enum': 'BlockdevDriver', 'data': [ 'blkdebug', 'blklogwrites', 'blkreplay', 'blkverify', 'bochs', 'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps', -'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi', +'gluster', 'host_cdrom', +{'name': 'host_device', 'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' }, Shouldn’t this be done for host_cdrom, too? (and below) Apart from that, looks good to me. +'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' }, @@ -3996,7 +3999,8 @@ 'ftps': 'BlockdevOptionsCurlFtps', 'gluster':'BlockdevOptionsGluster', 'host_cdrom': 'BlockdevOptionsFile', - 'host_device':'BlockdevOptionsFile', + 'host_device': { 'type': 'BlockdevOptionsFile', + 'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' }, 'http': 'BlockdevOptionsCurlHttp', 'https': 'BlockdevOptionsCurlHttps', 'iscsi': 'BlockdevOptionsIscsi',
Re: [PATCH v3 4/7] file-posix: try BLKSECTGET on block devices too, do not round to power of 2
On 03.06.21 15:37, Paolo Bonzini wrote: bs->sg is only true for character devices, but block devices can also be used with scsi-block and scsi-generic. Unfortunately BLKSECTGET returns bytes in an int for /dev/sgN devices, and sectors in a short for block devices, so account for that in the code. The maximum transfer also need not be a power of 2 (for example I have seen disks with 1280 KiB maximum transfer) so there's no need to pass the result through pow2floor. This patch reintroduces the logic that was removed in commit 867eccfed8 ("file-posix: Use max transfer length/segment count only for SCSI passthrough", 2019-07-12). The removal was motivated by the performance regression when using a block device's max_transfer with kernel I/O; the new, separate max_hw_transfer limit avoids reintroducing the same regression. (And the fact that the result of hdev_get_max_segments() is no longer used to cap max_transfer, but is just stored in max_iov instead.) Signed-off-by: Paolo Bonzini --- block/file-posix.c | 40 ++-- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index f55f92d0f5..1439293f63 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1149,22 +1149,27 @@ static void raw_reopen_abort(BDRVReopenState *state) s->reopen_state = NULL; } -static int sg_get_max_transfer_length(int fd) +static int hdev_get_max_hw_transfer(int fd, struct stat *st) { #ifdef BLKSECTGET -int max_bytes = 0; - -if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) { -return max_bytes; +if (S_ISBLK(st->st_mode)) { +unsigned short max_sectors = 0; +if (ioctl(fd, BLKSECTGET, &max_sectors) == 0) { +return max_sectors * 512; +} } else { -return -errno; +int max_bytes = 0; +if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) { +return max_bytes; +} } +return -errno; #else return -ENOSYS; #endif } -static int sg_get_max_segments(int fd) +static int hdev_get_max_segments(int fd, struct stat *st) OK, now I see why in patch 1 you treated `st` as a pointer. Still, needs to be `st.` in patch 1, and changed to `st->` in this patch only. { #ifdef CONFIG_LINUX char buf[32]; @@ -1173,12 +1178,6 @@ static int sg_get_max_segments(int fd) int ret; int sysfd = -1; long max_segments; -struct stat st; - -if (fstat(fd, &st)) { -ret = -errno; -goto out; -} if (S_ISCHR(st->st_mode)) { if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) { @@ -1192,7 +1191,7 @@ static int sg_get_max_segments(int fd) } sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments", -major(st.st_rdev), minor(st.st_rdev)); +major(st->st_rdev), minor(st->st_rdev)); sysfd = open(sysfspath, O_RDONLY); if (sysfd == -1) { ret = -errno; @@ -1229,15 +1228,20 @@ out: static void raw_refresh_limits(BlockDriverState *bs, Error **errp) { BDRVRawState *s = bs->opaque; +struct stat st; + +if (fstat(s->fd, &st)) { +return; +} First, I think if we ignore an error, there should be a comment explaining why that’s OK. It is OK here because inquiring transfer limits is best-effort. However, the alignment probes below the following block are completely independent of `st`. Therefore, I don’t think we should skip them if `fstat()` failed here; only the `if (bs->sg || ...)` block should be skipped. Max -if (bs->sg) { -int ret = sg_get_max_transfer_length(s->fd); +if (bs->sg || S_ISBLK(st.st_mode)) { +int ret = hdev_get_max_hw_transfer(s->fd, &st); if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) { -bs->bl.max_hw_transfer = pow2floor(ret); +bs->bl.max_hw_transfer = ret; } -ret = sg_get_max_segments(s->fd); +ret = hdev_get_max_segments(s->fd, &st); if (ret > 0) { bs->bl.max_iov = ret; }
Re: [PATCH v3 2/7] scsi-generic: pass max_segments via max_iov field in BlockLimits
On 03.06.21 15:37, Paolo Bonzini wrote: I/O to a disk via read/write is not limited by the number of segments allowed by the host adapter; the kernel can split requests if needed, and the limit imposed by the host adapter can be very low (256k or so) to avoid that SG_IO returns EINVAL if memory is heavily fragmented. Since this value is only interesting for SG_IO-based I/O, do not include it in the max_transfer and only take it into account when patching the block limits VPD page in the scsi-generic device. Signed-off-by: Paolo Bonzini --- block/file-posix.c | 3 +-- hw/scsi/scsi-generic.c | 6 -- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 58db526cc2..e3241a0dd3 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1239,8 +1239,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) ret = sg_get_max_segments(s->fd); if (ret > 0) { -bs->bl.max_transfer = MIN(bs->bl.max_transfer, - ret * qemu_real_host_page_size); +bs->bl.max_iov = ret; } } diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c index 98c30c5d5c..82e1e2ee79 100644 --- a/hw/scsi/scsi-generic.c +++ b/hw/scsi/scsi-generic.c @@ -179,10 +179,12 @@ static void scsi_handle_inquiry_reply(SCSIGenericReq *r, SCSIDevice *s) (r->req.cmd.buf[1] & 0x01)) { page = r->req.cmd.buf[2]; if (page == 0xb0) { -uint32_t max_transfer = -blk_get_max_transfer(s->conf.blk) / s->blocksize; +uint32_t max_transfer = blk_get_max_transfer(s->conf.blk); +uint32_t max_iov = blk_get_max_iov(s->conf.blk); assert(max_transfer); +max_transfer = MIN_NON_ZERO(max_transfer, max_iov * qemu_real_host_page_size) +/ s->blocksize; Now that I ran checkpatch for patch 3, I saw that it complains about this line being longer than 80 characters. I think it could be split so it doesn’t exceed that limit. It looks a bit like you intentionally exceeded the warning limit, but didn’t exceed the error limit (of 90). Is that so? Max
Re: [PATCH v3 3/7] block: add max_hw_transfer to BlockLimits
On 03.06.21 15:37, Paolo Bonzini wrote: For block host devices, I/O can happen through either the kernel file descriptor I/O system calls (preadv/pwritev, io_submit, io_uring) or the SCSI passthrough ioctl SG_IO. In the latter case, the size of each transfer can be limited by the HBA, while for file descriptor I/O the kernel is able to split and merge I/O in smaller pieces as needed. Applying the HBA limits to file descriptor I/O results in more system calls and suboptimal performance, so this patch splits the max_transfer limit in two: max_transfer remains valid and is used in general, while max_hw_transfer is limited to the maximum hardware size. max_hw_transfer can then be included by the scsi-generic driver in the block limits page, to ensure that the stricter hardware limit is used. Signed-off-by: Paolo Bonzini --- block/block-backend.c | 12 block/file-posix.c | 2 +- block/io.c | 1 + hw/scsi/scsi-generic.c | 2 +- include/block/block_int.h | 7 +++ include/sysemu/block-backend.h | 1 + 6 files changed, 23 insertions(+), 2 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 15f1ea4288..2ea1412a54 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -1953,6 +1953,18 @@ uint32_t blk_get_request_alignment(BlockBackend *blk) return bs ? bs->bl.request_alignment : BDRV_SECTOR_SIZE; } +/* Returns the maximum hardware transfer length, in bytes; guaranteed nonzero */ +uint64_t blk_get_max_hw_transfer(BlockBackend *blk) +{ +BlockDriverState *bs = blk_bs(blk); +uint64_t max = INT_MAX; + +if (bs) { +max = MIN_NON_ZERO(bs->bl.max_hw_transfer, bs->bl.max_transfer); +} +return max; Both `max_hw_transfer` and `max_transfer` can be 0, so this could return 0, contrary to what the comment above promises. Should `max` be initialized to 0 with a `MIN_NON_ZERO(max, INT_MAX)` here (like `blk_get_max_transfer()` does it)? (As for the rest, I think aligning to the request alignment makes sense, but then again we don’t do that for max_transfer either, so... this at least wouldn’t be a new bug. Regarding the comment, checkpatch complains about it, so it should be fixed so that /* is on its own line. Speaking of checkpatch, now that I ran it, it also complains about the new line in bdrv_merge_limits() exceeding 80 characters, so that should be fixed, too.) Max
Re: [PATCH v3 2/7] scsi-generic: pass max_segments via max_iov field in BlockLimits
On 03.06.21 15:37, Paolo Bonzini wrote: I/O to a disk via read/write is not limited by the number of segments allowed by the host adapter; the kernel can split requests if needed, and the limit imposed by the host adapter can be very low (256k or so) to avoid that SG_IO returns EINVAL if memory is heavily fragmented. Since this value is only interesting for SG_IO-based I/O, do not include it in the max_transfer and only take it into account when patching the block limits VPD page in the scsi-generic device. Signed-off-by: Paolo Bonzini --- block/file-posix.c | 3 +-- hw/scsi/scsi-generic.c | 6 -- 2 files changed, 5 insertions(+), 4 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH v3 1/7] file-posix: fix max_iov for /dev/sg devices
On 03.06.21 15:37, Paolo Bonzini wrote: Even though it was only called for devices that have bs->sg set (which must be character devices), sg_get_max_segments looked at /sys/dev/block which only works for block devices. On Linux the sg driver has its own way to provide the maximum number of iovecs in a scatter/gather list. Signed-off-by: Paolo Bonzini --- block/file-posix.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/file-posix.c b/block/file-posix.c index f37dfc10b3..58db526cc2 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1180,6 +1180,17 @@ static int sg_get_max_segments(int fd) goto out; } +if (S_ISCHR(st->st_mode)) { +if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) { +return ret; +} +return -EIO; +} + +if (!S_ISBLK(st->st_mode)) { +return -ENOTSUP; +} + With %s/->/./: Reviewed-by: Max Reitz (To answer Vladimir’s question, I don’t believe the condition should be bs->sg, because bs->sg == true is a given for this function anyway. Therefore, there’s no need to check whether the char device is an SG device.)
[PATCH 3/4] export/fuse: Let permissions be adjustable
Allow changing the file mode, UID, and GID through SETATTR. This only really makes sense with allow-other, though (because without it, the effective access mode is fixed to be 0600 (u+rw) with qemu's user being the file's owner), so changing these stat fields is not allowed without allow-other. Signed-off-by: Max Reitz --- block/export/fuse.c | 48 ++--- 1 file changed, 37 insertions(+), 11 deletions(-) diff --git a/block/export/fuse.c b/block/export/fuse.c index 1d54286d90..742e0af657 100644 --- a/block/export/fuse.c +++ b/block/export/fuse.c @@ -47,6 +47,10 @@ typedef struct FuseExport { bool writable; bool growable; bool allow_other; + +mode_t st_mode; +uid_t st_uid; +gid_t st_gid; } FuseExport; static GHashTable *exports; @@ -120,6 +124,13 @@ static int fuse_export_create(BlockExport *blk_exp, exp->growable = args->growable; exp->allow_other = args->allow_other; +exp->st_mode = S_IFREG | S_IRUSR; +if (exp->writable) { +exp->st_mode |= S_IWUSR; +} +exp->st_uid = getuid(); +exp->st_gid = getgid(); + ret = setup_fuse_export(exp, args->mountpoint, errp); if (ret < 0) { goto fail; @@ -329,7 +340,6 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode, int64_t length, allocated_blocks; time_t now = time(NULL); FuseExport *exp = fuse_req_userdata(req); -mode_t mode; length = blk_getlength(exp->common.blk); if (length < 0) { @@ -344,17 +354,12 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode, allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512); } -mode = S_IFREG | S_IRUSR; -if (exp->writable) { -mode |= S_IWUSR; -} - statbuf = (struct stat) { .st_ino = inode, -.st_mode= mode, +.st_mode= exp->st_mode, .st_nlink = 1, -.st_uid = getuid(), -.st_gid = getgid(), +.st_uid = exp->st_uid, +.st_gid = exp->st_gid, .st_size= length, .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment, .st_blocks = allocated_blocks, @@ -400,15 +405,23 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size, } /** - * Let clients set file attributes. Only resizing is supported. + * Let clients set file attributes. With allow_other, only resizing and + * changing permissions (st_mode, st_uid, st_gid) is allowed. Without + * allow_other, only resizing is supported. */ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf, int to_set, struct fuse_file_info *fi) { FuseExport *exp = fuse_req_userdata(req); +int supported_attrs; int ret; -if (to_set & ~FUSE_SET_ATTR_SIZE) { +supported_attrs = FUSE_SET_ATTR_SIZE; +if (exp->allow_other) { +supported_attrs |= FUSE_SET_ATTR_MODE | FUSE_SET_ATTR_UID | +FUSE_SET_ATTR_GID; +} +if (to_set & ~supported_attrs) { fuse_reply_err(req, ENOTSUP); return; } @@ -426,6 +439,19 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf, } } +if (to_set & FUSE_SET_ATTR_MODE) { +/* Only allow changing the file mode, not the type */ +exp->st_mode = (statbuf->st_mode & 0) | S_IFREG; +} + +if (to_set & FUSE_SET_ATTR_UID) { +exp->st_uid = statbuf->st_uid; +} + +if (to_set & FUSE_SET_ATTR_GID) { +exp->st_gid = statbuf->st_gid; +} + fuse_getattr(req, inode, fi); } -- 2.31.1
[PATCH 1/4] export/fuse: Add allow-other option
Without the allow_other mount option, no user (not even root) but the one who started qemu/the storage daemon can access the export. Allow users to configure the export such that such accesses are possible. When we do pass allow_other, we should also pass default_permissions, because our export code performs no permission checks. With default_permissions, we can just let the kernel do it. Signed-off-by: Max Reitz --- qapi/block-export.json | 11 ++- block/export/fuse.c| 17 +++-- 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/qapi/block-export.json b/qapi/block-export.json index e819e70cac..5d1cc04ac4 100644 --- a/qapi/block-export.json +++ b/qapi/block-export.json @@ -132,11 +132,20 @@ # @growable: Whether writes beyond the EOF should grow the block node #accordingly. (default: false) # +# @allow-other: By default (if this is false), only qemu's user is allowed +# access to this export. That cannot be changed even with +# chmod or chown. +# This option will allow other users access to the export with the +# FUSE mount options "allow_other,default_permissions". +# (default_permissions enables standard UNIX permission checks.) +# (since 6.1; default: false) +# # Since: 6.0 ## { 'struct': 'BlockExportOptionsFuse', 'data': { 'mountpoint': 'str', -'*growable': 'bool' }, +'*growable': 'bool', +'*allow-other': 'bool' }, 'if': 'defined(CONFIG_FUSE)' } ## diff --git a/block/export/fuse.c b/block/export/fuse.c index 38f74c94da..34a5836ece 100644 --- a/block/export/fuse.c +++ b/block/export/fuse.c @@ -46,6 +46,7 @@ typedef struct FuseExport { char *mountpoint; bool writable; bool growable; +bool allow_other; } FuseExport; static GHashTable *exports; @@ -117,6 +118,7 @@ static int fuse_export_create(BlockExport *blk_exp, exp->mountpoint = g_strdup(args->mountpoint); exp->writable = blk_exp_args->writable; exp->growable = args->growable; +exp->allow_other = args->allow_other; ret = setup_fuse_export(exp, args->mountpoint, errp); if (ret < 0) { @@ -150,11 +152,22 @@ static int setup_fuse_export(FuseExport *exp, const char *mountpoint, { const char *fuse_argv[4]; char *mount_opts; +const char *allow_other; struct fuse_args fuse_args; int ret; -/* Needs to match what fuse_init() sets. Only max_read must be supplied. */ -mount_opts = g_strdup_printf("max_read=%zu", FUSE_MAX_BOUNCE_BYTES); +if (exp->allow_other) { +allow_other = ",allow_other,default_permissions"; +} else { +allow_other = ""; +} + +/* + * max_read needs to match what fuse_init() sets. + * max_write need not be supplied. + */ +mount_opts = g_strdup_printf("max_read=%zu%s", FUSE_MAX_BOUNCE_BYTES, + allow_other); fuse_argv[0] = ""; /* Dummy program name */ fuse_argv[1] = "-o"; -- 2.31.1
[PATCH 0/4] export/fuse: Allow other users access to the export
Hi, With the default mount options, FUSE mounts are not accessible to any users but the one who did the mount, not even to root. To allow such accesses, allow_other must be passed. This is probably useful to some people (it certainly is to me, e.g. when exporting some image as my normal user, and then trying to loop mount it as root), so this series adds a QAPI allow-other bool that will make the FUSE export code pass allow_other,default_permissions to FUSE. (default_permissions will make the kernel do the usual UNIX permission checks, which is something that makes a lot of sense when allowing other users access to the export.) This also requires our SETATTR code to be able to handle permission changes, though, so the user can then run chmod/chown/chgrp on the export to adjust its permissions to their need. The final patch adds a test. Max Reitz (4): export/fuse: Add allow-other option export/fuse: Give SET_ATTR_SIZE its own branch export/fuse: Let permissions be adjustable iotests/308: Test allow-other qapi/block-export.json | 11 - block/export/fuse.c| 83 +- tests/qemu-iotests/308 | 91 ++ tests/qemu-iotests/308.out | 47 4 files changed, 210 insertions(+), 22 deletions(-) -- 2.31.1
[PATCH 4/4] iotests/308: Test allow-other
We cannot reasonably test the main point of allow-other, which is to allow users other than the current one to access the FUSE export, because that would require access to sudo, which this test most likely will not have. (Also, we would need to figure out some user/group that is on the machine and that is not the current user/group, which may become a bit hairy.) But we can test some byproducts: First, whether changing permissions works (our FUSE code only allows so for allow-other=true), and second, whether the kernel applies permission checks with allow-other=true (because that implies default_permissions). Signed-off-by: Max Reitz --- tests/qemu-iotests/308 | 91 ++ tests/qemu-iotests/308.out | 47 2 files changed, 138 insertions(+) diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308 index f122065d0f..1b2f908947 100755 --- a/tests/qemu-iotests/308 +++ b/tests/qemu-iotests/308 @@ -334,6 +334,97 @@ echo '=== Compare copy with original ===' $QEMU_IMG compare -f raw -F $IMGFMT "$COPIED_IMG" "$TEST_IMG" +echo +echo '=== Test permissions ===' + +# Test that you can only change permissions on the export with allow-other=true. +# We cannot really test the primary reason behind allow-other (i.e. to allow +# users other than the current one access to the export), because for that we +# would need sudo, which realistically nobody will allow this test to use. +# What we can do is test that allow-other=true also enables default_permissions, +# i.e. whether we can still read from the file if we remove the read permission. + +# $1: allow-other value ('true' or 'false') +run_permission_test() +{ +# Below, we want to test permission checks on the export. To do that +# properly, we need to ensure that those checks are not bypassed, so we have +# to drop cap_dac_override. +# (Note that cap_dac_read_search bypasses read permission checks, which is +# why we try to open the file R/W below, so we do not have to care about +# cap_dac_read_search here.) +# $capsh is effectively a shell command, i.e. can be used like: +# $capsh -c "$command $args..." +# By default it is just bash. +capsh='/usr/bin/env bash' +if type -p capsh > /dev/null; then +if ! capsh --has-p=cap_dac_override 2>/dev/null; then +# No cap_dac_override, we are good to go +true # pass +elif capsh --has-p=cap_setpcap 2>/dev/null; then +# We will need to drop cap_dac_override, but doing so requires +# cap_setpcap in turn +capsh='capsh --drop=cap_dac_override --' +else +# Hopefully a rare case +_notrun 'CAP_DAC_OVERRIDE must be dropped, but this cannot be' \ +'done without CAP_SETPCAP' +fi +elif [ "$(id -u)" -ne 0 ]; then +# Non-root users probably do not have those capabilities, so try to get +# by without capsh +true # pass +else +# Running this test as root without capsh on the system should be a rare +# case... +_notrun 'No capsh found while run as root' +fi + +_launch_qemu \ +-blockdev \ + "$IMGFMT,node-name=node-format,file.driver=file,file.filename=$TEST_IMG" + +_send_qemu_cmd $QEMU_HANDLE \ +"{'execute': 'qmp_capabilities'}" \ +'return' + +# Writable so we can open it R/W below +fuse_export_add 'export' \ +"'mountpoint': '$EXT_MP', + 'writable': true, + 'allow-other': $1" + +echo '(Invoking chmod)' +chmod 666 "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt +stat -c 'Permissions post-chmod: %a' "$EXT_MP" + +echo '(Removing all permissions)' +chmod 000 "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt + +# We want a permission denied here (with allow-other=true) +# (Use $QEMU_IO_PROG, because $capsh will not know the wrapper that is +# $QEMU_IO) +# Use blkdebug to force-take the WRITE permission so this will definitely +# try to open the export with O_RDWR. +imgopts="driver=blkdebug,take-child-perms.0=write," +imgopts+="image.driver=file,image.filename=$EXT_MP" +$capsh -c "$QEMU_IO_PROG -c quit --image-opts '$imgopts'" 2>&1 \ +| _filter_qemu_io | _filter_testdir | _filter_imgfmt + +_send_qemu_cmd $QEMU_HANDLE \ +"{'execute': 'quit'}" \ +'return' + +wait=yes _cleanup_qemu +} + +for ao in 'false' 'true'; do +echo +echo "--- allow-other=$ao ---
[PATCH 2/4] export/fuse: Give SET_ATTR_SIZE its own branch
In order to support changing other attributes than the file size in fuse_setattr(), we have to give each its own independent branch. This also applies to the only attribute we do support right now. Signed-off-by: Max Reitz --- block/export/fuse.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/block/export/fuse.c b/block/export/fuse.c index 34a5836ece..1d54286d90 100644 --- a/block/export/fuse.c +++ b/block/export/fuse.c @@ -408,20 +408,22 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf, FuseExport *exp = fuse_req_userdata(req); int ret; -if (!exp->writable) { -fuse_reply_err(req, EACCES); -return; -} - if (to_set & ~FUSE_SET_ATTR_SIZE) { fuse_reply_err(req, ENOTSUP); return; } -ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF); -if (ret < 0) { -fuse_reply_err(req, -ret); -return; +if (to_set & FUSE_SET_ATTR_SIZE) { +if (!exp->writable) { +fuse_reply_err(req, EACCES); +return; +} + +ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF); +if (ret < 0) { +fuse_reply_err(req, -ret); +return; +} } fuse_getattr(req, inode, fi); -- 2.31.1
[PATCH v2 9/9] virtiofsd: Add lazy lo_do_find()
lo_find() right now takes two lookup keys for two maps, namely the file handle for inodes_by_handle and the statx information for inodes_by_ids. However, we only need the statx information if looking up the inode by the file handle failed. There are two callers of lo_find(): The first one, lo_do_lookup(), has both keys anyway, so passing them does not incur any additional cost. The second one, lookup_name(), though, needs to explicitly invoke name_to_handle_at() (through get_file_handle()) and statx() (through do_statx()). We need to try to get a file handle as the primary key, so we cannot get rid of get_file_handle(), but we only need the statx information if looking up an inode by handle failed; so we can defer that until the lookup has indeed failed. To this end, replace lo_find()'s st/mnt_id parameters by a get_ids() closure that is invoked to fill the lo_key struct if necessary. Also, lo_find() is renamed to lo_do_find(), so we can add a new lo_find() wrapper whose closure just initializes the lo_key from the st/mnt_id parameters, just like the old lo_find() did. lookup_name() directly calls lo_do_find() now and passes its own closure, which performs the do_statx() call. Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 93 ++-- 1 file changed, 76 insertions(+), 17 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 2e56c40b2f..8990fd5bd2 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -1168,22 +1168,23 @@ out_err: fuse_reply_err(req, saverr); } -static struct lo_inode *lo_find(struct lo_data *lo, -const struct lo_fhandle *fhandle, -struct stat *st, uint64_t mnt_id) +/* + * get_ids() will be called to get the key for lo->inodes_by_ids if + * the lookup by file handle has failed. + */ +static struct lo_inode *lo_do_find(struct lo_data *lo, +const struct lo_fhandle *fhandle, +int (*get_ids)(struct lo_key *, const void *), +const void *get_ids_opaque) { struct lo_inode *p = NULL; -struct lo_key ids_key = { -.ino = st->st_ino, -.dev = st->st_dev, -.mnt_id = mnt_id, -}; +struct lo_key ids_key; pthread_mutex_lock(&lo->mutex); if (fhandle) { p = g_hash_table_lookup(lo->inodes_by_handle, fhandle); } -if (!p) { +if (!p && get_ids(&ids_key, get_ids_opaque) == 0) { p = g_hash_table_lookup(lo->inodes_by_ids, &ids_key); /* * When we had to fall back to looking up an inode by its IDs, @@ -1211,6 +1212,36 @@ static struct lo_inode *lo_find(struct lo_data *lo, return p; } +struct lo_find_get_ids_key_opaque { +const struct stat *st; +uint64_t mnt_id; +}; + +static int lo_find_get_ids_key(struct lo_key *ids_key, const void *opaque) +{ +const struct lo_find_get_ids_key_opaque *stat_info = opaque; + +*ids_key = (struct lo_key){ +.ino = stat_info->st->st_ino, +.dev = stat_info->st->st_dev, +.mnt_id = stat_info->mnt_id, +}; + +return 0; +} + +static struct lo_inode *lo_find(struct lo_data *lo, +const struct lo_fhandle *fhandle, +struct stat *st, uint64_t mnt_id) +{ +const struct lo_find_get_ids_key_opaque stat_info = { +.st = st, +.mnt_id = mnt_id, +}; + +return lo_do_find(lo, fhandle, lo_find_get_ids_key, &stat_info); +} + /* value_destroy_func for posix_locks GHashTable */ static void posix_locks_value_destroy(gpointer data) { @@ -1682,14 +1713,41 @@ out_err: fuse_reply_err(req, saverr); } +struct lookup_name_get_ids_key_opaque { +struct lo_data *lo; +int parent_fd; +const char *name; +}; + +static int lookup_name_get_ids_key(struct lo_key *ids_key, const void *opaque) +{ +const struct lookup_name_get_ids_key_opaque *stat_params = opaque; +uint64_t mnt_id; +struct stat attr; +int res; + +res = do_statx(stat_params->lo, stat_params->parent_fd, stat_params->name, + &attr, AT_SYMLINK_NOFOLLOW, &mnt_id); +if (res < 0) { +return -errno; +} + +*ids_key = (struct lo_key){ +.ino = attr.st_ino, +.dev = attr.st_dev, +.mnt_id = mnt_id, +}; + +return 0; +} + /* Increments nlookup and caller must release refcount using lo_inode_put() */ static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent, const char *name) { g_auto(TempFd) dir_fd = TEMP_FD_INIT; int res; -uint64_t mnt_id; -struct stat attr; +struct lookup_name_get_ids_key_opaque stat_params; struct lo_fhandle *fh; struct lo_data *lo = lo_data(req); struct lo_inode *dir = lo_inode(req, parent); @@ -1
[PATCH v2 4/9] virtiofsd: Let lo_fd() return a TempFd
Accessing lo_inode.fd must generally happen through lo_inode_fd(), and lo_fd() is no exception; and then it must pass on the TempFd it has received from lo_inode_fd(). (Note that all lo_fd() calls now use proper error handling, where all of them were in-line before; i.e. they were used in place of the fd argument of some function call. This only worked because the only error that could occur was that lo_inode() failed to find the inode ID: Then -1 would be passed as the fd, which would result in an EBADF error, which is precisely what we would want to return to the guest for an invalid inode ID. Now, though, lo_inode_fd() might potentially invoke open_by_handle_at(), which can return many different errors, and they should be properly handled and returned to the guest. So we can no longer allow lo_fd() to be used in-line, and instead need to do proper error handling for it.) Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 55 +--- 1 file changed, 44 insertions(+), 11 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 46c9dfe200..8f64bcd6c5 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -644,18 +644,19 @@ static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd) * they are done with the fd. This will be done in a later patch to make * review easier. */ -static int lo_fd(fuse_req_t req, fuse_ino_t ino) +static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd *tfd) { struct lo_inode *inode = lo_inode(req, ino); -int fd; +int res; if (!inode) { -return -1; +return -EBADF; } -fd = inode->fd; +res = lo_inode_fd(inode, tfd); + lo_inode_put(lo_data(req), &inode); -return fd; +return res; } /* @@ -766,14 +767,19 @@ static void lo_init(void *userdata, struct fuse_conn_info *conn) static void lo_getattr(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi) { +g_auto(TempFd) ino_fd = TEMP_FD_INIT; int res; struct stat buf; struct lo_data *lo = lo_data(req); (void)fi; -res = -fstatat(lo_fd(req, ino), "", &buf, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW); +res = lo_fd(req, ino, &ino_fd); +if (res < 0) { +return (void)fuse_reply_err(req, -res); +} + +res = fstatat(ino_fd.fd, "", &buf, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW); if (res == -1) { return (void)fuse_reply_err(req, errno); } @@ -1441,6 +1447,7 @@ static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent, static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, const char *name) { +g_auto(TempFd) parent_fd = TEMP_FD_INIT; int res; struct lo_inode *inode; struct lo_data *lo = lo_data(req); @@ -1455,13 +1462,19 @@ static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, const char *name) return; } +res = lo_fd(req, parent, &parent_fd); +if (res < 0) { +fuse_reply_err(req, -res); +return; +} + inode = lookup_name(req, parent, name); if (!inode) { fuse_reply_err(req, EIO); return; } -res = unlinkat(lo_fd(req, parent), name, AT_REMOVEDIR); +res = unlinkat(parent_fd.fd, name, AT_REMOVEDIR); fuse_reply_err(req, res == -1 ? errno : 0); unref_inode_lolocked(lo, inode, 1); @@ -1547,6 +1560,7 @@ out: static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name) { +g_auto(TempFd) parent_fd = TEMP_FD_INIT; int res; struct lo_inode *inode; struct lo_data *lo = lo_data(req); @@ -1561,13 +1575,19 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name) return; } +res = lo_fd(req, parent, &parent_fd); +if (res < 0) { +fuse_reply_err(req, -res); +return; +} + inode = lookup_name(req, parent, name); if (!inode) { fuse_reply_err(req, EIO); return; } -res = unlinkat(lo_fd(req, parent), name, 0); +res = unlinkat(parent_fd.fd, name, 0); fuse_reply_err(req, res == -1 ? errno : 0); unref_inode_lolocked(lo, inode, 1); @@ -1647,10 +1667,16 @@ static void lo_forget_multi(fuse_req_t req, size_t count, static void lo_readlink(fuse_req_t req, fuse_ino_t ino) { +g_auto(TempFd) ino_fd = TEMP_FD_INIT; char buf[PATH_MAX + 1]; int res; -res = readlinkat(lo_fd(req, ino), "", buf, sizeof(buf)); +res = lo_fd(req, ino, &ino_fd); +if (res < 0) { +return (void)fuse_reply_err(req, -res); +} + +res = readlinkat(ino_fd.fd, "", buf, sizeof(buf)); if (res == -1) { return (void)fuse_reply_err(req, errno); } @@ -2447,10 +2473,17 @@ static void lo_write_buf(fuse_req_t req, fuse_ino_t ino, static void lo_statfs(fuse_re
[PATCH v2 8/9] virtiofsd: Optionally fill lo_inode.fhandle
When the inode_file_handles option is set, try to generate a file handle for new inodes instead of opening an O_PATH FD. Being able to open these again will require CAP_DAC_READ_SEARCH, so the description text tells the user they will also need to specify -o modcaps=+dac_read_search. Generating a file handle returns the mount ID it is valid for. Opening it will require an FD instead. We have mount_fds to map an ID to an FD. get_file_handle() fills the hash map by opening the file we have generated a handle for. To verify that the resulting FD indeed represents the handle's mount ID, we use statx(). Therefore, using file handles requires statx() support. Signed-off-by: Max Reitz --- tools/virtiofsd/helper.c | 3 + tools/virtiofsd/passthrough_ll.c | 197 -- tools/virtiofsd/passthrough_seccomp.c | 1 + 3 files changed, 192 insertions(+), 9 deletions(-) diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c index 5e98ed702b..954f8639e6 100644 --- a/tools/virtiofsd/helper.c +++ b/tools/virtiofsd/helper.c @@ -186,6 +186,9 @@ void fuse_cmdline_help(void) " to virtiofsd from guest applications.\n" " default: no_allow_direct_io\n" "-o announce_submounts Announce sub-mount points to the guest\n" + "-o inode_file_handles Use file handles to reference inodes\n" + " instead of O_PATH file descriptors\n" + " (requires -o modcaps=+dac_read_search)\n" ); } diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 793d2c333e..2e56c40b2f 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -190,6 +190,7 @@ struct lo_data { /* An O_PATH file descriptor to /proc/self/fd/ */ int proc_self_fd; int user_killpriv_v2, killpriv_v2; +int inode_file_handles; }; /** @@ -244,6 +245,10 @@ static const struct fuse_opt lo_opts[] = { { "announce_submounts", offsetof(struct lo_data, announce_submounts), 1 }, { "killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 1 }, { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 }, +{ "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 }, +{ "no_inode_file_handles", + offsetof(struct lo_data, inode_file_handles), + 0 }, FUSE_OPT_END }; static bool use_syslog = false; @@ -315,6 +320,135 @@ static int temp_fd_steal(TempFd *temp_fd) } } +/** + * Generate a file handle for the given dirfd/name combination. + * + * If mount_fds does not yet contain an entry for the handle's mount + * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds + * as the FD for that mount ID. (That is the file that we have + * generated a handle for, so it should be representative for the + * mount ID. However, to be sure (and to rule out races), we use + * statx() to verify that our assumption is correct.) + */ +static struct lo_fhandle *get_file_handle(struct lo_data *lo, + int dirfd, const char *name) +{ +/* We need statx() to verify the mount ID */ +#if defined(CONFIG_STATX) && defined(STATX_MNT_ID) +struct lo_fhandle *fh; +int ret; + +if (!lo->use_statx || !lo->inode_file_handles) { +return NULL; +} + +fh = g_new0(struct lo_fhandle, 1); + +fh->handle.handle_bytes = sizeof(fh->padding) - sizeof(fh->handle); +ret = name_to_handle_at(dirfd, name, &fh->handle, &fh->mount_id, +AT_EMPTY_PATH); +if (ret < 0) { +goto fail; +} + +if (pthread_rwlock_rdlock(&mount_fds_lock)) { +goto fail; +} +if (!g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) { +g_auto(TempFd) path_fd = TEMP_FD_INIT; +struct statx stx; +char procname[64]; +int fd; + +pthread_rwlock_unlock(&mount_fds_lock); + +/* + * Before opening an O_RDONLY fd, check whether dirfd/name is a regular + * file or directory, because we must not open anything else with + * anything but O_PATH. + * (And we use that occasion to verify that the file has the mount ID we + * need.) + */ +if (name[0]) { +path_fd.fd = openat(dirfd, name, O_PATH); +if (path_fd.fd < 0) { +goto fail; +} +path_fd.owned = true; +} else { +path_fd.fd = dirfd; +path_fd.owned = false; +} + +ret = statx(path_fd.fd, "", AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW, +
[PATCH v2 2/9] virtiofsd: Use lo_inode_open() instead of openat()
The xattr functions want a non-O_PATH FD, so they reopen the lo_inode.fd with the flags they need through /proc/self/fd. Similarly, lo_opendir() needs an O_RDONLY FD. Instead of the /proc/self/fd trick, it just uses openat(fd, "."), because the FD is guaranteed to be a directory, so this works. All cases have one problem in common, though: In the future, when we may have a file handle in the lo_inode instead of an FD, querying an lo_inode FD may incur an open_by_handle_at() call. It does not make sense to then reopen that FD with custom flags, those should have been passed to open_by_handle_at() instead. Use lo_inode_open() instead of openat(). As part of the file handle change, lo_inode_open() will be made to invoke openat() only if lo_inode.fd is valid. Otherwise, it will invoke open_by_handle_at() with the right flags from the start. Consequently, after this patch, lo_inode_open() is the only place to invoke openat() to reopen an existing FD with different flags. Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 43 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index a4674aba80..436f771d2a 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -1645,18 +1645,26 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino, { int error = ENOMEM; struct lo_data *lo = lo_data(req); -struct lo_dirp *d; +struct lo_inode *inode; +struct lo_dirp *d = NULL; int fd; ssize_t fh; +inode = lo_inode(req, ino); +if (!inode) { +error = EBADF; +goto out_err; +} + d = calloc(1, sizeof(struct lo_dirp)); if (d == NULL) { goto out_err; } -fd = openat(lo_fd(req, ino), ".", O_RDONLY); -if (fd == -1) { -goto out_errno; +fd = lo_inode_open(lo, inode, O_RDONLY); +if (fd < 0) { +error = -fd; +goto out_err; } d->dp = fdopendir(fd); @@ -1685,6 +1693,7 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino, out_errno: error = errno; out_err: +lo_inode_put(lo, &inode); if (d) { if (d->dp) { closedir(d->dp); @@ -2827,7 +2836,6 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, const char *in_name, } } -sprintf(procname, "%i", inode->fd); /* * It is not safe to open() non-regular/non-dir files in file server * unless O_PATH is used, so use that method for regular files/dir @@ -2835,12 +2843,14 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, const char *in_name, * Otherwise, call fchdir() to avoid open(). */ if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) { -fd = openat(lo->proc_self_fd, procname, O_RDONLY); +fd = lo_inode_open(lo, inode, O_RDONLY); if (fd < 0) { -goto out_err; +saverr = -fd; +goto out; } ret = fgetxattr(fd, name, value, size); } else { +sprintf(procname, "%i", inode->fd); /* fchdir should not fail here */ FCHDIR_NOFAIL(lo->proc_self_fd); ret = getxattr(procname, name, value, size); @@ -2906,14 +2916,15 @@ static void lo_listxattr(fuse_req_t req, fuse_ino_t ino, size_t size) } } -sprintf(procname, "%i", inode->fd); if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) { -fd = openat(lo->proc_self_fd, procname, O_RDONLY); +fd = lo_inode_open(lo, inode, O_RDONLY); if (fd < 0) { -goto out_err; +saverr = -fd; +goto out; } ret = flistxattr(fd, value, size); } else { +sprintf(procname, "%i", inode->fd); /* fchdir should not fail here */ FCHDIR_NOFAIL(lo->proc_self_fd); ret = listxattr(procname, value, size); @@ -3039,15 +3050,15 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, const char *in_name, fuse_log(FUSE_LOG_DEBUG, "lo_setxattr(ino=%" PRIu64 ", name=%s value=%s size=%zd)\n", ino, name, value, size); -sprintf(procname, "%i", inode->fd); if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) { -fd = openat(lo->proc_self_fd, procname, O_RDONLY); +fd = lo_inode_open(lo, inode, O_RDONLY); if (fd < 0) { -saverr = errno; +saverr = -fd; goto out; } ret = fsetxattr(fd, name, value, size, flags); } else { +sprintf(procname, "%i", inode->fd); /* fchdir should not fail here */ FCHDIR_NOFAIL(lo->proc_self_fd); ret = setxattr(procname, name, value, size, flags); @@ -3
[PATCH v2 6/9] virtiofsd: Add lo_inode.fhandle
This new field is an alternative to lo_inode.fd: Either of the two must be set. In case an O_PATH FD is needed for some lo_inode, it is either taken from lo_inode.fd, if valid, or a temporary FD is opened with open_by_handle_at(). Using a file handle instead of an FD has the advantage of keeping the number of open file descriptors low. Because open_by_handle_at() requires a mount FD (i.e. a non-O_PATH FD opened on the filesystem to which the file handle refers), but every lo_fhandle only has a mount ID (as returned by name_to_handle_at()), we keep a hash map of such FDs in mount_fds (mapping ID to FD). get_file_handle(), which is added by a later patch, will ensure that every mount ID for which we have generated a handle has a corresponding entry in mount_fds. Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 116 ++ tools/virtiofsd/passthrough_seccomp.c | 1 + 2 files changed, 102 insertions(+), 15 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 3014e8baf8..e665575401 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -88,8 +88,25 @@ struct lo_key { uint64_t mnt_id; }; +struct lo_fhandle { +union { +struct file_handle handle; +char padding[sizeof(struct file_handle) + MAX_HANDLE_SZ]; +}; +int mount_id; +}; + +/* Maps mount IDs to an FD that we can pass to open_by_handle_at() */ +static GHashTable *mount_fds; +pthread_rwlock_t mount_fds_lock = PTHREAD_RWLOCK_INITIALIZER; + struct lo_inode { +/* + * Either of fd or fhandle must be set (i.e. >= 0 or non-NULL, + * respectively). + */ int fd; +struct lo_fhandle *fhandle; /* * Atomic reference count for this object. The nlookup field holds a @@ -296,6 +313,44 @@ static int temp_fd_steal(TempFd *temp_fd) } } +/** + * Open the given file handle with the given flags. + * + * The mount FD to pass to open_by_handle_at() is taken from the + * mount_fds hash map. + * + * On error, return -errno. + */ +static int open_file_handle(const struct lo_fhandle *fh, int flags) +{ +gpointer mount_fd_ptr; +int mount_fd; +bool found; +int ret; + +ret = pthread_rwlock_rdlock(&mount_fds_lock); +if (ret) { +return -ret; +} + +/* mount_fd == 0 is valid, so we need lookup_extended */ +found = g_hash_table_lookup_extended(mount_fds, + GINT_TO_POINTER(fh->mount_id), + NULL, &mount_fd_ptr); +pthread_rwlock_unlock(&mount_fds_lock); +if (!found) { +return -EINVAL; +} +mount_fd = GPOINTER_TO_INT(mount_fd_ptr); + +ret = open_by_handle_at(mount_fd, (struct file_handle *)&fh->handle, flags); +if (ret < 0) { +return -errno; +} + +return ret; +} + /* * Load capng's state from our saved state if the current thread * hadn't previously been loaded. @@ -602,7 +657,11 @@ static void lo_inode_put(struct lo_data *lo, struct lo_inode **inodep) *inodep = NULL; if (g_atomic_int_dec_and_test(&inode->refcount)) { -close(inode->fd); +if (inode->fd >= 0) { +close(inode->fd); +} else { +g_free(inode->fhandle); +} free(inode); } } @@ -629,10 +688,25 @@ static struct lo_inode *lo_inode(fuse_req_t req, fuse_ino_t ino) static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd) { -*tfd = (TempFd) { -.fd = inode->fd, -.owned = false, -}; +if (inode->fd >= 0) { +*tfd = (TempFd) { +.fd = inode->fd, +.owned = false, +}; +} else { +int fd; + +assert(inode->fhandle != NULL); +fd = open_file_handle(inode->fhandle, O_PATH); +if (fd < 0) { +return -errno; +} + +*tfd = (TempFd) { +.fd = fd, +.owned = true, +}; +} return 0; } @@ -672,22 +746,32 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd *tfd) static int lo_inode_open(const struct lo_data *lo, const struct lo_inode *inode, int open_flags, TempFd *tfd) { -g_autofree char *fd_str = g_strdup_printf("%d", inode->fd); +g_autofree char *fd_str = NULL; int fd; if (!S_ISREG(inode->filetype) && !S_ISDIR(inode->filetype)) { return -EBADF; } -/* - * The file is a symlink so O_NOFOLLOW must be ignored. We checked earlier - * that the inode is not a special file but if an external process races - * with us then symlinks are traversed here. It is not possible to escape - * the shared directory since it is mounted as "/" though. - */ -fd = openat(lo->proc_self_fd, fd_str, ope
[PATCH v2 5/9] virtiofsd: Let lo_inode_open() return a TempFd
Strictly speaking, this is not necessary, because lo_inode_open() will always return a new FD owned by the caller, so TempFd.owned will always be true. However, auto-cleanup is nice, and in some cases this plays nicely with an lo_inode_fd() call in another conditional branch (see lo_setattr()). Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 137 +-- 1 file changed, 59 insertions(+), 78 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 8f64bcd6c5..3014e8baf8 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -285,10 +285,8 @@ static void temp_fd_clear(TempFd *temp_fd) /** * Return an owned fd from *temp_fd that will not be closed when * *temp_fd goes out of scope. - * - * (TODO: Remove __attribute__ once this is used.) */ -static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd) +static int temp_fd_steal(TempFd *temp_fd) { if (temp_fd->owned) { temp_fd->owned = false; @@ -667,9 +665,12 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd *tfd) * when a malicious client opens special files such as block device nodes. * Symlink inodes are also rejected since symlinks must already have been * traversed on the client side. + * + * The fd is returned in tfd->fd. The return value is 0 on success and -errno + * otherwise. */ -static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode, - int open_flags) +static int lo_inode_open(const struct lo_data *lo, const struct lo_inode *inode, + int open_flags, TempFd *tfd) { g_autofree char *fd_str = g_strdup_printf("%d", inode->fd); int fd; @@ -688,7 +689,13 @@ static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode, if (fd < 0) { return -errno; } -return fd; + +*tfd = (TempFd) { +.fd = fd, +.owned = true, +}; + +return 0; } static void lo_init(void *userdata, struct fuse_conn_info *conn) @@ -820,7 +827,12 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, return; } -res = lo_inode_fd(inode, &inode_fd); +if (!fi && (valid & FUSE_SET_ATTR_SIZE)) { +/* We need an O_RDWR FD for ftruncate() */ +res = lo_inode_open(lo, inode, O_RDWR, &inode_fd); +} else { +res = lo_inode_fd(inode, &inode_fd); +} if (res < 0) { saverr = -res; goto out_err; @@ -868,18 +880,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, if (fi) { truncfd = fd; } else { -truncfd = lo_inode_open(lo, inode, O_RDWR); -if (truncfd < 0) { -saverr = -truncfd; -goto out_err; -} +truncfd = inode_fd.fd; } saverr = drop_security_capability(lo, truncfd); if (saverr) { -if (!fi) { -close(truncfd); -} goto out_err; } @@ -887,9 +892,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, res = drop_effective_cap("FSETID", &cap_fsetid_dropped); if (res != 0) { saverr = res; -if (!fi) { -close(truncfd); -} goto out_err; } } @@ -902,9 +904,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, fuse_log(FUSE_LOG_ERR, "Failed to gain CAP_FSETID\n"); } } -if (!fi) { -close(truncfd); -} if (res == -1) { goto out_err; } @@ -1734,11 +1733,12 @@ static struct lo_dirp *lo_dirp(fuse_req_t req, struct fuse_file_info *fi) static void lo_opendir(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi) { +g_auto(TempFd) inode_fd = TEMP_FD_INIT; int error = ENOMEM; struct lo_data *lo = lo_data(req); struct lo_inode *inode; struct lo_dirp *d = NULL; -int fd; +int res; ssize_t fh; inode = lo_inode(req, ino); @@ -1752,13 +1752,13 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino, goto out_err; } -fd = lo_inode_open(lo, inode, O_RDONLY); -if (fd < 0) { -error = -fd; +res = lo_inode_open(lo, inode, O_RDONLY, &inode_fd); +if (res < 0) { +error = -res; goto out_err; } -d->dp = fdopendir(fd); +d->dp = fdopendir(temp_fd_steal(&inode_fd)); if (d->dp == NULL) { goto out_errno; } @@ -1788,8 +1788,6 @@ out_err: if (d) { if (d->dp) { closedir(d->dp); -} else if (fd != -1) { -close(fd)
[PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table
Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH FD in lo_inode.fd. Therefore, when the respective inode is unlinked, its inode ID will remain in use until we drop our lo_inode (and lo_inode_put() thus closes the FD). Therefore, lo_find() can safely use the inode ID as an lo_inode key, because any inode with an inode ID we find in lo_data.inodes (on the same filesystem) must be the exact same file. This will change when we start setting lo_inode.fhandle so we do not have to keep an O_PATH FD open. Then, unlinking such an inode will immediately remove it, so its ID can then be reused by newly created files, even while the lo_inode object is still there[1]. So creating a new file can then reuse the old file's inode ID, and looking up the new file would lead to us finding the old file's lo_inode, which is not ideal. Luckily, just as file handles cause this problem, they also solve it: A file handle contains a generation ID, which changes when an inode ID is reused, so the new file can be distinguished from the old one. So all we need to do is to add a second map besides lo_data.inodes that maps file handles to lo_inodes, namely lo_data.inodes_by_handle. For clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids. Unfortunately, we cannot rely on being able to generate file handles every time. Therefore, we still enter every lo_inode object into inodes_by_ids, but having an entry in inodes_by_handle is optional. A potential inodes_by_handle entry then has precedence, the inodes_by_ids entry is just a fallback. Note that we do not generate lo_fhandle objects yet, and so we also do not enter anything into the inodes_by_handle map yet. Also, all lookups skip that map. We might manually create file handles with some code that is immediately removed by the next patch again, but that would break the assumption in lo_find() that every lo_inode with a non-NULL .fhandle must have an entry in inodes_by_handle and vice versa. So we leave actually using the inodes_by_handle map for the next patch. [1] If some application in the guest still has the file open, there is going to be a corresponding FD mapping in lo_data.fd_map. In such a case, the inode will only go away once every application in the guest has closed it. The problem described only applies to cases where the guest does not have the file open, and it is just in the dentry cache, basically. Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 80 +--- 1 file changed, 64 insertions(+), 16 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index e665575401..793d2c333e 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -179,7 +179,8 @@ struct lo_data { int announce_submounts; bool use_statx; struct lo_inode root; -GHashTable *inodes; /* protected by lo->mutex */ +GHashTable *inodes_by_ids; /* protected by lo->mutex */ +GHashTable *inodes_by_handle; /* protected by lo->mutex */ struct lo_map ino_map; /* protected by lo->mutex */ struct lo_map dirp_map; /* protected by lo->mutex */ struct lo_map fd_map; /* protected by lo->mutex */ @@ -257,8 +258,9 @@ static struct { /* That we loaded cap-ng in the current thread from the saved */ static __thread bool cap_loaded = 0; -static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st, -uint64_t mnt_id); +static struct lo_inode *lo_find(struct lo_data *lo, +const struct lo_fhandle *fhandle, +struct stat *st, uint64_t mnt_id); static int xattr_map_client(const struct lo_data *lo, const char *client_name, char **out_name); @@ -1032,18 +1034,39 @@ out_err: fuse_reply_err(req, saverr); } -static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st, -uint64_t mnt_id) +static struct lo_inode *lo_find(struct lo_data *lo, +const struct lo_fhandle *fhandle, +struct stat *st, uint64_t mnt_id) { -struct lo_inode *p; -struct lo_key key = { +struct lo_inode *p = NULL; +struct lo_key ids_key = { .ino = st->st_ino, .dev = st->st_dev, .mnt_id = mnt_id, }; pthread_mutex_lock(&lo->mutex); -p = g_hash_table_lookup(lo->inodes, &key); +if (fhandle) { +p = g_hash_table_lookup(lo->inodes_by_handle, fhandle); +} +if (!p) { +p = g_hash_table_lookup(lo->inodes_by_ids, &ids_key); +/* + * When we had to fall back to looking up an inode by its IDs, + * ensure that we hit an entry that does not have a file + * handle. Entries with file handles must also have a handle + * alt key, so if w
[PATCH v2 3/9] virtiofsd: Add lo_inode_fd() helper
Once we let lo_inode.fd be optional, we will need its users to open the file handle stored in lo_inode instead. This function will do that. For now, it just returns lo_inode.fd, though. Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 138 ++- 1 file changed, 117 insertions(+), 21 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 436f771d2a..46c9dfe200 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -629,6 +629,16 @@ static struct lo_inode *lo_inode(fuse_req_t req, fuse_ino_t ino) return elem->inode; } +static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd) +{ +*tfd = (TempFd) { +.fd = inode->fd, +.owned = false, +}; + +return 0; +} + /* * TODO Remove this helper and force callers to hold an inode refcount until * they are done with the fd. This will be done in a later patch to make @@ -790,11 +800,11 @@ static int lo_fi_fd(fuse_req_t req, struct fuse_file_info *fi) static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, int valid, struct fuse_file_info *fi) { +g_auto(TempFd) inode_fd = TEMP_FD_INIT; int saverr; char procname[64]; struct lo_data *lo = lo_data(req); struct lo_inode *inode; -int ifd; int res; int fd = -1; @@ -804,7 +814,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, return; } -ifd = inode->fd; +res = lo_inode_fd(inode, &inode_fd); +if (res < 0) { +saverr = -res; +goto out_err; +} /* If fi->fh is invalid we'll report EBADF later */ if (fi) { @@ -815,7 +829,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, if (fi) { res = fchmod(fd, attr->st_mode); } else { -sprintf(procname, "%i", ifd); +sprintf(procname, "%i", inode_fd.fd); res = fchmodat(lo->proc_self_fd, procname, attr->st_mode, 0); } if (res == -1) { @@ -827,12 +841,13 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, uid_t uid = (valid & FUSE_SET_ATTR_UID) ? attr->st_uid : (uid_t)-1; gid_t gid = (valid & FUSE_SET_ATTR_GID) ? attr->st_gid : (gid_t)-1; -saverr = drop_security_capability(lo, ifd); +saverr = drop_security_capability(lo, inode_fd.fd); if (saverr) { goto out_err; } -res = fchownat(ifd, "", uid, gid, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW); +res = fchownat(inode_fd.fd, "", uid, gid, + AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW); if (res == -1) { saverr = errno; goto out_err; @@ -911,7 +926,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, if (fi) { res = futimens(fd, tv); } else { -sprintf(procname, "%i", inode->fd); +sprintf(procname, "%i", inode_fd.fd); res = utimensat(lo->proc_self_fd, procname, tv, 0); } if (res == -1) { @@ -1026,7 +1041,8 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name, struct fuse_entry_param *e, struct lo_inode **inodep) { -int newfd; +g_auto(TempFd) dir_fd = TEMP_FD_INIT; +int newfd = -1; int res; int saverr; uint64_t mnt_id; @@ -1056,7 +1072,13 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name, name = "."; } -newfd = openat(dir->fd, name, O_PATH | O_NOFOLLOW); +res = lo_inode_fd(dir, &dir_fd); +if (res < 0) { +saverr = -res; +goto out; +} + +newfd = openat(dir_fd.fd, name, O_PATH | O_NOFOLLOW); if (newfd == -1) { goto out_err; } @@ -1123,6 +1145,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name, out_err: saverr = errno; +out: if (newfd != -1) { close(newfd); } @@ -1228,6 +1251,7 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t parent, const char *name, mode_t mode, dev_t rdev, const char *link) { +g_auto(TempFd) dir_fd = TEMP_FD_INIT; int res; int saverr; struct lo_data *lo = lo_data(req); @@ -1251,12 +1275,18 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t parent, return; } +res = lo_inode_fd(dir, &dir_fd); +if (res < 0) { +saverr = -res; +goto out; +} + saverr = lo_change_cred(req, &old); if (saverr) { goto out; } -res = mknod_wrapper(dir->f
[PATCH v2 1/9] virtiofsd: Add TempFd structure
We are planning to add file handles to lo_inode objects as an alternative to lo_inode.fd. That means that everywhere where we currently reference lo_inode.fd, we will have to open a temporary file descriptor that needs to be closed after use. So instead of directly accessing lo_inode.fd, there will be a helper function (lo_inode_fd()) that either returns lo_inode.fd, or opens a new file descriptor with open_by_handle_at(). It encapsulates this result in a TempFd structure to let the caller know whether the FD needs to be closed after use (opened from the handle) or not (copied from lo_inode.fd). By using g_auto(TempFd) to store this result, callers will not even have to care about closing a temporary FD after use. It will be done automatically once the object goes out of scope. Signed-off-by: Max Reitz Reviewed-by: Connor Kuehl --- tools/virtiofsd/passthrough_ll.c | 49 1 file changed, 49 insertions(+) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 49c21fd855..a4674aba80 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -174,6 +174,28 @@ struct lo_data { int user_killpriv_v2, killpriv_v2; }; +/** + * Represents a file descriptor that may either be owned by this + * TempFd, or only referenced (i.e. the ownership belongs to some + * other object, and the value has just been copied into this TempFd). + * + * The purpose of this encapsulation is to be used as g_auto(TempFd) + * to automatically clean up owned file descriptors when this object + * goes out of scope. + * + * Use temp_fd_steal() to get an owned file descriptor that will not + * be closed when the TempFd goes out of scope. + */ +typedef struct { +int fd; +bool owned; /* fd owned by this object? */ +} TempFd; + +#define TEMP_FD_INIT ((TempFd) { .fd = -1, .owned = false }) + +static void temp_fd_clear(TempFd *temp_fd); +G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC(TempFd, temp_fd_clear); + static const struct fuse_opt lo_opts[] = { { "sandbox=namespace", offsetof(struct lo_data, sandbox), @@ -249,6 +271,33 @@ static struct lo_data *lo_data(fuse_req_t req) return (struct lo_data *)fuse_req_userdata(req); } +/** + * Clean-up function for TempFds + */ +static void temp_fd_clear(TempFd *temp_fd) +{ +if (temp_fd->owned) { +close(temp_fd->fd); +*temp_fd = TEMP_FD_INIT; +} +} + +/** + * Return an owned fd from *temp_fd that will not be closed when + * *temp_fd goes out of scope. + * + * (TODO: Remove __attribute__ once this is used.) + */ +static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd) +{ +if (temp_fd->owned) { +temp_fd->owned = false; +return temp_fd->fd; +} else { +return dup(temp_fd->fd); +} +} + /* * Load capng's state from our saved state if the current thread * hadn't previously been loaded. -- 2.31.1
[PATCH v2 0/9] virtiofsd: Allow using file handles instead of O_PATH FDs
Hi, v1 cover letter for an overview: https://listman.redhat.com/archives/virtio-fs/2021-June/msg00033.html In v2, I (tried to) fix the bug Dave found, which is that get_file_handle() indiscriminately opened the given dirfd/name combination to get an O_RDONLY fd without checking whether we’re actually allowed to open dirfd/name; namely, we don’t allow ourselves to open files that aren’t regular files or directories. So that openat(.., O_RDONLY) is changed to an openat(..., O_PATH), and then check the file type with the statx() we’re doing anyway. If the file is OK to open, we reopen it O_RDONLY with the help of /proc/self/fd, like we always do. (This only affects patch 8.) git-backport-diff against v1: Key: [] : patches are identical [] : number of functional differences between upstream/downstream patch [down] : patch is downstream-only The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively 001/9:[] [--] 'virtiofsd: Add TempFd structure' 002/9:[] [--] 'virtiofsd: Use lo_inode_open() instead of openat()' 003/9:[] [--] 'virtiofsd: Add lo_inode_fd() helper' 004/9:[] [--] 'virtiofsd: Let lo_fd() return a TempFd' 005/9:[] [--] 'virtiofsd: Let lo_inode_open() return a TempFd' 006/9:[] [--] 'virtiofsd: Add lo_inode.fhandle' 007/9:[] [--] 'virtiofsd: Add inodes_by_handle hash table' 008/9:[0045] [FC] 'virtiofsd: Optionally fill lo_inode.fhandle' 009/9:[] [--] 'virtiofsd: Add lazy lo_do_find()' Max Reitz (9): virtiofsd: Add TempFd structure virtiofsd: Use lo_inode_open() instead of openat() virtiofsd: Add lo_inode_fd() helper virtiofsd: Let lo_fd() return a TempFd virtiofsd: Let lo_inode_open() return a TempFd virtiofsd: Add lo_inode.fhandle virtiofsd: Add inodes_by_handle hash table virtiofsd: Optionally fill lo_inode.fhandle virtiofsd: Add lazy lo_do_find() tools/virtiofsd/helper.c | 3 + tools/virtiofsd/passthrough_ll.c | 836 +- tools/virtiofsd/passthrough_seccomp.c | 2 + 3 files changed, 694 insertions(+), 147 deletions(-) -- 2.31.1
Re: [PATCH 8/9] virtiofsd: Optionally fill lo_inode.fhandle
On 08.06.21 12:43, Dr. David Alan Gilbert wrote: * Max Reitz (mre...@redhat.com) wrote: When the inode_file_handles option is set, try to generate a file handle for new inodes instead of opening an O_PATH FD. Being able to open these again will require CAP_DAC_READ_SEARCH, so the description text tells the user they will also need to specify -o modcaps=+dac_read_search. Generating a file handle returns the mount ID it is valid for. Opening it will require an FD instead. We have mount_fds to map an ID to an FD. get_file_handle() fills the hash map by opening the file we have generated a handle for. To verify that the resulting FD indeed represents the handle's mount ID, we use statx(). Therefore, using file handles requires statx() support. Signed-off-by: Max Reitz --- tools/virtiofsd/helper.c | 3 + tools/virtiofsd/passthrough_ll.c | 170 -- tools/virtiofsd/passthrough_seccomp.c | 1 + 3 files changed, 165 insertions(+), 9 deletions(-) diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c index 5e98ed702b..954f8639e6 100644 --- a/tools/virtiofsd/helper.c +++ b/tools/virtiofsd/helper.c @@ -186,6 +186,9 @@ void fuse_cmdline_help(void) " to virtiofsd from guest applications.\n" " default: no_allow_direct_io\n" "-o announce_submounts Announce sub-mount points to the guest\n" + "-o inode_file_handles Use file handles to reference inodes\n" + " instead of O_PATH file descriptors\n" + " (requires -o modcaps=+dac_read_search)\n" ); } diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 793d2c333e..d01f9d3a59 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -190,6 +190,7 @@ struct lo_data { /* An O_PATH file descriptor to /proc/self/fd/ */ int proc_self_fd; int user_killpriv_v2, killpriv_v2; +int inode_file_handles; }; /** @@ -244,6 +245,10 @@ static const struct fuse_opt lo_opts[] = { { "announce_submounts", offsetof(struct lo_data, announce_submounts), 1 }, { "killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 1 }, { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 }, +{ "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 }, +{ "no_inode_file_handles", + offsetof(struct lo_data, inode_file_handles), + 0 }, FUSE_OPT_END }; static bool use_syslog = false; @@ -315,6 +320,108 @@ static int temp_fd_steal(TempFd *temp_fd) } } +/** + * Generate a file handle for the given dirfd/name combination. + * + * If mount_fds does not yet contain an entry for the handle's mount + * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds + * as the FD for that mount ID. (That is the file that we have + * generated a handle for, so it should be representative for the + * mount ID. However, to be sure (and to rule out races), we use + * statx() to verify that our assumption is correct.) + */ +static struct lo_fhandle *get_file_handle(struct lo_data *lo, + int dirfd, const char *name) +{ +/* We need statx() to verify the mount ID */ +#if defined(CONFIG_STATX) && defined(STATX_MNT_ID) +struct lo_fhandle *fh; +int ret; + +if (!lo->use_statx || !lo->inode_file_handles) { +return NULL; +} + +fh = g_new0(struct lo_fhandle, 1); + +fh->handle.handle_bytes = sizeof(fh->padding) - sizeof(fh->handle); +ret = name_to_handle_at(dirfd, name, &fh->handle, &fh->mount_id, +AT_EMPTY_PATH); +if (ret < 0) { +goto fail; +} + +if (pthread_rwlock_rdlock(&mount_fds_lock)) { +goto fail; +} +if (!g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) { +struct statx stx; +int fd; + +pthread_rwlock_unlock(&mount_fds_lock); + +if (name[0]) { +fd = openat(dirfd, name, O_RDONLY); But can't that be a device file or other special file that you must not open? Yes. So I think the right thing to do here is to use O_PATH (which should be safe), then statx() to find out what we need to know (now not only the mount ID, but the file type, too), and if it’s safe, we can then openat(proc_self_fd, str(fd), O_RDONLY) (and close the O_PATH fd). If it isn’t safe, we just return an error, and the inode gets an O_PATH fd. Max
[PATCH 9/9] virtiofsd: Add lazy lo_do_find()
lo_find() right now takes two lookup keys for two maps, namely the file handle for inodes_by_handle and the statx information for inodes_by_ids. However, we only need the statx information if looking up the inode by the file handle failed. There are two callers of lo_find(): The first one, lo_do_lookup(), has both keys anyway, so passing them does not incur any additional cost. The second one, lookup_name(), though, needs to explicitly invoke name_to_handle_at() (through get_file_handle()) and statx() (through do_statx()). We need to try to get a file handle as the primary key, so we cannot get rid of get_file_handle(), but we only need the statx information if looking up an inode by handle failed; so we can defer that until the lookup has indeed failed. To this end, replace lo_find()'s st/mnt_id parameters by a get_ids() closure that is invoked to fill the lo_key struct if necessary. Also, lo_find() is renamed to lo_do_find(), so we can add a new lo_find() wrapper whose closure just initializes the lo_key from the st/mnt_id parameters, just like the old lo_find() did. lookup_name() directly calls lo_do_find() now and passes its own closure, which performs the do_statx() call. Signed-off-by: Max Reitz --- tools/virtiofsd/passthrough_ll.c | 93 ++-- 1 file changed, 76 insertions(+), 17 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index d01f9d3a59..ef10a3ace6 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -1141,22 +1141,23 @@ out_err: fuse_reply_err(req, saverr); } -static struct lo_inode *lo_find(struct lo_data *lo, -const struct lo_fhandle *fhandle, -struct stat *st, uint64_t mnt_id) +/* + * get_ids() will be called to get the key for lo->inodes_by_ids if + * the lookup by file handle has failed. + */ +static struct lo_inode *lo_do_find(struct lo_data *lo, +const struct lo_fhandle *fhandle, +int (*get_ids)(struct lo_key *, const void *), +const void *get_ids_opaque) { struct lo_inode *p = NULL; -struct lo_key ids_key = { -.ino = st->st_ino, -.dev = st->st_dev, -.mnt_id = mnt_id, -}; +struct lo_key ids_key; pthread_mutex_lock(&lo->mutex); if (fhandle) { p = g_hash_table_lookup(lo->inodes_by_handle, fhandle); } -if (!p) { +if (!p && get_ids(&ids_key, get_ids_opaque) == 0) { p = g_hash_table_lookup(lo->inodes_by_ids, &ids_key); /* * When we had to fall back to looking up an inode by its IDs, @@ -1184,6 +1185,36 @@ static struct lo_inode *lo_find(struct lo_data *lo, return p; } +struct lo_find_get_ids_key_opaque { +const struct stat *st; +uint64_t mnt_id; +}; + +static int lo_find_get_ids_key(struct lo_key *ids_key, const void *opaque) +{ +const struct lo_find_get_ids_key_opaque *stat_info = opaque; + +*ids_key = (struct lo_key){ +.ino = stat_info->st->st_ino, +.dev = stat_info->st->st_dev, +.mnt_id = stat_info->mnt_id, +}; + +return 0; +} + +static struct lo_inode *lo_find(struct lo_data *lo, +const struct lo_fhandle *fhandle, +struct stat *st, uint64_t mnt_id) +{ +const struct lo_find_get_ids_key_opaque stat_info = { +.st = st, +.mnt_id = mnt_id, +}; + +return lo_do_find(lo, fhandle, lo_find_get_ids_key, &stat_info); +} + /* value_destroy_func for posix_locks GHashTable */ static void posix_locks_value_destroy(gpointer data) { @@ -1655,14 +1686,41 @@ out_err: fuse_reply_err(req, saverr); } +struct lookup_name_get_ids_key_opaque { +struct lo_data *lo; +int parent_fd; +const char *name; +}; + +static int lookup_name_get_ids_key(struct lo_key *ids_key, const void *opaque) +{ +const struct lookup_name_get_ids_key_opaque *stat_params = opaque; +uint64_t mnt_id; +struct stat attr; +int res; + +res = do_statx(stat_params->lo, stat_params->parent_fd, stat_params->name, + &attr, AT_SYMLINK_NOFOLLOW, &mnt_id); +if (res < 0) { +return -errno; +} + +*ids_key = (struct lo_key){ +.ino = attr.st_ino, +.dev = attr.st_dev, +.mnt_id = mnt_id, +}; + +return 0; +} + /* Increments nlookup and caller must release refcount using lo_inode_put() */ static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent, const char *name) { g_auto(TempFd) dir_fd = TEMP_FD_INIT; int res; -uint64_t mnt_id; -struct stat attr; +struct lookup_name_get_ids_key_opaque stat_params; struct lo_fhandle *fh; struct lo_data *lo = lo_data(req); struct lo_inode *dir = lo_inode(req, parent); @@ -1680,13 +1738,14
[PATCH 1/9] virtiofsd: Add TempFd structure
We are planning to add file handles to lo_inode objects as an alternative to lo_inode.fd. That means that everywhere where we currently reference lo_inode.fd, we will have to open a temporary file descriptor that needs to be closed after use. So instead of directly accessing lo_inode.fd, there will be a helper function (lo_inode_fd()) that either returns lo_inode.fd, or opens a new file descriptor with open_by_handle_at(). It encapsulates this result in a TempFd structure to let the caller know whether the FD needs to be closed after use (opened from the handle) or not (copied from lo_inode.fd). By using g_auto(TempFd) to store this result, callers will not even have to care about closing a temporary FD after use. It will be done automatically once the object goes out of scope. Signed-off-by: Max Reitz --- tools/virtiofsd/passthrough_ll.c | 49 1 file changed, 49 insertions(+) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 49c21fd855..a4674aba80 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -174,6 +174,28 @@ struct lo_data { int user_killpriv_v2, killpriv_v2; }; +/** + * Represents a file descriptor that may either be owned by this + * TempFd, or only referenced (i.e. the ownership belongs to some + * other object, and the value has just been copied into this TempFd). + * + * The purpose of this encapsulation is to be used as g_auto(TempFd) + * to automatically clean up owned file descriptors when this object + * goes out of scope. + * + * Use temp_fd_steal() to get an owned file descriptor that will not + * be closed when the TempFd goes out of scope. + */ +typedef struct { +int fd; +bool owned; /* fd owned by this object? */ +} TempFd; + +#define TEMP_FD_INIT ((TempFd) { .fd = -1, .owned = false }) + +static void temp_fd_clear(TempFd *temp_fd); +G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC(TempFd, temp_fd_clear); + static const struct fuse_opt lo_opts[] = { { "sandbox=namespace", offsetof(struct lo_data, sandbox), @@ -249,6 +271,33 @@ static struct lo_data *lo_data(fuse_req_t req) return (struct lo_data *)fuse_req_userdata(req); } +/** + * Clean-up function for TempFds + */ +static void temp_fd_clear(TempFd *temp_fd) +{ +if (temp_fd->owned) { +close(temp_fd->fd); +*temp_fd = TEMP_FD_INIT; +} +} + +/** + * Return an owned fd from *temp_fd that will not be closed when + * *temp_fd goes out of scope. + * + * (TODO: Remove __attribute__ once this is used.) + */ +static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd) +{ +if (temp_fd->owned) { +temp_fd->owned = false; +return temp_fd->fd; +} else { +return dup(temp_fd->fd); +} +} + /* * Load capng's state from our saved state if the current thread * hadn't previously been loaded. -- 2.31.1
[PATCH 0/9] virtiofsd: Allow using file handles instead of O_PATH FDs
Hi, This is the C virtiofsd equivalent to https://gitlab.com/virtio-fs/virtiofsd-rs/-/merge_requests/26. As such, the summary is pretty much the same: Storing an O_PATH file descriptor in every lo_inode object means we have a lot of FDs open, which is sometimes bad. This series adds an option (-o inode_file_handles) that will make virtiofsd attempt to generate a file handle for new inodes and store that instead of an FD. When an FD is needed for a given inode, we open the handle. To accomplish this, lo_inode.fd is should not be accessed directly anymore, but only through helper functions (mainly lo_inode_fd() and lo_inode_open()). A TempFd object is added to hide the difference between FDs that are bound to the lo_inode object (and so need not be closed after use) and temporary FDs from open_by_handle_at() (which do need to be closed after use). To prevent the problem I spent a long time talking about (if we don’t have an FD open for every inode, the inode can be deleted, its ID reused, and then the lookup in lo_data.inodes will find the old deleted inode), patch 7 adds a second hash table lo_data.inodes_by_handle that maps file handles to lo_inode objects. (Because file handles include a generation ID, so we can discern between the old and the new inode.) Patch 9 is completely optional, but I just really felt compelled to write it. Max Reitz (9): virtiofsd: Add TempFd structure virtiofsd: Use lo_inode_open() instead of openat() virtiofsd: Add lo_inode_fd() helper virtiofsd: Let lo_fd() return a TempFd virtiofsd: Let lo_inode_open() return a TempFd virtiofsd: Add lo_inode.fhandle virtiofsd: Add inodes_by_handle hash table virtiofsd: Optionally fill lo_inode.fhandle virtiofsd: Add lazy lo_do_find() tools/virtiofsd/helper.c | 3 + tools/virtiofsd/passthrough_ll.c | 809 +- tools/virtiofsd/passthrough_seccomp.c | 2 + 3 files changed, 667 insertions(+), 147 deletions(-) -- 2.31.1
Re: [PATCH 1/3] virtiofsd: Find original inode ID of mount points
On 02.06.21 20:59, Miklos Szeredi wrote: On Wed, 2 Jun 2021 at 20:20, Vivek Goyal wrote: On Wed, May 12, 2021 at 02:55:42PM +0200, Max Reitz wrote: Mount point directories represent two inodes: On one hand, they are a normal directory on their parent filesystem. On the other, they are the root node of the filesystem mounted there. Thus, they have two inode IDs. Right now, we only report the latter inode ID (i.e. the inode ID of the mounted filesystem's root node). This is fine once the guest has auto-mounted a submount there (so this inode ID goes with a device ID that is distinct from the parent filesystem), but before the auto-mount, they have the device ID of the parent and the inode ID for the submount. This is problematic because this is likely exactly the same st_dev/st_ino combination as the parent filesystem's root node. This leads to problems for example with `find`, which will thus complain about a filesystem loop if it has visited the parent filesystem's root node before, and then refuse to descend into the submount. There is a way to find the mount directory's original inode ID, and that is to readdir(3) the parent directory, look for the mount directory, and read the dirent.d_ino field. Using this, we can let lookup and readdirplus return that original inode ID, which the guest will thus show until the submount is auto-mounted. (Then, it will invoke getattr and that stat(2) call will return the inode ID for the submount.) [ CC miklos ] Hi Max, Though we discussed this in chat room, I am still responding to this email with the concern I have, so that there is a record of it. That sounds scary :) So with this patch for FUSE_LOOKUP we always return submount's parentinode id and with GETATTR request we return actual inode id of submount. That kind of bothers me a bit as we are assuming that there is always going to be a GETATTR request after FUSE_LOOKUP. FUSE_LOOKUP itself has attrs returned with it and it might happen that after FUSE_LOOKUP, FUSE_GETATTR might not be called at all because FUSE_LOOKUP itself got the latest updated attrs with certain timeout. For example, if I call stat on a normal file (not submount), I see that after FUSE_LOOKUP, no FUSE_GETATTR request is going and fuse_update_get_attr() is using attrs from locally cached inode attrs. But same thing does not seem to be happening in case of submount. Once submount is created in guest, I see that we never seem to be calling ->revalidate() on newly created dentry of submount root. I am not sure why. And that means we don't do FUSE_LOOKUP and that means FUSE_GETATTR always gets called. Even if it worked reliably, I have to admit it’s kind of relying on at most implementation-defined behavior. Having two inodes would definitely be the right solution. Yes, this sounds normal. The lookup sequence for "/mnt/virtiofs/dir1/submount/file" will be: LOOKUP(1, "dir") create inode I1 in sb1 create dentry "dir" with inode I1 in sb1 LOOKUP(I1, "submount") create inode I2 in sb1, set S_AUTOMOUNT create dentry "submount" with inode I2 in sb1 automount triggered: create sb2 create inode I2 in sb2 create dentry "/" with inode I2 in sb2 GETATTR(I2) fill inode I2 LOOKUP(I2, "file") create inode I3 create dentry "file" with inode I3 in sb2 Notice how there's two inodes numbered I2, but in separate sb's. However the server has only the notion of a single I2 inode which is the root of the submount, since the mountpoint is not visible (except for d_ino in readdir()). Now AFAICS what this patch does is set the inode number in the attributes returned by LOOKUP(I1, "submount") to the covered inode, so that AT_NO_AUTOMOUNT stat will return the correct value. While this seems to work, it's not a fundamental fix to the problem, since the attributes on sb1/I2 will time out and the next stat(..., AT_NO_AUTOMOUNT) will trigger a GETATTR request, which will return the inode number of the submount root. Ah, yeah. Solving this properly would mean that the server would have to generate separate nodeid's for the mountpoint and the submount root and the protocol would have to be extended so that this information could be transferred to the client. Yes, we’d somehow need to do a separate lookup for the root inode of the submount... Not sure whether this is worth it or not. I’m wondering the same. This series was mostly the result of an incidental finding and seeing that getting it to work and do what I want seemed to be pretty easy. Turns out doing something right can never really be easy... But we have already decided that we don’t deem the inode ID problem too important (not least considering NFS has the same issue), so fixing it is not really top priority. Maybe I’ll get back to it, but I think for now I consider it on the backlog. Max
[PATCH 8/9] virtiofsd: Optionally fill lo_inode.fhandle
When the inode_file_handles option is set, try to generate a file handle for new inodes instead of opening an O_PATH FD. Being able to open these again will require CAP_DAC_READ_SEARCH, so the description text tells the user they will also need to specify -o modcaps=+dac_read_search. Generating a file handle returns the mount ID it is valid for. Opening it will require an FD instead. We have mount_fds to map an ID to an FD. get_file_handle() fills the hash map by opening the file we have generated a handle for. To verify that the resulting FD indeed represents the handle's mount ID, we use statx(). Therefore, using file handles requires statx() support. Signed-off-by: Max Reitz --- tools/virtiofsd/helper.c | 3 + tools/virtiofsd/passthrough_ll.c | 170 -- tools/virtiofsd/passthrough_seccomp.c | 1 + 3 files changed, 165 insertions(+), 9 deletions(-) diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c index 5e98ed702b..954f8639e6 100644 --- a/tools/virtiofsd/helper.c +++ b/tools/virtiofsd/helper.c @@ -186,6 +186,9 @@ void fuse_cmdline_help(void) " to virtiofsd from guest applications.\n" " default: no_allow_direct_io\n" "-o announce_submounts Announce sub-mount points to the guest\n" + "-o inode_file_handles Use file handles to reference inodes\n" + " instead of O_PATH file descriptors\n" + " (requires -o modcaps=+dac_read_search)\n" ); } diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 793d2c333e..d01f9d3a59 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -190,6 +190,7 @@ struct lo_data { /* An O_PATH file descriptor to /proc/self/fd/ */ int proc_self_fd; int user_killpriv_v2, killpriv_v2; +int inode_file_handles; }; /** @@ -244,6 +245,10 @@ static const struct fuse_opt lo_opts[] = { { "announce_submounts", offsetof(struct lo_data, announce_submounts), 1 }, { "killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 1 }, { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 }, +{ "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 }, +{ "no_inode_file_handles", + offsetof(struct lo_data, inode_file_handles), + 0 }, FUSE_OPT_END }; static bool use_syslog = false; @@ -315,6 +320,108 @@ static int temp_fd_steal(TempFd *temp_fd) } } +/** + * Generate a file handle for the given dirfd/name combination. + * + * If mount_fds does not yet contain an entry for the handle's mount + * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds + * as the FD for that mount ID. (That is the file that we have + * generated a handle for, so it should be representative for the + * mount ID. However, to be sure (and to rule out races), we use + * statx() to verify that our assumption is correct.) + */ +static struct lo_fhandle *get_file_handle(struct lo_data *lo, + int dirfd, const char *name) +{ +/* We need statx() to verify the mount ID */ +#if defined(CONFIG_STATX) && defined(STATX_MNT_ID) +struct lo_fhandle *fh; +int ret; + +if (!lo->use_statx || !lo->inode_file_handles) { +return NULL; +} + +fh = g_new0(struct lo_fhandle, 1); + +fh->handle.handle_bytes = sizeof(fh->padding) - sizeof(fh->handle); +ret = name_to_handle_at(dirfd, name, &fh->handle, &fh->mount_id, +AT_EMPTY_PATH); +if (ret < 0) { +goto fail; +} + +if (pthread_rwlock_rdlock(&mount_fds_lock)) { +goto fail; +} +if (!g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) { +struct statx stx; +int fd; + +pthread_rwlock_unlock(&mount_fds_lock); + +if (name[0]) { +fd = openat(dirfd, name, O_RDONLY); +} else { +char procname[64]; +snprintf(procname, sizeof(procname), "%i", dirfd); +fd = openat(lo->proc_self_fd, procname, O_RDONLY); +} +if (fd < 0) { +goto fail; +} + +ret = statx(fd, "", AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW, +STATX_MNT_ID, &stx); +if (ret < 0) { +if (errno == ENOSYS) { +lo->use_statx = false; +fuse_log(FUSE_LOG_WARNING, + "statx() does not work: Will not be able to use file " + "handles for inodes\n"); +} +goto fail; +
[PATCH 4/9] virtiofsd: Let lo_fd() return a TempFd
Accessing lo_inode.fd must generally happen through lo_inode_fd(), and lo_fd() is no exception; and then it must pass on the TempFd it has received from lo_inode_fd(). (Note that all lo_fd() calls now use proper error handling, where all of them were in-line before; i.e. they were used in place of the fd argument of some function call. This only worked because the only error that could occur was that lo_inode() failed to find the inode ID: Then -1 would be passed as the fd, which would result in an EBADF error, which is precisely what we would want to return to the guest for an invalid inode ID. Now, though, lo_inode_fd() might potentially invoke open_by_handle_at(), which can return many different errors, and they should be properly handled and returned to the guest. So we can no longer allow lo_fd() to be used in-line, and instead need to do proper error handling for it.) Signed-off-by: Max Reitz --- tools/virtiofsd/passthrough_ll.c | 55 +--- 1 file changed, 44 insertions(+), 11 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 46c9dfe200..8f64bcd6c5 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -644,18 +644,19 @@ static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd) * they are done with the fd. This will be done in a later patch to make * review easier. */ -static int lo_fd(fuse_req_t req, fuse_ino_t ino) +static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd *tfd) { struct lo_inode *inode = lo_inode(req, ino); -int fd; +int res; if (!inode) { -return -1; +return -EBADF; } -fd = inode->fd; +res = lo_inode_fd(inode, tfd); + lo_inode_put(lo_data(req), &inode); -return fd; +return res; } /* @@ -766,14 +767,19 @@ static void lo_init(void *userdata, struct fuse_conn_info *conn) static void lo_getattr(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi) { +g_auto(TempFd) ino_fd = TEMP_FD_INIT; int res; struct stat buf; struct lo_data *lo = lo_data(req); (void)fi; -res = -fstatat(lo_fd(req, ino), "", &buf, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW); +res = lo_fd(req, ino, &ino_fd); +if (res < 0) { +return (void)fuse_reply_err(req, -res); +} + +res = fstatat(ino_fd.fd, "", &buf, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW); if (res == -1) { return (void)fuse_reply_err(req, errno); } @@ -1441,6 +1447,7 @@ static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent, static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, const char *name) { +g_auto(TempFd) parent_fd = TEMP_FD_INIT; int res; struct lo_inode *inode; struct lo_data *lo = lo_data(req); @@ -1455,13 +1462,19 @@ static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, const char *name) return; } +res = lo_fd(req, parent, &parent_fd); +if (res < 0) { +fuse_reply_err(req, -res); +return; +} + inode = lookup_name(req, parent, name); if (!inode) { fuse_reply_err(req, EIO); return; } -res = unlinkat(lo_fd(req, parent), name, AT_REMOVEDIR); +res = unlinkat(parent_fd.fd, name, AT_REMOVEDIR); fuse_reply_err(req, res == -1 ? errno : 0); unref_inode_lolocked(lo, inode, 1); @@ -1547,6 +1560,7 @@ out: static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name) { +g_auto(TempFd) parent_fd = TEMP_FD_INIT; int res; struct lo_inode *inode; struct lo_data *lo = lo_data(req); @@ -1561,13 +1575,19 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name) return; } +res = lo_fd(req, parent, &parent_fd); +if (res < 0) { +fuse_reply_err(req, -res); +return; +} + inode = lookup_name(req, parent, name); if (!inode) { fuse_reply_err(req, EIO); return; } -res = unlinkat(lo_fd(req, parent), name, 0); +res = unlinkat(parent_fd.fd, name, 0); fuse_reply_err(req, res == -1 ? errno : 0); unref_inode_lolocked(lo, inode, 1); @@ -1647,10 +1667,16 @@ static void lo_forget_multi(fuse_req_t req, size_t count, static void lo_readlink(fuse_req_t req, fuse_ino_t ino) { +g_auto(TempFd) ino_fd = TEMP_FD_INIT; char buf[PATH_MAX + 1]; int res; -res = readlinkat(lo_fd(req, ino), "", buf, sizeof(buf)); +res = lo_fd(req, ino, &ino_fd); +if (res < 0) { +return (void)fuse_reply_err(req, -res); +} + +res = readlinkat(ino_fd.fd, "", buf, sizeof(buf)); if (res == -1) { return (void)fuse_reply_err(req, errno); } @@ -2447,10 +2473,17 @@ static void lo_write_buf(fuse_req_t req, fuse_ino_t ino, static void lo_statfs(fuse_req_t req, fuse_ino_t ino) {
[PATCH 7/9] virtiofsd: Add inodes_by_handle hash table
Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH FD in lo_inode.fd. Therefore, when the respective inode is unlinked, its inode ID will remain in use until we drop our lo_inode (and lo_inode_put() thus closes the FD). Therefore, lo_find() can safely use the inode ID as an lo_inode key, because any inode with an inode ID we find in lo_data.inodes (on the same filesystem) must be the exact same file. This will change when we start setting lo_inode.fhandle so we do not have to keep an O_PATH FD open. Then, unlinking such an inode will immediately remove it, so its ID can then be reused by newly created files, even while the lo_inode object is still there[1]. So creating a new file can then reuse the old file's inode ID, and looking up the new file would lead to us finding the old file's lo_inode, which is not ideal. Luckily, just as file handles cause this problem, they also solve it: A file handle contains a generation ID, which changes when an inode ID is reused, so the new file can be distinguished from the old one. So all we need to do is to add a second map besides lo_data.inodes that maps file handles to lo_inodes, namely lo_data.inodes_by_handle. For clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids. Unfortunately, we cannot rely on being able to generate file handles every time. Therefore, we still enter every lo_inode object into inodes_by_ids, but having an entry in inodes_by_handle is optional. A potential inodes_by_handle entry then has precedence, the inodes_by_ids entry is just a fallback. Note that we do not generate lo_fhandle objects yet, and so we also do not enter anything into the inodes_by_handle map yet. Also, all lookups skip that map. We might manually create file handles with some code that is immediately removed by the next patch again, but that would break the assumption in lo_find() that every lo_inode with a non-NULL .fhandle must have an entry in inodes_by_handle and vice versa. So we leave actually using the inodes_by_handle map for the next patch. [1] If some application in the guest still has the file open, there is going to be a corresponding FD mapping in lo_data.fd_map. In such a case, the inode will only go away once every application in the guest has closed it. The problem described only applies to cases where the guest does not have the file open, and it is just in the dentry cache, basically. Signed-off-by: Max Reitz --- tools/virtiofsd/passthrough_ll.c | 80 +--- 1 file changed, 64 insertions(+), 16 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index e665575401..793d2c333e 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -179,7 +179,8 @@ struct lo_data { int announce_submounts; bool use_statx; struct lo_inode root; -GHashTable *inodes; /* protected by lo->mutex */ +GHashTable *inodes_by_ids; /* protected by lo->mutex */ +GHashTable *inodes_by_handle; /* protected by lo->mutex */ struct lo_map ino_map; /* protected by lo->mutex */ struct lo_map dirp_map; /* protected by lo->mutex */ struct lo_map fd_map; /* protected by lo->mutex */ @@ -257,8 +258,9 @@ static struct { /* That we loaded cap-ng in the current thread from the saved */ static __thread bool cap_loaded = 0; -static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st, -uint64_t mnt_id); +static struct lo_inode *lo_find(struct lo_data *lo, +const struct lo_fhandle *fhandle, +struct stat *st, uint64_t mnt_id); static int xattr_map_client(const struct lo_data *lo, const char *client_name, char **out_name); @@ -1032,18 +1034,39 @@ out_err: fuse_reply_err(req, saverr); } -static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st, -uint64_t mnt_id) +static struct lo_inode *lo_find(struct lo_data *lo, +const struct lo_fhandle *fhandle, +struct stat *st, uint64_t mnt_id) { -struct lo_inode *p; -struct lo_key key = { +struct lo_inode *p = NULL; +struct lo_key ids_key = { .ino = st->st_ino, .dev = st->st_dev, .mnt_id = mnt_id, }; pthread_mutex_lock(&lo->mutex); -p = g_hash_table_lookup(lo->inodes, &key); +if (fhandle) { +p = g_hash_table_lookup(lo->inodes_by_handle, fhandle); +} +if (!p) { +p = g_hash_table_lookup(lo->inodes_by_ids, &ids_key); +/* + * When we had to fall back to looking up an inode by its IDs, + * ensure that we hit an entry that does not have a file + * handle. Entries with file handles must also have a handle + * alt key, so if we have not found it by th
[PATCH 3/9] virtiofsd: Add lo_inode_fd() helper
Once we let lo_inode.fd be optional, we will need its users to open the file handle stored in lo_inode instead. This function will do that. For now, it just returns lo_inode.fd, though. Signed-off-by: Max Reitz --- tools/virtiofsd/passthrough_ll.c | 138 ++- 1 file changed, 117 insertions(+), 21 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 436f771d2a..46c9dfe200 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -629,6 +629,16 @@ static struct lo_inode *lo_inode(fuse_req_t req, fuse_ino_t ino) return elem->inode; } +static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd) +{ +*tfd = (TempFd) { +.fd = inode->fd, +.owned = false, +}; + +return 0; +} + /* * TODO Remove this helper and force callers to hold an inode refcount until * they are done with the fd. This will be done in a later patch to make @@ -790,11 +800,11 @@ static int lo_fi_fd(fuse_req_t req, struct fuse_file_info *fi) static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, int valid, struct fuse_file_info *fi) { +g_auto(TempFd) inode_fd = TEMP_FD_INIT; int saverr; char procname[64]; struct lo_data *lo = lo_data(req); struct lo_inode *inode; -int ifd; int res; int fd = -1; @@ -804,7 +814,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, return; } -ifd = inode->fd; +res = lo_inode_fd(inode, &inode_fd); +if (res < 0) { +saverr = -res; +goto out_err; +} /* If fi->fh is invalid we'll report EBADF later */ if (fi) { @@ -815,7 +829,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, if (fi) { res = fchmod(fd, attr->st_mode); } else { -sprintf(procname, "%i", ifd); +sprintf(procname, "%i", inode_fd.fd); res = fchmodat(lo->proc_self_fd, procname, attr->st_mode, 0); } if (res == -1) { @@ -827,12 +841,13 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, uid_t uid = (valid & FUSE_SET_ATTR_UID) ? attr->st_uid : (uid_t)-1; gid_t gid = (valid & FUSE_SET_ATTR_GID) ? attr->st_gid : (gid_t)-1; -saverr = drop_security_capability(lo, ifd); +saverr = drop_security_capability(lo, inode_fd.fd); if (saverr) { goto out_err; } -res = fchownat(ifd, "", uid, gid, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW); +res = fchownat(inode_fd.fd, "", uid, gid, + AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW); if (res == -1) { saverr = errno; goto out_err; @@ -911,7 +926,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, if (fi) { res = futimens(fd, tv); } else { -sprintf(procname, "%i", inode->fd); +sprintf(procname, "%i", inode_fd.fd); res = utimensat(lo->proc_self_fd, procname, tv, 0); } if (res == -1) { @@ -1026,7 +1041,8 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name, struct fuse_entry_param *e, struct lo_inode **inodep) { -int newfd; +g_auto(TempFd) dir_fd = TEMP_FD_INIT; +int newfd = -1; int res; int saverr; uint64_t mnt_id; @@ -1056,7 +1072,13 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name, name = "."; } -newfd = openat(dir->fd, name, O_PATH | O_NOFOLLOW); +res = lo_inode_fd(dir, &dir_fd); +if (res < 0) { +saverr = -res; +goto out; +} + +newfd = openat(dir_fd.fd, name, O_PATH | O_NOFOLLOW); if (newfd == -1) { goto out_err; } @@ -1123,6 +1145,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name, out_err: saverr = errno; +out: if (newfd != -1) { close(newfd); } @@ -1228,6 +1251,7 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t parent, const char *name, mode_t mode, dev_t rdev, const char *link) { +g_auto(TempFd) dir_fd = TEMP_FD_INIT; int res; int saverr; struct lo_data *lo = lo_data(req); @@ -1251,12 +1275,18 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t parent, return; } +res = lo_inode_fd(dir, &dir_fd); +if (res < 0) { +saverr = -res; +goto out; +} + saverr = lo_change_cred(req, &old); if (saverr) { goto out; } -res = mknod_wrapper(dir->fd, name, link, mode, rdev)
[PATCH 2/9] virtiofsd: Use lo_inode_open() instead of openat()
The xattr functions want a non-O_PATH FD, so they reopen the lo_inode.fd with the flags they need through /proc/self/fd. Similarly, lo_opendir() needs an O_RDONLY FD. Instead of the /proc/self/fd trick, it just uses openat(fd, "."), because the FD is guaranteed to be a directory, so this works. All cases have one problem in common, though: In the future, when we may have a file handle in the lo_inode instead of an FD, querying an lo_inode FD may incur an open_by_handle_at() call. It does not make sense to then reopen that FD with custom flags, those should have been passed to open_by_handle_at() instead. Use lo_inode_open() instead of openat(). As part of the file handle change, lo_inode_open() will be made to invoke openat() only if lo_inode.fd is valid. Otherwise, it will invoke open_by_handle_at() with the right flags from the start. Consequently, after this patch, lo_inode_open() is the only place to invoke openat() to reopen an existing FD with different flags. Signed-off-by: Max Reitz --- tools/virtiofsd/passthrough_ll.c | 43 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index a4674aba80..436f771d2a 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -1645,18 +1645,26 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino, { int error = ENOMEM; struct lo_data *lo = lo_data(req); -struct lo_dirp *d; +struct lo_inode *inode; +struct lo_dirp *d = NULL; int fd; ssize_t fh; +inode = lo_inode(req, ino); +if (!inode) { +error = EBADF; +goto out_err; +} + d = calloc(1, sizeof(struct lo_dirp)); if (d == NULL) { goto out_err; } -fd = openat(lo_fd(req, ino), ".", O_RDONLY); -if (fd == -1) { -goto out_errno; +fd = lo_inode_open(lo, inode, O_RDONLY); +if (fd < 0) { +error = -fd; +goto out_err; } d->dp = fdopendir(fd); @@ -1685,6 +1693,7 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino, out_errno: error = errno; out_err: +lo_inode_put(lo, &inode); if (d) { if (d->dp) { closedir(d->dp); @@ -2827,7 +2836,6 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, const char *in_name, } } -sprintf(procname, "%i", inode->fd); /* * It is not safe to open() non-regular/non-dir files in file server * unless O_PATH is used, so use that method for regular files/dir @@ -2835,12 +2843,14 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, const char *in_name, * Otherwise, call fchdir() to avoid open(). */ if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) { -fd = openat(lo->proc_self_fd, procname, O_RDONLY); +fd = lo_inode_open(lo, inode, O_RDONLY); if (fd < 0) { -goto out_err; +saverr = -fd; +goto out; } ret = fgetxattr(fd, name, value, size); } else { +sprintf(procname, "%i", inode->fd); /* fchdir should not fail here */ FCHDIR_NOFAIL(lo->proc_self_fd); ret = getxattr(procname, name, value, size); @@ -2906,14 +2916,15 @@ static void lo_listxattr(fuse_req_t req, fuse_ino_t ino, size_t size) } } -sprintf(procname, "%i", inode->fd); if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) { -fd = openat(lo->proc_self_fd, procname, O_RDONLY); +fd = lo_inode_open(lo, inode, O_RDONLY); if (fd < 0) { -goto out_err; +saverr = -fd; +goto out; } ret = flistxattr(fd, value, size); } else { +sprintf(procname, "%i", inode->fd); /* fchdir should not fail here */ FCHDIR_NOFAIL(lo->proc_self_fd); ret = listxattr(procname, value, size); @@ -3039,15 +3050,15 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, const char *in_name, fuse_log(FUSE_LOG_DEBUG, "lo_setxattr(ino=%" PRIu64 ", name=%s value=%s size=%zd)\n", ino, name, value, size); -sprintf(procname, "%i", inode->fd); if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) { -fd = openat(lo->proc_self_fd, procname, O_RDONLY); +fd = lo_inode_open(lo, inode, O_RDONLY); if (fd < 0) { -saverr = errno; +saverr = -fd; goto out; } ret = fsetxattr(fd, name, value, size, flags); } else { +sprintf(procname, "%i", inode->fd); /* fchdir should not fail here */ FCHDIR_NOFAIL(lo->proc_self_fd); ret = setxattr(procname, name, value, size, flags); @@ -3105,15 +3116,15 @@ s
[PATCH 6/9] virtiofsd: Add lo_inode.fhandle
This new field is an alternative to lo_inode.fd: Either of the two must be set. In case an O_PATH FD is needed for some lo_inode, it is either taken from lo_inode.fd, if valid, or a temporary FD is opened with open_by_handle_at(). Using a file handle instead of an FD has the advantage of keeping the number of open file descriptors low. Because open_by_handle_at() requires a mount FD (i.e. a non-O_PATH FD opened on the filesystem to which the file handle refers), but every lo_fhandle only has a mount ID (as returned by name_to_handle_at()), we keep a hash map of such FDs in mount_fds (mapping ID to FD). get_file_handle(), which is added by a later patch, will ensure that every mount ID for which we have generated a handle has a corresponding entry in mount_fds. Signed-off-by: Max Reitz --- tools/virtiofsd/passthrough_ll.c | 116 ++ tools/virtiofsd/passthrough_seccomp.c | 1 + 2 files changed, 102 insertions(+), 15 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 3014e8baf8..e665575401 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -88,8 +88,25 @@ struct lo_key { uint64_t mnt_id; }; +struct lo_fhandle { +union { +struct file_handle handle; +char padding[sizeof(struct file_handle) + MAX_HANDLE_SZ]; +}; +int mount_id; +}; + +/* Maps mount IDs to an FD that we can pass to open_by_handle_at() */ +static GHashTable *mount_fds; +pthread_rwlock_t mount_fds_lock = PTHREAD_RWLOCK_INITIALIZER; + struct lo_inode { +/* + * Either of fd or fhandle must be set (i.e. >= 0 or non-NULL, + * respectively). + */ int fd; +struct lo_fhandle *fhandle; /* * Atomic reference count for this object. The nlookup field holds a @@ -296,6 +313,44 @@ static int temp_fd_steal(TempFd *temp_fd) } } +/** + * Open the given file handle with the given flags. + * + * The mount FD to pass to open_by_handle_at() is taken from the + * mount_fds hash map. + * + * On error, return -errno. + */ +static int open_file_handle(const struct lo_fhandle *fh, int flags) +{ +gpointer mount_fd_ptr; +int mount_fd; +bool found; +int ret; + +ret = pthread_rwlock_rdlock(&mount_fds_lock); +if (ret) { +return -ret; +} + +/* mount_fd == 0 is valid, so we need lookup_extended */ +found = g_hash_table_lookup_extended(mount_fds, + GINT_TO_POINTER(fh->mount_id), + NULL, &mount_fd_ptr); +pthread_rwlock_unlock(&mount_fds_lock); +if (!found) { +return -EINVAL; +} +mount_fd = GPOINTER_TO_INT(mount_fd_ptr); + +ret = open_by_handle_at(mount_fd, (struct file_handle *)&fh->handle, flags); +if (ret < 0) { +return -errno; +} + +return ret; +} + /* * Load capng's state from our saved state if the current thread * hadn't previously been loaded. @@ -602,7 +657,11 @@ static void lo_inode_put(struct lo_data *lo, struct lo_inode **inodep) *inodep = NULL; if (g_atomic_int_dec_and_test(&inode->refcount)) { -close(inode->fd); +if (inode->fd >= 0) { +close(inode->fd); +} else { +g_free(inode->fhandle); +} free(inode); } } @@ -629,10 +688,25 @@ static struct lo_inode *lo_inode(fuse_req_t req, fuse_ino_t ino) static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd) { -*tfd = (TempFd) { -.fd = inode->fd, -.owned = false, -}; +if (inode->fd >= 0) { +*tfd = (TempFd) { +.fd = inode->fd, +.owned = false, +}; +} else { +int fd; + +assert(inode->fhandle != NULL); +fd = open_file_handle(inode->fhandle, O_PATH); +if (fd < 0) { +return -errno; +} + +*tfd = (TempFd) { +.fd = fd, +.owned = true, +}; +} return 0; } @@ -672,22 +746,32 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd *tfd) static int lo_inode_open(const struct lo_data *lo, const struct lo_inode *inode, int open_flags, TempFd *tfd) { -g_autofree char *fd_str = g_strdup_printf("%d", inode->fd); +g_autofree char *fd_str = NULL; int fd; if (!S_ISREG(inode->filetype) && !S_ISDIR(inode->filetype)) { return -EBADF; } -/* - * The file is a symlink so O_NOFOLLOW must be ignored. We checked earlier - * that the inode is not a special file but if an external process races - * with us then symlinks are traversed here. It is not possible to escape - * the shared directory since it is mounted as "/" though. - */ -fd = openat(lo->proc_self_fd, fd_str, open_flags & ~O_NOFO
Re: [PATCH] block/snapshot: Clarify goto fallback behavior
On 03.06.21 18:02, Peter Maydell wrote: On Mon, 3 May 2021 at 10:55, Max Reitz wrote: In the bdrv_snapshot_goto() fallback code, we work with a pointer to either bs->file or bs->backing. We close that child, close the node (with .bdrv_close()), apply the snapshot on the child node, and then re-open the node (with .bdrv_open()). In order for .bdrv_open() to attach the same child node that we had before, we pass "file={child-node}" or "backing={child-node}" to it. Therefore, when .bdrv_open() has returned success, we can assume that bs->file or bs->backing (respectively) points to our original child again. This is verified by an assertion. All of this is not immediately clear from a quick glance at the code, so add a comment to the assertion what it is for, and why it is valid. It certainly confused Coverity. Reported-by: Coverity (CID 1452774) Signed-off-by: Max Reitz Did this patch get lost? I was just going through outstanding coverity issues and noticed it was posted a month ago and not in master... Oh, right, sorry. Thanks for the reminder. Now applied to my block branch: https://github.com/XanClic/qemu/commits/block
[PATCH 5/9] virtiofsd: Let lo_inode_open() return a TempFd
Strictly speaking, this is not necessary, because lo_inode_open() will always return a new FD owned by the caller, so TempFd.owned will always be true. However, auto-cleanup is nice, and in some cases this plays nicely with an lo_inode_fd() call in another conditional branch (see lo_setattr()). Signed-off-by: Max Reitz --- tools/virtiofsd/passthrough_ll.c | 137 +-- 1 file changed, 59 insertions(+), 78 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 8f64bcd6c5..3014e8baf8 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -285,10 +285,8 @@ static void temp_fd_clear(TempFd *temp_fd) /** * Return an owned fd from *temp_fd that will not be closed when * *temp_fd goes out of scope. - * - * (TODO: Remove __attribute__ once this is used.) */ -static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd) +static int temp_fd_steal(TempFd *temp_fd) { if (temp_fd->owned) { temp_fd->owned = false; @@ -667,9 +665,12 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd *tfd) * when a malicious client opens special files such as block device nodes. * Symlink inodes are also rejected since symlinks must already have been * traversed on the client side. + * + * The fd is returned in tfd->fd. The return value is 0 on success and -errno + * otherwise. */ -static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode, - int open_flags) +static int lo_inode_open(const struct lo_data *lo, const struct lo_inode *inode, + int open_flags, TempFd *tfd) { g_autofree char *fd_str = g_strdup_printf("%d", inode->fd); int fd; @@ -688,7 +689,13 @@ static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode, if (fd < 0) { return -errno; } -return fd; + +*tfd = (TempFd) { +.fd = fd, +.owned = true, +}; + +return 0; } static void lo_init(void *userdata, struct fuse_conn_info *conn) @@ -820,7 +827,12 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, return; } -res = lo_inode_fd(inode, &inode_fd); +if (!fi && (valid & FUSE_SET_ATTR_SIZE)) { +/* We need an O_RDWR FD for ftruncate() */ +res = lo_inode_open(lo, inode, O_RDWR, &inode_fd); +} else { +res = lo_inode_fd(inode, &inode_fd); +} if (res < 0) { saverr = -res; goto out_err; @@ -868,18 +880,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, if (fi) { truncfd = fd; } else { -truncfd = lo_inode_open(lo, inode, O_RDWR); -if (truncfd < 0) { -saverr = -truncfd; -goto out_err; -} +truncfd = inode_fd.fd; } saverr = drop_security_capability(lo, truncfd); if (saverr) { -if (!fi) { -close(truncfd); -} goto out_err; } @@ -887,9 +892,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, res = drop_effective_cap("FSETID", &cap_fsetid_dropped); if (res != 0) { saverr = res; -if (!fi) { -close(truncfd); -} goto out_err; } } @@ -902,9 +904,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, fuse_log(FUSE_LOG_ERR, "Failed to gain CAP_FSETID\n"); } } -if (!fi) { -close(truncfd); -} if (res == -1) { goto out_err; } @@ -1734,11 +1733,12 @@ static struct lo_dirp *lo_dirp(fuse_req_t req, struct fuse_file_info *fi) static void lo_opendir(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi) { +g_auto(TempFd) inode_fd = TEMP_FD_INIT; int error = ENOMEM; struct lo_data *lo = lo_data(req); struct lo_inode *inode; struct lo_dirp *d = NULL; -int fd; +int res; ssize_t fh; inode = lo_inode(req, ino); @@ -1752,13 +1752,13 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino, goto out_err; } -fd = lo_inode_open(lo, inode, O_RDONLY); -if (fd < 0) { -error = -fd; +res = lo_inode_open(lo, inode, O_RDONLY, &inode_fd); +if (res < 0) { +error = -res; goto out_err; } -d->dp = fdopendir(fd); +d->dp = fdopendir(temp_fd_steal(&inode_fd)); if (d->dp == NULL) { goto out_errno; } @@ -1788,8 +1788,6 @@ out_err: if (d) { if (d->dp) { closedir(d->dp); -} else if (fd != -1) { -close(fd); } free
Re: [PATCH v2 33/33] iotests/image-fleecing: add test-case for copy-before-write filter
On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote: New fleecing method becomes available: copy-before-write filter. Actually we don't need backup job to setup image fleecing. Add test for new recommended way of image fleecing. Signed-off-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/tests/image-fleecing | 50 +- tests/qemu-iotests/tests/image-fleecing.out | 72 + 2 files changed, 107 insertions(+), 15 deletions(-) diff --git a/tests/qemu-iotests/tests/image-fleecing b/tests/qemu-iotests/tests/image-fleecing index e210c00d28..404ebc00f1 100755 --- a/tests/qemu-iotests/tests/image-fleecing +++ b/tests/qemu-iotests/tests/image-fleecing [...] @@ -90,12 +91,22 @@ def do_test(base_img_path, fleece_img_path, nbd_sock_path, vm): 'backing': src_node, })) -# Establish COW from source to fleecing node -log(vm.qmp('blockdev-backup', - job_id='fleecing', - device=src_node, - target=tmp_node, - sync='none')) +# Establish CBW from source to fleecing node +if use_cbw: +log(vm.qmp('blockdev-add', **{ I thought this should work without ** now. With them dropped: Reviewed-by: Max Reitz +'driver': 'copy-before-write', +'node-name': 'fl-cbw', +'file': src_node, +'target': tmp_node +})) + +log(vm.qmp('qom-set', path=qom_path, property='drive', value='fl-cbw')) +else: +log(vm.qmp('blockdev-backup', + job_id='fleecing', + device=src_node, + target=tmp_node, + sync='none')) log('') log('--- Setting up NBD Export ---')
Re: [PATCH v2 32/33] iotests/image-fleecing: prepare for adding new test-case
On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote: We are going to add a test-case with some behavior modifications. So, let's prepare a function to be reused. Signed-off-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/tests/image-fleecing | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH v2 31/33] iotests/image-fleecing: rename tgt_node
On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote: Actually target of backup(sync=None) is not a final backup target: image fleecing is intended to be used with external tool, which will copy data from fleecing node to some real backup target. Also, we are going to add a test case for "push backup with fleecing", where instead of exporting fleecing node by NBD, we'll start a backup job from fleecing node to real backup target. To avoid confusion, let's rename temporary fleecing node now. Signed-off-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/tests/image-fleecing | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH v2 30/33] iotests/image-fleecing: proper source device
On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote: Define scsi device to operate with it by qom-set in further patch. Give a new node-name to source block node, to not look like device name. Job now don't want to work without giving explicit id, so, let's call it "fleecing". Signed-off-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/tests/image-fleecing | 12 tests/qemu-iotests/tests/image-fleecing.out | 2 +- 2 files changed, 9 insertions(+), 5 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH v2 29/33] iotests.py: hmp_qemu_io: support qdev
On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote: Signed-off-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/iotests.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH v2 28/33] iotests: move 222 to tests/image-fleecing
On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote: Give a good name to test file. Signed-off-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/{222 => tests/image-fleecing} | 0 tests/qemu-iotests/{222.out => tests/image-fleecing.out} | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename tests/qemu-iotests/{222 => tests/image-fleecing} (100%) rename tests/qemu-iotests/{222.out => tests/image-fleecing.out} (100%) Reviewed-by: Max Reitz
Re: [PATCH v2 27/33] iotests/222: constantly use single quotes for strings
On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote: The file use both single and double quotes for strings. Let's be consistent. Signed-off-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/222 | 68 +- 1 file changed, 34 insertions(+), 34 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH v2 26/33] iotests/222: fix pylint and mypy complains
On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote: Here: - long line - move to new interface of vm.qmp() (direct passing dict), to avoid mypy false-positive, as it thinks that unpacked dict is a positional argument. - extra parenthesis - handle event_wait possible None value Signed-off-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/222 | 20 +++- tests/qemu-iotests/297 | 2 +- 2 files changed, 12 insertions(+), 10 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH v2 25/33] iotests.py: VM: add own __enter__ method
On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote: In superclass __enter__ method is defined with return value type hint 'QEMUMachine'. So, mypy thinks that return value of VM.__enter__ is QEMUMachine. Let's redefine __enter__ in VM class, to give it correct type hint. Signed-off-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/iotests.py | 4 1 file changed, 4 insertions(+) Reviewed-by: Max Reitz
Re: [PATCH v2 24/33] python/qemu/machine: QEMUMachine: improve qmp() method
On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote: We often call qmp() with unpacking dict, like qmp('foo', **{...}). mypy don't really like it, it things that passed unpacked dict is a s/things/thinks/ positional argument and complains that it type should be bool (because second argument of qmp() is conv_keys: bool). I guess it would have been nice to see some examples of this that can now be written in a cleaner way, but well. Allow possing dict directly, simplifying interface, and giving a way to s/possing/passing/ satisfy mypy. Signed-off-by: Vladimir Sementsov-Ogievskiy --- python/qemu/machine.py | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) Reviewed-by: Max Reitz
Re: [PATCH v2 23/33] python/qemu/machine.py: refactor _qemu_args()
On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote: - use shorter construction - don't create new dict if not needed - drop extra unpacking key-val arguments - drop extra default values Signed-off-by: Vladimir Sementsov-Ogievskiy --- python/qemu/machine.py | 18 -- 1 file changed, 8 insertions(+), 10 deletions(-) Reviewed-by: Max Reitz
Re: [PATCH v2 22/33] qapi: publish copy-before-write filter
On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote: Signed-off-by: Vladimir Sementsov-Ogievskiy --- qapi/block-core.json | 30 ++ 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 2ea294129e..8c4801a13d 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2808,15 +2808,17 @@ # @blklogwrites: Since 3.0 # @blkreplay: Since 4.2 # @compress: Since 5.0 +# @copy-before-write: Since 6.1 # # Since: 2.9 ## { 'enum': 'BlockdevDriver', 'data': [ 'blkdebug', 'blklogwrites', 'blkreplay', 'blkverify', 'bochs', -'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps', -'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi', -'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', -'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', +'cloop', 'compress', 'copy-before-write', 'copy-on-read', 'dmg', +'file', 'ftp', 'ftps', 'gluster', 'host_cdrom', 'host_device', +'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs', 'null-aio', +'null-co', 'nvme', 'parallels', 'preallocate', 'qcow', 'qcow2', +'qed', 'quorum', 'raw', 'rbd', { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' }, 'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } @@ -3937,6 +3939,25 @@ 'base': 'BlockdevOptionsGenericFormat', 'data': { '*bottom': 'str' } } +## +# @BlockdevOptionsCbw: +# +# Driver specific block device options for the copy-before-write driver, +# which does so called copy-before-write operations: when data is +# written to the filter, the filter firstly reads corresponding blocks +# from its file child and copies them to @target child. After successful +# copying the write request is propagated to file child. If copying +# filed, the original write request is failed too and no data is written s/filed/failed/ With that fixed: Reviewed-by: Max Reitz +# to file child. +# +# @target: The target for copy-before-write operations. +# +# Since: 6.1 +## +{ 'struct': 'BlockdevOptionsCbw', + 'base': 'BlockdevOptionsGenericFormat', + 'data': { 'target': 'BlockdevRef' } } + ## # @BlockdevOptions: # @@ -3989,6 +4010,7 @@ 'bochs': 'BlockdevOptionsGenericFormat', 'cloop': 'BlockdevOptionsGenericFormat', 'compress': 'BlockdevOptionsGenericFormat', + 'copy-before-write':'BlockdevOptionsCbw', 'copy-on-read':'BlockdevOptionsCor', 'dmg':'BlockdevOptionsGenericFormat', 'file': 'BlockdevOptionsFile',
Re: [PATCH v2 19/33] block/copy-before-write: initialize block-copy bitmap
On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote: We are going to publish copy-before-write filter to be used in separate of backup. Future step would support bitmap for the filter. But let's start from full set bitmap. We have to modify backup, as bitmap is first initialized by copy-before-write filter, and then backup modifies it. Signed-off-by: Vladimir Sementsov-Ogievskiy --- block/backup.c| 16 +++- block/copy-before-write.c | 4 2 files changed, 11 insertions(+), 9 deletions(-) Reviewed-by: Max Reitz