Re: [PATCH v5 00/11] block: file-posix queue

2021-06-25 Thread Max Reitz

On 25.06.21 10:52, Paolo Bonzini wrote:

On 25/06/21 10:37, Max Reitz wrote:
Thanks, looks good to me, though I’m afraid I’ll be on PTO the next 
two weeks so I can’t take this series through my tree... (I don’t 
really want to send a pull request today when I probably wouldn’t be 
able to deal with potential problems)


I can take it too, if it's okay for you.


Sure!




Re: [PATCH v5 00/11] block: file-posix queue

2021-06-25 Thread Max Reitz

On 24.06.21 20:04, Paolo Bonzini wrote:

New patches:
- 3/4 (for review comments),
- 9 (split for ease of review),
- 11 (new bugfix)

v1->v2: add missing patch

v2->v3: add max_hw_transfer to BlockLimits

v3->v4: fix compilation after patch 1, tweak commit messages according
 to Vladimir's review

v4->v5: round down max_transfer and max_hw_transfer to request alignment
 checkpatch fixes
 return -ENOTSUP, -not -EIO if block limits ioctls fail
 handle host_cdrom like host_device in QAPI
 split "block: try BSD disk size ioctls one after another"
 new bugfix patch "file-posix: handle EINTR during ioctl"


Thanks, looks good to me, though I’m afraid I’ll be on PTO the next two 
weeks so I can’t take this series through my tree... (I don’t really 
want to send a pull request today when I probably wouldn’t be able to 
deal with potential problems)


Max




Re: [PATCH 11/11] file-posix: handle EINTR during ioctl

2021-06-25 Thread Max Reitz

On 24.06.21 20:04, Paolo Bonzini wrote:

Similar to other handle_aiocb_* functions, handle_aiocb_ioctl needs to cater
for the possibility that ioctl is interrupted by a signal.  Otherwise, the
I/O is incorrectly reported as a failure to the guest.

Reported-by: Gordon Watson 
Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 74b8216077..a26eab0ac3 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1347,7 +1347,9 @@ static int handle_aiocb_ioctl(void *opaque)
  RawPosixAIOData *aiocb = opaque;
  int ret;
  
-ret = ioctl(aiocb->aio_fildes, aiocb->ioctl.cmd, aiocb->ioctl.buf);

+do {
+ret = ioctl(aiocb->aio_fildes, aiocb->ioctl.cmd, aiocb->ioctl.buf);
+} while (ret == -1 && errno == EINTR);
  if (ret == -1) {
  return -errno;
  }


I think the macro TFR() from qemu-common.h could be applied here, 
probably like `TFR(ret = ioctl(...));`.


No matter:

Reviewed-by: Max Reitz 




Re: [PATCH 10/11] block: detect DKIOCGETBLOCKCOUNT/SIZE before use

2021-06-25 Thread Max Reitz

On 24.06.21 20:04, Paolo Bonzini wrote:

From: Joelle van Dyne 

iOS hosts do not have these defined so we fallback to the
default behaviour.

Co-authored-by: Warner Losh 
Signed-off-by: Joelle van Dyne 
Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Max Reitz 




Re: [PATCH 09/11] block: try BSD disk size ioctls one after another

2021-06-25 Thread Max Reitz

On 24.06.21 20:04, Paolo Bonzini wrote:

Try all the possible ioctls for disk size as long as they are
supported, to keep the #if ladder simple.

Extracted and cleaned up from a patch by Joelle van Dyne and
Warner Losh.

Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 34 --
  1 file changed, 16 insertions(+), 18 deletions(-)


Thanks, that’s much better.

Reviewed-by: Max Reitz 




Re: [PATCH 07/11] block: feature detection for host block support

2021-06-25 Thread Max Reitz

On 24.06.21 20:04, Paolo Bonzini wrote:

From: Joelle van Dyne 

On Darwin (iOS), there are no system level APIs for directly accessing
host block devices. We detect this at configure time.

Signed-off-by: Joelle van Dyne 
Message-Id: <20210315180341.31638-...@getutm.app>
Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c   | 33 ++---
  meson.build  |  6 +-
  qapi/block-core.json | 14 ++
  3 files changed, 37 insertions(+), 16 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH 06/11] file-posix: try BLKSECTGET on block devices too, do not round to power of 2

2021-06-25 Thread Max Reitz

On 24.06.21 20:04, Paolo Bonzini wrote:

bs->sg is only true for character devices, but block devices can also
be used with scsi-block and scsi-generic.  Unfortunately BLKSECTGET
returns bytes in an int for /dev/sgN devices, and sectors in a short
for block devices, so account for that in the code.

The maximum transfer also need not be a power of 2 (for example I have
seen disks with 1280 KiB maximum transfer) so there's no need to pass
the result through pow2floor.

Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 67 ++
  1 file changed, 38 insertions(+), 29 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH 05/11] block: add max_hw_transfer to BlockLimits

2021-06-25 Thread Max Reitz

On 24.06.21 20:04, Paolo Bonzini wrote:

For block host devices, I/O can happen through either the kernel file
descriptor I/O system calls (preadv/pwritev, io_submit, io_uring)
or the SCSI passthrough ioctl SG_IO.

In the latter case, the size of each transfer can be limited by the
HBA, while for file descriptor I/O the kernel is able to split and
merge I/O in smaller pieces as needed.  Applying the HBA limits to
file descriptor I/O results in more system calls and suboptimal
performance, so this patch splits the max_transfer limit in two:
max_transfer remains valid and is used in general, while max_hw_transfer
is limited to the maximum hardware size.  max_hw_transfer can then be
included by the scsi-generic driver in the block limits page, to ensure
that the stricter hardware limit is used.

Signed-off-by: Paolo Bonzini 
---
  block/block-backend.c  | 13 +
  block/file-posix.c |  2 +-
  block/io.c |  2 ++
  hw/scsi/scsi-generic.c |  2 +-
  include/block/block_int.h  |  7 +++
  include/sysemu/block-backend.h |  1 +
  6 files changed, 25 insertions(+), 2 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH 04/11] block-backend: align max_transfer to request alignment

2021-06-25 Thread Max Reitz

On 24.06.21 20:04, Paolo Bonzini wrote:

Block device requests must be aligned to bs->bl.request_alignment.
It makes sense for drivers to align bs->bl.max_transfer the same
way; however when there is no specified limit, blk_get_max_transfer
just returns INT_MAX.  Since the contract of the function does not
specify that INT_MAX means "no maximum", just align the outcome
of the function (whether INT_MAX or bs->bl.max_transfer) before
returning it.

Signed-off-by: Paolo Bonzini 
---
  block/block-backend.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH 03/11] osdep: provide ROUND_DOWN macro

2021-06-25 Thread Max Reitz

On 24.06.21 20:04, Paolo Bonzini wrote:

osdep.h provides a ROUND_UP macro to hide bitwise operations for the
purpose of rounding a number up to a power of two; add a ROUND_DOWN
macro that does the same with truncation towards zero.

While at it, change the formatting of some comments.

Signed-off-by: Paolo Bonzini 
---
  include/qemu/osdep.h | 28 ++--
  1 file changed, 22 insertions(+), 6 deletions(-)


What a nice thing to have.

Reviewed-by: Max Reitz 




[PULL 1/1] block/snapshot: Clarify goto fallback behavior

2021-06-24 Thread Max Reitz
In the bdrv_snapshot_goto() fallback code, we work with a pointer to
either bs->file or bs->backing.  We detach that child, close the node
(with .bdrv_close()), apply the snapshot on the child node, and then
re-open the node (with .bdrv_open()).

In order for .bdrv_open() to attach the same child node that we had
before, we pass "file={child-node}" or "backing={child-node}" to it.
Therefore, when .bdrv_open() has returned success, we can assume that
bs->file or bs->backing (respectively) points to our original child
again.  This is verified by an assertion.

All of this is not immediately clear from a quick glance at the code,
so add a comment to the assertion what it is for, and why it is valid.
It certainly confused Coverity.

Reported-by: Coverity (CID 1452774)
Signed-off-by: Max Reitz 
Message-Id: <20210503095418.31521-1-mre...@redhat.com>
[mreitz: s/close/detach/]
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/snapshot.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index 6702c75e42..ccacda8bd5 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -275,13 +275,16 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 qobject_unref(file_options);
 g_free(subqdict_prefix);
 
+/* Force .bdrv_open() below to re-attach fallback_bs on *fallback_ptr 
*/
 qdict_put_str(options, (*fallback_ptr)->name,
   bdrv_get_node_name(fallback_bs));
 
+/* Now close bs, apply the snapshot on fallback_bs, and re-open bs */
 if (drv->bdrv_close) {
 drv->bdrv_close(bs);
 }
 
+/* .bdrv_open() will re-attach it */
 bdrv_unref_child(bs, *fallback_ptr);
 *fallback_ptr = NULL;
 
@@ -296,7 +299,16 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 return ret < 0 ? ret : open_ret;
 }
 
-assert(fallback_bs == (*fallback_ptr)->bs);
+/*
+ * fallback_ptr is &bs->file or &bs->backing.  *fallback_ptr
+ * was closed above and set to NULL, but the .bdrv_open() call
+ * has opened it again, because we set the respective option
+ * (with the qdict_put_str() call above).
+ * Assert that .bdrv_open() has attached some child on
+ * *fallback_ptr, and that it has attached the one we wanted
+ * it to (i.e., fallback_bs).
+ */
+assert(*fallback_ptr && fallback_bs == (*fallback_ptr)->bs);
 bdrv_unref(fallback_bs);
 return ret;
 }
-- 
2.31.1




[PULL 0/1] Block patch

2021-06-24 Thread Max Reitz
The following changes since commit b22726abdfa54592d6ad88f65b0297c0e8b363e2:

  Merge remote-tracking branch 
'remotes/vivier2/tags/linux-user-for-6.1-pull-request' into staging (2021-06-22 
16:07:53 +0100)

are available in the Git repository at:

  https://github.com/XanClic/qemu.git tags/pull-block-2021-06-24

for you to fetch changes up to 32a9a245d719a883eef2cbf07d2cf89efa0206d0:

  block/snapshot: Clarify goto fallback behavior (2021-06-24 09:49:04 +0200)


Block patch:
- Fix Coverity complaint in block/snapshot.c

--------
Max Reitz (1):
  block/snapshot: Clarify goto fallback behavior

 block/snapshot.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

-- 
2.31.1




Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum

2021-06-24 Thread Max Reitz

On 24.06.21 12:25, Vladimir Sementsov-Ogievskiy wrote:

24.06.2021 13:16, Max Reitz wrote:

On 24.06.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

23.06.2021 18:01, Max Reitz wrote:
.bdrv_co_block_status() implementations are free to return a *pnum 
that
exceeds @bytes, because bdrv_co_block_status() in block/io.c will 
clamp

*pnum as necessary.

On the other hand, if drivers' implementations return values for *pnum
that are as large as possible, our recently introduced block-status
cache will become more effective.

So, make a note in block_int.h that @bytes is no upper limit for 
*pnum.


Suggested-by: Eric Blake 
Signed-off-by: Max Reitz 
---
  include/block/block_int.h | 5 +
  1 file changed, 5 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index fcb599dd1c..f85b96ed99 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -347,6 +347,11 @@ struct BlockDriver {
   * clamped to bdrv_getlength() and aligned to request_alignment,
   * as well as non-NULL pnum, map, and file; in turn, the driver
   * must return an error or set pnum to an aligned non-zero 
value.

+ *
+ * Note that @bytes is just a hint on how big of a region the
+ * caller wants to inspect.  It is not a limit on *pnum.
+ * Implementations are free to return larger values of *pnum if
+ * doing so does not incur a performance penalty.


Worth mention that the cache will benefit of it?


Oh, right, absolutely.  Like so:

"block/io.c's bdrv_co_block_status() will clamp *pnum before 
returning it to its caller, but it itself can still make use of the 
unclamped *pnum value.  Specifically, the block-status cache for 
protocol nodes will benefit from storing as large a region as possible."




Sounds good. Do you mean this as an addition or substitution? If the 
latter, I'd keep "if doing so does not incur a performance penalty 


I meant it as an addition, just a new paragraph.

Max




Re: [PATCH v2 2/6] block: block-status cache for data regions

2021-06-24 Thread Max Reitz

On 24.06.21 12:05, Vladimir Sementsov-Ogievskiy wrote:

23.06.2021 18:01, Max Reitz wrote:

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform. The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts. So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 


I'm new to RCU, so my review can't be reliable..


Yeah, well, same here. :)


---
  include/block/block_int.h | 47 ++
  block.c   | 84 +++
  block/io.c    | 61 ++--
  3 files changed, 189 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..fcb599dd1c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,22 @@ struct BdrvChild {
  QLIST_ENTRY(BdrvChild) next_parent;
  };
  +/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @valid: Whether the cache is valid (should be accessed with atomic
+ * functions so this can be reset by RCU readers)
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not 
necessarily

+ *    the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+    bool valid;
+    int64_t data_start;
+    int64_t data_end;
+} BdrvBlockStatusCache;
+
  struct BlockDriverState {
  /* Protected by big QEMU lock or read-only after opening. No 
special

   * locking needed during I/O...
@@ -997,6 +1013,11 @@ struct BlockDriverState {
    /* BdrvChild links to this node may never be frozen */
  bool never_freeze;
+
+    /* Lock for block-status cache RCU writers */
+    CoMutex bsc_modify_lock;
+    /* Always non-NULL, but must only be dereferenced under an RCU 
read guard */

+    BdrvBlockStatusCache *block_status_cache;>   };
    struct BlockBackendRootState {
@@ -1422,4 +1443,30 @@ static inline BlockDriverState 
*bdrv_primary_bs(BlockDriverState *bs)

   */
  void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
  +/**
+ * Check whether the given offset is in the cached block-status data
+ * region.
+ *
+ * If it is, and @pnum is not NULL, *pnum is set to
+ * `bsc.data_end - offset`, i.e. how many bytes, starting from
+ * @offset, are data (according to the cache).
+ * Otherwise, *pnum is not touched.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t 
*pnum);

+
+/**
+ * If [offset, offset + bytes) overlaps with the currently cached
+ * block-status region, invalidate the cache.
+ *
+ * (To be used by I/O paths that cause data regions to be zero or
+ * holes.)
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+   int64_t offset, int64_t bytes);
+
+/**
+ * Mark the range [offset, offset + bytes) as a data region.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t 
bytes);

+
  #endif /* BLOCK_INT_H */
diff --git a/block.c b/block.c
index 3f456892d0..9ab9459f7a 100644
--- a/block.c
+++ b/block.c
@

Re: [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum

2021-06-24 Thread Max Reitz

On 24.06.21 11:15, Vladimir Sementsov-Ogievskiy wrote:

23.06.2021 18:01, Max Reitz wrote:

.bdrv_co_block_status() implementations are free to return a *pnum that
exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
*pnum as necessary.

On the other hand, if drivers' implementations return values for *pnum
that are as large as possible, our recently introduced block-status
cache will become more effective.

So, make a note in block_int.h that @bytes is no upper limit for *pnum.

Suggested-by: Eric Blake 
Signed-off-by: Max Reitz 
---
  include/block/block_int.h | 5 +
  1 file changed, 5 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index fcb599dd1c..f85b96ed99 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -347,6 +347,11 @@ struct BlockDriver {
   * clamped to bdrv_getlength() and aligned to request_alignment,
   * as well as non-NULL pnum, map, and file; in turn, the driver
   * must return an error or set pnum to an aligned non-zero value.
+ *
+ * Note that @bytes is just a hint on how big of a region the
+ * caller wants to inspect.  It is not a limit on *pnum.
+ * Implementations are free to return larger values of *pnum if
+ * doing so does not incur a performance penalty.


Worth mention that the cache will benefit of it?


Oh, right, absolutely.  Like so:

"block/io.c's bdrv_co_block_status() will clamp *pnum before returning 
it to its caller, but it itself can still make use of the unclamped 
*pnum value.  Specifically, the block-status cache for protocol nodes 
will benefit from storing as large a region as possible."


?

Max




[PATCH v2 2/2] iotests/307: Test iothread conflict for exports

2021-06-24 Thread Max Reitz
Passing fixed-iothread=true should make iothread conflicts fatal,
whereas fixed-iothread=false should not.

Combine the second case with an error condition that is checked after
the iothread is handled, to verify that qemu does not crash if there is
such an error after changing the iothread failed.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/307 | 15 +++
 tests/qemu-iotests/307.out |  8 
 2 files changed, 23 insertions(+)

diff --git a/tests/qemu-iotests/307 b/tests/qemu-iotests/307
index c7685347bc..b429b5aa50 100755
--- a/tests/qemu-iotests/307
+++ b/tests/qemu-iotests/307
@@ -41,9 +41,11 @@ with iotests.FilePath('image') as img, \
 iotests.log('=== Launch VM ===')
 
 vm.add_object('iothread,id=iothread0')
+vm.add_object('iothread,id=iothread1')
 vm.add_blockdev(f'file,filename={img},node-name=file')
 vm.add_blockdev(f'{iotests.imgfmt},file=file,node-name=fmt')
 vm.add_blockdev('raw,file=file,node-name=ro,read-only=on')
+vm.add_blockdev('null-co,node-name=null')
 vm.add_device(f'id=scsi0,driver=virtio-scsi,iothread=iothread0')
 vm.launch()
 
@@ -74,6 +76,19 @@ with iotests.FilePath('image') as img, \
 vm.qmp_log('query-block-exports')
 iotests.qemu_nbd_list_log('-k', socket)
 
+iotests.log('\n=== Add export with conflicting iothread ===')
+
+vm.qmp_log('device_add', id='sdb', driver='scsi-hd', drive='null')
+
+# Should fail because of fixed-iothread
+vm.qmp_log('block-export-add', id='export1', type='nbd', node_name='null',
+   iothread='iothread1', fixed_iothread=True, writable=True)
+
+# Should ignore the iothread conflict, but then fail because of the
+# permission conflict (and not crash)
+vm.qmp_log('block-export-add', id='export1', type='nbd', node_name='null',
+   iothread='iothread1', fixed_iothread=False, writable=True)
+
 iotests.log('\n=== Add a writable export ===')
 
 # This fails because share-rw=off
diff --git a/tests/qemu-iotests/307.out b/tests/qemu-iotests/307.out
index 4b0c7e155a..ec8d2be0e0 100644
--- a/tests/qemu-iotests/307.out
+++ b/tests/qemu-iotests/307.out
@@ -51,6 +51,14 @@ exports available: 1
base:allocation
 
 
+=== Add export with conflicting iothread ===
+{"execute": "device_add", "arguments": {"drive": "null", "driver": "scsi-hd", 
"id": "sdb"}}
+{"return": {}}
+{"execute": "block-export-add", "arguments": {"fixed-iothread": true, "id": 
"export1", "iothread": "iothread1", "node-name": "null", "type": "nbd", 
"writable": true}}
+{"error": {"class": "GenericError", "desc": "Cannot change iothread of active 
block backend"}}
+{"execute": "block-export-add", "arguments": {"fixed-iothread": false, "id": 
"export1", "iothread": "iothread1", "node-name": "null", "type": "nbd", 
"writable": true}}
+{"error": {"class": "GenericError", "desc": "Permission conflict on node 
'null': permissions 'write' are both required by an unnamed block device (uses 
node 'null' as 'root' child) and unshared by block device 'sdb' (uses node 
'null' as 'root' child)."}}
+
 === Add a writable export ===
 {"execute": "block-export-add", "arguments": {"description": "This is the 
writable second export", "id": "export1", "name": "export1", "node-name": 
"fmt", "type": "nbd", "writable": true, "writethrough": true}}
 {"error": {"class": "GenericError", "desc": "Permission conflict on node 
'fmt': permissions 'write' are both required by an unnamed block device (uses 
node 'fmt' as 'root' child) and unshared by block device 'sda' (uses node 'fmt' 
as 'root' child)."}}
-- 
2.31.1




[PATCH v2 0/2] block/export: Conditionally ignore set-context error

2021-06-24 Thread Max Reitz
Hi,

I had this v2 lying around for quite some time but for some reason never
sent it.  I probably just forgot.  Sorry.

v1:
https://lists.nongnu.org/archive/html/qemu-block/2021-04/msg00584.html

Changes in v2:
- Letting `bdrv_try_set_aio_context()` return an error and then just
  discarding it conditionally is kind of not right.
  If we want to ignore the error, decide so from the start: Depending on
  `fixed_iothread`, pass either `errp` or `NULL`.

- Patch 2: Reference output has changed because of 30ebb9aa920.


git-backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/2:[0009] [FC] 'block/export: Conditionally ignore set-context error'
002/2:[0002] [FC] 'iotests/307: Test iothread conflict for exports'


Max Reitz (2):
  block/export: Conditionally ignore set-context error
  iotests/307: Test iothread conflict for exports

 block/export/export.c  |  5 -
 tests/qemu-iotests/307 | 15 +++
 tests/qemu-iotests/307.out |  8 
 3 files changed, 27 insertions(+), 1 deletion(-)

-- 
2.31.1




[PATCH v2 1/2] block/export: Conditionally ignore set-context error

2021-06-24 Thread Max Reitz
When invoking block-export-add with some iothread and
fixed-iothread=false, and changing the node's iothread fails, the error
is supposed to be ignored.

However, it is still stored in *errp, which is wrong.  If a second error
occurs, the "*errp must be NULL" assertion in error_setv() fails:

  qemu-system-x86_64: ../util/error.c:59: error_setv: Assertion
  `*errp == NULL' failed.

So if fixed-iothread=false, we should ignore the error by passing NULL
to bdrv_try_set_aio_context().

Fixes: f51d23c80af73c95e0ce703ad06a300f1b3d63ef
   ("block/export: add iothread and fixed-iothread options")
Signed-off-by: Max Reitz 
---
 block/export/export.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/export/export.c b/block/export/export.c
index fec7d9f738..6d3b9964c8 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -111,6 +111,7 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error 
**errp)
 if (export->has_iothread) {
 IOThread *iothread;
 AioContext *new_ctx;
+Error **set_context_errp;
 
 iothread = iothread_by_id(export->iothread);
 if (!iothread) {
@@ -120,7 +121,9 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error 
**errp)
 
 new_ctx = iothread_get_aio_context(iothread);
 
-ret = bdrv_try_set_aio_context(bs, new_ctx, errp);
+/* Ignore errors with fixed-iothread=false */
+set_context_errp = fixed_iothread ? errp : NULL;
+ret = bdrv_try_set_aio_context(bs, new_ctx, set_context_errp);
 if (ret == 0) {
 aio_context_release(ctx);
 aio_context_acquire(new_ctx);
-- 
2.31.1




Re: [PATCH v4 7/7] block: detect DKIOCGETBLOCKCOUNT/SIZE before use

2021-06-23 Thread Max Reitz

On 08.06.21 15:16, Paolo Bonzini wrote:

From: Joelle van Dyne 

iOS hosts do not have these defined so we fallback to the
default behaviour.

Co-authored-by: Warner Losh 
Reviewed-by: Peter Maydell 
Signed-off-by: Joelle van Dyne 
Message-Id: <20210315180341.31638-...@getutm.app>
Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 21 +
  1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 5821e1afed..4e2f7cf508 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2322,8 +2322,11 @@ static int64_t raw_getlength(BlockDriverState *bs)
  again:
  #endif
  if (!fstat(fd, &sb) && (S_IFCHR & sb.st_mode)) {
+size = 0;
  #ifdef DIOCGMEDIASIZE
-if (ioctl(fd, DIOCGMEDIASIZE, (off_t *)&size))
+if (ioctl(fd, DIOCGMEDIASIZE, (off_t *)&size)) {
+size = 0;
+}
  #elif defined(DIOCGPART)
  {
  struct partinfo pi;
@@ -2332,9 +2335,7 @@ again:
  else
  size = 0;
  }
-if (size == 0)
-#endif
-#if defined(__APPLE__) && defined(__MACH__)
+#elif defined(DKIOCGETBLOCKCOUNT) && defined(DKIOCGETBLOCKSIZE)


In v3, I was wondering whether it’s intentional that the following 
DKIOCGETBLOCKCOUNT/SIZE block would no longer be used as a fallback if 
the DIOCGMEDIASIZE ioctl failed (which it was before this patch).  I’m 
still wondering.


Max


  {
  uint64_t sectors = 0;
  uint32_t sector_size = 0;
@@ -2342,19 +2343,15 @@ again:
  if (ioctl(fd, DKIOCGETBLOCKCOUNT, §ors) == 0
 && ioctl(fd, DKIOCGETBLOCKSIZE, §or_size) == 0) {
  size = sectors * sector_size;
-} else {
-size = lseek(fd, 0LL, SEEK_END);
-if (size < 0) {
-return -errno;
-}
  }
  }
-#else
-size = lseek(fd, 0LL, SEEK_END);
+#endif
+if (size == 0) {
+size = lseek(fd, 0LL, SEEK_END);
+}
  if (size < 0) {
  return -errno;
  }
-#endif
  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
  switch(s->type) {
  case FTYPE_CD:





Re: [PATCH v4 5/7] block: feature detection for host block support

2021-06-23 Thread Max Reitz

On 08.06.21 15:16, Paolo Bonzini wrote:

From: Joelle van Dyne 

On Darwin (iOS), there are no system level APIs for directly accessing
host block devices. We detect this at configure time.

Signed-off-by: Joelle van Dyne 
Message-Id: <20210315180341.31638-...@getutm.app>
Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c   | 33 ++---
  meson.build  |  6 +-
  qapi/block-core.json | 10 +++---
  3 files changed, 34 insertions(+), 15 deletions(-)


[...]


diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2ea294129e..2dd41be156 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -897,7 +897,8 @@
'discriminator': 'driver',
'data': {
'file': 'BlockStatsSpecificFile',
-  'host_device': 'BlockStatsSpecificFile',
+  'host_device': { 'type': 'BlockStatsSpecificFile',
+   'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' },
'nvme': 'BlockStatsSpecificNvme' } }
  
  ##

@@ -2814,7 +2815,9 @@
  { 'enum': 'BlockdevDriver',
'data': [ 'blkdebug', 'blklogwrites', 'blkreplay', 'blkverify', 'bochs',
  'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
-'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
+'gluster', 'host_cdrom',
+{'name': 'host_device', 'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' },
+'http', 'https', 'iscsi',
  'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
  'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
  { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
@@ -3996,7 +3999,8 @@
'ftps':   'BlockdevOptionsCurlFtps',
'gluster':'BlockdevOptionsGluster',
'host_cdrom': 'BlockdevOptionsFile',
-  'host_device':'BlockdevOptionsFile',
+  'host_device': { 'type': 'BlockdevOptionsFile',
+   'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' },
'http':   'BlockdevOptionsCurlHttp',
'https':  'BlockdevOptionsCurlHttps',
'iscsi':  'BlockdevOptionsIscsi',


As I asked in v3: What about host_cdrom?

Max




Re: [PATCH v4 1/7] file-posix: fix max_iov for /dev/sg devices

2021-06-23 Thread Max Reitz

On 08.06.21 21:14, Vladimir Sementsov-Ogievskiy wrote:

08.06.2021 16:16, Paolo Bonzini wrote:

Even though it was only called for devices that have bs->sg set (which
must be character devices), sg_get_max_segments looked at /sys/dev/block
which only works for block devices.

On Linux the sg driver has its own way to provide the maximum number of
iovecs in a scatter/gather list, so add support for it.  The block 
device

path is kept because it will be reinstated in the next patches.

Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index f37dfc10b3..536998a1d6 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1180,6 +1180,17 @@ static int sg_get_max_segments(int fd)
  goto out;
  }
  +    if (S_ISCHR(st.st_mode)) {


Why not check "if (bs->sg) {" instead? It seems to be more consistent 
with issuing SG_ ioctl. Or what I miss?


I dismissed this in v3, because I didn’t understand why you’d raise this 
point.  The function is called sg_*(), and it’s only called if bs->sg is 
true anyway.  So clearly we can use SG_ ioctls, because the whole 
function is intended only for SG devices anyway.


This time, I looked forward, and perhaps starting at patch 4 I can 
understand where you’re coming from, because then the function is used 
for host devices in general.


So now I don’t particularly mind.  I think it’s still clear that if 
there’s a host device here that’s a character device, then that’s going 
to be an SG device, so I don’t really have a preference between 
S_ISCHR() and bs->sg.


Max


+    if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+    return ret;
+    }
+    return -ENOTSUP;
+    }
+
+    if (!S_ISBLK(st.st_mode)) {
+    return -ENOTSUP;
+    }
+
  sysfspath = 
g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",

  major(st.st_rdev), minor(st.st_rdev));
  sysfd = open(sysfspath, O_RDONLY);









[PATCH v2 2/6] block: block-status cache for data regions

2021-06-23 Thread Max Reitz
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 
---
 include/block/block_int.h | 47 ++
 block.c   | 84 +++
 block/io.c| 61 ++--
 3 files changed, 189 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..fcb599dd1c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,22 @@ struct BdrvChild {
 QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @valid: Whether the cache is valid (should be accessed with atomic
+ * functions so this can be reset by RCU readers)
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+bool valid;
+int64_t data_start;
+int64_t data_end;
+} BdrvBlockStatusCache;
+
 struct BlockDriverState {
 /* Protected by big QEMU lock or read-only after opening.  No special
  * locking needed during I/O...
@@ -997,6 +1013,11 @@ struct BlockDriverState {
 
 /* BdrvChild links to this node may never be frozen */
 bool never_freeze;
+
+/* Lock for block-status cache RCU writers */
+CoMutex bsc_modify_lock;
+/* Always non-NULL, but must only be dereferenced under an RCU read guard 
*/
+BdrvBlockStatusCache *block_status_cache;
 };
 
 struct BlockBackendRootState {
@@ -1422,4 +1443,30 @@ static inline BlockDriverState 
*bdrv_primary_bs(BlockDriverState *bs)
  */
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
 
+/**
+ * Check whether the given offset is in the cached block-status data
+ * region.
+ *
+ * If it is, and @pnum is not NULL, *pnum is set to
+ * `bsc.data_end - offset`, i.e. how many bytes, starting from
+ * @offset, are data (according to the cache).
+ * Otherwise, *pnum is not touched.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
+
+/**
+ * If [offset, offset + bytes) overlaps with the currently cached
+ * block-status region, invalidate the cache.
+ *
+ * (To be used by I/O paths that cause data regions to be zero or
+ * holes.)
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+   int64_t offset, int64_t bytes);
+
+/**
+ * Mark the range [offset, offset + bytes) as a data region.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
+
 #endif /* BLOCK_INT_H */
diff --git a/block.c b/block.c
index 3f456892d0..9ab9459f7a 100644
--- a/block.c
+++ b/block.c
@@ -49,6 +49,8 @@
 #include "qemu/timer.h"
 #include "qemu/cutils.h"
 #include "qemu/id.h"
+#include "qemu/range.h"
+#include "qemu/rcu.h"
 #include "bl

[PATCH v2 4/6] block/file-posix: Do not force-cap *pnum

2021-06-23 Thread Max Reitz
bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/file-posix.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index b3fbb9bd63..aeb370d5bb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2689,7 +2689,8 @@ static int find_allocation(BlockDriverState *bs, off_t 
start,
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  */
 static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
 bool want_zero,
@@ -2727,7 +2728,7 @@ static int coroutine_fn 
raw_co_block_status(BlockDriverState *bs,
 } else if (data == offset) {
 /* On a data extent, compute bytes to the end of the extent,
  * possibly including a partial sector at EOF. */
-*pnum = MIN(bytes, hole - offset);
+*pnum = hole - offset;
 
 /*
  * We are not allowed to return partial sectors, though, so
@@ -2746,7 +2747,7 @@ static int coroutine_fn 
raw_co_block_status(BlockDriverState *bs,
 } else {
 /* On a hole, compute bytes to the beginning of the next extent.  */
 assert(hole == offset);
-*pnum = MIN(bytes, data - offset);
+*pnum = data - offset;
 ret = BDRV_BLOCK_ZERO;
 }
 *map = offset;
-- 
2.31.1




[PATCH v2 6/6] block/iscsi: Do not force-cap *pnum

2021-06-23 Thread Max Reitz
bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/iscsi.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 4d2a416ce7..852384086b 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -781,9 +781,6 @@ retry:
 iscsi_allocmap_set_allocated(iscsilun, offset, *pnum);
 }
 
-if (*pnum > bytes) {
-*pnum = bytes;
-}
 out_unlock:
 qemu_mutex_unlock(&iscsilun->mutex);
 g_free(iTask.err_str);
-- 
2.31.1




[PATCH v2 1/6] block: Drop BDS comment regarding bdrv_append()

2021-06-23 Thread Max Reitz
There is a comment above the BDS definition stating care must be taken
to consider handling newly added fields in bdrv_append().

Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
(nine years ago), and in any case, bdrv_swap() was dropped in
8e419aefa (six years ago).  So no such care is necessary anymore.

Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 057d88b1fc..a8f9598102 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,12 +832,6 @@ struct BdrvChild {
 QLIST_ENTRY(BdrvChild) next_parent;
 };
 
-/*
- * Note: the function bdrv_append() copies and swaps contents of
- * BlockDriverStates, so if you add new fields to this struct, please
- * inspect bdrv_append() to determine if the new fields need to be
- * copied as well.
- */
 struct BlockDriverState {
 /* Protected by big QEMU lock or read-only after opening.  No special
  * locking needed during I/O...
-- 
2.31.1




[PATCH v2 5/6] block/gluster: Do not force-cap *pnum

2021-06-23 Thread Max Reitz
bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/gluster.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index e8ee14c8e9..8ef7bb18d5 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1461,7 +1461,8 @@ exit:
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  *
  * (Based on raw_co_block_status() from file-posix.c.)
  */
@@ -1500,12 +1501,12 @@ static int coroutine_fn 
qemu_gluster_co_block_status(BlockDriverState *bs,
 } else if (data == offset) {
 /* On a data extent, compute bytes to the end of the extent,
  * possibly including a partial sector at EOF. */
-*pnum = MIN(bytes, hole - offset);
+*pnum = hole - offset;
 ret = BDRV_BLOCK_DATA;
 } else {
 /* On a hole, compute bytes to the beginning of the next extent.  */
 assert(hole == offset);
-*pnum = MIN(bytes, data - offset);
+*pnum = data - offset;
 ret = BDRV_BLOCK_ZERO;
 }
 
-- 
2.31.1




[PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum

2021-06-23 Thread Max Reitz
.bdrv_co_block_status() implementations are free to return a *pnum that
exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
*pnum as necessary.

On the other hand, if drivers' implementations return values for *pnum
that are as large as possible, our recently introduced block-status
cache will become more effective.

So, make a note in block_int.h that @bytes is no upper limit for *pnum.

Suggested-by: Eric Blake 
Signed-off-by: Max Reitz 
---
 include/block/block_int.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index fcb599dd1c..f85b96ed99 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -347,6 +347,11 @@ struct BlockDriver {
  * clamped to bdrv_getlength() and aligned to request_alignment,
  * as well as non-NULL pnum, map, and file; in turn, the driver
  * must return an error or set pnum to an aligned non-zero value.
+ *
+ * Note that @bytes is just a hint on how big of a region the
+ * caller wants to inspect.  It is not a limit on *pnum.
+ * Implementations are free to return larger values of *pnum if
+ * doing so does not incur a performance penalty.
  */
 int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs,
 bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
-- 
2.31.1




[PATCH v2 0/6] block: block-status cache for data regions

2021-06-23 Thread Max Reitz
Hi,

See the cover letter from v1 for the general idea:

https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg00843.html


The biggest change here in v2 is that instead of having a common CoMutex
protect the block-status cache, we’re using RCU now.  So to read from
the cache, or even to invalidate it, no lock is needed, only to update
it with new data.

Disclaimer: I have no experience with RCU in practice so far, neither in
qemu nor anywhere else.  So I hope I’ve used it correctly...


Differences to v1 in detail:
- Patch 2:
  - Moved BdrvBlockStatusCache.lock up to BDS, it is now the RCU writer
lock
  - BDS.block_status_cache is now a pointer, so it can be replaced with
RCU
  - Moved all cache access functionality into helper functions
(bdrv_bsc_is_data(), bdrv_bsc_invalidate_range(), bdrv_bsc_fill())
in block.c
  - Guard BSC accesses with RCU
(BSC.valid is to be accessed atomically, which allows resetting it
without taking an RCU write lock)
  - Check QLIST_EMPTY(&bs->children) not just when reading from the
cache, but when filling it, too (so we don’t need an RCU update when
it won’t make sense)
- Patch 3: Added
- Dropped the block/nbd patch (because it would make NBD query a larger
  range; the patch’s intent was to get more information for free, which
  this would not be)


git-backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/6:[] [--] 'block: Drop BDS comment regarding bdrv_append()'
002/6:[0169] [FC] 'block: block-status cache for data regions'
003/6:[down] 'block: Clarify that @bytes is no limit on *pnum'
004/6:[] [--] 'block/file-posix: Do not force-cap *pnum'
005/6:[] [--] 'block/gluster: Do not force-cap *pnum'
006/6:[----] [--] 'block/iscsi: Do not force-cap *pnum'


Max Reitz (6):
  block: Drop BDS comment regarding bdrv_append()
  block: block-status cache for data regions
  block: Clarify that @bytes is no limit on *pnum
  block/file-posix: Do not force-cap *pnum
  block/gluster: Do not force-cap *pnum
  block/iscsi: Do not force-cap *pnum

 include/block/block_int.h | 54 +++--
 block.c   | 84 +++
 block/file-posix.c|  7 ++--
 block/gluster.c   |  7 ++--
 block/io.c| 61 ++--
 block/iscsi.c |  3 --
 6 files changed, 200 insertions(+), 16 deletions(-)

-- 
2.31.1




Re: [PATCH 4/4] iotests/308: Test allow-other

2021-06-22 Thread Max Reitz

On 22.06.21 17:08, Kevin Wolf wrote:

Am 14.06.2021 um 16:44 hat Max Reitz geschrieben:

We cannot reasonably test the main point of allow-other, which is to
allow users other than the current one to access the FUSE export,
because that would require access to sudo, which this test most likely
will not have.  (Also, we would need to figure out some user/group that
is on the machine and that is not the current user/group, which may
become a bit hairy.)

But we can test some byproducts: First, whether changing permissions
works (our FUSE code only allows so for allow-other=true), and second,
whether the kernel applies permission checks with allow-other=true
(because that implies default_permissions).

Signed-off-by: Max Reitz 

This seems to have the problem that you mentioned:

--- /home/kwolf/source/qemu/tests/qemu-iotests/308.out
+++ 308.out.bad
@@ -205,7 +205,9 @@
   'writable': true,
   'allow-other': true
} }
-{"return": {}}
+fusermount3: option allow_other only allowed if 'user_allow_other' is set in 
/etc/fuse.conf
+{"error": {"class": "GenericError", "desc": "Failed to mount FUSE session to 
export"}}
+Timeout waiting for return on handle 2
  (Invoking chmod)
  Permissions post-chmod: 666
  (Removing all permissions)

Maybe it should be a separate test case that is skipped with
user_allow_other is disabled.


Right.


  tests/qemu-iotests/308 | 91 ++
  tests/qemu-iotests/308.out | 47 
  2 files changed, 138 insertions(+)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index f122065d0f..1b2f908947 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -334,6 +334,97 @@ echo '=== Compare copy with original ==='
  
  $QEMU_IMG compare -f raw -F $IMGFMT "$COPIED_IMG" "$TEST_IMG"
  
+echo

+echo '=== Test permissions ==='
+
+# Test that you can only change permissions on the export with 
allow-other=true.
+# We cannot really test the primary reason behind allow-other (i.e. to allow
+# users other than the current one access to the export), because for that we
+# would need sudo, which realistically nobody will allow this test to use.
+# What we can do is test that allow-other=true also enables 
default_permissions,
+# i.e. whether we can still read from the file if we remove the read 
permission.

We already have other test cases that use sudo if available. Though I
guess it means that these tests aren't run very often.


Yes, I know, but honestly I don’t really want to deal with user 
management either.  I had a paragraph about that in a preliminary 
version but decided to cut it, because, I thought it wouldn’t really matter.


That problem is that I’d need to run qemu-io as some different user, and 
the question is, who is a different user?  I suppose I could rely on 
“root” and “nobody” being valid users on any system, but I don’t think I 
can be sure that the user running the tests isn’t either of those.  So I 
would have to check whether the current user is “root”, and then run it 
as “nobody”, or otherwise run it as “root”, but that just seems like I’m 
getting in too deep for something that isn’t really useful anyway, 
because on developers’ machines, it will most likely be skipped anyway.


Max




Re: [PATCH 3/4] export/fuse: Let permissions be adjustable

2021-06-22 Thread Max Reitz

On 22.06.21 17:02, Kevin Wolf wrote:

Am 14.06.2021 um 16:44 hat Max Reitz geschrieben:

Allow changing the file mode, UID, and GID through SETATTR.

This only really makes sense with allow-other, though (because without
it, the effective access mode is fixed to be 0600 (u+rw) with qemu's
user being the file's owner), so changing these stat fields is not
allowed without allow-other.

Signed-off-by: Max Reitz 
---
  block/export/fuse.c | 48 ++---
  1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 1d54286d90..742e0af657 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -47,6 +47,10 @@ typedef struct FuseExport {
  bool writable;
  bool growable;
  bool allow_other;
+
+mode_t st_mode;
+uid_t st_uid;
+gid_t st_gid;
  } FuseExport;
  
  static GHashTable *exports;

@@ -120,6 +124,13 @@ static int fuse_export_create(BlockExport *blk_exp,
  exp->growable = args->growable;
  exp->allow_other = args->allow_other;
  
+exp->st_mode = S_IFREG | S_IRUSR;

+if (exp->writable) {
+exp->st_mode |= S_IWUSR;
+}
+exp->st_uid = getuid();
+exp->st_gid = getgid();
+
  ret = setup_fuse_export(exp, args->mountpoint, errp);
  if (ret < 0) {
  goto fail;
@@ -329,7 +340,6 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
  int64_t length, allocated_blocks;
  time_t now = time(NULL);
  FuseExport *exp = fuse_req_userdata(req);
-mode_t mode;
  
  length = blk_getlength(exp->common.blk);

  if (length < 0) {
@@ -344,17 +354,12 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
  allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512);
  }
  
-mode = S_IFREG | S_IRUSR;

-if (exp->writable) {
-mode |= S_IWUSR;
-}
-
  statbuf = (struct stat) {
  .st_ino = inode,
-.st_mode= mode,
+.st_mode= exp->st_mode,
  .st_nlink   = 1,
-.st_uid = getuid(),
-.st_gid = getgid(),
+.st_uid = exp->st_uid,
+.st_gid = exp->st_gid,
  .st_size= length,
  .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
  .st_blocks  = allocated_blocks,
@@ -400,15 +405,23 @@ static int fuse_do_truncate(const FuseExport *exp, 
int64_t size,
  }
  
  /**

- * Let clients set file attributes.  Only resizing is supported.
+ * Let clients set file attributes.  With allow_other, only resizing and
+ * changing permissions (st_mode, st_uid, st_gid) is allowed.  Without
+ * allow_other, only resizing is supported.
   */
  static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat 
*statbuf,
   int to_set, struct fuse_file_info *fi)
  {
  FuseExport *exp = fuse_req_userdata(req);
+int supported_attrs;
  int ret;
  
-if (to_set & ~FUSE_SET_ATTR_SIZE) {

+supported_attrs = FUSE_SET_ATTR_SIZE;
+if (exp->allow_other) {
+supported_attrs |= FUSE_SET_ATTR_MODE | FUSE_SET_ATTR_UID |
+FUSE_SET_ATTR_GID;
+}
+if (to_set & ~supported_attrs) {
  fuse_reply_err(req, ENOTSUP);
  return;
  }
@@ -426,6 +439,19 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, 
struct stat *statbuf,
  }
  }
  
+if (to_set & FUSE_SET_ATTR_MODE) {

+/* Only allow changing the file mode, not the type */
+exp->st_mode = (statbuf->st_mode & 0) | S_IFREG;
+}

Should we check that the mode actually makes sense? Not sure if making
an image executable has a good use case, and making it writable in the
permissions for a read-only export isn't a good idea either.


I mean, I don’t mind what the user does.  It doesn’t really faze us, I 
believe.  If the image contains an executable ELF and the user wants to 
run it directly from FUSE...  I don’t mind.


As for +w on RO exports, I’m not sure.  This reminds me of `sudo chattr 
+i $file`, which effectively makes any regular file read-only, too, and 
it can still keep +w.  So the file permissions are basically just ACLs, 
getting permission for something doesn’t mean you can actually do it.


OTOH, the difference to `chattr +i` is that we’d allow opening the 
export R/W, only writing would then fail.  `chattr +i` does give EPERM 
when opening the file.


So I’m not quite sure.  I don’t really want to prevent the user from 
setting any access restrictions they want, but on the other hand, if 
writing can never work, then there really is no point in allowing +w.  
(Then I’m wondering, if we don’t allow +w, should we silently drop it or 
return an error?  I guess returning success but not actually succeeding 
is weird, so we probably should return EROFS.)


But +x can technically work, so I wouldn’t disallow it.

Max




[PATCH] block: BDRV_O_NO_IO for backing file on creation

2021-06-22 Thread Max Reitz
When creating an image file with a backing file, we generally try to
open the backing file (unless -u was specified), mostly to verify that
it is there, but also to get the file size if none was specified for the
new image.

For neither of these things do we need data I/O, and so we can pass
BDRV_O_NO_IO when opening the backing file.  This allows us to open even
encrypted backing images without requiring the user to provide a secret.

This makes the -u switch in iotests 189 and 198 unnecessary (and the
$size parameter), so drop it, because this way we get regression tests
for this patch here.

Fixes: https://gitlab.com/qemu-project/qemu/-/issues/441
Signed-off-by: Max Reitz 
---
 block.c| 6 +-
 tests/qemu-iotests/189 | 2 +-
 tests/qemu-iotests/198 | 2 +-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 3f456892d0..b459502632 100644
--- a/block.c
+++ b/block.c
@@ -6553,9 +6553,13 @@ void bdrv_img_create(const char *filename, const char 
*fmt,
 }
 assert(full_backing);
 
-/* backing files always opened read-only */
+/*
+ * No need to do I/O here, which allows us to open encrypted
+ * backing images without needing the secret
+ */
 back_flags = flags;
 back_flags &= ~(BDRV_O_RDWR | BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING);
+back_flags |= BDRV_O_NO_IO;
 
 backing_options = qdict_new();
 if (backing_fmt) {
diff --git a/tests/qemu-iotests/189 b/tests/qemu-iotests/189
index 4e463385b2..801494c6b9 100755
--- a/tests/qemu-iotests/189
+++ b/tests/qemu-iotests/189
@@ -67,7 +67,7 @@ echo "== verify pattern =="
 $QEMU_IO --object $SECRET0 -c "read -P 0xa 0 $size" --image-opts $IMGSPECBASE 
| _filter_qemu_io | _filter_testdir
 
 echo "== create overlay =="
-_make_test_img --object $SECRET1 -o 
"encrypt.format=luks,encrypt.key-secret=sec1,encrypt.iter-time=10" -u -b 
"$TEST_IMG_BASE" -F $IMGFMT $size
+_make_test_img --object $SECRET1 -o 
"encrypt.format=luks,encrypt.key-secret=sec1,encrypt.iter-time=10" -b 
"$TEST_IMG_BASE" -F $IMGFMT
 
 echo
 echo "== writing part of a cluster =="
diff --git a/tests/qemu-iotests/198 b/tests/qemu-iotests/198
index b333a8f281..1c93dea1f7 100755
--- a/tests/qemu-iotests/198
+++ b/tests/qemu-iotests/198
@@ -64,7 +64,7 @@ echo "== writing whole image base =="
 $QEMU_IO --object $SECRET0 -c "write -P 0xa 0 $size" --image-opts $IMGSPECBASE 
| _filter_qemu_io | _filter_testdir
 
 echo "== create overlay =="
-_make_test_img --object $SECRET1 -o 
"encrypt.format=luks,encrypt.key-secret=sec1,encrypt.iter-time=10" -u -b 
"$TEST_IMG_BASE" -F $IMGFMT $size
+_make_test_img --object $SECRET1 -o 
"encrypt.format=luks,encrypt.key-secret=sec1,encrypt.iter-time=10" -b 
"$TEST_IMG_BASE" -F $IMGFMT
 
 echo
 echo "== writing whole image layer =="
-- 
2.31.1




Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table

2021-06-21 Thread Max Reitz

On 21.06.21 17:51, Vivek Goyal wrote:

On Mon, Jun 21, 2021 at 11:02:16AM +0200, Max Reitz wrote:

On 18.06.21 20:29, Vivek Goyal wrote:


[snip]


What am I not able to understand is that why we can't return error if
we run into a temporary issue and can't generate file handle. What's
that requirement that we need to hide the error and try to cover it
up by some other means.

There is no requirement, it’s just that we need to hide ENOTSUPP errors
anyway (because e.g. some submounted filesystem may not support file
handles), so I considered hiding temporary errors

ENOTSUPP is not a temporary error I guess. But this is a good point that
if host filesystem does not support file handles, should we return
error so that user is forced to remove "-o inode_file_handles" option.

I can see multiple modes and they all seem to be useful in different
scenarios.

A. Do not use file handles at all.

B. Use file handles if filesystem supports it. Also this could do some
kind of mix and matching so that some inodes use file handles while
others use fd. We could think of implementing some threshold and
if open fds go above this threshold, switch to file handle stuff.

C. Strictly use file handles otherwise return error to caller.

One possibility is that we implement two options inode_file_handles
and no_inode_file_handles.

- User specified -o no_inode_file_handles, implement A.
- User specified -o inode_file_handles, implement C.
- User did not specify anything, then use file handles opportunistically
   as seen fit by daemon. This provides us the maximum flexibility and
   this practically implements option B.

IOW, if user did not specify anything, then we can think of implementing
file handles by default and fallback to using O_PATH fds if filesystem
does not support file handles (ENOTSUPP). But if user did specify
"-o no_inode_file_handles" or "-o inode_file_handles", this kind
of points to strictly implementing respective policy, IMHO. That's how
I have implemented some other options.

Alternatively, we could think of adding one more option say
"inode_file_handles_only.

- "-o no_inode_files_handles" implements A.
- "-o inode_files_handles" implements B.
- "-o inode_files_handles_only" implements C.
- If user did not specify anything on command line, then its up to the
   daemon whatever default policy it wants and default can change
   over time.


I think it makes sense not to punish the user for wanting to use file 
handles as much as possible and still gracefully handling submounts that 
don’t support them.  So I’d want to implement B first, and have that be 
-o inode_files_handles.  I think A as -o no_inode_file_handles is also 
there, automatically...?


I don’t think there’s much of a problem with implementing C, except we 
probably want to log ENOTSUPP errors then, so the user can figure out 
what’s going on.  I feel like we can still wait with such an option 
until there’s a demand for it.


(Except that perhaps the demand would come in the form of “please help I 
use -o inode_file_handles but virtiofsd’s FD count is still too high I 
don’t know what to do”, and that probably wouldn’t be so great.  But 
then again, inodes_files_handles_only wouldn’t help a user in that case 
either, because it changes “works badly” to “doesn’t work at all”.  Now 
I’m wondering what the actual use case of inodes_files_handles_only 
would be.)



not to really add
complexity.  (Which is true, as can be seen from the diff stat I posted
below: Dropping the second hash table and making the handle part of lo_key,
so that temporary errors are now fatal, generates a diff of +37/-66, where
-20 are just two comments (which realistically I’d need to replace by
different comments), so in terms of code, it’s +37/-46.)

I’d just rather handle errors gracefully when I find it doesn’t really cost
complexity.


However, you made a good point in that we must require name_to_handle_at()
to work if it worked before for some inode, not because it would be simpler,
but because it would be wrong otherwise.

As for the other way around...  Well, now I don’t have a strong opinion on
it.  Handling temporary name_to_handle_at() failure after it worked the
first time should not add extra complexity, but it wouldn’t be symmetric.
Like, allowing temporary failure sometimes but not at other times.

Right. If we decided that we need to generate file handle for an inode
and underlying filesystem supports file handles, then handling temporary
failures only sometimes will make it assymetric. At this point of time
I am more inclined to return error to caller on temporary failures. But
if this does not work well in practice, I am open to change the policy.


The next question is, how do we detect temporary failure, because if we look
up some new inode, name_to_handle_at() fails, we ignore it, and then it
starts to work and we fail all further lookups, tha

Re: [PATCH 0/4] export/fuse: Allow other users access to the export

2021-06-21 Thread Max Reitz

On 21.06.21 18:12, Kevin Wolf wrote:

Am 14.06.2021 um 16:44 hat Max Reitz geschrieben:

Hi,

With the default mount options, FUSE mounts are not accessible to any
users but the one who did the mount, not even to root.  To allow such
accesses, allow_other must be passed.

This is probably useful to some people (it certainly is to me, e.g. when
exporting some image as my normal user, and then trying to loop mount it
as root), so this series adds a QAPI allow-other bool that will make the
FUSE export code pass allow_other,default_permissions to FUSE.

(default_permissions will make the kernel do the usual UNIX permission
checks, which is something that makes a lot of sense when allowing other
users access to the export.)

This also requires our SETATTR code to be able to handle permission
changes, though, so the user can then run chmod/chown/chgrp on the
export to adjust its permissions to their need.

The final patch adds a test.

If there is even a use case for leaving the option off (not trusting
root?), it must certainly be the less common case? So I'm not sure if
allow-other should be an option at all, but if it is, enabling it by
default would make more sense to me.

Is there a reason why you picked false as the default, except that it is
the old behaviour?


No. :)

Well, mostly.  I also thought, if FUSE thinks allow_other shouldn’t be 
the default, who am I to decide otherwise.


Now that I tried to find out why FUSE has it as the default (I only 
remember vague “security reasons”), I still couldn’t find out why, but I 
did find that using this option as non-root user requires /etc/fuse.conf 
to have user_allow_other in it, which I don’t think we can require.


So I think it must be an option.  As for which value should be the 
default, that probably depends on how common having user_allow_other in 
/etc/fuse.conf is.  I know I never put it there, and it’s both on my 
Fedora and my Arch system.  So I guess it seems rather common?


Max




Re: [PATCH 2/6] block: block-status cache for data regions

2021-06-21 Thread Max Reitz

On 19.06.21 12:20, Vladimir Sementsov-Ogievskiy wrote:

17.06.2021 18:52, Max Reitz wrote:

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform. The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts. So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 
---
  include/block/block_int.h | 19 ++
  block.c   |  2 +
  block/io.c    | 80 +--
  3 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..c09512556a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,23 @@ struct BdrvChild {
  QLIST_ENTRY(BdrvChild) next_parent;
  };
  +/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @lock: Lock for accessing this object's fields
+ * @valid: Whether the cache is valid
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not 
necessarily

+ *    the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+    CoMutex lock;
+    bool valid;
+    int64_t data_start;
+    int64_t data_end;
+} BdrvBlockStatusCache;
+
  struct BlockDriverState {
  /* Protected by big QEMU lock or read-only after opening. No 
special

   * locking needed during I/O...
@@ -997,6 +1014,8 @@ struct BlockDriverState {
    /* BdrvChild links to this node may never be frozen */
  bool never_freeze;
+
+    BdrvBlockStatusCache block_status_cache;
  };
    struct BlockBackendRootState {
diff --git a/block.c b/block.c
index 3f456892d0..bad64d5c4f 100644
--- a/block.c
+++ b/block.c
@@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void)
    qemu_co_queue_init(&bs->flush_queue);
  +    qemu_co_mutex_init(&bs->block_status_cache.lock);
+
  for (i = 0; i < bdrv_drain_all_count; i++) {
  bdrv_drained_begin(bs);
  }
diff --git a/block/io.c b/block/io.c
index 323854d063..320638cc48 100644
--- a/block/io.c
+++ b/block/io.c
@@ -35,6 +35,7 @@
  #include "qapi/error.h"
  #include "qemu/error-report.h"
  #include "qemu/main-loop.h"
+#include "qemu/range.h"
  #include "sysemu/replay.h"
    /* Maximum bounce buffer for copy-on-read and write zeroes, in 
bytes */
@@ -1862,6 +1863,7 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,

  bool need_flush = false;
  int head = 0;
  int tail = 0;
+    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
    int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, 
INT_MAX);

  int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
@@ -1878,6 +1880,16 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,

  return -ENOTSUP;
  }
  +    /* Invalidate the cached block-status data range if this write 
overlaps */

+    qemu_co_mutex_lock(&bsc->lock);
+    if (bsc->valid &&
+    ranges_overlap(o

Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum

2021-06-21 Thread Max Reitz

On 19.06.21 12:53, Vladimir Sementsov-Ogievskiy wrote:

17.06.2021 18:52, Max Reitz wrote:

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
---
  block/nbd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nbd.c b/block/nbd.c
index 616f9ae6c4..930bd234de 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1702,7 +1702,7 @@ static int coroutine_fn 
nbd_client_co_block_status(

  .type = NBD_CMD_BLOCK_STATUS,
  .from = offset,
  .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
-   MIN(bytes, s->info.size - offset)),
+   s->info.size - offset),
  .flags = NBD_CMD_FLAG_REQ_ONE,
  };



Hmm..

I don't that this change is correct. In contrast with file-posix you 
don't get extra information for free, you just make a larger request. 
This means that server will have to do more work.


Oh, oops.  Seems I was blind in my rage to replace this MIN() pattern.

You’re absolutely right.  So this patch should be dropped.

Max

(look at blockstatus_to_extents, it calls bdrv_block_status_above in a 
loop).


For example, assume that nbd export is a qcow2 image with all clusters 
allocated. With this change, nbd server will loop through the whole 
qcow2 image, load all L2 tables to return big allocated extent.


So, only server can decide, could it add some extra free information 
to request or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't 
allow it.







Re: [PATCH 4/6] block/gluster: Do not force-cap *pnum

2021-06-21 Thread Max Reitz

On 19.06.21 12:36, Vladimir Sementsov-Ogievskiy wrote:

17.06.2021 18:52, Max Reitz wrote:

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 



Reviewed-by: Vladimir Sementsov-Ogievskiy 



---
  block/gluster.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index e8ee14c8e9..8ef7bb18d5 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1461,7 +1461,8 @@ exit:
   * the specified offset) that are known to be in the same
   * allocated/unallocated state.
   *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 
'pnum' may

+ * well exceed it.
   *
   * (Based on raw_co_block_status() from file-posix.c.)
   */
@@ -1500,12 +1501,12 @@ static int coroutine_fn 
qemu_gluster_co_block_status(BlockDriverState *bs,

  } else if (data == offset) {
  /* On a data extent, compute bytes to the end of the extent,
   * possibly including a partial sector at EOF. */
-    *pnum = MIN(bytes, hole - offset);
+    *pnum = hole - offset;
  ret = BDRV_BLOCK_DATA;


Interesting, isn't it a bug that we don't ROUND_UP *pnum to 
request_alignment here like it is done in file-posix ?


Guess I forgot gluster in 9c3db310ff0 O:)

I don’t think I’ll be able to reproduce it for gluster, but I suppose 
just doing the same thing for gluster should be fine...


Max




Re: [PATCH 3/6] block/file-posix: Do not force-cap *pnum

2021-06-21 Thread Max Reitz

On 18.06.21 22:16, Eric Blake wrote:

On Thu, Jun 17, 2021 at 05:52:44PM +0200, Max Reitz wrote:

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

We should update the documentation in include/block/block_int.h to
mention that the driver's block_status callback may treat *pnum as a
soft cap, and that returning a larger value is fine.


Oh, sure.

Max


But I agree with this change in the individual drivers, as long as we
remember to make our global contract explicit that we can now rely on
it ;)

Reviewed-by: Eric Blake 






Re: [PATCH 2/6] block: block-status cache for data regions

2021-06-21 Thread Max Reitz

On 18.06.21 20:51, Eric Blake wrote:

On Thu, Jun 17, 2021 at 05:52:43PM +0200, Max Reitz wrote:

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

Here's hoping third time's the charm!


Indeed :)


(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 
---
  include/block/block_int.h | 19 ++
  block.c   |  2 +
  block/io.c| 80 +--
  3 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..c09512556a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,23 @@ struct BdrvChild {
  QLIST_ENTRY(BdrvChild) next_parent;
  };
  
+/*

+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @lock: Lock for accessing this object's fields
+ * @valid: Whether the cache is valid
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+CoMutex lock;
+bool valid;
+int64_t data_start;
+int64_t data_end;
+} BdrvBlockStatusCache;

Looks like the right bits of information, and I'm glad you documented
the need to be prepared for protocols that report split data sections
rather than consolidated.


+++ b/block/io.c
@@ -35,6 +35,7 @@
  #include "qapi/error.h"
  #include "qemu/error-report.h"
  #include "qemu/main-loop.h"
+#include "qemu/range.h"
  #include "sysemu/replay.h"
  
  /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */

@@ -1862,6 +1863,7 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
  bool need_flush = false;
  int head = 0;
  int tail = 0;
+BdrvBlockStatusCache *bsc = &bs->block_status_cache;
  
  int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);

  int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
@@ -1878,6 +1880,16 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
  return -ENOTSUP;
  }
  
+/* Invalidate the cached block-status data range if this write overlaps */

+qemu_co_mutex_lock(&bsc->lock);

Are we going to be suffering from a lot of lock contention performance
degradation?  Is there a way to take advantage of RCU access patterns
for any more performance without sacrificing correctness?


The critical section is so short that I considered it fine.  I wanted to 
use RW locks, but then realized that every RW lock operation is 
internally locked by another mutex, so it wouldn’t gain anything.


I’m not sure whether RCU is worth it here.

We could try something very crude, namely to just not take a lock and 
make `valid` an atomic.  After all, it doesn’t really matter whether 
`data_start` and `data_end` are consistent values, and resetting `valid` 
to false is always safe.


The worst that could happen is that a concurrent block-status call tries 
to set up an overlapping data area, which we thus fail to recognize 
here.  But if such a thing were to happen, it could just as well happen 
before said concurrent call took any lock on `bsc`.



+if (bsc->valid &&
+ranges_overlap(offset, bytes, bsc->data_start,
+   bsc->data_end - bsc->data_start))
+{
+bsc->valid = false;
+}

Do we want to invalidate the entire bsc, or can we be smart and leave
the prefix intact (if offset > bsc->data_start, then set bsc->data_end
to offset)?


Perhaps we could be smart, but I don’t think it really makes a 
difference in practice, so I think keeping it simple is better.



+qemu_co_mutex_unlock(&bsc->lock);

Worth using WITH_QEMU_LOCK_GUARD?


I knew I forgot something, right.  Will use!

Max




Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table

2021-06-21 Thread Max Reitz

On 18.06.21 20:29, Vivek Goyal wrote:

On Fri, Jun 18, 2021 at 10:28:38AM +0200, Max Reitz wrote:

On 17.06.21 23:21, Vivek Goyal wrote:

On Wed, Jun 16, 2021 at 03:38:13PM +0200, Max Reitz wrote:

On 11.06.21 22:04, Vivek Goyal wrote:

On Wed, Jun 09, 2021 at 05:55:49PM +0200, Max Reitz wrote:

Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
its inode ID will remain in use until we drop our lo_inode (and
lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
the inode ID as an lo_inode key, because any inode with an inode ID we
find in lo_data.inodes (on the same filesystem) must be the exact same
file.

This will change when we start setting lo_inode.fhandle so we do not
have to keep an O_PATH FD open.  Then, unlinking such an inode will
immediately remove it, so its ID can then be reused by newly created
files, even while the lo_inode object is still there[1].

So creating a new file can then reuse the old file's inode ID, and
looking up the new file would lead to us finding the old file's
lo_inode, which is not ideal.

Luckily, just as file handles cause this problem, they also solve it:  A
file handle contains a generation ID, which changes when an inode ID is
reused, so the new file can be distinguished from the old one.  So all
we need to do is to add a second map besides lo_data.inodes that maps
file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.

Unfortunately, we cannot rely on being able to generate file handles
every time.

Hi Max,

What are the cases where we can not rely being able to generate file
handles?

I have no idea, but it’s much easier to claim we can’t than to prove that we
can. I’d rather be resilient.

Assuming that we can not genererate file handles all the time and hence
mainitaing another inode cache seems little problematic to me.

How so?

It is extra complexity. Need to worry about one more hash table. Sure,
I give it to you that we are not creating two copies of inodes. Same
inode object is being added to two different hash tables and is being
looked up using two different keys.


I would rather start with that we can generate file handles and have
a single inode cache.

The assumption that we can generate file handles all the time is a much
stronger one (and one which needs to be proven) than assuming that failure
is possible.

So if temporary failures can happen (like -ENOMEM, as you mentioned),
these can happen with openat() too. And in that case we return error
to caller. So why to try to hide temporary failures from
name_to_handle_at().


Well, for openat() we don’t have a choice, if that fails, there is no 
fallback, so we must return an error.  For name_to_handle_at(), there is 
a fallback.



I am still reading your code and trying to understand it. But one
question came to mind. What happens if we can generate file handle
during lookup. But can't generate when same file is looked up again.

- A file foo.txt is looked. We can create file handle and we add it
   to lo->inodes_by_handle as well as lo->inodes_by_ds.

- Say somebody deleted file and created again and inode number got
   reused.

- Now during ->revalidation path, lookup happens again. This time say
   we can't generate file handle. If am reading lo_do_find() code
   correctly, it will find the old inode using ids and return same
   inode as result of lookup. And we did not recognize that inode
   number has been reused.


Oh, that’s a good point.  If an lo_inode has no O_PATH fd but is only 
addressed by handle, we must always look it up by handle.



And issues might arise if we could not generate file handle in first
lookup but could generate in second.

- A file foo.txt is lookedup. We can not create file handle and we add it
   to lo->inodes_by_ids.

- Say somebody deleted file and created again and inode number got
   reused.


This is not possible, because if we could not generate a file handle on 
the first lookup, the lo_inode will have an O_PATH fd attached to it, so 
the inode number cannot be reused while the lo_inode still exists.



- Now during ->revalidation path, lookup happens again. This time say
   we can generate file handle. If am reading lo_do_find() code
   correctly, it will find the old inode using ids and return same
   inode as result of lookup. And we did not recognize that inode
   number has been reused.

IOW, because we could not generate the file handle, we lost the
ability to recognize that inode number has been reused. That means
either we don't switch to using file handles or if we do switch,
it is important that we can generate file handle to differentiate
whether inode number has been used or not. If not, return error to
caller. Caller probably will mark inode bad and let and do a lookup
again.


Also, it is still a single inode cache. I'

Re: [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table

2021-06-18 Thread Max Reitz

On 09.06.21 17:55, Max Reitz wrote:

Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
its inode ID will remain in use until we drop our lo_inode (and
lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
the inode ID as an lo_inode key, because any inode with an inode ID we
find in lo_data.inodes (on the same filesystem) must be the exact same
file.

This will change when we start setting lo_inode.fhandle so we do not
have to keep an O_PATH FD open.  Then, unlinking such an inode will
immediately remove it, so its ID can then be reused by newly created
files, even while the lo_inode object is still there[1].

So creating a new file can then reuse the old file's inode ID, and
looking up the new file would lead to us finding the old file's
lo_inode, which is not ideal.

Luckily, just as file handles cause this problem, they also solve it:  A
file handle contains a generation ID, which changes when an inode ID is
reused, so the new file can be distinguished from the old one.  So all
we need to do is to add a second map besides lo_data.inodes that maps
file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.

Unfortunately, we cannot rely on being able to generate file handles
every time.  Therefore, we still enter every lo_inode object into
inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
potential inodes_by_handle entry then has precedence, the inodes_by_ids
entry is just a fallback.

Note that we do not generate lo_fhandle objects yet, and so we also do
not enter anything into the inodes_by_handle map yet.  Also, all lookups
skip that map.  We might manually create file handles with some code
that is immediately removed by the next patch again, but that would
break the assumption in lo_find() that every lo_inode with a non-NULL
.fhandle must have an entry in inodes_by_handle and vice versa.  So we
leave actually using the inodes_by_handle map for the next patch.

[1] If some application in the guest still has the file open, there is
going to be a corresponding FD mapping in lo_data.fd_map.  In such a
case, the inode will only go away once every application in the guest
has closed it.  The problem described only applies to cases where the
guest does not have the file open, and it is just in the dentry cache,
basically.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
  tools/virtiofsd/passthrough_ll.c | 80 +---
  1 file changed, 64 insertions(+), 16 deletions(-)


[...]


@@ -3931,10 +3954,34 @@ static gboolean lo_key_equal(gconstpointer a, 
gconstpointer b)
  return la->ino == lb->ino && la->dev == lb->dev && la->mnt_id == 
lb->mnt_id;
  }
  
+static guint lo_fhandle_hash(gconstpointer key)

+{
+const struct lo_fhandle *fh = key;
+guint hash;
+size_t i;
+
+/* Basically g_str_hash() */
+hash = 5381;
+for (i = 0; i < sizeof(fh->padding); i++) {
+hash += hash * 33 + (unsigned char)fh->padding[i];
+}
+hash += hash * 33 + fh->mount_id;


Just spotted: These `+=` should be `=`.

Max


+
+return hash;
+}





Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table

2021-06-18 Thread Max Reitz

On 17.06.21 23:21, Vivek Goyal wrote:

On Wed, Jun 16, 2021 at 03:38:13PM +0200, Max Reitz wrote:

On 11.06.21 22:04, Vivek Goyal wrote:

On Wed, Jun 09, 2021 at 05:55:49PM +0200, Max Reitz wrote:

Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
its inode ID will remain in use until we drop our lo_inode (and
lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
the inode ID as an lo_inode key, because any inode with an inode ID we
find in lo_data.inodes (on the same filesystem) must be the exact same
file.

This will change when we start setting lo_inode.fhandle so we do not
have to keep an O_PATH FD open.  Then, unlinking such an inode will
immediately remove it, so its ID can then be reused by newly created
files, even while the lo_inode object is still there[1].

So creating a new file can then reuse the old file's inode ID, and
looking up the new file would lead to us finding the old file's
lo_inode, which is not ideal.

Luckily, just as file handles cause this problem, they also solve it:  A
file handle contains a generation ID, which changes when an inode ID is
reused, so the new file can be distinguished from the old one.  So all
we need to do is to add a second map besides lo_data.inodes that maps
file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.

Unfortunately, we cannot rely on being able to generate file handles
every time.

Hi Max,

What are the cases where we can not rely being able to generate file
handles?

I have no idea, but it’s much easier to claim we can’t than to prove that we
can. I’d rather be resilient.

Assuming that we can not genererate file handles all the time and hence
mainitaing another inode cache seems little problematic to me.


How so?


I would rather start with that we can generate file handles and have
a single inode cache.


The assumption that we can generate file handles all the time is a much 
stronger one (and one which needs to be proven) than assuming that 
failure is possible.


Also, it is still a single inode cache. I'm just adding a third key. 
(The two existing keys are lo_key (through lo->inodes) and fuse_ino_t 
(through lo->ino_map).)



Therefore, we still enter every lo_inode object into
inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
potential inodes_by_handle entry then has precedence, the inodes_by_ids
entry is just a fallback.

If we have to keep inodes_by_ids around, then can we just add fhandle
to the lo_key. That way we can manage with single hash table and still
be able to detect if inode ID has been reused.

We cannot, because I assume we cannot rely on name_to_handle_at() working
every time.

I guess either we need concrete information that we can't generate
file handle every time or we should assume we can until we are proven
wrong. And then fix it accordingly, IMHO.


I don’t know why we need this other than because it would simplify the code.

Under the assumption that for a specific file we can either generate 
file handles all the time or never, the code as it is will behave 
correct. It’s just a bit more complicated than it would need to be, but 
I don’t find the diffstat of +64/-16 to be indicative of something 
that’s really bad.



Therefore, maybe at one point we can generate a file handle, and
at another, we cannot – we should still be able to look up the inode
regardless.

If the file handle were part of inodes_by_ids, then we can look up inodes
only if we can generate a file handle either every time (for a given inode)
or never.

Right. And is there a reason to belive that for the same file we can
sometimes generate file handles and other times not.


Looking into name_to_handle_at()’s man page, there is no error listed 
that I could imagine happening only sometimes. But it doesn’t give an 
explicit guarantee that it will either always succeed or fail for a 
given inode.


Looking into the kernel, I can see that most filesystems only fail 
.encode_fh() if the buffer is too small. Overlayfs can also fail with 
ENOMEM (which will be translated to EOVERFLOW along the way, so so much 
for name_to_handle_at()’s list of error conditions), and EIO on 
conditions I don’t understand well enough (again, will become EOVERFLOW 
for the user).


You’re probably right that at least in practice we don’t need to 
accommodate for name_to_handle_at() sometimes working for some inode and 
sometimes not.


But I feel quite uneasy relying on this being the case, and being the 
case forever, when I find it quite simple to just have some added 
complexity to deal with it. It’s just a third key for our inode cache.


If you want to, I can write a 10/9 patch that simplifies the code under 
the assumption that name_to_handle_at() will either fail or not, but 
frankly I wouldn’t want to have my name under it. (Which is wh

[PATCH 5/6] block/nbd: Do not force-cap *pnum

2021-06-17 Thread Max Reitz
bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
---
 block/nbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nbd.c b/block/nbd.c
index 616f9ae6c4..930bd234de 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status(
 .type = NBD_CMD_BLOCK_STATUS,
 .from = offset,
 .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
-   MIN(bytes, s->info.size - offset)),
+   s->info.size - offset),
 .flags = NBD_CMD_FLAG_REQ_ONE,
 };
 
-- 
2.31.1




[PATCH 4/6] block/gluster: Do not force-cap *pnum

2021-06-17 Thread Max Reitz
bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
---
 block/gluster.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index e8ee14c8e9..8ef7bb18d5 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1461,7 +1461,8 @@ exit:
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  *
  * (Based on raw_co_block_status() from file-posix.c.)
  */
@@ -1500,12 +1501,12 @@ static int coroutine_fn 
qemu_gluster_co_block_status(BlockDriverState *bs,
 } else if (data == offset) {
 /* On a data extent, compute bytes to the end of the extent,
  * possibly including a partial sector at EOF. */
-*pnum = MIN(bytes, hole - offset);
+*pnum = hole - offset;
 ret = BDRV_BLOCK_DATA;
 } else {
 /* On a hole, compute bytes to the beginning of the next extent.  */
 assert(hole == offset);
-*pnum = MIN(bytes, data - offset);
+*pnum = data - offset;
 ret = BDRV_BLOCK_ZERO;
 }
 
-- 
2.31.1




[PATCH 3/6] block/file-posix: Do not force-cap *pnum

2021-06-17 Thread Max Reitz
bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
---
 block/file-posix.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index b3fbb9bd63..aeb370d5bb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2689,7 +2689,8 @@ static int find_allocation(BlockDriverState *bs, off_t 
start,
  * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
  */
 static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
 bool want_zero,
@@ -2727,7 +2728,7 @@ static int coroutine_fn 
raw_co_block_status(BlockDriverState *bs,
 } else if (data == offset) {
 /* On a data extent, compute bytes to the end of the extent,
  * possibly including a partial sector at EOF. */
-*pnum = MIN(bytes, hole - offset);
+*pnum = hole - offset;
 
 /*
  * We are not allowed to return partial sectors, though, so
@@ -2746,7 +2747,7 @@ static int coroutine_fn 
raw_co_block_status(BlockDriverState *bs,
 } else {
 /* On a hole, compute bytes to the beginning of the next extent.  */
 assert(hole == offset);
-*pnum = MIN(bytes, data - offset);
+*pnum = data - offset;
 ret = BDRV_BLOCK_ZERO;
 }
 *map = offset;
-- 
2.31.1




[PATCH 2/6] block: block-status cache for data regions

2021-06-17 Thread Max Reitz
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 
---
 include/block/block_int.h | 19 ++
 block.c   |  2 +
 block/io.c| 80 +--
 3 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..c09512556a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,23 @@ struct BdrvChild {
 QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @lock: Lock for accessing this object's fields
+ * @valid: Whether the cache is valid
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+CoMutex lock;
+bool valid;
+int64_t data_start;
+int64_t data_end;
+} BdrvBlockStatusCache;
+
 struct BlockDriverState {
 /* Protected by big QEMU lock or read-only after opening.  No special
  * locking needed during I/O...
@@ -997,6 +1014,8 @@ struct BlockDriverState {
 
 /* BdrvChild links to this node may never be frozen */
 bool never_freeze;
+
+BdrvBlockStatusCache block_status_cache;
 };
 
 struct BlockBackendRootState {
diff --git a/block.c b/block.c
index 3f456892d0..bad64d5c4f 100644
--- a/block.c
+++ b/block.c
@@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void)
 
 qemu_co_queue_init(&bs->flush_queue);
 
+qemu_co_mutex_init(&bs->block_status_cache.lock);
+
 for (i = 0; i < bdrv_drain_all_count; i++) {
 bdrv_drained_begin(bs);
 }
diff --git a/block/io.c b/block/io.c
index 323854d063..320638cc48 100644
--- a/block/io.c
+++ b/block/io.c
@@ -35,6 +35,7 @@
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "qemu/range.h"
 #include "sysemu/replay.h"
 
 /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
@@ -1862,6 +1863,7 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 bool need_flush = false;
 int head = 0;
 int tail = 0;
+BdrvBlockStatusCache *bsc = &bs->block_status_cache;
 
 int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
 int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
@@ -1878,6 +1880,16 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 return -ENOTSUP;
 }
 
+/* Invalidate the cached block-status data range if this write overlaps */
+qemu_co_mutex_lock(&bsc->lock);
+if (bsc->valid &&
+ranges_overlap(offset, bytes, bsc->data_start,
+   bsc->data_end - bsc->data_start))
+{
+bsc->va

[PATCH 6/6] block/iscsi: Do not force-cap *pnum

2021-06-17 Thread Max Reitz
bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
---
 block/iscsi.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 4d2a416ce7..852384086b 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -781,9 +781,6 @@ retry:
 iscsi_allocmap_set_allocated(iscsilun, offset, *pnum);
 }
 
-if (*pnum > bytes) {
-*pnum = bytes;
-}
 out_unlock:
 qemu_mutex_unlock(&iscsilun->mutex);
 g_free(iTask.err_str);
-- 
2.31.1




[PATCH 0/6] block: block-status cache for data regions

2021-06-17 Thread Max Reitz
Hi,

We’ve already had two attempts at introducing a block-status cache for
data regions
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), both of which were local to file-posix.

I have a gitlab issue assigned to me
(https://gitlab.com/qemu-project/qemu/-/issues/307), so I suppose it’s
time for round three.

This series tries to address what I gathered from the reviews for both:
(1) We should try to have this cache in the general block layer for all
protocol drivers; (2) Just to be sure, we should have a mutex for it;
(3) External writers don’t matter if we just cache data regions, because
(for a protocol node) reporting a zero region as data isn’t fatal.

Patch 1 is clean-up, patch 2 is the main one, patches 3 to 6 make it
more useful: Some protocol drivers force-clamp the returned *pnum based
on the @bytes parameter, but that’s completely unnecessary, because
bdrv_co_block_status() will clamp the value, too.  It’s beneficial to
return as large a *pnum value as possible to bdrv_co_block_status(), so
our cache can be more effective.


Max Reitz (6):
  block: Drop BDS comment regarding bdrv_append()
  block: block-status cache for data regions
  block/file-posix: Do not force-cap *pnum
  block/gluster: Do not force-cap *pnum
  block/nbd: Do not force-cap *pnum
  block/iscsi: Do not force-cap *pnum

 include/block/block_int.h | 21 --
 block.c   |  2 +
 block/file-posix.c|  7 ++--
 block/gluster.c   |  7 ++--
 block/io.c| 80 +--
 block/iscsi.c |  3 --
 block/nbd.c   |  2 +-
 7 files changed, 105 insertions(+), 17 deletions(-)

-- 
2.31.1




[PATCH 1/6] block: Drop BDS comment regarding bdrv_append()

2021-06-17 Thread Max Reitz
There is a comment above the BDS definition stating care must be taken
to consider handling newly added fields in bdrv_append().

Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
(nine years ago), and in any case, bdrv_swap() was dropped in
8e419aefa (six years ago).  So no such care is necessary anymore.

Signed-off-by: Max Reitz 
---
 include/block/block_int.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 057d88b1fc..a8f9598102 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,12 +832,6 @@ struct BdrvChild {
 QLIST_ENTRY(BdrvChild) next_parent;
 };
 
-/*
- * Note: the function bdrv_append() copies and swaps contents of
- * BlockDriverStates, so if you add new fields to this struct, please
- * inspect bdrv_append() to determine if the new fields need to be
- * copied as well.
- */
 struct BlockDriverState {
 /* Protected by big QEMU lock or read-only after opening.  No special
  * locking needed during I/O...
-- 
2.31.1




Re: [PATCH v3 3/7] block: add max_hw_transfer to BlockLimits

2021-06-16 Thread Max Reitz

On 16.06.21 15:18, Paolo Bonzini wrote:

On 15/06/21 18:18, Max Reitz wrote:

  }
+/* Returns the maximum hardware transfer length, in bytes; 
guaranteed nonzero */

+uint64_t blk_get_max_hw_transfer(BlockBackend *blk)
+{
+    BlockDriverState *bs = blk_bs(blk);
+    uint64_t max = INT_MAX;
+
+    if (bs) {
+    max = MIN_NON_ZERO(bs->bl.max_hw_transfer, 
bs->bl.max_transfer);

+    }
+    return max;


Both `max_hw_transfer` and `max_transfer` can be 0, so this could 
return 0, contrary to what the comment above promises.


Should `max` be initialized to 0 with a `MIN_NON_ZERO(max, INT_MAX)` 
here (like `blk_get_max_transfer()` does it)?


Yes, something to that effect.

(As for the rest, I think aligning to the request alignment makes 
sense, but then again we don’t do that for max_transfer either, so... 
this at least wouldn’t be a new bug.


Ok, will do.  I will also add a new patch to align max_transfer to the 
request alignment.


Regarding the comment, checkpatch complains about it, so it should be 
fixed so that /* is on its own line.


That makes it different from every other comment in block_int.h 
though.  Is it okay to fix all of them in a follow-up?


The reason it’s different is that the comment style in question was 
added to checkpatch only relatively recently. I can’t speak for others, 
but I’m a simple person. I just do what makes checkpatch happy. :)


Given that the checkpatch complaint is only a warning, I think it’s OK 
to keep the comment as it is here, and perhaps optionally fix all 
comments in block_int.h in a follow-up. I don’t think we need to fix 
existing comments, but, well, it wouldn’t be wrong.


Max




Re: [PATCH v2 0/9] virtiofsd: Allow using file handles instead of O_PATH FDs

2021-06-16 Thread Max Reitz

On 11.06.21 21:19, Vivek Goyal wrote:

On Wed, Jun 09, 2021 at 05:55:42PM +0200, Max Reitz wrote:

Hi,

v1 cover letter for an overview:
https://listman.redhat.com/archives/virtio-fs/2021-June/msg00033.html

Hi Max,

What's the impact of these patches on performance? Just trying to
get some idea what to expect. Performance remains more or less
same or we expect a hit.
I definitely expect a hit if you just benchmark the right way. Having to 
open FDs all the time (for metadata operations) will have an impact 
(when you do lookups all the time, or open files and close them 
immediately). How much of an impact, that’s completely up to the 
filesystem’s open_by_handle_at() implementation, I presume.


I don’t expect it to be significant for real-world use cases, though. 
When really doing I/O, there should be absolutely no difference, because 
then we’re operating with these FUSE handles that have actual (non-path) 
FDs attached to them.


Max




Re: [Virtio-fs] [PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table

2021-06-16 Thread Max Reitz

On 11.06.21 22:04, Vivek Goyal wrote:

On Wed, Jun 09, 2021 at 05:55:49PM +0200, Max Reitz wrote:

Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
its inode ID will remain in use until we drop our lo_inode (and
lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
the inode ID as an lo_inode key, because any inode with an inode ID we
find in lo_data.inodes (on the same filesystem) must be the exact same
file.

This will change when we start setting lo_inode.fhandle so we do not
have to keep an O_PATH FD open.  Then, unlinking such an inode will
immediately remove it, so its ID can then be reused by newly created
files, even while the lo_inode object is still there[1].

So creating a new file can then reuse the old file's inode ID, and
looking up the new file would lead to us finding the old file's
lo_inode, which is not ideal.

Luckily, just as file handles cause this problem, they also solve it:  A
file handle contains a generation ID, which changes when an inode ID is
reused, so the new file can be distinguished from the old one.  So all
we need to do is to add a second map besides lo_data.inodes that maps
file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.

Unfortunately, we cannot rely on being able to generate file handles
every time.

Hi Max,

What are the cases where we can not rely being able to generate file
handles?


I have no idea, but it’s much easier to claim we can’t than to prove 
that we can. I’d rather be resilient.



Therefore, we still enter every lo_inode object into
inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
potential inodes_by_handle entry then has precedence, the inodes_by_ids
entry is just a fallback.

If we have to keep inodes_by_ids around, then can we just add fhandle
to the lo_key. That way we can manage with single hash table and still
be able to detect if inode ID has been reused.


We cannot, because I assume we cannot rely on name_to_handle_at() 
working every time. Therefore, maybe at one point we can generate a file 
handle, and at another, we cannot – we should still be able to look up 
the inode regardless.


If the file handle were part of inodes_by_ids, then we can look up 
inodes only if we can generate a file handle either every time (for a 
given inode) or never. Or, well, I suppose we could always create two 
entries, one with the file handles zeroed out, and one with the file 
handle specified, but I wouldn’t find that very beautiful.


Max


Note that we do not generate lo_fhandle objects yet, and so we also do
not enter anything into the inodes_by_handle map yet.  Also, all lookups
skip that map.  We might manually create file handles with some code
that is immediately removed by the next patch again, but that would
break the assumption in lo_find() that every lo_inode with a non-NULL
.fhandle must have an entry in inodes_by_handle and vice versa.  So we
leave actually using the inodes_by_handle map for the next patch.

[1] If some application in the guest still has the file open, there is
going to be a corresponding FD mapping in lo_data.fd_map.  In such a
case, the inode will only go away once every application in the guest
has closed it.  The problem described only applies to cases where the
guest does not have the file open, and it is just in the dentry cache,
basically.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
  tools/virtiofsd/passthrough_ll.c | 80 +---
  1 file changed, 64 insertions(+), 16 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index e665575401..793d2c333e 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -179,7 +179,8 @@ struct lo_data {
  int announce_submounts;
  bool use_statx;
  struct lo_inode root;
-GHashTable *inodes; /* protected by lo->mutex */
+GHashTable *inodes_by_ids; /* protected by lo->mutex */
+GHashTable *inodes_by_handle; /* protected by lo->mutex */
  struct lo_map ino_map; /* protected by lo->mutex */
  struct lo_map dirp_map; /* protected by lo->mutex */
  struct lo_map fd_map; /* protected by lo->mutex */
@@ -257,8 +258,9 @@ static struct {
  /* That we loaded cap-ng in the current thread from the saved */
  static __thread bool cap_loaded = 0;
  
-static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,

-uint64_t mnt_id);
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id);
  static int xattr_map_client(const struct lo_data *lo, const char *client_name,
  char **out_name);
  
@@ -1032,18 +1034,39 @@ out_er

Re: [PATCH v3 7/7] block: detect DKIOCGETBLOCKCOUNT/SIZE before use

2021-06-15 Thread Max Reitz

On 03.06.21 15:37, Paolo Bonzini wrote:

From: Joelle van Dyne 

iOS hosts do not have these defined so we fallback to the
default behaviour.

Co-authored-by: Warner Losh 
Reviewed-by: Peter Maydell 
Signed-off-by: Joelle van Dyne 
Message-Id: <20210315180341.31638-...@getutm.app>
Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 21 +
  1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 5821e1afed..4e2f7cf508 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2322,8 +2322,11 @@ static int64_t raw_getlength(BlockDriverState *bs)
  again:
  #endif
  if (!fstat(fd, &sb) && (S_IFCHR & sb.st_mode)) {
+size = 0;
  #ifdef DIOCGMEDIASIZE
-if (ioctl(fd, DIOCGMEDIASIZE, (off_t *)&size))
+if (ioctl(fd, DIOCGMEDIASIZE, (off_t *)&size)) {


Pre-existing, but I feel compelled to express my unease about this cast.


+size = 0;
+}
  #elif defined(DIOCGPART)
  {
  struct partinfo pi;
@@ -2332,9 +2335,7 @@ again:
  else
  size = 0;
  }
-if (size == 0)
-#endif
-#if defined(__APPLE__) && defined(__MACH__)
+#elif defined(DKIOCGETBLOCKCOUNT) && defined(DKIOCGETBLOCKSIZE)


As far a I can tell, before this patch, if the DIOCGMEDIASIZE ioctl 
failed, we fell back to this DKIOCGETBLOCKCOUNT/SIZE block (if compiled 
in). Now this is an #elif and so will not be used if DIOCGMEDIASIZE was 
defined. Is that intentional?


This may be fine, and apart from that, this patch looks good to me, but 
this change in behavior wasn’t mentioned in the commit message, hence me 
asking.



  {
  uint64_t sectors = 0;
  uint32_t sector_size = 0;
@@ -2342,19 +2343,15 @@ again:
  if (ioctl(fd, DKIOCGETBLOCKCOUNT, §ors) == 0
 && ioctl(fd, DKIOCGETBLOCKSIZE, §or_size) == 0) {
  size = sectors * sector_size;
-} else {
-size = lseek(fd, 0LL, SEEK_END);
-if (size < 0) {
-return -errno;
-}
  }
  }
-#else
-size = lseek(fd, 0LL, SEEK_END);
+#endif
+if (size == 0) {
+size = lseek(fd, 0LL, SEEK_END);
+}
  if (size < 0) {
  return -errno;
  }
-#endif
  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
  switch(s->type) {
  case FTYPE_CD:





Re: [PATCH v3 6/7] block: check for sys/disk.h

2021-06-15 Thread Max Reitz

On 03.06.21 15:37, Paolo Bonzini wrote:

From: Joelle van Dyne 

Some BSD platforms do not have this header.

Reviewed-by: Peter Maydell 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Joelle van Dyne 
Message-Id: <20210315180341.31638-...@getutm.app>
Signed-off-by: Paolo Bonzini 
---
  block.c | 2 +-
  meson.build | 1 +
  2 files changed, 2 insertions(+), 1 deletion(-)


Reviewed-by: Max Reitz 




Re: [PATCH v3 5/7] block: feature detection for host block support

2021-06-15 Thread Max Reitz

On 03.06.21 15:37, Paolo Bonzini wrote:

From: Joelle van Dyne 

On Darwin (iOS), there are no system level APIs for directly accessing
host block devices. We detect this at configure time.

Signed-off-by: Joelle van Dyne 
Message-Id: <20210315180341.31638-...@getutm.app>
Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c   | 33 ++---
  meson.build  |  6 +-
  qapi/block-core.json | 10 +++---
  3 files changed, 34 insertions(+), 15 deletions(-)


[...]


diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2ea294129e..2dd41be156 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -897,7 +897,8 @@
'discriminator': 'driver',
'data': {
'file': 'BlockStatsSpecificFile',
-  'host_device': 'BlockStatsSpecificFile',
+  'host_device': { 'type': 'BlockStatsSpecificFile',
+   'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' },
'nvme': 'BlockStatsSpecificNvme' } }
  
  ##

@@ -2814,7 +2815,9 @@
  { 'enum': 'BlockdevDriver',
'data': [ 'blkdebug', 'blklogwrites', 'blkreplay', 'blkverify', 'bochs',
  'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
-'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
+'gluster', 'host_cdrom',
+{'name': 'host_device', 'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' },


Shouldn’t this be done for host_cdrom, too? (and below)

Apart from that, looks good to me.


+'http', 'https', 'iscsi',
  'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
  'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
  { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
@@ -3996,7 +3999,8 @@
'ftps':   'BlockdevOptionsCurlFtps',
'gluster':'BlockdevOptionsGluster',
'host_cdrom': 'BlockdevOptionsFile',
-  'host_device':'BlockdevOptionsFile',
+  'host_device': { 'type': 'BlockdevOptionsFile',
+   'if': 'defined(HAVE_HOST_BLOCK_DEVICE)' },
'http':   'BlockdevOptionsCurlHttp',
'https':  'BlockdevOptionsCurlHttps',
'iscsi':  'BlockdevOptionsIscsi',





Re: [PATCH v3 4/7] file-posix: try BLKSECTGET on block devices too, do not round to power of 2

2021-06-15 Thread Max Reitz

On 03.06.21 15:37, Paolo Bonzini wrote:

bs->sg is only true for character devices, but block devices can also
be used with scsi-block and scsi-generic.  Unfortunately BLKSECTGET
returns bytes in an int for /dev/sgN devices, and sectors in a short
for block devices, so account for that in the code.

The maximum transfer also need not be a power of 2 (for example I have
seen disks with 1280 KiB maximum transfer) so there's no need to pass
the result through pow2floor.

This patch reintroduces the logic that was removed in commit 867eccfed8
("file-posix: Use max transfer length/segment count only for SCSI
passthrough", 2019-07-12).  The removal was motivated by the performance
regression when using a block device's max_transfer with kernel I/O;
the new, separate max_hw_transfer limit avoids reintroducing the same
regression.


(And the fact that the result of hdev_get_max_segments() is no longer 
used to cap max_transfer, but is just stored in max_iov instead.)



Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 40 ++--
  1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index f55f92d0f5..1439293f63 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1149,22 +1149,27 @@ static void raw_reopen_abort(BDRVReopenState *state)
  s->reopen_state = NULL;
  }
  
-static int sg_get_max_transfer_length(int fd)

+static int hdev_get_max_hw_transfer(int fd, struct stat *st)
  {
  #ifdef BLKSECTGET
-int max_bytes = 0;
-
-if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
-return max_bytes;
+if (S_ISBLK(st->st_mode)) {
+unsigned short max_sectors = 0;
+if (ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
+return max_sectors * 512;
+}
  } else {
-return -errno;
+int max_bytes = 0;
+if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
+return max_bytes;
+}
  }
+return -errno;
  #else
  return -ENOSYS;
  #endif
  }
  
-static int sg_get_max_segments(int fd)

+static int hdev_get_max_segments(int fd, struct stat *st)


OK, now I see why in patch 1 you treated `st` as a pointer. Still, needs 
to be `st.` in patch 1, and changed to `st->` in this patch only.



  {
  #ifdef CONFIG_LINUX
  char buf[32];
@@ -1173,12 +1178,6 @@ static int sg_get_max_segments(int fd)
  int ret;
  int sysfd = -1;
  long max_segments;
-struct stat st;
-
-if (fstat(fd, &st)) {
-ret = -errno;
-goto out;
-}
  
  if (S_ISCHR(st->st_mode)) {

  if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
@@ -1192,7 +1191,7 @@ static int sg_get_max_segments(int fd)
  }
  
  sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",

-major(st.st_rdev), minor(st.st_rdev));
+major(st->st_rdev), minor(st->st_rdev));
  sysfd = open(sysfspath, O_RDONLY);
  if (sysfd == -1) {
  ret = -errno;
@@ -1229,15 +1228,20 @@ out:
  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
  {
  BDRVRawState *s = bs->opaque;
+struct stat st;
+
+if (fstat(s->fd, &st)) {
+return;
+}


First, I think if we ignore an error, there should be a comment 
explaining why that’s OK.


It is OK here because inquiring transfer limits is best-effort.

However, the alignment probes below the following block are completely 
independent of `st`. Therefore, I don’t think we should skip them if 
`fstat()` failed here; only the `if (bs->sg || ...)` block should be 
skipped.


Max

  
-if (bs->sg) {

-int ret = sg_get_max_transfer_length(s->fd);
+if (bs->sg || S_ISBLK(st.st_mode)) {
+int ret = hdev_get_max_hw_transfer(s->fd, &st);
  
  if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {

-bs->bl.max_hw_transfer = pow2floor(ret);
+bs->bl.max_hw_transfer = ret;
  }
  
-ret = sg_get_max_segments(s->fd);

+ret = hdev_get_max_segments(s->fd, &st);
  if (ret > 0) {
  bs->bl.max_iov = ret;
  }





Re: [PATCH v3 2/7] scsi-generic: pass max_segments via max_iov field in BlockLimits

2021-06-15 Thread Max Reitz

On 03.06.21 15:37, Paolo Bonzini wrote:

I/O to a disk via read/write is not limited by the number of segments allowed
by the host adapter; the kernel can split requests if needed, and the limit
imposed by the host adapter can be very low (256k or so) to avoid that SG_IO
returns EINVAL if memory is heavily fragmented.

Since this value is only interesting for SG_IO-based I/O, do not include
it in the max_transfer and only take it into account when patching the
block limits VPD page in the scsi-generic device.

Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 3 +--
  hw/scsi/scsi-generic.c | 6 --
  2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 58db526cc2..e3241a0dd3 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1239,8 +1239,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
  
  ret = sg_get_max_segments(s->fd);

  if (ret > 0) {
-bs->bl.max_transfer = MIN(bs->bl.max_transfer,
-  ret * qemu_real_host_page_size);
+bs->bl.max_iov = ret;
  }
  }
  
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c

index 98c30c5d5c..82e1e2ee79 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -179,10 +179,12 @@ static void scsi_handle_inquiry_reply(SCSIGenericReq *r, 
SCSIDevice *s)
  (r->req.cmd.buf[1] & 0x01)) {
  page = r->req.cmd.buf[2];
  if (page == 0xb0) {
-uint32_t max_transfer =
-blk_get_max_transfer(s->conf.blk) / s->blocksize;
+uint32_t max_transfer = blk_get_max_transfer(s->conf.blk);
+uint32_t max_iov = blk_get_max_iov(s->conf.blk);
  
  assert(max_transfer);

+max_transfer = MIN_NON_ZERO(max_transfer, max_iov * 
qemu_real_host_page_size)
+/ s->blocksize;


Now that I ran checkpatch for patch 3, I saw that it complains about 
this line being longer than 80 characters. I think it could be split so 
it doesn’t exceed that limit. It looks a bit like you intentionally 
exceeded the warning limit, but didn’t exceed the error limit (of 90). 
Is that so?


Max




Re: [PATCH v3 3/7] block: add max_hw_transfer to BlockLimits

2021-06-15 Thread Max Reitz

On 03.06.21 15:37, Paolo Bonzini wrote:

For block host devices, I/O can happen through either the kernel file
descriptor I/O system calls (preadv/pwritev, io_submit, io_uring)
or the SCSI passthrough ioctl SG_IO.

In the latter case, the size of each transfer can be limited by the
HBA, while for file descriptor I/O the kernel is able to split and
merge I/O in smaller pieces as needed.  Applying the HBA limits to
file descriptor I/O results in more system calls and suboptimal
performance, so this patch splits the max_transfer limit in two:
max_transfer remains valid and is used in general, while max_hw_transfer
is limited to the maximum hardware size.  max_hw_transfer can then be
included by the scsi-generic driver in the block limits page, to ensure
that the stricter hardware limit is used.

Signed-off-by: Paolo Bonzini 
---
  block/block-backend.c  | 12 
  block/file-posix.c |  2 +-
  block/io.c |  1 +
  hw/scsi/scsi-generic.c |  2 +-
  include/block/block_int.h  |  7 +++
  include/sysemu/block-backend.h |  1 +
  6 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 15f1ea4288..2ea1412a54 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1953,6 +1953,18 @@ uint32_t blk_get_request_alignment(BlockBackend *blk)
  return bs ? bs->bl.request_alignment : BDRV_SECTOR_SIZE;
  }
  
+/* Returns the maximum hardware transfer length, in bytes; guaranteed nonzero */

+uint64_t blk_get_max_hw_transfer(BlockBackend *blk)
+{
+BlockDriverState *bs = blk_bs(blk);
+uint64_t max = INT_MAX;
+
+if (bs) {
+max = MIN_NON_ZERO(bs->bl.max_hw_transfer, bs->bl.max_transfer);
+}
+return max;


Both `max_hw_transfer` and `max_transfer` can be 0, so this could return 
0, contrary to what the comment above promises.


Should `max` be initialized to 0 with a `MIN_NON_ZERO(max, INT_MAX)` 
here (like `blk_get_max_transfer()` does it)?


(As for the rest, I think aligning to the request alignment makes sense, 
but then again we don’t do that for max_transfer either, so... this at 
least wouldn’t be a new bug.


Regarding the comment, checkpatch complains about it, so it should be 
fixed so that /* is on its own line.


Speaking of checkpatch, now that I ran it, it also complains about the 
new line in bdrv_merge_limits() exceeding 80 characters, so that should 
be fixed, too.)


Max




Re: [PATCH v3 2/7] scsi-generic: pass max_segments via max_iov field in BlockLimits

2021-06-15 Thread Max Reitz

On 03.06.21 15:37, Paolo Bonzini wrote:

I/O to a disk via read/write is not limited by the number of segments allowed
by the host adapter; the kernel can split requests if needed, and the limit
imposed by the host adapter can be very low (256k or so) to avoid that SG_IO
returns EINVAL if memory is heavily fragmented.

Since this value is only interesting for SG_IO-based I/O, do not include
it in the max_transfer and only take it into account when patching the
block limits VPD page in the scsi-generic device.

Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 3 +--
  hw/scsi/scsi-generic.c | 6 --
  2 files changed, 5 insertions(+), 4 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH v3 1/7] file-posix: fix max_iov for /dev/sg devices

2021-06-15 Thread Max Reitz

On 03.06.21 15:37, Paolo Bonzini wrote:

Even though it was only called for devices that have bs->sg set (which
must be character devices),
sg_get_max_segments looked at /sys/dev/block which only works for
block devices.

On Linux the sg driver has its own way to provide the maximum number of
iovecs in a scatter/gather list.

Signed-off-by: Paolo Bonzini 
---
  block/file-posix.c | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index f37dfc10b3..58db526cc2 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1180,6 +1180,17 @@ static int sg_get_max_segments(int fd)
  goto out;
  }
  
+if (S_ISCHR(st->st_mode)) {

+if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+return ret;
+}
+return -EIO;
+}
+
+if (!S_ISBLK(st->st_mode)) {
+return -ENOTSUP;
+}
+


With %s/->/./:

Reviewed-by: Max Reitz 

(To answer Vladimir’s question, I don’t believe the condition should be 
bs->sg, because bs->sg == true is a given for this function anyway. 
Therefore, there’s no need to check whether the char device is an SG 
device.)





[PATCH 3/4] export/fuse: Let permissions be adjustable

2021-06-14 Thread Max Reitz
Allow changing the file mode, UID, and GID through SETATTR.

This only really makes sense with allow-other, though (because without
it, the effective access mode is fixed to be 0600 (u+rw) with qemu's
user being the file's owner), so changing these stat fields is not
allowed without allow-other.

Signed-off-by: Max Reitz 
---
 block/export/fuse.c | 48 ++---
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 1d54286d90..742e0af657 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -47,6 +47,10 @@ typedef struct FuseExport {
 bool writable;
 bool growable;
 bool allow_other;
+
+mode_t st_mode;
+uid_t st_uid;
+gid_t st_gid;
 } FuseExport;
 
 static GHashTable *exports;
@@ -120,6 +124,13 @@ static int fuse_export_create(BlockExport *blk_exp,
 exp->growable = args->growable;
 exp->allow_other = args->allow_other;
 
+exp->st_mode = S_IFREG | S_IRUSR;
+if (exp->writable) {
+exp->st_mode |= S_IWUSR;
+}
+exp->st_uid = getuid();
+exp->st_gid = getgid();
+
 ret = setup_fuse_export(exp, args->mountpoint, errp);
 if (ret < 0) {
 goto fail;
@@ -329,7 +340,6 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
 int64_t length, allocated_blocks;
 time_t now = time(NULL);
 FuseExport *exp = fuse_req_userdata(req);
-mode_t mode;
 
 length = blk_getlength(exp->common.blk);
 if (length < 0) {
@@ -344,17 +354,12 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
 allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512);
 }
 
-mode = S_IFREG | S_IRUSR;
-if (exp->writable) {
-mode |= S_IWUSR;
-}
-
 statbuf = (struct stat) {
 .st_ino = inode,
-.st_mode= mode,
+.st_mode= exp->st_mode,
 .st_nlink   = 1,
-.st_uid = getuid(),
-.st_gid = getgid(),
+.st_uid = exp->st_uid,
+.st_gid = exp->st_gid,
 .st_size= length,
 .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
 .st_blocks  = allocated_blocks,
@@ -400,15 +405,23 @@ static int fuse_do_truncate(const FuseExport *exp, 
int64_t size,
 }
 
 /**
- * Let clients set file attributes.  Only resizing is supported.
+ * Let clients set file attributes.  With allow_other, only resizing and
+ * changing permissions (st_mode, st_uid, st_gid) is allowed.  Without
+ * allow_other, only resizing is supported.
  */
 static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat 
*statbuf,
  int to_set, struct fuse_file_info *fi)
 {
 FuseExport *exp = fuse_req_userdata(req);
+int supported_attrs;
 int ret;
 
-if (to_set & ~FUSE_SET_ATTR_SIZE) {
+supported_attrs = FUSE_SET_ATTR_SIZE;
+if (exp->allow_other) {
+supported_attrs |= FUSE_SET_ATTR_MODE | FUSE_SET_ATTR_UID |
+FUSE_SET_ATTR_GID;
+}
+if (to_set & ~supported_attrs) {
 fuse_reply_err(req, ENOTSUP);
 return;
 }
@@ -426,6 +439,19 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, 
struct stat *statbuf,
 }
 }
 
+if (to_set & FUSE_SET_ATTR_MODE) {
+/* Only allow changing the file mode, not the type */
+exp->st_mode = (statbuf->st_mode & 0) | S_IFREG;
+}
+
+if (to_set & FUSE_SET_ATTR_UID) {
+exp->st_uid = statbuf->st_uid;
+}
+
+if (to_set & FUSE_SET_ATTR_GID) {
+exp->st_gid = statbuf->st_gid;
+}
+
 fuse_getattr(req, inode, fi);
 }
 
-- 
2.31.1




[PATCH 1/4] export/fuse: Add allow-other option

2021-06-14 Thread Max Reitz
Without the allow_other mount option, no user (not even root) but the
one who started qemu/the storage daemon can access the export.  Allow
users to configure the export such that such accesses are possible.

When we do pass allow_other, we should also pass default_permissions,
because our export code performs no permission checks.  With
default_permissions, we can just let the kernel do it.

Signed-off-by: Max Reitz 
---
 qapi/block-export.json | 11 ++-
 block/export/fuse.c| 17 +++--
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index e819e70cac..5d1cc04ac4 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -132,11 +132,20 @@
 # @growable: Whether writes beyond the EOF should grow the block node
 #accordingly. (default: false)
 #
+# @allow-other: By default (if this is false), only qemu's user is allowed
+#   access to this export.  That cannot be changed even with
+#   chmod or chown.
+#   This option will allow other users access to the export with 
the
+#   FUSE mount options "allow_other,default_permissions".
+#   (default_permissions enables standard UNIX permission checks.)
+#   (since 6.1; default: false)
+#
 # Since: 6.0
 ##
 { 'struct': 'BlockExportOptionsFuse',
   'data': { 'mountpoint': 'str',
-'*growable': 'bool' },
+'*growable': 'bool',
+'*allow-other': 'bool' },
   'if': 'defined(CONFIG_FUSE)' }
 
 ##
diff --git a/block/export/fuse.c b/block/export/fuse.c
index 38f74c94da..34a5836ece 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -46,6 +46,7 @@ typedef struct FuseExport {
 char *mountpoint;
 bool writable;
 bool growable;
+bool allow_other;
 } FuseExport;
 
 static GHashTable *exports;
@@ -117,6 +118,7 @@ static int fuse_export_create(BlockExport *blk_exp,
 exp->mountpoint = g_strdup(args->mountpoint);
 exp->writable = blk_exp_args->writable;
 exp->growable = args->growable;
+exp->allow_other = args->allow_other;
 
 ret = setup_fuse_export(exp, args->mountpoint, errp);
 if (ret < 0) {
@@ -150,11 +152,22 @@ static int setup_fuse_export(FuseExport *exp, const char 
*mountpoint,
 {
 const char *fuse_argv[4];
 char *mount_opts;
+const char *allow_other;
 struct fuse_args fuse_args;
 int ret;
 
-/* Needs to match what fuse_init() sets.  Only max_read must be supplied. 
*/
-mount_opts = g_strdup_printf("max_read=%zu", FUSE_MAX_BOUNCE_BYTES);
+if (exp->allow_other) {
+allow_other = ",allow_other,default_permissions";
+} else {
+allow_other = "";
+}
+
+/*
+ * max_read needs to match what fuse_init() sets.
+ * max_write need not be supplied.
+ */
+mount_opts = g_strdup_printf("max_read=%zu%s", FUSE_MAX_BOUNCE_BYTES,
+ allow_other);
 
 fuse_argv[0] = ""; /* Dummy program name */
 fuse_argv[1] = "-o";
-- 
2.31.1




[PATCH 0/4] export/fuse: Allow other users access to the export

2021-06-14 Thread Max Reitz
Hi,

With the default mount options, FUSE mounts are not accessible to any
users but the one who did the mount, not even to root.  To allow such
accesses, allow_other must be passed.

This is probably useful to some people (it certainly is to me, e.g. when
exporting some image as my normal user, and then trying to loop mount it
as root), so this series adds a QAPI allow-other bool that will make the
FUSE export code pass allow_other,default_permissions to FUSE.

(default_permissions will make the kernel do the usual UNIX permission
checks, which is something that makes a lot of sense when allowing other
users access to the export.)

This also requires our SETATTR code to be able to handle permission
changes, though, so the user can then run chmod/chown/chgrp on the
export to adjust its permissions to their need.

The final patch adds a test.


Max Reitz (4):
  export/fuse: Add allow-other option
  export/fuse: Give SET_ATTR_SIZE its own branch
  export/fuse: Let permissions be adjustable
  iotests/308: Test allow-other

 qapi/block-export.json | 11 -
 block/export/fuse.c| 83 +-
 tests/qemu-iotests/308 | 91 ++
 tests/qemu-iotests/308.out | 47 
 4 files changed, 210 insertions(+), 22 deletions(-)

-- 
2.31.1




[PATCH 4/4] iotests/308: Test allow-other

2021-06-14 Thread Max Reitz
We cannot reasonably test the main point of allow-other, which is to
allow users other than the current one to access the FUSE export,
because that would require access to sudo, which this test most likely
will not have.  (Also, we would need to figure out some user/group that
is on the machine and that is not the current user/group, which may
become a bit hairy.)

But we can test some byproducts: First, whether changing permissions
works (our FUSE code only allows so for allow-other=true), and second,
whether the kernel applies permission checks with allow-other=true
(because that implies default_permissions).

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/308 | 91 ++
 tests/qemu-iotests/308.out | 47 
 2 files changed, 138 insertions(+)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index f122065d0f..1b2f908947 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -334,6 +334,97 @@ echo '=== Compare copy with original ==='
 
 $QEMU_IMG compare -f raw -F $IMGFMT "$COPIED_IMG" "$TEST_IMG"
 
+echo
+echo '=== Test permissions ==='
+
+# Test that you can only change permissions on the export with 
allow-other=true.
+# We cannot really test the primary reason behind allow-other (i.e. to allow
+# users other than the current one access to the export), because for that we
+# would need sudo, which realistically nobody will allow this test to use.
+# What we can do is test that allow-other=true also enables 
default_permissions,
+# i.e. whether we can still read from the file if we remove the read 
permission.
+
+# $1: allow-other value ('true' or 'false')
+run_permission_test()
+{
+# Below, we want to test permission checks on the export.  To do that
+# properly, we need to ensure that those checks are not bypassed, so we 
have
+# to drop cap_dac_override.
+# (Note that cap_dac_read_search bypasses read permission checks, which is
+# why we try to open the file R/W below, so we do not have to care about
+# cap_dac_read_search here.)
+# $capsh is effectively a shell command, i.e. can be used like:
+#   $capsh -c "$command $args..."
+# By default it is just bash.
+capsh='/usr/bin/env bash'
+if type -p capsh > /dev/null; then
+if ! capsh --has-p=cap_dac_override 2>/dev/null; then
+# No cap_dac_override, we are good to go
+true # pass
+elif capsh --has-p=cap_setpcap 2>/dev/null; then
+# We will need to drop cap_dac_override, but doing so requires
+# cap_setpcap in turn
+capsh='capsh --drop=cap_dac_override --'
+else
+# Hopefully a rare case
+_notrun 'CAP_DAC_OVERRIDE must be dropped, but this cannot be' \
+'done without CAP_SETPCAP'
+fi
+elif [ "$(id -u)" -ne 0 ]; then
+# Non-root users probably do not have those capabilities, so try to get
+# by without capsh
+true # pass
+else
+# Running this test as root without capsh on the system should be a 
rare
+# case...
+_notrun 'No capsh found while run as root'
+fi
+
+_launch_qemu \
+-blockdev \
+
"$IMGFMT,node-name=node-format,file.driver=file,file.filename=$TEST_IMG"
+
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'qmp_capabilities'}" \
+'return'
+
+# Writable so we can open it R/W below
+fuse_export_add 'export' \
+"'mountpoint': '$EXT_MP',
+ 'writable': true,
+ 'allow-other': $1"
+
+echo '(Invoking chmod)'
+chmod 666 "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-chmod: %a' "$EXT_MP"
+
+echo '(Removing all permissions)'
+chmod 000 "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+
+# We want a permission denied here (with allow-other=true)
+# (Use $QEMU_IO_PROG, because $capsh will not know the wrapper that is
+# $QEMU_IO)
+# Use blkdebug to force-take the WRITE permission so this will definitely
+# try to open the export with O_RDWR.
+imgopts="driver=blkdebug,take-child-perms.0=write,"
+imgopts+="image.driver=file,image.filename=$EXT_MP"
+$capsh -c "$QEMU_IO_PROG -c quit --image-opts '$imgopts'" 2>&1 \
+| _filter_qemu_io | _filter_testdir | _filter_imgfmt
+
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'quit'}" \
+'return'
+
+wait=yes _cleanup_qemu
+}
+
+for ao in 'false' 'true'; do
+echo
+echo "--- allow-other=$ao ---

[PATCH 2/4] export/fuse: Give SET_ATTR_SIZE its own branch

2021-06-14 Thread Max Reitz
In order to support changing other attributes than the file size in
fuse_setattr(), we have to give each its own independent branch.  This
also applies to the only attribute we do support right now.

Signed-off-by: Max Reitz 
---
 block/export/fuse.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 34a5836ece..1d54286d90 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -408,20 +408,22 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t 
inode, struct stat *statbuf,
 FuseExport *exp = fuse_req_userdata(req);
 int ret;
 
-if (!exp->writable) {
-fuse_reply_err(req, EACCES);
-return;
-}
-
 if (to_set & ~FUSE_SET_ATTR_SIZE) {
 fuse_reply_err(req, ENOTSUP);
 return;
 }
 
-ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
-if (ret < 0) {
-fuse_reply_err(req, -ret);
-return;
+if (to_set & FUSE_SET_ATTR_SIZE) {
+if (!exp->writable) {
+fuse_reply_err(req, EACCES);
+return;
+}
+
+ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
+if (ret < 0) {
+fuse_reply_err(req, -ret);
+return;
+}
 }
 
 fuse_getattr(req, inode, fi);
-- 
2.31.1




[PATCH v2 9/9] virtiofsd: Add lazy lo_do_find()

2021-06-09 Thread Max Reitz
lo_find() right now takes two lookup keys for two maps, namely the file
handle for inodes_by_handle and the statx information for inodes_by_ids.
However, we only need the statx information if looking up the inode by
the file handle failed.

There are two callers of lo_find(): The first one, lo_do_lookup(), has
both keys anyway, so passing them does not incur any additional cost.
The second one, lookup_name(), though, needs to explicitly invoke
name_to_handle_at() (through get_file_handle()) and statx() (through
do_statx()).  We need to try to get a file handle as the primary key, so
we cannot get rid of get_file_handle(), but we only need the statx
information if looking up an inode by handle failed; so we can defer
that until the lookup has indeed failed.

To this end, replace lo_find()'s st/mnt_id parameters by a get_ids()
closure that is invoked to fill the lo_key struct if necessary.

Also, lo_find() is renamed to lo_do_find(), so we can add a new
lo_find() wrapper whose closure just initializes the lo_key from the
st/mnt_id parameters, just like the old lo_find() did.

lookup_name() directly calls lo_do_find() now and passes its own
closure, which performs the do_statx() call.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c | 93 ++--
 1 file changed, 76 insertions(+), 17 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 2e56c40b2f..8990fd5bd2 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -1168,22 +1168,23 @@ out_err:
 fuse_reply_err(req, saverr);
 }
 
-static struct lo_inode *lo_find(struct lo_data *lo,
-const struct lo_fhandle *fhandle,
-struct stat *st, uint64_t mnt_id)
+/*
+ * get_ids() will be called to get the key for lo->inodes_by_ids if
+ * the lookup by file handle has failed.
+ */
+static struct lo_inode *lo_do_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+int (*get_ids)(struct lo_key *, const void *),
+const void *get_ids_opaque)
 {
 struct lo_inode *p = NULL;
-struct lo_key ids_key = {
-.ino = st->st_ino,
-.dev = st->st_dev,
-.mnt_id = mnt_id,
-};
+struct lo_key ids_key;
 
 pthread_mutex_lock(&lo->mutex);
 if (fhandle) {
 p = g_hash_table_lookup(lo->inodes_by_handle, fhandle);
 }
-if (!p) {
+if (!p && get_ids(&ids_key, get_ids_opaque) == 0) {
 p = g_hash_table_lookup(lo->inodes_by_ids, &ids_key);
 /*
  * When we had to fall back to looking up an inode by its IDs,
@@ -1211,6 +1212,36 @@ static struct lo_inode *lo_find(struct lo_data *lo,
 return p;
 }
 
+struct lo_find_get_ids_key_opaque {
+const struct stat *st;
+uint64_t mnt_id;
+};
+
+static int lo_find_get_ids_key(struct lo_key *ids_key, const void *opaque)
+{
+const struct lo_find_get_ids_key_opaque *stat_info = opaque;
+
+*ids_key = (struct lo_key){
+.ino = stat_info->st->st_ino,
+.dev = stat_info->st->st_dev,
+.mnt_id = stat_info->mnt_id,
+};
+
+return 0;
+}
+
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id)
+{
+const struct lo_find_get_ids_key_opaque stat_info = {
+.st = st,
+.mnt_id = mnt_id,
+};
+
+return lo_do_find(lo, fhandle, lo_find_get_ids_key, &stat_info);
+}
+
 /* value_destroy_func for posix_locks GHashTable */
 static void posix_locks_value_destroy(gpointer data)
 {
@@ -1682,14 +1713,41 @@ out_err:
 fuse_reply_err(req, saverr);
 }
 
+struct lookup_name_get_ids_key_opaque {
+struct lo_data *lo;
+int parent_fd;
+const char *name;
+};
+
+static int lookup_name_get_ids_key(struct lo_key *ids_key, const void *opaque)
+{
+const struct lookup_name_get_ids_key_opaque *stat_params = opaque;
+uint64_t mnt_id;
+struct stat attr;
+int res;
+
+res = do_statx(stat_params->lo, stat_params->parent_fd, stat_params->name,
+   &attr, AT_SYMLINK_NOFOLLOW, &mnt_id);
+if (res < 0) {
+return -errno;
+}
+
+*ids_key = (struct lo_key){
+.ino = attr.st_ino,
+.dev = attr.st_dev,
+.mnt_id = mnt_id,
+};
+
+return 0;
+}
+
 /* Increments nlookup and caller must release refcount using lo_inode_put() */
 static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent,
 const char *name)
 {
 g_auto(TempFd) dir_fd = TEMP_FD_INIT;
 int res;
-uint64_t mnt_id;
-struct stat attr;
+struct lookup_name_get_ids_key_opaque stat_params;
 struct lo_fhandle *fh;
 struct lo_data *lo = lo_data(req);
 struct lo_inode *dir = lo_inode(req, parent);
@@ -1

[PATCH v2 4/9] virtiofsd: Let lo_fd() return a TempFd

2021-06-09 Thread Max Reitz
Accessing lo_inode.fd must generally happen through lo_inode_fd(), and
lo_fd() is no exception; and then it must pass on the TempFd it has
received from lo_inode_fd().

(Note that all lo_fd() calls now use proper error handling, where all of
them were in-line before; i.e. they were used in place of the fd
argument of some function call.  This only worked because the only error
that could occur was that lo_inode() failed to find the inode ID: Then
-1 would be passed as the fd, which would result in an EBADF error,
which is precisely what we would want to return to the guest for an
invalid inode ID.
Now, though, lo_inode_fd() might potentially invoke open_by_handle_at(),
which can return many different errors, and they should be properly
handled and returned to the guest.  So we can no longer allow lo_fd() to
be used in-line, and instead need to do proper error handling for it.)

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c | 55 +---
 1 file changed, 44 insertions(+), 11 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 46c9dfe200..8f64bcd6c5 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -644,18 +644,19 @@ static int lo_inode_fd(const struct lo_inode *inode, 
TempFd *tfd)
  * they are done with the fd.  This will be done in a later patch to make
  * review easier.
  */
-static int lo_fd(fuse_req_t req, fuse_ino_t ino)
+static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd *tfd)
 {
 struct lo_inode *inode = lo_inode(req, ino);
-int fd;
+int res;
 
 if (!inode) {
-return -1;
+return -EBADF;
 }
 
-fd = inode->fd;
+res = lo_inode_fd(inode, tfd);
+
 lo_inode_put(lo_data(req), &inode);
-return fd;
+return res;
 }
 
 /*
@@ -766,14 +767,19 @@ static void lo_init(void *userdata, struct fuse_conn_info 
*conn)
 static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
struct fuse_file_info *fi)
 {
+g_auto(TempFd) ino_fd = TEMP_FD_INIT;
 int res;
 struct stat buf;
 struct lo_data *lo = lo_data(req);
 
 (void)fi;
 
-res =
-fstatat(lo_fd(req, ino), "", &buf, AT_EMPTY_PATH | 
AT_SYMLINK_NOFOLLOW);
+res = lo_fd(req, ino, &ino_fd);
+if (res < 0) {
+return (void)fuse_reply_err(req, -res);
+}
+
+res = fstatat(ino_fd.fd, "", &buf, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
 if (res == -1) {
 return (void)fuse_reply_err(req, errno);
 }
@@ -1441,6 +1447,7 @@ static struct lo_inode *lookup_name(fuse_req_t req, 
fuse_ino_t parent,
 
 static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, const char *name)
 {
+g_auto(TempFd) parent_fd = TEMP_FD_INIT;
 int res;
 struct lo_inode *inode;
 struct lo_data *lo = lo_data(req);
@@ -1455,13 +1462,19 @@ static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, 
const char *name)
 return;
 }
 
+res = lo_fd(req, parent, &parent_fd);
+if (res < 0) {
+fuse_reply_err(req, -res);
+return;
+}
+
 inode = lookup_name(req, parent, name);
 if (!inode) {
 fuse_reply_err(req, EIO);
 return;
 }
 
-res = unlinkat(lo_fd(req, parent), name, AT_REMOVEDIR);
+res = unlinkat(parent_fd.fd, name, AT_REMOVEDIR);
 
 fuse_reply_err(req, res == -1 ? errno : 0);
 unref_inode_lolocked(lo, inode, 1);
@@ -1547,6 +1560,7 @@ out:
 
 static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
 {
+g_auto(TempFd) parent_fd = TEMP_FD_INIT;
 int res;
 struct lo_inode *inode;
 struct lo_data *lo = lo_data(req);
@@ -1561,13 +1575,19 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t 
parent, const char *name)
 return;
 }
 
+res = lo_fd(req, parent, &parent_fd);
+if (res < 0) {
+fuse_reply_err(req, -res);
+return;
+}
+
 inode = lookup_name(req, parent, name);
 if (!inode) {
 fuse_reply_err(req, EIO);
 return;
 }
 
-res = unlinkat(lo_fd(req, parent), name, 0);
+res = unlinkat(parent_fd.fd, name, 0);
 
 fuse_reply_err(req, res == -1 ? errno : 0);
 unref_inode_lolocked(lo, inode, 1);
@@ -1647,10 +1667,16 @@ static void lo_forget_multi(fuse_req_t req, size_t 
count,
 
 static void lo_readlink(fuse_req_t req, fuse_ino_t ino)
 {
+g_auto(TempFd) ino_fd = TEMP_FD_INIT;
 char buf[PATH_MAX + 1];
 int res;
 
-res = readlinkat(lo_fd(req, ino), "", buf, sizeof(buf));
+res = lo_fd(req, ino, &ino_fd);
+if (res < 0) {
+return (void)fuse_reply_err(req, -res);
+}
+
+res = readlinkat(ino_fd.fd, "", buf, sizeof(buf));
 if (res == -1) {
 return (void)fuse_reply_err(req, errno);
 }
@@ -2447,10 +2473,17 @@ static void lo_write_buf(fuse_req_t req, fuse_ino_t ino,
 
 static void lo_statfs(fuse_re

[PATCH v2 8/9] virtiofsd: Optionally fill lo_inode.fhandle

2021-06-09 Thread Max Reitz
When the inode_file_handles option is set, try to generate a file handle
for new inodes instead of opening an O_PATH FD.

Being able to open these again will require CAP_DAC_READ_SEARCH, so the
description text tells the user they will also need to specify
-o modcaps=+dac_read_search.

Generating a file handle returns the mount ID it is valid for.  Opening
it will require an FD instead.  We have mount_fds to map an ID to an FD.
get_file_handle() fills the hash map by opening the file we have
generated a handle for.  To verify that the resulting FD indeed
represents the handle's mount ID, we use statx().  Therefore, using file
handles requires statx() support.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/helper.c  |   3 +
 tools/virtiofsd/passthrough_ll.c  | 197 --
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 3 files changed, 192 insertions(+), 9 deletions(-)

diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index 5e98ed702b..954f8639e6 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -186,6 +186,9 @@ void fuse_cmdline_help(void)
"   to virtiofsd from guest 
applications.\n"
"   default: no_allow_direct_io\n"
"-o announce_submounts  Announce sub-mount points to the 
guest\n"
+   "-o inode_file_handles  Use file handles to reference 
inodes\n"
+   "   instead of O_PATH file 
descriptors\n"
+   "   (requires -o 
modcaps=+dac_read_search)\n"
);
 }
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 793d2c333e..2e56c40b2f 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -190,6 +190,7 @@ struct lo_data {
 /* An O_PATH file descriptor to /proc/self/fd/ */
 int proc_self_fd;
 int user_killpriv_v2, killpriv_v2;
+int inode_file_handles;
 };
 
 /**
@@ -244,6 +245,10 @@ static const struct fuse_opt lo_opts[] = {
 { "announce_submounts", offsetof(struct lo_data, announce_submounts), 1 },
 { "killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 1 },
 { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
+{ "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 },
+{ "no_inode_file_handles",
+  offsetof(struct lo_data, inode_file_handles),
+  0 },
 FUSE_OPT_END
 };
 static bool use_syslog = false;
@@ -315,6 +320,135 @@ static int temp_fd_steal(TempFd *temp_fd)
 }
 }
 
+/**
+ * Generate a file handle for the given dirfd/name combination.
+ *
+ * If mount_fds does not yet contain an entry for the handle's mount
+ * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds
+ * as the FD for that mount ID.  (That is the file that we have
+ * generated a handle for, so it should be representative for the
+ * mount ID.  However, to be sure (and to rule out races), we use
+ * statx() to verify that our assumption is correct.)
+ */
+static struct lo_fhandle *get_file_handle(struct lo_data *lo,
+  int dirfd, const char *name)
+{
+/* We need statx() to verify the mount ID */
+#if defined(CONFIG_STATX) && defined(STATX_MNT_ID)
+struct lo_fhandle *fh;
+int ret;
+
+if (!lo->use_statx || !lo->inode_file_handles) {
+return NULL;
+}
+
+fh = g_new0(struct lo_fhandle, 1);
+
+fh->handle.handle_bytes = sizeof(fh->padding) - sizeof(fh->handle);
+ret = name_to_handle_at(dirfd, name, &fh->handle, &fh->mount_id,
+AT_EMPTY_PATH);
+if (ret < 0) {
+goto fail;
+}
+
+if (pthread_rwlock_rdlock(&mount_fds_lock)) {
+goto fail;
+}
+if (!g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) {
+g_auto(TempFd) path_fd = TEMP_FD_INIT;
+struct statx stx;
+char procname[64];
+int fd;
+
+pthread_rwlock_unlock(&mount_fds_lock);
+
+/*
+ * Before opening an O_RDONLY fd, check whether dirfd/name is a regular
+ * file or directory, because we must not open anything else with
+ * anything but O_PATH.
+ * (And we use that occasion to verify that the file has the mount ID 
we
+ * need.)
+ */
+if (name[0]) {
+path_fd.fd = openat(dirfd, name, O_PATH);
+if (path_fd.fd < 0) {
+goto fail;
+}
+path_fd.owned = true;
+} else {
+path_fd.fd = dirfd;
+path_fd.owned = false;
+}
+
+ret = statx(path_fd.fd, "", AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW,
+

[PATCH v2 2/9] virtiofsd: Use lo_inode_open() instead of openat()

2021-06-09 Thread Max Reitz
The xattr functions want a non-O_PATH FD, so they reopen the lo_inode.fd
with the flags they need through /proc/self/fd.

Similarly, lo_opendir() needs an O_RDONLY FD.  Instead of the
/proc/self/fd trick, it just uses openat(fd, "."), because the FD is
guaranteed to be a directory, so this works.

All cases have one problem in common, though: In the future, when we may
have a file handle in the lo_inode instead of an FD, querying an
lo_inode FD may incur an open_by_handle_at() call.  It does not make
sense to then reopen that FD with custom flags, those should have been
passed to open_by_handle_at() instead.

Use lo_inode_open() instead of openat().  As part of the file handle
change, lo_inode_open() will be made to invoke openat() only if
lo_inode.fd is valid.  Otherwise, it will invoke open_by_handle_at()
with the right flags from the start.

Consequently, after this patch, lo_inode_open() is the only place to
invoke openat() to reopen an existing FD with different flags.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c | 43 
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index a4674aba80..436f771d2a 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -1645,18 +1645,26 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
 {
 int error = ENOMEM;
 struct lo_data *lo = lo_data(req);
-struct lo_dirp *d;
+struct lo_inode *inode;
+struct lo_dirp *d = NULL;
 int fd;
 ssize_t fh;
 
+inode = lo_inode(req, ino);
+if (!inode) {
+error = EBADF;
+goto out_err;
+}
+
 d = calloc(1, sizeof(struct lo_dirp));
 if (d == NULL) {
 goto out_err;
 }
 
-fd = openat(lo_fd(req, ino), ".", O_RDONLY);
-if (fd == -1) {
-goto out_errno;
+fd = lo_inode_open(lo, inode, O_RDONLY);
+if (fd < 0) {
+error = -fd;
+goto out_err;
 }
 
 d->dp = fdopendir(fd);
@@ -1685,6 +1693,7 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
 out_errno:
 error = errno;
 out_err:
+lo_inode_put(lo, &inode);
 if (d) {
 if (d->dp) {
 closedir(d->dp);
@@ -2827,7 +2836,6 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
 }
 }
 
-sprintf(procname, "%i", inode->fd);
 /*
  * It is not safe to open() non-regular/non-dir files in file server
  * unless O_PATH is used, so use that method for regular files/dir
@@ -2835,12 +2843,14 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
  * Otherwise, call fchdir() to avoid open().
  */
 if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+fd = lo_inode_open(lo, inode, O_RDONLY);
 if (fd < 0) {
-goto out_err;
+saverr = -fd;
+goto out;
 }
 ret = fgetxattr(fd, name, value, size);
 } else {
+sprintf(procname, "%i", inode->fd);
 /* fchdir should not fail here */
 FCHDIR_NOFAIL(lo->proc_self_fd);
 ret = getxattr(procname, name, value, size);
@@ -2906,14 +2916,15 @@ static void lo_listxattr(fuse_req_t req, fuse_ino_t 
ino, size_t size)
 }
 }
 
-sprintf(procname, "%i", inode->fd);
 if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+fd = lo_inode_open(lo, inode, O_RDONLY);
 if (fd < 0) {
-goto out_err;
+saverr = -fd;
+goto out;
 }
 ret = flistxattr(fd, value, size);
 } else {
+sprintf(procname, "%i", inode->fd);
 /* fchdir should not fail here */
 FCHDIR_NOFAIL(lo->proc_self_fd);
 ret = listxattr(procname, value, size);
@@ -3039,15 +3050,15 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
 fuse_log(FUSE_LOG_DEBUG, "lo_setxattr(ino=%" PRIu64
  ", name=%s value=%s size=%zd)\n", ino, name, value, size);
 
-sprintf(procname, "%i", inode->fd);
 if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+fd = lo_inode_open(lo, inode, O_RDONLY);
 if (fd < 0) {
-saverr = errno;
+saverr = -fd;
 goto out;
 }
 ret = fsetxattr(fd, name, value, size, flags);
 } else {
+sprintf(procname, "%i", inode->fd);
 /* fchdir should not fail here */
 FCHDIR_NOFAIL(lo->proc_self_fd);
 ret = setxattr(procname, name, value, size, flags);
@@ -3

[PATCH v2 6/9] virtiofsd: Add lo_inode.fhandle

2021-06-09 Thread Max Reitz
This new field is an alternative to lo_inode.fd: Either of the two must
be set.  In case an O_PATH FD is needed for some lo_inode, it is either
taken from lo_inode.fd, if valid, or a temporary FD is opened with
open_by_handle_at().

Using a file handle instead of an FD has the advantage of keeping the
number of open file descriptors low.

Because open_by_handle_at() requires a mount FD (i.e. a non-O_PATH FD
opened on the filesystem to which the file handle refers), but every
lo_fhandle only has a mount ID (as returned by name_to_handle_at()), we
keep a hash map of such FDs in mount_fds (mapping ID to FD).
get_file_handle(), which is added by a later patch, will ensure that
every mount ID for which we have generated a handle has a corresponding
entry in mount_fds.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c  | 116 ++
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 2 files changed, 102 insertions(+), 15 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 3014e8baf8..e665575401 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -88,8 +88,25 @@ struct lo_key {
 uint64_t mnt_id;
 };
 
+struct lo_fhandle {
+union {
+struct file_handle handle;
+char padding[sizeof(struct file_handle) + MAX_HANDLE_SZ];
+};
+int mount_id;
+};
+
+/* Maps mount IDs to an FD that we can pass to open_by_handle_at() */
+static GHashTable *mount_fds;
+pthread_rwlock_t mount_fds_lock = PTHREAD_RWLOCK_INITIALIZER;
+
 struct lo_inode {
+/*
+ * Either of fd or fhandle must be set (i.e. >= 0 or non-NULL,
+ * respectively).
+ */
 int fd;
+struct lo_fhandle *fhandle;
 
 /*
  * Atomic reference count for this object.  The nlookup field holds a
@@ -296,6 +313,44 @@ static int temp_fd_steal(TempFd *temp_fd)
 }
 }
 
+/**
+ * Open the given file handle with the given flags.
+ *
+ * The mount FD to pass to open_by_handle_at() is taken from the
+ * mount_fds hash map.
+ *
+ * On error, return -errno.
+ */
+static int open_file_handle(const struct lo_fhandle *fh, int flags)
+{
+gpointer mount_fd_ptr;
+int mount_fd;
+bool found;
+int ret;
+
+ret = pthread_rwlock_rdlock(&mount_fds_lock);
+if (ret) {
+return -ret;
+}
+
+/* mount_fd == 0 is valid, so we need lookup_extended */
+found = g_hash_table_lookup_extended(mount_fds,
+ GINT_TO_POINTER(fh->mount_id),
+ NULL, &mount_fd_ptr);
+pthread_rwlock_unlock(&mount_fds_lock);
+if (!found) {
+return -EINVAL;
+}
+mount_fd = GPOINTER_TO_INT(mount_fd_ptr);
+
+ret = open_by_handle_at(mount_fd, (struct file_handle *)&fh->handle, 
flags);
+if (ret < 0) {
+return -errno;
+}
+
+return ret;
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
@@ -602,7 +657,11 @@ static void lo_inode_put(struct lo_data *lo, struct 
lo_inode **inodep)
 *inodep = NULL;
 
 if (g_atomic_int_dec_and_test(&inode->refcount)) {
-close(inode->fd);
+if (inode->fd >= 0) {
+close(inode->fd);
+} else {
+g_free(inode->fhandle);
+}
 free(inode);
 }
 }
@@ -629,10 +688,25 @@ static struct lo_inode *lo_inode(fuse_req_t req, 
fuse_ino_t ino)
 
 static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd)
 {
-*tfd = (TempFd) {
-.fd = inode->fd,
-.owned = false,
-};
+if (inode->fd >= 0) {
+*tfd = (TempFd) {
+.fd = inode->fd,
+.owned = false,
+};
+} else {
+int fd;
+
+assert(inode->fhandle != NULL);
+fd = open_file_handle(inode->fhandle, O_PATH);
+if (fd < 0) {
+return -errno;
+}
+
+*tfd = (TempFd) {
+.fd = fd,
+.owned = true,
+};
+}
 
 return 0;
 }
@@ -672,22 +746,32 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd 
*tfd)
 static int lo_inode_open(const struct lo_data *lo, const struct lo_inode 
*inode,
  int open_flags, TempFd *tfd)
 {
-g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
+g_autofree char *fd_str = NULL;
 int fd;
 
 if (!S_ISREG(inode->filetype) && !S_ISDIR(inode->filetype)) {
 return -EBADF;
 }
 
-/*
- * The file is a symlink so O_NOFOLLOW must be ignored. We checked earlier
- * that the inode is not a special file but if an external process races
- * with us then symlinks are traversed here. It is not possible to escape
- * the shared directory since it is mounted as "/" though.
- */
-fd = openat(lo->proc_self_fd, fd_str, ope

[PATCH v2 5/9] virtiofsd: Let lo_inode_open() return a TempFd

2021-06-09 Thread Max Reitz
Strictly speaking, this is not necessary, because lo_inode_open() will
always return a new FD owned by the caller, so TempFd.owned will always
be true.

However, auto-cleanup is nice, and in some cases this plays nicely with
an lo_inode_fd() call in another conditional branch (see lo_setattr()).

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c | 137 +--
 1 file changed, 59 insertions(+), 78 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 8f64bcd6c5..3014e8baf8 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -285,10 +285,8 @@ static void temp_fd_clear(TempFd *temp_fd)
 /**
  * Return an owned fd from *temp_fd that will not be closed when
  * *temp_fd goes out of scope.
- *
- * (TODO: Remove __attribute__ once this is used.)
  */
-static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd)
+static int temp_fd_steal(TempFd *temp_fd)
 {
 if (temp_fd->owned) {
 temp_fd->owned = false;
@@ -667,9 +665,12 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd 
*tfd)
  * when a malicious client opens special files such as block device nodes.
  * Symlink inodes are also rejected since symlinks must already have been
  * traversed on the client side.
+ *
+ * The fd is returned in tfd->fd.  The return value is 0 on success and -errno
+ * otherwise.
  */
-static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode,
- int open_flags)
+static int lo_inode_open(const struct lo_data *lo, const struct lo_inode 
*inode,
+ int open_flags, TempFd *tfd)
 {
 g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
 int fd;
@@ -688,7 +689,13 @@ static int lo_inode_open(struct lo_data *lo, struct 
lo_inode *inode,
 if (fd < 0) {
 return -errno;
 }
-return fd;
+
+*tfd = (TempFd) {
+.fd = fd,
+.owned = true,
+};
+
+return 0;
 }
 
 static void lo_init(void *userdata, struct fuse_conn_info *conn)
@@ -820,7 +827,12 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 return;
 }
 
-res = lo_inode_fd(inode, &inode_fd);
+if (!fi && (valid & FUSE_SET_ATTR_SIZE)) {
+/* We need an O_RDWR FD for ftruncate() */
+res = lo_inode_open(lo, inode, O_RDWR, &inode_fd);
+} else {
+res = lo_inode_fd(inode, &inode_fd);
+}
 if (res < 0) {
 saverr = -res;
 goto out_err;
@@ -868,18 +880,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 if (fi) {
 truncfd = fd;
 } else {
-truncfd = lo_inode_open(lo, inode, O_RDWR);
-if (truncfd < 0) {
-saverr = -truncfd;
-goto out_err;
-}
+truncfd = inode_fd.fd;
 }
 
 saverr = drop_security_capability(lo, truncfd);
 if (saverr) {
-if (!fi) {
-close(truncfd);
-}
 goto out_err;
 }
 
@@ -887,9 +892,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 res = drop_effective_cap("FSETID", &cap_fsetid_dropped);
 if (res != 0) {
 saverr = res;
-if (!fi) {
-close(truncfd);
-}
 goto out_err;
 }
 }
@@ -902,9 +904,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 fuse_log(FUSE_LOG_ERR, "Failed to gain CAP_FSETID\n");
 }
 }
-if (!fi) {
-close(truncfd);
-}
 if (res == -1) {
 goto out_err;
 }
@@ -1734,11 +1733,12 @@ static struct lo_dirp *lo_dirp(fuse_req_t req, struct 
fuse_file_info *fi)
 static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
struct fuse_file_info *fi)
 {
+g_auto(TempFd) inode_fd = TEMP_FD_INIT;
 int error = ENOMEM;
 struct lo_data *lo = lo_data(req);
 struct lo_inode *inode;
 struct lo_dirp *d = NULL;
-int fd;
+int res;
 ssize_t fh;
 
 inode = lo_inode(req, ino);
@@ -1752,13 +1752,13 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
 goto out_err;
 }
 
-fd = lo_inode_open(lo, inode, O_RDONLY);
-if (fd < 0) {
-error = -fd;
+res = lo_inode_open(lo, inode, O_RDONLY, &inode_fd);
+if (res < 0) {
+error = -res;
 goto out_err;
 }
 
-d->dp = fdopendir(fd);
+d->dp = fdopendir(temp_fd_steal(&inode_fd));
 if (d->dp == NULL) {
 goto out_errno;
 }
@@ -1788,8 +1788,6 @@ out_err:
 if (d) {
 if (d->dp) {
 closedir(d->dp);
-} else if (fd != -1) {
-close(fd)

[PATCH v2 7/9] virtiofsd: Add inodes_by_handle hash table

2021-06-09 Thread Max Reitz
Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
its inode ID will remain in use until we drop our lo_inode (and
lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
the inode ID as an lo_inode key, because any inode with an inode ID we
find in lo_data.inodes (on the same filesystem) must be the exact same
file.

This will change when we start setting lo_inode.fhandle so we do not
have to keep an O_PATH FD open.  Then, unlinking such an inode will
immediately remove it, so its ID can then be reused by newly created
files, even while the lo_inode object is still there[1].

So creating a new file can then reuse the old file's inode ID, and
looking up the new file would lead to us finding the old file's
lo_inode, which is not ideal.

Luckily, just as file handles cause this problem, they also solve it:  A
file handle contains a generation ID, which changes when an inode ID is
reused, so the new file can be distinguished from the old one.  So all
we need to do is to add a second map besides lo_data.inodes that maps
file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.

Unfortunately, we cannot rely on being able to generate file handles
every time.  Therefore, we still enter every lo_inode object into
inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
potential inodes_by_handle entry then has precedence, the inodes_by_ids
entry is just a fallback.

Note that we do not generate lo_fhandle objects yet, and so we also do
not enter anything into the inodes_by_handle map yet.  Also, all lookups
skip that map.  We might manually create file handles with some code
that is immediately removed by the next patch again, but that would
break the assumption in lo_find() that every lo_inode with a non-NULL
.fhandle must have an entry in inodes_by_handle and vice versa.  So we
leave actually using the inodes_by_handle map for the next patch.

[1] If some application in the guest still has the file open, there is
going to be a corresponding FD mapping in lo_data.fd_map.  In such a
case, the inode will only go away once every application in the guest
has closed it.  The problem described only applies to cases where the
guest does not have the file open, and it is just in the dentry cache,
basically.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c | 80 +---
 1 file changed, 64 insertions(+), 16 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index e665575401..793d2c333e 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -179,7 +179,8 @@ struct lo_data {
 int announce_submounts;
 bool use_statx;
 struct lo_inode root;
-GHashTable *inodes; /* protected by lo->mutex */
+GHashTable *inodes_by_ids; /* protected by lo->mutex */
+GHashTable *inodes_by_handle; /* protected by lo->mutex */
 struct lo_map ino_map; /* protected by lo->mutex */
 struct lo_map dirp_map; /* protected by lo->mutex */
 struct lo_map fd_map; /* protected by lo->mutex */
@@ -257,8 +258,9 @@ static struct {
 /* That we loaded cap-ng in the current thread from the saved */
 static __thread bool cap_loaded = 0;
 
-static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
-uint64_t mnt_id);
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id);
 static int xattr_map_client(const struct lo_data *lo, const char *client_name,
 char **out_name);
 
@@ -1032,18 +1034,39 @@ out_err:
 fuse_reply_err(req, saverr);
 }
 
-static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
-uint64_t mnt_id)
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id)
 {
-struct lo_inode *p;
-struct lo_key key = {
+struct lo_inode *p = NULL;
+struct lo_key ids_key = {
 .ino = st->st_ino,
 .dev = st->st_dev,
 .mnt_id = mnt_id,
 };
 
 pthread_mutex_lock(&lo->mutex);
-p = g_hash_table_lookup(lo->inodes, &key);
+if (fhandle) {
+p = g_hash_table_lookup(lo->inodes_by_handle, fhandle);
+}
+if (!p) {
+p = g_hash_table_lookup(lo->inodes_by_ids, &ids_key);
+/*
+ * When we had to fall back to looking up an inode by its IDs,
+ * ensure that we hit an entry that does not have a file
+ * handle.  Entries with file handles must also have a handle
+ * alt key, so if w

[PATCH v2 3/9] virtiofsd: Add lo_inode_fd() helper

2021-06-09 Thread Max Reitz
Once we let lo_inode.fd be optional, we will need its users to open the
file handle stored in lo_inode instead.  This function will do that.

For now, it just returns lo_inode.fd, though.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c | 138 ++-
 1 file changed, 117 insertions(+), 21 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 436f771d2a..46c9dfe200 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -629,6 +629,16 @@ static struct lo_inode *lo_inode(fuse_req_t req, 
fuse_ino_t ino)
 return elem->inode;
 }
 
+static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd)
+{
+*tfd = (TempFd) {
+.fd = inode->fd,
+.owned = false,
+};
+
+return 0;
+}
+
 /*
  * TODO Remove this helper and force callers to hold an inode refcount until
  * they are done with the fd.  This will be done in a later patch to make
@@ -790,11 +800,11 @@ static int lo_fi_fd(fuse_req_t req, struct fuse_file_info 
*fi)
 static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
int valid, struct fuse_file_info *fi)
 {
+g_auto(TempFd) inode_fd = TEMP_FD_INIT;
 int saverr;
 char procname[64];
 struct lo_data *lo = lo_data(req);
 struct lo_inode *inode;
-int ifd;
 int res;
 int fd = -1;
 
@@ -804,7 +814,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 return;
 }
 
-ifd = inode->fd;
+res = lo_inode_fd(inode, &inode_fd);
+if (res < 0) {
+saverr = -res;
+goto out_err;
+}
 
 /* If fi->fh is invalid we'll report EBADF later */
 if (fi) {
@@ -815,7 +829,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 if (fi) {
 res = fchmod(fd, attr->st_mode);
 } else {
-sprintf(procname, "%i", ifd);
+sprintf(procname, "%i", inode_fd.fd);
 res = fchmodat(lo->proc_self_fd, procname, attr->st_mode, 0);
 }
 if (res == -1) {
@@ -827,12 +841,13 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 uid_t uid = (valid & FUSE_SET_ATTR_UID) ? attr->st_uid : (uid_t)-1;
 gid_t gid = (valid & FUSE_SET_ATTR_GID) ? attr->st_gid : (gid_t)-1;
 
-saverr = drop_security_capability(lo, ifd);
+saverr = drop_security_capability(lo, inode_fd.fd);
 if (saverr) {
 goto out_err;
 }
 
-res = fchownat(ifd, "", uid, gid, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
+res = fchownat(inode_fd.fd, "", uid, gid,
+   AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
 if (res == -1) {
 saverr = errno;
 goto out_err;
@@ -911,7 +926,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 if (fi) {
 res = futimens(fd, tv);
 } else {
-sprintf(procname, "%i", inode->fd);
+sprintf(procname, "%i", inode_fd.fd);
 res = utimensat(lo->proc_self_fd, procname, tv, 0);
 }
 if (res == -1) {
@@ -1026,7 +1041,8 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
parent, const char *name,
 struct fuse_entry_param *e,
 struct lo_inode **inodep)
 {
-int newfd;
+g_auto(TempFd) dir_fd = TEMP_FD_INIT;
+int newfd = -1;
 int res;
 int saverr;
 uint64_t mnt_id;
@@ -1056,7 +1072,13 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
parent, const char *name,
 name = ".";
 }
 
-newfd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
+res = lo_inode_fd(dir, &dir_fd);
+if (res < 0) {
+saverr = -res;
+goto out;
+}
+
+newfd = openat(dir_fd.fd, name, O_PATH | O_NOFOLLOW);
 if (newfd == -1) {
 goto out_err;
 }
@@ -1123,6 +1145,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
parent, const char *name,
 
 out_err:
 saverr = errno;
+out:
 if (newfd != -1) {
 close(newfd);
 }
@@ -1228,6 +1251,7 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t 
parent,
  const char *name, mode_t mode, dev_t rdev,
  const char *link)
 {
+g_auto(TempFd) dir_fd = TEMP_FD_INIT;
 int res;
 int saverr;
 struct lo_data *lo = lo_data(req);
@@ -1251,12 +1275,18 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t 
parent,
 return;
 }
 
+res = lo_inode_fd(dir, &dir_fd);
+if (res < 0) {
+saverr = -res;
+goto out;
+}
+
 saverr = lo_change_cred(req, &old);
 if (saverr) {
 goto out;
 }
 
-res = mknod_wrapper(dir->f

[PATCH v2 1/9] virtiofsd: Add TempFd structure

2021-06-09 Thread Max Reitz
We are planning to add file handles to lo_inode objects as an
alternative to lo_inode.fd.  That means that everywhere where we
currently reference lo_inode.fd, we will have to open a temporary file
descriptor that needs to be closed after use.

So instead of directly accessing lo_inode.fd, there will be a helper
function (lo_inode_fd()) that either returns lo_inode.fd, or opens a new
file descriptor with open_by_handle_at().  It encapsulates this result
in a TempFd structure to let the caller know whether the FD needs to be
closed after use (opened from the handle) or not (copied from
lo_inode.fd).

By using g_auto(TempFd) to store this result, callers will not even have
to care about closing a temporary FD after use.  It will be done
automatically once the object goes out of scope.

Signed-off-by: Max Reitz 
Reviewed-by: Connor Kuehl 
---
 tools/virtiofsd/passthrough_ll.c | 49 
 1 file changed, 49 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 49c21fd855..a4674aba80 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -174,6 +174,28 @@ struct lo_data {
 int user_killpriv_v2, killpriv_v2;
 };
 
+/**
+ * Represents a file descriptor that may either be owned by this
+ * TempFd, or only referenced (i.e. the ownership belongs to some
+ * other object, and the value has just been copied into this TempFd).
+ *
+ * The purpose of this encapsulation is to be used as g_auto(TempFd)
+ * to automatically clean up owned file descriptors when this object
+ * goes out of scope.
+ *
+ * Use temp_fd_steal() to get an owned file descriptor that will not
+ * be closed when the TempFd goes out of scope.
+ */
+typedef struct {
+int fd;
+bool owned; /* fd owned by this object? */
+} TempFd;
+
+#define TEMP_FD_INIT ((TempFd) { .fd = -1, .owned = false })
+
+static void temp_fd_clear(TempFd *temp_fd);
+G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC(TempFd, temp_fd_clear);
+
 static const struct fuse_opt lo_opts[] = {
 { "sandbox=namespace",
   offsetof(struct lo_data, sandbox),
@@ -249,6 +271,33 @@ static struct lo_data *lo_data(fuse_req_t req)
 return (struct lo_data *)fuse_req_userdata(req);
 }
 
+/**
+ * Clean-up function for TempFds
+ */
+static void temp_fd_clear(TempFd *temp_fd)
+{
+if (temp_fd->owned) {
+close(temp_fd->fd);
+*temp_fd = TEMP_FD_INIT;
+}
+}
+
+/**
+ * Return an owned fd from *temp_fd that will not be closed when
+ * *temp_fd goes out of scope.
+ *
+ * (TODO: Remove __attribute__ once this is used.)
+ */
+static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd)
+{
+if (temp_fd->owned) {
+temp_fd->owned = false;
+return temp_fd->fd;
+} else {
+return dup(temp_fd->fd);
+}
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
-- 
2.31.1




[PATCH v2 0/9] virtiofsd: Allow using file handles instead of O_PATH FDs

2021-06-09 Thread Max Reitz
Hi,

v1 cover letter for an overview:
https://listman.redhat.com/archives/virtio-fs/2021-June/msg00033.html

In v2, I (tried to) fix the bug Dave found, which is that
get_file_handle() indiscriminately opened the given dirfd/name
combination to get an O_RDONLY fd without checking whether we’re
actually allowed to open dirfd/name; namely, we don’t allow ourselves to
open files that aren’t regular files or directories.

So that openat(.., O_RDONLY) is changed to an openat(..., O_PATH), and
then check the file type with the statx() we’re doing anyway.  If the
file is OK to open, we reopen it O_RDONLY with the help of
/proc/self/fd, like we always do.

(This only affects patch 8.)


git-backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/9:[] [--] 'virtiofsd: Add TempFd structure'
002/9:[] [--] 'virtiofsd: Use lo_inode_open() instead of openat()'
003/9:[] [--] 'virtiofsd: Add lo_inode_fd() helper'
004/9:[] [--] 'virtiofsd: Let lo_fd() return a TempFd'
005/9:[] [--] 'virtiofsd: Let lo_inode_open() return a TempFd'
006/9:[] [--] 'virtiofsd: Add lo_inode.fhandle'
007/9:[] [--] 'virtiofsd: Add inodes_by_handle hash table'
008/9:[0045] [FC] 'virtiofsd: Optionally fill lo_inode.fhandle'
009/9:[] [--] 'virtiofsd: Add lazy lo_do_find()'


Max Reitz (9):
  virtiofsd: Add TempFd structure
  virtiofsd: Use lo_inode_open() instead of openat()
  virtiofsd: Add lo_inode_fd() helper
  virtiofsd: Let lo_fd() return a TempFd
  virtiofsd: Let lo_inode_open() return a TempFd
  virtiofsd: Add lo_inode.fhandle
  virtiofsd: Add inodes_by_handle hash table
  virtiofsd: Optionally fill lo_inode.fhandle
  virtiofsd: Add lazy lo_do_find()

 tools/virtiofsd/helper.c  |   3 +
 tools/virtiofsd/passthrough_ll.c  | 836 +-
 tools/virtiofsd/passthrough_seccomp.c |   2 +
 3 files changed, 694 insertions(+), 147 deletions(-)

-- 
2.31.1




Re: [PATCH 8/9] virtiofsd: Optionally fill lo_inode.fhandle

2021-06-08 Thread Max Reitz

On 08.06.21 12:43, Dr. David Alan Gilbert wrote:

* Max Reitz (mre...@redhat.com) wrote:

When the inode_file_handles option is set, try to generate a file handle
for new inodes instead of opening an O_PATH FD.

Being able to open these again will require CAP_DAC_READ_SEARCH, so the
description text tells the user they will also need to specify
-o modcaps=+dac_read_search.

Generating a file handle returns the mount ID it is valid for.  Opening
it will require an FD instead.  We have mount_fds to map an ID to an FD.
get_file_handle() fills the hash map by opening the file we have
generated a handle for.  To verify that the resulting FD indeed
represents the handle's mount ID, we use statx().  Therefore, using file
handles requires statx() support.

Signed-off-by: Max Reitz 
---
  tools/virtiofsd/helper.c  |   3 +
  tools/virtiofsd/passthrough_ll.c  | 170 --
  tools/virtiofsd/passthrough_seccomp.c |   1 +
  3 files changed, 165 insertions(+), 9 deletions(-)

diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index 5e98ed702b..954f8639e6 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -186,6 +186,9 @@ void fuse_cmdline_help(void)
 "   to virtiofsd from guest 
applications.\n"
 "   default: no_allow_direct_io\n"
 "-o announce_submounts  Announce sub-mount points to the 
guest\n"
+   "-o inode_file_handles  Use file handles to reference 
inodes\n"
+   "   instead of O_PATH file 
descriptors\n"
+   "   (requires -o 
modcaps=+dac_read_search)\n"
 );
  }
  
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c

index 793d2c333e..d01f9d3a59 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -190,6 +190,7 @@ struct lo_data {
  /* An O_PATH file descriptor to /proc/self/fd/ */
  int proc_self_fd;
  int user_killpriv_v2, killpriv_v2;
+int inode_file_handles;
  };
  
  /**

@@ -244,6 +245,10 @@ static const struct fuse_opt lo_opts[] = {
  { "announce_submounts", offsetof(struct lo_data, announce_submounts), 1 },
  { "killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 1 },
  { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
+{ "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 },
+{ "no_inode_file_handles",
+  offsetof(struct lo_data, inode_file_handles),
+  0 },
  FUSE_OPT_END
  };
  static bool use_syslog = false;
@@ -315,6 +320,108 @@ static int temp_fd_steal(TempFd *temp_fd)
  }
  }
  
+/**

+ * Generate a file handle for the given dirfd/name combination.
+ *
+ * If mount_fds does not yet contain an entry for the handle's mount
+ * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds
+ * as the FD for that mount ID.  (That is the file that we have
+ * generated a handle for, so it should be representative for the
+ * mount ID.  However, to be sure (and to rule out races), we use
+ * statx() to verify that our assumption is correct.)
+ */
+static struct lo_fhandle *get_file_handle(struct lo_data *lo,
+  int dirfd, const char *name)
+{
+/* We need statx() to verify the mount ID */
+#if defined(CONFIG_STATX) && defined(STATX_MNT_ID)
+struct lo_fhandle *fh;
+int ret;
+
+if (!lo->use_statx || !lo->inode_file_handles) {
+return NULL;
+}
+
+fh = g_new0(struct lo_fhandle, 1);
+
+fh->handle.handle_bytes = sizeof(fh->padding) - sizeof(fh->handle);
+ret = name_to_handle_at(dirfd, name, &fh->handle, &fh->mount_id,
+AT_EMPTY_PATH);
+if (ret < 0) {
+goto fail;
+}
+
+if (pthread_rwlock_rdlock(&mount_fds_lock)) {
+goto fail;
+}
+if (!g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) {
+struct statx stx;
+int fd;
+
+pthread_rwlock_unlock(&mount_fds_lock);
+
+if (name[0]) {
+fd = openat(dirfd, name, O_RDONLY);

But can't that be a device file or other special file that you must not
open?


Yes. So I think the right thing to do here is to use O_PATH (which 
should be safe), then statx() to find out what we need to know (now not 
only the mount ID, but the file type, too), and if it’s safe, we can 
then openat(proc_self_fd, str(fd), O_RDONLY) (and close the O_PATH fd).


If it isn’t safe, we just return an error, and the inode gets an O_PATH fd.

Max




[PATCH 9/9] virtiofsd: Add lazy lo_do_find()

2021-06-04 Thread Max Reitz
lo_find() right now takes two lookup keys for two maps, namely the file
handle for inodes_by_handle and the statx information for inodes_by_ids.
However, we only need the statx information if looking up the inode by
the file handle failed.

There are two callers of lo_find(): The first one, lo_do_lookup(), has
both keys anyway, so passing them does not incur any additional cost.
The second one, lookup_name(), though, needs to explicitly invoke
name_to_handle_at() (through get_file_handle()) and statx() (through
do_statx()).  We need to try to get a file handle as the primary key, so
we cannot get rid of get_file_handle(), but we only need the statx
information if looking up an inode by handle failed; so we can defer
that until the lookup has indeed failed.

To this end, replace lo_find()'s st/mnt_id parameters by a get_ids()
closure that is invoked to fill the lo_key struct if necessary.

Also, lo_find() is renamed to lo_do_find(), so we can add a new
lo_find() wrapper whose closure just initializes the lo_key from the
st/mnt_id parameters, just like the old lo_find() did.

lookup_name() directly calls lo_do_find() now and passes its own
closure, which performs the do_statx() call.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 93 ++--
 1 file changed, 76 insertions(+), 17 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index d01f9d3a59..ef10a3ace6 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -1141,22 +1141,23 @@ out_err:
 fuse_reply_err(req, saverr);
 }
 
-static struct lo_inode *lo_find(struct lo_data *lo,
-const struct lo_fhandle *fhandle,
-struct stat *st, uint64_t mnt_id)
+/*
+ * get_ids() will be called to get the key for lo->inodes_by_ids if
+ * the lookup by file handle has failed.
+ */
+static struct lo_inode *lo_do_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+int (*get_ids)(struct lo_key *, const void *),
+const void *get_ids_opaque)
 {
 struct lo_inode *p = NULL;
-struct lo_key ids_key = {
-.ino = st->st_ino,
-.dev = st->st_dev,
-.mnt_id = mnt_id,
-};
+struct lo_key ids_key;
 
 pthread_mutex_lock(&lo->mutex);
 if (fhandle) {
 p = g_hash_table_lookup(lo->inodes_by_handle, fhandle);
 }
-if (!p) {
+if (!p && get_ids(&ids_key, get_ids_opaque) == 0) {
 p = g_hash_table_lookup(lo->inodes_by_ids, &ids_key);
 /*
  * When we had to fall back to looking up an inode by its IDs,
@@ -1184,6 +1185,36 @@ static struct lo_inode *lo_find(struct lo_data *lo,
 return p;
 }
 
+struct lo_find_get_ids_key_opaque {
+const struct stat *st;
+uint64_t mnt_id;
+};
+
+static int lo_find_get_ids_key(struct lo_key *ids_key, const void *opaque)
+{
+const struct lo_find_get_ids_key_opaque *stat_info = opaque;
+
+*ids_key = (struct lo_key){
+.ino = stat_info->st->st_ino,
+.dev = stat_info->st->st_dev,
+.mnt_id = stat_info->mnt_id,
+};
+
+return 0;
+}
+
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id)
+{
+const struct lo_find_get_ids_key_opaque stat_info = {
+.st = st,
+.mnt_id = mnt_id,
+};
+
+return lo_do_find(lo, fhandle, lo_find_get_ids_key, &stat_info);
+}
+
 /* value_destroy_func for posix_locks GHashTable */
 static void posix_locks_value_destroy(gpointer data)
 {
@@ -1655,14 +1686,41 @@ out_err:
 fuse_reply_err(req, saverr);
 }
 
+struct lookup_name_get_ids_key_opaque {
+struct lo_data *lo;
+int parent_fd;
+const char *name;
+};
+
+static int lookup_name_get_ids_key(struct lo_key *ids_key, const void *opaque)
+{
+const struct lookup_name_get_ids_key_opaque *stat_params = opaque;
+uint64_t mnt_id;
+struct stat attr;
+int res;
+
+res = do_statx(stat_params->lo, stat_params->parent_fd, stat_params->name,
+   &attr, AT_SYMLINK_NOFOLLOW, &mnt_id);
+if (res < 0) {
+return -errno;
+}
+
+*ids_key = (struct lo_key){
+.ino = attr.st_ino,
+.dev = attr.st_dev,
+.mnt_id = mnt_id,
+};
+
+return 0;
+}
+
 /* Increments nlookup and caller must release refcount using lo_inode_put() */
 static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent,
 const char *name)
 {
 g_auto(TempFd) dir_fd = TEMP_FD_INIT;
 int res;
-uint64_t mnt_id;
-struct stat attr;
+struct lookup_name_get_ids_key_opaque stat_params;
 struct lo_fhandle *fh;
 struct lo_data *lo = lo_data(req);
 struct lo_inode *dir = lo_inode(req, parent);
@@ -1680,13 +1738,14 

[PATCH 1/9] virtiofsd: Add TempFd structure

2021-06-04 Thread Max Reitz
We are planning to add file handles to lo_inode objects as an
alternative to lo_inode.fd.  That means that everywhere where we
currently reference lo_inode.fd, we will have to open a temporary file
descriptor that needs to be closed after use.

So instead of directly accessing lo_inode.fd, there will be a helper
function (lo_inode_fd()) that either returns lo_inode.fd, or opens a new
file descriptor with open_by_handle_at().  It encapsulates this result
in a TempFd structure to let the caller know whether the FD needs to be
closed after use (opened from the handle) or not (copied from
lo_inode.fd).

By using g_auto(TempFd) to store this result, callers will not even have
to care about closing a temporary FD after use.  It will be done
automatically once the object goes out of scope.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 49 
 1 file changed, 49 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 49c21fd855..a4674aba80 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -174,6 +174,28 @@ struct lo_data {
 int user_killpriv_v2, killpriv_v2;
 };
 
+/**
+ * Represents a file descriptor that may either be owned by this
+ * TempFd, or only referenced (i.e. the ownership belongs to some
+ * other object, and the value has just been copied into this TempFd).
+ *
+ * The purpose of this encapsulation is to be used as g_auto(TempFd)
+ * to automatically clean up owned file descriptors when this object
+ * goes out of scope.
+ *
+ * Use temp_fd_steal() to get an owned file descriptor that will not
+ * be closed when the TempFd goes out of scope.
+ */
+typedef struct {
+int fd;
+bool owned; /* fd owned by this object? */
+} TempFd;
+
+#define TEMP_FD_INIT ((TempFd) { .fd = -1, .owned = false })
+
+static void temp_fd_clear(TempFd *temp_fd);
+G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC(TempFd, temp_fd_clear);
+
 static const struct fuse_opt lo_opts[] = {
 { "sandbox=namespace",
   offsetof(struct lo_data, sandbox),
@@ -249,6 +271,33 @@ static struct lo_data *lo_data(fuse_req_t req)
 return (struct lo_data *)fuse_req_userdata(req);
 }
 
+/**
+ * Clean-up function for TempFds
+ */
+static void temp_fd_clear(TempFd *temp_fd)
+{
+if (temp_fd->owned) {
+close(temp_fd->fd);
+*temp_fd = TEMP_FD_INIT;
+}
+}
+
+/**
+ * Return an owned fd from *temp_fd that will not be closed when
+ * *temp_fd goes out of scope.
+ *
+ * (TODO: Remove __attribute__ once this is used.)
+ */
+static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd)
+{
+if (temp_fd->owned) {
+temp_fd->owned = false;
+return temp_fd->fd;
+} else {
+return dup(temp_fd->fd);
+}
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
-- 
2.31.1




[PATCH 0/9] virtiofsd: Allow using file handles instead of O_PATH FDs

2021-06-04 Thread Max Reitz
Hi,

This is the C virtiofsd equivalent to
https://gitlab.com/virtio-fs/virtiofsd-rs/-/merge_requests/26.  As such,
the summary is pretty much the same:

Storing an O_PATH file descriptor in every lo_inode object means we have
a lot of FDs open, which is sometimes bad.  This series adds an option
(-o inode_file_handles) that will make virtiofsd attempt to generate a
file handle for new inodes and store that instead of an FD.  When an FD
is needed for a given inode, we open the handle.

To accomplish this, lo_inode.fd is should not be accessed directly
anymore, but only through helper functions (mainly lo_inode_fd() and
lo_inode_open()).  A TempFd object is added to hide the difference
between FDs that are bound to the lo_inode object (and so need not be
closed after use) and temporary FDs from open_by_handle_at() (which do
need to be closed after use).

To prevent the problem I spent a long time talking about (if we don’t
have an FD open for every inode, the inode can be deleted, its ID
reused, and then the lookup in lo_data.inodes will find the old deleted
inode), patch 7 adds a second hash table lo_data.inodes_by_handle that
maps file handles to lo_inode objects.  (Because file handles include a
generation ID, so we can discern between the old and the new inode.)

Patch 9 is completely optional, but I just really felt compelled to
write it.


Max Reitz (9):
  virtiofsd: Add TempFd structure
  virtiofsd: Use lo_inode_open() instead of openat()
  virtiofsd: Add lo_inode_fd() helper
  virtiofsd: Let lo_fd() return a TempFd
  virtiofsd: Let lo_inode_open() return a TempFd
  virtiofsd: Add lo_inode.fhandle
  virtiofsd: Add inodes_by_handle hash table
  virtiofsd: Optionally fill lo_inode.fhandle
  virtiofsd: Add lazy lo_do_find()

 tools/virtiofsd/helper.c  |   3 +
 tools/virtiofsd/passthrough_ll.c  | 809 +-
 tools/virtiofsd/passthrough_seccomp.c |   2 +
 3 files changed, 667 insertions(+), 147 deletions(-)

-- 
2.31.1




Re: [PATCH 1/3] virtiofsd: Find original inode ID of mount points

2021-06-04 Thread Max Reitz

On 02.06.21 20:59, Miklos Szeredi wrote:

On Wed, 2 Jun 2021 at 20:20, Vivek Goyal  wrote:

On Wed, May 12, 2021 at 02:55:42PM +0200, Max Reitz wrote:

Mount point directories represent two inodes: On one hand, they are a
normal directory on their parent filesystem.  On the other, they are the
root node of the filesystem mounted there.  Thus, they have two inode
IDs.

Right now, we only report the latter inode ID (i.e. the inode ID of the
mounted filesystem's root node).  This is fine once the guest has
auto-mounted a submount there (so this inode ID goes with a device ID
that is distinct from the parent filesystem), but before the auto-mount,
they have the device ID of the parent and the inode ID for the submount.
This is problematic because this is likely exactly the same
st_dev/st_ino combination as the parent filesystem's root node.  This
leads to problems for example with `find`, which will thus complain
about a filesystem loop if it has visited the parent filesystem's root
node before, and then refuse to descend into the submount.

There is a way to find the mount directory's original inode ID, and that
is to readdir(3) the parent directory, look for the mount directory, and
read the dirent.d_ino field.  Using this, we can let lookup and
readdirplus return that original inode ID, which the guest will thus
show until the submount is auto-mounted.  (Then, it will invoke getattr
and that stat(2) call will return the inode ID for the submount.)

[ CC miklos ]

Hi Max,

Though we discussed this in chat room, I am still responding to this
email with the concern I have, so that there is a record of it.


That sounds scary :)


So with this patch for FUSE_LOOKUP  we always return submount's parentinode
id and with GETATTR request we return actual inode id of submount. That
kind of bothers me a bit as we are assuming that there is always going
to be a GETATTR request after FUSE_LOOKUP. FUSE_LOOKUP itself has attrs
returned with it and it might happen that after FUSE_LOOKUP, FUSE_GETATTR
might not be called at all because FUSE_LOOKUP itself got the latest
updated attrs with certain timeout.

For example, if I call stat on a normal file (not submount), I see that
after FUSE_LOOKUP, no FUSE_GETATTR request is going and
fuse_update_get_attr() is using attrs from locally cached inode attrs.

But same thing does not seem to be happening in case of submount. Once
submount is created in guest, I see that we never seem to be calling
->revalidate() on newly created dentry of submount root. I am not sure
why. And that means we don't do FUSE_LOOKUP and that means FUSE_GETATTR
always gets called.


Even if it worked reliably, I have to admit it’s kind of relying on at 
most implementation-defined behavior.  Having two inodes would 
definitely be the right solution.



Yes, this sounds normal.

The lookup sequence for "/mnt/virtiofs/dir1/submount/file" will be:

LOOKUP(1, "dir")
 create inode I1 in sb1
 create dentry "dir" with inode I1 in sb1
LOOKUP(I1, "submount")
 create inode I2 in sb1, set S_AUTOMOUNT
 create dentry "submount" with inode I2 in sb1
 automount triggered:
 create sb2
 create inode I2 in sb2
 create dentry "/" with inode I2 in sb2
GETATTR(I2)
  fill inode I2
LOOKUP(I2, "file")
  create inode I3
  create dentry "file" with inode I3 in sb2

Notice how there's two inodes numbered I2, but in separate sb's.
However the server has only the notion of a single I2 inode which is
the root of the submount, since the mountpoint is not visible (except
for d_ino in readdir()).

Now AFAICS what this patch does is set the inode number in the
attributes returned by LOOKUP(I1, "submount") to the covered inode, so
that AT_NO_AUTOMOUNT stat will return the correct value.   While this
seems to work, it's not a fundamental fix to the problem, since the
attributes on sb1/I2 will time out and the next stat(...,
AT_NO_AUTOMOUNT) will trigger a GETATTR request, which will return the
inode number of the submount root.


Ah, yeah.


Solving this properly would mean that the server would have to
generate separate nodeid's for the mountpoint and the submount root
and the protocol would have to be extended so that this information
could be transferred to the client.


Yes, we’d somehow need to do a separate lookup for the root inode of the 
submount...



Not sure whether this is worth it or not.


I’m wondering the same.  This series was mostly the result of an 
incidental finding and seeing that getting it to work and do what I want 
seemed to be pretty easy.  Turns out doing something right can never 
really be easy...


But we have already decided that we don’t deem the inode ID problem too 
important (not least considering NFS has the same issue), so fixing it 
is not really top priority.  Maybe I’ll get back to it, but I think for 
now I consider it on the backlog.


Max




[PATCH 8/9] virtiofsd: Optionally fill lo_inode.fhandle

2021-06-04 Thread Max Reitz
When the inode_file_handles option is set, try to generate a file handle
for new inodes instead of opening an O_PATH FD.

Being able to open these again will require CAP_DAC_READ_SEARCH, so the
description text tells the user they will also need to specify
-o modcaps=+dac_read_search.

Generating a file handle returns the mount ID it is valid for.  Opening
it will require an FD instead.  We have mount_fds to map an ID to an FD.
get_file_handle() fills the hash map by opening the file we have
generated a handle for.  To verify that the resulting FD indeed
represents the handle's mount ID, we use statx().  Therefore, using file
handles requires statx() support.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/helper.c  |   3 +
 tools/virtiofsd/passthrough_ll.c  | 170 --
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 3 files changed, 165 insertions(+), 9 deletions(-)

diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index 5e98ed702b..954f8639e6 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -186,6 +186,9 @@ void fuse_cmdline_help(void)
"   to virtiofsd from guest 
applications.\n"
"   default: no_allow_direct_io\n"
"-o announce_submounts  Announce sub-mount points to the 
guest\n"
+   "-o inode_file_handles  Use file handles to reference 
inodes\n"
+   "   instead of O_PATH file 
descriptors\n"
+   "   (requires -o 
modcaps=+dac_read_search)\n"
);
 }
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 793d2c333e..d01f9d3a59 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -190,6 +190,7 @@ struct lo_data {
 /* An O_PATH file descriptor to /proc/self/fd/ */
 int proc_self_fd;
 int user_killpriv_v2, killpriv_v2;
+int inode_file_handles;
 };
 
 /**
@@ -244,6 +245,10 @@ static const struct fuse_opt lo_opts[] = {
 { "announce_submounts", offsetof(struct lo_data, announce_submounts), 1 },
 { "killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 1 },
 { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
+{ "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 },
+{ "no_inode_file_handles",
+  offsetof(struct lo_data, inode_file_handles),
+  0 },
 FUSE_OPT_END
 };
 static bool use_syslog = false;
@@ -315,6 +320,108 @@ static int temp_fd_steal(TempFd *temp_fd)
 }
 }
 
+/**
+ * Generate a file handle for the given dirfd/name combination.
+ *
+ * If mount_fds does not yet contain an entry for the handle's mount
+ * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds
+ * as the FD for that mount ID.  (That is the file that we have
+ * generated a handle for, so it should be representative for the
+ * mount ID.  However, to be sure (and to rule out races), we use
+ * statx() to verify that our assumption is correct.)
+ */
+static struct lo_fhandle *get_file_handle(struct lo_data *lo,
+  int dirfd, const char *name)
+{
+/* We need statx() to verify the mount ID */
+#if defined(CONFIG_STATX) && defined(STATX_MNT_ID)
+struct lo_fhandle *fh;
+int ret;
+
+if (!lo->use_statx || !lo->inode_file_handles) {
+return NULL;
+}
+
+fh = g_new0(struct lo_fhandle, 1);
+
+fh->handle.handle_bytes = sizeof(fh->padding) - sizeof(fh->handle);
+ret = name_to_handle_at(dirfd, name, &fh->handle, &fh->mount_id,
+AT_EMPTY_PATH);
+if (ret < 0) {
+goto fail;
+}
+
+if (pthread_rwlock_rdlock(&mount_fds_lock)) {
+goto fail;
+}
+if (!g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) {
+struct statx stx;
+int fd;
+
+pthread_rwlock_unlock(&mount_fds_lock);
+
+if (name[0]) {
+fd = openat(dirfd, name, O_RDONLY);
+} else {
+char procname[64];
+snprintf(procname, sizeof(procname), "%i", dirfd);
+fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+}
+if (fd < 0) {
+goto fail;
+}
+
+ret = statx(fd, "", AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW,
+STATX_MNT_ID, &stx);
+if (ret < 0) {
+if (errno == ENOSYS) {
+lo->use_statx = false;
+fuse_log(FUSE_LOG_WARNING,
+ "statx() does not work: Will not be able to use file "
+ "handles for inodes\n");
+}
+goto fail;
+ 

[PATCH 4/9] virtiofsd: Let lo_fd() return a TempFd

2021-06-04 Thread Max Reitz
Accessing lo_inode.fd must generally happen through lo_inode_fd(), and
lo_fd() is no exception; and then it must pass on the TempFd it has
received from lo_inode_fd().

(Note that all lo_fd() calls now use proper error handling, where all of
them were in-line before; i.e. they were used in place of the fd
argument of some function call.  This only worked because the only error
that could occur was that lo_inode() failed to find the inode ID: Then
-1 would be passed as the fd, which would result in an EBADF error,
which is precisely what we would want to return to the guest for an
invalid inode ID.
Now, though, lo_inode_fd() might potentially invoke open_by_handle_at(),
which can return many different errors, and they should be properly
handled and returned to the guest.  So we can no longer allow lo_fd() to
be used in-line, and instead need to do proper error handling for it.)

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 55 +---
 1 file changed, 44 insertions(+), 11 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 46c9dfe200..8f64bcd6c5 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -644,18 +644,19 @@ static int lo_inode_fd(const struct lo_inode *inode, 
TempFd *tfd)
  * they are done with the fd.  This will be done in a later patch to make
  * review easier.
  */
-static int lo_fd(fuse_req_t req, fuse_ino_t ino)
+static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd *tfd)
 {
 struct lo_inode *inode = lo_inode(req, ino);
-int fd;
+int res;
 
 if (!inode) {
-return -1;
+return -EBADF;
 }
 
-fd = inode->fd;
+res = lo_inode_fd(inode, tfd);
+
 lo_inode_put(lo_data(req), &inode);
-return fd;
+return res;
 }
 
 /*
@@ -766,14 +767,19 @@ static void lo_init(void *userdata, struct fuse_conn_info 
*conn)
 static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
struct fuse_file_info *fi)
 {
+g_auto(TempFd) ino_fd = TEMP_FD_INIT;
 int res;
 struct stat buf;
 struct lo_data *lo = lo_data(req);
 
 (void)fi;
 
-res =
-fstatat(lo_fd(req, ino), "", &buf, AT_EMPTY_PATH | 
AT_SYMLINK_NOFOLLOW);
+res = lo_fd(req, ino, &ino_fd);
+if (res < 0) {
+return (void)fuse_reply_err(req, -res);
+}
+
+res = fstatat(ino_fd.fd, "", &buf, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
 if (res == -1) {
 return (void)fuse_reply_err(req, errno);
 }
@@ -1441,6 +1447,7 @@ static struct lo_inode *lookup_name(fuse_req_t req, 
fuse_ino_t parent,
 
 static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, const char *name)
 {
+g_auto(TempFd) parent_fd = TEMP_FD_INIT;
 int res;
 struct lo_inode *inode;
 struct lo_data *lo = lo_data(req);
@@ -1455,13 +1462,19 @@ static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, 
const char *name)
 return;
 }
 
+res = lo_fd(req, parent, &parent_fd);
+if (res < 0) {
+fuse_reply_err(req, -res);
+return;
+}
+
 inode = lookup_name(req, parent, name);
 if (!inode) {
 fuse_reply_err(req, EIO);
 return;
 }
 
-res = unlinkat(lo_fd(req, parent), name, AT_REMOVEDIR);
+res = unlinkat(parent_fd.fd, name, AT_REMOVEDIR);
 
 fuse_reply_err(req, res == -1 ? errno : 0);
 unref_inode_lolocked(lo, inode, 1);
@@ -1547,6 +1560,7 @@ out:
 
 static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
 {
+g_auto(TempFd) parent_fd = TEMP_FD_INIT;
 int res;
 struct lo_inode *inode;
 struct lo_data *lo = lo_data(req);
@@ -1561,13 +1575,19 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t 
parent, const char *name)
 return;
 }
 
+res = lo_fd(req, parent, &parent_fd);
+if (res < 0) {
+fuse_reply_err(req, -res);
+return;
+}
+
 inode = lookup_name(req, parent, name);
 if (!inode) {
 fuse_reply_err(req, EIO);
 return;
 }
 
-res = unlinkat(lo_fd(req, parent), name, 0);
+res = unlinkat(parent_fd.fd, name, 0);
 
 fuse_reply_err(req, res == -1 ? errno : 0);
 unref_inode_lolocked(lo, inode, 1);
@@ -1647,10 +1667,16 @@ static void lo_forget_multi(fuse_req_t req, size_t 
count,
 
 static void lo_readlink(fuse_req_t req, fuse_ino_t ino)
 {
+g_auto(TempFd) ino_fd = TEMP_FD_INIT;
 char buf[PATH_MAX + 1];
 int res;
 
-res = readlinkat(lo_fd(req, ino), "", buf, sizeof(buf));
+res = lo_fd(req, ino, &ino_fd);
+if (res < 0) {
+return (void)fuse_reply_err(req, -res);
+}
+
+res = readlinkat(ino_fd.fd, "", buf, sizeof(buf));
 if (res == -1) {
 return (void)fuse_reply_err(req, errno);
 }
@@ -2447,10 +2473,17 @@ static void lo_write_buf(fuse_req_t req, fuse_ino_t ino,
 
 static void lo_statfs(fuse_req_t req, fuse_ino_t ino)
 {

[PATCH 7/9] virtiofsd: Add inodes_by_handle hash table

2021-06-04 Thread Max Reitz
Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
its inode ID will remain in use until we drop our lo_inode (and
lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
the inode ID as an lo_inode key, because any inode with an inode ID we
find in lo_data.inodes (on the same filesystem) must be the exact same
file.

This will change when we start setting lo_inode.fhandle so we do not
have to keep an O_PATH FD open.  Then, unlinking such an inode will
immediately remove it, so its ID can then be reused by newly created
files, even while the lo_inode object is still there[1].

So creating a new file can then reuse the old file's inode ID, and
looking up the new file would lead to us finding the old file's
lo_inode, which is not ideal.

Luckily, just as file handles cause this problem, they also solve it:  A
file handle contains a generation ID, which changes when an inode ID is
reused, so the new file can be distinguished from the old one.  So all
we need to do is to add a second map besides lo_data.inodes that maps
file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.

Unfortunately, we cannot rely on being able to generate file handles
every time.  Therefore, we still enter every lo_inode object into
inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
potential inodes_by_handle entry then has precedence, the inodes_by_ids
entry is just a fallback.

Note that we do not generate lo_fhandle objects yet, and so we also do
not enter anything into the inodes_by_handle map yet.  Also, all lookups
skip that map.  We might manually create file handles with some code
that is immediately removed by the next patch again, but that would
break the assumption in lo_find() that every lo_inode with a non-NULL
.fhandle must have an entry in inodes_by_handle and vice versa.  So we
leave actually using the inodes_by_handle map for the next patch.

[1] If some application in the guest still has the file open, there is
going to be a corresponding FD mapping in lo_data.fd_map.  In such a
case, the inode will only go away once every application in the guest
has closed it.  The problem described only applies to cases where the
guest does not have the file open, and it is just in the dentry cache,
basically.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 80 +---
 1 file changed, 64 insertions(+), 16 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index e665575401..793d2c333e 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -179,7 +179,8 @@ struct lo_data {
 int announce_submounts;
 bool use_statx;
 struct lo_inode root;
-GHashTable *inodes; /* protected by lo->mutex */
+GHashTable *inodes_by_ids; /* protected by lo->mutex */
+GHashTable *inodes_by_handle; /* protected by lo->mutex */
 struct lo_map ino_map; /* protected by lo->mutex */
 struct lo_map dirp_map; /* protected by lo->mutex */
 struct lo_map fd_map; /* protected by lo->mutex */
@@ -257,8 +258,9 @@ static struct {
 /* That we loaded cap-ng in the current thread from the saved */
 static __thread bool cap_loaded = 0;
 
-static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
-uint64_t mnt_id);
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id);
 static int xattr_map_client(const struct lo_data *lo, const char *client_name,
 char **out_name);
 
@@ -1032,18 +1034,39 @@ out_err:
 fuse_reply_err(req, saverr);
 }
 
-static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
-uint64_t mnt_id)
+static struct lo_inode *lo_find(struct lo_data *lo,
+const struct lo_fhandle *fhandle,
+struct stat *st, uint64_t mnt_id)
 {
-struct lo_inode *p;
-struct lo_key key = {
+struct lo_inode *p = NULL;
+struct lo_key ids_key = {
 .ino = st->st_ino,
 .dev = st->st_dev,
 .mnt_id = mnt_id,
 };
 
 pthread_mutex_lock(&lo->mutex);
-p = g_hash_table_lookup(lo->inodes, &key);
+if (fhandle) {
+p = g_hash_table_lookup(lo->inodes_by_handle, fhandle);
+}
+if (!p) {
+p = g_hash_table_lookup(lo->inodes_by_ids, &ids_key);
+/*
+ * When we had to fall back to looking up an inode by its IDs,
+ * ensure that we hit an entry that does not have a file
+ * handle.  Entries with file handles must also have a handle
+ * alt key, so if we have not found it by th

[PATCH 3/9] virtiofsd: Add lo_inode_fd() helper

2021-06-04 Thread Max Reitz
Once we let lo_inode.fd be optional, we will need its users to open the
file handle stored in lo_inode instead.  This function will do that.

For now, it just returns lo_inode.fd, though.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 138 ++-
 1 file changed, 117 insertions(+), 21 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 436f771d2a..46c9dfe200 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -629,6 +629,16 @@ static struct lo_inode *lo_inode(fuse_req_t req, 
fuse_ino_t ino)
 return elem->inode;
 }
 
+static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd)
+{
+*tfd = (TempFd) {
+.fd = inode->fd,
+.owned = false,
+};
+
+return 0;
+}
+
 /*
  * TODO Remove this helper and force callers to hold an inode refcount until
  * they are done with the fd.  This will be done in a later patch to make
@@ -790,11 +800,11 @@ static int lo_fi_fd(fuse_req_t req, struct fuse_file_info 
*fi)
 static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
int valid, struct fuse_file_info *fi)
 {
+g_auto(TempFd) inode_fd = TEMP_FD_INIT;
 int saverr;
 char procname[64];
 struct lo_data *lo = lo_data(req);
 struct lo_inode *inode;
-int ifd;
 int res;
 int fd = -1;
 
@@ -804,7 +814,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 return;
 }
 
-ifd = inode->fd;
+res = lo_inode_fd(inode, &inode_fd);
+if (res < 0) {
+saverr = -res;
+goto out_err;
+}
 
 /* If fi->fh is invalid we'll report EBADF later */
 if (fi) {
@@ -815,7 +829,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 if (fi) {
 res = fchmod(fd, attr->st_mode);
 } else {
-sprintf(procname, "%i", ifd);
+sprintf(procname, "%i", inode_fd.fd);
 res = fchmodat(lo->proc_self_fd, procname, attr->st_mode, 0);
 }
 if (res == -1) {
@@ -827,12 +841,13 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 uid_t uid = (valid & FUSE_SET_ATTR_UID) ? attr->st_uid : (uid_t)-1;
 gid_t gid = (valid & FUSE_SET_ATTR_GID) ? attr->st_gid : (gid_t)-1;
 
-saverr = drop_security_capability(lo, ifd);
+saverr = drop_security_capability(lo, inode_fd.fd);
 if (saverr) {
 goto out_err;
 }
 
-res = fchownat(ifd, "", uid, gid, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
+res = fchownat(inode_fd.fd, "", uid, gid,
+   AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
 if (res == -1) {
 saverr = errno;
 goto out_err;
@@ -911,7 +926,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 if (fi) {
 res = futimens(fd, tv);
 } else {
-sprintf(procname, "%i", inode->fd);
+sprintf(procname, "%i", inode_fd.fd);
 res = utimensat(lo->proc_self_fd, procname, tv, 0);
 }
 if (res == -1) {
@@ -1026,7 +1041,8 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
parent, const char *name,
 struct fuse_entry_param *e,
 struct lo_inode **inodep)
 {
-int newfd;
+g_auto(TempFd) dir_fd = TEMP_FD_INIT;
+int newfd = -1;
 int res;
 int saverr;
 uint64_t mnt_id;
@@ -1056,7 +1072,13 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
parent, const char *name,
 name = ".";
 }
 
-newfd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
+res = lo_inode_fd(dir, &dir_fd);
+if (res < 0) {
+saverr = -res;
+goto out;
+}
+
+newfd = openat(dir_fd.fd, name, O_PATH | O_NOFOLLOW);
 if (newfd == -1) {
 goto out_err;
 }
@@ -1123,6 +1145,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
parent, const char *name,
 
 out_err:
 saverr = errno;
+out:
 if (newfd != -1) {
 close(newfd);
 }
@@ -1228,6 +1251,7 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t 
parent,
  const char *name, mode_t mode, dev_t rdev,
  const char *link)
 {
+g_auto(TempFd) dir_fd = TEMP_FD_INIT;
 int res;
 int saverr;
 struct lo_data *lo = lo_data(req);
@@ -1251,12 +1275,18 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t 
parent,
 return;
 }
 
+res = lo_inode_fd(dir, &dir_fd);
+if (res < 0) {
+saverr = -res;
+goto out;
+}
+
 saverr = lo_change_cred(req, &old);
 if (saverr) {
 goto out;
 }
 
-res = mknod_wrapper(dir->fd, name, link, mode, rdev)

[PATCH 2/9] virtiofsd: Use lo_inode_open() instead of openat()

2021-06-04 Thread Max Reitz
The xattr functions want a non-O_PATH FD, so they reopen the lo_inode.fd
with the flags they need through /proc/self/fd.

Similarly, lo_opendir() needs an O_RDONLY FD.  Instead of the
/proc/self/fd trick, it just uses openat(fd, "."), because the FD is
guaranteed to be a directory, so this works.

All cases have one problem in common, though: In the future, when we may
have a file handle in the lo_inode instead of an FD, querying an
lo_inode FD may incur an open_by_handle_at() call.  It does not make
sense to then reopen that FD with custom flags, those should have been
passed to open_by_handle_at() instead.

Use lo_inode_open() instead of openat().  As part of the file handle
change, lo_inode_open() will be made to invoke openat() only if
lo_inode.fd is valid.  Otherwise, it will invoke open_by_handle_at()
with the right flags from the start.

Consequently, after this patch, lo_inode_open() is the only place to
invoke openat() to reopen an existing FD with different flags.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 43 
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index a4674aba80..436f771d2a 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -1645,18 +1645,26 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
 {
 int error = ENOMEM;
 struct lo_data *lo = lo_data(req);
-struct lo_dirp *d;
+struct lo_inode *inode;
+struct lo_dirp *d = NULL;
 int fd;
 ssize_t fh;
 
+inode = lo_inode(req, ino);
+if (!inode) {
+error = EBADF;
+goto out_err;
+}
+
 d = calloc(1, sizeof(struct lo_dirp));
 if (d == NULL) {
 goto out_err;
 }
 
-fd = openat(lo_fd(req, ino), ".", O_RDONLY);
-if (fd == -1) {
-goto out_errno;
+fd = lo_inode_open(lo, inode, O_RDONLY);
+if (fd < 0) {
+error = -fd;
+goto out_err;
 }
 
 d->dp = fdopendir(fd);
@@ -1685,6 +1693,7 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
 out_errno:
 error = errno;
 out_err:
+lo_inode_put(lo, &inode);
 if (d) {
 if (d->dp) {
 closedir(d->dp);
@@ -2827,7 +2836,6 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
 }
 }
 
-sprintf(procname, "%i", inode->fd);
 /*
  * It is not safe to open() non-regular/non-dir files in file server
  * unless O_PATH is used, so use that method for regular files/dir
@@ -2835,12 +2843,14 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
  * Otherwise, call fchdir() to avoid open().
  */
 if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+fd = lo_inode_open(lo, inode, O_RDONLY);
 if (fd < 0) {
-goto out_err;
+saverr = -fd;
+goto out;
 }
 ret = fgetxattr(fd, name, value, size);
 } else {
+sprintf(procname, "%i", inode->fd);
 /* fchdir should not fail here */
 FCHDIR_NOFAIL(lo->proc_self_fd);
 ret = getxattr(procname, name, value, size);
@@ -2906,14 +2916,15 @@ static void lo_listxattr(fuse_req_t req, fuse_ino_t 
ino, size_t size)
 }
 }
 
-sprintf(procname, "%i", inode->fd);
 if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+fd = lo_inode_open(lo, inode, O_RDONLY);
 if (fd < 0) {
-goto out_err;
+saverr = -fd;
+goto out;
 }
 ret = flistxattr(fd, value, size);
 } else {
+sprintf(procname, "%i", inode->fd);
 /* fchdir should not fail here */
 FCHDIR_NOFAIL(lo->proc_self_fd);
 ret = listxattr(procname, value, size);
@@ -3039,15 +3050,15 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
const char *in_name,
 fuse_log(FUSE_LOG_DEBUG, "lo_setxattr(ino=%" PRIu64
  ", name=%s value=%s size=%zd)\n", ino, name, value, size);
 
-sprintf(procname, "%i", inode->fd);
 if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
-fd = openat(lo->proc_self_fd, procname, O_RDONLY);
+fd = lo_inode_open(lo, inode, O_RDONLY);
 if (fd < 0) {
-saverr = errno;
+saverr = -fd;
 goto out;
 }
 ret = fsetxattr(fd, name, value, size, flags);
 } else {
+sprintf(procname, "%i", inode->fd);
 /* fchdir should not fail here */
 FCHDIR_NOFAIL(lo->proc_self_fd);
 ret = setxattr(procname, name, value, size, flags);
@@ -3105,15 +3116,15 @@ s

[PATCH 6/9] virtiofsd: Add lo_inode.fhandle

2021-06-04 Thread Max Reitz
This new field is an alternative to lo_inode.fd: Either of the two must
be set.  In case an O_PATH FD is needed for some lo_inode, it is either
taken from lo_inode.fd, if valid, or a temporary FD is opened with
open_by_handle_at().

Using a file handle instead of an FD has the advantage of keeping the
number of open file descriptors low.

Because open_by_handle_at() requires a mount FD (i.e. a non-O_PATH FD
opened on the filesystem to which the file handle refers), but every
lo_fhandle only has a mount ID (as returned by name_to_handle_at()), we
keep a hash map of such FDs in mount_fds (mapping ID to FD).
get_file_handle(), which is added by a later patch, will ensure that
every mount ID for which we have generated a handle has a corresponding
entry in mount_fds.

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c  | 116 ++
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 2 files changed, 102 insertions(+), 15 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 3014e8baf8..e665575401 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -88,8 +88,25 @@ struct lo_key {
 uint64_t mnt_id;
 };
 
+struct lo_fhandle {
+union {
+struct file_handle handle;
+char padding[sizeof(struct file_handle) + MAX_HANDLE_SZ];
+};
+int mount_id;
+};
+
+/* Maps mount IDs to an FD that we can pass to open_by_handle_at() */
+static GHashTable *mount_fds;
+pthread_rwlock_t mount_fds_lock = PTHREAD_RWLOCK_INITIALIZER;
+
 struct lo_inode {
+/*
+ * Either of fd or fhandle must be set (i.e. >= 0 or non-NULL,
+ * respectively).
+ */
 int fd;
+struct lo_fhandle *fhandle;
 
 /*
  * Atomic reference count for this object.  The nlookup field holds a
@@ -296,6 +313,44 @@ static int temp_fd_steal(TempFd *temp_fd)
 }
 }
 
+/**
+ * Open the given file handle with the given flags.
+ *
+ * The mount FD to pass to open_by_handle_at() is taken from the
+ * mount_fds hash map.
+ *
+ * On error, return -errno.
+ */
+static int open_file_handle(const struct lo_fhandle *fh, int flags)
+{
+gpointer mount_fd_ptr;
+int mount_fd;
+bool found;
+int ret;
+
+ret = pthread_rwlock_rdlock(&mount_fds_lock);
+if (ret) {
+return -ret;
+}
+
+/* mount_fd == 0 is valid, so we need lookup_extended */
+found = g_hash_table_lookup_extended(mount_fds,
+ GINT_TO_POINTER(fh->mount_id),
+ NULL, &mount_fd_ptr);
+pthread_rwlock_unlock(&mount_fds_lock);
+if (!found) {
+return -EINVAL;
+}
+mount_fd = GPOINTER_TO_INT(mount_fd_ptr);
+
+ret = open_by_handle_at(mount_fd, (struct file_handle *)&fh->handle, 
flags);
+if (ret < 0) {
+return -errno;
+}
+
+return ret;
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
@@ -602,7 +657,11 @@ static void lo_inode_put(struct lo_data *lo, struct 
lo_inode **inodep)
 *inodep = NULL;
 
 if (g_atomic_int_dec_and_test(&inode->refcount)) {
-close(inode->fd);
+if (inode->fd >= 0) {
+close(inode->fd);
+} else {
+g_free(inode->fhandle);
+}
 free(inode);
 }
 }
@@ -629,10 +688,25 @@ static struct lo_inode *lo_inode(fuse_req_t req, 
fuse_ino_t ino)
 
 static int lo_inode_fd(const struct lo_inode *inode, TempFd *tfd)
 {
-*tfd = (TempFd) {
-.fd = inode->fd,
-.owned = false,
-};
+if (inode->fd >= 0) {
+*tfd = (TempFd) {
+.fd = inode->fd,
+.owned = false,
+};
+} else {
+int fd;
+
+assert(inode->fhandle != NULL);
+fd = open_file_handle(inode->fhandle, O_PATH);
+if (fd < 0) {
+return -errno;
+}
+
+*tfd = (TempFd) {
+.fd = fd,
+.owned = true,
+};
+}
 
 return 0;
 }
@@ -672,22 +746,32 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd 
*tfd)
 static int lo_inode_open(const struct lo_data *lo, const struct lo_inode 
*inode,
  int open_flags, TempFd *tfd)
 {
-g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
+g_autofree char *fd_str = NULL;
 int fd;
 
 if (!S_ISREG(inode->filetype) && !S_ISDIR(inode->filetype)) {
 return -EBADF;
 }
 
-/*
- * The file is a symlink so O_NOFOLLOW must be ignored. We checked earlier
- * that the inode is not a special file but if an external process races
- * with us then symlinks are traversed here. It is not possible to escape
- * the shared directory since it is mounted as "/" though.
- */
-fd = openat(lo->proc_self_fd, fd_str, open_flags & ~O_NOFO

Re: [PATCH] block/snapshot: Clarify goto fallback behavior

2021-06-04 Thread Max Reitz

On 03.06.21 18:02, Peter Maydell wrote:

On Mon, 3 May 2021 at 10:55, Max Reitz  wrote:

In the bdrv_snapshot_goto() fallback code, we work with a pointer to
either bs->file or bs->backing.  We close that child, close the node
(with .bdrv_close()), apply the snapshot on the child node, and then
re-open the node (with .bdrv_open()).

In order for .bdrv_open() to attach the same child node that we had
before, we pass "file={child-node}" or "backing={child-node}" to it.
Therefore, when .bdrv_open() has returned success, we can assume that
bs->file or bs->backing (respectively) points to our original child
again.  This is verified by an assertion.

All of this is not immediately clear from a quick glance at the code,
so add a comment to the assertion what it is for, and why it is valid.
It certainly confused Coverity.

Reported-by: Coverity (CID 1452774)
Signed-off-by: Max Reitz 

Did this patch get lost? I was just going through outstanding
coverity issues and noticed it was posted a month ago and not
in master...


Oh, right, sorry.  Thanks for the reminder.

Now applied to my block branch:

https://github.com/XanClic/qemu/commits/block




[PATCH 5/9] virtiofsd: Let lo_inode_open() return a TempFd

2021-06-04 Thread Max Reitz
Strictly speaking, this is not necessary, because lo_inode_open() will
always return a new FD owned by the caller, so TempFd.owned will always
be true.

However, auto-cleanup is nice, and in some cases this plays nicely with
an lo_inode_fd() call in another conditional branch (see lo_setattr()).

Signed-off-by: Max Reitz 
---
 tools/virtiofsd/passthrough_ll.c | 137 +--
 1 file changed, 59 insertions(+), 78 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 8f64bcd6c5..3014e8baf8 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -285,10 +285,8 @@ static void temp_fd_clear(TempFd *temp_fd)
 /**
  * Return an owned fd from *temp_fd that will not be closed when
  * *temp_fd goes out of scope.
- *
- * (TODO: Remove __attribute__ once this is used.)
  */
-static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd)
+static int temp_fd_steal(TempFd *temp_fd)
 {
 if (temp_fd->owned) {
 temp_fd->owned = false;
@@ -667,9 +665,12 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd 
*tfd)
  * when a malicious client opens special files such as block device nodes.
  * Symlink inodes are also rejected since symlinks must already have been
  * traversed on the client side.
+ *
+ * The fd is returned in tfd->fd.  The return value is 0 on success and -errno
+ * otherwise.
  */
-static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode,
- int open_flags)
+static int lo_inode_open(const struct lo_data *lo, const struct lo_inode 
*inode,
+ int open_flags, TempFd *tfd)
 {
 g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
 int fd;
@@ -688,7 +689,13 @@ static int lo_inode_open(struct lo_data *lo, struct 
lo_inode *inode,
 if (fd < 0) {
 return -errno;
 }
-return fd;
+
+*tfd = (TempFd) {
+.fd = fd,
+.owned = true,
+};
+
+return 0;
 }
 
 static void lo_init(void *userdata, struct fuse_conn_info *conn)
@@ -820,7 +827,12 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 return;
 }
 
-res = lo_inode_fd(inode, &inode_fd);
+if (!fi && (valid & FUSE_SET_ATTR_SIZE)) {
+/* We need an O_RDWR FD for ftruncate() */
+res = lo_inode_open(lo, inode, O_RDWR, &inode_fd);
+} else {
+res = lo_inode_fd(inode, &inode_fd);
+}
 if (res < 0) {
 saverr = -res;
 goto out_err;
@@ -868,18 +880,11 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 if (fi) {
 truncfd = fd;
 } else {
-truncfd = lo_inode_open(lo, inode, O_RDWR);
-if (truncfd < 0) {
-saverr = -truncfd;
-goto out_err;
-}
+truncfd = inode_fd.fd;
 }
 
 saverr = drop_security_capability(lo, truncfd);
 if (saverr) {
-if (!fi) {
-close(truncfd);
-}
 goto out_err;
 }
 
@@ -887,9 +892,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 res = drop_effective_cap("FSETID", &cap_fsetid_dropped);
 if (res != 0) {
 saverr = res;
-if (!fi) {
-close(truncfd);
-}
 goto out_err;
 }
 }
@@ -902,9 +904,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
struct stat *attr,
 fuse_log(FUSE_LOG_ERR, "Failed to gain CAP_FSETID\n");
 }
 }
-if (!fi) {
-close(truncfd);
-}
 if (res == -1) {
 goto out_err;
 }
@@ -1734,11 +1733,12 @@ static struct lo_dirp *lo_dirp(fuse_req_t req, struct 
fuse_file_info *fi)
 static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
struct fuse_file_info *fi)
 {
+g_auto(TempFd) inode_fd = TEMP_FD_INIT;
 int error = ENOMEM;
 struct lo_data *lo = lo_data(req);
 struct lo_inode *inode;
 struct lo_dirp *d = NULL;
-int fd;
+int res;
 ssize_t fh;
 
 inode = lo_inode(req, ino);
@@ -1752,13 +1752,13 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
 goto out_err;
 }
 
-fd = lo_inode_open(lo, inode, O_RDONLY);
-if (fd < 0) {
-error = -fd;
+res = lo_inode_open(lo, inode, O_RDONLY, &inode_fd);
+if (res < 0) {
+error = -res;
 goto out_err;
 }
 
-d->dp = fdopendir(fd);
+d->dp = fdopendir(temp_fd_steal(&inode_fd));
 if (d->dp == NULL) {
 goto out_errno;
 }
@@ -1788,8 +1788,6 @@ out_err:
 if (d) {
 if (d->dp) {
 closedir(d->dp);
-} else if (fd != -1) {
-close(fd);
 }
 free

Re: [PATCH v2 33/33] iotests/image-fleecing: add test-case for copy-before-write filter

2021-06-01 Thread Max Reitz

On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote:

New fleecing method becomes available: copy-before-write filter.

Actually we don't need backup job to setup image fleecing. Add test
for new recommended way of image fleecing.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/tests/image-fleecing | 50 +-
  tests/qemu-iotests/tests/image-fleecing.out | 72 +
  2 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/tests/qemu-iotests/tests/image-fleecing 
b/tests/qemu-iotests/tests/image-fleecing
index e210c00d28..404ebc00f1 100755
--- a/tests/qemu-iotests/tests/image-fleecing
+++ b/tests/qemu-iotests/tests/image-fleecing


[...]


@@ -90,12 +91,22 @@ def do_test(base_img_path, fleece_img_path, nbd_sock_path, 
vm):
  'backing': src_node,
  }))
  
-# Establish COW from source to fleecing node

-log(vm.qmp('blockdev-backup',
-   job_id='fleecing',
-   device=src_node,
-   target=tmp_node,
-   sync='none'))
+# Establish CBW from source to fleecing node
+if use_cbw:
+log(vm.qmp('blockdev-add', **{


I thought this should work without ** now.

With them dropped:

Reviewed-by: Max Reitz 


+'driver': 'copy-before-write',
+'node-name': 'fl-cbw',
+'file': src_node,
+'target': tmp_node
+}))
+
+log(vm.qmp('qom-set', path=qom_path, property='drive', value='fl-cbw'))
+else:
+log(vm.qmp('blockdev-backup',
+   job_id='fleecing',
+   device=src_node,
+   target=tmp_node,
+   sync='none'))
  
  log('')

  log('--- Setting up NBD Export ---')





Re: [PATCH v2 32/33] iotests/image-fleecing: prepare for adding new test-case

2021-06-01 Thread Max Reitz

On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote:

We are going to add a test-case with some behavior modifications. So,
let's prepare a function to be reused.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/tests/image-fleecing | 19 +--
  1 file changed, 13 insertions(+), 6 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH v2 31/33] iotests/image-fleecing: rename tgt_node

2021-06-01 Thread Max Reitz

On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote:

Actually target of backup(sync=None) is not a final backup target:
image fleecing is intended to be used with external tool, which will
copy data from fleecing node to some real backup target.

Also, we are going to add a test case for "push backup with fleecing",
where instead of exporting fleecing node by NBD, we'll start a backup
job from fleecing node to real backup target.

To avoid confusion, let's rename temporary fleecing node now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/tests/image-fleecing | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH v2 30/33] iotests/image-fleecing: proper source device

2021-06-01 Thread Max Reitz

On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote:

Define scsi device to operate with it by qom-set in further patch.

Give a new node-name to source block node, to not look like device
name.

Job now don't want to work without giving explicit id, so, let's call
it "fleecing".

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/tests/image-fleecing | 12 
  tests/qemu-iotests/tests/image-fleecing.out |  2 +-
  2 files changed, 9 insertions(+), 5 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH v2 29/33] iotests.py: hmp_qemu_io: support qdev

2021-06-01 Thread Max Reitz

On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote:

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/iotests.py | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH v2 28/33] iotests: move 222 to tests/image-fleecing

2021-06-01 Thread Max Reitz

On 20.05.21 16:22, Vladimir Sementsov-Ogievskiy wrote:

Give a good name to test file.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/{222 => tests/image-fleecing} | 0
  tests/qemu-iotests/{222.out => tests/image-fleecing.out} | 0
  2 files changed, 0 insertions(+), 0 deletions(-)
  rename tests/qemu-iotests/{222 => tests/image-fleecing} (100%)
  rename tests/qemu-iotests/{222.out => tests/image-fleecing.out} (100%)


Reviewed-by: Max Reitz 




Re: [PATCH v2 27/33] iotests/222: constantly use single quotes for strings

2021-06-01 Thread Max Reitz

On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote:

The file use both single and double quotes for strings. Let's be
consistent.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/222 | 68 +-
  1 file changed, 34 insertions(+), 34 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH v2 26/33] iotests/222: fix pylint and mypy complains

2021-06-01 Thread Max Reitz

On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote:

Here:
  - long line
  - move to new interface of vm.qmp() (direct passing dict), to avoid
mypy false-positive, as it thinks that unpacked dict is a positional
argument.
  - extra parenthesis
  - handle event_wait possible None value

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/222 | 20 +++-
  tests/qemu-iotests/297 |  2 +-
  2 files changed, 12 insertions(+), 10 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH v2 25/33] iotests.py: VM: add own __enter__ method

2021-06-01 Thread Max Reitz

On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote:

In superclass __enter__ method is defined with return value type hint
'QEMUMachine'. So, mypy thinks that return value of VM.__enter__ is
QEMUMachine. Let's redefine __enter__ in VM class, to give it correct
type hint.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/iotests.py | 4 
  1 file changed, 4 insertions(+)


Reviewed-by: Max Reitz 




Re: [PATCH v2 24/33] python/qemu/machine: QEMUMachine: improve qmp() method

2021-06-01 Thread Max Reitz

On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote:

We often call qmp() with unpacking dict, like qmp('foo', **{...}).
mypy don't really like it, it things that passed unpacked dict is a


s/things/thinks/


positional argument and complains that it type should be bool (because
second argument of qmp() is conv_keys: bool).


I guess it would have been nice to see some examples of this that can 
now be written in a cleaner way, but well.



Allow possing dict directly, simplifying interface, and giving a way to


s/possing/passing/


satisfy mypy.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  python/qemu/machine.py | 12 +++-
  1 file changed, 11 insertions(+), 1 deletion(-)


Reviewed-by: Max Reitz 




Re: [PATCH v2 23/33] python/qemu/machine.py: refactor _qemu_args()

2021-06-01 Thread Max Reitz

On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote:

  - use shorter construction
  - don't create new dict if not needed
  - drop extra unpacking key-val arguments
  - drop extra default values

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  python/qemu/machine.py | 18 --
  1 file changed, 8 insertions(+), 10 deletions(-)


Reviewed-by: Max Reitz 




Re: [PATCH v2 22/33] qapi: publish copy-before-write filter

2021-06-01 Thread Max Reitz

On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote:

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  qapi/block-core.json | 30 ++
  1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2ea294129e..8c4801a13d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2808,15 +2808,17 @@
  # @blklogwrites: Since 3.0
  # @blkreplay: Since 4.2
  # @compress: Since 5.0
+# @copy-before-write: Since 6.1
  #
  # Since: 2.9
  ##
  { 'enum': 'BlockdevDriver',
'data': [ 'blkdebug', 'blklogwrites', 'blkreplay', 'blkverify', 'bochs',
-'cloop', 'compress', 'copy-on-read', 'dmg', 'file', 'ftp', 'ftps',
-'gluster', 'host_cdrom', 'host_device', 'http', 'https', 'iscsi',
-'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
-'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
+'cloop', 'compress', 'copy-before-write', 'copy-on-read', 'dmg',
+'file', 'ftp', 'ftps', 'gluster', 'host_cdrom', 'host_device',
+'http', 'https', 'iscsi', 'luks', 'nbd', 'nfs', 'null-aio',
+'null-co', 'nvme', 'parallels', 'preallocate', 'qcow', 'qcow2',
+'qed', 'quorum', 'raw', 'rbd',
  { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
  'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
  
@@ -3937,6 +3939,25 @@

'base': 'BlockdevOptionsGenericFormat',
'data': { '*bottom': 'str' } }
  
+##

+# @BlockdevOptionsCbw:
+#
+# Driver specific block device options for the copy-before-write driver,
+# which does so called copy-before-write operations: when data is
+# written to the filter, the filter firstly reads corresponding blocks
+# from its file child and copies them to @target child. After successful
+# copying the write request is propagated to file child. If copying
+# filed, the original write request is failed too and no data is written


s/filed/failed/

With that fixed:

Reviewed-by: Max Reitz 


+# to file child.
+#
+# @target: The target for copy-before-write operations.
+#
+# Since: 6.1
+##
+{ 'struct': 'BlockdevOptionsCbw',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'target': 'BlockdevRef' } }
+
  ##
  # @BlockdevOptions:
  #
@@ -3989,6 +4010,7 @@
'bochs':  'BlockdevOptionsGenericFormat',
'cloop':  'BlockdevOptionsGenericFormat',
'compress':   'BlockdevOptionsGenericFormat',
+  'copy-before-write':'BlockdevOptionsCbw',
'copy-on-read':'BlockdevOptionsCor',
'dmg':'BlockdevOptionsGenericFormat',
'file':   'BlockdevOptionsFile',





Re: [PATCH v2 19/33] block/copy-before-write: initialize block-copy bitmap

2021-05-31 Thread Max Reitz

On 20.05.21 16:21, Vladimir Sementsov-Ogievskiy wrote:

We are going to publish copy-before-write filter to be used in separate
of backup. Future step would support bitmap for the filter. But let's
start from full set bitmap.

We have to modify backup, as bitmap is first initialized by
copy-before-write filter, and then backup modifies it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/backup.c| 16 +++-
  block/copy-before-write.c |  4 
  2 files changed, 11 insertions(+), 9 deletions(-)



Reviewed-by: Max Reitz 




<    1   2   3   4   5   6   7   8   9   10   >