Re: [PATCH 1/6] block: Drop BDS comment regarding bdrv_append()

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

17.06.2021 18:52, Max Reitz wrote:

There is a comment above the BDS definition stating care must be taken
to consider handling newly added fields in bdrv_append().

Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
(nine years ago), and in any case, bdrv_swap() was dropped in
8e419aefa (six years ago).  So no such care is necessary anymore.

Signed-off-by: Max Reitz


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH 2/6] block: block-status cache for data regions

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

17.06.2021 18:52, Max Reitz wrote:

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 
---
  include/block/block_int.h | 19 ++
  block.c   |  2 +
  block/io.c| 80 +--
  3 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..c09512556a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,23 @@ struct BdrvChild {
  QLIST_ENTRY(BdrvChild) next_parent;
  };
  
+/*

+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @lock: Lock for accessing this object's fields
+ * @valid: Whether the cache is valid
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+CoMutex lock;
+bool valid;
+int64_t data_start;
+int64_t data_end;
+} BdrvBlockStatusCache;
+
  struct BlockDriverState {
  /* Protected by big QEMU lock or read-only after opening.  No special
   * locking needed during I/O...
@@ -997,6 +1014,8 @@ struct BlockDriverState {
  
  /* BdrvChild links to this node may never be frozen */

  bool never_freeze;
+
+BdrvBlockStatusCache block_status_cache;
  };
  
  struct BlockBackendRootState {

diff --git a/block.c b/block.c
index 3f456892d0..bad64d5c4f 100644
--- a/block.c
+++ b/block.c
@@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void)
  
  qemu_co_queue_init(&bs->flush_queue);
  
+qemu_co_mutex_init(&bs->block_status_cache.lock);

+
  for (i = 0; i < bdrv_drain_all_count; i++) {
  bdrv_drained_begin(bs);
  }
diff --git a/block/io.c b/block/io.c
index 323854d063..320638cc48 100644
--- a/block/io.c
+++ b/block/io.c
@@ -35,6 +35,7 @@
  #include "qapi/error.h"
  #include "qemu/error-report.h"
  #include "qemu/main-loop.h"
+#include "qemu/range.h"
  #include "sysemu/replay.h"
  
  /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */

@@ -1862,6 +1863,7 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
  bool need_flush = false;
  int head = 0;
  int tail = 0;
+BdrvBlockStatusCache *bsc = &bs->block_status_cache;
  
  int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);

  int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
@@ -1878,6 +1880,16 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
  return -ENOTSUP;
  }
  
+/* Invalidate the cached block-status data range if this write overlaps */

+qemu_co_mutex_lock(&bsc->lock);
+if (bsc->valid &&
+ranges_overlap(offset, bytes, bsc->data_start,
+   bsc->data_end - bsc->data_start))
+{
+bsc->valid = false;
+}
+qemu_co_mutex_unlock(&bsc->lock);


I th

Re: [PATCH 3/6] block/file-posix: Do not force-cap *pnum

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

17.06.2021 18:52, Max Reitz wrote:

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH 4/6] block/gluster: Do not force-cap *pnum

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

17.06.2021 18:52, Max Reitz wrote:

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 



Reviewed-by: Vladimir Sementsov-Ogievskiy 



---
  block/gluster.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index e8ee14c8e9..8ef7bb18d5 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1461,7 +1461,8 @@ exit:
   * the specified offset) that are known to be in the same
   * allocated/unallocated state.
   *
- * 'bytes' is the max value 'pnum' should be set to.
+ * 'bytes' is a soft cap for 'pnum'.  If the information is free, 'pnum' may
+ * well exceed it.
   *
   * (Based on raw_co_block_status() from file-posix.c.)
   */
@@ -1500,12 +1501,12 @@ static int coroutine_fn 
qemu_gluster_co_block_status(BlockDriverState *bs,
  } else if (data == offset) {
  /* On a data extent, compute bytes to the end of the extent,
   * possibly including a partial sector at EOF. */
-*pnum = MIN(bytes, hole - offset);
+*pnum = hole - offset;
  ret = BDRV_BLOCK_DATA;


Interesting, isn't it a bug that we don't ROUND_UP *pnum to request_alignment 
here like it is done in file-posix ?


  } else {
  /* On a hole, compute bytes to the beginning of the next extent.  */
  assert(hole == offset);
-*pnum = MIN(bytes, data - offset);
+*pnum = data - offset;
  ret = BDRV_BLOCK_ZERO;
  }
  




--
Best regards,
Vladimir



Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

17.06.2021 18:52, Max Reitz wrote:

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
---
  block/nbd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nbd.c b/block/nbd.c
index 616f9ae6c4..930bd234de 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status(
  .type = NBD_CMD_BLOCK_STATUS,
  .from = offset,
  .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
-   MIN(bytes, s->info.size - offset)),
+   s->info.size - offset),
  .flags = NBD_CMD_FLAG_REQ_ONE,
  };
  



Hmm..

I don't that this change is correct. In contrast with file-posix you don't get 
extra information for free, you just make a larger request. This means that 
server will have to do more work.

(look at blockstatus_to_extents, it calls bdrv_block_status_above in a loop).

For example, assume that nbd export is a qcow2 image with all clusters 
allocated. With this change, nbd server will loop through the whole qcow2 
image, load all L2 tables to return big allocated extent.

So, only server can decide, could it add some extra free information to request 
or not. But unfortunately NBD_CMD_FLAG_REQ_ONE doesn't allow it.

--
Best regards,
Vladimir



Re: [PATCH 5/6] block/nbd: Do not force-cap *pnum

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

18.06.2021 23:20, Eric Blake wrote:

On Thu, Jun 17, 2021 at 05:52:46PM +0200, Max Reitz wrote:

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz 
---
  block/nbd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Eric Blake 



diff --git a/block/nbd.c b/block/nbd.c
index 616f9ae6c4..930bd234de 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1702,7 +1702,7 @@ static int coroutine_fn nbd_client_co_block_status(
  .type = NBD_CMD_BLOCK_STATUS,
  .from = offset,
  .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
-   MIN(bytes, s->info.size - offset)),
+   s->info.size - offset),
  .flags = NBD_CMD_FLAG_REQ_ONE,


I'd love to someday get rid of using NBD_CMD_FLAG_REQ_ONE (so the
server can reply with more extents in one go), but that's a bigger
task and unrelated to your block-layer cache.



I think for this to work, the generic block_status should be updated so we can 
work with several extents in one go.


--
Best regards,
Vladimir



Re: [PATCH 6/6] block/iscsi: Do not force-cap *pnum

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

17.06.2021 18:52, Max Reitz wrote:

bdrv_co_block_status() does it for us, we do not need to do it here.

The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.

Signed-off-by: Max Reitz



Reviewed-by: Vladimir Sementsov-Ogievskiy 


--
Best regards,
Vladimir



Re: [PATCH v3 4/5] progressmeter: protect with a mutex

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 11:11, Emanuele Giuseppe Esposito wrote:

Progressmeter is protected by the AioContext mutex, which
is taken by the block jobs and their caller (like blockdev).

We would like to remove the dependency of block layer code on the
AioContext mutex, since most drivers and the core I/O code are already
not relying on it.

Create a new C file to implement the ProgressMeter API, but keep the
struct as public, to avoid forcing allocation on the heap.

Also add a mutex to be able to provide an accurate snapshot of the
progress values to the caller.

Signed-off-by: Emanuele Giuseppe Esposito
Reviewed-by: Stefan Hajnoczi



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v3 2/5] block-copy: let ratelimit handle a speed of 0

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 11:11, Emanuele Giuseppe Esposito wrote:

From: Paolo Bonzini 

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/block-copy.c | 28 +++-
  1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 5808cfe657..943e30b7e6 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -114,7 +114,6 @@ typedef struct BlockCopyState {
  
  SharedResource *mem;
  
-uint64_t speed;

  RateLimit rate_limit;
  } BlockCopyState;
  
@@ -647,21 +646,19 @@ block_copy_dirty_clusters(BlockCopyCallState *call_state)

  task->copy_range = false;
  }
  
-if (s->speed) {

-if (!call_state->ignore_ratelimit) {
-uint64_t ns = ratelimit_calculate_delay(&s->rate_limit, 0);
-if (ns > 0) {
-block_copy_task_end(task, -EAGAIN);
-g_free(task);
-qemu_co_sleep_ns_wakeable(&call_state->sleep,
-  QEMU_CLOCK_REALTIME, ns);
-continue;
-}
+if (!call_state->ignore_ratelimit) {
+uint64_t ns = ratelimit_calculate_delay(&s->rate_limit, 0);
+if (ns > 0) {
+block_copy_task_end(task, -EAGAIN);
+g_free(task);
+qemu_co_sleep_ns_wakeable(&call_state->sleep,
+QEMU_CLOCK_REALTIME, ns);


Indentation broken. Can be fixed when applying, not worth resending


+continue;
  }
-
-ratelimit_calculate_delay(&s->rate_limit, task->bytes);
  }
  
+ratelimit_calculate_delay(&s->rate_limit, task->bytes);

+
  trace_block_copy_process(s, task->offset);
  
  co_get_from_shres(s->mem, task->bytes);

@@ -853,10 +850,7 @@ void block_copy_set_skip_unallocated(BlockCopyState *s, 
bool skip)
  
  void block_copy_set_speed(BlockCopyState *s, uint64_t speed)

  {
-s->speed = speed;
-if (speed > 0) {
-ratelimit_set_speed(&s->rate_limit, speed, BLOCK_COPY_SLICE_TIME);
-}
+ratelimit_set_speed(&s->rate_limit, speed, BLOCK_COPY_SLICE_TIME);
  
  /*

   * Note: it's good to kick all call states from here, but it should be 
done




--
Best regards,
Vladimir



Re: [PATCH v3 0/5] block-copy: make helper APIs thread safe

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 11:17, Emanuele Giuseppe Esposito wrote:



On 14/06/2021 10:11, Emanuele Giuseppe Esposito wrote:

This serie of patches bring thread safety to the smaller APIs used by
block-copy, namely ratelimit, progressmeter, co-shared-resource
and aiotask.
The end goal is to reduce the usage of AioContexlock in block-copy,
by introducing smaller granularity locks thus on making the block layer
thread safe.

What's missing for block-copy to be fully thread-safe is fixing
the CoSleep API to allow cross-thread sleep and wakeup.
Paolo is working on it and will post the patches once his new
CoSleep API is accepted.

Patches 1-3 work on ratelimit, 4 covers progressmeter and
5 co-shared-resources.

Signed-off-by: Emanuele Giuseppe Esposito 
---
v3:
* Rebase on current master (had conflicts in block-copy), remove based-on in
   cover letter


Hi Kevin & Max,

I think this series ha been reviewed and I just rebased it to current master. 
Can you give it a look and let me know if it can be merged?

Thank you,
Emanuele



I think, I can queue it myself as a block-job series. ratelimit and 
progressmeter are not mentioned in Block Jobs sections of MAINTAINERS, but 
actually these APIs used only by block-jobs.

I remember, Stefan had a complain against patch 5 and against general design of 
adding mutex to every structure.. Stefan, what do you think now? Paolo, is this 
v3 OK for you?

If everybody silent, I don't see a reason to slow down these series anymore and 
will make a pull request on Tuesday.

--
Best regards,
Vladimir



Re: [PATCH v5 5/6] block/blkdebug: remove new_state field and instead use a local variable

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 11:29, Emanuele Giuseppe Esposito wrote:

There seems to be no benefit in using a field. Replace it with a local
variable, and move the state update before the yields.

The state update has do be done before the yields because now using
a local variable does not allow the new updated state to be visible
by the other yields.


Not sure that this sentence helps.. I just can't imagine how state update 
should work keeping in mind these yields :) And making state update more 
atomically makes sense to me just because it's simpler to understand.



Signed-off-by: Emanuele Giuseppe Esposito 


Reviewed-by: Vladimir Sementsov-Ogievskiy 



--
Best regards,
Vladimir



Re: [PATCH v5 6/6] blkdebug: protect rules and suspended_reqs with a lock

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 11:29, Emanuele Giuseppe Esposito wrote:

First, categorize the structure fields to identify what needs
to be protected and what doesn't.

We essentially need to protect only .state, and the 3 lists in
BDRVBlkdebugState.

Then, add the lock and mark the functions accordingly.

Co-developed-by: Paolo Bonzini
Signed-off-by: Emanuele Giuseppe Esposito



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH RFC] meson: add option to use zstd for qcow2 compression by default

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

17.06.2021 22:51, Vladimir Sementsov-Ogievskiy wrote:

Recently we added support of zstd to qcow2 format, as zstd seems to be
better than zlib in general, and which is important (as qcow2
compression used mostly for backups) compressed writes are faster with
zstd.

Let's add a build option to use zstd by default.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

Hi all! We want to use zstd as a default compression type for newly
created qcow2 images in upcoming Virtuozzo 8.

I am not sure, how much the community interested in such option,
probably I should just keep a downstream-only patch changing the
default.

But if you like it, I'd be happy to only set new config option in our
qemu build instead of maintaining extra downstream-only patch :)

So, it's an RFC. I also can split the patch so that refactoring of
qcow2_co_create() go in a separate preparation patch.

Another RFC question, shouldn't we move to zstd by default in upstream
too?

  configure | 10 +-
  meson.build   |  4 
  block/qcow2.c | 32 +---
  meson_options.txt |  2 ++
  4 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/configure b/configure
index debd50c085..b19af43525 100755
--- a/configure
+++ b/configure
@@ -385,6 +385,7 @@ snappy="auto"
  bzip2="auto"
  lzfse="auto"
  zstd="auto"
+qcow2_zstd_default="no"


s/no/disabled/


  guest_agent="$default_feature"
  guest_agent_with_vss="no"
  guest_agent_ntddscsi="no"
@@ -1318,6 +1319,10 @@ for opt do
;;
--enable-zstd) zstd="enabled"
;;
+  --disable-qcow2-zstd-default) qcow2_zstd_default="disabled"
+  ;;
+  --enable-qcow2-zstd-default) qcow2_zstd_default="enabled"
+  ;;
--enable-guest-agent) guest_agent="yes"
;;
--disable-guest-agent) guest_agent="no"
@@ -1903,6 +1908,8 @@ disabled with --disable-FEATURE, default is enabled if 
available
(for reading lzfse-compressed dmg images)
zstdsupport for zstd compression library
(for migration compression and qcow2 cluster compression)
+  qcow2-zstd-default  Use zstd by default for qcow2 image creation
+  (requires zstd enabled)
seccomp seccomp support
coroutine-pool  coroutine freelist (better performance)
glusterfs   GlusterFS backend
@@ -6424,7 +6431,8 @@ if test "$skip_meson" = no; then
  -Dcurl=$curl -Dglusterfs=$glusterfs -Dbzip2=$bzip2 
-Dlibiscsi=$libiscsi \
  -Dlibnfs=$libnfs -Diconv=$iconv -Dcurses=$curses -Dlibudev=$libudev\
  -Drbd=$rbd -Dlzo=$lzo -Dsnappy=$snappy -Dlzfse=$lzfse \
--Dzstd=$zstd -Dseccomp=$seccomp -Dvirtfs=$virtfs -Dcap_ng=$cap_ng \
+-Dzstd=$zstd -Dqcow2_zstd_default=$qcow2_zstd_default \
+-Dseccomp=$seccomp -Dvirtfs=$virtfs -Dcap_ng=$cap_ng \
  -Dattr=$attr -Ddefault_devices=$default_devices \
  -Ddocs=$docs -Dsphinx_build=$sphinx_build -Dinstall_blobs=$blobs \
  -Dvhost_user_blk_server=$vhost_user_blk_server 
-Dmultiprocess=$multiprocess \
diff --git a/meson.build b/meson.build
index d8a92666fb..3d65b6c46b 100644
--- a/meson.build
+++ b/meson.build
@@ -484,6 +484,9 @@ if not get_option('zstd').auto() or have_block
  required: get_option('zstd'),
  method: 'pkg-config', kwargs: static_kwargs)
  endif
+if not zstd.found() and get_option('qcow2_zstd_default').enabled()
+  error('--enable-qcow2-zstd-default: Cannot use zstd by default without 
enabling zstd')
+endif
  gbm = not_found
  if 'CONFIG_GBM' in config_host
gbm = declare_dependency(compile_args: config_host['GBM_CFLAGS'].split(),
@@ -1168,6 +1171,7 @@ config_host_data.set('CONFIG_GETTID', has_gettid)
  config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim)
  config_host_data.set('CONFIG_STATX', has_statx)
  config_host_data.set('CONFIG_ZSTD', zstd.found())
+config_host_data.set('CONFIG_QCOW2_ZSTD_DEFAULT', 
get_option('qcow2_zstd_default').enabled())
  config_host_data.set('CONFIG_FUSE', fuse.found())
  config_host_data.set('CONFIG_FUSE_LSEEK', fuse_lseek.found())
  config_host_data.set('CONFIG_X11', x11.found())
diff --git a/block/qcow2.c b/block/qcow2.c
index ee4530cdbd..06bfbbf7b8 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3540,17 +3540,36 @@ qcow2_co_create(BlockdevCreateOptions *create_options, 
Error **errp)
  }
  }
  
-if (qcow2_opts->has_compression_type &&

-qcow2_opts->compression_type != QCOW2_COMPRESSION_TYPE_ZLIB) {
-
-ret = -EINVAL;
-
-if (version < 3) {
+if (version < 3) {
+if (qcow2_opts->has_compression_type &&
+qcow2_opts->compression_type != QCOW2_COMPRESSION_TYPE_ZLIB)
+{
+ret = -EINVAL;
  error_setg(errp, "Non-zlib compression type is only supported with 
"
 "compatibility level 1.1 and above (use version=v3 or "
 "greater)");
  goto out;
  }
+} else {
+if (qco

[PATCH 0/2] introduce QEMU_AUTO_VFREE

2021-06-19 Thread Vladimir Sementsov-Ogievskiy
Hi all!

There is a good movement to use g_autofree macro, that helps to
automatically call g_free on exit from code block.

We lack similar possibility for qemu_memalign() functions family. Let's
add, it seems rather simple with help of "cleanup" attribute.

I'll update more places with a follow-up if this is accepted.

Vladimir Sementsov-Ogievskiy (2):
  introduce QEMU_AUTO_VFREE
  block/commit: use QEMU_AUTO_VFREE

 include/qemu/osdep.h |  2 ++
 block/commit.c   | 25 +
 2 files changed, 11 insertions(+), 16 deletions(-)

-- 
2.29.2




[PATCH 1/2] introduce QEMU_AUTO_VFREE

2021-06-19 Thread Vladimir Sementsov-Ogievskiy
Introduce a convenient macro, that works for qemu_memalign() like
g_autofree works with g_malloc.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/qemu/osdep.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 236a045671..844658a764 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -366,6 +366,8 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *align, 
bool shared);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
+#define QEMU_AUTO_VFREE __attribute__((cleanup(qemu_vfree)))
+
 #define QEMU_MADV_INVALID -1
 
 #if defined(CONFIG_MADVISE)
-- 
2.29.2




[PATCH 2/2] block/commit: use QEMU_AUTO_VFREE

2021-06-19 Thread Vladimir Sementsov-Ogievskiy
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/commit.c | 25 +
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/block/commit.c b/block/commit.c
index b7f0c7c061..42792b4556 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -119,24 +119,24 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
 uint64_t delay_ns = 0;
 int ret = 0;
 int64_t n = 0; /* bytes */
-void *buf = NULL;
+QEMU_AUTO_VFREE void *buf = NULL;
 int64_t len, base_len;
 
-ret = len = blk_getlength(s->top);
+len = blk_getlength(s->top);
 if (len < 0) {
-goto out;
+return len;
 }
 job_progress_set_remaining(&s->common.job, len);
 
-ret = base_len = blk_getlength(s->base);
+base_len = blk_getlength(s->base);
 if (base_len < 0) {
-goto out;
+return base_len;
 }
 
 if (base_len < len) {
 ret = blk_truncate(s->base, len, false, PREALLOC_MODE_OFF, 0, NULL);
 if (ret) {
-goto out;
+return ret;
 }
 }
 
@@ -174,7 +174,7 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
 block_job_error_action(&s->common, s->on_error,
error_in_source, -ret);
 if (action == BLOCK_ERROR_ACTION_REPORT) {
-goto out;
+return ret;
 } else {
 n = 0;
 continue;
@@ -190,12 +190,7 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
 }
 }
 
-ret = 0;
-
-out:
-qemu_vfree(buf);
-
-return ret;
+return 0;
 }
 
 static const BlockJobDriver commit_job_driver = {
@@ -435,7 +430,7 @@ int bdrv_commit(BlockDriverState *bs)
 int ro;
 int64_t n;
 int ret = 0;
-uint8_t *buf = NULL;
+QEMU_AUTO_VFREE uint8_t *buf = NULL;
 Error *local_err = NULL;
 
 if (!drv)
@@ -556,8 +551,6 @@ int bdrv_commit(BlockDriverState *bs)
 
 ret = 0;
 ro_cleanup:
-qemu_vfree(buf);
-
 blk_unref(backing);
 if (bdrv_cow_bs(bs) != backing_file_bs) {
 bdrv_set_backing_hd(bs, backing_file_bs, &error_abort);
-- 
2.29.2




Re: [PATCH v4 1/6] block-copy: small refactor in block_copy_task_entry and block_copy_common

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 10:33, Emanuele Giuseppe Esposito wrote:

Use a local variable instead of referencing BlockCopyState through a
BlockCopyCallState or BlockCopyTask every time.
This is in preparation for next patches.

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v4 2/6] block-copy: streamline choice of copy_range vs. read/write

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 10:33, Emanuele Giuseppe Esposito wrote:

From: Paolo Bonzini 

Put the logic to determine the copy size in a separate function, so
that there is a simple state machine for the possible methods of
copying data from one BlockDriverState to the other.

Use .method instead of .copy_range as in-out argument, and
include also .zeroes as an additional copy method.

While at it, store the common computation of block_copy_max_transfer
into a new field of BlockCopyState, and make sure that we always
obey max_transfer; that's more efficient even for the
COPY_RANGE_READ_WRITE case.

Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Paolo Bonzini 
---
  block/block-copy.c | 174 +++--
  1 file changed, 89 insertions(+), 85 deletions(-)



[..]


@@ -267,28 +293,28 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
  .len = bdrv_dirty_bitmap_size(copy_bitmap),
  .write_flags = write_flags,
  .mem = shres_create(BLOCK_COPY_MAX_MEM),
+.max_transfer = QEMU_ALIGN_DOWN(
+block_copy_max_transfer(source, target),
+cluster_size),
  };
  
-if (block_copy_max_transfer(source, target) < cluster_size) {

+if (s->max_transfer < cluster_size) {
  /*
   * copy_range does not respect max_transfer. We don't want to bother
   * with requests smaller than block-copy cluster size, so fallback to
   * buffered copying (read and write respect max_transfer on their
   * behalf).
   */
-s->use_copy_range = false;
-s->copy_size = cluster_size;
+s->method = COPY_READ_WRITE_CLUSTER;
  } else if (write_flags & BDRV_REQ_WRITE_COMPRESSED) {
  /* Compression supports only cluster-size writes and no copy-range. */
-s->use_copy_range = false;
-s->copy_size = cluster_size;
+s->method = COPY_READ_WRITE_CLUSTER;
  } else {
  /*
   * We enable copy-range, but keep small copy_size, until first
   * successful copy_range (look at block_copy_do_copy).
   */


Comment outdated. Maybe:

/* if copy range enabled, start with COPY_RANGE_SMALL, until first successful 
copy_range (look at block_copy_do_copy). */


-s->use_copy_range = use_copy_range;
-s->copy_size = MAX(s->cluster_size, BLOCK_COPY_MAX_BUFFER);
+s->method = use_copy_range ? COPY_RANGE_SMALL : COPY_READ_WRITE;
  }
  
  ratelimit_init(&s->rate_limit);


[..]


@@ -377,76 +400,59 @@ static int coroutine_fn block_copy_do_copy(BlockCopyState 
*s,
  *error_is_read = false;
  }
  return ret;
-}
  
-if (*copy_range) {

+case COPY_RANGE_SMALL:
+case COPY_RANGE_FULL:
  ret = bdrv_co_copy_range(s->source, offset, s->target, offset, nbytes,
   0, s->write_flags);
-if (ret < 0) {
-trace_block_copy_copy_range_fail(s, offset, ret);
-*copy_range = false;
-/* Fallback to read+write with allocated buffer */
-} else {
+if (ret >= 0) {
+/* Successful copy-range, increase copy_size.  */


No copy_size now. So, "increase chunk size".


+*method = COPY_RANGE_FULL;
  return 0;
  }
-}
  


Comment tweaks are non-critical, can be done in-flight.

Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v4 3/6] block-copy: improve comments of BlockCopyTask and BlockCopyState types and functions

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 10:33, Emanuele Giuseppe Esposito wrote:

As done in BlockCopyCallState, categorize BlockCopyTask
and BlockCopyState in IN, State and OUT fields.
This is just to understand which field has to be protected with a lock.

.sleep_state is handled in the series "coroutine: new sleep/wake API"
and thus here left as TODO.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/block-copy.c | 49 +-
  1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 3f26be8ddc..5ff7764e87 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -52,29 +52,35 @@ typedef struct BlockCopyCallState {
  /* Coroutine where async block-copy is running */
  Coroutine *co;
  
-/* To reference all call states from BlockCopyState */

-QLIST_ENTRY(BlockCopyCallState) list;
-
  /* State */
-int ret;
  bool finished;
-QemuCoSleep sleep;
-bool cancelled;
+QemuCoSleep sleep; /* TODO: protect API with a lock */
+
+/* To reference all call states from BlockCopyState */
+QLIST_ENTRY(BlockCopyCallState) list;
  
  /* OUT parameters */

+bool cancelled;
  bool error_is_read;
+int ret;
  } BlockCopyCallState;
  
  typedef struct BlockCopyTask {

  AioTask task;
  
+/*

+ * IN parameters. Initialized in block_copy_task_create()
+ * and never changed.
+ */


That's just not true for method field :(


  BlockCopyState *s;
  BlockCopyCallState *call_state;
  int64_t offset;
-int64_t bytes;
  BlockCopyMethod method;
-QLIST_ENTRY(BlockCopyTask) list;
+
+/* State */
  CoQueue wait_queue; /* coroutines blocked on this task */
+int64_t bytes;
+QLIST_ENTRY(BlockCopyTask) list;
  } BlockCopyTask;
  
  static int64_t task_end(BlockCopyTask *task)

@@ -90,15 +96,25 @@ typedef struct BlockCopyState {
   */
  BdrvChild *source;
  BdrvChild *target;
-BdrvDirtyBitmap *copy_bitmap;
+
+/* State */
  int64_t in_flight_bytes;
-int64_t cluster_size;
  BlockCopyMethod method;
-int64_t max_transfer;
-uint64_t len;
  QLIST_HEAD(, BlockCopyTask) tasks; /* All tasks from all block-copy calls 
*/
  QLIST_HEAD(, BlockCopyCallState) calls;
+/* State fields that use a thread-safe API */
+BdrvDirtyBitmap *copy_bitmap;
+ProgressMeter *progress;
+SharedResource *mem;
+RateLimit rate_limit;
  
+/*

+ * IN parameters. Initialized in block_copy_state_new()
+ * and never changed.
+ */
+int64_t cluster_size;
+int64_t max_transfer;
+uint64_t len;
  BdrvRequestFlags write_flags;
  
  /*

@@ -114,14 +130,11 @@ typedef struct BlockCopyState {
   * In this case, block_copy() will query the source’s allocation status,
   * skip unallocated regions, clear them in the copy_bitmap, and invoke
   * block_copy_reset_unallocated() every time it does.
+ *
+ * This field is set in backup_run() before coroutines are run,
+ * therefore is an IN.


That's not true: it is modified from backup_run, when block-copy already 
initialized, and block_copy() calls may be done from backup-top filter.


   */
  bool skip_unallocated;
-
-ProgressMeter *progress;
-
-SharedResource *mem;
-
-RateLimit rate_limit;
  } BlockCopyState;
  
  static BlockCopyTask *find_conflicting_task(BlockCopyState *s,





--
Best regards,
Vladimir



[PATCH 12/15] hw/scsi/megasas: Improve megasas_queue_ops min_access_size

2021-06-19 Thread Richard Henderson
This device is a read-zero-write-ignored device.
We can trivially support any size operation.

Cc: qemu-block@nongnu.org
Cc: Fam Zheng  
Cc: Hannes Reinecke 
Signed-off-by: Richard Henderson 
---
 hw/scsi/megasas.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index c98cb7a499..197b75225f 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -2207,7 +2207,7 @@ static const MemoryRegionOps megasas_queue_ops = {
 .write = megasas_queue_write,
 .endianness = DEVICE_LITTLE_ENDIAN,
 .impl = {
-.min_access_size = 8,
+.min_access_size = 1,
 .max_access_size = 8,
 }
 };
-- 
2.25.1




[PATCH 11/15] hw/scsi/megasas: Fix megasas_mmio_ops sizes

2021-06-19 Thread Richard Henderson
All of the megasas mmio registers are 32-bit not 64-bit.

Cc: qemu-block@nongnu.org
Cc: Fam Zheng 
Cc: Hannes Reinecke 
Signed-off-by: Richard Henderson 
---
 hw/scsi/megasas.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index 8f2389d2c6..c98cb7a499 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -2163,8 +2163,8 @@ static const MemoryRegionOps megasas_mmio_ops = {
 .write = megasas_mmio_write,
 .endianness = DEVICE_LITTLE_ENDIAN,
 .impl = {
-.min_access_size = 8,
-.max_access_size = 8,
+.min_access_size = 4,
+.max_access_size = 4,
 }
 };
 
-- 
2.25.1




Re: [PATCH v4 3/6] block-copy: improve comments of BlockCopyTask and BlockCopyState types and functions

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 10:33, Emanuele Giuseppe Esposito wrote:

As done in BlockCopyCallState, categorize BlockCopyTask
and BlockCopyState in IN, State and OUT fields.
This is just to understand which field has to be protected with a lock.

.sleep_state is handled in the series "coroutine: new sleep/wake API"
and thus here left as TODO.

Signed-off-by: Emanuele Giuseppe Esposito
---
  block/block-copy.c | 49 +-
  1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 3f26be8ddc..5ff7764e87 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -52,29 +52,35 @@ typedef struct BlockCopyCallState {
  /* Coroutine where async block-copy is running */
  Coroutine *co;
  
-/* To reference all call states from BlockCopyState */

-QLIST_ENTRY(BlockCopyCallState) list;
-
  /* State */
-int ret;
  bool finished;
-QemuCoSleep sleep;
-bool cancelled;
+QemuCoSleep sleep; /* TODO: protect API with a lock */
+
+/* To reference all call states from BlockCopyState */
+QLIST_ENTRY(BlockCopyCallState) list;
  
  /* OUT parameters */

+bool cancelled;


actually, cancelled is not OUT field. It's set by user to cancel the operation. 
And block-copy tracks the field to understand should it cancel at the moment or 
not. So, it's part of state.

Also, I just now understand, that "out parameter" sounds strange here. As "parameter" is an input by general 
meaning.. We may have "out parameters" for functions, as all parameters of a function are generally called 
"parameters" anyway. I think "OUT fields" is more correct here. I don't insist, as I'm not an expert in 
English (for sure, you'll find mistakes even in this paragraph :\

--
Best regards,
Vladimir



Re: [PATCH v4 2/6] block-copy: streamline choice of copy_range vs. read/write

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 10:33, Emanuele Giuseppe Esposito wrote:

@@ -158,8 +183,9 @@ static BlockCopyTask *block_copy_task_create(BlockCopyState 
*s,
   int64_t offset, int64_t bytes)
  {
  BlockCopyTask *task;
-int64_t max_chunk = MIN_NON_ZERO(s->copy_size, call_state->max_chunk);
+int64_t max_chunk = block_copy_chunk_size(s);
  
+max_chunk = MIN_NON_ZERO(max_chunk, call_state->max_chunk);


Note that you modify this code in patch 4. So, you may write here

int64_t max_chunk;

max_chunk = MIN_NON_ZERO(block_copy_chunk_size(s), call_state->max_chunk);

to not modify it later.

--
Best regards,
Vladimir



Re: [PATCH v4 3/6] block-copy: improve comments of BlockCopyTask and BlockCopyState types and functions

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

19.06.2021 18:23, Vladimir Sementsov-Ogievskiy wrote:

  typedef struct BlockCopyTask {
  AioTask task;
+    /*
+ * IN parameters. Initialized in block_copy_task_create()
+ * and never changed.
+ */


That's just not true for method field :(


I think, we just need to document that @method is never accessed concurrently

--
Best regards,
Vladimir



Re: [PATCH v4 3/6] block-copy: improve comments of BlockCopyTask and BlockCopyState types and functions

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 10:33, Emanuele Giuseppe Esposito wrote:

--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -52,29 +52,35 @@ typedef struct BlockCopyCallState {
  /* Coroutine where async block-copy is running */
  Coroutine *co;
  
-/* To reference all call states from BlockCopyState */

-QLIST_ENTRY(BlockCopyCallState) list;
-
  /* State */
-int ret;
  bool finished;
-QemuCoSleep sleep;
-bool cancelled;
+QemuCoSleep sleep; /* TODO: protect API with a lock */
+
+/* To reference all call states from BlockCopyState */
+QLIST_ENTRY(BlockCopyCallState) list;
  
  /* OUT parameters */

+bool cancelled;
  bool error_is_read;
+int ret;


Hmm, about that. Is @ret an "OUT parameter"? Yes it is.

But someone may think, that out parameters doesn't need locking like "State" 
parameters (otherwise, what is the difference for the person who read these comments?). 
But that is wrong. And ret is modified under mutex for reason.

Actually, the full description of "ret" field usage may look as follows:

Set concurrently by tasks under mutex. Only set once by first failed task (and 
untouched if no task failed).
After finish (if call_state->finished is true) not modified anymore and may be 
read safely without mutex.

So, before finished, ret is a kind of "State" too: it is both read and written 
by tasks.

This shows to me that dividing fields into "IN", "State" and "OUT", doesn't 
really help here. In this series we use different policies of concurrent access to fields: some are accessed 
only under mutex, other has more complex usage scenario (like this @ret), some needs atomic access.

--
Best regards,
Vladimir



Re: [PATCH v4 5/6] block-copy: add a CoMutex

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 10:33, Emanuele Giuseppe Esposito wrote:

Add a CoMutex to protect concurrent access of block-copy
data structures.

This mutex also protects .copy_bitmap, because its thread-safe
API does not prevent it from assigning two tasks to the same
bitmap region.

.finished, .cancelled and reads to .ret and .error_is_read will be
protected in the following patch, because are used also outside
coroutines.

Also set block_copy_task_create as coroutine_fn because:
1) it is static and only invoked by coroutine functions
2) this patch introduces and uses a CoMutex lock there

Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/block-copy.c | 79 +-
  include/block/block-copy.h |  2 +
  2 files changed, 62 insertions(+), 19 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index afa2f484f0..6416929abd 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -61,6 +61,7 @@ typedef struct BlockCopyCallState {
  
  /* OUT parameters */

  bool cancelled;
+/* Fields protected by lock in BlockCopyState */
  bool error_is_read;
  int ret;
  } BlockCopyCallState;
@@ -77,8 +78,12 @@ typedef struct BlockCopyTask {
  int64_t offset;
  BlockCopyMethod method;
  
-/* State */

+/* State. Protected by lock in BlockCopyState */
  CoQueue wait_queue; /* coroutines blocked on this task */
+/*
+ * Only protect the case of parallel read while updating @bytes
+ * value in block_copy_task_shrink().
+ */


So, you have to add new comments and modify comment added in previous commit. 
That's another sign to just merge patch 03 here.


  int64_t bytes;
  QLIST_ENTRY(BlockCopyTask) list;
  } BlockCopyTask;
@@ -97,7 +102,8 @@ typedef struct BlockCopyState {
  BdrvChild *source;
  BdrvChild *target;
  
-/* State */

+/* State. Protected by lock */
+CoMutex lock;
  int64_t in_flight_bytes;
  BlockCopyMethod method;
  QLIST_HEAD(, BlockCopyTask) tasks; /* All tasks from all block-copy calls 
*/
@@ -137,6 +143,7 @@ typedef struct BlockCopyState {
  bool skip_unallocated;
  } BlockCopyState;
  
+/* Called with lock held */

  static BlockCopyTask *find_conflicting_task(BlockCopyState *s,
  int64_t offset, int64_t bytes)
  {
@@ -154,6 +161,8 @@ static BlockCopyTask *find_conflicting_task(BlockCopyState 
*s,
  /*
   * If there are no intersecting tasks return false. Otherwise, wait for the
   * first found intersecting tasks to finish and return true.
+ *
+ * Called with lock held.


Please:

Called with lock held, may temporary release the lock. Return value of 0 proves 
that lock was NOT released.


   */
  static bool coroutine_fn block_copy_wait_one(BlockCopyState *s, int64_t 
offset,
   int64_t bytes)
@@ -164,7 +173,7 @@ static bool coroutine_fn block_copy_wait_one(BlockCopyState 
*s, int64_t offset,
  return false;
  }
  
-qemu_co_queue_wait(&task->wait_queue, NULL);

+qemu_co_queue_wait(&task->wait_queue, &s->lock);
  
  return true;

  }
@@ -191,14 +200,15 @@ static int64_t block_copy_chunk_size(BlockCopyState *s)


Hmm, you had /* Called with lock held */ comment for block_copy_chunk_size() in 
a previous version.. Why dropped?


   * Search for the first dirty area in offset/bytes range and create task at
   * the beginning of it.
   */
-static BlockCopyTask *block_copy_task_create(BlockCopyState *s,
- BlockCopyCallState *call_state,
- int64_t offset, int64_t bytes)
+static coroutine_fn BlockCopyTask *block_copy_task_create(BlockCopyState *s,
+BlockCopyCallState *call_state,
+int64_t offset, int64_t bytes)


That breaks Qemu coding style (docs/devel/style.rst  "Multiline Indent")

Function "type" becomes too long. Common practice is keep a type on a separate 
line:

static coroutine_fn BlockCopyTask *
block_copy_task_create(BlockCopyState *s, BlockCopyCallState *call_state,
   int64_t offset, int64_t bytes)

(and this is used in the file in one place, so it would be consistent do same 
thing here)

Another way is 4-spaces indent:

static coroutine_fn BlockCopyTask *block_copy_task_create(BlockCopyState *s,
BlockCopyCallState *call_state, int64_t offset, int64_t bytes)


choose what you want.


  {
  BlockCopyTask *task;
-int64_t max_chunk = block_copy_chunk_size(s);
+int64_t max_chunk;
  
-max_chunk = MIN_NON_ZERO(max_chunk, call_state->max_chunk);

+QEMU_LOCK_GUARD(&s->lock);
+max_chunk = MIN_NON_ZERO(block_copy_chunk_size(s), call_state->max_chunk);
  if (!bdrv_dirty_bitmap_next_dirty_area(s->copy_bitmap,
 offset, offset + bytes,
 max_ch

Re: [PATCH] block/rbd: Add support for rbd image encryption

2021-06-19 Thread Ilya Dryomov
On Thu, Jun 17, 2021 at 6:05 PM Or Ozeri  wrote:
>
> Starting from ceph Pacific, RBD has built-in support for image-level 
> encryption.
> Currently supported formats are LUKS version 1 and 2.
>
> There are 2 new relevant librbd APIs for controlling encryption, both expect 
> an
> open image context:
>
> rbd_encryption_format: formats an image (i.e. writes the LUKS header)
> rbd_encryption_load: loads encryptor/decryptor to the image IO stack
>
> This commit extends the qemu rbd driver API to support the above.
>
> Signed-off-by: Or Ozeri 
> ---
>  block/raw-format.c   |   7 +
>  block/rbd.c  | 371 ++-
>  qapi/block-core.json | 110 -
>  3 files changed, 482 insertions(+), 6 deletions(-)
>
> diff --git a/block/raw-format.c b/block/raw-format.c
> index 7717578ed6..f6e70e2356 100644
> --- a/block/raw-format.c
> +++ b/block/raw-format.c
> @@ -369,6 +369,12 @@ static int raw_get_info(BlockDriverState *bs, 
> BlockDriverInfo *bdi)
>  return bdrv_get_info(bs->file->bs, bdi);
>  }
>
> +static ImageInfoSpecific *raw_get_specific_info(BlockDriverState *bs,
> +Error **errp)
> +{
> +return bdrv_get_specific_info(bs->file->bs, errp);
> +}
> +
>  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>  if (bs->probed) {
> @@ -603,6 +609,7 @@ BlockDriver bdrv_raw = {
>  .has_variable_length  = true,
>  .bdrv_measure = &raw_measure,
>  .bdrv_get_info= &raw_get_info,
> +.bdrv_get_specific_info = &raw_get_specific_info,

Hi Or,

This feels a bit contentious to me.

AFAIU ImageInfoSpecific is for format-specfic information.  "raw"
is a format and passing the request down the stack this way results
in a somewhat confusing output such as

$ qemu-img info rbd:foo/bar
image: json:{"driver": "raw", "file": {"pool": "foo", "image":
"bar", "driver": "rbd", "namespace": ""}}
file format: raw
...
Format specific information:
   

I think this should be broken out into its own patch and get separate
acks.

>  .bdrv_refresh_limits  = &raw_refresh_limits,
>  .bdrv_probe_blocksizes = &raw_probe_blocksizes,
>  .bdrv_probe_geometry  = &raw_probe_geometry,
> diff --git a/block/rbd.c b/block/rbd.c
> index f098a89c7b..183b17cd84 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -73,6 +73,18 @@
>  #define LIBRBD_USE_IOVEC 0
>  #endif
>
> +#define RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN 8
> +
> +static const char rbd_luks_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 1
> +};
> +
> +static const char rbd_luks2_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 2
> +};
> +
>  typedef enum {
>  RBD_AIO_READ,
>  RBD_AIO_WRITE,
> @@ -341,6 +353,206 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
>  }
>  }
>
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION
> +static int qemu_rbd_convert_luks_options(
> +RbdEncryptionOptionsLUKSBase *luks_opts,
> +char **passphrase,
> +Error **errp)
> +{
> +int r = 0;
> +
> +if (!luks_opts->has_key_secret) {
> +r = -EINVAL;
> +error_setg_errno(errp, -r, "missing encrypt.key-secret");
> +return r;
> +}

Why is key-secret optional?

> +
> +*passphrase = qcrypto_secret_lookup_as_utf8(luks_opts->key_secret, errp);
> +if (!*passphrase) {
> +return -EIO;
> +}
> +
> +return 0;
> +}
> +
> +static int qemu_rbd_convert_luks_create_options(
> +RbdEncryptionCreateOptionsLUKSBase *luks_opts,
> +rbd_encryption_algorithm_t *alg,
> +char **passphrase,
> +Error **errp)
> +{
> +int r = 0;
> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionCreateOptionsLUKSBase_base(luks_opts),
> +passphrase, errp);
> +if (r < 0) {
> +return r;
> +}
> +
> +if (luks_opts->has_cipher_alg) {
> +switch (luks_opts->cipher_alg) {
> +case QCRYPTO_CIPHER_ALG_AES_128: {
> +*alg = RBD_ENCRYPTION_ALGORITHM_AES128;
> +break;
> +}
> +case QCRYPTO_CIPHER_ALG_AES_256: {
> +*alg = RBD_ENCRYPTION_ALGORITHM_AES256;
> +break;
> +}
> +default: {
> +r = -ENOTSUP;
> +error_setg_errno(errp, -r, "unknown encryption algorithm: 
> %u",
> + luks_opts->cipher_alg);
> +return r;
> +}
> +}
> +} else {
> +/* default alg */
> +*alg = RBD_ENCRYPTION_ALGORITHM_AES256;
> +}
> +
> +return 0;
> +}
> +
> +static int qemu_rbd_encryption_format(rbd_image_t image,
> +  RbdEncryptionCreateOptions *encrypt,
> +  Error **errp)
> 

Re: [PATCH V3 3/6] block/rbd: update s->image_size in qemu_rbd_getlength

2021-06-19 Thread Ilya Dryomov
On Wed, May 19, 2021 at 4:26 PM Peter Lieven  wrote:
>
> in case the image size changed we should adjust our internally stored size as 
> well.
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index b4caea4f1b..97a2ae4c84 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -976,6 +976,7 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
>  return r;
>  }
>
> +s->image_size = info.size;
>  return info.size;

Since you are touching this function might as well switch to
rbd_get_size() to put size directly into s->image_size and return
s->image_size, skipping rbd_image_info_t.

Thanks,

Ilya



Re: [PATCH V3 2/6] block/rbd: store object_size in BDRVRBDState

2021-06-19 Thread Ilya Dryomov
On Wed, May 19, 2021 at 4:29 PM Peter Lieven  wrote:
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 18 +++---
>  1 file changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 6b1cbe1d75..b4caea4f1b 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -90,6 +90,7 @@ typedef struct BDRVRBDState {
>  char *snap;
>  char *namespace;
>  uint64_t image_size;
> +uint64_t object_size;
>  } BDRVRBDState;
>
>  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> @@ -675,6 +676,7 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  const QDictEntry *e;
>  Error *local_err = NULL;
>  char *keypairs, *secretid;
> +rbd_image_info_t info;
>  int r;
>
>  keypairs = g_strdup(qdict_get_try_str(options, "=keyvalue-pairs"));
> @@ -739,13 +741,15 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  goto failed_open;
>  }
>
> -r = rbd_get_size(s->image, &s->image_size);
> +r = rbd_stat(s->image, &info, sizeof(info));
>  if (r < 0) {
> -error_setg_errno(errp, -r, "error getting image size from %s",
> +error_setg_errno(errp, -r, "error getting image info from %s",
>   s->image_name);
>  rbd_close(s->image);
>  goto failed_open;
>  }
> +s->image_size = info.size;
> +s->object_size = info.obj_size;
>
>  /* If we are using an rbd snapshot, we must be r/o, otherwise
>   * leave as-is */
> @@ -957,15 +961,7 @@ static BlockAIOCB *qemu_rbd_aio_flush(BlockDriverState 
> *bs,
>  static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
>  {
>  BDRVRBDState *s = bs->opaque;
> -rbd_image_info_t info;
> -int r;
> -
> -r = rbd_stat(s->image, &info, sizeof(info));
> -if (r < 0) {
> -return r;
> -}
> -
> -bdi->cluster_size = info.obj_size;
> +bdi->cluster_size = s->object_size;
>  return 0;
>  }
>
> --
> 2.17.1
>
>
>

Reviewed-by: Ilya Dryomov 

Thanks,

Ilya



Re: [PATCH v4 6/6] block-copy: atomic .cancelled and .finished fields in BlockCopyCallState

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

14.06.2021 10:33, Emanuele Giuseppe Esposito wrote:

By adding acquire/release pairs, we ensure that .ret and .error_is_read
fields are written by block_copy_dirty_clusters before .finished is true.


And that they are read by API user after .finished is true.



The atomic here are necessary because the fields are concurrently modified
also outside coroutines.


To be honest, finished is modified only in coroutine. And read outside.



Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/block-copy.c | 33 ++---
  1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 6416929abd..5348e1f61b 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -53,14 +53,14 @@ typedef struct BlockCopyCallState {
  Coroutine *co;
  
  /* State */

-bool finished;
+bool finished; /* atomic */


So, logic around finished:

Thread of block_copy does:
0. finished is false
1. tasks set ret and error_is_read
2. qatomic_store_release finished -> true
3. after that point ret and error_is_read are not modified

Other threads can:

- qatomic_read finished, just to check are we finished or not

- if finished, can read ret and error_is_read safely. If you not sure that 
block-copy finished, use qatomic_load_acquire() of finished first, to be sure 
that you read ret and error_is_read AFTER finished read and checked to be true.


  QemuCoSleep sleep; /* TODO: protect API with a lock */
  
  /* To reference all call states from BlockCopyState */

  QLIST_ENTRY(BlockCopyCallState) list;
  
  /* OUT parameters */

-bool cancelled;
+bool cancelled; /* atomic */


Logic around cancelled is simpler:

- false at start

- qatomic_read is allowed from any thread

- qatomic_write to true is allowed from any thread

- never write to false

Note that cancelling and finishing are racy. User can cancel block-copy that's 
already finished. We probably may improve change it, but I'm not sure that it 
worth doing. Still, maybe leave some comment in API documentation.


  /* Fields protected by lock in BlockCopyState */
  bool error_is_read;
  int ret;
@@ -650,7 +650,8 @@ block_copy_dirty_clusters(BlockCopyCallState *call_state)
  assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
  assert(QEMU_IS_ALIGNED(bytes, s->cluster_size));
  
-while (bytes && aio_task_pool_status(aio) == 0 && !call_state->cancelled) {

+while (bytes && aio_task_pool_status(aio) == 0 &&
+   !qatomic_read(&call_state->cancelled)) {
  BlockCopyTask *task;
  int64_t status_bytes;
  
@@ -761,7 +762,7 @@ static int coroutine_fn block_copy_common(BlockCopyCallState *call_state)

  do {
  ret = block_copy_dirty_clusters(call_state);
  
-if (ret == 0 && !call_state->cancelled) {

+if (ret == 0 && !qatomic_read(&call_state->cancelled)) {
  WITH_QEMU_LOCK_GUARD(&s->lock) {
  /*
   * Check that there is no task we still need to
@@ -792,9 +793,9 @@ static int coroutine_fn 
block_copy_common(BlockCopyCallState *call_state)
   * 2. We have waited for some intersecting block-copy request
   *It may have failed and produced new dirty bits.
   */
-} while (ret > 0 && !call_state->cancelled);
+} while (ret > 0 && !qatomic_read(&call_state->cancelled));
  
-call_state->finished = true;

+qatomic_store_release(&call_state->finished, true);


so, all writes to ret and error_is_read are finished to this point.

  
  if (call_state->cb) {

  call_state->cb(call_state->cb_opaque);
@@ -857,35 +858,37 @@ void block_copy_call_free(BlockCopyCallState *call_state)
  return;
  }
  
-assert(call_state->finished);

+assert(qatomic_load_acquire(&call_state->finished));


Here we don't need load_aquire, as we don't read other fields. qatomic_read is 
enough.


  g_free(call_state);
  }
  
  bool block_copy_call_finished(BlockCopyCallState *call_state)

  {
-return call_state->finished;
+return qatomic_load_acquire(&call_state->finished);


and here


  }
  
  bool block_copy_call_succeeded(BlockCopyCallState *call_state)

  {
-return call_state->finished && !call_state->cancelled &&
-call_state->ret == 0;
+return qatomic_load_acquire(&call_state->finished) &&
+   !qatomic_read(&call_state->cancelled) &&
+   call_state->ret == 0;


Here qatomic_load_acquire() is reasonable: it protects read of ->ret


  }
  
  bool block_copy_call_failed(BlockCopyCallState *call_state)

  {
-return call_state->finished && !call_state->cancelled &&
-call_state->ret < 0;
+return qatomic_load_acquire(&call_state->finished) &&
+   !qatomic_read(&call_state->cancelled) &&
+   call_state->ret < 0;


Here reasonable.


  }
  
  bool block_copy_call_cancelled(BlockCopyCallState *call_state)

  {
-return call_state->cancelled;
+return

Re: [PATCH V3 6/6] block/rbd: drop qemu_rbd_refresh_limits

2021-06-19 Thread Ilya Dryomov
On Wed, May 19, 2021 at 4:26 PM Peter Lieven  wrote:
>
> librbd supports 1 byte alignment for all aio operations.
>
> Currently, there is no API call to query limits from the ceph backend.
> So drop the bdrv_refresh_limits completely until there is such an API call.
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 9 -
>  1 file changed, 9 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index ee13f08a74..368a674aa0 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -228,14 +228,6 @@ done:
>  return;
>  }
>
> -
> -static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
> -{
> -/* XXX Does RBD support AIO on less than 512-byte alignment? */
> -bs->bl.request_alignment = 512;
> -}
> -
> -
>  static int qemu_rbd_set_auth(rados_t cluster, BlockdevOptionsRbd *opts,
>   Error **errp)
>  {
> @@ -1128,7 +1120,6 @@ static BlockDriver bdrv_rbd = {
>  .format_name= "rbd",
>  .instance_size  = sizeof(BDRVRBDState),
>  .bdrv_parse_filename= qemu_rbd_parse_filename,
> -.bdrv_refresh_limits= qemu_rbd_refresh_limits,
>  .bdrv_file_open = qemu_rbd_open,
>  .bdrv_close = qemu_rbd_close,
>  .bdrv_reopen_prepare= qemu_rbd_reopen_prepare,
> --
> 2.17.1
>
>
>

librbd does support 1-byte-aligned I/O (with the caveat that those
code paths are probably not very well tested).  Regardless, I think
it is better to do read-modify-write and similar handling in librbd
than in QEMU.

Reviewed-by: Ilya Dryomov 

Thanks,

Ilya



Re: [PATCH v3 0/5] block-copy: make helper APIs thread safe

2021-06-19 Thread Vladimir Sementsov-Ogievskiy

19.06.2021 15:21, Vladimir Sementsov-Ogievskiy wrote:

14.06.2021 11:17, Emanuele Giuseppe Esposito wrote:



On 14/06/2021 10:11, Emanuele Giuseppe Esposito wrote:

This serie of patches bring thread safety to the smaller APIs used by
block-copy, namely ratelimit, progressmeter, co-shared-resource
and aiotask.
The end goal is to reduce the usage of AioContexlock in block-copy,
by introducing smaller granularity locks thus on making the block layer
thread safe.

What's missing for block-copy to be fully thread-safe is fixing
the CoSleep API to allow cross-thread sleep and wakeup.
Paolo is working on it and will post the patches once his new
CoSleep API is accepted.

Patches 1-3 work on ratelimit, 4 covers progressmeter and
5 co-shared-resources.

Signed-off-by: Emanuele Giuseppe Esposito 
---
v3:
* Rebase on current master (had conflicts in block-copy), remove based-on in
   cover letter


Hi Kevin & Max,

I think this series ha been reviewed and I just rebased it to current master. 
Can you give it a look and let me know if it can be merged?

Thank you,
Emanuele



I think, I can queue it myself as a block-job series. ratelimit and 
progressmeter are not mentioned in Block Jobs sections of MAINTAINERS, but 
actually these APIs used only by block-jobs.

I remember, Stefan had a complain against patch 5 and against general design of 
adding mutex to every structure.. Stefan, what do you think now? Paolo, is this 
v3 OK for you?

If everybody silent, I don't see a reason to slow down these series anymore and 
will make a pull request on Tuesday.



Hmm, actually, I'll wait for final version of "[PATCH v4 0/6] block-copy: protect 
block-copy internal structures" which seems to be close, to pull them together.


--
Best regards,
Vladimir