Re: [Qemu-block] [PATCH v9 01/13] qcow2: Unallocate unmapped zero clusters if no backing file

2017-05-12 Thread John Snow


On 05/12/2017 12:06 PM, Max Reitz wrote:
> On 2017-05-11 16:56, Eric Blake wrote:
>> [revisiting this older patch version, even though the final version in
>> today's pull request changed somewhat from this approach]
>>
>> On 04/12/2017 04:49 AM, Kevin Wolf wrote:
>>> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben:
 'qemu-img map' already coalesces information about unallocated
 clusters (those with status 0) and pure zero clusters (those
 with status BDRV_BLOCK_ZERO and no offset).  Furthermore, all
 qcow2 images with no backing file already report all unallocated
 clusters (in the preallocation sense of clusters with no offset)
 as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was
 set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of
 BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit
 to external users), thanks to generic block layer code in
 bdrv_co_get_block_status().

 So, for an image with no backing file, having bdrv_pwrite_zeroes
 mark clusters as unallocated (defer to backing file) rather than
 reads-as-zero (regardless of backing file) makes no difference
 to normal behavior, but may potentially allow for fewer writes to
 the L2 table of a freshly-created image where the L2 table is
 initially written to all-zeroes (although I don't actually know
 if we skip an L2 update and flush when re-writing the same
 contents as already present).
>>>
>>> I don't get this. Allocating a cluster always involves an L2 update, no
>>> matter whether it was previously unallocated or a zero cluster.
>>
>> On IRC, John, Kevin, and I were discussing the current situation with
>> libvirt NBD storage migration.  When libvirt creates a file on the
>> destination (rather than the user pre-creating it), it currently
>> defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3]
>> (arguably something that could be improved in libvirt, but
>> https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a
>> bug).
>>
>> Therefore, the use case of doing a mirror job to a v2 image, and having
>> that image become thick even though the source was thin, is happening
>> more than we'd like
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1371749).  While Kevin had
>> a point that in the common case we ALWAYS want to turn an unallocated
>> cluster into a zero cluster (so that we don't have to audit whether all
>> callers are properly accounting for the case where a backing image is
>> added later), our conversation on IRC today conceded that we may want to
>> introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that
>> particular callers can use to request that if a cluster already reads as
>> zeroes, the write zero request does NOT have to change it.  Normal guest
>> operations would not use the flag, but mirroring IS a case that would
>> use the flag, so that we can end up with thinner mirrors even to 0.10
>> images.
>>
>> The other consideration is that on 0.10 images, even if we have to
>> allocate, right now our allocation is done by way of failing with
>> -ENOTSUP and falling back to the normal pwrite() of explicit zeroes.  It
>> may be worth teaching the qcow2 layer to explicitly handle write zeroes,
>> even on 0.10 images, by allocating a cluster (as needed) but then
>> telling bs->file to write zeroes (punching a hole as appropriate) so
>> that the file is still thin.  In fact, it matches the fact that we
>> already have code that probes whether a qcow2 cluster that reports
>> BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if
>> there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the
>> bdrv_get_block_status.
>>
>> I don't know which one of us will tackle patches along these lines, but
>> wanted to at least capture the gist of the IRC conversation in the
>> archives for a bit more permanence.
> 
> Just short ideas:
> 
> (1) I do consider it a bug if v2 images are created. The BZ wasn't
> closed for a very good reason, but because nobody answered this question:
> 
>> is this really a problem?
> 
> And apparently it is.
> 
> (2) There is a way of creating zero clusters on v2 images, but we've
> never done it (yet): You create a single data cluster containing zeroes
> and then you just point to it whenever you need a zero cluster (until
> its refcount overflows, at which point you allocate another cluster).
> Maybe that helps.
> 
> I'd be really careful with optimizations for v2 images. You have already
> proven to me that my fears of too much complexity are out of proportions
> sometimes, but still. v2 upstream is old legacy and should be treated as
> such, at least IMHO.
> 
> Max
> 
> 

I agree that V2 changes should be limited in nature, but there are
less-fiddly ways to have thin 0.10 images on sparse filesystems. This
way feels a little too clever and mostly likely too intrusive for an old
format.

I'd also think that it'd probably confuse 

[Qemu-block] QEMU seg-fault with intermediate image streaming -- bdrv_reopen() in stream_start()

2017-05-12 Thread Kashyap Chamarthy
Reproducer
--

[Disk image chain: disk1.qcow2 <- b.qcow2 <- c.qcow2]

$ qemu-system-x86_64 -display none -nodefconfig -nodefaults \
-m 512 -device virtio-scsi-pci,id=scsi \
-device virtio-serial-pci  \
-drive driver=qcow2,file.driver=file,file.filename=./disk1.qcow2,id=virtio0 
\
-monitor stdio -qmp unix:./qmp-sock,server,nowait

Create two overlays (I used `qmp-shell`):

(QEMU) blockdev-snapshot-sync device=virtio0 snapshot-file=b.qcow2
(QEMU) blockdev-snapshot-sync device=virtio0 snapshot-file=c.qcow2


[Figure out the (format) 'node-name' of 'b.qcow2', from the output of
QMP `query-named-block-nodes` so that it can be supplied to the 'device'
parameter]

Try to perform intermediate streaming (pull clusters from 'disk1.qcow2'
into 'b.qcow2':

(QEMU) block-stream device=#block832 base=disk1.qcow2


Result
--

QEMU crashes with SIGSEGV:

[...]
Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
0x5593d8f7 in stream_start (job_id=0x0, bs=0x58646e20, 
base=0x568548c0, backing_file_str=0x5863d710 "disk1.qcow2", speed=0, 
on_error=BLOCKDEV_ON_ERROR_REPORT, 
errp=0x7fffbcf8) at /home/kashyapc/tinker-space/qemu/block/stream.c:283
283 bdrv_reopen(bs, s->bs_flags, NULL);
[...]

* * *

NOTE: Of course, streaming to active layer works.


Stack traces


I've attached the stack traces from GDB to this email.


Version
---

v2.9.0-304-gca7305b


`git blame` seems to point to this commit:

commit a170a91fd3eab6155da39e740381867e80bcc93e
[...]
stream: Use real permissions in streaming block job

The correct permissions are relatively obvious here (and explained in
code comments). For intermediate streaming, we need to reopen the top
node read-write before creating the job now because the permissions
system catches attempts to get the BLK_PERM_WRITE_UNCHANGED permission
on a read-only node.


-- 
/kashyap
(gdb) thread apply all bt full

Thread 4 (Thread 0x7fffc4c8e700 (LWP 730)):
#0  0x7fffdccb4bd0 in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x55c83e8f in qemu_cond_wait (cond=0x568b9980, 
mutex=0x56323fc0 ) at 
/home/kashyapc/tinker-space/qemu/util/qemu-thread-posix.c:133
err = 21845
__func__ = "qemu_cond_wait"
#2  0x557a74c0 in qemu_tcg_wait_io_event (cpu=0x56886dc0) at 
/home/kashyapc/tinker-space/qemu/cpus.c:1074
#3  0x557a7d10 in qemu_tcg_rr_cpu_thread_fn (arg=0x56886dc0) at 
/home/kashyapc/tinker-space/qemu/cpus.c:1385
cpu = 0x0
#4  0x7fffdccaf5ca in start_thread () at /lib64/libpthread.so.0
#5  0x7fffdc9e90ed in clone () at /lib64/libc.so.6

Thread 2 (Thread 0x7fffd0b01700 (LWP 728)):
#0  0x7fffdc9e3239 in syscall () at /lib64/libc.so.6
#1  0x55c8421d in qemu_futex_wait (f=0x56757184 
, val=4294967295) at 
/home/kashyapc/tinker-space/qemu/include/qemu/futex.h:26
#2  0x55c84320 in qemu_event_wait (ev=0x56757184 
) at 
/home/kashyapc/tinker-space/qemu/util/qemu-thread-posix.c:399
value = 1
#3  0x55c9b7fd in call_rcu_thread (opaque=0x0) at 
/home/kashyapc/tinker-space/qemu/util/rcu.c:249
tries = 0
n = 0
node = 0x7fff941f9c10
#4  0x7fffdccaf5ca in start_thread () at /lib64/libpthread.so.0
#5  0x7fffdc9e90ed in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x77ee0f80 (LWP 724)):
#0  0x5593d8f7 in stream_start (job_id=0x0, bs=0x58646e20, 
base=0x568548c0, backing_file_str=0x5863d710 "disk1.qcow2", speed=0, 
on_error=BLOCKDEV_ON_ERROR_REPORT, errp=0x
7fffbcf8) at /home/kashyapc/tinker-space/qemu/block/stream.c:283
s = 0x0
iter = 0xe5685e050
orig_bs_flags = 8192
---Type  to continue, or q  to quit---
#1  0x558f8acf in qmp_block_stream (has_job_id=false, job_id=0x0, 
device=0x586282f0 "#block830", has_base=true, base=0x5863d710 
"disk1.qcow2", has_base_node=false, base_node=
0x0, has_backing_file=false, backing_file=0x0, has_speed=false, speed=0, 
has_on_error=false, on_error=BLOCKDEV_ON_ERROR_REPORT, errp=0x7fffbda0)
at /home/kashyapc/tinker-space/qemu/blockdev.c:3033
bs = 0x58646e20
iter = 0x568548c0
base_bs = 0x568548c0
aio_context = 0x5683cb40
local_err = 0x5684a230
base_name = 0x5863d710 "disk1.qcow2"
__func__ = "qmp_block_stream"
__PRETTY_FUNCTION__ = "qmp_block_stream"
#2  0x5590f6e8 in qmp_marshal_block_stream (args=0x5689ddd0, 
ret=0x7fffbe90, errp=0x7fffbe88) at qmp-marshal.c:488
err = 0x0
v = 0x5779cd80
arg = 
  {has_job_id = false, job_id = 0x0, device = 0x586282f0 
"#block830", has_base = true, base = 0x5863d710 "disk1.qcow2", 

Re: [Qemu-block] [PATCH 7/7] curl: do not do aio_poll when waiting for a free CURLState

2017-05-12 Thread Jeff Cody
On Wed, May 10, 2017 at 04:32:05PM +0200, Paolo Bonzini wrote:
> Instead, put the CURLAIOCB on a wait list; curl_clean_state will
> wake the corresponding coroutine.
> 
> Because of CURL's callback-based structure, we cannot easily convert
> everything to CoMutex/CoQueue; keeping the QemuMutex is simpler.
> However, CoQueue is a simple wrapper around a linked list, so we can
> use QSIMPLEQ easily to open-code a CoQueue that has a QemuMutex's
> protection instead of a CoMutex's.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/curl.c | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/block/curl.c b/block/curl.c
> index 80870bd60c..4ccdf63510 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -98,6 +98,8 @@ typedef struct CURLAIOCB {
>  
>  size_t start;
>  size_t end;
> +
> +QSIMPLEQ_ENTRY(CURLAIOCB) next;
>  } CURLAIOCB;
>  
>  typedef struct CURLSocket {
> @@ -133,6 +135,7 @@ typedef struct BDRVCURLState {
>  bool accept_range;
>  AioContext *aio_context;
>  QemuMutex mutex;
> +QSIMPLEQ_HEAD(, CURLAIOCB) free_state_waitq;
>  char *username;
>  char *password;
>  char *proxyusername;
> @@ -532,6 +535,7 @@ static int curl_init_state(BDRVCURLState *s, CURLState 
> *state)
>  /* Called with s->mutex held.  */
>  static void curl_clean_state(CURLState *s)
>  {
> +CURLAIOCB *next;
>  int j;
>  for (j=0; j  assert(!s->acb[j]);
> @@ -548,6 +552,14 @@ static void curl_clean_state(CURLState *s)
>  }
>  
>  s->in_use = 0;
> +
> +next = QSIMPLEQ_FIRST(>s->free_state_waitq);
> +if (next) {
> +QSIMPLEQ_REMOVE_HEAD(>s->free_state_waitq, next);
> +qemu_mutex_unlock(>s->mutex);
> +aio_co_wake(next->co);
> +qemu_mutex_lock(>s->mutex);
> +}
>  }
>  
>  static void curl_parse_filename(const char *filename, QDict *options,
> @@ -744,6 +756,7 @@ static int curl_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  
>  DPRINTF("CURL: Opening %s\n", file);
>  qemu_mutex_init(>mutex);
> +QSIMPLEQ_INIT(>free_state_waitq);
>  s->aio_context = bdrv_get_aio_context(bs);
>  s->url = g_strdup(file);
>  qemu_mutex_lock(>mutex);
> @@ -843,8 +856,9 @@ static void curl_setup_preadv(BlockDriverState *bs, 
> CURLAIOCB *acb)
>  if (state) {
>  break;
>  }
> +QSIMPLEQ_INSERT_TAIL(>free_state_waitq, acb, next);
>  qemu_mutex_unlock(>mutex);
> -aio_poll(bdrv_get_aio_context(bs), true);
> +qemu_coroutine_yield();
>  qemu_mutex_lock(>mutex);
>  }
>  
> -- 
> 2.12.2
>

Reviewed-by: Jeff Cody 



Re: [Qemu-block] [PATCH 6/7] curl: convert readv to coroutines

2017-05-12 Thread Jeff Cody
On Wed, May 10, 2017 at 04:32:04PM +0200, Paolo Bonzini wrote:
> This is pretty simple.  The bottom half goes away because, unlike
> bdrv_aio_readv, coroutine-based read can return immediately without
> yielding.  However, for simplicity I kept the former bottom half
> handler in a separate function.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/curl.c | 94 
> 
>  1 file changed, 38 insertions(+), 56 deletions(-)
> 
> diff --git a/block/curl.c b/block/curl.c
> index 3e288f2bc7..80870bd60c 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -76,10 +76,6 @@ static CURLMcode __curl_multi_socket_action(CURLM 
> *multi_handle,
>  #define CURL_TIMEOUT_DEFAULT 5
>  #define CURL_TIMEOUT_MAX 1
>  
> -#define FIND_RET_NONE   0
> -#define FIND_RET_OK 1
> -#define FIND_RET_WAIT   2
> -
>  #define CURL_BLOCK_OPT_URL   "url"
>  #define CURL_BLOCK_OPT_READAHEAD "readahead"
>  #define CURL_BLOCK_OPT_SSLVERIFY "sslverify"
> @@ -93,11 +89,12 @@ static CURLMcode __curl_multi_socket_action(CURLM 
> *multi_handle,
>  struct BDRVCURLState;
>  
>  typedef struct CURLAIOCB {
> -BlockAIOCB common;
> +Coroutine *co;
>  QEMUIOVector *qiov;
>  
>  uint64_t offset;
>  uint64_t bytes;
> +int ret;
>  
>  size_t start;
>  size_t end;
> @@ -268,11 +265,11 @@ static size_t curl_read_cb(void *ptr, size_t size, 
> size_t nmemb, void *opaque)
>request_length - offset);
>  }
>  
> +acb->ret = 0;
> +s->acb[i] = NULL;
>  qemu_mutex_unlock(>s->mutex);
> -acb->common.cb(acb->common.opaque, 0);
> +aio_co_wake(acb->co);
>  qemu_mutex_lock(>s->mutex);
> -qemu_aio_unref(acb);
> -s->acb[i] = NULL;
>  }
>  }
>  
> @@ -282,8 +279,8 @@ read_end:
>  }
>  
>  /* Called with s->mutex held.  */
> -static int curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
> - CURLAIOCB *acb)
> +static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
> +  CURLAIOCB *acb)
>  {
>  int i;
>  uint64_t end = start + len;
> @@ -312,7 +309,8 @@ static int curl_find_buf(BDRVCURLState *s, uint64_t 
> start, uint64_t len,
>  if (clamped_len < len) {
>  qemu_iovec_memset(acb->qiov, clamped_len, 0, len - 
> clamped_len);
>  }
> -return FIND_RET_OK;
> +acb->ret = 0;
> +return true;
>  }
>  
>  // Wait for unfinished chunks
> @@ -330,13 +328,13 @@ static int curl_find_buf(BDRVCURLState *s, uint64_t 
> start, uint64_t len,
>  for (j=0; j  if (!state->acb[j]) {
>  state->acb[j] = acb;
> -return FIND_RET_WAIT;
> +return true;
>  }
>  }
>  }
>  }
>  
> -return FIND_RET_NONE;
> +return false;
>  }
>  
>  /* Called with s->mutex held.  */
> @@ -381,11 +379,11 @@ static void curl_multi_check_completion(BDRVCURLState 
> *s)
>  continue;
>  }
>  
> +acb->ret = -EIO;
> +state->acb[i] = NULL;
>  qemu_mutex_unlock(>mutex);
> -acb->common.cb(acb->common.opaque, -EIO);
> +aio_co_wake(acb->co);
>  qemu_mutex_lock(>mutex);
> -qemu_aio_unref(acb);
> -state->acb[i] = NULL;
>  }
>  }
>  
> @@ -821,19 +819,11 @@ out_noclean:
>  return -EINVAL;
>  }
>  
> -static const AIOCBInfo curl_aiocb_info = {
> -.aiocb_size = sizeof(CURLAIOCB),
> -};
> -
> -
> -static void curl_readv_bh_cb(void *p)
> +static void curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
>  {
>  CURLState *state;
>  int running;
> -int ret = -EINPROGRESS;
>  
> -CURLAIOCB *acb = p;
> -BlockDriverState *bs = acb->common.bs;
>  BDRVCURLState *s = bs->opaque;
>  
>  uint64_t start = acb->offset;
> @@ -843,14 +833,8 @@ static void curl_readv_bh_cb(void *p)
>  
>  // In case we have the requested data already (e.g. read-ahead),
>  // we can just call the callback and be done.
> -switch (curl_find_buf(s, start, acb->bytes, acb)) {
> -case FIND_RET_OK:
> -ret = 0;
> -goto out;
> -case FIND_RET_WAIT:
> -goto out;
> -default:
> -break;
> +if (curl_find_buf(s, start, acb->bytes, acb)) {
> +goto out;
>  }
>  
>  // No cache found, so let's start a new request
> @@ -866,7 +850,7 @@ static void curl_readv_bh_cb(void *p)
>  
>  if (curl_init_state(s, state) < 0) {
>  curl_clean_state(state);
> -ret = -EIO;
> +acb->ret = -EIO;
>  

Re: [Qemu-block] [PATCH 5/7] curl: convert CURLAIOCB to byte values

2017-05-12 Thread Jeff Cody
On Wed, May 10, 2017 at 04:32:03PM +0200, Paolo Bonzini wrote:
> This is in preparation for the conversion from bdrv_aio_readv to
> bdrv_co_preadv, and it also requires changing some of the size_t values
> to uint64_t.  This was broken before for disks > 2TB, but now it would
> break at 4GB.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/curl.c | 44 ++--
>  1 file changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/block/curl.c b/block/curl.c
> index 4b4d5a2389..3e288f2bc7 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -96,8 +96,8 @@ typedef struct CURLAIOCB {
>  BlockAIOCB common;
>  QEMUIOVector *qiov;
>  
> -int64_t sector_num;
> -int nb_sectors;
> +uint64_t offset;
> +uint64_t bytes;
>  
>  size_t start;
>  size_t end;
> @@ -115,7 +115,7 @@ typedef struct CURLState
>  CURL *curl;
>  QLIST_HEAD(, CURLSocket) sockets;
>  char *orig_buf;
> -size_t buf_start;
> +uint64_t buf_start;
>  size_t buf_off;
>  size_t buf_len;
>  char range[128];
> @@ -126,7 +126,7 @@ typedef struct CURLState
>  typedef struct BDRVCURLState {
>  CURLM *multi;
>  QEMUTimer timer;
> -size_t len;
> +uint64_t len;
>  CURLState states[CURL_NUM_STATES];
>  char *url;
>  size_t readahead_size;
> @@ -257,7 +257,7 @@ static size_t curl_read_cb(void *ptr, size_t size, size_t 
> nmemb, void *opaque)
>  continue;
>  
>  if ((s->buf_off >= acb->end)) {
> -size_t request_length = acb->nb_sectors * BDRV_SECTOR_SIZE;
> +size_t request_length = acb->bytes;
>  
>  qemu_iovec_from_buf(acb->qiov, 0, s->orig_buf + acb->start,
>  acb->end - acb->start);
> @@ -282,18 +282,18 @@ read_end:
>  }
>  
>  /* Called with s->mutex held.  */
> -static int curl_find_buf(BDRVCURLState *s, size_t start, size_t len,
> +static int curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
>   CURLAIOCB *acb)
>  {
>  int i;
> -size_t end = start + len;
> -size_t clamped_end = MIN(end, s->len);
> -size_t clamped_len = clamped_end - start;
> +uint64_t end = start + len;
> +uint64_t clamped_end = MIN(end, s->len);
> +uint64_t clamped_len = clamped_end - start;
>  
>  for (i=0; i  CURLState *state = >states[i];
> -size_t buf_end = (state->buf_start + state->buf_off);
> -size_t buf_fend = (state->buf_start + state->buf_len);
> +uint64_t buf_end = (state->buf_start + state->buf_off);
> +uint64_t buf_fend = (state->buf_start + state->buf_len);
>  
>  if (!state->orig_buf)
>  continue;
> @@ -788,7 +788,7 @@ static int curl_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  }
>  #endif
>  
> -s->len = (size_t)d;
> +s->len = d;
>  
>  if ((!strncasecmp(s->url, "http://;, strlen("http://;))
>  || !strncasecmp(s->url, "https://;, strlen("https://;)))
> @@ -797,7 +797,7 @@ static int curl_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  "Server does not support 'range' (byte ranges).");
>  goto out;
>  }
> -DPRINTF("CURL: Size = %zd\n", s->len);
> +DPRINTF("CURL: Size = %" PRIu64 "\n", s->len);
>  
>  qemu_mutex_lock(>mutex);
>  curl_clean_state(state);
> @@ -836,14 +836,14 @@ static void curl_readv_bh_cb(void *p)
>  BlockDriverState *bs = acb->common.bs;
>  BDRVCURLState *s = bs->opaque;
>  
> -size_t start = acb->sector_num * BDRV_SECTOR_SIZE;
> -size_t end;
> +uint64_t start = acb->offset;
> +uint64_t end;
>  
>  qemu_mutex_lock(>mutex);
>  
>  // In case we have the requested data already (e.g. read-ahead),
>  // we can just call the callback and be done.
> -switch (curl_find_buf(s, start, acb->nb_sectors * BDRV_SECTOR_SIZE, 
> acb)) {
> +switch (curl_find_buf(s, start, acb->bytes, acb)) {
>  case FIND_RET_OK:
>  ret = 0;
>  goto out;
> @@ -871,7 +871,7 @@ static void curl_readv_bh_cb(void *p)
>  }
>  
>  acb->start = 0;
> -acb->end = MIN(acb->nb_sectors * BDRV_SECTOR_SIZE, s->len - start);
> +acb->end = MIN(acb->bytes, s->len - start);
>  
>  state->buf_off = 0;
>  g_free(state->orig_buf);
> @@ -886,9 +886,9 @@ static void curl_readv_bh_cb(void *p)
>  }
>  state->acb[0] = acb;
>  
> -snprintf(state->range, 127, "%zd-%zd", start, end);
> -DPRINTF("CURL (AIO): Reading %llu at %zd (%s)\n",
> -(acb->nb_sectors * BDRV_SECTOR_SIZE), start, state->range);
> +snprintf(state->range, 127, "%" PRIu64 "-%" PRIu64, start, end);
> +DPRINTF("CURL (AIO): Reading %" PRIu64 " at %" PRIu64 " (%s)\n",
> +acb->bytes, start, state->range);
>  curl_easy_setopt(state->curl, CURLOPT_RANGE, state->range);
>  
>  curl_multi_add_handle(s->multi, state->curl);
> @@ 

Re: [Qemu-block] [PATCH 4/7] curl: split curl_find_state/curl_init_state

2017-05-12 Thread Jeff Cody
On Wed, May 10, 2017 at 04:32:02PM +0200, Paolo Bonzini wrote:
> If curl_easy_init fails, a CURLState is left with s->in_use = 1.  Split
> curl_init_state in two, so that we can distinguish the two failures and
> call curl_clean_state if needed.
> 
> While at it, simplify curl_find_state, removing a dummy loop.  The
> aio_poll loop is moved to the sole caller that needs it.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/curl.c | 52 ++--
>  1 file changed, 30 insertions(+), 22 deletions(-)
> 
> diff --git a/block/curl.c b/block/curl.c
> index b18e79bf54..4b4d5a2389 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -455,34 +455,27 @@ static void curl_multi_timeout_do(void *arg)
>  }
>  
>  /* Called with s->mutex held.  */
> -static CURLState *curl_init_state(BlockDriverState *bs, BDRVCURLState *s)
> +static CURLState *curl_find_state(BDRVCURLState *s)
>  {
>  CURLState *state = NULL;
> -int i, j;
> -
> -do {
> -for (i=0; i -for (j=0; j -if (s->states[i].acb[j])
> -continue;
> -if (s->states[i].in_use)
> -continue;
> +int i;
>  
> +for (i=0; i +if (!s->states[i].in_use) {
>  state = >states[i];
>  state->in_use = 1;
>  break;
>  }
> -if (!state) {
> -qemu_mutex_unlock(>mutex);
> -aio_poll(bdrv_get_aio_context(bs), true);
> -qemu_mutex_lock(>mutex);
> -}
> -} while(!state);
> +}
> +return state;
> +}
>  
> +static int curl_init_state(BDRVCURLState *s, CURLState *state)
> +{
>  if (!state->curl) {
>  state->curl = curl_easy_init();
>  if (!state->curl) {
> -return NULL;
> +return -EIO;
>  }
>  curl_easy_setopt(state->curl, CURLOPT_URL, s->url);
>  curl_easy_setopt(state->curl, CURLOPT_SSL_VERIFYPEER,
> @@ -535,7 +528,7 @@ static CURLState *curl_init_state(BlockDriverState *bs, 
> BDRVCURLState *s)
>  QLIST_INIT(>sockets);
>  state->s = s;
>  
> -return state;
> +return 0;
>  }
>  
>  /* Called with s->mutex held.  */
> @@ -756,13 +749,18 @@ static int curl_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  s->aio_context = bdrv_get_aio_context(bs);
>  s->url = g_strdup(file);
>  qemu_mutex_lock(>mutex);
> -state = curl_init_state(bs, s);
> +state = curl_find_state(s);
>  qemu_mutex_unlock(>mutex);
> -if (!state)
> +if (!state) {
>  goto out_noclean;
> +}
>  
>  // Get file size
>  
> +if (curl_init_state(s, state) < 0) {
> +goto out;
> +}
> +
>  s->accept_range = false;
>  curl_easy_setopt(state->curl, CURLOPT_NOBODY, 1);
>  curl_easy_setopt(state->curl, CURLOPT_HEADERFUNCTION,
> @@ -856,8 +854,18 @@ static void curl_readv_bh_cb(void *p)
>  }
>  
>  // No cache found, so let's start a new request
> -state = curl_init_state(acb->common.bs, s);
> -if (!state) {
> +for (;;) {
> +state = curl_find_state(s);
> +if (state) {
> +break;
> +}
> +qemu_mutex_unlock(>mutex);
> +aio_poll(bdrv_get_aio_context(bs), true);
> +qemu_mutex_lock(>mutex);
> +}
> +
> +if (curl_init_state(s, state) < 0) {
> +curl_clean_state(state);

For some reason, I initially thought this might cause problems with the
assert in curl_clean_state(), but that isn't the case.

Reviewed-by: Jeff Cody 

>  ret = -EIO;
>  goto out;
>  }
> -- 
> 2.12.2
> 
> 



Re: [Qemu-block] [PATCH v4] qemu-img: Check for backing image if specified during create

2017-05-12 Thread John Snow


On 05/12/2017 03:46 PM, Eric Blake wrote:
> On 05/12/2017 01:07 PM, Max Reitz wrote:
>> On 2017-05-11 20:27, John Snow wrote:
>>> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1213786
>>>
>>> Or, rather, force the open of a backing image if one was specified
>>> for creation. Using a similar -unsafe option as rebase, allow qemu-img
>>> to ignore the backing file validation if possible.
>>>
> 
>>> +++ b/block.c
>>> @@ -4275,37 +4275,37 @@ void bdrv_img_create(const char *filename, const 
>>> char *fmt,
>>>  // The size for the image must always be specified, with one exception:
>>>  // If we are using a backing file, we can obtain the size from there
>>>  size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 0);
>>> -if (size == -1) {
>>
>> "Hang on, why should this be -1 when the defval is 0? Where does the -1
>> come from?"
>> "..."
>> "Oh, the option exists and is set to -1? Why is that?"
>> "..."
>> "Oh, because this function always sets it itself, and because @img_size
>> is set to (uint64_t)-1."
> 
> I had pretty much the same conversation on my v1 review.
> https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01097.html
> 
>>
>> First, I won't start with how signed integer overflow is
>> implementation-defined in C because I hope you have thrashed that out
>> with Eric (I hope that "to thrash out" is a good translation for
>> "auskaspern" (lit. "to buffoon out").).
> 
> Sounds like a reasonable choice of words, even if I don't speak the
> counterpart language to validate your translation.
> 
> (uint64_t)-1 is well-defined in C (so I think we're just fine here). But
> (int64_t)UINT64_MAX is where signed integer overflow does indeed throw
> wrinkles at you.
> 
> I seem to recall that qemu has chosen to use compiler flags and/or
> assumptions that we are using 2s-complement arithmetic with sane
> behavior (that is, tighter behavior than the bare minimum that C
> requires), because it was easier than auditing our code for strict C
> compliance on border cases of conversions from unsigned to signed that
> trigger undefined behavior.  But again, I don't think it affects this
> patch (where our conversion is only from signed to unsigned, and that is
> well-defined behavior).
> 
> 
>>
>> Second, well, at least we should put -1 as the default value here, then.
> 
> Indeed, now that two reviewers have tripped on it,
> qemu_opt_get_size(,,-1) would be nicer.
> 
>>
>> Not strictly your fault or something that you need to fix, but it is
>> just a single line in the vicinity...
>>
>> Let me know if you want to address this, for now I'll leave a
>>
>> Reviewed-by: Max Reitz 
>>
>> here if you don't want to.
> 
> I'm okay whether you want to squash that fix into this patch, or whether
> you do it as a separate followup patch.
> 

I had considered the issue separate; but you're welcome to either write
a patch or squish it into this one, I'm not going to be picky.

--js



Re: [Qemu-block] [PATCH v4] qemu-img: Check for backing image if specified during create

2017-05-12 Thread Eric Blake
On 05/12/2017 01:07 PM, Max Reitz wrote:
> On 2017-05-11 20:27, John Snow wrote:
>> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1213786
>>
>> Or, rather, force the open of a backing image if one was specified
>> for creation. Using a similar -unsafe option as rebase, allow qemu-img
>> to ignore the backing file validation if possible.
>>

>> +++ b/block.c
>> @@ -4275,37 +4275,37 @@ void bdrv_img_create(const char *filename, const 
>> char *fmt,
>>  // The size for the image must always be specified, with one exception:
>>  // If we are using a backing file, we can obtain the size from there
>>  size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 0);
>> -if (size == -1) {
> 
> "Hang on, why should this be -1 when the defval is 0? Where does the -1
> come from?"
> "..."
> "Oh, the option exists and is set to -1? Why is that?"
> "..."
> "Oh, because this function always sets it itself, and because @img_size
> is set to (uint64_t)-1."

I had pretty much the same conversation on my v1 review.
https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01097.html

> 
> First, I won't start with how signed integer overflow is
> implementation-defined in C because I hope you have thrashed that out
> with Eric (I hope that "to thrash out" is a good translation for
> "auskaspern" (lit. "to buffoon out").).

Sounds like a reasonable choice of words, even if I don't speak the
counterpart language to validate your translation.

(uint64_t)-1 is well-defined in C (so I think we're just fine here). But
(int64_t)UINT64_MAX is where signed integer overflow does indeed throw
wrinkles at you.

I seem to recall that qemu has chosen to use compiler flags and/or
assumptions that we are using 2s-complement arithmetic with sane
behavior (that is, tighter behavior than the bare minimum that C
requires), because it was easier than auditing our code for strict C
compliance on border cases of conversions from unsigned to signed that
trigger undefined behavior.  But again, I don't think it affects this
patch (where our conversion is only from signed to unsigned, and that is
well-defined behavior).


> 
> Second, well, at least we should put -1 as the default value here, then.

Indeed, now that two reviewers have tripped on it,
qemu_opt_get_size(,,-1) would be nicer.

> 
> Not strictly your fault or something that you need to fix, but it is
> just a single line in the vicinity...
> 
> Let me know if you want to address this, for now I'll leave a
> 
> Reviewed-by: Max Reitz 
> 
> here if you don't want to.

I'm okay whether you want to squash that fix into this patch, or whether
you do it as a separate followup patch.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v4] qemu-img: Check for backing image if specified during create

2017-05-12 Thread Max Reitz
On 2017-05-11 20:27, John Snow wrote:
> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1213786
> 
> Or, rather, force the open of a backing image if one was specified
> for creation. Using a similar -unsafe option as rebase, allow qemu-img
> to ignore the backing file validation if possible.
> 
> It may not always be possible, as in the existing case when a filesize
> for the new image was not specified.
> 
> This is accomplished by shifting around the conditionals in
> bdrv_img_create, such that a backing file is always opened unless we
> provide BDRV_O_NO_BACKING. qemu-img is adjusted to pass this new flag
> when -u is provided to create.
> 
> Sorry for the heinous looking diffstat, but it's mostly whitespace.
> 
> Reported-by: Yi Sun 
> Signed-off-by: John Snow 
> Reviewed-by: Eric Blake 
> ---
> 
> v4: Actually do the things Eric told me to.
> v3: Rebased
> v2: Rebased for 2.10
> Corrected some of my less cromulent grammar
> 
> 
>  block.c| 73 
> +++---
>  qemu-img-cmds.hx   |  4 +--
>  qemu-img.c | 16 ++
>  tests/qemu-iotests/082 |  4 +--
>  tests/qemu-iotests/082.out |  4 +--
>  5 files changed, 54 insertions(+), 47 deletions(-)
> 
> diff --git a/block.c b/block.c
> index a45b9b5..3c3df54 100644
> --- a/block.c
> +++ b/block.c
> @@ -4275,37 +4275,37 @@ void bdrv_img_create(const char *filename, const char 
> *fmt,
>  // The size for the image must always be specified, with one exception:
>  // If we are using a backing file, we can obtain the size from there
>  size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 0);
> -if (size == -1) {

"Hang on, why should this be -1 when the defval is 0? Where does the -1
come from?"
"..."
"Oh, the option exists and is set to -1? Why is that?"
"..."
"Oh, because this function always sets it itself, and because @img_size
is set to (uint64_t)-1."

First, I won't start with how signed integer overflow is
implementation-defined in C because I hope you have thrashed that out
with Eric (I hope that "to thrash out" is a good translation for
"auskaspern" (lit. "to buffoon out").).

Second, well, at least we should put -1 as the default value here, then.

Not strictly your fault or something that you need to fix, but it is
just a single line in the vicinity...

Let me know if you want to address this, for now I'll leave a

Reviewed-by: Max Reitz 

here if you don't want to.

Max

> -if (backing_file) {
> -BlockDriverState *bs;
> -char *full_backing = g_new0(char, PATH_MAX);
> -int64_t size;
> -int back_flags;
> -QDict *backing_options = NULL;
> -
> -bdrv_get_full_backing_filename_from_filename(filename, 
> backing_file,
> - full_backing, 
> PATH_MAX,
> - _err);
> -if (local_err) {
> -g_free(full_backing);
> -goto out;
> -}
> -
> -/* backing files always opened read-only */
> -back_flags = flags;
> -back_flags &= ~(BDRV_O_RDWR | BDRV_O_SNAPSHOT | 
> BDRV_O_NO_BACKING);
> -
> -if (backing_fmt) {
> -backing_options = qdict_new();
> -qdict_put_str(backing_options, "driver", backing_fmt);
> -}
> -
> -bs = bdrv_open(full_backing, NULL, backing_options, back_flags,
> -   _err);
> +if (backing_file && !(flags & BDRV_O_NO_BACKING)) {
> +BlockDriverState *bs;
> +char *full_backing = g_new0(char, PATH_MAX);
> +int back_flags;
> +QDict *backing_options = NULL;
> +
> +bdrv_get_full_backing_filename_from_filename(filename, backing_file,
> + full_backing, PATH_MAX,
> + _err);
> +if (local_err) {
>  g_free(full_backing);
> -if (!bs) {
> -goto out;
> -}
> +goto out;
> +}
> +
> +/* backing files always opened read-only */
> +back_flags = flags;
> +back_flags &= ~(BDRV_O_RDWR | BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING);
> +
> +if (backing_fmt) {
> +backing_options = qdict_new();
> +qdict_put_str(backing_options, "driver", backing_fmt);
> +}
> +
> +bs = bdrv_open(full_backing, NULL, backing_options, back_flags,
> +   _err);
> +g_free(full_backing);
> +if (!bs) {
> +goto out;
> +}
> +
> +if (size == -1) {
>  size = bdrv_getlength(bs);
>  if (size < 0) {
>  error_setg_errno(errp, -size, "Could not get size of '%s'",
> @@ -4313,14 +4313,15 @@ void 

Re: [Qemu-block] [PATCH v8 0/4] Improve convert and dd commands

2017-05-12 Thread Max Reitz
On 2017-05-09 11:48, Daniel P. Berrange wrote:
> Update to
> 
>   v1: https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg05699.html
>   v2: https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg00728.html
>   v3: https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg04391.html
>   v4: https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg02153.html
>   v5: https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04109.html
>   v6: https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg00215.html
> 
> This series is in response to Max pointing out that you cannot
> use 'convert' for an encrypted target image.
> 
> The 'convert' and 'dd' commands need to first create the image
> and then open it. The bdrv_create() method takes a set of options
> for creating the image, which let us provide a key-secret for the
> encryption key. When the commands then open the new image, they
> don't provide any options, so the image is unable to be opened
> due to lack of encryption key. It is also not possible to use
> the --image-opts argument to provide structured options in the
> target image name - it must be a plain filename to satisfy the
> bdrv_create() API contract.
> 
> This series addresses these problems to some extent
> 
>  - Adds a new --target-image-opts flag which is used to say
>that the target filename is using structured options.
>It is *only* permitted to use this when -n is also set.
>ie the target image must be pre-created so convert/dd
>don't need to run bdrv_create().
> 
>  - When --target-image-opts is not used, add special case
>code that identifies options passed to bdrv_create()
>named "*key-secret" and adds them to the options used
>to open the new image
> 
> In future it is desirable to make --target-image-opts work even when -n is
> *not* given. This requires considerable work to create a new bdrv_create()
> API impl.
> 
> The first patch fixes a bug in the 'dd' command while the second adds support
> for the missing '--object' arg to 'dd', allowing it to reference secrets when
> opening files.  The last two patches implement the new features described 
> above
> for the 'convert' command.
> 
> NB v8 is based against git master once more, since the img_convert changes
> previously in block-next have now merged.

Changes from the previous version look good, but unfortunately here's
the "but": The image locking series has brought even more changes to
qemu-img. :-(

I tried resolving them, but the following backport-diff didn't look like
I should proceed:

001/4:[] [-C] 'qemu-img: add support for --object with 'dd' command'
002/4:[0004] [FC] 'qemu-img: fix --image-opts usage with dd command'
003/4:[0015] [FC] 'qemu-img: introduce --target-image-opts for 'convert'
command'
004/4:[0024] [FC] 'qemu-img: copy *key-secret opts when opening newly
created files'

The fun is increased by the fact that the locking series has
(inadvertently) removed the -B documentation from convert, so there is
another conflict looming in the future...

(Or you just inadvertently add it back. Then we'd have resolved the
issue altogether...)

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PULL 05/58] qemu-img: Update documentation for -U

2017-05-12 Thread Max Reitz
On 2017-05-11 16:32, Kevin Wolf wrote:
> From: Fam Zheng 
> 
> Signed-off-by: Fam Zheng 
> Signed-off-by: Kevin Wolf 
> ---
>  qemu-img-cmds.hx | 36 ++--
>  1 file changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
> index bf4ce59..e5bc28f 100644
> --- a/qemu-img-cmds.hx
> +++ b/qemu-img-cmds.hx

[...]

>  DEF("convert", img_convert,
> -"convert [--object objectdef] [--image-opts] [-c] [-p] [-q] [-n] [-f 
> fmt] [-t cache] [-T src_cache] [-O output_fmt] [-B backing_file] [-o options] 
> [-s snapshot_id_or_name] [-l snapshot_param] [-S sparse_size] [-m 
> num_coroutines] [-W] filename [filename2 [...]] output_filename")
> +"convert [--object objectdef] [--image-opts] [-U] [-c] [-p] [-q] [-n] 
> [-f fmt] [-t cache] [-T src_cache] [-O output_fmt] [-o options] [-s 
> snapshot_id_or_name] [-l snapshot_param] [-S sparse_size] [-m num_coroutines] 
> [-W] filename [filename2 [...]] output_filename")
>  STEXI
> -@item convert [--object @var{objectdef}] [--image-opts] [-c] [-p] [-q] [-n] 
> [-f @var{fmt}] [-t @var{cache}] [-T @var{src_cache}] [-O @var{output_fmt}] 
> [-B @var{backing_file}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l 
> @var{snapshot_param}] [-S @var{sparse_size}] [-m @var{num_coroutines}] [-W] 
> @var{filename} [@var{filename2} [...]] @var{output_filename}
> +@item convert [--object @var{objectdef}] [--image-opts] [-U] [-c] [-p] [-q] 
> [-n] [-f @var{fmt}] [-t @var{cache}] [-T @var{src_cache}] [-O 
> @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l 
> @var{snapshot_param}] [-S @var{sparse_size}] [-m @var{num_coroutines}] [-W] 
> @var{filename} [@var{filename2} [...]] @var{output_filename}

Soo... Who gets to add the -B documentation back?

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH] qcow2: remove extra local_error variable

2017-05-12 Thread Max Reitz
On 2017-05-11 17:03, Alberto Garcia wrote:
> Commit d7086422b1c1e75e320519cfe26176db6ec97a37 added a local_err
> variable global to the qcow2_amend_options() function, so there's no
> need to have this other one.
> 
> Signed-off-by: Alberto Garcia 
> ---
>  block/qcow2.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

Thanks, applied to my block branch:

https://github.com/XanClic/qemu/commits/block

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v9 01/13] qcow2: Unallocate unmapped zero clusters if no backing file

2017-05-12 Thread Max Reitz
On 2017-05-11 16:56, Eric Blake wrote:
> [revisiting this older patch version, even though the final version in
> today's pull request changed somewhat from this approach]
> 
> On 04/12/2017 04:49 AM, Kevin Wolf wrote:
>> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben:
>>> 'qemu-img map' already coalesces information about unallocated
>>> clusters (those with status 0) and pure zero clusters (those
>>> with status BDRV_BLOCK_ZERO and no offset).  Furthermore, all
>>> qcow2 images with no backing file already report all unallocated
>>> clusters (in the preallocation sense of clusters with no offset)
>>> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was
>>> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of
>>> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit
>>> to external users), thanks to generic block layer code in
>>> bdrv_co_get_block_status().
>>>
>>> So, for an image with no backing file, having bdrv_pwrite_zeroes
>>> mark clusters as unallocated (defer to backing file) rather than
>>> reads-as-zero (regardless of backing file) makes no difference
>>> to normal behavior, but may potentially allow for fewer writes to
>>> the L2 table of a freshly-created image where the L2 table is
>>> initially written to all-zeroes (although I don't actually know
>>> if we skip an L2 update and flush when re-writing the same
>>> contents as already present).
>>
>> I don't get this. Allocating a cluster always involves an L2 update, no
>> matter whether it was previously unallocated or a zero cluster.
> 
> On IRC, John, Kevin, and I were discussing the current situation with
> libvirt NBD storage migration.  When libvirt creates a file on the
> destination (rather than the user pre-creating it), it currently
> defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3]
> (arguably something that could be improved in libvirt, but
> https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a
> bug).
> 
> Therefore, the use case of doing a mirror job to a v2 image, and having
> that image become thick even though the source was thin, is happening
> more than we'd like
> (https://bugzilla.redhat.com/show_bug.cgi?id=1371749).  While Kevin had
> a point that in the common case we ALWAYS want to turn an unallocated
> cluster into a zero cluster (so that we don't have to audit whether all
> callers are properly accounting for the case where a backing image is
> added later), our conversation on IRC today conceded that we may want to
> introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that
> particular callers can use to request that if a cluster already reads as
> zeroes, the write zero request does NOT have to change it.  Normal guest
> operations would not use the flag, but mirroring IS a case that would
> use the flag, so that we can end up with thinner mirrors even to 0.10
> images.
> 
> The other consideration is that on 0.10 images, even if we have to
> allocate, right now our allocation is done by way of failing with
> -ENOTSUP and falling back to the normal pwrite() of explicit zeroes.  It
> may be worth teaching the qcow2 layer to explicitly handle write zeroes,
> even on 0.10 images, by allocating a cluster (as needed) but then
> telling bs->file to write zeroes (punching a hole as appropriate) so
> that the file is still thin.  In fact, it matches the fact that we
> already have code that probes whether a qcow2 cluster that reports
> BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if
> there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the
> bdrv_get_block_status.
> 
> I don't know which one of us will tackle patches along these lines, but
> wanted to at least capture the gist of the IRC conversation in the
> archives for a bit more permanence.

Just short ideas:

(1) I do consider it a bug if v2 images are created. The BZ wasn't
closed for a very good reason, but because nobody answered this question:

> is this really a problem?

And apparently it is.

(2) There is a way of creating zero clusters on v2 images, but we've
never done it (yet): You create a single data cluster containing zeroes
and then you just point to it whenever you need a zero cluster (until
its refcount overflows, at which point you allocate another cluster).
Maybe that helps.

I'd be really careful with optimizations for v2 images. You have already
proven to me that my fears of too much complexity are out of proportions
sometimes, but still. v2 upstream is old legacy and should be treated as
such, at least IMHO.

Max




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 3/7] curl: avoid recursive locking of BDRVCURLState mutex

2017-05-12 Thread Paolo Bonzini


On 11/05/2017 22:56, Jeff Cody wrote:
> On Wed, May 10, 2017 at 04:32:01PM +0200, Paolo Bonzini wrote:
>> The curl driver has a ugly hack where, if it cannot find an empty CURLState,
>> it just uses aio_poll to wait for one to be empty.  This is probably
>> buggy when used together with dataplane, and the simplest way to fix it
>> is to use coroutines instead.
>>
>> A more immediate effect of the bug however is that it can cause a
>> recursive call to curl_readv_bh_cb and recursively taking the
>> BDRVCURLState mutex.  This causes a deadlock.
>>
>> The fix is to unlock the mutex around aio_poll, but for cleanliness we
>> should also take the mutex around all calls to curl_init_state, even if
>> reaching the unlock/lock pair is impossible.  The same is true for
>> curl_clean_state.
>>
>> Reported-by: Richard W.M. Jones 
>> Cc: jc...@redhat.com
>> Cc: qemu-sta...@nongnu.org
>> Signed-off-by: Paolo Bonzini 
>> ---
>>  block/curl.c | 13 -
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/curl.c b/block/curl.c
>> index 9a00fdc28e..b18e79bf54 100644
>> --- a/block/curl.c
>> +++ b/block/curl.c
>> @@ -281,6 +281,7 @@ read_end:
>>  return size * nmemb;
>>  }
>>  
>> +/* Called with s->mutex held.  */
>>  static int curl_find_buf(BDRVCURLState *s, size_t start, size_t len,
>>   CURLAIOCB *acb)
>>  {
>> @@ -453,6 +454,7 @@ static void curl_multi_timeout_do(void *arg)
>>  #endif
>>  }
>>  
>> +/* Called with s->mutex held.  */
>>  static CURLState *curl_init_state(BlockDriverState *bs, BDRVCURLState *s)
>>  {
>>  CURLState *state = NULL;
>> @@ -471,7 +473,9 @@ static CURLState *curl_init_state(BlockDriverState *bs, 
>> BDRVCURLState *s)
>>  break;
>>  }
>>  if (!state) {
>> +qemu_mutex_unlock(>mutex);
>>  aio_poll(bdrv_get_aio_context(bs), true);
>> +qemu_mutex_lock(>mutex);
>>  }
>>  } while(!state);
>>  
>> @@ -534,6 +538,7 @@ static CURLState *curl_init_state(BlockDriverState *bs, 
>> BDRVCURLState *s)
>>  return state;
>>  }
>>  
>> +/* Called with s->mutex held.  */
>>  static void curl_clean_state(CURLState *s)
>>  {
>>  int j;
>> @@ -565,6 +570,7 @@ static void curl_detach_aio_context(BlockDriverState *bs)
>>  BDRVCURLState *s = bs->opaque;
>>  int i;
>>  
>> +qemu_mutex_lock(>mutex);
>>  for (i = 0; i < CURL_NUM_STATES; i++) {
>>  if (s->states[i].in_use) {
>>  curl_clean_state(>states[i]);
>> @@ -580,6 +586,7 @@ static void curl_detach_aio_context(BlockDriverState *bs)
>>  curl_multi_cleanup(s->multi);
>>  s->multi = NULL;
>>  }
>> +qemu_mutex_unlock(>mutex);
>>  
>>  timer_del(>timer);
>>  }
>> @@ -745,9 +752,12 @@ static int curl_open(BlockDriverState *bs, QDict 
>> *options, int flags,
>>  }
>>  
>>  DPRINTF("CURL: Opening %s\n", file);
>> +qemu_mutex_init(>mutex);
> 
> This mutex init is now done above possible returns on error, so we should
> call qemu_mutex_destroy() on errors after this point.

Ok, I'll wait for you to complete the review and send v3.

Paolo



Re: [Qemu-block] [PULL 00/58] Block layer patches

2017-05-12 Thread Stefan Hajnoczi
On Thu, May 11, 2017 at 04:32:03PM +0200, Kevin Wolf wrote:
> The following changes since commit 76d20ea0f1b26ebd5da2f5fb2fdf3250cde887bb:
> 
>   Merge remote-tracking branch 'armbru/tags/pull-qapi-2017-05-04-v3' into 
> staging (2017-05-09 15:49:14 -0400)
> 
> are available in the git repository at:
> 
> 
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
> 
> for you to fetch changes up to d541e201bd3ad888f02abeddf0e14f7b0c126529:
> 
>   Merge remote-tracking branch 'mreitz/tags/pull-block-2017-05-11' into 
> queue-block (2017-05-11 14:34:56 +0200)
> 
> 
> 
> Block layer patches
> 
> 
> Anton Nefedov (1):
>   qemu-img: wait for convert coroutines to complete
> 
> Christoph Hellwig (1):
>   nvme: Implement Write Zeroes
> 
> Eric Blake (21):
>   qemu-io: Improve alignment checks
>   qemu-io: Switch 'alloc' command to byte-based length
>   qemu-io: Switch 'map' output to byte-based reporting
>   blkdebug: Sanity check block layer guarantees
>   blkdebug: Refactor error injection
>   blkdebug: Add pass-through write_zero and discard support
>   blkdebug: Simplify override logic
>   blkdebug: Add ability to override unmap geometries
>   tests: Add coverage for recent block geometry fixes
>   qcow2: Nicer variable names in qcow2_update_snapshot_refcount()
>   qcow2: Use consistent switch indentation
>   block: Update comments on BDRV_BLOCK_* meanings
>   qcow2: Correctly report status of preallocated zero clusters
>   qcow2: Name typedef for cluster type
>   qcow2: Make distinction between zero cluster types obvious
>   qcow2: Optimize zero_single_l2() to minimize L2 churn
>   iotests: Improve _filter_qemu_img_map
>   iotests: Add test 179 to cover write zeroes with unmap
>   qcow2: Optimize write zero of unaligned tail cluster
>   qcow2: Assert that cluster operations are aligned
>   qcow2: Discard/zero clusters by byte count
> 
> Fam Zheng (23):
>   block: Make bdrv_perm_names public
>   block: Add, parse and store "force-share" option
>   block: Respect "force-share" in perm propagating
>   qemu-img: Add --force-share option to subcommands
>   qemu-img: Update documentation for -U
>   qemu-io: Add --force-share option
>   iotests: 030: Prepare for image locking
>   iotests: 046: Prepare for image locking
>   iotests: 055: Don't attach the target image already for drive-backup
>   iotests: 085: Avoid image locking conflict
>   iotests: 087: Don't attach test image twice
>   iotests: 091: Quit QEMU before checking image
>   iotests: 172: Use separate images for multiple devices
>   tests: Use null-co:// instead of /dev/null as the dummy image
>   file-posix: Add 'locking' option
>   file-win32: Error out if locking=on
>   tests: Disable image lock in test-replication
>   block: Reuse bs as backing hd for drive-backup sync=none
>   osdep: Add qemu_lock_fd and qemu_unlock_fd
>   osdep: Fall back to posix lock when OFD lock is unavailable
>   file-posix: Add image locking to perm operations
>   qemu-iotests: Add test case 153 for image locking
>   tests: Add POSIX image locking test case 182
> 
> John Snow (1):
>   blockdev: use drained_begin/end for qmp_block_resize
> 
> Kevin Wolf (7):
>   migration: Unify block node activation error handling
>   block: New BdrvChildRole.activate() for blk_resume_after_migration()
>   block: Drop permissions when migration completes
>   block: Inactivate parents before children
>   block: Fix write/resize permissions for inactive images
>   file-posix: Remove .bdrv_inactivate/invalidate_cache
>   Merge remote-tracking branch 'mreitz/tags/pull-block-2017-05-11' into 
> queue-block
> 
> Max Reitz (5):
>   qcow2: Fix preallocation size formula
>   qcow2: Reuse preallocated zero clusters
>   qcow2: Discard preallocated zero clusters
>   iotests: Extend test 066
>   MAINTAINERS: Add qemu-progress to the block layer
> 
>  MAINTAINERS   |   1 +
>  block.c   | 127 +++--
>  block/blkdebug.c  | 264 +-
>  block/block-backend.c |  81 +---
>  block/file-posix.c| 248 +++-
>  block/file-win32.c|   5 +
>  block/qcow2-cluster.c | 252 ++--
>  block/qcow2-refcount.c| 148 +++
>  block/qcow2-snapshot.c|   7 +-
>  block/qcow2.c |  47 +++--
>  block/qcow2.h |  26 ++-
>  blockdev.c|  20 +-
>  hw/block/nvme.c   |  26 +++
>  hw/block/nvme.h   |   1 +
>  include/block/block.h |  41 ++--
>  

[Qemu-block] [PATCH v2] blockdev: Print a warning for legacy drive options that belong to -device

2017-05-12 Thread Thomas Huth
We likely do not want to carry these legacy -drive options along forever.
Let's emit a deprecation warning for the -drive options that have a
replacement with the -device option, so that the (hopefully few) remaining
users are aware of this and can adapt their scripts / behaviour accordingly.

Signed-off-by: Thomas Huth 
---
 v2:
 - Check for !qtest_enabled() since tests/hd-geo-test still uses these
 - Added "addr" to the list, too
 - Also mark the options as deprecated in the documentation

 blockdev.c  | 14 ++
 qemu-options.hx |  5 -
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/blockdev.c b/blockdev.c
index 0b38c3d..aef38f0 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -50,6 +50,7 @@
 #include "qmp-commands.h"
 #include "block/trace.h"
 #include "sysemu/arch_init.h"
+#include "sysemu/qtest.h"
 #include "qemu/cutils.h"
 #include "qemu/help_option.h"
 #include "qemu/throttle-options.h"
@@ -797,6 +798,9 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType 
block_default_type)
 const char *filename;
 Error *local_err = NULL;
 int i;
+const char *deprecated[] = {
+"serial", "trans", "secs", "heads", "cyls", "addr"
+};
 
 /* Change legacy command line options into QMP ones */
 static const struct {
@@ -880,6 +884,16 @@ DriveInfo *drive_new(QemuOpts *all_opts, 
BlockInterfaceType block_default_type)
 "update your scripts.\n");
 }
 
+/* Other deprecated options */
+if (!qtest_enabled()) {
+for (i = 0; i < ARRAY_SIZE(deprecated); i++) {
+if (qemu_opt_get(legacy_opts, deprecated[i]) != NULL) {
+error_report("'%s' is deprecated, please use the corresponding 
"
+ "option of '-device' instead", deprecated[i]);
+}
+}
+}
+
 /* Media type */
 value = qemu_opt_get(legacy_opts, "media");
 if (value) {
diff --git a/qemu-options.hx b/qemu-options.hx
index 9d7964d..2f66f1a 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -615,6 +615,8 @@ of available connectors of a given interface type.
 This option defines the type of the media: disk or cdrom.
 @item cyls=@var{c},heads=@var{h},secs=@var{s}[,trans=@var{t}]
 These options have the same definition as they have in @option{-hdachs}.
+These parameters are deprecated, use the corresponding parameters
+of @code{-device} instead.
 @item snapshot=@var{snapshot}
 @var{snapshot} is "on" or "off" and controls snapshot mode for the given drive
 (see @option{-snapshot}).
@@ -631,7 +633,8 @@ an untrusted format header.
 @item serial=@var{serial}
 This option specifies the serial number to assign to the device.
 @item addr=@var{addr}
-Specify the controller's PCI address (if=virtio only).
+Specify the controller's PCI address (if=virtio only). This parameter is
+deprecated, use the corresponding parameter of @code{-device} instead.
 @item werror=@var{action},rerror=@var{action}
 Specify which @var{action} to take on write and read errors. Valid actions are:
 "ignore" (ignore the error and try to continue), "stop" (pause QEMU),
-- 
1.8.3.1




Re: [Qemu-block] [PATCH 02/18] block: access quiesce_counter with atomic ops

2017-05-12 Thread Alberto Garcia
On Thu 11 May 2017 04:41:52 PM CEST, Paolo Bonzini wrote:
> Reviewed-by: Stefan Hajnoczi 
> Signed-off-by: Paolo Bonzini 
> ---

Reviewed-by: Alberto Garcia 

Berto



Re: [Qemu-block] [PATCH v4 0/6] COLO block replication supports shared disk case

2017-05-12 Thread Hailiang Zhang

On 2017/5/12 3:17, Stefan Hajnoczi wrote:

On Wed, Apr 12, 2017 at 10:05:15PM +0800, zhanghailiang wrote:

COLO block replication doesn't support the shared disk case,
Here we try to implement it and this is the 4th version.

Please review and any commits are welcomed.

Cc: Dr. David Alan Gilbert (git) 
Cc: eddie.d...@intel.com

Sorry for the delay.  Feel free to ping me if I don't review within a
few days when you post a patch.


That is OK. :) , I was doing other things these days, and it is not quite 
urgent ... thanks.


v4:
- Add proper comment for primary_disk in patch 2 (Stefan)
- Call bdrv_invalidate_cache() while do checkpoint for shared disk in patch 5

v3:
- Fix some comments from Stefan and Eric

v2:
- Drop the patch which add a blk_root() helper
- Fix some comments from Changlong

zhanghailiang (6):
   docs/block-replication: Add description for shared-disk case
   replication: add shared-disk and shared-disk-id options
   replication: Split out backup_do_checkpoint() from
 secondary_do_checkpoint()
   replication: fix code logic with the new shared_disk option
   replication: Implement block replication for shared disk case
   nbd/replication: implement .bdrv_get_info() for nbd and replication
 driver

  block/nbd.c|  12 +++
  block/replication.c| 198 ++---
  docs/block-replication.txt | 139 ++-
  qapi/block-core.json   |  10 ++-
  4 files changed, 306 insertions(+), 53 deletions(-)

--
1.8.3.1









Re: [Qemu-block] [Qemu-devel] [PATCH v4 5/6] replication: Implement block replication for shared disk case

2017-05-12 Thread Hailiang Zhang

On 2017/5/12 3:15, Stefan Hajnoczi wrote:

On Wed, Apr 12, 2017 at 10:05:20PM +0800, zhanghailiang wrote:

@@ -612,6 +644,16 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
  error_propagate(errp, local_err);
  break;
  }
+} else {
+/*
+ * For shared disk, we need to force SVM to re-read metadata
+ * that is loaded in memory, or there will be inconsistent.
+ */
+bdrv_invalidate_cache(s->secondary_disk->bs, _err);

I'm not sure this call has any effect:

 if (!(bs->open_flags & BDRV_O_INACTIVE)) {
 return;
 }

Is BDRV_O_INACTIVE set?


No, you are right, it does not take any effect. So should we set this flag for 
secondary_disk ?
Is it enough to set this flag only, or should we call bdrv_inactivate_recurse() 
?
To be honest, i'm not quite familiar with this parts.

Thanks,
Hailiang





Re: [Qemu-block] [PATCH v4 2/6] replication: add shared-disk and shared-disk-id options

2017-05-12 Thread Hailiang Zhang

On 2017/5/12 3:08, Stefan Hajnoczi wrote:

On Wed, Apr 12, 2017 at 10:05:17PM +0800, zhanghailiang wrote:

We use these two options to identify which disk is
shared

Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
Signed-off-by: Zhang Chen 
---
v4:
- Add proper comment for primary_disk (Stefan)
v2:
- Move g_free(s->shared_disk_id) to the common fail process place (Stefan)
- Fix comments for these two options
---
  block/replication.c  | 43 +--
  qapi/block-core.json | 10 +-
  2 files changed, 50 insertions(+), 3 deletions(-)

Aside from the ongoing discussion about this patch...

Reviewed-by: Stefan Hajnoczi 


Thanks,  I'll fix the related problems found by changlong.




Re: [Qemu-block] [PATCH v4 2/6] replication: add shared-disk and shared-disk-id options

2017-05-12 Thread Hailiang Zhang

On 2017/4/18 13:59, Xie Changlong wrote:


On 04/12/2017 10:05 PM, zhanghailiang wrote:

We use these two options to identify which disk is
shared

Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
Signed-off-by: Zhang Chen 
---
v4:
- Add proper comment for primary_disk (Stefan)
v2:
- Move g_free(s->shared_disk_id) to the common fail process place (Stefan)
- Fix comments for these two options
---
   block/replication.c  | 43 +--
   qapi/block-core.json | 10 +-
   2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index bf3c395..418b81b 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -25,9 +25,12 @@
   typedef struct BDRVReplicationState {
   ReplicationMode mode;
   int replication_state;
+bool is_shared_disk;
+char *shared_disk_id;
   BdrvChild *active_disk;
   BdrvChild *hidden_disk;
   BdrvChild *secondary_disk;
+BdrvChild *primary_disk;
   char *top_id;
   ReplicationState *rs;
   Error *blocker;
@@ -53,6 +56,9 @@ static void replication_stop(ReplicationState *rs, bool 
failover,
   
   #define REPLICATION_MODE"mode"

   #define REPLICATION_TOP_ID  "top-id"
+#define REPLICATION_SHARED_DISK "shared-disk"
+#define REPLICATION_SHARED_DISK_ID "shared-disk-id"
+
   static QemuOptsList replication_runtime_opts = {
   .name = "replication",
   .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
@@ -65,6 +71,14 @@ static QemuOptsList replication_runtime_opts = {
   .name = REPLICATION_TOP_ID,
   .type = QEMU_OPT_STRING,
   },
+{
+.name = REPLICATION_SHARED_DISK_ID,
+.type = QEMU_OPT_STRING,
+},
+{
+.name = REPLICATION_SHARED_DISK,
+.type = QEMU_OPT_BOOL,
+},
   { /* end of list */ }
   },
   };
@@ -85,6 +99,9 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
   QemuOpts *opts = NULL;
   const char *mode;
   const char *top_id;
+const char *shared_disk_id;
+BlockBackend *blk;
+BlockDriverState *tmp_bs;
   
   bs->file = bdrv_open_child(NULL, options, "file", bs, _file,

  false, errp);
@@ -125,12 +142,33 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
  "The option mode's value should be primary or secondary");
   goto fail;
   }
+s->is_shared_disk = qemu_opt_get_bool(opts, REPLICATION_SHARED_DISK,
+

What If secondary side is supplied with 'REPLICATION_SHARED_DISK_ID'?
Pls refer f4f2539bc to pefect the logical.


Hmm, we should not configure it for secondary side, i'll fix it in next version.



  false);
+if (s->is_shared_disk && (s->mode == REPLICATION_MODE_PRIMARY)) {
+shared_disk_id = qemu_opt_get(opts, REPLICATION_SHARED_DISK_ID);
+if (!shared_disk_id) {
+error_setg(_err, "Missing shared disk blk option");
+goto fail;
+}
+s->shared_disk_id = g_strdup(shared_disk_id);
+blk = blk_by_name(s->shared_disk_id);
+if (!blk) {
+error_setg(_err, "There is no %s block", s->shared_disk_id);
+goto fail;
+}
+/* We have a BlockBackend for the primary disk but use BdrvChild for
+ * consistency - active_disk, secondary_disk, etc are also BdrvChild.
+ */
+tmp_bs = blk_bs(blk);
+s->primary_disk = QLIST_FIRST(_bs->parents);
+}
   
   s->rs = replication_new(bs, _ops);
   
-ret = 0;

-
+qemu_opts_del(opts);
+return 0;
   fail:
+g_free(s->shared_disk_id);
   qemu_opts_del(opts);
   error_propagate(errp, local_err);
   
@@ -141,6 +179,7 @@ static void replication_close(BlockDriverState *bs)

   {
   BDRVReplicationState *s = bs->opaque;
   
+g_free(s->shared_disk_id);

   if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
   replication_stop(s->rs, false, NULL);
   }
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 033457c..361c932 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2661,12 +2661,20 @@
   #  node who owns the replication node chain. Must not be given in
   #  primary mode.
   #
+# @shared-disk-id: Id of shared disk while is replication mode, if @shared-disk
+#  is true, this option is required (Since: 2.10)
+#

Further explanations:

For @shared-disk-id, it must/only be given when @shared-disk enable on
Primary side.


OK.

+# @shared-disk: To indicate whether or not a disk is shared by primary VM
+#   and secondary VM. (The default is false) (Since: 2.10)
+#

Further explanations:

For @shared-disk, it must be given or not-given on both side at the same