date:20190417

Re: [Qemu-block] [Qemu-devel] [PATCH RFC 1/1] block/rbd: increase dynamically the image size

2019-04-17 Thread Stefano Garzarella

On Mon, Apr 15, 2019 at 10:04:52AM +0200, Kevin Wolf wrote:
> 
> I think a potential actual use case could be persistent dirty bitmaps
> for incremental backup. Though maybe that would be better served by
> using the rbd image just as a raw external data file and keeping the
> qcow2 metadata on a filesystem.

Thanks to point it out! I'll take a look to understand how to keep
metadata separated from the data.

> 
> How fast is rbd_resize()? Does automatically resizing for every write
> request actually work reasonably well in practice? If it does, there is
> probably little reason not to allow it, even if the use cases are rather
> obscure.

I'll try to measure the percentage of the time spent in the rbd_resize.

Another solution could be to pass to the rbd driver the virtual size of
the image and resize it only one time also if the preallocation is
disabled, because RBD will not allocate blocks but IIUC it only set the max
size.

Do you think make sense? Is it feasible?

Thanks,
Stefano

Re: [Qemu-block] [Qemu-devel] [PATCH RFC 1/1] block/rbd: increase dynamically the image size

2019-04-17 Thread Kevin Wolf

Am 17.04.2019 um 09:34 hat Stefano Garzarella geschrieben:
> On Mon, Apr 15, 2019 at 10:04:52AM +0200, Kevin Wolf wrote:
> > 
> > I think a potential actual use case could be persistent dirty bitmaps
> > for incremental backup. Though maybe that would be better served by
> > using the rbd image just as a raw external data file and keeping the
> > qcow2 metadata on a filesystem.
> 
> Thanks to point it out! I'll take a look to understand how to keep
> metadata separated from the data.

I'd consider the feature still experimental, but for local files, it
works like this:

qemu-img create -f qcow2 -o data_file=test.raw test.qcow2 4G

And then just use test.qcow2. As long as you can put everything you need
into an rbd URL, the same approach should work. Otherwise, you may need
to use QMP blockdev-create on creation and possibly the data-file option
of the qcow2 driver for opening.

> > How fast is rbd_resize()? Does automatically resizing for every write
> > request actually work reasonably well in practice? If it does, there is
> > probably little reason not to allow it, even if the use cases are rather
> > obscure.
> 
> I'll try to measure the percentage of the time spent in the rbd_resize.
> 
> Another solution could be to pass to the rbd driver the virtual size of
> the image and resize it only one time also if the preallocation is
> disabled, because RBD will not allocate blocks but IIUC it only set the max
> size.
> 
> Do you think make sense? Is it feasible?

In theory yes, though it requires modification of every driver that
should be usable together with rbd (i.e. ideally all of the drivers). If
automatic resize works good enough, I'd prefer that.

Kevin

[Qemu-block] [PATCH for-4.0? 0/2] Fix overflow bug in qcow2 discard

2019-04-17 Thread Vladimir Sementsov-Ogievskiy

Hi all. We faced an interesting bug, which may be simply reproduced:

prepare image:
./qemu-img create -f qcow2 -o cluster_size=1M /ssd/test 2300M
./qemu-io -c 'write 100M 2000M' -c 'write 2100M 200M' -c 'write 0 100M' 
/ssd/test

shrink:
./qemu-img resize --shrink /ssd/test 50M

bug:
./qemu-img info /ssd/test
image: /ssd/test
file format: qcow2
virtual size: 50M (52428800 bytes)
disk size: 2.2G
cluster_size: 1048576
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

Virtual size is shrunk, but file - not. It is due to the fact,
that merged qcow2 discard may exceed 2G, and then converting from
uint64_t to int in qcow2_process_discards when we call bdrv_pdiscard
make wrong thing.

So, here are proposal of fix and new iotest for it.

Vladimir Sementsov-Ogievskiy (2):
  block/io: bdrv_pdiscard: support int64_t bytes parameter
  iotests: test big qcow2 shrink

 include/block/block.h  |  4 +--
 block/io.c | 19 ++-
 tests/qemu-iotests/249 | 69 ++
 tests/qemu-iotests/249.out | 30 +
 tests/qemu-iotests/group   |  1 +
 5 files changed, 112 insertions(+), 11 deletions(-)
 create mode 100755 tests/qemu-iotests/249
 create mode 100644 tests/qemu-iotests/249.out

-- 
2.18.0

[Qemu-block] [PATCH 1/2] block/io: bdrv_pdiscard: support int64_t bytes parameter

2019-04-17 Thread Vladimir Sementsov-Ogievskiy

This fixes at least one overflow in qcow2_process_discards.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block.h |  4 ++--
 block/io.c| 19 ++-
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index c7a26199aa..69fa18867e 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -432,8 +432,8 @@ void bdrv_drain_all(void);
 AIO_WAIT_WHILE(bdrv_get_aio_context(bs_),  \
cond); })
 
-int bdrv_pdiscard(BdrvChild *child, int64_t offset, int bytes);
-int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int bytes);
+int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
+int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
 int bdrv_has_zero_init_1(BlockDriverState *bs);
 int bdrv_has_zero_init(BlockDriverState *bs);
 bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);
diff --git a/block/io.c b/block/io.c
index dfc153b8d8..c6429380e8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2653,7 +2653,7 @@ int bdrv_flush(BlockDriverState *bs)
 typedef struct DiscardCo {
 BdrvChild *child;
 int64_t offset;
-int bytes;
+int64_t bytes;
 int ret;
 } DiscardCo;
 static void coroutine_fn bdrv_pdiscard_co_entry(void *opaque)
@@ -2664,14 +2664,15 @@ static void coroutine_fn bdrv_pdiscard_co_entry(void 
*opaque)
 aio_wait_kick();
 }
 
-int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int bytes)
+int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
+  int64_t bytes)
 {
 BdrvTrackedRequest req;
 int max_pdiscard, ret;
 int head, tail, align;
 BlockDriverState *bs = child->bs;
 
-if (!bs || !bs->drv) {
+if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
 return -ENOMEDIUM;
 }
 
@@ -2679,9 +2680,8 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, 
int64_t offset, int bytes)
 return -EPERM;
 }
 
-ret = bdrv_check_byte_request(bs, offset, bytes);
-if (ret < 0) {
-return ret;
+if (offset < 0) {
+return -EIO;
 }
 
 /* Do nothing if disabled.  */
@@ -2711,12 +2711,13 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, 
int64_t offset, int bytes)
 goto out;
 }
 
-max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
+max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard,
+BDRV_REQUEST_MAX_BYTES),
align);
 assert(max_pdiscard >= bs->bl.request_alignment);
 
 while (bytes > 0) {
-int num = bytes;
+int64_t num = bytes;
 
 if (head) {
 /* Make small requests to get to alignment boundaries. */
@@ -2778,7 +2779,7 @@ out:
 return ret;
 }
 
-int bdrv_pdiscard(BdrvChild *child, int64_t offset, int bytes)
+int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes)
 {
 Coroutine *co;
 DiscardCo rwco = {
-- 
2.18.0

[Qemu-block] [PATCH 2/2] iotests: test big qcow2 shrink

2019-04-17 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/249 | 69 ++
 tests/qemu-iotests/249.out | 30 +
 tests/qemu-iotests/group   |  1 +
 3 files changed, 100 insertions(+)
 create mode 100755 tests/qemu-iotests/249
 create mode 100644 tests/qemu-iotests/249.out

diff --git a/tests/qemu-iotests/249 b/tests/qemu-iotests/249
new file mode 100755
index 00..f0140461ad
--- /dev/null
+++ b/tests/qemu-iotests/249
@@ -0,0 +1,69 @@
+#!/usr/bin/env bash
+#
+# Test big discard in qcow2 shrink
+#
+# Copyright (c) 2019 Virtuozzo International GmbH. All rights reserved.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+# creator
+owner=vsement...@virtuozzo.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+# Requires backing files and .bdrv_change_backing_file support
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+# This test checks bug in qcow2_process_discards, fixed by previous commit.
+# The problem was that bdrv_pdiscard was called with uint64_t bytes parameter
+# which was cropped to int.
+# To reproduce bug we need to overflow int by one sequential discard, so we
+# need size > 2G, bigger cluster size (as with default 64k we may have maximum
+# of 512M sequential data, corresponding to one L1 entry, and we need to write
+# at offset 0 at last, to prevent bdrv_co_truncate(bs->file) in
+# qcow2_co_truncate to stole the whole effect of failed discard.
+
+size=2300M
+IMGOPTS="cluster_size=1M"
+
+_make_test_img $size
+$QEMU_IO -c 'write 100M 2000M' -c 'write 2100M 200M' \
+-c 'write 0 100M' "$TEST_IMG" | _filter_qemu_io
+
+$QEMU_IMG info "$TEST_IMG" | _filter_testdir
+
+$QEMU_IMG resize --shrink "$TEST_IMG" 50M
+
+$QEMU_IMG info "$TEST_IMG" | _filter_testdir
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/249.out b/tests/qemu-iotests/249.out
new file mode 100644
index 00..18bf7b2fb2
--- /dev/null
+++ b/tests/qemu-iotests/249.out
@@ -0,0 +1,30 @@
+QA output created by 249
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=2411724800
+wrote 2097152000/2097152000 bytes at offset 104857600
+1.953 GiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 209715200/209715200 bytes at offset 2202009600
+200 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 104857600/104857600 bytes at offset 0
+100 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.qcow2
+file format: qcow2
+virtual size: 2.2G (2411724800 bytes)
+disk size: 2.3G
+cluster_size: 1048576
+Format specific information:
+compat: 1.1
+lazy refcounts: false
+refcount bits: 16
+corrupt: false
+Image resized.
+image: TEST_DIR/t.qcow2
+file format: qcow2
+virtual size: 50M (52428800 bytes)
+disk size: 54M
+cluster_size: 1048576
+Format specific information:
+compat: 1.1
+lazy refcounts: false
+refcount bits: 16
+corrupt: false
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index bae7718380..3a40dbfe69 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -248,3 +248,4 @@
 246 rw auto quick
 247 rw auto quick
 248 rw auto quick
+249 rw auto
-- 
2.18.0

[Qemu-block] [PATCH] scsi-generic: prevent guest from exceeding SG_IO limits

2019-04-17 Thread Paolo Bonzini

Linux places a limit of UIO_MAXIOV pages on SG_IO ioctls (and if the limit
is exceeded, a confusing ENOMEM error is returned[1]).  Prevent the guest
from exceeding these limits, by capping the maximum transfer length to
that value in the block limits VPD page.

[1] Oh well, at least it was easier to follow the kernel source knowing
it had to end as ENOMEM...

Cc: qemu-sta...@nongnu.org
Cc: qemu-block@nongnu.org
Cc: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 hw/scsi/scsi-generic.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index d82b462be4..4134d8389b 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -131,6 +131,17 @@ static int execute_command(BlockBackend *blk,
 return 0;
 }
 
+/*
+ * Linux places a hard limit on SG_IO transfers equal to UIO_MAXIOV
+ * pages, which we need to factor in the block limits we return.
+ */
+static uint32_t sg_max_transfer(SCSIDevice *s)
+{
+uint32_t max_transfer = blk_get_max_transfer(s->conf.blk);
+max_transfer = MIN_NON_ZERO(max_transfer, UIO_MAXIOV * 
qemu_real_host_page_size);
+return max_transfer / s->blocksize;
+}
+
 static void scsi_handle_inquiry_reply(SCSIGenericReq *r, SCSIDevice *s)
 {
 uint8_t page, page_idx;
@@ -162,10 +173,8 @@ static void scsi_handle_inquiry_reply(SCSIGenericReq *r, 
SCSIDevice *s)
 if (s->type == TYPE_DISK && (r->req.cmd.buf[1] & 0x01)) {
 page = r->req.cmd.buf[2];
 if (page == 0xb0) {
-uint32_t max_transfer =
-blk_get_max_transfer(s->conf.blk) / s->blocksize;
+uint32_t max_transfer = sg_max_transfer(s);
 
-assert(max_transfer);
 stl_be_p(&r->buf[8], max_transfer);
 /* Also take care of the opt xfer len. */
 stl_be_p(&r->buf[12],
@@ -206,7 +215,7 @@ static int scsi_generic_emulate_block_limits(SCSIGenericReq 
*r, SCSIDevice *s)
 uint8_t buf[64];
 
 SCSIBlockLimits bl = {
-.max_io_sectors = blk_get_max_transfer(s->conf.blk) / s->blocksize
+.max_io_sectors = sg_max_transfer(s)
 };
 
 memset(r->buf, 0, r->buflen);
-- 
2.21.0

Re: [Qemu-block] [PATCH] scsi-generic: prevent guest from exceeding SG_IO limits

2019-04-17 Thread Stefan Hajnoczi

On Wed, Apr 17, 2019 at 12:54 PM Paolo Bonzini  wrote:
> Linux places a limit of UIO_MAXIOV pages on SG_IO ioctls (and if the limit
> is exceeded, a confusing ENOMEM error is returned[1]).  Prevent the guest
> from exceeding these limits, by capping the maximum transfer length to
> that value in the block limits VPD page.
>
> [1] Oh well, at least it was easier to follow the kernel source knowing
> it had to end as ENOMEM...
>
> Cc: qemu-sta...@nongnu.org
> Cc: qemu-block@nongnu.org
> Cc: Stefan Hajnoczi 
> Signed-off-by: Paolo Bonzini 
> ---
>  hw/scsi/scsi-generic.c | 17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)

Reviewed-by: Stefan Hajnoczi

Re: [Qemu-block] [Qemu-devel] [PATCH for-4.0? 0/2] Fix overflow bug in qcow2 discard

2019-04-17 Thread Eric Blake

On 4/17/19 5:09 AM, Vladimir Sementsov-Ogievskiy wrote:
> Hi all. We faced an interesting bug, which may be simply reproduced:
> 
> prepare image:
> ./qemu-img create -f qcow2 -o cluster_size=1M /ssd/test 2300M
> ./qemu-io -c 'write 100M 2000M' -c 'write 2100M 200M' -c 'write 0 100M' 
> /ssd/test
> 
> shrink:
> ./qemu-img resize --shrink /ssd/test 50M
> 
> bug:
> ./qemu-img info /ssd/test
> image: /ssd/test
> file format: qcow2
> virtual size: 50M (52428800 bytes)
> disk size: 2.2G
> cluster_size: 1048576
> Format specific information:
> compat: 1.1
> lazy refcounts: false
> refcount bits: 16
> corrupt: false
> 
> Virtual size is shrunk, but file - not. It is due to the fact,
> that merged qcow2 discard may exceed 2G, and then converting from
> uint64_t to int in qcow2_process_discards when we call bdrv_pdiscard
> make wrong thing.

Too late for 4.0, but also not a regression new to this release, since
the problem appears to have been present since its introduction in
commit 0b919fae (1.6.0) (that is, even back then, Qcow2DiscardRegion was
introduced with a 64-bit discard length, but qcow2_process_discards
blindly passed that through bdrv_discard() at the time, which took 'int
nb_sectors').

> 
> So, here are proposal of fix and new iotest for it.
> 
> Vladimir Sementsov-Ogievskiy (2):
>   block/io: bdrv_pdiscard: support int64_t bytes parameter
>   iotests: test big qcow2 shrink
> 
>  include/block/block.h  |  4 +--
>  block/io.c | 19 ++-
>  tests/qemu-iotests/249 | 69 ++
>  tests/qemu-iotests/249.out | 30 +
>  tests/qemu-iotests/group   |  1 +
>  5 files changed, 112 insertions(+), 11 deletions(-)
>  create mode 100755 tests/qemu-iotests/249
>  create mode 100644 tests/qemu-iotests/249.out
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] block/io: bdrv_pdiscard: support int64_t bytes parameter

2019-04-17 Thread Eric Blake

On 4/17/19 5:09 AM, Vladimir Sementsov-Ogievskiy wrote:
> This fixes at least one overflow in qcow2_process_discards.

It's worth calling out how long the problem of passing >2G discard
requests has been present (my reply to the cover letter tracked down
0b919fae as tracking a 64-bit discard region but passing it to
bdrv_discard() which took an int sectors; I'm not sure if later changes
to byte-based rather than sector-based made a difference).

> @@ -2664,14 +2664,15 @@ static void coroutine_fn bdrv_pdiscard_co_entry(void 
> *opaque)
>  aio_wait_kick();
>  }
>  
> -int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int 
> bytes)
> +int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
> +  int64_t bytes)
>  {
>  BdrvTrackedRequest req;
>  int max_pdiscard, ret;
>  int head, tail, align;
>  BlockDriverState *bs = child->bs;
>  
> -if (!bs || !bs->drv) {
> +if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {

This change seems unrelated? Oh, it's because you are inlining the rest
of what bdrv_check_byte_request used to do.

>  return -ENOMEDIUM;
>  }
>  
> @@ -2679,9 +2680,8 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, 
> int64_t offset, int bytes)
>  return -EPERM;
>  }
>  
> -ret = bdrv_check_byte_request(bs, offset, bytes);
> -if (ret < 0) {
> -return ret;

If we keep this call in place, we can flag if there were any other
callers that were passing truncated 64-bit quantities. But I also agree
that now that we are switching to a 64-bit interface, we no longer have
to check whether callers were properly limiting their requests.

Hmm - I just realized that bdrv_check_byte_request() takes a size_t
(rather than int64_t) size argument - could this result in any other
truncations on a 32-bit platform that don't affect 64-bit platforms?

> @@ -2711,12 +2711,13 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, 
> int64_t offset, int bytes)
>  goto out;
>  }
>  
> -max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, 
> INT_MAX),
> +max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard,
> +BDRV_REQUEST_MAX_BYTES),
> align);

This change is a no-op, since BDRV_REQUEST_MAX_BYTES is already INT_MAX
aligned down to sector size, and align is at least sector size.

>  assert(max_pdiscard >= bs->bl.request_alignment);
>  
>  while (bytes > 0) {
> -int num = bytes;
> +int64_t num = bytes;
>  
>  if (head) {
>  /* Make small requests to get to alignment boundaries. */
> @@ -2778,7 +2779,7 @@ out:
>  return ret;
>  }
>  
> -int bdrv_pdiscard(BdrvChild *child, int64_t offset, int bytes)
> +int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes)
>  {
>  Coroutine *co;
>  DiscardCo rwco = {
> 

I'm not sure the patch is perfect, but I definitely agree that we want
to support 64-byte discard length (where the block layer fragments the
request as needed) rather than the current 32-byte discard length (where
callers have to be careful to not suffer from truncation).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] iotests: test big qcow2 shrink

2019-04-17 Thread Eric Blake

On 4/17/19 5:09 AM, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  tests/qemu-iotests/249 | 69 ++
>  tests/qemu-iotests/249.out | 30 +
>  tests/qemu-iotests/group   |  1 +
>  3 files changed, 100 insertions(+)
>  create mode 100755 tests/qemu-iotests/249
>  create mode 100644 tests/qemu-iotests/249.out

> +
> +# This test checks bug in qcow2_process_discards, fixed by previous commit.

This sentence makes sense in a commit message, but is a bit stale in the
test itself. Simpler to just state:

This test checks that qcow2_process_discards does not truncate a discard
request > 2G.

> +# The problem was that bdrv_pdiscard was called with uint64_t bytes parameter
> +# which was cropped to int.

then drop this sentence.

> +# To reproduce bug we need to overflow int by one sequential discard, so we
> +# need size > 2G, bigger cluster size (as with default 64k we may have 
> maximum
> +# of 512M sequential data, corresponding to one L1 entry, and we need to 
> write

s/entry,/entry),/

> +# at offset 0 at last, to prevent bdrv_co_truncate(bs->file) in
> +# qcow2_co_truncate to stole the whole effect of failed discard.

and we need a second write at offset 0 to prevent
bdrv_co_truncate(bs->file) from short-circuiting the entire discard.

> +
> +size=2300M
> +IMGOPTS="cluster_size=1M"
> +
> +_make_test_img $size
> +$QEMU_IO -c 'write 100M 2000M' -c 'write 2100M 200M' \
> +-c 'write 0 100M' "$TEST_IMG" | _filter_qemu_io

This takes a LONG time to produce; and when I tried it on a 32-bit
machine, I got an ENOMEM failure for being unable to allocate a 2000M
buffer. Is there any faster way to produce a similar setup, perhaps by
manually messing with the qcow2 file rather than directly writing that
much consecutive data?  Even if there isn't, you'll want to at least
make the test gracefully skip rather than fail if it hits ENOMEM.

> +Image resized.
> +image: TEST_DIR/t.qcow2
> +file format: qcow2
> +virtual size: 50M (52428800 bytes)
> +disk size: 54M

I confirmed that when patch 1/2 is not present, this reports 2.2G
instead of 54M.

The output here will need tweaking if my patch to rework qemu-img
human-readable sizing lands first (it's currently queued on Kevin's
block-next tree, but needs a rework to fix a few more iotests and to
resolve a 32-bit compilation issue).

> +++ b/tests/qemu-iotests/group
> @@ -248,3 +248,4 @@
>  246 rw auto quick
>  247 rw auto quick
>  248 rw auto quick
> +249 rw auto
> 

I agree that this isn't quick. I will also point out that at least three
earlier messages have also claimed 249:

Berto: [PATCH for-4.1 v2 2/2] iotests: Check that images are in
read-only mode after block-commit

Max: [PATCH v3 6/7] iotests: Test qemu-img convert --salvage

and yourself :) [PATCH v6 7/7] iotests: test nbd reconnect

so someone gets to rename.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH] iotests: Make 182 do without device_add

2019-04-17 Thread Max Reitz

182 fails if qemu has no support for hotplugging of a virtio-blk device.
Using an NBD server instead works just as well for the test, even on
qemus without hotplugging support.

Fixes: 6d0a4a0fb5c8f10c8eb68b52cfda0082b00ae963
Reported-by: Danilo C. L. de Paula 
Signed-off-by: Max Reitz 
---
To reintroduce the bug for which this test was written, revert the
following commits:

$ git revert -n \
23dece19da41724349809873923e20a48b619cb7 \
6ceabe6f77e4ae5ac2fa3d2ac1be11dc95021941 \
a6aeca0ca530f104b5a5dd6704fca22b2c5edefa \
577a133988c76e4ebf01d050d0d758d207a1baf7
---
 tests/qemu-iotests/182 | 22 +-
 tests/qemu-iotests/182.out |  1 +
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/182 b/tests/qemu-iotests/182
index ff3d7e7ec1..38959bf276 100755
--- a/tests/qemu-iotests/182
+++ b/tests/qemu-iotests/182
@@ -31,6 +31,7 @@ _cleanup()
 {
 _cleanup_test_img
 rm -f "$TEST_IMG.overlay"
+rm -f "$TEST_DIR/nbd.socket"
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
@@ -126,15 +127,26 @@ success_or_failure=y _send_qemu_cmd $QEMU_HANDLE \
 'return' \
 'error'
 
-# Now we attach the image to a virtio-blk device.  This device does
-# require some permissions (at least WRITE and READ_CONSISTENT), so if
+# Start an NBD server to which we can attach node1
+success_or_failure=y _send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'nbd-server-start',
+  'arguments': {
+  'addr': {
+  'type': 'unix',
+  'data': {
+  'path': '$TEST_DIR/nbd.socket'
+  } } } }" \
+'return' \
+'error'
+
+# Now we attach the image to the NBD server.  This server does require
+# some permissions (at least WRITE and READ_CONSISTENT), so if
 # reopening node0 unshared any (which it should not have), this will
 # fail (but it should not).
 success_or_failure=y _send_qemu_cmd $QEMU_HANDLE \
-"{'execute': 'device_add',
+"{'execute': 'nbd-server-add',
   'arguments': {
-  'driver': 'virtio-blk',
-  'drive': 'node1'
+  'device': 'node1'
   } }" \
 'return' \
 'error'
diff --git a/tests/qemu-iotests/182.out b/tests/qemu-iotests/182.out
index af501ca3f3..33d41eea91 100644
--- a/tests/qemu-iotests/182.out
+++ b/tests/qemu-iotests/182.out
@@ -14,4 +14,5 @@ Formatting 'TEST_DIR/t.qcow2.overlay', fmt=qcow2 size=197120 
backing_file=TEST_D
 {"return": {}}
 {"return": {}}
 {"return": {}}
+{"return": {}}
 *** done
-- 
2.20.1

Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: Make 182 do without device_add

2019-04-17 Thread Eric Blake

On 4/17/19 10:30 AM, Max Reitz wrote:
> 182 fails if qemu has no support for hotplugging of a virtio-blk device.
> Using an NBD server instead works just as well for the test, even on
> qemus without hotplugging support.
> 
> Fixes: 6d0a4a0fb5c8f10c8eb68b52cfda0082b00ae963
> Reported-by: Danilo C. L. de Paula 
> Signed-off-by: Max Reitz 
> ---
> To reintroduce the bug for which this test was written, revert the
> following commits:
> 
> $ git revert -n \
> 23dece19da41724349809873923e20a48b619cb7 \
> 6ceabe6f77e4ae5ac2fa3d2ac1be11dc95021941 \
> a6aeca0ca530f104b5a5dd6704fca22b2c5edefa \
> 577a133988c76e4ebf01d050d0d758d207a1baf7

Useful instructions; with them, I was able to:

Tested-by: Eric Blake 

> ---
>  tests/qemu-iotests/182 | 22 +-
>  tests/qemu-iotests/182.out |  1 +
>  2 files changed, 18 insertions(+), 5 deletions(-)

Reviewed-by: Eric Blake 

Do you want me to take this through my NBD tree, since it now depends on
NBD? I'm also fine letting it go through your iotests tree

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH] block: Fix AioContext switch for bs->drv == NULL

2019-04-17 Thread Kevin Wolf

Even for block nodes with bs->drv == NULL, we can't just ignore a
bdrv_set_aio_context() call. Leaving the node in its old context can
mean that it's still in an iothread context in bdrv_close_all() during
shutdown, resulting in an attempted unlock of the AioContext lock which
we don't hold.

This is an example stack trace of a related crash:

 #0  0x759da57f in raise () at /lib64/libc.so.6
 #1  0x759c4895 in abort () at /lib64/libc.so.6
 #2  0x55b97b1e in error_exit (err=, 
msg=msg@entry=0x55d386d0 <__func__.19059> "qemu_mutex_unlock_impl") at 
util/qemu-thread-posix.c:36
 #3  0x55b97f7f in qemu_mutex_unlock_impl 
(mutex=mutex@entry=0x568002f0, file=file@entry=0x55d378df 
"util/async.c", line=line@entry=507) at util/qemu-thread-posix.c:97
 #4  0x55b92f55 in aio_context_release (ctx=ctx@entry=0x56800290) 
at util/async.c:507
 #5  0x55b05cf8 in bdrv_prwv_co (child=child@entry=0x7fffc80012f0, 
offset=offset@entry=131072, qiov=qiov@entry=0x7fffd4f0, 
is_write=is_write@entry=true, flags=flags@entry=0)
 at block/io.c:833
 #6  0x55b060a9 in bdrv_pwritev (qiov=0x7fffd4f0, offset=131072, 
child=0x7fffc80012f0) at block/io.c:990
 #7  0x55b060a9 in bdrv_pwrite (child=0x7fffc80012f0, offset=131072, 
buf=, bytes=) at block/io.c:990
 #8  0x55ae172b in qcow2_cache_entry_flush (bs=bs@entry=0x56810680, 
c=c@entry=0x568cc740, i=i@entry=0) at block/qcow2-cache.c:51
 #9  0x55ae18dd in qcow2_cache_write (bs=bs@entry=0x56810680, 
c=0x568cc740) at block/qcow2-cache.c:248
 #10 0x55ae15de in qcow2_cache_flush (bs=0x56810680, c=) at block/qcow2-cache.c:259
 #11 0x55ae16b1 in qcow2_cache_flush_dependency (c=0x568a1700, 
c=0x568a1700, bs=0x56810680) at block/qcow2-cache.c:194
 #12 0x55ae16b1 in qcow2_cache_entry_flush (bs=bs@entry=0x56810680, 
c=c@entry=0x568a1700, i=i@entry=0) at block/qcow2-cache.c:194
 #13 0x55ae18dd in qcow2_cache_write (bs=bs@entry=0x56810680, 
c=0x568a1700) at block/qcow2-cache.c:248
 #14 0x55ae15de in qcow2_cache_flush (bs=bs@entry=0x56810680, 
c=) at block/qcow2-cache.c:259
 #15 0x55ad242c in qcow2_inactivate (bs=bs@entry=0x56810680) at 
block/qcow2.c:2124
 #16 0x55ad2590 in qcow2_close (bs=0x56810680) at block/qcow2.c:2153
 #17 0x55ab0c62 in bdrv_close (bs=0x56810680) at block.c:3358
 #18 0x55ab0c62 in bdrv_delete (bs=0x56810680) at block.c:3542
 #19 0x55ab0c62 in bdrv_unref (bs=0x56810680) at block.c:4598
 #20 0x55af4d72 in blk_remove_bs (blk=blk@entry=0x568103d0) at 
block/block-backend.c:785
 #21 0x55af4dbb in blk_remove_all_bs () at block/block-backend.c:483
 #22 0x55aae02f in bdrv_close_all () at block.c:3412
 #23 0x557f9796 in main (argc=, argv=, 
envp=) at vl.c:4776

The reproducer I used is a qcow2 image on gluster volume, where the
virtual disk size (4 GB) is larger than the gluster volume size (64M),
so we can easily trigger an ENOSPC. This backend is assigned to a
virtio-blk device using an iothread, and then from the guest a
'dd if=/dev/zero of=/dev/vda bs=1G count=1' causes the VM to stop
because of an I/O error. qemu_gluster_co_flush_to_disk() sets
bs->drv = NULL on error, so when virtio-blk stops the dataplane, the
block nodes stay in the iothread AioContext. A 'quit' monitor command
issued from this paused state crashes the process.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1631227
Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
---
 block.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index 16615bc876..9ae5c0ed2f 100644
--- a/block.c
+++ b/block.c
@@ -5672,10 +5672,6 @@ void bdrv_detach_aio_context(BlockDriverState *bs)
 BdrvAioNotifier *baf, *baf_tmp;
 BdrvChild *child;
 
-if (!bs->drv) {
-return;
-}
-
 assert(!bs->walking_aio_notifiers);
 bs->walking_aio_notifiers = true;
 QLIST_FOREACH_SAFE(baf, &bs->aio_notifiers, list, baf_tmp) {
@@ -5690,7 +5686,7 @@ void bdrv_detach_aio_context(BlockDriverState *bs)
  */
 bs->walking_aio_notifiers = false;
 
-if (bs->drv->bdrv_detach_aio_context) {
+if (bs->drv && bs->drv->bdrv_detach_aio_context) {
 bs->drv->bdrv_detach_aio_context(bs);
 }
 QLIST_FOREACH(child, &bs->children, next) {
@@ -5709,10 +5705,6 @@ void bdrv_attach_aio_context(BlockDriverState *bs,
 BdrvAioNotifier *ban, *ban_tmp;
 BdrvChild *child;
 
-if (!bs->drv) {
-return;
-}
-
 if (bs->quiesce_counter) {
 aio_disable_external(new_context);
 }
@@ -5722,7 +5714,7 @@ void bdrv_attach_aio_context(BlockDriverState *bs,
 QLIST_FOREACH(child, &bs->children, next) {
 bdrv_attach_aio_context(child->bs, new_context);
 }
-if (bs->drv->bdrv_attach_aio_context) {
+if (bs->drv && bs->drv->bdrv_attach_aio_co

Re: [Qemu-block] [Qemu-devel] [PATCH] block: Fix AioContext switch for bs->drv == NULL

2019-04-17 Thread Eric Blake

On 4/17/19 10:48 AM, Kevin Wolf wrote:
> Even for block nodes with bs->drv == NULL, we can't just ignore a
> bdrv_set_aio_context() call. Leaving the node in its old context can
> mean that it's still in an iothread context in bdrv_close_all() during
> shutdown, resulting in an attempted unlock of the AioContext lock which
> we don't hold.
> 

Seems like an "interesting" bug to diagnose. At any rate, getting rid of
the short-circuiting makes sense, and it's nice that you did manage to
come up with a reliable reproducer.

> The reproducer I used is a qcow2 image on gluster volume, where the
> virtual disk size (4 GB) is larger than the gluster volume size (64M),
> so we can easily trigger an ENOSPC. This backend is assigned to a
> virtio-blk device using an iothread, and then from the guest a
> 'dd if=/dev/zero of=/dev/vda bs=1G count=1' causes the VM to stop
> because of an I/O error. qemu_gluster_co_flush_to_disk() sets
> bs->drv = NULL on error, so when virtio-blk stops the dataplane, the
> block nodes stay in the iothread AioContext. A 'quit' monitor command
> issued from this paused state crashes the process.
> 
> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1631227
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Kevin Wolf 
> ---
>  block.c | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions

2019-04-17 Thread Max Reitz

On 16.04.19 12:02, Vladimir Sementsov-Ogievskiy wrote:
> 10.04.2019 23:20, Max Reitz wrote:
>> What bs->file and bs->backing mean depends on the node.  For filter
>> nodes, both signify a node that will eventually receive all R/W
>> accesses.  For format nodes, bs->file contains metadata and data, and
>> bs->backing will not receive writes -- instead, writes are COWed to
>> bs->file.  Usually.
>>
>> In any case, it is not trivial to guess what a child means exactly with
>> our currently limited form of expression.  It is better to introduce
>> some functions that actually guarantee a meaning:
>>
>> - bdrv_filtered_cow_child() will return the child that receives requests
>>filtered through COW.  That is, reads may or may not be forwarded
>>(depending on the overlay's allocation status), but writes never go to
>>this child.
>>
>> - bdrv_filtered_rw_child() will return the child that receives requests
>>filtered through some very plain process.  Reads and writes issued to
>>the parent will go to the child as well (although timing, etc. may be
>>modified).
>>
>> - All drivers but quorum (but quorum is pretty opaque to the general
>>block layer anyway) always only have one of these children: All read
>>requests must be served from the filtered_rw_child (if it exists), so
>>if there was a filtered_cow_child in addition, it would not receive
>>any requests at all.
>>(The closest here is mirror, where all requests are passed on to the
>>source, but with write-blocking, write requests are "COWed" to the
>>target.  But that just means that the target is a special child that
>>cannot be introspected by the generic block layer functions, and that
>>source is a filtered_rw_child.)
>>Therefore, we can also add bdrv_filtered_child() which returns that
>>one child (or NULL, if there is no filtered child).
>>
>> Also, many places in the current block layer should be skipping filters
>> (all filters or just the ones added implicitly, it depends) when going
>> through a block node chain.  They do not do that currently, but this
>> patch makes them.
>>
>> One example for this is qemu-img map, which should skip filters and only
>> look at the COW elements in the graph.  The change to iotest 204's
>> reference output shows how using blkdebug on top of a COW node used to
>> make qemu-img map disregard the rest of the backing chain, but with this
>> patch, the allocation in the base image is reported correctly.
>>
>> Furthermore, a note should be made that sometimes we do want to access
>> bs->backing directly.  This is whenever the operation in question is not
>> about accessing the COW child, but the "backing" child, be it COW or
>> not.  This is the case in functions such as bdrv_open_backing_file() or
>> whenever we have to deal with the special behavior of @backing as a
>> blockdev option, which is that it does not default to null like all
>> other child references do.
>>
>> Finally, the query functions (query-block and query-named-block-nodes)
>> are modified to return any filtered child under "backing", not just
>> bs->backing or COW children.  This is so that filters do not interrupt
>> the reported backing chain.  This changes the output of iotest 184, as
>> the throttled node now appears as a backing child.
>>
>> Signed-off-by: Max Reitz 
>> ---
>>   qapi/block-core.json   |   4 +
>>   include/block/block.h  |   1 +
>>   include/block/block_int.h  |  40 +--
>>   block.c| 210 +++--
>>   block/backup.c |   8 +-
>>   block/block-backend.c  |  16 ++-
>>   block/commit.c |  33 +++---
>>   block/io.c |  45 ---
>>   block/mirror.c |  21 ++--
>>   block/qapi.c   |  30 +++--
>>   block/stream.c |  13 +-
>>   blockdev.c |  88 +++---
>>   migration/block-dirty-bitmap.c |   4 +-
>>   nbd/server.c   |   6 +-
>>   qemu-img.c |  29 ++---
>>   tests/qemu-iotests/184.out |   7 +-
>>   tests/qemu-iotests/204.out |   1 +
>>   17 files changed, 411 insertions(+), 145 deletions(-)
> 
> really huge... didn't you consider conversion file-by-file?

Frankly, no, I just didn’t consider it.

Hm.  I don’t know, 30-patch series always look so frightening.

>> diff --git a/block.c b/block.c
>> index 16615bc876..e8f6febda0 100644
>> --- a/block.c
>> +++ b/block.c
> 
> [..]
> 
>>   
>> @@ -3467,14 +3469,17 @@ static int bdrv_reopen_parse_backing(BDRVReopenState 
>> *reopen_state,
>>   /*
>>* Find the "actual" backing file by skipping all links that point
>>* to an implicit node, if any (e.g. a commit filter node).
>> + * We cannot use any of the bdrv_skip_*() functions here because
>> + * those return the first explicit node, while we are looking for
>> + * its overlay here.
>>*/
>>   o

Re: [Qemu-block] [PATCH v3 3/7] blkdebug: Add @iotype error option

2019-04-17 Thread Max Reitz

On 16.04.19 09:18, Vladimir Sementsov-Ogievskiy wrote:
> 13.04.2019 19:53, Max Reitz wrote:
>> This new error option allows users of blkdebug to inject errors only on
>> certain kinds of I/O operations.  Users usually want to make a very
>> specific operation fail, not just any; but right now they simply hope
>> that the event that triggers the error injection is followed up with
>> that very operation.  That may not be true, however, because the block
>> layer is changing (including blkdebug, which may increase the number of
>> types of I/O operations on which to inject errors).
>>
>> The new option's default has been chosen to keep backwards
>> compatibility.
>>
>> Note that similar to the internal representation, we could choose to
>> expose this option as a list of I/O types.  But there is no practical
>> use for this, because as described above, users usually know exactly
>> which kind of operation they want to make fail, so there is no need to
>> specify multiple I/O types at once.  In addition, exposing this option
>> as a list would require non-trivial changes to qemu_opts_absorb_qdict().
>>
>> Signed-off-by: Max Reitz 
>> ---
>>   qapi/block-core.json | 26 +++
>>   block/blkdebug.c | 50 
>>   2 files changed, 67 insertions(+), 9 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 7ccbfff9d0..5141e91172 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -3235,6 +3235,26 @@
>>   'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
>>   'cor_write'] }
>>   
>> +##
>> +# @BlkdebugIOType:
>> +#
>> +# Kinds of I/O that blkdebug can inject errors in.
>> +#
>> +# @read: .bdrv_co_preadv()
>> +#
>> +# @write: .bdrv_co_pwritev()
>> +#
>> +# @write-zeroes: .bdrv_co_pwrite_zeroes()
>> +#
>> +# @discard: .bdrv_co_pdiscard()
>> +#
>> +# @flush: .bdrv_co_flush_to_disk()
>> +#
>> +# Since: 4.1
>> +##
>> +{ 'enum': 'BlkdebugIOType',
>> +  'data': [ 'read', 'write', 'write-zeroes', 'discard', 'flush' ] }
>> +
>>   ##
>>   # @BlkdebugInjectErrorOptions:
>>   #
>> @@ -3245,6 +3265,11 @@
>>   # @state:   the state identifier blkdebug needs to be in to
>>   #   actually trigger the event; defaults to "any"
>>   #
>> +# @iotype:  the type of I/O operations on which this error should
>> +#   be injected; defaults to "all read, write,
>> +#   write-zeroes, discard, and flush operations"
> 
> Don't you want defaults to any?

See the commit message:

> The new option's default has been chosen to keep backwards
> compatibility.

I can’t let it default to any because I’d like to add the block-status
I/O type, and the current behavior is not to inject errors then.
Changing that would break a range of iotests, and I’d rather avoid that.

I think it makes sense to require users to specify @iotype if they want
anything but the basic I/O types fail.  Well, in fact, I think users
always want to specify this option, but, well, we didn’t have it.

>> +#   (since: 4.1)
>> +#
>>   # @errno:   error identifier (errno) to be returned; defaults to
>>   #   EIO
>>   #
>> @@ -3262,6 +3287,7 @@
>>   { 'struct': 'BlkdebugInjectErrorOptions',
>> 'data': { 'event': 'BlkdebugEvent',
>>   '*state': 'int',
>> +'*iotype': 'BlkdebugIOType',
>>   '*errno': 'int',
>>   '*sector': 'int',
>>   '*once': 'bool',
>> diff --git a/block/blkdebug.c b/block/blkdebug.c
>> index efd9441625..ca84b12e32 100644
>> --- a/block/blkdebug.c
>> +++ b/block/blkdebug.c
>> @@ -75,6 +75,7 @@ typedef struct BlkdebugRule {
>>   int state;
>>   union {
>>   struct {
>> +uint64_t iotype_mask;
>>   int error;
>>   int immediately;
>>   int once;
>> @@ -91,6 +92,9 @@ typedef struct BlkdebugRule {
>>   QSIMPLEQ_ENTRY(BlkdebugRule) active_next;
>>   } BlkdebugRule;
>>   
>> +QEMU_BUILD_BUG_MSG(BLKDEBUGIO_TYPE__MAX > 64,
>> +   "BlkdebugIOType mask does not fit into an uint64_t");
>> +
>>   static QemuOptsList inject_error_opts = {
>>   .name = "inject-error",
>>   .head = QTAILQ_HEAD_INITIALIZER(inject_error_opts.head),
>> @@ -103,6 +107,10 @@ static QemuOptsList inject_error_opts = {
>>   .name = "state",
>>   .type = QEMU_OPT_NUMBER,
>>   },
>> +{
>> +.name = "iotype",
>> +.type = QEMU_OPT_STRING,
>> +},
>>   {
>>   .name = "errno",
>>   .type = QEMU_OPT_NUMBER,
>> @@ -162,6 +170,8 @@ static int add_rule(void *opaque, QemuOpts *opts, Error 
>> **errp)
>>   int event;
>>   struct BlkdebugRule *rule;
>>   int64_t sector;
>> +BlkdebugIOType iotype;
>> +Error *local_error = NULL;
>>   
>>   /* Find the right event for the rule */
>>   event_name = qemu_opt_get(opts, "event");
>> @@ -19

Re: [Qemu-block] [PATCH v3 6/7] iotests: Test qemu-img convert --salvage

2019-04-17 Thread Max Reitz

On 16.04.19 10:02, Vladimir Sementsov-Ogievskiy wrote:
> 13.04.2019 19:53, Max Reitz wrote:
>> This test converts a simple image to another, but blkdebug injects
>> block_status and read faults at some offsets.  The resulting image
>> should be the same as the input image, except that sectors that could
>> not be read have to be 0.
>>
>> Signed-off-by: Max Reitz 
>> ---
>>   tests/qemu-iotests/249 | 163 +
>>   tests/qemu-iotests/249.out |  43 ++
>>   tests/qemu-iotests/group   |   1 +
>>   3 files changed, 207 insertions(+)
>>   create mode 100755 tests/qemu-iotests/249
>>   create mode 100644 tests/qemu-iotests/249.out
>>
>> diff --git a/tests/qemu-iotests/249 b/tests/qemu-iotests/249
>> new file mode 100755
>> index 00..d3883d75f3
>> --- /dev/null
>> +++ b/tests/qemu-iotests/249
>> @@ -0,0 +1,163 @@
>> +#!/bin/bash
>> +#
>> +# Test qemu-img convert --salvage
>> +#
>> +# Copyright (C) 2018 Red Hat, Inc.
> 
> 2019
> 
>> +#
>> +# This program is free software; you can redistribute it and/or modify
>> +# it under the terms of the GNU General Public License as published by
>> +# the Free Software Foundation; either version 2 of the License, or
>> +# (at your option) any later version.
>> +#
>> +# This program is distributed in the hope that it will be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program.  If not, see .
>> +#
>> +
>> +# creator
>> +owner=mre...@redhat.com
>> +
>> +seq=$(basename $0)
>> +echo "QA output created by $seq"
>> +
>> +here=$PWD
> 
> this variable is not defined in latest bash-based iotests, is it needed 
> really?

Hm, no, it’s just here to remind me that this test really is from 2018. :-)

>> +status=1# failure is the default!
>> +
>> +_cleanup()
>> +{
>> +_cleanup_test_img
> 
> as I understand, you also need to remove source img which is $TEST_IMG.orig

That’s done automatically by _cleanup_test_img().

(It removes $TEST_IMG, $TEST_IMG.orig, and $TEST_IMG.base.)

>> +}
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +# get standard environment, filters and checks
>> +. ./common.rc
>> +. ./common.filter
>> +. ./common.qemu
>> +
>> +_supported_fmt generic
>> +_supported_proto file
>> +_supported_os Linux
>> +
>> +
>> +TEST_IMG="$TEST_IMG.orig" _make_test_img 64M
>> +
>> +$QEMU_IO -c 'write -P 42 0 64M' "$TEST_IMG.orig" | _filter_qemu_io
>> +
>> +
>> +sector_size=512
>> +
>> +# Offsets on which to fail block-status.  Keep in ascending order so
>> +# the indexing done by _filter_offsets will appear in ascending order
>> +# in the output as well.
>> +status_fail_offsets="$((16 * 1024 * 1024 + 8192))
>> + $((33 * 1024 * 1024 + 512))"
>> +
>> +# Offsets on which to fail reads.  Keep in ascending order for the
>> +# same reason.
>> +# Element 1 is shared with $status_fail_offsets on purpose.
> 
> you mean element 2
> 
>> +# Elements 2 and later test what happens when a continuous range of
> 
> and 3..
> 
> ah no, you just counting from zero..

Of course I am. :-)

I can rewrite the comment to "The second element" and "Starting with the
third element, we test...".

>> +# sectors is inaccessible.
>> +read_fail_offsets="$((32 * 1024 * 1024 - 65536))
>> +   $((33 * 1024 * 1024 + 512))
>> +   $(seq $((34 * 1024 * 1024)) $sector_size \
>> + $((34 * 1024 * 1024 + 4096 - $sector_size)))"
>> +
>> +
>> +# blkdebug must be above the format layer so it can intercept all
>> +# block-status events
>> +source_img="json:{'driver': 'blkdebug',
>> +  'image': {
>> +  'driver': '$IMGFMT',
>> +  'file': {
>> +  'driver': 'file',
>> +  'filename': '$TEST_IMG.orig'
>> +  }
>> +  },
>> +  'inject-error': ["
>> +
>> +for ofs in $status_fail_offsets
>> +do
>> +source_img+="{ 'event': 'none',
> 
> idea: may be make @event optional with default of 'none'?

I don’t know whether that makes sense.  'none' is a special case, so my
personal feeling is not to make it the default.

>> +   'iotype': 'block-status',
>> +   'errno': 5,
>> +   'sector': $((ofs / sector_size)) },"
>> +done
>> +
>> +for ofs in $read_fail_offsets
>> +do
>> +source_img+="{ 'event': 'none',
>> +   'iotype': 'read',
>> +   'errno': 5,
>> +   'sector': $((ofs / sector_size)) },"
>> +done
>> +
>> +# Remove the trailing comma and terminate @inject-error and json:{}
>> +source_img="${source_img%,} ] }"
>> +
>> +
>> +echo
>> +
>> +
>> +_filter_offsets() {
> 
> why to filter them?

Re: [Qemu-block] [PATCH v3 7/7] iotests: Test qemu-img convert -C --salvage

2019-04-17 Thread Max Reitz

On 16.04.19 10:13, Vladimir Sementsov-Ogievskiy wrote:
> 13.04.2019 19:53, Max Reitz wrote:
>> We do not support this combination (yet), so this should yield an error
>> message.
>>
>> Signed-off-by: Max Reitz 
> 
> Tested-by: Vladimir Sementsov-Ogievskiy 
> [only -qcow2, as -nfs -qcow2 is already broken]
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 
> Hmm, interesting, that this test is defined as
> _supported_fmt qcow2
> _supported_proto file nfs
> 
> And of course, it doesn't work for -qcow2 -nfs combination..

Of course, nobody ever tests nfs. :-)

Thanks for reviewing!

Max



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH] block: Fix AioContext switch for bs->drv == NULL

2019-04-17 Thread Max Reitz

On 17.04.19 17:48, Kevin Wolf wrote:
> Even for block nodes with bs->drv == NULL, we can't just ignore a
> bdrv_set_aio_context() call. Leaving the node in its old context can
> mean that it's still in an iothread context in bdrv_close_all() during
> shutdown, resulting in an attempted unlock of the AioContext lock which
> we don't hold.
> 
> This is an example stack trace of a related crash:
> 
>  #0  0x759da57f in raise () at /lib64/libc.so.6
>  #1  0x759c4895 in abort () at /lib64/libc.so.6
>  #2  0x55b97b1e in error_exit (err=, 
> msg=msg@entry=0x55d386d0 <__func__.19059> "qemu_mutex_unlock_impl") at 
> util/qemu-thread-posix.c:36
>  #3  0x55b97f7f in qemu_mutex_unlock_impl 
> (mutex=mutex@entry=0x568002f0, file=file@entry=0x55d378df 
> "util/async.c", line=line@entry=507) at util/qemu-thread-posix.c:97
>  #4  0x55b92f55 in aio_context_release (ctx=ctx@entry=0x56800290) 
> at util/async.c:507
>  #5  0x55b05cf8 in bdrv_prwv_co (child=child@entry=0x7fffc80012f0, 
> offset=offset@entry=131072, qiov=qiov@entry=0x7fffd4f0, 
> is_write=is_write@entry=true, flags=flags@entry=0)
>  at block/io.c:833
>  #6  0x55b060a9 in bdrv_pwritev (qiov=0x7fffd4f0, offset=131072, 
> child=0x7fffc80012f0) at block/io.c:990
>  #7  0x55b060a9 in bdrv_pwrite (child=0x7fffc80012f0, offset=131072, 
> buf=, bytes=) at block/io.c:990
>  #8  0x55ae172b in qcow2_cache_entry_flush 
> (bs=bs@entry=0x56810680, c=c@entry=0x568cc740, i=i@entry=0) at 
> block/qcow2-cache.c:51
>  #9  0x55ae18dd in qcow2_cache_write (bs=bs@entry=0x56810680, 
> c=0x568cc740) at block/qcow2-cache.c:248
>  #10 0x55ae15de in qcow2_cache_flush (bs=0x56810680, c= out>) at block/qcow2-cache.c:259
>  #11 0x55ae16b1 in qcow2_cache_flush_dependency (c=0x568a1700, 
> c=0x568a1700, bs=0x56810680) at block/qcow2-cache.c:194
>  #12 0x55ae16b1 in qcow2_cache_entry_flush 
> (bs=bs@entry=0x56810680, c=c@entry=0x568a1700, i=i@entry=0) at 
> block/qcow2-cache.c:194
>  #13 0x55ae18dd in qcow2_cache_write (bs=bs@entry=0x56810680, 
> c=0x568a1700) at block/qcow2-cache.c:248
>  #14 0x55ae15de in qcow2_cache_flush (bs=bs@entry=0x56810680, 
> c=) at block/qcow2-cache.c:259
>  #15 0x55ad242c in qcow2_inactivate (bs=bs@entry=0x56810680) at 
> block/qcow2.c:2124
>  #16 0x55ad2590 in qcow2_close (bs=0x56810680) at 
> block/qcow2.c:2153
>  #17 0x55ab0c62 in bdrv_close (bs=0x56810680) at block.c:3358
>  #18 0x55ab0c62 in bdrv_delete (bs=0x56810680) at block.c:3542
>  #19 0x55ab0c62 in bdrv_unref (bs=0x56810680) at block.c:4598
>  #20 0x55af4d72 in blk_remove_bs (blk=blk@entry=0x568103d0) at 
> block/block-backend.c:785
>  #21 0x55af4dbb in blk_remove_all_bs () at block/block-backend.c:483
>  #22 0x55aae02f in bdrv_close_all () at block.c:3412
>  #23 0x557f9796 in main (argc=, argv=, 
> envp=) at vl.c:4776
> 
> The reproducer I used is a qcow2 image on gluster volume, where the
> virtual disk size (4 GB) is larger than the gluster volume size (64M),
> so we can easily trigger an ENOSPC. This backend is assigned to a
> virtio-blk device using an iothread, and then from the guest a
> 'dd if=/dev/zero of=/dev/vda bs=1G count=1' causes the VM to stop
> because of an I/O error. qemu_gluster_co_flush_to_disk() sets
> bs->drv = NULL on error, so when virtio-blk stops the dataplane, the
> block nodes stay in the iothread AioContext. A 'quit' monitor command
> issued from this paused state crashes the process.
> 
> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1631227
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Kevin Wolf 
> ---
>  block.c | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH for-4.1 v2] qemu-img: Saner printing of large file sizes

2019-04-17 Thread Eric Blake

On 4/10/19 2:28 PM, Max Reitz wrote:
> On 01.04.19 16:57, Eric Blake wrote:
>> Disk sizes close to INT64_MAX cause overflow, for some pretty
>> ridiculous output:
>>
>>   $ ./nbdkit -U - memory size=$((2**63 - 512)) --run 'qemu-img info $nbd'
>>   image: nbd+unix://?socket=/tmp/nbdkitHSAzNz/socket
>>   file format: raw
>>   virtual size: -8388607T (9223372036854775296 bytes)
>>   disk size: unavailable
>>

> 
> There are more iotests this breaks.  First, there is 059 for vmdk, which
> looks just like the rest.

Easy enough; I was able to reproduce.

> 
> But for -m32, it gets a bit more difficult.  Every size above 999 GB
> (1000 GB gets rounded to 1 TB, which is 2^31 * 512) gets printed as
> "inf [unit]":

Ouch - pre-existing bug in size_to_str(); I'll have to fix that, then
send a v3.

Kevin, do you want to unqueue this from block-next until then?

> 
> But regardless of the iotests, we shouldn’t show the size as infinite
> just because of -m32.

Is there an easy docker setup for building -m32?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH v3 0/2] Saner printing of large file sizes

2019-04-17 Thread Eric Blake

since v2: Fix problems pointed out by Max:
vmdk (test 59) output had not actually been tested
32-bit builds of size_to_str() have been been broken since 2.10

Eric Blake (2):
  cutils: Fix size_to_str() on 32-bit platforms
  qemu-img: Saner printing of large file sizes

 block/qapi.c   | 49 +-
 util/cutils.c  |  2 +-
 tests/qemu-iotests/043.out |  6 ++---
 tests/qemu-iotests/053.out |  2 +-
 tests/qemu-iotests/059.out | 10 
 tests/qemu-iotests/060.out | 10 
 tests/qemu-iotests/061.out | 12 +-
 tests/qemu-iotests/070.out |  2 +-
 tests/qemu-iotests/082.out | 26 ++--
 tests/qemu-iotests/084.out |  8 +++
 tests/qemu-iotests/089.out |  2 +-
 tests/qemu-iotests/095.out |  4 ++--
 tests/qemu-iotests/104.out |  6 ++---
 tests/qemu-iotests/110.out |  6 ++---
 tests/qemu-iotests/114.out |  2 +-
 tests/qemu-iotests/126.out |  4 ++--
 tests/qemu-iotests/130.out | 10 
 tests/qemu-iotests/153.out |  2 +-
 tests/qemu-iotests/191.out |  8 +++
 tests/qemu-iotests/195.out |  4 ++--
 tests/qemu-iotests/198.out |  4 ++--
 tests/qemu-iotests/206.out | 10 
 tests/qemu-iotests/207.out | 12 +-
 tests/qemu-iotests/210.out |  8 +++
 tests/qemu-iotests/211.out | 10 
 tests/qemu-iotests/212.out | 10 
 tests/qemu-iotests/213.out | 10 
 tests/qemu-iotests/233.out |  4 ++--
 tests/qemu-iotests/237.out | 22 -
 tests/qemu-iotests/242.out | 10 
 30 files changed, 124 insertions(+), 151 deletions(-)

-- 
2.20.1

[Qemu-block] [PATCH v3 1/2] cutils: Fix size_to_str() on 32-bit platforms

2019-04-17 Thread Eric Blake

When extracting a human-readable size formatter, we changed 'uint64_t
div' pre-patch to 'unsigned long div' post-patch. Which breaks on
32-bit platforms, resulting in 'inf' instead of intended values larger
than 999GB.

Fixes: 22951aaa
CC: qemu-sta...@nongnu.org
Reported-by: Max Reitz 
Signed-off-by: Eric Blake 
---
 util/cutils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/cutils.c b/util/cutils.c
index e098debdc0c..d682c909015 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -825,7 +825,7 @@ const char *qemu_ether_ntoa(const MACAddr *mac)
 char *size_to_str(uint64_t val)
 {
 static const char *suffixes[] = { "", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei" };
-unsigned long div;
+uint64_t div;
 int i;

 /*
-- 
2.20.1

[Qemu-block] [PATCH v3 2/2] qemu-img: Saner printing of large file sizes

2019-04-17 Thread Eric Blake

Disk sizes close to INT64_MAX cause overflow, for some pretty
ridiculous output:

  $ ./nbdkit -U - memory size=$((2**63 - 512)) --run 'qemu-img info $nbd'
  image: nbd+unix://?socket=/tmp/nbdkitHSAzNz/socket
  file format: raw
  virtual size: -8388607T (9223372036854775296 bytes)
  disk size: unavailable

But there's no reason to have two separate implementations of integer
to human-readable abbreviation, where one has overflow and stops at
'T', while the other avoids overflow and goes all the way to 'E'. With
this patch, the output now claims 8EiB instead of -8388607T, which
really is the correct rounding of largest file size supported by qemu
(we could go 511 bytes larger if we used byte-accurate sizing instead
of rounding up to the next sector boundary, but that wouldn't change
the human-readable result).

Quite a few iotests need updates to expected output to match.

Reported-by: Richard W.M. Jones 
Signed-off-by: Eric Blake 
Tested-by: Richard W.M. Jones 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v3: fix 59 for vmdk [Max]
---
 block/qapi.c   | 49 +-
 tests/qemu-iotests/043.out |  6 ++---
 tests/qemu-iotests/053.out |  2 +-
 tests/qemu-iotests/059.out | 10 
 tests/qemu-iotests/060.out | 10 
 tests/qemu-iotests/061.out | 12 +-
 tests/qemu-iotests/070.out |  2 +-
 tests/qemu-iotests/082.out | 26 ++--
 tests/qemu-iotests/084.out |  8 +++
 tests/qemu-iotests/089.out |  2 +-
 tests/qemu-iotests/095.out |  4 ++--
 tests/qemu-iotests/104.out |  6 ++---
 tests/qemu-iotests/110.out |  6 ++---
 tests/qemu-iotests/114.out |  2 +-
 tests/qemu-iotests/126.out |  4 ++--
 tests/qemu-iotests/130.out | 10 
 tests/qemu-iotests/153.out |  2 +-
 tests/qemu-iotests/191.out |  8 +++
 tests/qemu-iotests/195.out |  4 ++--
 tests/qemu-iotests/198.out |  4 ++--
 tests/qemu-iotests/206.out | 10 
 tests/qemu-iotests/207.out | 12 +-
 tests/qemu-iotests/210.out |  8 +++
 tests/qemu-iotests/211.out | 10 
 tests/qemu-iotests/212.out | 10 
 tests/qemu-iotests/213.out | 10 
 tests/qemu-iotests/233.out |  4 ++--
 tests/qemu-iotests/237.out | 22 -
 tests/qemu-iotests/242.out | 10 
 29 files changed, 123 insertions(+), 150 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 21edab34fca..110d05dc575 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -630,43 +630,14 @@ BlockStatsList *qmp_query_blockstats(bool has_query_nodes,
 return head;
 }

-#define NB_SUFFIXES 4
-
-static char *get_human_readable_size(char *buf, int buf_size, int64_t size)
-{
-static const char suffixes[NB_SUFFIXES] = {'K', 'M', 'G', 'T'};
-int64_t base;
-int i;
-
-if (size <= 999) {
-snprintf(buf, buf_size, "%" PRId64, size);
-} else {
-base = 1024;
-for (i = 0; i < NB_SUFFIXES; i++) {
-if (size < (10 * base)) {
-snprintf(buf, buf_size, "%0.1f%c",
- (double)size / base,
- suffixes[i]);
-break;
-} else if (size < (1000 * base) || i == (NB_SUFFIXES - 1)) {
-snprintf(buf, buf_size, "%" PRId64 "%c",
- ((size + (base >> 1)) / base),
- suffixes[i]);
-break;
-}
-base = base * 1024;
-}
-}
-return buf;
-}
-
 void bdrv_snapshot_dump(fprintf_function func_fprintf, void *f,
 QEMUSnapshotInfo *sn)
 {
-char buf1[128], date_buf[128], clock_buf[128];
+char date_buf[128], clock_buf[128];
 struct tm tm;
 time_t ti;
 int64_t secs;
+char *sizing = NULL;

 if (!sn) {
 func_fprintf(f,
@@ -684,14 +655,15 @@ void bdrv_snapshot_dump(fprintf_function func_fprintf, 
void *f,
  (int)((secs / 60) % 60),
  (int)(secs % 60),
  (int)((sn->vm_clock_nsec / 100) % 1000));
+sizing = size_to_str(sn->vm_state_size);
 func_fprintf(f,
  "%-10s%-20s%7s%20s%15s",
  sn->id_str, sn->name,
- get_human_readable_size(buf1, sizeof(buf1),
- sn->vm_state_size),
+ sizing,
  date_buf,
  clock_buf);
 }
+g_free(sizing);
 }

 static void dump_qdict(fprintf_function func_fprintf, void *f, int indentation,
@@ -796,14 +768,13 @@ void bdrv_image_info_specific_dump(fprintf_function 
func_fprintf, void *f,
 void bdrv_image_info_dump(fprintf_function func_fprintf, void *f,
   ImageInfo *info)
 {
-char size_buf[128], dsize_buf[128];
+char *size_buf, *dsize_buf;
 if (!info->has_actual_size) {
-snprintf(dsize_buf, sizeof(dsize_buf), "unavailable");
+dsize_buf = g_strdup("unavailable");
 } else {
-g

Re: [Qemu-block] [PATCH for-4.1 v2] qemu-img: Saner printing of large file sizes

2019-04-17 Thread Max Reitz

On 17.04.19 18:58, Eric Blake wrote:
> On 4/10/19 2:28 PM, Max Reitz wrote:
>> On 01.04.19 16:57, Eric Blake wrote:
>>> Disk sizes close to INT64_MAX cause overflow, for some pretty
>>> ridiculous output:
>>>
>>>   $ ./nbdkit -U - memory size=$((2**63 - 512)) --run 'qemu-img info $nbd'
>>>   image: nbd+unix://?socket=/tmp/nbdkitHSAzNz/socket
>>>   file format: raw
>>>   virtual size: -8388607T (9223372036854775296 bytes)
>>>   disk size: unavailable
>>>
> 
>>
>> There are more iotests this breaks.  First, there is 059 for vmdk, which
>> looks just like the rest.
> 
> Easy enough; I was able to reproduce.
> 
>>
>> But for -m32, it gets a bit more difficult.  Every size above 999 GB
>> (1000 GB gets rounded to 1 TB, which is 2^31 * 512) gets printed as
>> "inf [unit]":
> 
> Ouch - pre-existing bug in size_to_str(); I'll have to fix that, then
> send a v3.
> 
> Kevin, do you want to unqueue this from block-next until then?
> 
>>
>> But regardless of the iotests, we shouldn’t show the size as infinite
>> just because of -m32.
> 
> Is there an easy docker setup for building -m32?

I don’t know.  In theory it's as simple as (on Fedora):

$ export PKG_CONFIG_LIBDIR=/usr/lib/pkgconfig
$ ../configure --target-list=x86_64-softmmu --extra-cflags=-m32
$ make

In practice, Fedora 29 has a broken glib2-devel.i686 package, which you
cannot install alongside the x86_64 version, so it doesn’t work
(https://bugzilla.redhat.com/show_bug.cgi?id=1651231).

I had hoped it was fixed by now... :-/

(In case you’re wondering, I’m building the 32-bit version on Arch.)

Max



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 1/2] cutils: Fix size_to_str() on 32-bit platforms

2019-04-17 Thread Max Reitz

On 17.04.19 19:11, Eric Blake wrote:
> When extracting a human-readable size formatter, we changed 'uint64_t
> div' pre-patch to 'unsigned long div' post-patch. Which breaks on
> 32-bit platforms, resulting in 'inf' instead of intended values larger
> than 999GB.
> 
> Fixes: 22951aaa
> CC: qemu-sta...@nongnu.org
> Reported-by: Max Reitz 
> Signed-off-by: Eric Blake 
> ---
>  util/cutils.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH for-4.1 v2] qemu-img: Saner printing of large file sizes

2019-04-17 Thread Daniel P . Berrangé

On Wed, Apr 17, 2019 at 11:58:57AM -0500, Eric Blake wrote:
> On 4/10/19 2:28 PM, Max Reitz wrote:
> > On 01.04.19 16:57, Eric Blake wrote:
> >> Disk sizes close to INT64_MAX cause overflow, for some pretty
> >> ridiculous output:
> >>
> >>   $ ./nbdkit -U - memory size=$((2**63 - 512)) --run 'qemu-img info $nbd'
> >>   image: nbd+unix://?socket=/tmp/nbdkitHSAzNz/socket
> >>   file format: raw
> >>   virtual size: -8388607T (9223372036854775296 bytes)
> >>   disk size: unavailable
> >>
> 
> > 
> > There are more iotests this breaks.  First, there is 059 for vmdk, which
> > looks just like the rest.
> 
> Easy enough; I was able to reproduce.
> 
> > 
> > But for -m32, it gets a bit more difficult.  Every size above 999 GB
> > (1000 GB gets rounded to 1 TB, which is 2^31 * 512) gets printed as
> > "inf [unit]":
> 
> Ouch - pre-existing bug in size_to_str(); I'll have to fix that, then
> send a v3.
> 
> Kevin, do you want to unqueue this from block-next until then?
> 
> > 
> > But regardless of the iotests, we shouldn’t show the size as infinite
> > just because of -m32.
> 
> Is there an easy docker setup for building -m32?

QEMU has a "fedora-i386-cross" image you can at least do builds in
Hopefully its good enough for make check too

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [Qemu-block] [PATCH v3 2/2] qemu-img: Saner printing of large file sizes

2019-04-17 Thread Max Reitz

On 17.04.19 19:11, Eric Blake wrote:
> Disk sizes close to INT64_MAX cause overflow, for some pretty
> ridiculous output:
> 
>   $ ./nbdkit -U - memory size=$((2**63 - 512)) --run 'qemu-img info $nbd'
>   image: nbd+unix://?socket=/tmp/nbdkitHSAzNz/socket
>   file format: raw
>   virtual size: -8388607T (9223372036854775296 bytes)
>   disk size: unavailable
> 
> But there's no reason to have two separate implementations of integer
> to human-readable abbreviation, where one has overflow and stops at
> 'T', while the other avoids overflow and goes all the way to 'E'. With
> this patch, the output now claims 8EiB instead of -8388607T, which
> really is the correct rounding of largest file size supported by qemu
> (we could go 511 bytes larger if we used byte-accurate sizing instead
> of rounding up to the next sector boundary, but that wouldn't change
> the human-readable result).
> 
> Quite a few iotests need updates to expected output to match.
> 
> Reported-by: Richard W.M. Jones 
> Signed-off-by: Eric Blake 
> Tested-by: Richard W.M. Jones 
> Reviewed-by: Alberto Garcia 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 
> ---
> v3: fix 59 for vmdk [Max]

Tested-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH v3 02/15] block/ssh: Do not report read/write/flush errors to the user

2019-04-17 Thread Markus Armbruster

Callbacks ssh_co_readv(), ssh_co_writev(), ssh_co_flush() report
errors to the user with error_printf().  They shouldn't, it's their
caller's job.  Replace by a suitable trace point.  While there, drop
the unreachable !s->sftp case.

Perhaps we should convert this part of the block driver interface to
Error, so block drivers can pass more detail to their callers.  Not
today.

Cc: "Richard W.M. Jones" 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: qemu-block@nongnu.org
Signed-off-by: Markus Armbruster 
Reviewed-by: Eric Blake 
---
 block/ssh.c| 38 +-
 block/trace-events |  3 +++
 2 files changed, 16 insertions(+), 25 deletions(-)

diff --git a/block/ssh.c b/block/ssh.c
index 190ef95300..859249113d 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -159,31 +159,19 @@ sftp_error_setg(Error **errp, BDRVSSHState *s, const char 
*fs, ...)
 g_free(msg);
 }
 
-static void GCC_FMT_ATTR(2, 3)
-sftp_error_report(BDRVSSHState *s, const char *fs, ...)
+static void sftp_error_trace(BDRVSSHState *s, const char *op)
 {
-va_list args;
+char *ssh_err;
+int ssh_err_code;
+unsigned long sftp_err_code;
 
-va_start(args, fs);
-error_vprintf(fs, args);
+/* This is not an errno.  See . */
+ssh_err_code = libssh2_session_last_error(s->session,
+  &ssh_err, NULL, 0);
+/* See . */
+sftp_err_code = libssh2_sftp_last_error((s)->sftp);
 
-if ((s)->sftp) {
-char *ssh_err;
-int ssh_err_code;
-unsigned long sftp_err_code;
-
-/* This is not an errno.  See . */
-ssh_err_code = libssh2_session_last_error(s->session,
-  &ssh_err, NULL, 0);
-/* See . */
-sftp_err_code = libssh2_sftp_last_error((s)->sftp);
-
-error_printf(": %s (libssh2 error code: %d, sftp error code: %lu)",
- ssh_err, ssh_err_code, sftp_err_code);
-}
-
-va_end(args);
-error_printf("\n");
+trace_sftp_error(op, ssh_err, ssh_err_code, sftp_err_code);
 }
 
 static int parse_uri(const char *filename, QDict *options, Error **errp)
@@ -1035,7 +1023,7 @@ static coroutine_fn int ssh_read(BDRVSSHState *s, 
BlockDriverState *bs,
 goto again;
 }
 if (r < 0) {
-sftp_error_report(s, "read failed");
+sftp_error_trace(s, "read");
 s->offset = -1;
 return -EIO;
 }
@@ -1105,7 +1093,7 @@ static int ssh_write(BDRVSSHState *s, BlockDriverState 
*bs,
 goto again;
 }
 if (r < 0) {
-sftp_error_report(s, "write failed");
+sftp_error_trace(s, "write");
 s->offset = -1;
 return -EIO;
 }
@@ -1188,7 +1176,7 @@ static coroutine_fn int ssh_flush(BDRVSSHState *s, 
BlockDriverState *bs)
 return 0;
 }
 if (r < 0) {
-sftp_error_report(s, "fsync failed");
+sftp_error_trace(s, "fsync");
 return -EIO;
 }
 
diff --git a/block/trace-events b/block/trace-events
index 7335a42540..79ccd8d824 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -208,3 +208,6 @@ sheepdog_co_rw_vector_new(uint64_t oid) "new oid 0x%" PRIx64
 sheepdog_snapshot_create_info(const char *sn_name, const char *id, const char 
*name, int64_t size, int is_snapshot) "sn_info: name %s id_str %s s: name %s 
vm_state_size %" PRId64 " " "is_snapshot %d"
 sheepdog_snapshot_create(const char *sn_name, const char *id) "%s %s"
 sheepdog_snapshot_create_inode(const char *name, uint32_t snap, uint32_t vdi) 
"s->inode: name %s snap_id 0x%" PRIx32 " vdi 0x%" PRIx32
+
+# ssh.c
+sftp_error(const char *op, const char *ssh_err, int ssh_err_code, unsigned 
long sftp_err_code) "%s failed: %s (libssh2 error code: %d, sftp error code: 
%lu)"
-- 
2.17.2

[Qemu-block] [PATCH v3 01/15] qemu-img: Use error_vreport() in error_exit()

2019-04-17 Thread Markus Armbruster

error_exit() uses low-level error_printf() to report errors.
Modernize it to use error_vreport().

Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: qemu-block@nongnu.org
Signed-off-by: Markus Armbruster 
Reviewed-by: Eric Blake 
---
 qemu-img.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 8c7b2437f0..c376e91ca0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -85,13 +85,11 @@ static void QEMU_NORETURN GCC_FMT_ATTR(1, 2) 
error_exit(const char *fmt, ...)
 {
 va_list ap;
 
-error_printf("qemu-img: ");
-
 va_start(ap, fmt);
-error_vprintf(fmt, ap);
+error_vreport(fmt, ap);
 va_end(ap);
 
-error_printf("\nTry 'qemu-img --help' for more information\n");
+error_printf("Try 'qemu-img --help' for more information\n");
 exit(EXIT_FAILURE);
 }
 
-- 
2.17.2

[Qemu-block] [PATCH v3 12/15] blockdev: Make -drive format=help print to stdout

2019-04-17 Thread Markus Armbruster

Command line help explicitly requested by the user should be printed
to stdout, not stderr.  We do elsewhere.  Adjust -drive to match: use
qemu_printf() instead of error_printf().  Plain printf() would be
wrong because we need to print to the current monitor for "drive_add
... format=help".

Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: qemu-block@nongnu.org
Signed-off-by: Markus Armbruster 
Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
---
 blockdev.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 4775a07d93..79fbac8450 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -40,6 +40,7 @@
 #include "monitor/monitor.h"
 #include "qemu/error-report.h"
 #include "qemu/option.h"
+#include "qemu/qemu-print.h"
 #include "qemu/config-file.h"
 #include "qapi/qapi-commands-block.h"
 #include "qapi/qapi-commands-transaction.h"
@@ -301,7 +302,7 @@ DriveInfo *drive_get_next(BlockInterfaceType type)
 
 static void bdrv_format_print(void *opaque, const char *name)
 {
-error_printf(" %s", name);
+qemu_printf(" %s", name);
 }
 
 typedef struct {
@@ -530,11 +531,11 @@ static BlockBackend *blockdev_init(const char *file, 
QDict *bs_opts,
 
 if ((buf = qemu_opt_get(opts, "format")) != NULL) {
 if (is_help_option(buf)) {
-error_printf("Supported formats:");
+qemu_printf("Supported formats:");
 bdrv_iterate_format(bdrv_format_print, NULL, false);
-error_printf("\nSupported formats (read-only):");
+qemu_printf("\nSupported formats (read-only):");
 bdrv_iterate_format(bdrv_format_print, NULL, true);
-error_printf("\n");
+qemu_printf("\n");
 goto early_err;
 }
 
-- 
2.17.2

[Qemu-block] [PATCH v2 2/5] block/nvme: fix doorbell stride

2019-04-17 Thread Maxim Levitsky

Fix the math involving non standard doorbell stride

Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 2d208000df..208242cf1f 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -216,7 +216,7 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 error_propagate(errp, local_err);
 goto fail;
 }
-q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1];
+q->cq.doorbell = &s->regs->doorbells[(idx * 2 + 1) * s->doorbell_scale];
 
 return q;
 fail:
-- 
2.17.2

[Qemu-block] [PATCH v2 1/5] block/nvme: don't flip CQ phase bits

2019-04-17 Thread Maxim Levitsky

Phase bits are only set by the hardware to indicate new completions
and not by the device driver.

Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 0684bbd077..2d208000df 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -340,8 +340,6 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
NVMeQueuePair *q)
 qemu_mutex_lock(&q->lock);
 c->cid = cpu_to_le16(0);
 q->inflight--;
-/* Flip Phase Tag bit. */
-c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
 progress = true;
 }
 if (progress) {
-- 
2.17.2

[Qemu-block] [PATCH v2 0/5] Few fixes for userspace NVME driver

2019-04-17 Thread Maxim Levitsky

Hi!
These are few assorted fixes and features for the userspace
nvme driver.

Tested that on my laptop with my Samsung X5 thunderbolt drive, which
happens to have 4K sectors, support for discard and write zeros.

Also bunch of fixes sitting in my queue from the period when I developed
the nvme-mdev driver.

Best regards,
Maxim Levitsky

Maxim Levitsky (5):
  block/nvme: don't flip CQ phase bits
  block/nvme: fix doorbell stride
  block/nvme: support larger that 512 bytes sector devices
  block/nvme: add support for write zeros
  block/nvme: add support for discard

 block/nvme.c | 193 +--
 block/trace-events   |   3 +
 include/block/nvme.h |  19 -
 3 files changed, 205 insertions(+), 10 deletions(-)

-- 
2.17.2

[Qemu-block] [PATCH v2 3/5] block/nvme: support larger that 512 bytes sector devices

2019-04-17 Thread Maxim Levitsky

Currently the driver hardcodes the sector size to 512,
and doesn't check the underlying device

Also fail if underlying nvme device is formatted with metadata
as this needs special support.

Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 40 +++-
 1 file changed, 35 insertions(+), 5 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 208242cf1f..0b1da54574 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -101,8 +101,11 @@ typedef struct {
 size_t doorbell_scale;
 bool write_cache_supported;
 EventNotifier irq_notifier;
+
 uint64_t nsze; /* Namespace size reported by identify command */
 int nsid;  /* The namespace id to read/write data. */
+size_t blkshift;
+
 uint64_t max_transfer;
 bool plugged;
 
@@ -415,8 +418,9 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 BDRVNVMeState *s = bs->opaque;
 NvmeIdCtrl *idctrl;
 NvmeIdNs *idns;
+NvmeLBAF *lbaf;
 uint8_t *resp;
-int r;
+int r, hwsect_size;
 uint64_t iova;
 NvmeCmd cmd = {
 .opcode = NVME_ADM_CMD_IDENTIFY,
@@ -463,7 +467,22 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 }
 
 s->nsze = le64_to_cpu(idns->nsze);
+lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
+
+if (lbaf->ms) {
+error_setg(errp, "Namespaces with metadata are not yet supported");
+goto out;
+}
+
+hwsect_size = 1 << lbaf->ds;
+
+if (hwsect_size < BDRV_SECTOR_BITS || hwsect_size > s->page_size) {
+error_setg(errp, "Namespace has unsupported block size (%d)",
+hwsect_size);
+goto out;
+}
 
+s->blkshift = lbaf->ds;
 out:
 qemu_vfio_dma_unmap(s->vfio, resp);
 qemu_vfree(resp);
@@ -782,8 +801,17 @@ fail:
 static int64_t nvme_getlength(BlockDriverState *bs)
 {
 BDRVNVMeState *s = bs->opaque;
+return s->nsze << s->blkshift;
+}
+
 
-return s->nsze << BDRV_SECTOR_BITS;
+static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
+{
+BDRVNVMeState *s = bs->opaque;
+assert(s->blkshift >= 9);
+bsz->phys = 1 << s->blkshift;
+bsz->log = 1 << s->blkshift;
+return 0;
 }
 
 /* Called with s->dma_map_lock */
@@ -914,13 +942,14 @@ static coroutine_fn int 
nvme_co_prw_aligned(BlockDriverState *bs,
 BDRVNVMeState *s = bs->opaque;
 NVMeQueuePair *ioq = s->queues[1];
 NVMeRequest *req;
-uint32_t cdw12 = (((bytes >> BDRV_SECTOR_BITS) - 1) & 0x) |
+
+uint32_t cdw12 = (((bytes >> s->blkshift) - 1) & 0x) |
(flags & BDRV_REQ_FUA ? 1 << 30 : 0);
 NvmeCmd cmd = {
 .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
 .nsid = cpu_to_le32(s->nsid),
-.cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0x),
-.cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 
0x),
+.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
+.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
 .cdw12 = cpu_to_le32(cdw12),
 };
 NVMeCoData data = {
@@ -1151,6 +1180,7 @@ static BlockDriver bdrv_nvme = {
 .bdrv_file_open   = nvme_file_open,
 .bdrv_close   = nvme_close,
 .bdrv_getlength   = nvme_getlength,
+.bdrv_probe_blocksizes= nvme_probe_blocksizes,
 
 .bdrv_co_preadv   = nvme_co_preadv,
 .bdrv_co_pwritev  = nvme_co_pwritev,
-- 
2.17.2

[Qemu-block] [PATCH v2 5/5] block/nvme: add support for discard

2019-04-17 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 block/nvme.c   | 80 ++
 block/trace-events |  2 ++
 2 files changed, 82 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index 35b925899f..b83912c627 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -110,6 +110,7 @@ typedef struct {
 bool plugged;
 
 bool supports_write_zeros;
+bool supports_discard;
 
 CoMutex dma_map_lock;
 CoQueue dma_flush_queue;
@@ -462,6 +463,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 
 
 s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
+s->supports_discard = (idctrl->oncs & NVME_ONCS_DSM) != 0;
 
 memset(resp, 0, 4096);
 
@@ -1144,6 +1146,83 @@ static coroutine_fn int 
nvme_co_pwrite_zeroes(BlockDriverState *bs,
 }
 
 
+static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
+int64_t offset, int bytes)
+{
+BDRVNVMeState *s = bs->opaque;
+NVMeQueuePair *ioq = s->queues[1];
+NVMeRequest *req;
+NvmeDsmRange *buf;
+QEMUIOVector local_qiov;
+int r;
+
+NvmeCmd cmd = {
+.opcode = NVME_CMD_DSM,
+.nsid = cpu_to_le32(s->nsid),
+.cdw10 = 0, /*number of ranges - 0 based*/
+.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
+};
+
+NVMeCoData data = {
+.ctx = bdrv_get_aio_context(bs),
+.ret = -EINPROGRESS,
+};
+
+if (!s->supports_discard) {
+return -ENOTSUP;
+}
+
+assert(s->nr_queues > 1);
+
+buf = qemu_try_blockalign0(bs, 4096);
+if (!buf) {
+return -ENOMEM;
+}
+
+buf->nlb = bytes >> s->blkshift;
+buf->slba = offset >> s->blkshift;
+buf->cattr = 0;
+
+qemu_iovec_init(&local_qiov, 1);
+qemu_iovec_add(&local_qiov, buf, 4096);
+
+req = nvme_get_free_req(ioq);
+assert(req);
+
+qemu_co_mutex_lock(&s->dma_map_lock);
+r = nvme_cmd_map_qiov(bs, &cmd, req, &local_qiov);
+qemu_co_mutex_unlock(&s->dma_map_lock);
+
+if (r) {
+req->busy = false;
+return r;
+}
+
+trace_nvme_dsm(s, offset, bytes);
+
+nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+
+data.co = qemu_coroutine_self();
+while (data.ret == -EINPROGRESS) {
+qemu_coroutine_yield();
+}
+
+qemu_co_mutex_lock(&s->dma_map_lock);
+r = nvme_cmd_unmap_qiov(bs, &local_qiov);
+qemu_co_mutex_unlock(&s->dma_map_lock);
+if (r) {
+return r;
+}
+
+trace_nvme_dsm_done(s, offset, bytes, data.ret);
+
+qemu_iovec_destroy(&local_qiov);
+qemu_vfree(buf);
+return data.ret;
+
+}
+
+
 static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
BlockReopenQueue *queue, Error **errp)
 {
@@ -1250,6 +1329,7 @@ static BlockDriver bdrv_nvme = {
 .bdrv_co_pwritev  = nvme_co_pwritev,
 
 .bdrv_co_pwrite_zeroes= nvme_co_pwrite_zeroes,
+.bdrv_co_pdiscard = nvme_co_pdiscard,
 
 .bdrv_co_flush_to_disk= nvme_co_flush,
 .bdrv_reopen_prepare  = nvme_reopen_prepare,
diff --git a/block/trace-events b/block/trace-events
index 943a58569f..e55ac5c40b 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -148,6 +148,8 @@ nvme_write_zeros(void *s, uint64_t offset, uint64_t bytes, 
int flags) "s %p offs
 nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int 
align) "qiov %p n %d base %p size 0x%zx align 0x%x"
 nvme_prw_buffered(void *s, uint64_t offset, uint64_t bytes, int niov, int 
is_write) "s %p offset %"PRId64" bytes %"PRId64" niov %d is_write %d"
 nvme_rw_done(void *s, int is_write, uint64_t offset, uint64_t bytes, int ret) 
"s %p is_write %d offset %"PRId64" bytes %"PRId64" ret %d"
+nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p offset %"PRId64" 
bytes %"PRId64""
+nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p offset 
%"PRId64" bytes %"PRId64" ret %d"
 nvme_dma_map_flush(void *s) "s %p"
 nvme_free_req_queue_wait(void *q) "q %p"
 nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s 
%p cmd %p req %p qiov %p entries %d"
-- 
2.17.2

[Qemu-block] [PATCH v2 4/5] block/nvme: add support for write zeros

2019-04-17 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 69 +++-
 block/trace-events   |  1 +
 include/block/nvme.h | 19 +++-
 3 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 0b1da54574..35b925899f 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -109,6 +109,8 @@ typedef struct {
 uint64_t max_transfer;
 bool plugged;
 
+bool supports_write_zeros;
+
 CoMutex dma_map_lock;
 CoQueue dma_flush_queue;
 
@@ -457,6 +459,10 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 s->max_transfer = MIN_NON_ZERO(s->max_transfer,
   s->page_size / sizeof(uint64_t) * s->page_size);
 
+
+
+s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
+
 memset(resp, 0, 4096);
 
 cmd.cdw10 = 0;
@@ -469,6 +475,11 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 s->nsze = le64_to_cpu(idns->nsze);
 lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
 
+if (NVME_ID_NS_DLFEAT_WRITE_ZEROS(idns->dlfeat) &&
+NVME_ID_NS_DLFEAT_READ_BEHAVIOR(idns->dlfeat) ==
+NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS)
+bs->supported_write_flags |= BDRV_REQ_MAY_UNMAP;
+
 if (lbaf->ms) {
 error_setg(errp, "Namespaces with metadata are not yet supported");
 goto out;
@@ -763,6 +774,8 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 int ret;
 BDRVNVMeState *s = bs->opaque;
 
+bs->supported_write_flags = BDRV_REQ_FUA;
+
 opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
 qemu_opts_absorb_qdict(opts, options, &error_abort);
 device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
@@ -791,7 +804,6 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto fail;
 }
 }
-bs->supported_write_flags = BDRV_REQ_FUA;
 return 0;
 fail:
 nvme_close(bs);
@@ -1080,6 +1092,58 @@ static coroutine_fn int nvme_co_flush(BlockDriverState 
*bs)
 }
 
 
+static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
+int64_t offset, int bytes, BdrvRequestFlags flags)
+{
+BDRVNVMeState *s = bs->opaque;
+NVMeQueuePair *ioq = s->queues[1];
+NVMeRequest *req;
+
+if (!s->supports_write_zeros) {
+return -ENOTSUP;
+}
+
+uint32_t cdw12 = ((bytes >> s->blkshift) - 1) & 0x;
+
+NvmeCmd cmd = {
+.opcode = NVME_CMD_WRITE_ZEROS,
+.nsid = cpu_to_le32(s->nsid),
+.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
+.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
+};
+
+NVMeCoData data = {
+.ctx = bdrv_get_aio_context(bs),
+.ret = -EINPROGRESS,
+};
+
+if (flags & BDRV_REQ_MAY_UNMAP) {
+cdw12 |= (1 << 25);
+}
+
+if (flags & BDRV_REQ_FUA) {
+cdw12 |= (1 << 30);
+}
+
+cmd.cdw12 = cpu_to_le32(cdw12);
+
+trace_nvme_write_zeros(s, offset, bytes, flags);
+assert(s->nr_queues > 1);
+req = nvme_get_free_req(ioq);
+assert(req);
+
+nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+
+data.co = qemu_coroutine_self();
+while (data.ret == -EINPROGRESS) {
+qemu_coroutine_yield();
+}
+
+trace_nvme_rw_done(s, true, offset, bytes, data.ret);
+return data.ret;
+}
+
+
 static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
BlockReopenQueue *queue, Error **errp)
 {
@@ -1184,6 +1248,9 @@ static BlockDriver bdrv_nvme = {
 
 .bdrv_co_preadv   = nvme_co_preadv,
 .bdrv_co_pwritev  = nvme_co_pwritev,
+
+.bdrv_co_pwrite_zeroes= nvme_co_pwrite_zeroes,
+
 .bdrv_co_flush_to_disk= nvme_co_flush,
 .bdrv_reopen_prepare  = nvme_reopen_prepare,
 
diff --git a/block/trace-events b/block/trace-events
index 7335a42540..943a58569f 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -144,6 +144,7 @@ nvme_submit_command_raw(int c0, int c1, int c2, int c3, int 
c4, int c5, int c6,
 nvme_handle_event(void *s) "s %p"
 nvme_poll_cb(void *s) "s %p"
 nvme_prw_aligned(void *s, int is_write, uint64_t offset, uint64_t bytes, int 
flags, int niov) "s %p is_write %d offset %"PRId64" bytes %"PRId64" flags %d 
niov %d"
+nvme_write_zeros(void *s, uint64_t offset, uint64_t bytes, int flags) "s %p 
offset %"PRId64" bytes %"PRId64" flags %d"
 nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int 
align) "qiov %p n %d base %p size 0x%zx align 0x%x"
 nvme_prw_buffered(void *s, uint64_t offset, uint64_t bytes, int niov, int 
is_write) "s %p offset %"PRId64" bytes %"PRId64" niov %d is_write %d"
 nvme_rw_done(void *s, int is_write, uint64_t offset, uint64_t bytes, int ret) 
"s %p is_write %d offset %"PRId64" bytes %"PRId64" ret %d"
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 0eae

Re: [Qemu-block] [Qemu-devel] [PATCH 0/5] Few fixes for userspace NVME driver

2019-04-17 Thread Maxim Levitsky

On Tue, 2019-04-16 at 15:50 +0200, Paolo Bonzini wrote:
> On 15/04/19 15:57, Maxim Levitsky wrote:
> > 
> > 
> > Hi!
> > These are few assorted fixes and features for the userspace
> > nvme driver.
> > 
> > Tested that on my laptop with my Samsung X5 thunderbolt drive, which
> > happens to have 4K sectors, support for discard and write zeros.
> > 
> > Also bunch of fixes sitting in my queue from the period when I developed
> > the nvme-mdev driver.
> > 
> > Best regards,
> > Maxim Levitsky
> 
> I only pointed out just two cosmetic changes.  Thanks very much!
> 
> Paolo
> 
Hi Paolo!
Thank you very much!

I have just resent V2 of the series with the fixes.

Best regards,
Maxim Levitsky

[Qemu-block] [PATCH v3 00/10] block/pflash_cfi02: Implement missing AMD pflash functionality

2019-04-17 Thread Stephen Checkoway

The goal of this patch series implement the following AMD command-set parallel
flash functionality:
- flash interleaving;
- nonuniform sector sizes;
- erase suspend/resume commands; and
- multi-sector erase.

During refactoring and implementation, I discovered several bugs that are
fixed here as well:
- flash commands use only 11-bits of the address in most cases, but the
  current code uses all of them [1];
- entering CFI mode from autoselect mode and then exiting CFI mode should
  return the chip to autoselect mode, but the current code returns to read
  array mode; and
- reset command should be ignored during sector/chip erase, but the current
  code performs the reset.

The first patch in the series adds a test for the existing behavior. Tests for
additional behavior/bug fixes are added in the relevant patch.

1. I found firmware in the wild that relies on the 11-bit address behavior,
   probably due to a bug in the firmware itself.

Changes from v1:
- Fix missing spaces around *, -, and ?;
- Fix missing Signed-off-by line on patch 7; and
- Replace use of errc with g_printerr and exit.

Changes from v2:
- Remove global_qtest from tests; and
- Test the CFI table changes.

Stephen Checkoway (10):
  block/pflash_cfi02: Add test for supported commands
  block/pflash_cfi02: Refactor, NFC intended
  block/pflash_cfi02: Fix command address comparison
  block/pflash_cfi02: Implement intereleaved flash devices
  block/pflash_cfi02: Implement nonuniform sector sizes
  block/pflash_cfi02: Fix CFI in autoselect mode
  block/pflash_cfi02: Fix reset command not ignored during erase
  block/pflash_cfi02: Implement multi-sector erase
  block/pflash_cfi02: Implement erase suspend/resume
  block/pflash_cfi02: Use the chip erase time specified in the CFI table

 hw/block/pflash_cfi02.c   | 843 +++---
 tests/Makefile.include|   2 +
 tests/pflash-cfi02-test.c | 815 
 3 files changed, 1423 insertions(+), 237 deletions(-)
 create mode 100644 tests/pflash-cfi02-test.c

-- 
2.20.1 (Apple Git-117)

[Qemu-block] [PATCH v3 07/10] block/pflash_cfi02: Fix reset command not ignored during erase

2019-04-17 Thread Stephen Checkoway

When the flash device is performing a chip erase, all commands are
ignored. When it is performing a sector erase, only the erase suspend
command is valid, which is currently not supported.

In particular, the reset command should not cause the device to reset to
read array mode while programming is on going.

Signed-off-by: Stephen Checkoway 
---
 hw/block/pflash_cfi02.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index be10036886..cb1160eb35 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -325,7 +325,8 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 pfl->bank_width * 2, value);
 }
 
-if (cmd == 0xF0) {
+/* Reset does nothing during chip erase and sector erase. */
+if (cmd == 0xF0 && pfl->cmd != 0x10 && pfl->cmd != 0x30) {
 if (pfl->wcycle == WCYCLE_AUTOSELECT_CFI) {
 /* Return to autoselect mode. */
 pfl->wcycle = 3;
-- 
2.20.1 (Apple Git-117)

[Qemu-block] [PATCH v3 01/10] block/pflash_cfi02: Add test for supported commands

2019-04-17 Thread Stephen Checkoway

Test the AMD command set for parallel flash chips. This test uses an
ARM musicpal board with a pflash drive to test the following list of
currently-supported commands.
- Autoselect
- CFI
- Sector erase
- Chip erase
- Program
- Unlock bypass
- Reset

Signed-off-by: Stephen Checkoway 
---
 tests/Makefile.include|   2 +
 tests/pflash-cfi02-test.c | 227 ++
 2 files changed, 229 insertions(+)
 create mode 100644 tests/pflash-cfi02-test.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 6b904d7430..0a26eacce0 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -263,6 +263,7 @@ check-qtest-arm-y += tests/m25p80-test$(EXESUF)
 check-qtest-arm-y += tests/test-arm-mptimer$(EXESUF)
 check-qtest-arm-y += tests/boot-serial-test$(EXESUF)
 check-qtest-arm-y += tests/hexloader-test$(EXESUF)
+check-qtest-arm-$(CONFIG_PFLASH_CFI02) += tests/pflash-cfi02-test$(EXESUF)
 
 check-qtest-aarch64-y = tests/numa-test$(EXESUF)
 check-qtest-aarch64-y += tests/boot-serial-test$(EXESUF)
@@ -773,6 +774,7 @@ tests/device-introspect-test$(EXESUF): 
tests/device-introspect-test.o
 tests/rtc-test$(EXESUF): tests/rtc-test.o
 tests/m48t59-test$(EXESUF): tests/m48t59-test.o
 tests/hexloader-test$(EXESUF): tests/hexloader-test.o
+tests/pflash-cfi02$(EXESUF): tests/pflash-cfi02-test.o
 tests/endianness-test$(EXESUF): tests/endianness-test.o
 tests/prom-env-test$(EXESUF): tests/prom-env-test.o $(libqos-obj-y)
 tests/rtas-test$(EXESUF): tests/rtas-test.o $(libqos-spapr-obj-y)
diff --git a/tests/pflash-cfi02-test.c b/tests/pflash-cfi02-test.c
new file mode 100644
index 00..b113fca5af
--- /dev/null
+++ b/tests/pflash-cfi02-test.c
@@ -0,0 +1,227 @@
+/*
+ * QTest testcase for parallel flash with AMD command set
+ *
+ * Copyright (c) 2018 Stephen Checkoway
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include 
+#include 
+#include "libqtest.h"
+
+/*
+ * To test the pflash_cfi02 device, we run QEMU with the musicpal machine with
+ * a pflash drive. This enables us to test some flash configurations, but not
+ * all. In particular, we're limited to a 16-bit wide flash device.
+ */
+
+#define MP_FLASH_SIZE_MAX (32 * 1024 * 1024)
+#define BASE_ADDR (0x1ULL - MP_FLASH_SIZE_MAX)
+
+#define FLASH_WIDTH 2
+#define CFI_ADDR (FLASH_WIDTH * 0x55)
+#define UNLOCK0_ADDR (FLASH_WIDTH * 0x)
+#define UNLOCK1_ADDR (FLASH_WIDTH * 0x2AAA)
+
+#define CFI_CMD 0x98
+#define UNLOCK0_CMD 0xAA
+#define UNLOCK1_CMD 0x55
+#define AUTOSELECT_CMD 0x90
+#define RESET_CMD 0xF0
+#define PROGRAM_CMD 0xA0
+#define SECTOR_ERASE_CMD 0x30
+#define CHIP_ERASE_CMD 0x10
+#define UNLOCK_BYPASS_CMD 0x20
+#define UNLOCK_BYPASS_RESET_CMD 0x00
+
+static char image_path[] = "/tmp/qtest.XX";
+
+static inline void flash_write(uint64_t byte_addr, uint16_t data)
+{
+qtest_writew(global_qtest, BASE_ADDR + byte_addr, data);
+}
+
+static inline uint16_t flash_read(uint64_t byte_addr)
+{
+return qtest_readw(global_qtest, BASE_ADDR + byte_addr);
+}
+
+static void unlock(void)
+{
+flash_write(UNLOCK0_ADDR, UNLOCK0_CMD);
+flash_write(UNLOCK1_ADDR, UNLOCK1_CMD);
+}
+
+static void reset(void)
+{
+flash_write(0, RESET_CMD);
+}
+
+static void sector_erase(uint64_t byte_addr)
+{
+unlock();
+flash_write(UNLOCK0_ADDR, 0x80);
+unlock();
+flash_write(byte_addr, SECTOR_ERASE_CMD);
+}
+
+static void wait_for_completion(uint64_t byte_addr)
+{
+/* If DQ6 is toggling, step the clock and ensure the toggle stops. */
+if ((flash_read(byte_addr) & 0x40) ^ (flash_read(byte_addr) & 0x40)) {
+/* Wait for erase or program to finish. */
+clock_step_next();
+/* Ensure that DQ6 has stopped toggling. */
+g_assert_cmpint(flash_read(byte_addr), ==, flash_read(byte_addr));
+}
+}
+
+static void bypass_program(uint64_t byte_addr, uint16_t data)
+{
+flash_write(UNLOCK0_ADDR, PROGRAM_CMD);
+flash_write(byte_addr, data);
+/*
+ * Data isn't valid until DQ6 stops toggling. We don't model this as
+ * writes are immediate, but if this changes in the future, we can wait
+ * until the program is complete.
+ */
+wait_for_completion(byte_addr);
+}
+
+static void program(uint64_t byte_addr, uint16_t data)
+{
+unlock();
+bypass_program(byte_addr, data);
+}
+
+static void chip_erase(void)
+{
+unlock();
+flash_write(UNLOCK0_ADDR, 0x80);
+unlock();
+flash_write(UNLOCK0_ADDR, SECTOR_ERASE_CMD);
+}
+
+static void test_flash(void)
+{
+global_qtest = qtest_initf("-M musicpal,accel=qtest "
+   "-drive 
if=pflash,file=%s,format=raw,copy-on-read",
+   image_path);
+/* Check the IDs. */
+unlock();
+flash_write(UNLOCK0_ADDR, AUTOSELECT_CMD);
+g_assert_cmpint(flash_read(FLASH_WIDTH * 0x), ==, 0x00BF);
+g_assert_cmpint(flash_read(FL

[Qemu-block] [PATCH v3 06/10] block/pflash_cfi02: Fix CFI in autoselect mode

2019-04-17 Thread Stephen Checkoway

After a flash device enters CFI mode from autoselect mode, the reset
command returns the device to autoselect mode. An additional reset
command is necessary to return to read array mode.

Signed-off-by: Stephen Checkoway 
---
 hw/block/pflash_cfi02.c   | 21 +
 tests/pflash-cfi02-test.c | 39 +++
 2 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index c4efbe8cdf..be10036886 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -61,8 +61,9 @@ do {   \
  */
 #define PFLASH_MAX_ERASE_REGIONS 4
 
-/* Special write cycle for CFI queries. */
+/* Special write cycles for CFI queries. */
 #define WCYCLE_CFI 7
+#define WCYCLE_AUTOSELECT_CFI 8
 
 struct PFlashCFI02 {
 /*< private >*/
@@ -325,6 +326,12 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 }
 
 if (cmd == 0xF0) {
+if (pfl->wcycle == WCYCLE_AUTOSELECT_CFI) {
+/* Return to autoselect mode. */
+pfl->wcycle = 3;
+pfl->cmd = 0x90;
+return;
+}
 goto reset_flash;
 }
 }
@@ -350,7 +357,6 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 /* We're in read mode */
 check_unlock0:
 if (masked_addr == 0x55 && cmd == 0x98) {
-enter_CFI_mode:
 /* Enter CFI query mode */
 pfl->wcycle = WCYCLE_CFI;
 pfl->cmd = 0x98;
@@ -427,9 +433,15 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 /* Unlock bypass reset */
 goto reset_flash;
 }
-/* We can enter CFI query mode from autoselect mode */
+/*
+ * We can enter CFI query mode from autoselect mode, but we must
+ * return to autoselect mode after a reset.
+ */
 if (masked_addr == 0x55 && cmd == 0x98) {
-goto enter_CFI_mode;
+/* Enter autoselect CFI query mode */
+pfl->wcycle = WCYCLE_AUTOSELECT_CFI;
+pfl->cmd = 0x98;
+return;
 }
 /* No break here */
 default:
@@ -510,6 +522,7 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 }
 break;
 case WCYCLE_CFI: /* Special value for CFI queries */
+case WCYCLE_AUTOSELECT_CFI:
 DPRINTF("%s: invalid write in CFI query mode\n", __func__);
 goto reset_flash;
 default:
diff --git a/tests/pflash-cfi02-test.c b/tests/pflash-cfi02-test.c
index 82bc5695e1..4039647604 100644
--- a/tests/pflash-cfi02-test.c
+++ b/tests/pflash-cfi02-test.c
@@ -480,6 +480,42 @@ static void test_geometry(const void *opaque)
 qtest_quit(qtest);
 }
 
+/*
+ * Test that
+ * 1. enter autoselect mode;
+ * 2. enter CFI mode; and then
+ * 3. exit CFI mode
+ * leaves the flash device in autoselect mode.
+ */
+static void test_cfi_in_autoselect(const void *opaque)
+{
+const FlashConfig *config = opaque;
+QTestState *qtest = qtest_initf("-M musicpal,accel=qtest"
+" -drive if=pflash,file=%s,format=raw,"
+"copy-on-read",
+image_path);
+FlashConfig explicit_config = expand_config_defaults(config);
+explicit_config.qtest = qtest;
+const FlashConfig *c = &explicit_config;
+
+/* 1. Enter autoselect. */
+unlock(c);
+flash_cmd(c, UNLOCK0_ADDR, AUTOSELECT_CMD);
+g_assert_cmpint(flash_query(c, FLASH_ADDR(0)), ==, replicate(c, 0xBF));
+
+/* 2. Enter CFI. */
+flash_cmd(c, CFI_ADDR, CFI_CMD);
+g_assert_cmpint(flash_query(c, FLASH_ADDR(0x10)), ==, replicate(c, 'Q'));
+g_assert_cmpint(flash_query(c, FLASH_ADDR(0x11)), ==, replicate(c, 'R'));
+g_assert_cmpint(flash_query(c, FLASH_ADDR(0x12)), ==, replicate(c, 'Y'));
+
+/* 3. Exit CFI. */
+reset(c);
+g_assert_cmpint(flash_query(c, FLASH_ADDR(0)), ==, replicate(c, 0xBF));
+
+qtest_quit(qtest);
+}
+
 static void cleanup(void *opaque)
 {
 unlink(image_path);
@@ -605,6 +641,9 @@ int main(int argc, char **argv)
 qtest_add_data_func(path, config, test_geometry);
 g_free(path);
 }
+
+qtest_add_data_func("pflash-cfi02/cfi-in-autoselect", &configuration[0],
+test_cfi_in_autoselect);
 int result = g_test_run();
 cleanup(NULL);
 return result;
-- 
2.20.1 (Apple Git-117)

[Qemu-block] [PATCH v3 08/10] block/pflash_cfi02: Implement multi-sector erase

2019-04-17 Thread Stephen Checkoway

After two unlock cycles and a sector erase command, the AMD flash chips
start a 50 us erase time out. Any additional sector erase commands add a
sector to be erased and restart the 50 us timeout. During the timeout,
status bit DQ3 is cleared. After the time out, DQ3 is asserted during
erasure.

Signed-off-by: Stephen Checkoway 
---
 hw/block/pflash_cfi02.c   | 94 +++
 tests/pflash-cfi02-test.c | 59 ++--
 2 files changed, 131 insertions(+), 22 deletions(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index cb1160eb35..21ceb0823b 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -30,7 +30,6 @@
  *
  * It does not implement software data protection as found in many real chips
  * It does not implement erase suspend/resume commands
- * It does not implement multiple sectors erase
  */
 
 #include "qemu/osdep.h"
@@ -106,6 +105,7 @@ struct PFlashCFI02 {
 MemoryRegion orig_mem;
 int rom_mode;
 int read_counter; /* used for lazy switch-back to rom mode */
+int sectors_to_erase;
 char *name;
 void *storage;
 };
@@ -136,6 +136,22 @@ static inline void toggle_dq6(PFlashCFI02 *pfl)
 pfl->status ^= pfl->interleave_multiplier * 0x40;
 }
 
+/*
+ * Turn on DQ3.
+ */
+static inline void assert_dq3(PFlashCFI02 *pfl)
+{
+pfl->status |= pfl->interleave_multiplier * 0x08;
+}
+
+/*
+ * Turn off DQ3.
+ */
+static inline void reset_dq3(PFlashCFI02 *pfl)
+{
+pfl->status &= ~(pfl->interleave_multiplier * 0x08);
+}
+
 /*
  * Set up replicated mappings of the same region.
  */
@@ -159,11 +175,37 @@ static void pflash_register_memory(PFlashCFI02 *pfl, int 
rom_mode)
 pfl->rom_mode = rom_mode;
 }
 
-static void pflash_timer (void *opaque)
+static void pflash_timer(void *opaque)
 {
 PFlashCFI02 *pfl = opaque;
 
 trace_pflash_timer_expired(pfl->cmd);
+if (pfl->cmd == 0x30) {
+/*
+ * Sector erase. If DQ3 is 0 when the timer expires, then the 50
+ * us erase timeout has expired so we need to start the timer for the
+ * sector erase algorithm. Otherwise, the erase completed and we should
+ * go back to read array mode.
+ */
+if ((pfl->status & 0x08) == 0) {
+assert_dq3(pfl);
+/*
+ * CFI address 0x21 is "Typical timeout per individual block erase
+ * 2^N ms"
+ */
+uint64_t timeout = ((1ULL << pfl->cfi_table[0x21]) *
+pfl->sectors_to_erase) * 100;
+timer_mod(&pfl->timer,
+  qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + timeout);
+DPRINTF("%s: erase timeout fired; erasing %d sectors\n",
+__func__, pfl->sectors_to_erase);
+return;
+}
+DPRINTF("%s: sector erase complete\n", __func__);
+pfl->sectors_to_erase = 0;
+reset_dq3(pfl);
+}
+
 /* Reset flash */
 toggle_dq7(pfl);
 if (pfl->bypass) {
@@ -307,13 +349,30 @@ static void pflash_update(PFlashCFI02 *pfl, int offset, 
int size)
 }
 }
 
+static void pflash_sector_erase(PFlashCFI02 *pfl, hwaddr offset)
+{
+uint64_t sector_len = pflash_sector_len(pfl, offset);
+offset &= ~(sector_len - 1);
+DPRINTF("%s: start sector erase at %0*" PRIx64 "-%0*" PRIx64 "\n",
+__func__, pfl->bank_width * 2, offset,
+pfl->bank_width * 2, offset + sector_len - 1);
+if (!pfl->ro) {
+uint8_t *p = pfl->storage;
+memset(p + offset, 0xFF, sector_len);
+pflash_update(pfl, offset, sector_len);
+}
+set_dq7(pfl, 0x00);
+++pfl->sectors_to_erase;
+/* Set (or reset) the 50 us timer for additional erase commands.  */
+timer_mod(&pfl->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 5);
+}
+
 static void pflash_write(void *opaque, hwaddr offset, uint64_t value,
  unsigned int width)
 {
 PFlashCFI02 *pfl = opaque;
 uint8_t *p;
 uint8_t cmd;
-uint32_t sector_len;
 
 cmd = value;
 if (pfl->cmd != 0xA0) {
@@ -486,20 +545,7 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 break;
 case 0x30:
 /* Sector erase */
-p = pfl->storage;
-sector_len = pflash_sector_len(pfl, offset);
-offset &= ~(sector_len - 1);
-DPRINTF("%s: start sector erase at %0*" PRIx64 "-%0*" PRIx64 "\n",
-__func__, pfl->bank_width * 2, offset,
-pfl->bank_width * 2, offset + sector_len - 1);
-if (!pfl->ro) {
-memset(p + offset, 0xFF, sector_len);
-pflash_update(pfl, offset, sector_len);
-}
-set_dq7(pfl, 0x00);
-/* Let's wait 1/2 second before sector erase is done */
-timer_mod(&pfl->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
-  (NANOSECONDS_PER_SECOND / 2));
+

[Qemu-block] [PATCH v3 05/10] block/pflash_cfi02: Implement nonuniform sector sizes

2019-04-17 Thread Stephen Checkoway

Some flash chips support sectors of different sizes. For example, the
AMD AM29LV160DT has 31 64 kB sectors, one 32 kB sector, two 8 kB
sectors, and a 16 kB sector, in that order. The AM29LV160DB has those in
the reverse order.

The `num-blocks` and `sector-length` properties work exactly as they did
before: a flash device with uniform sector lengths. To get non-uniform
sector lengths for up to four regions, the following properties may be
set
- region 0. `num-blocks0` and `sector-length0`;
- region 1. `num-blocks1` and `sector-length1`;
- region 2. `num-blocks2` and `sector-length2`; and
- region 3. `num-blocks3` and `sector-length3`.

If the uniform and nonuniform properties are set, then both must specify
a flash device with the same total size. It would be better to disallow
both being set, or make `num-blocks0` and `sector-length0` alias
`num-blocks` and `sector-length`, but that would make testing currently
impossible.

Signed-off-by: Stephen Checkoway 
---
 hw/block/pflash_cfi02.c   | 177 +---
 tests/pflash-cfi02-test.c | 185 --
 2 files changed, 265 insertions(+), 97 deletions(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index 101628b4ec..c4efbe8cdf 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -28,7 +28,6 @@
  * - unlock bypass command
  * - CFI queries
  *
- * It does not implement boot blocs with reduced size
  * It does not implement software data protection as found in many real chips
  * It does not implement erase suspend/resume commands
  * It does not implement multiple sectors erase
@@ -55,6 +54,13 @@ do {   \
 
 #define PFLASH_LAZY_ROMD_THRESHOLD 42
 
+/*
+ * The size of the cfi_table indirectly depends on this and the start of the
+ * PRI table directly depends on it. 4 is the maximum size (and also what
+ * seems common) without changing the PRT table address.
+ */
+#define PFLASH_MAX_ERASE_REGIONS 4
+
 /* Special write cycle for CFI queries. */
 #define WCYCLE_CFI 7
 
@@ -64,8 +70,10 @@ struct PFlashCFI02 {
 /*< public >*/
 
 BlockBackend *blk;
-uint32_t sector_len;
-uint32_t nb_blocs;
+uint32_t uniform_nb_blocs;
+uint32_t uniform_sector_len;
+uint32_t nb_blocs[PFLASH_MAX_ERASE_REGIONS];
+uint32_t sector_len[PFLASH_MAX_ERASE_REGIONS];
 uint64_t total_len;
 uint64_t interleave_multiplier;
 uint8_t mappings;
@@ -86,7 +94,7 @@ struct PFlashCFI02 {
 uint16_t ident3;
 uint16_t unlock_addr0;
 uint16_t unlock_addr1;
-uint8_t cfi_table[0x52];
+uint8_t cfi_table[0x4D];
 QEMUTimer timer;
 /* The device replicates the flash memory across its memory space.  Emulate
  * that by having a container (.mem) filled with an array of aliases
@@ -189,6 +197,25 @@ static uint64_t pflash_data_read(PFlashCFI02 *pfl, hwaddr 
offset,
 return ret;
 }
 
+/*
+ * offset should be a byte offset of the QEMU device and _not_ a device
+ * offset.
+ */
+static uint32_t pflash_sector_len(PFlashCFI02 *pfl, hwaddr offset)
+{
+assert(offset < pfl->total_len);
+int nb_regions = pfl->cfi_table[0x2C];
+hwaddr addr = 0;
+for (int i = 0; i < nb_regions; ++i) {
+uint64_t region_size = (uint64_t)pfl->nb_blocs[i] * pfl->sector_len[i];
+if (addr <= offset && offset < addr + region_size) {
+return pfl->sector_len[i];
+}
+addr += region_size;
+}
+abort();
+}
+
 static uint64_t pflash_read(void *opaque, hwaddr offset, unsigned int width)
 {
 PFlashCFI02 *pfl = opaque;
@@ -285,6 +312,7 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 PFlashCFI02 *pfl = opaque;
 uint8_t *p;
 uint8_t cmd;
+uint32_t sector_len;
 
 cmd = value;
 if (pfl->cmd != 0xA0) {
@@ -446,12 +474,14 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 case 0x30:
 /* Sector erase */
 p = pfl->storage;
-offset &= ~(pfl->sector_len - 1);
-DPRINTF("%s: start sector erase at " TARGET_FMT_plx "\n", __func__,
-offset);
+sector_len = pflash_sector_len(pfl, offset);
+offset &= ~(sector_len - 1);
+DPRINTF("%s: start sector erase at %0*" PRIx64 "-%0*" PRIx64 "\n",
+__func__, pfl->bank_width * 2, offset,
+pfl->bank_width * 2, offset + sector_len - 1);
 if (!pfl->ro) {
-memset(p + offset, 0xFF, pfl->sector_len);
-pflash_update(pfl, offset, pfl->sector_len);
+memset(p + offset, 0xFF, sector_len);
+pflash_update(pfl, offset, sector_len);
 }
 set_dq7(pfl, 0x00);
 /* Let's wait 1/2 second before sector erase is done */
@@ -515,15 +545,14 @@ static const MemoryRegionOps pflash_cfi02_ops = {
 static void pflash_cfi02_realize(DeviceState *dev, Err

[Qemu-block] [PATCH v3 04/10] block/pflash_cfi02: Implement intereleaved flash devices

2019-04-17 Thread Stephen Checkoway

It's common for multiple narrow flash chips to be hooked up in parallel
to support wider buses. For example, four 8-bit wide flash chips (x8)
may be combined in parallel to produce a 32-bit wide device. Similarly,
two 16-bit wide chips (x16) may be combined.

This commit introduces `device-width` and `max-device-width` properties,
similar to pflash_cfi01, with the following meanings:
- `width`: The width of the logical, qemu device (same as before);
- `device-width`: The width of an individual flash chip, defaulting to
  `width`; and
- `max-device-width`: The maximum width of an individual flash chip,
  defaulting to `device-width`.

Nothing needs to change to support reading such interleaved devices but
commands (e.g., erase and programming) must be sent to all devices at
the same time or else the various chips will be in different states.

For example, a 4-byte wide logical device can be composed of four x8/x16
devices in x8 mode. That is, each device supports both x8 or x16 and
they're being used in the byte, rather than word, mode. This
configuration would have `width=4`, `device-width=1`, and
`max-device-width=2`.

In addition to commands being sent to all devices, guest firmware
expects the status and CFI queries to be replicated for each device.
(The one exception to the response replication is that each device gets
to report its own status bit DQ7 while programming because its value
depends on the value being programmed which will usually differ for each
device.)

Testing is limited to 16-bit wide devices due to the current inability
to override the properties set by `pflash_cfi02_register`, but multiple
configurations are tested.

Stop using global_qtest. Instead, package the qtest variable inside the
FlashConfig structure.

Signed-off-by: Stephen Checkoway 
---
 hw/block/pflash_cfi02.c   | 270 +++--
 tests/pflash-cfi02-test.c | 477 ++
 2 files changed, 577 insertions(+), 170 deletions(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index e4bff0c8f8..101628b4ec 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -28,7 +28,6 @@
  * - unlock bypass command
  * - CFI queries
  *
- * It does not support flash interleaving.
  * It does not implement boot blocs with reduced size
  * It does not implement software data protection as found in many real chips
  * It does not implement erase suspend/resume commands
@@ -67,15 +66,19 @@ struct PFlashCFI02 {
 BlockBackend *blk;
 uint32_t sector_len;
 uint32_t nb_blocs;
-uint32_t chip_len;
+uint64_t total_len;
+uint64_t interleave_multiplier;
 uint8_t mappings;
-uint8_t width;
+uint8_t bank_width; /* Width of the QEMU device in bytes. */
+uint8_t device_width; /* Width of individual pflash chip. */
+uint8_t max_device_width; /* Maximum width of individual pflash chip. */
 uint8_t be;
+int device_shift; /* Amount to shift an offset to get a device address. */
 int wcycle; /* if 0, the flash is read normally */
 int bypass;
 int ro;
 uint8_t cmd;
-uint8_t status;
+uint64_t status;
 /* FIXME: implement array device properties */
 uint16_t ident0;
 uint16_t ident1;
@@ -103,16 +106,17 @@ struct PFlashCFI02 {
  */
 static inline void toggle_dq7(PFlashCFI02 *pfl)
 {
-pfl->status ^= 0x80;
+pfl->status ^= pfl->interleave_multiplier * 0x80;
 }
 
 /*
  * Set status bit DQ7 to bit 7 of value.
  */
-static inline void set_dq7(PFlashCFI02 *pfl, uint8_t value)
+static inline void set_dq7(PFlashCFI02 *pfl, uint64_t value)
 {
-pfl->status &= 0x7F;
-pfl->status |= value & 0x80;
+uint64_t mask = pfl->interleave_multiplier * 0x80;
+pfl->status &= ~mask;
+pfl->status |= value & mask;
 }
 
 /*
@@ -120,7 +124,7 @@ static inline void set_dq7(PFlashCFI02 *pfl, uint8_t value)
  */
 static inline void toggle_dq6(PFlashCFI02 *pfl)
 {
-pfl->status ^= 0x40;
+pfl->status ^= pfl->interleave_multiplier * 0x40;
 }
 
 /*
@@ -188,7 +192,6 @@ static uint64_t pflash_data_read(PFlashCFI02 *pfl, hwaddr 
offset,
 static uint64_t pflash_read(void *opaque, hwaddr offset, unsigned int width)
 {
 PFlashCFI02 *pfl = opaque;
-hwaddr boff;
 uint64_t ret;
 
 ret = -1;
@@ -198,12 +201,10 @@ static uint64_t pflash_read(void *opaque, hwaddr offset, 
unsigned int width)
 ++pfl->read_counter > PFLASH_LAZY_ROMD_THRESHOLD) {
 pflash_register_memory(pfl, 1);
 }
-offset &= pfl->chip_len - 1;
-boff = offset & 0xFF;
-if (pfl->width == 2)
-boff = boff >> 1;
-else if (pfl->width == 4)
-boff = boff >> 2;
+/* Mask by the total length of the chip to account for alias mappings. */
+offset &= pfl->total_len - 1;
+hwaddr device_addr = offset >> pfl->device_shift;
+
 switch (pfl->cmd) {
 default:
 /* This should never happen : reset state & treat it as a read*/
@@ -215,29 +216,32 @@ static uint64_t pflash_read(void *opaque, h

[Qemu-block] [PATCH v3 03/10] block/pflash_cfi02: Fix command address comparison

2019-04-17 Thread Stephen Checkoway

Most AMD commands only examine 11 bits of the address. This masks the
addresses used in the comparison to 11 bits. The exceptions are word or
sector addresses which use offset directly rather than the shifted
offset, boff.

Signed-off-by: Stephen Checkoway 
---
 hw/block/pflash_cfi02.c   |  8 +++-
 tests/pflash-cfi02-test.c | 12 ++--
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index 4b7af71806..e4bff0c8f8 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -296,11 +296,13 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 
 DPRINTF("%s: offset " TARGET_FMT_plx " %08" PRIx64 " %d\n", __func__,
 offset, value, width);
-boff = offset & (pfl->sector_len - 1);
+boff = offset;
 if (pfl->width == 2)
 boff = boff >> 1;
 else if (pfl->width == 4)
 boff = boff >> 2;
+/* Only the least-significant 11 bits are used in most cases. */
+boff &= 0x7FF;
 switch (pfl->wcycle) {
 case 0:
 /* Set the device in I/O access mode if required */
@@ -519,6 +521,10 @@ static void pflash_cfi02_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
+/* Only 11 bits are used in the comparison. */
+pfl->unlock_addr0 &= 0x7FF;
+pfl->unlock_addr1 &= 0x7FF;
+
 chip_len = pfl->sector_len * pfl->nb_blocs;
 
 memory_region_init_rom_device(&pfl->orig_mem, OBJECT(pfl),
diff --git a/tests/pflash-cfi02-test.c b/tests/pflash-cfi02-test.c
index b113fca5af..b91bb66a79 100644
--- a/tests/pflash-cfi02-test.c
+++ b/tests/pflash-cfi02-test.c
@@ -23,8 +23,8 @@
 
 #define FLASH_WIDTH 2
 #define CFI_ADDR (FLASH_WIDTH * 0x55)
-#define UNLOCK0_ADDR (FLASH_WIDTH * 0x)
-#define UNLOCK1_ADDR (FLASH_WIDTH * 0x2AAA)
+#define UNLOCK0_ADDR (FLASH_WIDTH * 0x555)
+#define UNLOCK1_ADDR (FLASH_WIDTH * 0x2AA)
 
 #define CFI_CMD 0x98
 #define UNLOCK0_CMD 0xAA
@@ -192,6 +192,14 @@ static void test_flash(void)
 g_assert_cmpint(flash_read(6), ==, 0xCDEF);
 g_assert_cmpint(flash_read(8), ==, 0x);
 
+/* Test ignored high order bits of address. */
+flash_write(FLASH_WIDTH * 0x, UNLOCK0_CMD);
+flash_write(FLASH_WIDTH * 0x2AAA, UNLOCK1_CMD);
+flash_write(FLASH_WIDTH * 0x, AUTOSELECT_CMD);
+g_assert_cmpint(flash_read(FLASH_WIDTH * 0x), ==, 0x00BF);
+g_assert_cmpint(flash_read(FLASH_WIDTH * 0x0001), ==, 0x236D);
+reset();
+
 qtest_quit(global_qtest);
 }
 
-- 
2.20.1 (Apple Git-117)

[Qemu-block] [PATCH v3 02/10] block/pflash_cfi02: Refactor, NFC intended

2019-04-17 Thread Stephen Checkoway

Simplify and refactor for upcoming commits. In particular, pull out all
of the code to modify the status into simple helper functions. Status
handling becomes more complex once multiple chips are interleaved to
produce a single device.

No change in functionality is intended with this commit.

Signed-off-by: Stephen Checkoway 
---
 hw/block/pflash_cfi02.c | 221 +---
 1 file changed, 95 insertions(+), 126 deletions(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index f2c6201f81..4b7af71806 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -46,18 +46,19 @@
 #include "hw/sysbus.h"
 #include "trace.h"
 
-//#define PFLASH_DEBUG
-#ifdef PFLASH_DEBUG
+#define PFLASH_DEBUG false
 #define DPRINTF(fmt, ...)  \
 do {   \
-fprintf(stderr, "PFLASH: " fmt , ## __VA_ARGS__);   \
+if (PFLASH_DEBUG) {\
+fprintf(stderr, "PFLASH: " fmt, ## __VA_ARGS__);   \
+}  \
 } while (0)
-#else
-#define DPRINTF(fmt, ...) do { } while (0)
-#endif
 
 #define PFLASH_LAZY_ROMD_THRESHOLD 42
 
+/* Special write cycle for CFI queries. */
+#define WCYCLE_CFI 7
+
 struct PFlashCFI02 {
 /*< private >*/
 SysBusDevice parent_obj;
@@ -97,6 +98,31 @@ struct PFlashCFI02 {
 void *storage;
 };
 
+/*
+ * Toggle status bit DQ7.
+ */
+static inline void toggle_dq7(PFlashCFI02 *pfl)
+{
+pfl->status ^= 0x80;
+}
+
+/*
+ * Set status bit DQ7 to bit 7 of value.
+ */
+static inline void set_dq7(PFlashCFI02 *pfl, uint8_t value)
+{
+pfl->status &= 0x7F;
+pfl->status |= value & 0x80;
+}
+
+/*
+ * Toggle status bit DQ6.
+ */
+static inline void toggle_dq6(PFlashCFI02 *pfl)
+{
+pfl->status ^= 0x40;
+}
+
 /*
  * Set up replicated mappings of the same region.
  */
@@ -126,7 +152,7 @@ static void pflash_timer (void *opaque)
 
 trace_pflash_timer_expired(pfl->cmd);
 /* Reset flash */
-pfl->status ^= 0x80;
+toggle_dq7(pfl);
 if (pfl->bypass) {
 pfl->wcycle = 2;
 } else {
@@ -136,12 +162,34 @@ static void pflash_timer (void *opaque)
 pfl->cmd = 0;
 }
 
-static uint32_t pflash_read(PFlashCFI02 *pfl, hwaddr offset,
-int width, int be)
+/*
+ * Read data from flash.
+ */
+static uint64_t pflash_data_read(PFlashCFI02 *pfl, hwaddr offset,
+ unsigned int width)
 {
+uint8_t *p = (uint8_t *)pfl->storage + offset;
+uint64_t ret = pfl->be ? ldn_be_p(p, width) : ldn_le_p(p, width);
+/* XXX: Need a trace_pflash_data_read(offset, ret, width) */
+switch (width) {
+case 1:
+trace_pflash_data_read8(offset, ret);
+break;
+case 2:
+trace_pflash_data_read16(offset, ret);
+break;
+case 4:
+trace_pflash_data_read32(offset, ret);
+break;
+}
+return ret;
+}
+
+static uint64_t pflash_read(void *opaque, hwaddr offset, unsigned int width)
+{
+PFlashCFI02 *pfl = opaque;
 hwaddr boff;
-uint32_t ret;
-uint8_t *p;
+uint64_t ret;
 
 ret = -1;
 trace_pflash_read(offset, pfl->cmd, width, pfl->wcycle);
@@ -166,39 +214,8 @@ static uint32_t pflash_read(PFlashCFI02 *pfl, hwaddr 
offset,
 case 0x80:
 /* We accept reads during second unlock sequence... */
 case 0x00:
-flash_read:
 /* Flash area read */
-p = pfl->storage;
-switch (width) {
-case 1:
-ret = p[offset];
-trace_pflash_data_read8(offset, ret);
-break;
-case 2:
-if (be) {
-ret = p[offset] << 8;
-ret |= p[offset + 1];
-} else {
-ret = p[offset];
-ret |= p[offset + 1] << 8;
-}
-trace_pflash_data_read16(offset, ret);
-break;
-case 4:
-if (be) {
-ret = p[offset] << 24;
-ret |= p[offset + 1] << 16;
-ret |= p[offset + 2] << 8;
-ret |= p[offset + 3];
-} else {
-ret = p[offset];
-ret |= p[offset + 1] << 8;
-ret |= p[offset + 2] << 16;
-ret |= p[offset + 3] << 24;
-}
-trace_pflash_data_read32(offset, ret);
-break;
-}
+ret = pflash_data_read(pfl, offset, width);
 break;
 case 0x90:
 /* flash ID read */
@@ -213,23 +230,23 @@ static uint32_t pflash_read(PFlashCFI02 *pfl, hwaddr 
offset,
 case 0x0E:
 case 0x0F:
 ret = boff & 0x01 ? pfl->ident3 : pfl->ident2;
-if (ret == (uint8_t)-1) {
-goto flash_read;
+if (ret != (uint8_t)-1) {
+break;
 }
-break;
+/* Fall through to data read. */
 default:
-

[Qemu-block] [PATCH v3 09/10] block/pflash_cfi02: Implement erase suspend/resume

2019-04-17 Thread Stephen Checkoway

During a sector erase (but not a chip erase), the embeded erase program
can be suspended. Once suspended, the sectors not selected for erasure
may be read and programmed. Autoselect mode is allowed during erase
suspend mode. Presumably, CFI queries are similarly allowed so this
commit allows them as well.

Since guest firmware can use status bits DQ7, DQ6, DQ3, and DQ2 to
determine the current state of sector erasure, these bits are properly
implemented.

Signed-off-by: Stephen Checkoway 
---
 hw/block/pflash_cfi02.c   | 153 ++
 tests/pflash-cfi02-test.c | 112 
 2 files changed, 251 insertions(+), 14 deletions(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index 21ceb0823b..d9087cafff 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -29,7 +29,6 @@
  * - CFI queries
  *
  * It does not implement software data protection as found in many real chips
- * It does not implement erase suspend/resume commands
  */
 
 #include "qemu/osdep.h"
@@ -37,6 +36,7 @@
 #include "hw/block/block.h"
 #include "hw/block/flash.h"
 #include "qapi/error.h"
+#include "qemu/bitmap.h"
 #include "qemu/timer.h"
 #include "sysemu/block-backend.h"
 #include "qemu/host-utils.h"
@@ -72,6 +72,7 @@ struct PFlashCFI02 {
 BlockBackend *blk;
 uint32_t uniform_nb_blocs;
 uint32_t uniform_sector_len;
+uint32_t total_sectors;
 uint32_t nb_blocs[PFLASH_MAX_ERASE_REGIONS];
 uint32_t sector_len[PFLASH_MAX_ERASE_REGIONS];
 uint64_t total_len;
@@ -106,6 +107,8 @@ struct PFlashCFI02 {
 int rom_mode;
 int read_counter; /* used for lazy switch-back to rom mode */
 int sectors_to_erase;
+uint64_t erase_time_remaining;
+unsigned long *sector_erase_map;
 char *name;
 void *storage;
 };
@@ -152,6 +155,14 @@ static inline void reset_dq3(PFlashCFI02 *pfl)
 pfl->status &= ~(pfl->interleave_multiplier * 0x08);
 }
 
+/*
+ * Toggle status bit DQ2.
+ */
+static inline void toggle_dq2(PFlashCFI02 *pfl)
+{
+pfl->status ^= pfl->interleave_multiplier * 0x04;
+}
+
 /*
  * Set up replicated mappings of the same region.
  */
@@ -175,6 +186,29 @@ static void pflash_register_memory(PFlashCFI02 *pfl, int 
rom_mode)
 pfl->rom_mode = rom_mode;
 }
 
+/*
+ * Returns the time it takes to erase the number of sectors scheduled for
+ * erasure based on CFI address 0x21 which is "Typical timeout per individual
+ * block erase 2^N ms."
+ */
+static uint64_t pflash_erase_time(PFlashCFI02 *pfl)
+{
+/*
+ * If there are no sectors to erase (which can happen if all of the sectors
+ * to be erased are protected), then erase takes 100 us. Protected sectors
+ * aren't supported so this should never happen.
+ */
+return ((1ULL << pfl->cfi_table[0x21]) * pfl->sectors_to_erase) * SCALE_US;
+}
+
+/*
+ * Returns true if the device is currently in erase suspend mode.
+ */
+static inline bool pflash_erase_suspend_mode(PFlashCFI02 *pfl)
+{
+return pfl->erase_time_remaining > 0;
+}
+
 static void pflash_timer(void *opaque)
 {
 PFlashCFI02 *pfl = opaque;
@@ -189,12 +223,7 @@ static void pflash_timer(void *opaque)
  */
 if ((pfl->status & 0x08) == 0) {
 assert_dq3(pfl);
-/*
- * CFI address 0x21 is "Typical timeout per individual block erase
- * 2^N ms"
- */
-uint64_t timeout = ((1ULL << pfl->cfi_table[0x21]) *
-pfl->sectors_to_erase) * 100;
+uint64_t timeout = pflash_erase_time(pfl);
 timer_mod(&pfl->timer,
   qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + timeout);
 DPRINTF("%s: erase timeout fired; erasing %d sectors\n",
@@ -202,6 +231,7 @@ static void pflash_timer(void *opaque)
 return;
 }
 DPRINTF("%s: sector erase complete\n", __func__);
+bitmap_zero(pfl->sector_erase_map, pfl->total_sectors);
 pfl->sectors_to_erase = 0;
 reset_dq3(pfl);
 }
@@ -240,25 +270,45 @@ static uint64_t pflash_data_read(PFlashCFI02 *pfl, hwaddr 
offset,
 return ret;
 }
 
+typedef struct {
+uint32_t len;
+uint32_t num;
+} SectorInfo;
+
 /*
  * offset should be a byte offset of the QEMU device and _not_ a device
  * offset.
  */
-static uint32_t pflash_sector_len(PFlashCFI02 *pfl, hwaddr offset)
+static SectorInfo pflash_sector_info(PFlashCFI02 *pfl, hwaddr offset)
 {
 assert(offset < pfl->total_len);
 int nb_regions = pfl->cfi_table[0x2C];
 hwaddr addr = 0;
+uint32_t sector_num = 0;
 for (int i = 0; i < nb_regions; ++i) {
 uint64_t region_size = (uint64_t)pfl->nb_blocs[i] * pfl->sector_len[i];
 if (addr <= offset && offset < addr + region_size) {
-return pfl->sector_len[i];
+return (SectorInfo) {
+.len = pfl->sector_len[i],
+.num = sector_num + (offset - addr) / pfl->sector_len[i],
+};

[Qemu-block] [PATCH v3 10/10] block/pflash_cfi02: Use the chip erase time specified in the CFI table

2019-04-17 Thread Stephen Checkoway

When erasing the chip, use the typical time specified in the CFI table
rather than arbitrarily selecting 5 seconds.

Since the currently unconfigurable value set in the table is 12, this
means a chip erase takes 4096 ms so this isn't a big change in behavior.

Signed-off-by: Stephen Checkoway 
---
 hw/block/pflash_cfi02.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index d9087cafff..76c8af4365 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -633,9 +633,9 @@ static void pflash_write(void *opaque, hwaddr offset, 
uint64_t value,
 pflash_update(pfl, 0, pfl->total_len);
 }
 set_dq7(pfl, 0x00);
-/* Let's wait 5 seconds before chip erase is done */
+/* Wait the time specified at CFI address 0x22. */
 timer_mod(&pfl->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
-  (NANOSECONDS_PER_SECOND * 5));
+  (1ULL << pfl->cfi_table[0x22]) * SCALE_MS);
 break;
 case 0x30:
 /* Sector erase */
-- 
2.20.1 (Apple Git-117)

[Qemu-block] [PATCH] docs/interop/bitmaps: rewrite and modernize doc

2019-04-17 Thread John Snow

This just about rewrites the entirety of the bitmaps.rst document to
make it consistent with the 4.0 release. I have added new features seen
in the 4.0 release, as well as tried to clarify some points that keep
coming up when discussing this feature both in-house and upstream.

Yes, it's a lot longer, mostly due to examples. I get a bit chatty.
I could use a good editor to help reign in my chattiness.

It does not yet cover pull backups or migration details, but I intend to
keep extending this document to cover those cases.

Please try compiling it with sphinx and look at the rendered output, I
don't have hosting to share my copy at present. I think this new layout
reads nicer in the HTML format than the old one did, at the expense of
looking less readable in the source tree itself (though not completely
unmanagable. We did decide to convert it from Markdown to ReST, after
all, so I am going all-in on ReST.)

Signed-off-by: John Snow 
---
 docs/interop/bitmaps.rst | 1499 ++
 Makefile |2 +-
 2 files changed, 1192 insertions(+), 309 deletions(-)

diff --git a/docs/interop/bitmaps.rst b/docs/interop/bitmaps.rst
index 7bcfe7f461..a39d1fc871 100644
--- a/docs/interop/bitmaps.rst
+++ b/docs/interop/bitmaps.rst
@@ -9,128 +9,481 @@
 Dirty Bitmaps and Incremental Backup
 
 
--  Dirty Bitmaps are objects that track which data needs to be backed up
-   for the next incremental backup.
+Dirty Bitmaps are in-memory objects that track writes to block devices. They 
can
+be used in conjunction with various block job operations to perform incremental
+or differential backup regimens.
 
--  Dirty bitmaps can be created at any time and attached to any node
-   (not just complete drives).
+This document explains the conceptual mechanisms, as well as up-to-date,
+complete and comprehensive documentation on the API to manipulate them.
+(Hopefully, the "why", "what", and "how".)
+
+The intended audience for this document is developers who are adding QEMU 
backup
+features to management applications, or power users who run and administer QEMU
+directly via QMP.
 
 .. contents::
 
+Overview
+
+
+Bitmaps are bit vectors where each '1' bit in the vector indicates a modified
+("dirty") segment of the corresponding block device. The size of the segment
+that is tracked is the granularity of the bitmap. If the granularity of a 
bitmap
+is 64K, each '1' bit means that an entire 64K region changed in some way.
+
+Smaller granularities mean more accurate tracking of modified disk data, but
+requires more computational overhead and larger bitmap sizes. Larger
+granularities mean smaller bitmap sizes, but less targeted backups.
+
+The size of a bitmap (in bytes) can be computed as such:
+``size`` = ((``image_size`` / ``granularity``) / 8)
+
+e.g. the size of a 64KiB granularity bitmap on a 2TiB image is:
+``size`` = ((2147483648K / 64K) / 8)
+ = 4194304B = 4MiB.
+
+QEMU uses these bitmaps when making incremental backups to know which
+sections of the file to copy out. They are not enabled by default and
+must be explicitly added in order to begin tracking writes.
+
+Bitmaps can be created at any time and can be attached to any
+arbitrary block node in the storage graph, but are most useful
+conceptually when attached to the root node attached to the guest's
+storage device model.
+
+(Which is a really chatty way of saying: It's likely most useful to
+track the guest's writes to disk, but you could theoretically track
+things like qcow2 metadata changes by attaching the bitmap elsewhere
+in the storage graph.)
+
+QEMU supports persisting these bitmaps to disk via the qcow2 image format.
+Bitmaps which are stored or loaded in this way are called "persistent", whereas
+bitmaps that are not are called "transient".
+
+QEMU also supports the migration of both transient bitmaps (tracking any
+arbitrary image format) or persistent bitaps (qcow2) via live migration.
+
+Supported Image Formats
+---
+
+QEMU supports all documented features below on the qcow2 image format.
+
+However, qcow2 is only strictly necessary for the persistence feature, which
+writes bitmap data to disk upon close. If persistence is not required for a
+specific use case, all bitmap features excepting persistence are available
+for any arbitrary image format.
+
+For example, Dirty Bitmaps can be combined with the 'raw' image format,
+but any changes to the bitmap will be discarded upon exit.
+
+.. warning:: Transient bitmaps will not be saved on QEMU exit! Persistent
+ bitmaps are available only on qcow2 images.
+
 Dirty Bitmap Names
 --
 
--  A dirty bitmap's name is unique to the node, but bitmaps attached to
-   different nodes can share the same name.
+Bitmap objects need a method to reference them in the API. All API-created and
+managed bitmaps have a human-readable name chosen by the user at creation time.

Re: [Qemu-block] [PATCH v3 03/10] block/pflash_cfi02: Fix command address comparison

2019-04-17 Thread Thomas Huth

On 18/04/2019 00.01, Stephen Checkoway wrote:
> Most AMD commands only examine 11 bits of the address. This masks the
> addresses used in the comparison to 11 bits. The exceptions are word or
> sector addresses which use offset directly rather than the shifted
> offset, boff.
> 
> Signed-off-by: Stephen Checkoway 
> ---
>  hw/block/pflash_cfi02.c   |  8 +++-
>  tests/pflash-cfi02-test.c | 12 ++--
>  2 files changed, 17 insertions(+), 3 deletions(-)

Acked-by: Thomas Huth

Re: [Qemu-block] [PATCH v3 04/10] block/pflash_cfi02: Implement intereleaved flash devices

2019-04-17 Thread Thomas Huth

On 18/04/2019 00.01, Stephen Checkoway wrote:
> It's common for multiple narrow flash chips to be hooked up in parallel
> to support wider buses. For example, four 8-bit wide flash chips (x8)
> may be combined in parallel to produce a 32-bit wide device. Similarly,
> two 16-bit wide chips (x16) may be combined.
> 
> This commit introduces `device-width` and `max-device-width` properties,
> similar to pflash_cfi01, with the following meanings:
> - `width`: The width of the logical, qemu device (same as before);
> - `device-width`: The width of an individual flash chip, defaulting to
>   `width`; and
> - `max-device-width`: The maximum width of an individual flash chip,
>   defaulting to `device-width`.
> 
> Nothing needs to change to support reading such interleaved devices but
> commands (e.g., erase and programming) must be sent to all devices at
> the same time or else the various chips will be in different states.
> 
> For example, a 4-byte wide logical device can be composed of four x8/x16
> devices in x8 mode. That is, each device supports both x8 or x16 and
> they're being used in the byte, rather than word, mode. This
> configuration would have `width=4`, `device-width=1`, and
> `max-device-width=2`.
> 
> In addition to commands being sent to all devices, guest firmware
> expects the status and CFI queries to be replicated for each device.
> (The one exception to the response replication is that each device gets
> to report its own status bit DQ7 while programming because its value
> depends on the value being programmed which will usually differ for each
> device.)
> 
> Testing is limited to 16-bit wide devices due to the current inability
> to override the properties set by `pflash_cfi02_register`, but multiple
> configurations are tested.
> 
> Stop using global_qtest. Instead, package the qtest variable inside the
> FlashConfig structure.

Thanks for doing that change wrt global_qtest!

Acked-by: Thomas Huth

Re: [Qemu-block] [PATCH v3 05/10] block/pflash_cfi02: Implement nonuniform sector sizes

2019-04-17 Thread Thomas Huth

On 18/04/2019 00.01, Stephen Checkoway wrote:
> Some flash chips support sectors of different sizes. For example, the
> AMD AM29LV160DT has 31 64 kB sectors, one 32 kB sector, two 8 kB
> sectors, and a 16 kB sector, in that order. The AM29LV160DB has those in
> the reverse order.
> 
> The `num-blocks` and `sector-length` properties work exactly as they did
> before: a flash device with uniform sector lengths. To get non-uniform
> sector lengths for up to four regions, the following properties may be
> set
> - region 0. `num-blocks0` and `sector-length0`;
> - region 1. `num-blocks1` and `sector-length1`;
> - region 2. `num-blocks2` and `sector-length2`; and
> - region 3. `num-blocks3` and `sector-length3`.
> 
> If the uniform and nonuniform properties are set, then both must specify
> a flash device with the same total size. It would be better to disallow
> both being set, or make `num-blocks0` and `sector-length0` alias
> `num-blocks` and `sector-length`, but that would make testing currently
> impossible.
> 
> Signed-off-by: Stephen Checkoway 
> ---
>  hw/block/pflash_cfi02.c   | 177 +---
>  tests/pflash-cfi02-test.c | 185 --
>  2 files changed, 265 insertions(+), 97 deletions(-)

Acked-by: Thomas Huth

Re: [Qemu-block] [PATCH v3 06/10] block/pflash_cfi02: Fix CFI in autoselect mode

2019-04-17 Thread Thomas Huth

On 18/04/2019 00.01, Stephen Checkoway wrote:
> After a flash device enters CFI mode from autoselect mode, the reset
> command returns the device to autoselect mode. An additional reset
> command is necessary to return to read array mode.
> 
> Signed-off-by: Stephen Checkoway 
> ---
[...]
> diff --git a/tests/pflash-cfi02-test.c b/tests/pflash-cfi02-test.c
> index 82bc5695e1..4039647604 100644
> --- a/tests/pflash-cfi02-test.c
> +++ b/tests/pflash-cfi02-test.c
> @@ -480,6 +480,42 @@ static void test_geometry(const void *opaque)
>  qtest_quit(qtest);
>  }
>  
> +/*
> + * Test that
> + * 1. enter autoselect mode;
> + * 2. enter CFI mode; and then
> + * 3. exit CFI mode
> + * leaves the flash device in autoselect mode.
> + */
> +static void test_cfi_in_autoselect(const void *opaque)
> +{
> +const FlashConfig *config = opaque;
> +QTestState *qtest = qtest_initf("-M musicpal,accel=qtest"
> +" -drive if=pflash,file=%s,format=raw,"
> +"copy-on-read",
> +image_path);

Just a matter of taste, but I think here I'd declare the variable first,
and do the qtest_initf on a separate line, so you don't have to break
the string between "format=raw," and "copy-on-read".

Anyway, it's just a nit, so still:

Acked-by: Thomas Huth

Re: [Qemu-block] [PATCH v3 01/10] block/pflash_cfi02: Add test for supported commands

2019-04-17 Thread Thomas Huth

On 18/04/2019 00.01, Stephen Checkoway wrote:
> Test the AMD command set for parallel flash chips. This test uses an
> ARM musicpal board with a pflash drive to test the following list of
> currently-supported commands.
> - Autoselect
> - CFI
> - Sector erase
> - Chip erase
> - Program
> - Unlock bypass
> - Reset
> 
> Signed-off-by: Stephen Checkoway 
> ---
>  tests/Makefile.include|   2 +
>  tests/pflash-cfi02-test.c | 227 ++
>  2 files changed, 229 insertions(+)
>  create mode 100644 tests/pflash-cfi02-test.c
[...]
> new file mode 100644
> index 00..b113fca5af
> --- /dev/null
> +++ b/tests/pflash-cfi02-test.c
> @@ -0,0 +1,227 @@
> +/*
> + * QTest testcase for parallel flash with AMD command set
> + *
> + * Copyright (c) 2018 Stephen Checkoway

Do you maybe want to update that to 2019?

> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include 

We generally don't use err.h in QEMU ... could you please use the
standard functions from glib instead?

> +#include 

unistd.h is already provided by osdep.h, so you don't need to include it
here again. (and the scripts/clean-includes script will barf at this
later, so better fix it right from the start)

> +#include "libqtest.h"
> +
> +/*
> + * To test the pflash_cfi02 device, we run QEMU with the musicpal machine 
> with
> + * a pflash drive. This enables us to test some flash configurations, but not
> + * all. In particular, we're limited to a 16-bit wide flash device.
> + */
> +
> +#define MP_FLASH_SIZE_MAX (32 * 1024 * 1024)
> +#define BASE_ADDR (0x1ULL - MP_FLASH_SIZE_MAX)
> +
> +#define FLASH_WIDTH 2
> +#define CFI_ADDR (FLASH_WIDTH * 0x55)
> +#define UNLOCK0_ADDR (FLASH_WIDTH * 0x)
> +#define UNLOCK1_ADDR (FLASH_WIDTH * 0x2AAA)
> +
> +#define CFI_CMD 0x98
> +#define UNLOCK0_CMD 0xAA
> +#define UNLOCK1_CMD 0x55
> +#define AUTOSELECT_CMD 0x90
> +#define RESET_CMD 0xF0
> +#define PROGRAM_CMD 0xA0
> +#define SECTOR_ERASE_CMD 0x30
> +#define CHIP_ERASE_CMD 0x10
> +#define UNLOCK_BYPASS_CMD 0x20
> +#define UNLOCK_BYPASS_RESET_CMD 0x00
> +
> +static char image_path[] = "/tmp/qtest.XX";
> +
> +static inline void flash_write(uint64_t byte_addr, uint16_t data)
> +{
> +qtest_writew(global_qtest, BASE_ADDR + byte_addr, data);
> +}
> +
> +static inline uint16_t flash_read(uint64_t byte_addr)
> +{
> +return qtest_readw(global_qtest, BASE_ADDR + byte_addr);
> +}
> +
> +static void unlock(void)
> +{
> +flash_write(UNLOCK0_ADDR, UNLOCK0_CMD);
> +flash_write(UNLOCK1_ADDR, UNLOCK1_CMD);
> +}
> +
> +static void reset(void)
> +{
> +flash_write(0, RESET_CMD);
> +}
> +
> +static void sector_erase(uint64_t byte_addr)
> +{
> +unlock();
> +flash_write(UNLOCK0_ADDR, 0x80);
> +unlock();
> +flash_write(byte_addr, SECTOR_ERASE_CMD);
> +}
> +
> +static void wait_for_completion(uint64_t byte_addr)
> +{
> +/* If DQ6 is toggling, step the clock and ensure the toggle stops. */
> +if ((flash_read(byte_addr) & 0x40) ^ (flash_read(byte_addr) & 0x40)) {
> +/* Wait for erase or program to finish. */
> +clock_step_next();
> +/* Ensure that DQ6 has stopped toggling. */
> +g_assert_cmpint(flash_read(byte_addr), ==, flash_read(byte_addr));
> +}
> +}
> +
> +static void bypass_program(uint64_t byte_addr, uint16_t data)
> +{
> +flash_write(UNLOCK0_ADDR, PROGRAM_CMD);
> +flash_write(byte_addr, data);
> +/*
> + * Data isn't valid until DQ6 stops toggling. We don't model this as
> + * writes are immediate, but if this changes in the future, we can wait
> + * until the program is complete.
> + */
> +wait_for_completion(byte_addr);
> +}
> +
> +static void program(uint64_t byte_addr, uint16_t data)
> +{
> +unlock();
> +bypass_program(byte_addr, data);
> +}
> +
> +static void chip_erase(void)
> +{
> +unlock();
> +flash_write(UNLOCK0_ADDR, 0x80);
> +unlock();
> +flash_write(UNLOCK0_ADDR, SECTOR_ERASE_CMD);
> +}
> +
> +static void test_flash(void)
> +{
> +global_qtest = qtest_initf("-M musicpal,accel=qtest "
> +   "-drive 
> if=pflash,file=%s,format=raw,copy-on-read",
> +   image_path);
> +/* Check the IDs. */
> +unlock();
> +flash_write(UNLOCK0_ADDR, AUTOSELECT_CMD);
> +g_assert_cmpint(flash_read(FLASH_WIDTH * 0x), ==, 0x00BF);
> +g_assert_cmpint(flash_read(FLASH_WIDTH * 0x0001), ==, 0x236D);
> +reset();
> +
> +/* Check the erase blocks. */
> +flash_write(CFI_ADDR, CFI_CMD);
> +g_assert_cmpint(flash_read(FLASH_WIDTH * 0x10), ==, 'Q');
> +g_assert_cmpint(flash_read(FLASH_WIDTH * 0x11), ==, 'R');
> +g_assert_cmpint(flash_read(FLASH_WIDTH * 0x12), ==, 'Y');
> +/* Num erase regions. */
> +g_assert_cmpint(flash_read(FLASH_WIDTH * 0x2C), >=, 1);
> +uint32_t nb_sectors = flash_

Re: [Qemu-block] [PATCH v3 08/10] block/pflash_cfi02: Implement multi-sector erase

2019-04-17 Thread Thomas Huth

On 18/04/2019 00.01, Stephen Checkoway wrote:
> After two unlock cycles and a sector erase command, the AMD flash chips
> start a 50 us erase time out. Any additional sector erase commands add a
> sector to be erased and restart the 50 us timeout. During the timeout,
> status bit DQ3 is cleared. After the time out, DQ3 is asserted during
> erasure.
> 
> Signed-off-by: Stephen Checkoway 
> ---
>  hw/block/pflash_cfi02.c   | 94 +++
>  tests/pflash-cfi02-test.c | 59 ++--
>  2 files changed, 131 insertions(+), 22 deletions(-)

Acked-by: Thomas Huth

Re: [Qemu-block] [PATCH v3 09/10] block/pflash_cfi02: Implement erase suspend/resume

2019-04-17 Thread Thomas Huth

On 18/04/2019 00.01, Stephen Checkoway wrote:
> During a sector erase (but not a chip erase), the embeded erase program
> can be suspended. Once suspended, the sectors not selected for erasure
> may be read and programmed. Autoselect mode is allowed during erase
> suspend mode. Presumably, CFI queries are similarly allowed so this
> commit allows them as well.
> 
> Since guest firmware can use status bits DQ7, DQ6, DQ3, and DQ2 to
> determine the current state of sector erasure, these bits are properly
> implemented.
> 
> Signed-off-by: Stephen Checkoway 
> ---
>  hw/block/pflash_cfi02.c   | 153 ++
>  tests/pflash-cfi02-test.c | 112 
>  2 files changed, 251 insertions(+), 14 deletions(-)

Acked-by: Thomas Huth

56 matches

Mail list logo