Re: [Qemu-block] [Qemu-devel] [Nbd] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-13 Thread Eric Blake
On 12/13/2016 06:18 AM, Wouter Verhelst wrote:
> On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote:
>> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben:
>>> I'm not opposed to this proposal, per se, but there seems to be some
>>> disagreement (by Kevin, for instance) on whether this extension is at
>>> all useful.
>>
>> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but
>> I wanted to ask the question so that we don't end up adding a feature
>> that noone actually uses. Ultimately it's your call as a maintainer
>> anyway how conservative you want to be with spec additions.
> 
> I'm not opposed either, but I do agree with you that we shouldn't add
> such a feature if it doesn't end up getting used. Especially so if it
> burns a flag in the (16-bit) "transmission flags" field, where space is
> at a premium.

No, it is NOT a "transmission flag", as it is a per-export property
(where we currently have 64 bits).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 18/21] qmp: add x-debug-block-dirty-bitmap-sha256

2016-12-13 Thread Max Reitz

On 2016-11-22 at 18:26, Vladimir Sementsov-Ogievskiy wrote:

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/dirty-bitmap.c |  5 +
 blockdev.c   | 33 +
 include/block/dirty-bitmap.h |  2 ++
 include/qemu/hbitmap.h   |  8 
 qapi/block-core.json | 26 ++
 tests/Makefile.include   |  2 +-
 util/hbitmap.c   | 11 +++
 7 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index fe34d48..775181c 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -589,3 +589,8 @@ BdrvDirtyBitmap *bdrv_dirty_bitmap_next(BlockDriverState 
*bs,
 return bitmap == NULL ? QLIST_FIRST(>dirty_bitmaps) :
 QLIST_NEXT(bitmap, list);
 }
+
+char *bdrv_dirty_bitmap_sha256(const BdrvDirtyBitmap *bitmap, Error **errp)
+{
+return hbitmap_sha256(bitmap->bitmap, errp);
+}
diff --git a/blockdev.c b/blockdev.c
index 3891d86..f129ccf 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2819,6 +2819,39 @@ void qmp_block_dirty_bitmap_clear(const char *node, 
const char *name,
 aio_context_release(aio_context);
 }

+/**
+ * Completely clear a bitmap, for the purposes of synchronizing a bitmap
+ * immediately after a full backup operation.


This comment doesn't seem quite right.


+ */
+BlockDirtyBitmapSha256 *qmp_x_debug_block_dirty_bitmap_sha256(const char *node,
+  const char *name,
+  Error **errp)
+{
+AioContext *aio_context;
+BdrvDirtyBitmap *bitmap;
+BlockDriverState *bs;
+BlockDirtyBitmapSha256 *ret = NULL;
+char *sha256;
+
+bitmap = block_dirty_bitmap_lookup(node, name, , _context, errp);
+if (!bitmap || !bs) {
+return NULL;
+}
+
+sha256 = bdrv_dirty_bitmap_sha256(bitmap, errp);
+if (sha256 == NULL) {
+goto out;
+}
+
+ret = g_new(BlockDirtyBitmapSha256, 1);
+ret->sha256 = sha256;
+
+out:
+aio_context_release(aio_context);
+
+return ret;
+}
+
 void hmp_drive_del(Monitor *mon, const QDict *qdict)
 {
 const char *id = qdict_get_str(qdict, "id");
diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index d71edc4..b022b34 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -86,4 +86,6 @@ BdrvDirtyBitmap *bdrv_dirty_bitmap_next(BlockDriverState *bs,

 bool bdrv_has_persistent_bitmaps(BlockDriverState *bs);

+char *bdrv_dirty_bitmap_sha256(const BdrvDirtyBitmap *bitmap, Error **errp);
+
 #endif
diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 063ec0e..750346c 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -240,6 +240,14 @@ void hbitmap_deserialize_ones(HBitmap *hb, uint64_t start, 
uint64_t count,
 void hbitmap_deserialize_finish(HBitmap *hb);

 /**
+ * hbitmap_sha256:
+ * @bitmap: HBitmap to operate on.
+ *
+ * Returns SHA256 hash of the last level.
+ */
+char *hbitmap_sha256(const HBitmap *bitmap, Error **errp);
+
+/**
  * hbitmap_free:
  * @hb: HBitmap to operate on.
  *
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 648f94a..76f6c60 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1289,6 +1289,32 @@
   'data': 'BlockDirtyBitmap' }

 ##
+# @BlockDirtyBitmapSha256:
+#
+# SHA256 hash of dirty bitmap data
+#
+# @sha256: bitmap SHA256 hash
+#
+# Since: 2.8


*2.9


+##
+  { 'struct': 'BlockDirtyBitmapSha256',
+'data': {'sha256': 'str'} }
+
+##
+# @x-debug-block-dirty-bitmap-sha256


Should end in a colon ("@x-debug-block-dirty-bitmap-sha256:").


+#
+# Get bitmap SHA256
+#
+# Returns: BlockDirtyBitmapSha256 on success
+#  If @node is not a valid block device, DeviceNotFound
+#  If @name is not found, GenericError with an explanation


Theoretically, hashing can also fail, but, well...


+#
+# Since 2.8


*2.9 and also "Since: 2.9" (with colon).

Looks good apart from these minor things.

Max


+##
+  { 'command': 'x-debug-block-dirty-bitmap-sha256',
+'data': 'BlockDirtyBitmap', 'returns': 'BlockDirtyBitmapSha256' }
+
+##
 # @blockdev-mirror
 #
 # Start mirroring a block device's writes to a new destination.
diff --git a/tests/Makefile.include b/tests/Makefile.include
index e98d3b6..b85ff78 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -496,7 +496,7 @@ tests/test-blockjob$(EXESUF): tests/test-blockjob.o 
$(test-block-obj-y) $(test-u
 tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o 
$(test-block-obj-y) $(test-util-obj-y)
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
 tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
-tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y)
+tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y) 
$(test-crypto-obj-y)
 

Re: [Qemu-block] [PATCH RFC 0/1] Allow storing the qcow2 L2 cache in disk

2016-12-13 Thread Alberto Garcia
On Tue 13 Dec 2016 02:44:26 PM CET, Max Reitz wrote:

 But leaving that aside, would that improve anything? I don't think
 the cache itself adds any significant overhead here, IIRC even in
 your presentation at KVM Forum 2015 qcow2 was comparable to raw as
 long as all L2 tables were cached in memory.
>>>
>>> I haven't compared CPU usage, though. That may have gone up quite a
>>> bit, I don't know. For large enough images, it may even become a
>>> bottleneck.
>>
>> I don't know, I don't have any data to suggest that there's a problem
>> there, particularly after the rewrite from last year I think the
>> cache is reasonably fast (as long as the cache is big enough; and if
>> it's not big enough the problem is going to be in the extra I/O).
>
> So the remaining question is whether we want a cache for the cache. If
> the cache is stored directly on an SSD, any access to it has to go
> there and back, and that seems like we may still want to have the
> option of having an in-RAM cache on top of it.

That would be indeed possible... but I wouldn't have any without data
showing that it's worth it.

> If we don't want to split the qcow2 file into separate metadata and
> data files, then we could still create a sparse temporary file at
> runtime which contains all L2 tables. Reading an L2 table then
> wouldn't go through the qcow2 file but through that temporary file and
> our existing metadata cache would work on top of it.

And if you modify a table you have to do it in both places.

The basic idea is the same, anyway. I just started with the assumption
that the L2 cache structures don't add significant overhead and went for
the simplest solution (that's why it's an RFC after all).

But implementation details aside, the main question is whether the best
way to go is to keep the existing metadata and store it in a faster disk
or, as Stefan suggested, think about replacing the current structures
with something else that scales better. I think that's at least worth
exploring.

Berto



[Qemu-block] [throwaway PATCH] add fio plugin

2016-12-13 Thread Paolo Bonzini
All of the infrastructure is a bit hackish, which is why I've marked
this as throwaway, but perhaps someone else would like to use it (or
even revive the libtool infrastructure so that it can be included
in QEMU).  I tested it with fio 2.8 and the master branch.

Support for driver options is minimal (only aio=native/threads :) so
if you want to use the null driver remember that the size should be
at most 1G (the default size of the null block device).

The plugin requires Stefan's polling patches, but it's easy to remove
the poll_max_ns option if desired.

Signed-off-by: Paolo Bonzini 
---
 configure |  15 +-
 contrib/fio/Makefile  |  62 +++
 contrib/fio/README|  13 ++
 contrib/fio/fio.c | 436 ++
 contrib/fio/qemu.fio  |  27 +++
 contrib/fio/uninclude.awk |  78 +
 6 files changed, 624 insertions(+), 7 deletions(-)

diff --git a/configure b/configure
index 3770d7c..2c844bc 100755
--- a/configure
+++ b/configure
@@ -1589,8 +1589,8 @@ static THREAD int tls_var;
 int main(void) { return tls_var; }
 
 EOF
-  if compile_prog "-fPIE -DPIE" "-pie"; then
-QEMU_CFLAGS="-fPIE -DPIE $QEMU_CFLAGS"
+  if compile_prog "-fPIC -DPIC" "-pie"; then
+QEMU_CFLAGS="-fPIC -DPIC $QEMU_CFLAGS"
 LDFLAGS="-pie $LDFLAGS"
 pie="yes"
 if compile_prog "" "-Wl,-z,relro -Wl,-z,now" ; then
@@ -1598,15 +1598,15 @@ EOF
 fi
   else
 if test "$pie" = "yes"; then
-  error_exit "PIE not available due to missing toolchain support"
+  error_exit "PIC not available due to missing toolchain support"
 else
-  echo "Disabling PIE due to missing toolchain support"
+  echo "Disabling PIC due to missing toolchain support"
   pie="no"
 fi
   fi
 
-  if compile_prog "-Werror -fno-pie" "-nopie"; then
-CFLAGS_NOPIE="-fno-pie"
+  if compile_prog "-Werror -fno-pic -fno-pie" "-nopie"; then
+CFLAGS_NOPIE="-fno-pic -fno-pie"
 LDFLAGS_NOPIE="-nopie"
   fi
 fi
@@ -6183,13 +6183,14 @@ fi
 
 # build tree in object directory in case the source is not in the current 
directory
 DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32 tests/libqos 
tests/qapi-schema tests/tcg/xtensa tests/qemu-iotests"
-DIRS="$DIRS fsdev"
+DIRS="$DIRS fsdev contrib/fio"
 DIRS="$DIRS pc-bios/optionrom pc-bios/spapr-rtas pc-bios/s390-ccw"
 DIRS="$DIRS roms/seabios roms/vgabios"
 DIRS="$DIRS qapi-generated"
 FILES="Makefile tests/tcg/Makefile qdict-test-data.txt"
 FILES="$FILES tests/tcg/cris/Makefile tests/tcg/cris/.gdbinit"
 FILES="$FILES tests/tcg/lm32/Makefile tests/tcg/xtensa/Makefile po/Makefile"
+FILES="$FILES contrib/fio/Makefile"
 FILES="$FILES pc-bios/optionrom/Makefile pc-bios/keymaps"
 FILES="$FILES pc-bios/spapr-rtas/Makefile"
 FILES="$FILES pc-bios/s390-ccw/Makefile"
diff --git a/contrib/fio/Makefile b/contrib/fio/Makefile
new file mode 100644
index 000..3d47d5c
--- /dev/null
+++ b/contrib/fio/Makefile
@@ -0,0 +1,62 @@
+# -*- Mode: makefile -*-
+
+BUILD_DIR?=$(CURDIR)/../..
+
+include ../../config-host.mak
+include $(SRC_PATH)/rules.mak
+
+$(call set-vpath, $(SRC_PATH):$(SRC_PATH)/contrib/fio:$(CURDIR))
+
+PROGS=qemu.so
+
+all: $(PROGS)
+# Dummy command so that make thinks it has done something
+   @true
+
+QEMU_CFLAGS += -Wno-redundant-decls
+LIBS += $(LIBS_TOOLS)
+
+# We need two fio header files, but we don't want to include
+# fio's config-host.h because its #defined symbols conflict
+# with QEMU's own config-host.h.  In general we do not want to
+# have -I$(FIO_PATH) while compiling fio.o because the possible
+# conflicts are a mess.
+#
+# optgroup.h is tame and we can just copy it to the build directory,
+# but fio.h includes a lot of other header files, from both fio
+# itself and the system.  Therefore we preprocess it so that
+# fio's headers are merged into a single file, fio-qemu.h, while
+# system headers are left as #include directives.  Because the
+# preprocessing step removes all preprocessor conditionals,
+# there is no dependency left on fio's config-host.h file.
+#
+# While at it, we hack some symbols that conflict with QEMU's,
+# by prefixing them with a "FIO_" or "fio_" namespace.
+fio.o: fio-qemu.h fio-optgroup-qemu.h
+
+fio-optgroup-qemu.h:
+   cp $(FIO_PATH)/optgroup.h $@
+
+FIO_HACK_SYMBOLS=sed -e 's/\ $@
+
+include $(SRC_PATH)/Makefile.objs
+dummy := $(call unnest-vars,../.., \
+   block-obj-y \
+   block-obj-m \
+   crypto-obj-y \
+   qom-obj-y \
+   io-obj-y)
+obj-y := fio.o $(block-obj-y) $(crypto-obj-y) $(io-obj-y) $(qom-obj-y)
+
+dummy := $(shell mkdir -p $(dir $(obj-y)))
+qemu.so: $(obj-y) ../../libqemuutil.a ../../libqemustub.a
+   $(call LINK, $^)
+
+clean: clean-target
+   rm -f *.a $(filter-out ../../%, $(obj-y))
+   rm -f $(shell find . -name '*.[od]')
diff --git a/contrib/fio/README b/contrib/fio/README
new file mode 100644
index 000..4aef2a0
--- /dev/null
+++ 

[Qemu-block] [PATCH] add fio plugin

2016-12-13 Thread Paolo Bonzini
Signed-off-by: Paolo Bonzini 
---
 configure |   3 +-
 contrib/fio/Makefile  |  62 +++
 contrib/fio/README|  13 ++
 contrib/fio/fio.c | 404 ++
 contrib/fio/qemu.fio  |  26 +++
 contrib/fio/uninclude.awk |  78 +
 6 files changed, 585 insertions(+), 1 deletion(-)
 create mode 100644 contrib/fio/Makefile
 create mode 100644 contrib/fio/README
 create mode 100644 contrib/fio/fio.c
 create mode 100644 contrib/fio/qemu.fio
 create mode 100644 contrib/fio/uninclude.awk

diff --git a/configure b/configure
index 7dec6cd..8687461 100755
--- a/configure
+++ b/configure
@@ -6181,13 +6181,14 @@ fi
 
 # build tree in object directory in case the source is not in the current 
directory
 DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32 tests/libqos 
tests/qapi-schema tests/tcg/xtensa tests/qemu-iotests"
-DIRS="$DIRS fsdev"
+DIRS="$DIRS fsdev contrib/fio"
 DIRS="$DIRS pc-bios/optionrom pc-bios/spapr-rtas pc-bios/s390-ccw"
 DIRS="$DIRS roms/seabios roms/vgabios"
 DIRS="$DIRS qapi-generated"
 FILES="Makefile tests/tcg/Makefile qdict-test-data.txt"
 FILES="$FILES tests/tcg/cris/Makefile tests/tcg/cris/.gdbinit"
 FILES="$FILES tests/tcg/lm32/Makefile tests/tcg/xtensa/Makefile po/Makefile"
+FILES="$FILES contrib/fio/Makefile"
 FILES="$FILES pc-bios/optionrom/Makefile pc-bios/keymaps"
 FILES="$FILES pc-bios/spapr-rtas/Makefile"
 FILES="$FILES pc-bios/s390-ccw/Makefile"
diff --git a/contrib/fio/Makefile b/contrib/fio/Makefile
new file mode 100644
index 000..4a86b3f
--- /dev/null
+++ b/contrib/fio/Makefile
@@ -0,0 +1,62 @@
+# -*- Mode: makefile -*-
+
+BUILD_DIR?=$(CURDIR)/../..
+
+include ../../config-host.mak
+include $(SRC_PATH)/rules.mak
+
+$(call set-vpath, $(SRC_PATH):$(SRC_PATH)/contrib/fio:$(CURDIR))
+
+PROGS=qemu.so
+
+all: $(PROGS)
+# Dummy command so that make thinks it has done something
+   @true
+
+QEMU_CFLAGS += -I$(BUILD_DIR) -Wno-error -Wno-redundant-decls
+LIBS += $(LIBS_TOOLS)
+
+# We need two fio header files, but we don't want to include
+# fio's config-host.h because its #defined symbols conflict
+# with QEMU's own config-host.h.  In general we do not want to
+# have -I$(FIO_PATH) while compiling fio.o because the possible
+# conflicts are a mess.
+#
+# optgroup.h is tame and we can just copy it to the build directory,
+# but fio.h includes a lot of other header files, from both fio
+# itself and the system.  Therefore we preprocess it so that
+# fio's headers are merged into a single file, fio-qemu.h, while
+# system headers are left as #include directives.  Because the
+# preprocessing step removes all preprocessor conditionals,
+# there is no dependency left on fio's config-host.h file.
+#
+# While at it, we hack some symbols that conflict with QEMU's,
+# by prefixing them with a "FIO_" or "fio_" namespace.
+fio.o: fio-qemu.h fio-optgroup-qemu.h
+
+fio-optgroup-qemu.h:
+   cp $(FIO_PATH)/optgroup.h $@
+
+FIO_HACK_SYMBOLS=sed -e 's/\ $@
+
+include $(SRC_PATH)/Makefile.objs
+dummy := $(call unnest-vars,../.., \
+   block-obj-y \
+   block-obj-m \
+   crypto-obj-y \
+   qom-obj-y \
+   io-obj-y)
+obj-y := fio.o $(block-obj-y) $(crypto-obj-y) $(io-obj-y) $(qom-obj-y)
+
+dummy := $(shell mkdir -p $(dir $(obj-y)))
+qemu.so: $(obj-y) ../../libqemuutil.a ../../libqemustub.a
+   $(call LINK, $^)
+
+clean: clean-target
+   rm -f *.a $(filter-out ../../%, $(obj-y))
+   rm -f $(shell find . -name '*.[od]')
diff --git a/contrib/fio/README b/contrib/fio/README
new file mode 100644
index 000..4aef2a0
--- /dev/null
+++ b/contrib/fio/README
@@ -0,0 +1,13 @@
+This is a plugin to test the QEMU block layer with fio,
+the flexible I/O tester.
+
+To build this plugin you have to:
+
+1) hack the configure file to use -fPIC instead of -fPIE
+
+2) configure qemu with --enable-pie
+
+3) build the plugin with
+
+   make qemu-img
+   make -C contrib/fio
diff --git a/contrib/fio/fio.c b/contrib/fio/fio.c
new file mode 100644
index 000..eedc3b9
--- /dev/null
+++ b/contrib/fio/fio.c
@@ -0,0 +1,404 @@
+/*
+ * QEMU engine for fio
+ *
+ */
+#include "fio-qemu.h"
+#include "fio-optgroup-qemu.h"
+#include "qemu/osdep.h"
+#include "qemu/iov.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qstring.h"
+#include "block/aio.h"
+#include "block/qapi.h"
+#include "crypto/init.h"
+#include "sysemu/block-backend.h"
+
+struct qemu_data {
+   AioContext *ctx;
+   unsigned int completed;
+   unsigned int to_submit;
+   struct io_u *aio_events[];
+};
+
+struct qemu_options {
+   void *pad;
+   const char *aio;
+   const char *format;
+   const char *driver;
+};
+
+static QemuMutex iothread_lock;
+
+static int str_aio_cb(void *data, const char *str)
+{
+   struct qemu_options *o = data;
+
+   if (!strcmp(str, "native") || !strcmp(str, "threads"))
+   o->aio = 

Re: [Qemu-block] [PATCH RFC 0/1] Allow storing the qcow2 L2 cache in disk

2016-12-13 Thread Max Reitz

On 2016-12-13 at 13:55, Alberto Garcia wrote:

On Tue 13 Dec 2016 09:02:34 AM CET, Max Reitz wrote:


I definitely like how simple your approach is, but from a design
standpoint it is not exactly optimal, because O(n) for a cluster
lookup is simply worse than O(1).


It's actually not O(n) anymore since I rewrote that code last year.

When I changed the algorithm to use LRU for cache eviction I also
optimized the lookup:

 1) qcow2_cache_put() simply divides the table offset by the cluster
size and it immediately obtains the index.

 2) qcow2_cache_get() hashes the offset to do the lookup. In my tests
doing random I/O and a 2048 entry cache the average number of
comparisons was 2.5 (430 before the rewrite).

The cache was ~13% faster after the rewrite, but the results in terms of
actual disk I/O improvements were negligible, because that's not where
the bottleneck is. We even discussed how the hash function that I was
using was weak, but we decided not to touch it because it was not worth
the effort.


13% is not really what O(1) vs. O(n) is about, but you are completely 
right, For some reason I haven't seen that. OK, so the code is indeed 
O(1) in the best case (and probably in the average case, too), so that's 
good.



But leaving that aside, would that improve anything? I don't think
the cache itself adds any significant overhead here, IIRC even in
your presentation at KVM Forum 2015 qcow2 was comparable to raw as
long as all L2 tables were cached in memory.


I haven't compared CPU usage, though. That may have gone up quite a
bit, I don't know. For large enough images, it may even become a
bottleneck.


I don't know, I don't have any data to suggest that there's a problem
there, particularly after the rewrite from last year I think the cache
is reasonably fast (as long as the cache is big enough; and if it's not
big enough the problem is going to be in the extra I/O).


So the remaining question is whether we want a cache for the cache. If 
the cache is stored directly on an SSD, any access to it has to go there 
and back, and that seems like we may still want to have the option of 
having an in-RAM cache on top of it.


If we don't want to split the qcow2 file into separate metadata and data 
files, then we could still create a sparse temporary file at runtime 
which contains all L2 tables. Reading an L2 table then wouldn't go 
through the qcow2 file but through that temporary file and our existing 
metadata cache would work on top of it.


Max



Re: [Qemu-block] [Nbd] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-13 Thread Alex Bligh

> On 13 Dec 2016, at 12:18, Wouter Verhelst  wrote:
> 
> I'm not opposed either, but I do agree with you that we shouldn't add
> such a feature if it doesn't end up getting used. Especially so if it
> burns a flag in the (16-bit) "transmission flags" field, where space is
> at a premium.

I did suggest a few non-Qemu uses (see below). I think it might be
an idea if the reference implementation supported it before
merging (which per below should be trivial).

-- 
Alex Bligh


> Begin forwarded message:
> 
> From: Alex Bligh 
> Subject: Re: [Nbd] [Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES 
> extension
> Date: 6 December 2016 at 10:29:41 United Kingdom Time
> To: Kevin Wolf 
> Cc: Alex Bligh , Eric Blake , 
> "nbd-gene...@lists.sourceforge.net" , 
> xieying...@huawei.com, su...@huawei.com, qemu block , 
> "qemu-de...@nongnu.org" , Paolo Bonzini 
> 
> 
> 
>> On 6 Dec 2016, at 09:25, Kevin Wolf  wrote:
>> 
>> Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
>>> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
>>> team discovered that it is useful if a server can advertise
>>> whether an export is in a known-all-zeroes state at the time
>>> the client connects.
>> 
>> Does a server usually have the information to set this flag, other than
>> querying the block status of all blocks at startup? 
> 
> The server may have other ways of knowing this, for instance
> that it has just created the file (*), or that it stat'd the file
> before opening it (not unlikely) and noticed it had 0 allocated
> size. The latter I suspect would be trivial to implement in nbd-server
> 
> (*) = e.g. I had one application where nbd use the export path
> to signify it wanted to open a temporary file, the path consisting
> of a UUID and an encoded length. If the file was not present already
> it created it with ftruncate(). That could trivially have used this.
> 
>> If so, the client could just query this by itself.
> 
> Well there's no currently mainlined extension to do that, but yes
> it could. On the other hand I see no issue passing complete
> zero status back to the client if it's so obvious from a stat().
> 
> -- 
> Alex Bligh
> 
> 
> 
> 





Re: [Qemu-block] [Qemu-devel] [PATCH RFC 0/1] Allow storing the qcow2 L2 cache in disk

2016-12-13 Thread Fam Zheng
On Tue, 12/13 13:29, Kevin Wolf wrote:
> Am 13.12.2016 um 11:16 hat Fam Zheng geschrieben:
> > On Tue, 12/13 09:02, Max Reitz wrote:
> > > > Yes but the use case is that the qcow2 image is stored in a slow disk,
> > > > so things will be faster if we avoid having to read it too often.
> > > > 
> > > > But the data is there and it needs to be read, so we have three options:
> > > > 
> > > >   1) Read it everytime we need it. It slows things down.
> > > >   2) Keep (part of) it in memory. It can use a lot of memory.
> > > >   3) Keep it in a faster disk.
> > > > 
> > > > We're talking about 3) here, and this it not about creating new
> > > > structures, but about changing the storage backend of the existing L2
> > > > cache (disk rather than RAM).
> > > 
> > > I'm arguing that we already have an on-disk L2 structure and that is 
> > > called
> > > simply the L1-L2 structure in the qcow2 file. The cache only makes sense
> > > because it is in RAM.
> > 
> > I am looking at this problem from a different angle:
> > 
> > Assume a qcow2 image created by a known version of QEMU with the option
> > preallocation=falloc, it is theoretically possible to do a hacky 
> > optimization in
> > qcow2 I/O path that, instead of looking up in L1/L2, we can calculate file
> > offset for any given virtual address, by adding the file offset of LBA 0.
> > 
> > If we make this optimization a header extention and add it to the spec, it 
> > is no
> > longer a hack, and it should offer nearly the same performance as raw in all
> > cases - if we say raw is doing "identical mapping" from LBA to file offset, 
> > this
> > new operation mode of qcow2 file is basically doing a "shift by constant" 
> > mapping.
> 
> Is this a scenario that matters in practice, though? If you want
> performance and you don't need COW functionality, just use raw images to
> begin with.

The intention is actually to combine the advantages of both: external snapshot
can be supported by an allocation status bitmap generated from scanning the
whole L1/L2 table, or even from loading a new type of qcow2 bitmap specifically
for making that efficient.

I share the same doubt as Stefan regarding the on disk format. Modern file
systems support sparse file quite well, so I wonder what advantages are still
there in using L1/L2 tables instead of a simple bitmap, if the user doesn't use
internal snapshot in the first place?

Fam



Re: [Qemu-block] [PATCH RFC 0/1] Allow storing the qcow2 L2 cache in disk

2016-12-13 Thread Alberto Garcia
On Tue 13 Dec 2016 09:02:34 AM CET, Max Reitz wrote:

> I definitely like how simple your approach is, but from a design
> standpoint it is not exactly optimal, because O(n) for a cluster
> lookup is simply worse than O(1).

It's actually not O(n) anymore since I rewrote that code last year.

When I changed the algorithm to use LRU for cache eviction I also
optimized the lookup:

 1) qcow2_cache_put() simply divides the table offset by the cluster
size and it immediately obtains the index.

 2) qcow2_cache_get() hashes the offset to do the lookup. In my tests
doing random I/O and a 2048 entry cache the average number of
comparisons was 2.5 (430 before the rewrite).

The cache was ~13% faster after the rewrite, but the results in terms of
actual disk I/O improvements were negligible, because that's not where
the bottleneck is. We even discussed how the hash function that I was
using was weak, but we decided not to touch it because it was not worth
the effort.

>> But leaving that aside, would that improve anything? I don't think
>> the cache itself adds any significant overhead here, IIRC even in
>> your presentation at KVM Forum 2015 qcow2 was comparable to raw as
>> long as all L2 tables were cached in memory.
>
> I haven't compared CPU usage, though. That may have gone up quite a
> bit, I don't know. For large enough images, it may even become a
> bottleneck.

I don't know, I don't have any data to suggest that there's a problem
there, particularly after the rewrite from last year I think the cache
is reasonably fast (as long as the cache is big enough; and if it's not
big enough the problem is going to be in the extra I/O).

Berto



Re: [Qemu-block] [Qemu-devel] [PATCH RFC 0/1] Allow storing the qcow2 L2 cache in disk

2016-12-13 Thread Kevin Wolf
Am 13.12.2016 um 11:16 hat Fam Zheng geschrieben:
> On Tue, 12/13 09:02, Max Reitz wrote:
> > > Yes but the use case is that the qcow2 image is stored in a slow disk,
> > > so things will be faster if we avoid having to read it too often.
> > > 
> > > But the data is there and it needs to be read, so we have three options:
> > > 
> > >   1) Read it everytime we need it. It slows things down.
> > >   2) Keep (part of) it in memory. It can use a lot of memory.
> > >   3) Keep it in a faster disk.
> > > 
> > > We're talking about 3) here, and this it not about creating new
> > > structures, but about changing the storage backend of the existing L2
> > > cache (disk rather than RAM).
> > 
> > I'm arguing that we already have an on-disk L2 structure and that is called
> > simply the L1-L2 structure in the qcow2 file. The cache only makes sense
> > because it is in RAM.
> 
> I am looking at this problem from a different angle:
> 
> Assume a qcow2 image created by a known version of QEMU with the option
> preallocation=falloc, it is theoretically possible to do a hacky optimization 
> in
> qcow2 I/O path that, instead of looking up in L1/L2, we can calculate file
> offset for any given virtual address, by adding the file offset of LBA 0.
> 
> If we make this optimization a header extention and add it to the spec, it is 
> no
> longer a hack, and it should offer nearly the same performance as raw in all
> cases - if we say raw is doing "identical mapping" from LBA to file offset, 
> this
> new operation mode of qcow2 file is basically doing a "shift by constant" 
> mapping.

Is this a scenario that matters in practice, though? If you want
performance and you don't need COW functionality, just use raw images to
begin with.

Kevin



Re: [Qemu-block] [Nbd] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-13 Thread Wouter Verhelst
On Tue, Dec 13, 2016 at 08:38:12AM +0100, Kevin Wolf wrote:
> Am 12.12.2016 um 19:12 hat Wouter Verhelst geschrieben:
> > I'm not opposed to this proposal, per se, but there seems to be some
> > disagreement (by Kevin, for instance) on whether this extension is at
> > all useful.
> 
> FWIW, I'm not opposed to adding the flag. I don't think it can hurt, but
> I wanted to ask the question so that we don't end up adding a feature
> that noone actually uses. Ultimately it's your call as a maintainer
> anyway how conservative you want to be with spec additions.

I'm not opposed either, but I do agree with you that we shouldn't add
such a feature if it doesn't end up getting used. Especially so if it
burns a flag in the (16-bit) "transmission flags" field, where space is
at a premium.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12



Re: [Qemu-block] [Qemu-devel] [PATCH RFC 0/1] Allow storing the qcow2 L2 cache in disk

2016-12-13 Thread Fam Zheng
On Tue, 12/13 09:02, Max Reitz wrote:
> > Yes but the use case is that the qcow2 image is stored in a slow disk,
> > so things will be faster if we avoid having to read it too often.
> > 
> > But the data is there and it needs to be read, so we have three options:
> > 
> >   1) Read it everytime we need it. It slows things down.
> >   2) Keep (part of) it in memory. It can use a lot of memory.
> >   3) Keep it in a faster disk.
> > 
> > We're talking about 3) here, and this it not about creating new
> > structures, but about changing the storage backend of the existing L2
> > cache (disk rather than RAM).
> 
> I'm arguing that we already have an on-disk L2 structure and that is called
> simply the L1-L2 structure in the qcow2 file. The cache only makes sense
> because it is in RAM.

I am looking at this problem from a different angle:

Assume a qcow2 image created by a known version of QEMU with the option
preallocation=falloc, it is theoretically possible to do a hacky optimization in
qcow2 I/O path that, instead of looking up in L1/L2, we can calculate file
offset for any given virtual address, by adding the file offset of LBA 0.

If we make this optimization a header extention and add it to the spec, it is no
longer a hack, and it should offer nearly the same performance as raw in all
cases - if we say raw is doing "identical mapping" from LBA to file offset, this
new operation mode of qcow2 file is basically doing a "shift by constant" 
mapping.

Fam



Re: [Qemu-block] [PATCH RFC 0/1] Allow storing the qcow2 L2 cache in disk

2016-12-13 Thread Max Reitz

On 2016-12-12 at 15:13, Alberto Garcia wrote:

On Fri 09 Dec 2016 03:21:08 PM CET, Max Reitz wrote:


In some scenarios, however, there's a different alternative: if the
qcow2 image is stored in a slow backend (eg. HDD), we could save
memory by putting the L2 cache in a faster one (SSD) instead of in
RAM.



Well, from a full design standpoint, it doesn't make a lot of sense to
me:

We have a two-level on-disk structure for cluster mapping so as to not
waste memory for unused areas and so that we don't need to keep one
large continuous chunk of metadata. Accessing the disk is slow, so we
also have an in-memory cache which is just a single level fully
associative cache replicating the same data (but just a part of it).

Now you want to replicate all of it and store it on disk. My mind
tells me that is duplicate data: We already have all of the metadata
elsewhere on disk, namely in the qcow2 file, and even better, it is
not stored in a fully associative structure there but directly mapped,
making finding the correct entry much quicker.


Yes but the use case is that the qcow2 image is stored in a slow disk,
so things will be faster if we avoid having to read it too often.

But the data is there and it needs to be read, so we have three options:

  1) Read it everytime we need it. It slows things down.
  2) Keep (part of) it in memory. It can use a lot of memory.
  3) Keep it in a faster disk.

We're talking about 3) here, and this it not about creating new
structures, but about changing the storage backend of the existing L2
cache (disk rather than RAM).


I'm arguing that we already have an on-disk L2 structure and that is 
called simply the L1-L2 structure in the qcow2 file. The cache only 
makes sense because it is in RAM.



However, the thing is that the existing structures also only exist in
the original qcow2 file and cannot be just placed anywhere else, as
opposed to our cache. In order to solve this, we would need to
(incompatibly) modify the qcow2 format to allow storing data
independently from metadata. I think this would be certainly doable,
but the question is whether it is worth the effort.


You mean split the qcow2 file in two: data and metadata? I don't think
it's worth the effort.


That's the thing. I don't know.

I definitely like how simple your approach is, but from a design 
standpoint it is not exactly optimal, because O(n) for a cluster lookup 
is simply worse than O(1).



Maybe we can at least make the cache directly mapped if it is supposed
to cover the whole image? That is, we would basically just load all of
the L2 tables into memory and bypass the existing cache.


I don't see how this addresses the original use case that I described.


It just fixes the issue that the cache is fully associative and then the 
only issue I would have with your approach is that we are keeping 
duplicate data.



But leaving that aside, would that improve anything? I don't think the
cache itself adds any significant overhead here, IIRC even in your
presentation at KVM Forum 2015 qcow2 was comparable to raw as long as
all L2 tables were cached in memory.


I haven't compared CPU usage, though. That may have gone up quite a bit, 
I don't know. For large enough images, it may even become a bottleneck.


Max