Re: [Qemu-block] [PATCH v19 00/10] Block replication for continuous checkpoints

2016-05-26 Thread Changlong Xie

Ping here : )

Hi fam, do you have time to help reviewing this patchset? Consider of we 
are in the same time zone what will speed up code reviewing process,

any feedback will be appreciated.

Thanks
-Xie

On 05/20/2016 03:36 PM, Changlong Xie wrote:

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v19

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v18

TODO:
1. Continuous block replication. It will be started after basic functions






Re: [Qemu-block] [PATCH v19 09/10] tests: add unit test case for replication

2016-05-26 Thread Changlong Xie

On 05/20/2016 03:36 PM, Changlong Xie wrote:

+static void io_write(BlockDriverState *bs, long pattern, int64_t pattern_count,
+ int64_t offset, int64_t count, bool expect_failed)
+{
+void *pattern_buf;


Should initialize as NULL to avoid below warnning:

tests/test-replication.c:104:15: error: ‘pattern_buf’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]

 g_free(pattern_buf);

Will fix in next version.


+int ret;
+
+/* 1. alloc pattern buffer */
+if (pattern) {
+pattern_buf = g_malloc(pattern_count);
+memset(pattern_buf, pattern, pattern_count);
+}
+
+/* 2. do write */
+if (pattern) {
+ret = bdrv_write(bs, offset >> 9, (uint8_t *)pattern_buf, count >> 9);
+} else {
+ret = bdrv_write_zeroes(bs, offset >> 9, count >> 9, 0);
+}
+
+/* 3. assert */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+g_free(pattern_buf);
+}
+}






Re: [Qemu-block] [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-26 Thread Fam Zheng
On Thu, 05/26 11:20, Paolo Bonzini wrote:
> 
> 
> On 26/05/2016 10:30, Fam Zheng wrote:
> >> > 
> >> > This doesn't look too wrong...  Should the right sequence of events be
> >> > head/after_head or head/after_tail?  It's probably simplest to just emit
> >> > all four events.
> > I've no idea. (That's why I leaned towards fixing the test case).
> 
> Well, fixing the testcase means knowing what events should be emitted.
> 
> QEMU with Peter's patch emits head/after_head.  If the right one is
> head/after_tail, _both QEMU and the testcase_ need to be adjusted.  Your
> patch keeps the backwards-compatible route.

Yes, I mean I was not very convinced in tweaking the events at all: each pair
of them has been emitted around bdrv_aligned_preadv(), and the new branch
doesn't do it anymore. So I don't see a reason to add events here.

Fam



Re: [Qemu-block] [PATCH v3 5/5] qcow2: Catch more unaligned write_zero into zero cluster

2016-05-26 Thread Eric Blake
On 05/26/2016 07:41 AM, Denis V. Lunev wrote:
> On 05/26/2016 06:48 AM, Eric Blake wrote:
>> is_zero_cluster() and is_zero_cluster_top_locked() are used only
>> by qcow2_co_write_zeroes().  The former is too broad (we don't
>> care if the sectors we are about to overwrite are non-zero, only
>> that all other sectors in the cluster are zero), so it needs to
>> be called up to twice but with smaller limits - rename it along
>> with adding the neeeded parameter.  The latter can be inlined for
>> more compact code.
>>
>> The testsuite change shows that we now have a sparser top file
>> when an unaligned write_zeroes overwrites the only portion of
>> the backing file with data.
>>
>> Based on a patch proposal by Denis V. Lunev.
>>

>> -
>> -if (!is_zero_cluster(bs, sector_num)) {
>> +/* check whether remainder of cluster already reads as zero */
>> +if (!(is_zero_sectors(bs, cl_start, head) &&
>> +  is_zero_sectors(bs, sector_num + nb_sectors,
>> +  -tail & (s->cluster_sectors - 1 {
> 
> can we have cluster_sectors != 2^n?

No. cluster_sectors is an inherent property of the qcow2 file format:


 20 - 23:   cluster_bits
Number of bits that are used for addressing an offset
within a cluster (1 << cluster_bits is the cluster
size).
Must not be less than 9 (i.e. 512 byte clusters).


As the file format uses a bit shift value, you are guaranteed to have a
power of two amount of sectors within a cluster.

If you prefer, I could have written '-tail % s->cluster_sectors', but as
% on a negative signed integer gives different results than what you get
for an unsigned number, I felt that & was nicer than % for making it
more obvious that I'm grabbing particular bits.

If you can think of any cleaner expression that represents the number of
sectors occurring after the tail until the next cluster boundary, I'm
game; the hardest part is that when tail is 0, we want the number passed
to is_zero_sectors() to also be 0, not s->cluster_sectors (so the naive
's->cluster_sectors - tail' is wrong).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PULL 00/31] Block layer patches

2016-05-26 Thread Peter Maydell
On 25 May 2016 at 18:39, Kevin Wolf  wrote:
> The following changes since commit 287db79df8af8e31f18e262feb5e05103a09e4d4:
>
>   Merge remote-tracking branch 'remotes/ehabkost/tags/x86-pull-request' into 
> staging (2016-05-24 13:06:33 +0100)
>
> are available in the git repository at:
>
>
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to b75536c9fa742f887304769d0608557bb8e3a27f:
>
>   blockjob: Remove BlockJob.bs (2016-05-25 19:04:21 +0200)
>
> 
> Block layer patches
>
> 

Applied, thanks.

-- PMM



Re: [Qemu-block] [PATCH v3 5/5] qcow2: Catch more unaligned write_zero into zero cluster

2016-05-26 Thread Denis V. Lunev

On 05/26/2016 06:48 AM, Eric Blake wrote:

is_zero_cluster() and is_zero_cluster_top_locked() are used only
by qcow2_co_write_zeroes().  The former is too broad (we don't
care if the sectors we are about to overwrite are non-zero, only
that all other sectors in the cluster are zero), so it needs to
be called up to twice but with smaller limits - rename it along
with adding the neeeded parameter.  The latter can be inlined for
more compact code.

The testsuite change shows that we now have a sparser top file
when an unaligned write_zeroes overwrites the only portion of
the backing file with data.

Based on a patch proposal by Denis V. Lunev.

CC: Denis V. Lunev 
Signed-off-by: Eric Blake 
---
  block/qcow2.c  | 47 +++---
  tests/qemu-iotests/154.out |  8 
  2 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 105fd5e..ecac399 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2406,26 +2406,19 @@ finish:
  }


-static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
+static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
+uint32_t count)
  {
-BDRVQcow2State *s = bs->opaque;
  int nr;
  BlockDriverState *file;
-int64_t res = bdrv_get_block_status_above(bs, NULL, start,
-  s->cluster_sectors, , );
-return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == s->cluster_sectors;
-}
+int64_t res;

-static bool is_zero_cluster_top_locked(BlockDriverState *bs, int64_t start)
-{
-BDRVQcow2State *s = bs->opaque;
-int nr = s->cluster_sectors;
-uint64_t off;
-int ret;
-
-ret = qcow2_get_cluster_offset(bs, start << BDRV_SECTOR_BITS, , );
-assert(nr == s->cluster_sectors);
-return ret == QCOW2_CLUSTER_UNALLOCATED || ret == QCOW2_CLUSTER_ZERO;
+if (!count) {
+return true;
+}
+res = bdrv_get_block_status_above(bs, NULL, start, count,
+  , );
+return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
  }

  static coroutine_fn int qcow2_co_write_zeroes(BlockDriverState *bs,
@@ -2434,27 +2427,33 @@ static coroutine_fn int 
qcow2_co_write_zeroes(BlockDriverState *bs,
  int ret;
  BDRVQcow2State *s = bs->opaque;

-int head = sector_num % s->cluster_sectors;
-int tail = (sector_num + nb_sectors) % s->cluster_sectors;
+uint32_t head = sector_num % s->cluster_sectors;
+uint32_t tail = (sector_num + nb_sectors) % s->cluster_sectors;

  trace_qcow2_write_zeroes_start_req(qemu_coroutine_self(), sector_num,
 nb_sectors);

-if (head != 0 || tail != 0) {
+if (head || tail) {
  int64_t cl_start = sector_num - head;
+uint64_t off;
+int nr;

  assert(cl_start + s->cluster_sectors >= sector_num + nb_sectors);

-sector_num = cl_start;
-nb_sectors = s->cluster_sectors;
-
-if (!is_zero_cluster(bs, sector_num)) {
+/* check whether remainder of cluster already reads as zero */
+if (!(is_zero_sectors(bs, cl_start, head) &&
+  is_zero_sectors(bs, sector_num + nb_sectors,
+  -tail & (s->cluster_sectors - 1 {


can we have cluster_sectors != 2^n?
In this case this bits logic will be broken.



[Qemu-block] [PATCH 0/2] block/mirror: Fix target backing BDS

2016-05-26 Thread Max Reitz
bdrv_replace_in_backing_chain() sometimes does what is advertised (if
the new node does not have a backing file yet and if it hasn't been in
the same backing chain already), but this is not what the mirror block
job (the sole user of that function) actually needs. In fact, it only
needs this behavior in 'top' sync mode.

In 'none' sync mode, we need to use the old BDS as the new backing BDS;
in 'full' sync mode, we should not set any backing BDS at all. And if we
need to set a backing BDS, we should always do so, regardless of whether
the new BDS already has one.

Therefore, bdrv_replace_in_backing_chain() should not attempt to find
out which backing BDS is the right one. Instead, we should leave that to
the mirror block job, namely mirror_exit().


Max Reitz (2):
  block/mirror: Fix target backing BDS
  iotests: Add test for post-mirror backing chains

 block.c|   8 --
 block/mirror.c |  13 +++
 tests/qemu-iotests/155 | 218 +
 tests/qemu-iotests/155.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 5 files changed, 237 insertions(+), 8 deletions(-)
 create mode 100755 tests/qemu-iotests/155
 create mode 100644 tests/qemu-iotests/155.out

-- 
2.8.3




[Qemu-block] [PATCH 1/2] block/mirror: Fix target backing BDS

2016-05-26 Thread Max Reitz
Currently, we are trying to move the backing BDS from the source to the
target in bdrv_replace_in_backing_chain(). However, the conditions used
to decide when to move the backing BDS from source to target are not
really correct.

Basically, we do not have to set the target's backing BDS when doing a
commit operation (an active commit, to be specific) but we do have to
set it for every mirror operation (unless the target's backing BDS is
already the one it should have).

The decision when to adjust the target's backing BDS and what it should
be set to is something that the mirror code can do best, so let's do it
there.

Also, we do not need to drop the source's backing BDS. In most cases,
the source will be deleted after the mirroring operation anyway, and if
it is not, we probably want it to keep its own backing chain so it stays
valid.

Signed-off-by: Max Reitz 
---
 block.c|  8 
 block/mirror.c | 13 +
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index 736432f..7ae2766 100644
--- a/block.c
+++ b/block.c
@@ -2275,14 +2275,6 @@ void bdrv_replace_in_backing_chain(BlockDriverState 
*old, BlockDriverState *new)
 
 change_parent_backing_link(old, new);
 
-/* Change backing files if a previously independent node is added to the
- * chain. For active commit, we replace top by its own (indirect) backing
- * file and don't do anything here so we don't build a loop. */
-if (new->backing == NULL && !bdrv_chain_contains(backing_bs(old), new)) {
-bdrv_set_backing_hd(new, backing_bs(old));
-bdrv_set_backing_hd(old, NULL);
-}
-
 bdrv_unref(old);
 }
 
diff --git a/block/mirror.c b/block/mirror.c
index 80fd3c7..bc65e54 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -482,6 +482,19 @@ static void mirror_exit(BlockJob *job, void *opaque)
 /* We just changed the BDS the job BB refers to */
 blk_remove_bs(job->blk);
 blk_insert_bs(job->blk, src);
+
+/* Now we need to adjust the target's backing BDS. This is not 
necessary
+ * when having performed a commit operation.
+ * Note that we cannot do this adjustment before the call to
+ * bdrv_replace_in_backing_chain(), because we may end up using
+ * to_replace as the target's BDS, which makes it impossible to replace
+ * to_replace by the target afterwards */
+if (!bdrv_chain_contains(backing_bs(to_replace), target_bs)) {
+BlockDriverState *backing = s->is_none_mode ? to_replace : s->base;
+if (backing_bs(target_bs) != backing) {
+bdrv_set_backing_hd(target_bs, backing);
+}
+}
 }
 if (s->to_replace) {
 bdrv_op_unblock_all(s->to_replace, s->replace_blocker);
-- 
2.8.3




[Qemu-block] [PATCH 2/2] iotests: Add test for post-mirror backing chains

2016-05-26 Thread Max Reitz
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/155 | 218 +
 tests/qemu-iotests/155.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 224 insertions(+)
 create mode 100755 tests/qemu-iotests/155
 create mode 100644 tests/qemu-iotests/155.out

diff --git a/tests/qemu-iotests/155 b/tests/qemu-iotests/155
new file mode 100755
index 000..76fdd4f
--- /dev/null
+++ b/tests/qemu-iotests/155
@@ -0,0 +1,218 @@
+#!/usr/bin/env python
+#
+# Test whether the backing BDSs are correct after completion of a
+# mirror block job
+#
+# Copyright (C) 2016 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import stat
+import time
+import iotests
+from iotests import qemu_img
+
+back0_img = os.path.join(iotests.test_dir, 'back0.' + iotests.imgfmt)
+back1_img = os.path.join(iotests.test_dir, 'back1.' + iotests.imgfmt)
+back2_img = os.path.join(iotests.test_dir, 'back2.' + iotests.imgfmt)
+source_img = os.path.join(iotests.test_dir, 'source.' + iotests.imgfmt)
+target_img = os.path.join(iotests.test_dir, 'target.' + iotests.imgfmt)
+
+class BaseClass(iotests.QMPTestCase):
+def setUp(self):
+qemu_img('create', '-f', iotests.imgfmt, back0_img, '1M')
+qemu_img('create', '-f', iotests.imgfmt, '-b', back0_img, back1_img)
+qemu_img('create', '-f', iotests.imgfmt, '-b', back1_img, back2_img)
+qemu_img('create', '-f', iotests.imgfmt, '-b', back2_img, source_img)
+
+self.vm = iotests.VM()
+self.vm.add_drive(None, '', 'none')
+self.vm.launch()
+
+# Add the BDS via blockdev-add so it stays around after the mirror 
block
+# job has been completed
+result = self.vm.qmp('blockdev-add',
+ options={'node-name': 'source',
+  'driver': iotests.imgfmt,
+  'file': {'driver': 'file',
+   'filename': source_img}})
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp('x-blockdev-insert-medium',
+ device='drive0', node_name='source')
+self.assert_qmp(result, 'return', {})
+
+self.assertIntactSourceBackingChain()
+
+if self.existing:
+if self.target_backing:
+qemu_img('create', '-f', iotests.imgfmt,
+ '-b', self.target_backing, target_img, '1M')
+else:
+qemu_img('create', '-f', iotests.imgfmt, target_img, '1M')
+
+if self.cmd == 'blockdev-mirror':
+result = self.vm.qmp('blockdev-add',
+ options={'node-name': 'target',
+  'driver': iotests.imgfmt,
+  'file': {'driver': 'file',
+   'filename': 
target_img}})
+self.assert_qmp(result, 'return', {})
+
+def tearDown(self):
+self.vm.shutdown()
+os.remove(source_img)
+os.remove(back2_img)
+os.remove(back1_img)
+os.remove(back0_img)
+try:
+os.remove(target_img)
+except OSError:
+pass
+
+def findBlockNode(self, node_name, id=None):
+if id:
+result = self.vm.qmp('query-block')
+for device in result['return']:
+if device['device'] == id:
+if node_name:
+self.assert_qmp(device, 'inserted/node-name', 
node_name)
+return device['inserted']
+else:
+result = self.vm.qmp('query-named-block-nodes')
+for node in result['return']:
+if node['node-name'] == node_name:
+return node
+
+self.fail('Cannot find node %s/%s' % (id, node_name))
+
+def assertIntactSourceBackingChain(self):
+node = self.findBlockNode('source')
+
+self.assert_qmp(node, 'image' + '/backing-image' * 0 + '/filename',
+source_img)
+self.assert_qmp(node, 'image' + '/backing-image' * 1 + '/filename',
+back2_img)
+self.assert_qmp(node, 'image' + '/backing-image' * 2 + '/filename',
+

Re: [Qemu-block] [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-26 Thread Paolo Bonzini


On 26/05/2016 10:30, Fam Zheng wrote:
>> > 
>> > This doesn't look too wrong...  Should the right sequence of events be
>> > head/after_head or head/after_tail?  It's probably simplest to just emit
>> > all four events.
> I've no idea. (That's why I leaned towards fixing the test case).

Well, fixing the testcase means knowing what events should be emitted.

QEMU with Peter's patch emits head/after_head.  If the right one is
head/after_tail, _both QEMU and the testcase_ need to be adjusted.  Your
patch keeps the backwards-compatible route.

Thanks,

Paolo



Re: [Qemu-block] [PATCH v3 1/5] block: split write_zeroes always

2016-05-26 Thread Fam Zheng
On Wed, 05/25 21:48, Eric Blake wrote:
> From: "Denis V. Lunev" 
> 
> We should split requests even if they are less than write_zeroes_alignment.
> For example we can have the following request:
>   offset 62k
>   size   4k
>   write_zeroes_alignment 64k
> The original code sent 1 request covering 2 qcow2 clusters, and resulted
> in both clusters being allocated. But by splitting the request, we can
> cater to the case where one of the two clusters can be zeroed as a
> whole, for only 1 cluster allocated after the operation.
> 
> Signed-off-by: Denis V. Lunev 
> CC: Eric Blake 
> CC: Kevin Wolf 
> Message-Id: <1463476543-3087-2-git-send-email-...@openvz.org>
> 
> [eblake: Avoid exceeding nb_sectors, hoist alignment checks out of
> loop, and update testsuite to show that patch works]
> 
> Signed-off-by: Eric Blake 
> ---
>  block/io.c | 30 +-
>  tests/qemu-iotests/154.out | 18 --
>  2 files changed, 29 insertions(+), 19 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index f474b9a..26b5845 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1118,28 +1118,32 @@ static int coroutine_fn 
> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>  struct iovec iov = {0};
>  int ret = 0;
>  bool need_flush = false;
> +int head = 0;
> +int tail = 0;
> 
>  int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_write_zeroes,
>  BDRV_REQUEST_MAX_SECTORS);
> +if (bs->bl.write_zeroes_alignment) {
> +assert(is_power_of_2(bs->bl.write_zeroes_alignment));
> +head = sector_num & (bs->bl.write_zeroes_alignment - 1);
> +tail = (sector_num + nb_sectors) & (bs->bl.write_zeroes_alignment - 
> 1);
> +max_write_zeroes &= ~(bs->bl.write_zeroes_alignment - 1);
> +}
> 
>  while (nb_sectors > 0 && !ret) {
>  int num = nb_sectors;
> 
>  /* Align request.  Block drivers can expect the "bulk" of the request
> - * to be aligned.
> + * to be aligned, and that unaligned requests do not cross cluster
> + * boundaries.
>   */
> -if (bs->bl.write_zeroes_alignment
> -&& num > bs->bl.write_zeroes_alignment) {
> -if (sector_num % bs->bl.write_zeroes_alignment != 0) {
> -/* Make a small request up to the first aligned sector.  */
> -num = bs->bl.write_zeroes_alignment;
> -num -= sector_num % bs->bl.write_zeroes_alignment;
> -} else if ((sector_num + num) % bs->bl.write_zeroes_alignment != 
> 0) {
> -/* Shorten the request to the last aligned sector.  num 
> cannot
> - * underflow because num > bs->bl.write_zeroes_alignment.
> - */
> -num -= (sector_num + num) % bs->bl.write_zeroes_alignment;
> -}
> +if (head) {
> +/* Make a small request up to the first aligned sector.  */
> +num = MIN(nb_sectors, bs->bl.write_zeroes_alignment - head);
> +head = 0;
> +} else if (tail && num > bs->bl.write_zeroes_alignment) {
> +/* Shorten the request to the last aligned sector.  */
> +num -= tail;
>  }
> 
>  /* limit request size */
> diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
> index 8946b73..b9d27c5 100644
> --- a/tests/qemu-iotests/154.out
> +++ b/tests/qemu-iotests/154.out
> @@ -106,11 +106,14 @@ read 1024/1024 bytes at offset 67584
>  read 5120/5120 bytes at offset 68608
>  5 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>  [{ "start": 0, "length": 32768, "depth": 1, "zero": true, "data": false},
> -{ "start": 32768, "length": 8192, "depth": 0, "zero": false, "data": true, 
> "offset": 20480},
> +{ "start": 32768, "length": 4096, "depth": 0, "zero": false, "data": true, 
> "offset": 20480},
> +{ "start": 36864, "length": 4096, "depth": 0, "zero": true, "data": false},
>  { "start": 40960, "length": 8192, "depth": 1, "zero": true, "data": false},
> -{ "start": 49152, "length": 8192, "depth": 0, "zero": false, "data": true, 
> "offset": 28672},
> +{ "start": 49152, "length": 4096, "depth": 0, "zero": false, "data": true, 
> "offset": 24576},
> +{ "start": 53248, "length": 4096, "depth": 0, "zero": true, "data": false},
>  { "start": 57344, "length": 8192, "depth": 1, "zero": true, "data": false},
> -{ "start": 65536, "length": 8192, "depth": 0, "zero": false, "data": true, 
> "offset": 36864},
> +{ "start": 65536, "length": 4096, "depth": 0, "zero": false, "data": true, 
> "offset": 28672},
> +{ "start": 69632, "length": 4096, "depth": 0, "zero": true, "data": false},
>  { "start": 73728, "length": 134144000, "depth": 1, "zero": true, "data": 
> false}]
> 
>  == spanning two clusters, non-zero after request ==
> @@ -145,11 +148,14 @@ read 7168/7168 bytes at offset 65536
>  read 

Re: [Qemu-block] [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-26 Thread Fam Zheng
On Thu, 05/26 09:55, Paolo Bonzini wrote:
> 
> 
> On 26/05/2016 09:10, Fam Zheng wrote:
> > 
> > diff --git a/block/io.c b/block/io.c
> > index d480097..a6523cf 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -1435,8 +1435,10 @@ int coroutine_fn bdrv_co_pwritev(BlockDriverState 
> > *bs,
> >   * than one aligned block.
> >   */
> >  if (bytes < align) {
> > +bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_TAIL);
> >  qemu_iovec_add(_qiov, head_buf + bytes, align - bytes);
> >  bytes = align;
> > +bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
> >  }
> >  }
> 
> This doesn't look too wrong...  Should the right sequence of events be
> head/after_head or head/after_tail?  It's probably simplest to just emit
> all four events.

I've no idea. (That's why I leaned towards fixing the test case).  But if Kevin
can ack, I'd be happy with this way.

Fam



Re: [Qemu-block] [PATCH 2/2] scsi-generic: Merge block max xfer len in INQUIRY response

2016-05-26 Thread Paolo Bonzini


On 26/05/2016 08:15, Fam Zheng wrote:
> The rationale is similar to the above mode sense response interception:
> this is practically the only channel to communicate restraints from
> elsewhere such as host and block driver.
> 
> The scsi bus we attach onto can have a larger max xfer len than what is
> accepted by the host file system (guarding between the host scsi LUN and
> QEMU), in which case the SG_IO we generate would get -EINVAL.
> 
> Signed-off-by: Fam Zheng 

The two patches are pretty much separate, so I'm queuing this myself.

Thanks,

Paolo

> ---
>  hw/scsi/scsi-generic.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
> index 7459465..71372a8 100644
> --- a/hw/scsi/scsi-generic.c
> +++ b/hw/scsi/scsi-generic.c
> @@ -222,6 +222,18 @@ static void scsi_read_complete(void * opaque, int ret)
>  r->buf[3] |= 0x80;
>  }
>  }
> +if (s->type == TYPE_DISK &&
> +r->req.cmd.buf[0] == INQUIRY &&
> +r->req.cmd.buf[2] == 0xb0) {
> +uint32_t max_xfer_len = blk_get_max_transfer_length(s->conf.blk);
> +if (max_xfer_len) {
> +stl_be_p(>buf[8], max_xfer_len);
> +/* Also take care of the opt xfer len. */
> +if (ldl_be_p(>buf[12]) > max_xfer_len) {
> +stl_be_p(>buf[12], max_xfer_len);
> +}
> +}
> +}
>  scsi_req_data(>req, len);
>  scsi_req_unref(>req);
>  }
> 



Re: [Qemu-block] [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-26 Thread Paolo Bonzini


On 26/05/2016 09:10, Fam Zheng wrote:
> 
> diff --git a/block/io.c b/block/io.c
> index d480097..a6523cf 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1435,8 +1435,10 @@ int coroutine_fn bdrv_co_pwritev(BlockDriverState *bs,
>   * than one aligned block.
>   */
>  if (bytes < align) {
> +bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_TAIL);
>  qemu_iovec_add(_qiov, head_buf + bytes, align - bytes);
>  bytes = align;
> +bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
>  }
>  }

This doesn't look too wrong...  Should the right sequence of events be
head/after_head or head/after_tail?  It's probably simplest to just emit
all four events.

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-26 Thread Fam Zheng
On Thu, 05/26 14:50, Fam Zheng wrote:
> On Tue, 05/24 16:30, Peter Lieven wrote:
> > in a read-modify-write cycle a small request might cause
> > head and tail to fall into the same aligned block. Currently
> > QEMU reads the same block twice in this case which is
> > not necessary.
> > 
> > Signed-off-by: Peter Lieven 
> 
> Thanks, applied to my block branch:
> 
> https://github.com/famz/qemu/tree/block
> 

Looks like this breaks iotests 077 (hang), the blkdebug break points expected
by the script are not hit now. While squashing in below patch fixes the case, I
think it is more appropriate to keep the patch as is and fix the case itself.

Dropped from my queue, please send another version with test case update so I
can apply together.

diff --git a/block/io.c b/block/io.c
index d480097..a6523cf 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1435,8 +1435,10 @@ int coroutine_fn bdrv_co_pwritev(BlockDriverState *bs,
  * than one aligned block.
  */
 if (bytes < align) {
+bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_TAIL);
 qemu_iovec_add(_qiov, head_buf + bytes, align - bytes);
 bytes = align;
+bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
 }
 }



Re: [Qemu-block] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-26 Thread Fam Zheng
On Tue, 05/24 16:30, Peter Lieven wrote:
> in a read-modify-write cycle a small request might cause
> head and tail to fall into the same aligned block. Currently
> QEMU reads the same block twice in this case which is
> not necessary.
> 
> Signed-off-by: Peter Lieven 

Thanks, applied to my block branch:

https://github.com/famz/qemu/tree/block



[Qemu-block] [PATCH 0/2] block: Expose host block dev I/O size limit in scsi-block/scsi-generic

2016-05-26 Thread Fam Zheng
Host devices passed through as scsi-block or scsi-generic may have a compound
maximum I/O limit (out of physical hardware limit, driver quirks and file
system configuration, etc) that is presented in the sysfs entry. SG_IOs we
issue should respect this. However the value is currently not discoverable by
guest, because the vHBA (virtio-scsi) would present an fixed emulated limit,
while the INQUIRY (vpd=0xb0, block limits page) response solely speaks for the
LUN itself, not the host kernel. The issue is observed on scsi passthrough of
host usb or dm-multipath disks, and it is not specific to certain device types.

The proposed solution is collecting the host sysfs limit in raw-posix driver
when a block device is used, and intercepting INQUIRY response to clip the
max xfer len field.

This fixes booting a SanDisk usb-key with an upstream kernel.  The usb disk
reports a nonsense large value in INQUIRY, while the host (usb quirk?) only
allows 120KB.


Fam Zheng (2):
  raw-posix: Fetch max sectors for host block device from sysfs
  scsi-generic: Merge block max xfer len in INQUIRY response

 block/raw-posix.c  | 47 +++
 hw/scsi/scsi-generic.c | 12 
 2 files changed, 59 insertions(+)

-- 
2.8.2




[Qemu-block] [PATCH 1/2] raw-posix: Fetch max sectors for host block device from sysfs

2016-05-26 Thread Fam Zheng
This is sometimes a useful value we should count in.

Signed-off-by: Fam Zheng 
---
 block/raw-posix.c | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index a4f5a1b..d3796ad 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -729,9 +729,56 @@ static void raw_reopen_abort(BDRVReopenState *state)
 state->opaque = NULL;
 }
 
+static int hdev_get_max_transfer_length(dev_t dev)
+{
+int ret;
+int fd;
+char *path;
+const char *end;
+char buf[32];
+long len;
+
+path = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_sectors_kb",
+   major(dev), minor(dev));
+fd = open(path, O_RDONLY);
+if (fd < 0) {
+ret = -errno;
+goto out;
+}
+ret = read(fd, buf, sizeof(buf));
+if (ret < 0) {
+ret = -errno;
+goto out;
+} else if (ret == 0) {
+ret = -EIO;
+goto out;
+}
+buf[ret] = 0;
+/* The file is ended with '\n', pass 'end' to accept that. */
+ret = qemu_strtol(buf, , 10, );
+if (ret == 0 && end && *end == '\n') {
+ret = len * 1024 / BDRV_SECTOR_SIZE;
+}
+
+close(fd);
+out:
+g_free(path);
+return ret;
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
+struct stat st;
+
+if (!fstat(s->fd, )) {
+if (S_ISBLK(st.st_mode)) {
+int ret = hdev_get_max_transfer_length(st.st_rdev);
+if (ret >= 0) {
+bs->bl.max_transfer_length = ret;
+}
+}
+}
 
 raw_probe_alignment(bs, s->fd, errp);
 bs->bl.min_mem_alignment = s->buf_align;
-- 
2.8.2




[Qemu-block] [PATCH 2/2] scsi-generic: Merge block max xfer len in INQUIRY response

2016-05-26 Thread Fam Zheng
The rationale is similar to the above mode sense response interception:
this is practically the only channel to communicate restraints from
elsewhere such as host and block driver.

The scsi bus we attach onto can have a larger max xfer len than what is
accepted by the host file system (guarding between the host scsi LUN and
QEMU), in which case the SG_IO we generate would get -EINVAL.

Signed-off-by: Fam Zheng 
---
 hw/scsi/scsi-generic.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 7459465..71372a8 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -222,6 +222,18 @@ static void scsi_read_complete(void * opaque, int ret)
 r->buf[3] |= 0x80;
 }
 }
+if (s->type == TYPE_DISK &&
+r->req.cmd.buf[0] == INQUIRY &&
+r->req.cmd.buf[2] == 0xb0) {
+uint32_t max_xfer_len = blk_get_max_transfer_length(s->conf.blk);
+if (max_xfer_len) {
+stl_be_p(>buf[8], max_xfer_len);
+/* Also take care of the opt xfer len. */
+if (ldl_be_p(>buf[12]) > max_xfer_len) {
+stl_be_p(>buf[12], max_xfer_len);
+}
+}
+}
 scsi_req_data(>req, len);
 scsi_req_unref(>req);
 }
-- 
2.8.2