Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data

2018-11-17 Thread Nir Soffer
On Sat, Nov 17, 2018 at 11:13 PM Richard W.M. Jones 
wrote:

> On Sat, Nov 17, 2018 at 10:59:26PM +0200, Nir Soffer wrote:
> > On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf  wrote:
> >
> > > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben:
> > > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer 
> wrote:
> > > >
> > > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer 
> wrote:
> > > > >
> > > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf 
> wrote:
> > > > >>
> > > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> > > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <
> rjo...@redhat.com>
> > > > >>> wrote:
> > > > >>> >
> > > > >>> > > Another thing I tried was to change the NBD server (nbdkit)
> so
> > > that
> > > > >>> it
> > > > >>> > > doesn't advertise zero support to the client:
> > > > >>> > >
> > > > >>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
> > > > >>> logfile=/tmp/log \
> > > > >>> > >   --run './qemu-img convert ./fedora-28.img -n $nbd'
> > > > >>> > >   $ grep '\.\.\.$' /tmp/log | sed
> 's/.*\([A-Z][a-z]*\).*/\1/' |
> > > uniq
> > > > >>> -c
> > > > >>> > >2154 Write
> > > > >>> > >
> > > > >>> > > Not surprisingly no zero commands are issued.  The size of
> the
> > > write
> > > > >>> > > commands is very uneven -- it appears to be send one command
> per
> > > > >>> block
> > > > >>> > > of zeroes or data.
> > > > >>> > >
> > > > >>> > > Nir: If we could get information from imageio about whether
> > > zeroing
> > > > >>> is
> > > > >>> > > implemented efficiently or not by the backend, we could
> change
> > > > >>> > > virt-v2v / nbdkit to advertise this back to qemu.
> > > > >>> >
> > > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT)
> always
> > > > >>> > succeeds, falling back to manual zeroing in the kernel silently
> > > > >>> >
> > > > >>> > Even if we could, sending zero on the wire from qemu may be
> even
> > > > >>> > slower, and it looks like qemu send even more requests in this
> case
> > > > >>> > (2154 vs ~1300).
> > > > >>> >
> > > > >>> > Looks like this optimization in qemu side leads to worse
> > > performance,
> > > > >>> > so it should not be enabled by default.
> > > > >>>
> > > > >>> Well, that's overgeneralising your case a bit. If the backend
> does
> > > > >>> support efficient zero writes (which file systems, the most
> common
> > > case,
> > > > >>> generally do), doing one big write_zeroes request at the start
> can
> > > > >>> improve performance quite a bit.
> > > > >>>
> > > > >>> It seems the problem is that we can't really know whether the
> > > operation
> > > > >>> will be efficient because the backends generally don't tell us.
> Maybe
> > > > >>> NBD could introduce a flag for this, but in the general case it
> > > appears
> > > > >>> to me that we'll have to have a command line option.
> > > > >>>
> > > > >>> However, I'm curious what your exact use case and the backend
> used
> > > in it
> > > > >>> is? Can something be improved there to actually get efficient
> zero
> > > > >>> writes and get even better performance than by just disabling
> the big
> > > > >>> zero write?
> > > > >>
> > > > >>
> > > > >> The backend is some NetApp storage connected via FC. I don't have
> > > > >> more info on this. We get zero rate of about 1G/s on this storage,
> > > which
> > > > >> is quite slow compared with other storage we tested.
> > > > >>
> > > > >> One option we check now is if this is the kernel silent fallback
> to
> > > manual
> > > > >> zeroing when the server advertise wrong value of
> write_same_max_bytes.
> > > > >>
> > > > >
> > > > > We eliminated this using blkdiscard. This is what we get on with
> this
> > > > > storage
> > > > > zeroing 100G LV:
> > > > >
> > > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> > > > >
> > >
> /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> > > > > done
> > > > >
> > > > > real 4m50.851s
> > > > > user 0m0.065s
> > > > > sys 0m1.482s
> > > > >
> > > > > real 4m30.504s
> > > > > user 0m0.047s
> > > > > sys 0m0.870s
> > > > >
> > > > > real 4m19.443s
> > > > > user 0m0.029s
> > > > > sys 0m0.508s
> > > > >
> > > > > real 4m13.016s
> > > > > user 0m0.020s
> > > > > sys 0m0.284s
> > > > >
> > > > > real 2m45.888s
> > > > > user 0m0.011s
> > > > > sys 0m0.162s
> > > > >
> > > > > real 2m10.153s
> > > > > user 0m0.003s
> > > > > sys 0m0.100s
> > > > >
> > > > > We are investigating why we get low throughput on this server, and
> also
> > > > > will check
> > > > > several other servers.
> > > > >
> > > > > Having a command line option to control this behavior sounds good.
> I
> > > don't
> > > > >> have enough data to tell what should be the default, but I think
> the
> > > safe
> > > > >> way would be to keep old behavior.
> > > > >>
> > > > >
> > > > > We file this bug:
> > > > > https://bugzilla.redhat.com/1648622
> > > > >
> > > >
> > > > More data from even slower storage - zeroing 10G lv on Kaminario K2
> > > >
> > > 

Re: [Qemu-block] [PATCH 3/6] tests: pull qemu-nbd iotest helpers into common.nbd file

2018-11-17 Thread Eric Blake

On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:

The helpers for starting/stopping qemu-nbd in 058 will be useful in
other test cases, so move them into a common.nbd file.

Signed-off-by: Daniel P. Berrangé 
---



+function nbd_server_start_unix_socket()
+{
+nbd_server_stop
+$QEMU_NBD -v -t -k "$nbd_unix_socket" $@ &


Needs to be "$@" to properly preserve whitespace and/or empty arguments 
(the latter if someone passes -x '' for a default-named export).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [PATCH 0/6] Misc fixes to NBD

2018-11-17 Thread Eric Blake

On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:

This does two minor fixes to the NBD code and adds significant coverage
of the NBD TLS support to detect future problems.

The first two patches should be for 3.1.

The tests can wait till 4.0 if desired.

Daniel P. Berrangé (6):
   nbd: fix whitespace in server error message
   nbd: stop waiting for a NBD response with NBD_CMD_DISC
   tests: pull qemu-nbd iotest helpers into common.nbd file
   tests: check if qemu-nbd is still alive before waiting
   tests: add iotests helpers for dealing with TLS certificates
   tests: exercise NBD server in TLS mode



I'm still missing your S-o-b on 6. I've posted a preliminary version of 
your series with my touchups incorporated, if you'd like to double check 
it, at:


https://repo.or.cz/qemu/ericb.git/shortlog/refs/heads/nbd


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [PATCH 2/6 for-3.1] nbd: stop waiting for a NBD response with NBD_CMD_DISC

2018-11-17 Thread Eric Blake

On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:

When sending a NBD_CMD_DISC message there is no reply expected,
however, the nbd_read_eof() coroutine is still waiting for a reply.
In a plain NBD connection this doesn't matter as it will just get an
EOF, however, on a TLS connection it will get an interrupted TLS data
packet. The nbd_read_eof() will then print an error message on the
console due to missing reply to NBD_CMD_DISC.

This can be seen with qemu-img

   $ qemu-img info \
   --object tls-creds-x509,dir=tlsdir,id=tls0,endpoint=client \
   --image-opts driver=nbd,host=127.0.0.1,port=9000,tls-creds=tls0
   qemu-img: Cannot read from TLS channel: Input/output error
   image: nbd://127.0.0.1:9000
   file format: nbd
   virtual size: 10M (10485760 bytes)
   disk size: unavailable

Simply setting the 'quit' flag after sending NBD_CMD_DISC is enough to
get the coroutine to stop waiting for a reply and thus supress the error
message.


Actually, it's not quite enough - once you actually start performing 
I/O, enough coroutines are kicked off that the error still happens:


$  qemu-io -c 'r 1m 1m' -c 'w -P 0x22 1m 1m' --image-opts \
--object tls-creds-x509,dir=scratch/tls/client1,endpoint=client,id=tls0\
 driver=nbd,host=localhost,port=10809,tls-creds=tls0
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 0.0430 sec (23.204 MiB/sec and 23.2040 ops/sec)
wrote 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 0.0152 sec (65.479 MiB/sec and 65.4793 ops/sec)
Cannot read from TLS channel: Input/output error

Squashing this in on top of your patch helps, though:

diff --git i/block/nbd-client.c w/block/nbd-client.c
index 5f63e4b8f15..e7916c78996 100644
--- i/block/nbd-client.c
+++ w/block/nbd-client.c
@@ -79,7 +79,14 @@ static coroutine_fn void nbd_read_reply_entry(void 
*opaque)

 assert(s->reply.handle == 0);
 ret = nbd_receive_reply(s->ioc, >reply, _err);
 if (local_err) {
-error_report_err(local_err);
+/* If we are already quitting, either another error has
+ * already been reported, or we requested NBD_CMD_DISC and
+ * don't need to report anything further.  */
+if (!s->quit) {
+error_report_err(local_err);
+} else {
+error_free(local_err);
+}
 }
 if (ret <= 0) {
 break;

But I want to do more testing to make sure I'm not missing out on 
reporting an actual error if I add that.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



[Qemu-block] [PATCH 7/6] iotests: Also test I/O over NBD TLS

2018-11-17 Thread Eric Blake
Enhance test 233 to also perform I/O beyond the initial handshake.

Signed-off-by: Eric Blake 
---

Depends on my tweak to 2/6 to suppress an EIO error message
on a failed read after NBD_CMD_DISC.

 tests/qemu-iotests/233 | 12 +++-
 tests/qemu-iotests/233.out | 10 ++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/233 b/tests/qemu-iotests/233
index a1ba8c09c06..5b6982be6ad 100755
--- a/tests/qemu-iotests/233
+++ b/tests/qemu-iotests/233
@@ -61,7 +61,7 @@ tls_x509_create_client "ca2" "client2"
 echo
 echo "== preparing image =="
 _make_test_img 64M
-
+$QEMU_IO -c 'w -P 0x11 1m 1m' "$TEST_IMG" | _filter_qemu_io

 echo
 echo "== check TLS client to plain server fails =="
@@ -95,6 +95,16 @@ $QEMU_IMG info --image-opts \
 driver=nbd,host=$nbd_tcp_addr,port=$nbd_tcp_port,tls-creds=tls0 \
 2>&1 | sed "s/$nbd_tcp_port/PORT/g"

+echo
+echo "== perform I/O over TLS =="
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+$QEMU_IO -c 'r -P 0x11 1m 1m' -c 'w -P 0x22 1m 1m' --image-opts \
+--object tls-creds-x509,dir=${tls_dir}/client1,endpoint=client,id=tls0 \
+driver=nbd,host=$nbd_tcp_addr,port=$nbd_tcp_port,tls-creds=tls0 \
+2>&1 | sed "s/$nbd_tcp_port/PORT/g" | _filter_qemu_io
+
+$QEMU_IO -f qcow2 -r -U -c 'r -P 0x22 1m 1m' "$TEST_IMG" | _filter_qemu_io
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/233.out b/tests/qemu-iotests/233.out
index 616e9238c89..94acd9b9479 100644
--- a/tests/qemu-iotests/233.out
+++ b/tests/qemu-iotests/233.out
@@ -9,6 +9,8 @@ Generating a signed certificate...

 == preparing image ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)

 == check TLS client to plain server fails ==
 qemu-img: Could not open 'driver=nbd,host=127.0.0.1,port=PORT,tls-creds=tls0': 
Denied by server for option 5 (starttls)
@@ -27,4 +29,12 @@ disk size: unavailable
 == check TLS with different CA fails ==
 option negotiation failed: Verify failed: No certificate was found.
 qemu-img: Could not open 'driver=nbd,host=127.0.0.1,port=PORT,tls-creds=tls0': 
The certificate hasn't got a known issuer
+
+== perform I/O over TLS ==
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 *** done
-- 
2.17.2




[Qemu-block] [PATCH 1.5/6] nbd/server: Ignore write errors when replying to NBD_OPT_ABORT

2018-11-17 Thread Eric Blake
Commit 37ec36f6 intentionally ignores errors when trying to reply
to an NBD_OPT_ABORT request for plaintext clients, but did not make
the same change for a TLS server.  Since NBD_OPT_ABORT is
documented as being a potential for an EPIPE when the client hangs
up without waiting for our reply, we don't need to pollute the
server's output with that failure.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 056cfa5ad47..dc04513de70 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1134,12 +1134,16 @@ static int nbd_negotiate_options(NBDClient *client, 
uint16_t myflags,
 return -EINVAL;

 default:
-ret = nbd_opt_drop(client, NBD_REP_ERR_TLS_REQD, errp,
+/* Let the client keep trying, unless they asked to
+ * quit. Always try to give an error back to the
+ * client; but when replying to OPT_ABORT, be aware
+ * that the client may hang up before receiving the
+ * error, in which case we are fine ignoring the
+ * resulting EPIPE. */
+ret = nbd_opt_drop(client, NBD_REP_ERR_TLS_REQD,
+   option == NBD_OPT_ABORT ? NULL : errp,
"Option 0x%" PRIx32
" not permitted before TLS", option);
-/* Let the client keep trying, unless they asked to
- * quit. In this mode, we've already sent an error, so
- * we can't ack the abort.  */
 if (option == NBD_OPT_ABORT) {
 return 1;
 }
-- 
2.17.2




Re: [Qemu-block] [PATCH 6/6] tests: exercise NBD server in TLS mode

2018-11-17 Thread Eric Blake

On 11/17/18 2:49 PM, Eric Blake wrote:

On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:

Add tests that validate it is possible to connect to an NBD server
running TLS mode. Also test mis-matched TLS vs non-TLS connections
correctly fail.
---


Missing your Signed-off-by. Can you please supply that, so I can include 
this in my pull request?


Also, I'm getting failures when trying to test it:

@@ -17,9 +17,9 @@

  == check plain client to TLS server fails ==
  option negotiation failed: read failed: Unexpected end-of-file before 
all bytes were read

+write failed (error message): Unable to write to socket: Broken pipe
  qemu-img: Could not open 'nbd://localhost:10809': TLS negotiation 
required before option 8 (structured reply)

  server reported: Option 0x8 not permitted before TLS
-write failed (error message): Unable to write to socket: Broken pipe


which looks like an output race. :(


Found and squashed it - commit 37ec36f6 fixed plaintext servers to not 
be noisy for NBD_OPT_ABORT, but did not give equal treatment to TLS 
servers. Patch coming up separately.


So, with my fixes, I can add:

Tested-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [PATCH 6/6] tests: exercise NBD server in TLS mode

2018-11-17 Thread Eric Blake

On 11/16/18 11:20 AM, Eric Blake wrote:

On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:

Add tests that validate it is possible to connect to an NBD server
running TLS mode. Also test mis-matched TLS vs non-TLS connections
correctly fail.
---



+== check TLS client to plain server fails ==
+option negotiation failed: read failed: Unexpected end-of-file before 
all bytes were read


Annoying message; I wonder if we can clean that up. But not this patch's 
problem.




Actually, I tracked this message down to using socat (which actually 
connects and then abruptly exits) when probing whether the socket is up 
and listening.  That is, the message is being produced as a side effect 
of nbd_server_wait_for_tcp_socket rather than during the actual 
$QEMU_IMG command we are interested in testing.




  nbd_pid_file="${TEST_DIR}/qemu-nbd.pid"
  function nbd_server_stop()
@@ -62,3 +63,49 @@ function nbd_server_start_unix_socket()
  $QEMU_NBD -v -t -k "$nbd_unix_socket" $@ &
  nbd_server_wait_for_unix_socket $!
  }
+
+function nbd_server_set_tcp_port()
+{
+    for port in `seq 10809 10909`
+    do
+    socat TCP:$nbd_tcp_addr:$port STDIO < /dev/null 1>/dev/null 2>&1


This is the first use of socat in iotests.  Might not be the most 
portable, but I don't know if I have better ideas. 
nbdkit.git/tests/test-ip.sh greps the output of 'ss -ltn' to locate free 
ports, but I don't know if ss is any better than socat.


So, I'm planning to squash this in, to use ss instead of socat, as follows:

diff --git i/tests/qemu-iotests/common.nbd w/tests/qemu-iotests/common.nbd
index 0483ea7c55a..d73af285abd 100644
--- i/tests/qemu-iotests/common.nbd
+++ w/tests/qemu-iotests/common.nbd
@@ -66,12 +66,12 @@ function nbd_server_start_unix_socket()

 function nbd_server_set_tcp_port()
 {
-for port in `seq 10809 10909`
+(ss --help) >/dev/null 2>&1 || _notrun "ss utility not found, 
skipping test"

+
+for ((port = 10809; port <= 10909; port++))
 do
-   socat TCP:$nbd_tcp_addr:$port STDIO < /dev/null 1>/dev/null 2>&1
-if test $? != 0
-   then
-   nbd_tcp_port=$port
+if ! ss -tln | grep -sqE ":$port\b"; then
+nbd_tcp_port=$port
 return
 fi
 done
@@ -86,9 +86,7 @@ function nbd_server_wait_for_tcp_socket()

 for ((i = 0; i < 300; i++))
 do
-socat TCP:localhost:$nbd_tcp_port STDIO < /dev/null 1>/dev/null 
2>&1

-if test $? == 0
-   then
+if ss -tln | grep -sqE ":$nbd_tcp_port\b"; then
 return
 fi
 kill -s 0 $pid 2>/dev/null
diff --git i/tests/qemu-iotests/233.out w/tests/qemu-iotests/233.out
index eaa410c2703..eb4077f9fd7 100644
--- i/tests/qemu-iotests/233.out
+++ w/tests/qemu-iotests/233.out
@@ -11,12 +11,10 @@ Generating a signed certificate...
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864

 == check TLS client to plain server fails ==
-option negotiation failed: read failed: Unexpected end-of-file before 
all bytes were read
 qemu-img: Could not open 
'driver=nbd,host=127.0.0.1,port=10809,tls-creds=tls0': Denied by server 
for option 5 (starttls)

 server reported: TLS not configured

 == check plain client to TLS server fails ==
-option negotiation failed: read failed: Unexpected end-of-file before 
all bytes were read
 qemu-img: Could not open 'nbd://localhost:10809': TLS negotiation 
required before option 8 (structured reply)

 server reported: Option 0x8 not permitted before TLS
 write failed (error message): Unable to write to socket: Broken pipe


Also, you have to sanitize 233.out to change 10809 into PORT, so the 
test can still pass when it picked a different port.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [PATCH 6/6] tests: exercise NBD server in TLS mode

2018-11-17 Thread Eric Blake

On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:

Add tests that validate it is possible to connect to an NBD server
running TLS mode. Also test mis-matched TLS vs non-TLS connections
correctly fail.
---


Missing your Signed-off-by. Can you please supply that, so I can include 
this in my pull request?


Also, I'm getting failures when trying to test it:

@@ -17,9 +17,9 @@

 == check plain client to TLS server fails ==
 option negotiation failed: read failed: Unexpected end-of-file before 
all bytes were read

+write failed (error message): Unable to write to socket: Broken pipe
 qemu-img: Could not open 'nbd://localhost:10809': TLS negotiation 
required before option 8 (structured reply)

 server reported: Option 0x8 not permitted before TLS
-write failed (error message): Unable to write to socket: Broken pipe


which looks like an output race. :(

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data

2018-11-17 Thread Richard W.M. Jones
On Sat, Nov 17, 2018 at 10:59:26PM +0200, Nir Soffer wrote:
> On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf  wrote:
> 
> > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben:
> > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer  wrote:
> > >
> > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer  wrote:
> > > >
> > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf  wrote:
> > > >>
> > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones 
> > > >>> wrote:
> > > >>> >
> > > >>> > > Another thing I tried was to change the NBD server (nbdkit) so
> > that
> > > >>> it
> > > >>> > > doesn't advertise zero support to the client:
> > > >>> > >
> > > >>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
> > > >>> logfile=/tmp/log \
> > > >>> > >   --run './qemu-img convert ./fedora-28.img -n $nbd'
> > > >>> > >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' |
> > uniq
> > > >>> -c
> > > >>> > >2154 Write
> > > >>> > >
> > > >>> > > Not surprisingly no zero commands are issued.  The size of the
> > write
> > > >>> > > commands is very uneven -- it appears to be send one command per
> > > >>> block
> > > >>> > > of zeroes or data.
> > > >>> > >
> > > >>> > > Nir: If we could get information from imageio about whether
> > zeroing
> > > >>> is
> > > >>> > > implemented efficiently or not by the backend, we could change
> > > >>> > > virt-v2v / nbdkit to advertise this back to qemu.
> > > >>> >
> > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
> > > >>> > succeeds, falling back to manual zeroing in the kernel silently
> > > >>> >
> > > >>> > Even if we could, sending zero on the wire from qemu may be even
> > > >>> > slower, and it looks like qemu send even more requests in this case
> > > >>> > (2154 vs ~1300).
> > > >>> >
> > > >>> > Looks like this optimization in qemu side leads to worse
> > performance,
> > > >>> > so it should not be enabled by default.
> > > >>>
> > > >>> Well, that's overgeneralising your case a bit. If the backend does
> > > >>> support efficient zero writes (which file systems, the most common
> > case,
> > > >>> generally do), doing one big write_zeroes request at the start can
> > > >>> improve performance quite a bit.
> > > >>>
> > > >>> It seems the problem is that we can't really know whether the
> > operation
> > > >>> will be efficient because the backends generally don't tell us. Maybe
> > > >>> NBD could introduce a flag for this, but in the general case it
> > appears
> > > >>> to me that we'll have to have a command line option.
> > > >>>
> > > >>> However, I'm curious what your exact use case and the backend used
> > in it
> > > >>> is? Can something be improved there to actually get efficient zero
> > > >>> writes and get even better performance than by just disabling the big
> > > >>> zero write?
> > > >>
> > > >>
> > > >> The backend is some NetApp storage connected via FC. I don't have
> > > >> more info on this. We get zero rate of about 1G/s on this storage,
> > which
> > > >> is quite slow compared with other storage we tested.
> > > >>
> > > >> One option we check now is if this is the kernel silent fallback to
> > manual
> > > >> zeroing when the server advertise wrong value of write_same_max_bytes.
> > > >>
> > > >
> > > > We eliminated this using blkdiscard. This is what we get on with this
> > > > storage
> > > > zeroing 100G LV:
> > > >
> > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> > > >
> > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> > > > done
> > > >
> > > > real 4m50.851s
> > > > user 0m0.065s
> > > > sys 0m1.482s
> > > >
> > > > real 4m30.504s
> > > > user 0m0.047s
> > > > sys 0m0.870s
> > > >
> > > > real 4m19.443s
> > > > user 0m0.029s
> > > > sys 0m0.508s
> > > >
> > > > real 4m13.016s
> > > > user 0m0.020s
> > > > sys 0m0.284s
> > > >
> > > > real 2m45.888s
> > > > user 0m0.011s
> > > > sys 0m0.162s
> > > >
> > > > real 2m10.153s
> > > > user 0m0.003s
> > > > sys 0m0.100s
> > > >
> > > > We are investigating why we get low throughput on this server, and also
> > > > will check
> > > > several other servers.
> > > >
> > > > Having a command line option to control this behavior sounds good. I
> > don't
> > > >> have enough data to tell what should be the default, but I think the
> > safe
> > > >> way would be to keep old behavior.
> > > >>
> > > >
> > > > We file this bug:
> > > > https://bugzilla.redhat.com/1648622
> > > >
> > >
> > > More data from even slower storage - zeroing 10G lv on Kaminario K2
> > >
> > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2
> > >
> > > real50m12.425s
> > > user0m0.018s
> > > sys 2m6.785s
> > >
> > > Maybe something is wrong with this storage, since we see this:
> > >
> > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes
> > > /sys/block/dm-29/queue/write_same_max_bytes:512
> > >
> > > Since BLKZEROOUT always fallback to manual 

Re: [Qemu-block] Change in qemu 2.12 causes qemu-img convert to NBD to write more data

2018-11-17 Thread Nir Soffer
On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf  wrote:

> Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben:
> > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer  wrote:
> >
> > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer  wrote:
> > >
> > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf  wrote:
> > >>
> > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben:
> > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones 
> > >>> wrote:
> > >>> >
> > >>> > > Another thing I tried was to change the NBD server (nbdkit) so
> that
> > >>> it
> > >>> > > doesn't advertise zero support to the client:
> > >>> > >
> > >>> > >   $ nbdkit --filter=log --filter=nozero memory size=6G
> > >>> logfile=/tmp/log \
> > >>> > >   --run './qemu-img convert ./fedora-28.img -n $nbd'
> > >>> > >   $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' |
> uniq
> > >>> -c
> > >>> > >2154 Write
> > >>> > >
> > >>> > > Not surprisingly no zero commands are issued.  The size of the
> write
> > >>> > > commands is very uneven -- it appears to be send one command per
> > >>> block
> > >>> > > of zeroes or data.
> > >>> > >
> > >>> > > Nir: If we could get information from imageio about whether
> zeroing
> > >>> is
> > >>> > > implemented efficiently or not by the backend, we could change
> > >>> > > virt-v2v / nbdkit to advertise this back to qemu.
> > >>> >
> > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always
> > >>> > succeeds, falling back to manual zeroing in the kernel silently
> > >>> >
> > >>> > Even if we could, sending zero on the wire from qemu may be even
> > >>> > slower, and it looks like qemu send even more requests in this case
> > >>> > (2154 vs ~1300).
> > >>> >
> > >>> > Looks like this optimization in qemu side leads to worse
> performance,
> > >>> > so it should not be enabled by default.
> > >>>
> > >>> Well, that's overgeneralising your case a bit. If the backend does
> > >>> support efficient zero writes (which file systems, the most common
> case,
> > >>> generally do), doing one big write_zeroes request at the start can
> > >>> improve performance quite a bit.
> > >>>
> > >>> It seems the problem is that we can't really know whether the
> operation
> > >>> will be efficient because the backends generally don't tell us. Maybe
> > >>> NBD could introduce a flag for this, but in the general case it
> appears
> > >>> to me that we'll have to have a command line option.
> > >>>
> > >>> However, I'm curious what your exact use case and the backend used
> in it
> > >>> is? Can something be improved there to actually get efficient zero
> > >>> writes and get even better performance than by just disabling the big
> > >>> zero write?
> > >>
> > >>
> > >> The backend is some NetApp storage connected via FC. I don't have
> > >> more info on this. We get zero rate of about 1G/s on this storage,
> which
> > >> is quite slow compared with other storage we tested.
> > >>
> > >> One option we check now is if this is the kernel silent fallback to
> manual
> > >> zeroing when the server advertise wrong value of write_same_max_bytes.
> > >>
> > >
> > > We eliminated this using blkdiscard. This is what we get on with this
> > > storage
> > > zeroing 100G LV:
> > >
> > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m
> > >
> /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade;
> > > done
> > >
> > > real 4m50.851s
> > > user 0m0.065s
> > > sys 0m1.482s
> > >
> > > real 4m30.504s
> > > user 0m0.047s
> > > sys 0m0.870s
> > >
> > > real 4m19.443s
> > > user 0m0.029s
> > > sys 0m0.508s
> > >
> > > real 4m13.016s
> > > user 0m0.020s
> > > sys 0m0.284s
> > >
> > > real 2m45.888s
> > > user 0m0.011s
> > > sys 0m0.162s
> > >
> > > real 2m10.153s
> > > user 0m0.003s
> > > sys 0m0.100s
> > >
> > > We are investigating why we get low throughput on this server, and also
> > > will check
> > > several other servers.
> > >
> > > Having a command line option to control this behavior sounds good. I
> don't
> > >> have enough data to tell what should be the default, but I think the
> safe
> > >> way would be to keep old behavior.
> > >>
> > >
> > > We file this bug:
> > > https://bugzilla.redhat.com/1648622
> > >
> >
> > More data from even slower storage - zeroing 10G lv on Kaminario K2
> >
> > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2
> >
> > real50m12.425s
> > user0m0.018s
> > sys 2m6.785s
> >
> > Maybe something is wrong with this storage, since we see this:
> >
> > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes
> > /sys/block/dm-29/queue/write_same_max_bytes:512
> >
> > Since BLKZEROOUT always fallback to manual slow zeroing silently,
> > maybe we can disable the aggressive pre-zero of the entire device
> > for block devices, and keep this optimization for files when fallocate()
> > is supported?
>
> I'm not sure what the detour through NBD changes, but qemu-img directly
> on a block device doesn't use BLKZEROOUT first, but
> FALLOC_FL_PUNCH_HOLE.