Re: [PATCH v3 0/7] Switch iotests to using Async QMP

2021-10-13 Thread John Snow
On Wed, Oct 13, 2021 at 10:49 AM Hanna Reitz  wrote:

> On 13.10.21 16:00, John Snow wrote:
> >
> >
> > On Wed, Oct 13, 2021 at 8:51 AM John Snow  wrote:
> >
> >
> >
> > On Wed, Oct 13, 2021 at 4:45 AM Hanna Reitz 
> wrote:
> >
> > On 13.10.21 00:34, John Snow wrote:
> > > Based-on: <20211012214152.802483-1-js...@redhat.com>
> > >[PULL 00/10] Python patches
> > > GitLab:
> >
> https://gitlab.com/jsnow/qemu/-/commits/python-aqmp-iotest-wrapper
> > > CI: https://gitlab.com/jsnow/qemu/-/pipelines/387210591
> > >
> > > Hiya,
> > >
> > > This series continues where the last two AQMP series left
> > off and adds a
> > > synchronous 'legacy' wrapper around the new AQMP interface,
> > then drops
> > > it straight into iotests to prove that AQMP is functional
> > and totally
> > > cool and fine. The disruption and churn to iotests is pretty
> > minimal.
> > >
> > > In the event that a regression happens and I am not
> > physically proximate
> > > to inflict damage upon, one may set the
> > QEMU_PYTHON_LEGACY_QMP variable
> > > to any non-empty string as it pleases you to engage the QMP
> > machinery
> > > you are used to.
> > >
> > > I'd like to try and get this committed early in the 6.2
> > development
> > > cycle to give ample time to smooth over any possible
> > regressions. I've
> > > tested it locally and via gitlab CI, across Python versions
> > 3.6 through
> > > 3.10, and "worksforme". If something bad happens, we can
> > revert the
> > > actual switch-flip very trivially.
> >
> > So running iotests locally, I got one failure:
> >
> > $ TEST_DIR=/tmp/vdi-tests ./check -c writethrough -vdi 300
> > [...]
> > 300 fail   [10:28:06] [10:28:11]
> > 5.1s output mismatch (see 300.out.bad)
> > --- /home/maxx/projects/qemu/tests/qemu-iotests/300.out
> > +++ 300.out.bad
> > @@ -1,4 +1,5 @@
> > -...
> >
>  +..ERROR:qemu.aqmp.qmp_client.qemu-b-222963:Task.Reader:
> >
> > ConnectionResetError: [Errno 104] Connection reset by peer
> > +.
> >
>--
> >   Ran 39 tests
> > [...]
> >
> >
> > Oh, unfortunate.
> >
> >
> > I’m afraid I can’t really give a reproducer or anything.  It
> > feels like
> >
> >
> > Thank you for the report!
> >
> > just some random spurious timing-related error. Although then
> > again,
> > 300 does have an `except machine.AbnormalShutdown` clause at one
> > point...  So perhaps that’s the culprit, and we need to
> > disable logging
> > there.
> >
> >
> > I'll investigate!
> >
> >
> > Unfortunately, even in a loop some 150 times I couldn't reproduce this
> > one. As you point out, it appears to be just a failure caused by
> > logging. The test logic itself completes as expected.
> >
> > Still, I would expect, on a "clean" shutdown of the destination host
> > (where the destination process fails to load the migration stream and
> > voluntarily exits with an error code) to end with a FIN/ACK for TCP or
> > ... uh, whatever happens for a UNIX socket. Where's the Connection
> > Reset coming from? Did the destination VM process *crash*?
> >
> > I'm not so sure that I *should* silence this error, but I also can't
> > reproduce it at all to answer these questions, so uh. uhhh. I guess I
> > will just hammer it on a loop a few hundred times more and see if I
> > get lucky.
>
> I could reproduce it, by running 20 instances concurrently.  (Needs a
> change to testrunner.py, so that the reference outputs don’t collide:
>
> diff --git a/tests/qemu-iotests/testrunner.py
> b/tests/qemu-iotests/testrunner.py
> index a56b6da396..fd0a3a1eeb 100644
> --- a/tests/qemu-iotests/testrunner.py
> +++ b/tests/qemu-iotests/testrunner.py
> @@ -221,7 +221,7 @@ def find_reference(self, test: str) -> str:
>
>   def do_run_test(self, test: str) -> TestResult:
>   f_test = Path(test)
> -f_bad = Path(f_test.name + '.out.bad')
> +f_bad = Path(f'{os.getpid()}-{f_test.name}.out.bad')
>   f_notrun = Path(f_test.name + '.notrun')
>   f_casenotrun = Path(f_test.name + '.casenotrun')
>   f_reference = Path(self.find_reference(test))
>
> )
>
> And then:
>
> $ while TEST_DIR=/tmp/vdi-$$ ./check -vdi 300; do; done
>
> Which pretty quickly shows the error in at least one of those loops
> (under a minute).
>
> As far as I can tell, changing the log level in 300 does indeed fix it:
>
> diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
> in

Re: [PATCH v3 0/7] Switch iotests to using Async QMP

2021-10-13 Thread Hanna Reitz

On 13.10.21 16:00, John Snow wrote:



On Wed, Oct 13, 2021 at 8:51 AM John Snow  wrote:



On Wed, Oct 13, 2021 at 4:45 AM Hanna Reitz  wrote:

On 13.10.21 00:34, John Snow wrote:
> Based-on: <20211012214152.802483-1-js...@redhat.com>
>            [PULL 00/10] Python patches
> GitLab:
https://gitlab.com/jsnow/qemu/-/commits/python-aqmp-iotest-wrapper
> CI: https://gitlab.com/jsnow/qemu/-/pipelines/387210591
>
> Hiya,
>
> This series continues where the last two AQMP series left
off and adds a
> synchronous 'legacy' wrapper around the new AQMP interface,
then drops
> it straight into iotests to prove that AQMP is functional
and totally
> cool and fine. The disruption and churn to iotests is pretty
minimal.
>
> In the event that a regression happens and I am not
physically proximate
> to inflict damage upon, one may set the
QEMU_PYTHON_LEGACY_QMP variable
> to any non-empty string as it pleases you to engage the QMP
machinery
> you are used to.
>
> I'd like to try and get this committed early in the 6.2
development
> cycle to give ample time to smooth over any possible
regressions. I've
> tested it locally and via gitlab CI, across Python versions
3.6 through
> 3.10, and "worksforme". If something bad happens, we can
revert the
> actual switch-flip very trivially.

So running iotests locally, I got one failure:

$ TEST_DIR=/tmp/vdi-tests ./check -c writethrough -vdi 300
[...]
300 fail   [10:28:06] [10:28:11]
5.1s output mismatch (see 300.out.bad)
--- /home/maxx/projects/qemu/tests/qemu-iotests/300.out
+++ 300.out.bad
@@ -1,4 +1,5 @@
-...
+..ERROR:qemu.aqmp.qmp_client.qemu-b-222963:Task.Reader:

ConnectionResetError: [Errno 104] Connection reset by peer
+.
  --
  Ran 39 tests
[...]


Oh, unfortunate.


I’m afraid I can’t really give a reproducer or anything.  It
feels like


Thank you for the report!

just some random spurious timing-related error. Although then
again,
300 does have an `except machine.AbnormalShutdown` clause at one
point...  So perhaps that’s the culprit, and we need to
disable logging
there.


I'll investigate!


Unfortunately, even in a loop some 150 times I couldn't reproduce this 
one. As you point out, it appears to be just a failure caused by 
logging. The test logic itself completes as expected.


Still, I would expect, on a "clean" shutdown of the destination host 
(where the destination process fails to load the migration stream and 
voluntarily exits with an error code) to end with a FIN/ACK for TCP or 
... uh, whatever happens for a UNIX socket. Where's the Connection 
Reset coming from? Did the destination VM process *crash*?


I'm not so sure that I *should* silence this error, but I also can't 
reproduce it at all to answer these questions, so uh. uhhh. I guess I 
will just hammer it on a loop a few hundred times more and see if I 
get lucky.


I could reproduce it, by running 20 instances concurrently.  (Needs a 
change to testrunner.py, so that the reference outputs don’t collide:


diff --git a/tests/qemu-iotests/testrunner.py 
b/tests/qemu-iotests/testrunner.py

index a56b6da396..fd0a3a1eeb 100644
--- a/tests/qemu-iotests/testrunner.py
+++ b/tests/qemu-iotests/testrunner.py
@@ -221,7 +221,7 @@ def find_reference(self, test: str) -> str:

 def do_run_test(self, test: str) -> TestResult:
 f_test = Path(test)
-    f_bad = Path(f_test.name + '.out.bad')
+    f_bad = Path(f'{os.getpid()}-{f_test.name}.out.bad')
 f_notrun = Path(f_test.name + '.notrun')
 f_casenotrun = Path(f_test.name + '.casenotrun')
 f_reference = Path(self.find_reference(test))

)

And then:

$ while TEST_DIR=/tmp/vdi-$$ ./check -vdi 300; do; done

Which pretty quickly shows the error in at least one of those loops 
(under a minute).


As far as I can tell, changing the log level in 300 does indeed fix it:

diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
index 10f9f2a8da..096f5dabf0 100755
--- a/tests/qemu-iotests/300
+++ b/tests/qemu-iotests/300
@@ -27,6 +27,7 @@ from typing import Dict, List, Optional
 from qemu.machine import machine

 import iotests
+from iotests import change_log_level


 BlockBitmapMapping = List[Dict[str, object]]
@@ -464,7 +465,8 @@ class 
TestBlockBitmapMappingErrors(TestDirtyBitmapMigration):

 # Expect abnormal shutdown of the destination VM because of

Re: [PATCH v3 0/7] Switch iotests to using Async QMP

2021-10-13 Thread John Snow
On Wed, Oct 13, 2021 at 8:51 AM John Snow  wrote:

>
>
> On Wed, Oct 13, 2021 at 4:45 AM Hanna Reitz  wrote:
>
>> On 13.10.21 00:34, John Snow wrote:
>> > Based-on: <20211012214152.802483-1-js...@redhat.com>
>> >[PULL 00/10] Python patches
>> > GitLab:
>> https://gitlab.com/jsnow/qemu/-/commits/python-aqmp-iotest-wrapper
>> > CI: https://gitlab.com/jsnow/qemu/-/pipelines/387210591
>> >
>> > Hiya,
>> >
>> > This series continues where the last two AQMP series left off and adds a
>> > synchronous 'legacy' wrapper around the new AQMP interface, then drops
>> > it straight into iotests to prove that AQMP is functional and totally
>> > cool and fine. The disruption and churn to iotests is pretty minimal.
>> >
>> > In the event that a regression happens and I am not physically proximate
>> > to inflict damage upon, one may set the QEMU_PYTHON_LEGACY_QMP variable
>> > to any non-empty string as it pleases you to engage the QMP machinery
>> > you are used to.
>> >
>> > I'd like to try and get this committed early in the 6.2 development
>> > cycle to give ample time to smooth over any possible regressions. I've
>> > tested it locally and via gitlab CI, across Python versions 3.6 through
>> > 3.10, and "worksforme". If something bad happens, we can revert the
>> > actual switch-flip very trivially.
>>
>> So running iotests locally, I got one failure:
>>
>> $ TEST_DIR=/tmp/vdi-tests ./check -c writethrough -vdi 300
>> [...]
>> 300 fail   [10:28:06] [10:28:11]
>> 5.1s output mismatch (see 300.out.bad)
>> --- /home/maxx/projects/qemu/tests/qemu-iotests/300.out
>> +++ 300.out.bad
>> @@ -1,4 +1,5 @@
>> -...
>> +..ERROR:qemu.aqmp.qmp_client.qemu-b-222963:Task.Reader:
>> ConnectionResetError: [Errno 104] Connection reset by peer
>> +.
>>   --
>>   Ran 39 tests
>> [...]
>>
>>
> Oh, unfortunate.
>
>
>>
>> I’m afraid I can’t really give a reproducer or anything.  It feels like
>>
>
> Thank you for the report!
>
>
>> just some random spurious timing-related error.  Although then again,
>> 300 does have an `except machine.AbnormalShutdown` clause at one
>> point...  So perhaps that’s the culprit, and we need to disable logging
>> there.
>>
>>
> I'll investigate!
>

Unfortunately, even in a loop some 150 times I couldn't reproduce this one.
As you point out, it appears to be just a failure caused by logging. The
test logic itself completes as expected.

Still, I would expect, on a "clean" shutdown of the destination host (where
the destination process fails to load the migration stream and voluntarily
exits with an error code) to end with a FIN/ACK for TCP or ... uh, whatever
happens for a UNIX socket. Where's the Connection Reset coming from? Did
the destination VM process *crash*?

I'm not so sure that I *should* silence this error, but I also can't
reproduce it at all to answer these questions, so uh. uhhh. I guess I will
just hammer it on a loop a few hundred times more and see if I get lucky.


Re: [PATCH v3 0/7] Switch iotests to using Async QMP

2021-10-13 Thread John Snow
On Wed, Oct 13, 2021 at 4:45 AM Hanna Reitz  wrote:

> On 13.10.21 00:34, John Snow wrote:
> > Based-on: <20211012214152.802483-1-js...@redhat.com>
> >[PULL 00/10] Python patches
> > GitLab:
> https://gitlab.com/jsnow/qemu/-/commits/python-aqmp-iotest-wrapper
> > CI: https://gitlab.com/jsnow/qemu/-/pipelines/387210591
> >
> > Hiya,
> >
> > This series continues where the last two AQMP series left off and adds a
> > synchronous 'legacy' wrapper around the new AQMP interface, then drops
> > it straight into iotests to prove that AQMP is functional and totally
> > cool and fine. The disruption and churn to iotests is pretty minimal.
> >
> > In the event that a regression happens and I am not physically proximate
> > to inflict damage upon, one may set the QEMU_PYTHON_LEGACY_QMP variable
> > to any non-empty string as it pleases you to engage the QMP machinery
> > you are used to.
> >
> > I'd like to try and get this committed early in the 6.2 development
> > cycle to give ample time to smooth over any possible regressions. I've
> > tested it locally and via gitlab CI, across Python versions 3.6 through
> > 3.10, and "worksforme". If something bad happens, we can revert the
> > actual switch-flip very trivially.
>
> So running iotests locally, I got one failure:
>
> $ TEST_DIR=/tmp/vdi-tests ./check -c writethrough -vdi 300
> [...]
> 300 fail   [10:28:06] [10:28:11]
> 5.1s output mismatch (see 300.out.bad)
> --- /home/maxx/projects/qemu/tests/qemu-iotests/300.out
> +++ 300.out.bad
> @@ -1,4 +1,5 @@
> -...
> +..ERROR:qemu.aqmp.qmp_client.qemu-b-222963:Task.Reader:
> ConnectionResetError: [Errno 104] Connection reset by peer
> +.
>   --
>   Ran 39 tests
> [...]
>
>
Oh, unfortunate.


>
> I’m afraid I can’t really give a reproducer or anything.  It feels like
>

Thank you for the report!


> just some random spurious timing-related error.  Although then again,
> 300 does have an `except machine.AbnormalShutdown` clause at one
> point...  So perhaps that’s the culprit, and we need to disable logging
> there.
>
>
I'll investigate!


Re: [PATCH v3 0/7] Switch iotests to using Async QMP

2021-10-13 Thread Hanna Reitz

On 13.10.21 00:34, John Snow wrote:

Based-on: <20211012214152.802483-1-js...@redhat.com>
   [PULL 00/10] Python patches
GitLab: https://gitlab.com/jsnow/qemu/-/commits/python-aqmp-iotest-wrapper
CI: https://gitlab.com/jsnow/qemu/-/pipelines/387210591

Hiya,

This series continues where the last two AQMP series left off and adds a
synchronous 'legacy' wrapper around the new AQMP interface, then drops
it straight into iotests to prove that AQMP is functional and totally
cool and fine. The disruption and churn to iotests is pretty minimal.

In the event that a regression happens and I am not physically proximate
to inflict damage upon, one may set the QEMU_PYTHON_LEGACY_QMP variable
to any non-empty string as it pleases you to engage the QMP machinery
you are used to.

I'd like to try and get this committed early in the 6.2 development
cycle to give ample time to smooth over any possible regressions. I've
tested it locally and via gitlab CI, across Python versions 3.6 through
3.10, and "worksforme". If something bad happens, we can revert the
actual switch-flip very trivially.


So running iotests locally, I got one failure:

$ TEST_DIR=/tmp/vdi-tests ./check -c writethrough -vdi 300
[...]
300 fail   [10:28:06] [10:28:11] 
5.1s output mismatch (see 300.out.bad)

--- /home/maxx/projects/qemu/tests/qemu-iotests/300.out
+++ 300.out.bad
@@ -1,4 +1,5 @@
-...
+..ERROR:qemu.aqmp.qmp_client.qemu-b-222963:Task.Reader: 
ConnectionResetError: [Errno 104] Connection reset by peer

+.
 --
 Ran 39 tests
[...]


I’m afraid I can’t really give a reproducer or anything.  It feels like 
just some random spurious timing-related error.  Although then again, 
300 does have an `except machine.AbnormalShutdown` clause at one 
point...  So perhaps that’s the culprit, and we need to disable logging 
there.


Hanna