Small update on this, On 7/15/20 1:10 PM, Claudio Fontana wrote: > Hi Thomas, > > On 7/14/20 4:35 PM, Thomas Huth wrote: >> On 14/07/2020 16.29, Claudio Fontana wrote: >>> Hello, >>> >>> I have some tiny progress in narrowing down this issue, possibly a qcow2 >>> issue, still unclear, >>> but involving Kevin Wolf and Max Reitz. >>> >>> >>> The reproducer again: >>> >>>> --------------------------------------------cut------------------------------------------- >>>> diff --git a/cpus.c b/cpus.c >>>> index 41d1c5099f..443b88697a 100644 >>>> --- a/cpus.c >>>> +++ b/cpus.c >>>> @@ -643,7 +643,7 @@ static void qemu_account_warp_timer(void) >>>> >>>> static bool icount_state_needed(void *opaque) >>>> { >>>> - return use_icount; >>>> + return 0; >>>> } >>>> >>>> static bool warp_timer_state_needed(void *opaque) >>>> --------------------------------------------cut------------------------------------------- >>> >>> This issue for now appears on s390 only: >>> >>> On s390 hardware, test 267 fails (both kvm and tcg) in the qcow2 backing >>> file part, with broken migration stream data in the s390-skeys vmsave (old >>> style). >> [...] >>> If someone has a good idea let me know - first attempts to reproduce on x86 >>> failed, but maybe more work could lead to it. >> > > small update: in the GOOD case (enough padding added) a qcow_merge() is > triggered for the last write of 16202 bytes. > In the BAD case (not enough padding added) a qcow_merge() is not triggered > for the last write of 16201 bytes. > > Note: manually flushing with qemu_fflush in s390-skeys vmsave also works > (maybe got lost in the noise). > > >> Two questions: >> >> 1) Can you also reproduce the issue manually, without running iotest >> 267? ... I tried, but so far I failed. > > Thanks for the suggestion, will try.
Currently trying to reproduce manually an environment similar to that of the test, at the moment I am not able to reproduce the issue manually. Not very familiar with s390, I've been running with export QEMU=/home/cfontana/qemu-build/s390x-softmmu/qemu-system-s390x $QEMU -nographic -monitor stdio -nodefaults -no-shutdown FILENAME where FILENAME is the qcow2 produced by the test. let me know if you have a suggestion on how to setup up something simple properly. > >> >> 2) Since all the information so far sounds like the problem could be >> elsewhere in the code, and the skeys just catch it by accident ... have >> you tried running with valgrind? Maybe it catches something useful? > > Nothing yet, but will fiddle with the options a bit more. Only thing I have seen so far: +==33321== +==33321== Warning: client switching stacks? SP change: 0x1ffeffe5e8 --> 0x5d9cf60 +==33321== to suppress, use: --max-stackframe=137324009096 or greater +==33321== Warning: client switching stacks? SP change: 0x5d9cd18 --> 0x1ffeffe5e8 +==33321== to suppress, use: --max-stackframe=137324009680 or greater +==33321== Warning: client switching stacks? SP change: 0x1ffeffe8b8 --> 0x5d9ce58 +==33321== to suppress, use: --max-stackframe=137324010080 or greater +==33321== further instances of this message will not be shown. +==33321== Thread 4: +==33321== Conditional jump or move depends on uninitialised value(s) +==33321== at 0x3AEC70: process_queued_cpu_work (cpus-common.c:331) +==33321== by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213) +==33321== by 0x2755CD: qemu_wait_io_event (cpus.c:1253) +==33321== by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337) +==33321== by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521) +==33321== by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so) +==33321== by 0x4E72BBD: ??? (in /lib64/libc-2.26.so) +==33321== +==33321== Conditional jump or move depends on uninitialised value(s) +==33321== at 0x3AEC74: process_queued_cpu_work (cpus-common.c:331) +==33321== by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213) +==33321== by 0x2755CD: qemu_wait_io_event (cpus.c:1253) +==33321== by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337) +==33321== by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521) +==33321== by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so) +==33321== by 0x4E72BBD: ??? (in /lib64/libc-2.26.so) +==33321== +==33321== +==33321== HEAP SUMMARY: +==33321== in use at exit: 2,138,442 bytes in 13,935 blocks +==33321== total heap usage: 19,089 allocs, 5,154 frees, 5,187,670 bytes allocated +==33321== +==33321== LEAK SUMMARY: +==33321== definitely lost: 0 bytes in 0 blocks +==33321== indirectly lost: 0 bytes in 0 blocks +==33321== possibly lost: 7,150 bytes in 111 blocks +==33321== still reachable: 2,131,292 bytes in 13,824 blocks +==33321== suppressed: 0 bytes in 0 blocks +==33321== Rerun with --leak-check=full to see details of leaked memory > >> >> Thomas >> > > Ciao, > > Claudio > > A more interesting update is what follows I think. I was able to "fix" the problem shown by the reproducer: @@ -643,7 +643,7 @@ static void qemu_account_warp_tim@@ -643,7 +643,7 @@ static void qemu_account_warp_timer(void) static bool icount_state_needed(void *opaque) { - return use_icount; + return 0; } by just slowing down qcow2_co_pwritev_task_entry with some tight loops, without changing any fields between runs (other than the reproducer icount field removal). I tried to insert the same slowdown just in savevm.c at the end of save_snapshot, but that does not work, needs to be in the coroutine. Thanks, Claudio