Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed

Claudio Fontana Thu, 16 Jul 2020 05:59:43 -0700

Small update on this,

On 7/15/20 1:10 PM, Claudio Fontana wrote:
> Hi Thomas,
> 
> On 7/14/20 4:35 PM, Thomas Huth wrote:
>> On 14/07/2020 16.29, Claudio Fontana wrote:
>>> Hello,
>>>
>>> I have some tiny progress in narrowing down this issue, possibly a qcow2 
>>> issue, still unclear,
>>> but involving Kevin Wolf and Max Reitz.
>>>
>>>
>>> The reproducer again:
>>>
>>>> --------------------------------------------cut-------------------------------------------
>>>> diff --git a/cpus.c b/cpus.c
>>>> index 41d1c5099f..443b88697a 100644
>>>> --- a/cpus.c
>>>> +++ b/cpus.c
>>>> @@ -643,7 +643,7 @@ static void qemu_account_warp_timer(void)
>>>>  
>>>>  static bool icount_state_needed(void *opaque)
>>>>  {
>>>> -    return use_icount;
>>>> +    return 0;
>>>>  }
>>>>  
>>>>  static bool warp_timer_state_needed(void *opaque)
>>>> --------------------------------------------cut-------------------------------------------
>>>
>>> This issue for now appears on s390 only:
>>>
>>> On s390 hardware, test 267 fails (both kvm and tcg) in the qcow2 backing 
>>> file part, with broken migration stream data in the s390-skeys vmsave (old 
>>> style).
>> [...]
>>> If someone has a good idea let me know - first attempts to reproduce on x86 
>>> failed, but maybe more work could lead to it.
>>
> 
> small update: in the GOOD case (enough padding added) a qcow_merge() is 
> triggered for the last write of 16202 bytes.
> In the BAD case (not enough padding added) a qcow_merge() is not triggered 
> for the last write of 16201 bytes.
> 
> Note: manually flushing with qemu_fflush in s390-skeys vmsave also works 
> (maybe got lost in the noise).
> 
> 
>> Two questions:
>>
>> 1) Can you also reproduce the issue manually, without running iotest
>> 267? ... I tried, but so far I failed.
> 
> Thanks for the suggestion, will try.


Currently trying to reproduce manually an environment similar to that of the 
test,
at the moment I am not able to reproduce the issue manually.

Not very familiar with s390,
I've been running with 

export QEMU=/home/cfontana/qemu-build/s390x-softmmu/qemu-system-s390x

$QEMU -nographic -monitor stdio -nodefaults -no-shutdown FILENAME

where FILENAME is the qcow2 produced by the test.

let me know if you have a suggestion on how to setup up something simple 
properly.


> 
>>
>> 2) Since all the information so far sounds like the problem could be
>> elsewhere in the code, and the skeys just catch it by accident ... have
>> you tried running with valgrind? Maybe it catches something useful?
> 
> Nothing yet, but will fiddle with the options a bit more.

Only thing I have seen so far:


+==33321== 
+==33321== Warning: client switching stacks?  SP change: 0x1ffeffe5e8 --> 
0x5d9cf60
+==33321==          to suppress, use: --max-stackframe=137324009096 or greater
+==33321== Warning: client switching stacks?  SP change: 0x5d9cd18 --> 
0x1ffeffe5e8
+==33321==          to suppress, use: --max-stackframe=137324009680 or greater
+==33321== Warning: client switching stacks?  SP change: 0x1ffeffe8b8 --> 
0x5d9ce58
+==33321==          to suppress, use: --max-stackframe=137324010080 or greater
+==33321==          further instances of this message will not be shown.
+==33321== Thread 4:
+==33321== Conditional jump or move depends on uninitialised value(s)
+==33321==    at 0x3AEC70: process_queued_cpu_work (cpus-common.c:331)
+==33321==    by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
+==33321==    by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
+==33321==    by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
+==33321==    by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
+==33321==    by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
+==33321==    by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
+==33321== 
+==33321== Conditional jump or move depends on uninitialised value(s)
+==33321==    at 0x3AEC74: process_queued_cpu_work (cpus-common.c:331)
+==33321==    by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
+==33321==    by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
+==33321==    by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
+==33321==    by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
+==33321==    by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
+==33321==    by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
+==33321== 
+==33321== 
+==33321== HEAP SUMMARY:
+==33321==     in use at exit: 2,138,442 bytes in 13,935 blocks
+==33321==   total heap usage: 19,089 allocs, 5,154 frees, 5,187,670 bytes 
allocated
+==33321== 
+==33321== LEAK SUMMARY:
+==33321==    definitely lost: 0 bytes in 0 blocks
+==33321==    indirectly lost: 0 bytes in 0 blocks
+==33321==      possibly lost: 7,150 bytes in 111 blocks
+==33321==    still reachable: 2,131,292 bytes in 13,824 blocks
+==33321==         suppressed: 0 bytes in 0 blocks
+==33321== Rerun with --leak-check=full to see details of leaked memory


> 
>>
>>  Thomas
>>
> 
> Ciao,
> 
> Claudio
> 
> 

A more interesting update is what follows I think.

I was able to "fix" the problem shown by the reproducer:

@@ -643,7 +643,7 @@ static void qemu_account_warp_tim@@ -643,7 +643,7 @@ static 
void qemu_account_warp_timer(void)
 
 static bool icount_state_needed(void *opaque)
 {
-    return use_icount;
+    return 0;
 }
 
by just slowing down qcow2_co_pwritev_task_entry with some tight loops,
without changing any fields between runs (other than the reproducer icount 
field removal).

I tried to insert the same slowdown just in savevm.c at the end of 
save_snapshot, but that does not work, needs to be in the coroutine.

Thanks,

Claudio

Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed

Reply via email to