[Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-05-30 Thread osstest service owner
flight 123379 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/123379/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 fail 
REGR. vs. 123323
 test-armhf-armhf-xl-arndale   5 host-ping-check-native   fail REGR. vs. 123323

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stopfail like 123323
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stopfail like 123323
 test-armhf-armhf-libvirt 14 saverestore-support-checkfail  like 123323
 test-armhf-armhf-libvirt-xsm 14 saverestore-support-checkfail  like 123323
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop fail like 123323
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop fail like 123323
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop fail like 123323
 test-armhf-armhf-libvirt-raw 13 saverestore-support-checkfail  like 123323
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stopfail like 123323
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stopfail like 123323
 test-amd64-i386-xl-pvshim12 guest-start  fail   never pass
 test-amd64-amd64-libvirt 13 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 14 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  14 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-xsm  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-xsm  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 14 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 13 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-checkfail never pass
 test-arm64-arm64-xl  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  13 saverestore-support-checkfail   never pass
 test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop  fail never pass
 test-amd64-i386-xl-qemuu-win10-i386 10 windows-install fail never pass
 test-amd64-amd64-xl-qemuu-win10-i386 10 windows-installfail never pass
 test-amd64-amd64-xl-qemut-win10-i386 10 windows-installfail never pass
 test-amd64-i386-xl-qemut-win10-i386 10 windows-install fail never pass

version targeted for testing:
 xen  06f542f8f2e446c01bd0edab51e9450af7f6e05b
baseline version:
 xen  fc5805daef091240cd5fc06634a8bcdb2f3bb843

Last test of basis   123323  2018-05-28 23:34:10 Z2 days
Testing same since   123379  2018-05-29 21:42:20 Z1 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Ian Jackson 
  Jan Beulich 
  Juergen Gross 
  Lars Kurth 
  Marek Marczykowski-Górecki 
  Tim Deegan 
  Wei Liu 

jobs:
 build-amd64-xsm  pass  

[Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-12 Thread Juergen Gross
On 08/06/18 12:12, Juergen Gross wrote:
> On 07/06/18 13:30, Juergen Gross wrote:
>> On 06/06/18 11:40, Juergen Gross wrote:
>>> On 06/06/18 11:35, Jan Beulich wrote:
>>> On 05.06.18 at 18:19,  wrote:
>>>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 
>>> guest-saverestore.2 
>
> I thought I would reply again with the key point from my earlier mail
> highlighted, and go a bit further.  The first thing to go wrong in
> this was:
>
> 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14 = 
> Bad address): Internal error
> 2018-05-30 22:12:49.483+: xc: Save failed (14 = Bad address): 
> Internal error
> 2018-05-30 22:12:49.648+: libxl-save-helper: complete r=-1: Bad 
> address
>
> You can see similar messages in the other logfile:
>
> 2018-05-30 22:12:49.650+: libxl: 
> libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving 
> domain: domain responded to suspend request: Bad address
>
> All of these are reports of the same thing: xc_get_pfn_type_batch at
> xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.
>
> There is no corresponding message in the host's serial log nor the
> dom0 kernel log.

 I vaguely recall from the time when I had looked at the similar Windows
 migration issues that the guest is already in the process of being cleaned
 up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless
 warning") intentionally suppressed a log message here, and the
 immediately following debugging code (933f966bcd x86/mm: add
 temporary debugging code to get_page_from_gfn_p2m()) was reverted
 a little over a month later. This wasn't as a follow-up to another patch
 (fix), but following the discussion rooted at
 https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html
>>>
>>> That was -ESRCH, not -EFAULT.
>>
>> I've looked a little bit more into this.
>>
>> As we are seeing EFAULT being returned by the hypervisor this either
>> means the tools are specifying an invalid address (quite unlikely)
>> or the buffers are not as MAP_LOCKED as we wish them to be.
>>
>> Is there a way to see whether the host was experiencing some memory
>> shortage, so the buffers might have been swapped out?
>>
>> man mmap tells me: "This implementation will try to populate (prefault)
>> the whole range but the mmap call doesn't fail with ENOMEM if this
>> fails. Therefore major faults might happen later on."
>>
>> And: "One should use mmap(2) plus mlock(2) when major faults are not
>> acceptable after the initialization of the mapping."
>>
>> With osdep_alloc_pages() in tools/libs/call/linux.c touching all the
>> hypercall buffer pages before doing the hypercall I'm not sure this
>> could be an issue.
>>
>> Any thoughts on that?
> 
> Ian, is there a chance to dedicate a machine to a specific test trying
> to reproduce the problem? In case we manage to get this failure in a
> reasonable time frame I guess the most promising approach would be to
> use a test hypervisor producing more debug data. If you think this is
> worth doing I can write a patch.

Trying to reproduce the problem in a limited test environment finally
worked: doing a loop of "xl save -c" produced the problem after 198
iterations.

I have asked a SUSE engineer doing kernel memory management if he
could think of something. His idea is that maybe some kthread could be
the reason for our problem, e.g. trying page migration or compaction
(at least on the test machine I've looked at compaction of mlocked
pages is allowed: /proc/sys/vm/compact_unevictable_allowed is 1).

In order to be really sure nothing in the kernel can temporarily
switch hypercall buffer pages read-only or invalid for the hypervisor
we'll have to modify the privcmd driver interface: it will have to
gain knowledge which pages are handed over to the hypervisor as buffers
in order to be able to lock them accordingly via get_user_pages().

While this is a possible explanation of the fault we are seeing it might
be related to another reason. So I'm going to apply some modifications
to the hypervisor to get some more diagnostics in order to verify the
suspected kernel behavior is really the reason for the hypervisor to
return EFAULT.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-05-31 Thread Juergen Gross
On 31/05/18 08:00, osstest service owner wrote:
> flight 123379 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/123379/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 
> fail REGR. vs. 123323

AFAICS this seems to be the suspected Windows reboot again?

>  test-armhf-armhf-xl-arndale   5 host-ping-check-native   fail REGR. vs. 
> 123323

Flaky hardware again?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-05-31 Thread Juergen Gross
On 31/05/18 10:32, Juergen Gross wrote:
> On 31/05/18 08:00, osstest service owner wrote:
>> flight 123379 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/123379/
>>
>> Regressions :-(
>>
>> Tests which did not succeed and are blocking,
>> including tests which could not be run:
>>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 
>> fail REGR. vs. 123323
> 
> AFAICS this seems to be the suspected Windows reboot again?

Hmm, thinking more about it: xl save is done with the domU paused,
so the guest rebooting concurrently is rather improbable.

As this is an issue occurring sporadically not only during 4.11
development phase I don't think this should be a blocker.

Thoughts?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-01 Thread Jan Beulich
>>> On 31.05.18 at 11:14,  wrote:
> On 31/05/18 10:32, Juergen Gross wrote:
>> On 31/05/18 08:00, osstest service owner wrote:
>>> flight 123379 xen-unstable real [real]
>>> http://logs.test-lab.xenproject.org/osstest/logs/123379/ 
>>>
>>> Regressions :-(
>>>
>>> Tests which did not succeed and are blocking,
>>> including tests which could not be run:
>>>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 
> fail REGR. vs. 123323
>> 
>> AFAICS this seems to be the suspected Windows reboot again?
> 
> Hmm, thinking more about it: xl save is done with the domU paused,
> so the guest rebooting concurrently is rather improbable.

Not sure, considering e.g.

libxl: libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving 
domain: domain responded to suspend request: Bad address

When looking into the Windows reboot issue (note this we're not dealing
with Windows here), I had noticed that there was a problem with trying
to save the guest at the "wrong" time. Generally, as explained back then,
I think the tool stack should honor the guest trying to reboot when it is
already in the process of being migrated/saved, and migration/save
should not even be attempted when the guest has already signaled
reboot (iirc it's only the former that is an actual issue). Otherwise the
tool stack will internally try to drive the same guest into two distinct new
states at the same time. Giving reboot (or shutdown) higher priority than
migration/save seems natural to me: A rebooting guest can be moved to
the new host with no migration cost at all, and a shut down guest doesn't
need (live) moving in the first place.

> As this is an issue occurring sporadically not only during 4.11
> development phase I don't think this should be a blocker.

Yes and no: Yes, it's not a regression. But as long as we don't make this
a blocker, I don't think the issue will be addressed, considering for how
long it has been there already.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-01 Thread Juergen Gross
On 01/06/18 10:10, Jan Beulich wrote:
 On 31.05.18 at 11:14,  wrote:
>> On 31/05/18 10:32, Juergen Gross wrote:
>>> On 31/05/18 08:00, osstest service owner wrote:
 flight 123379 xen-unstable real [real]
 http://logs.test-lab.xenproject.org/osstest/logs/123379/ 

 Regressions :-(

 Tests which did not succeed and are blocking,
 including tests which could not be run:
  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 
>> fail REGR. vs. 123323
>>>
>>> AFAICS this seems to be the suspected Windows reboot again?
>>
>> Hmm, thinking more about it: xl save is done with the domU paused,
>> so the guest rebooting concurrently is rather improbable.
> 
> Not sure, considering e.g.
> 
> libxl: libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving 
> domain: domain responded to suspend request: Bad address

That was at 2018-05-30 22:12:49.650+

Before that there was:

2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14
= Bad address): Internal error

But looking at the messages issued some seconds before that I see some
xenstore watch related messages in:

http://logs.test-lab.xenproject.org/osstest/logs/123379/test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm/huxelrebe1---var-log-libvirt-libxl-libxl-driver.log

which make me wonder whether the libxl watch handling is really
correct: e.g. libxl__ev_xswatch_register() first registers the watch
with xenstore and only then writes the data needed for processing the
watch in the related structure. Could it be that the real suspend watch
event was interpreted as a @releaseDomain event?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-05 Thread Ian Jackson
Juergen Gross writes ("Re: [Xen-devel] [xen-unstable test] 123379: regressions 
- FAIL"):
> Before that there was:
> 
> 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14
> = Bad address): Internal error

This seems to be the only message about the root cause.

> But looking at the messages issued some seconds before that I see some
> xenstore watch related messages in:
> 
> http://logs.test-lab.xenproject.org/osstest/logs/123379/test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm/huxelrebe1---var-log-libvirt-libxl-libxl-driver.log

I think this is all a red herring.

What I see happening is:

2018-05-30 22:12:44.695+: libxl: 
libxl_event.c:636:libxl__ev_xswatch_register: watch w=0xb40005e8 
wpath=/local/domain/3/control/shutdown token=2/b: register slotnum=2

libxl starts watching the domain's shutdown control node.  I think
this is done from near libxl_dom_suspend.c:202.

2018-05-30 22:12:44.696+: libxl: libxl_event.c:573:watchfd_callback: watch 
w=0xb40005e8 wpath=/local/domain/3/control/shutdown token=2/b: event 
epath=/local/domain/3/control/shutdown

The watch we just set triggers.  This happens with every xenstore
watch, after it is set up - ie, it does not mean that anything had
written the node.

2018-05-30 22:12:44.696+: libxl: 
libxl_event.c:673:libxl__ev_xswatch_deregister: watch w=0xb40005e8 
wpath=/local/domain/3/control/shutdown token=2/b: deregister slotnum=2

libxl stops watching the domain's shutdown control node.  This is
done, I think, by domain_suspend_common_pvcontrol_suspending
(libxl_dom_suspend.c:222).

We can conclude that
  if (!rc && !domain_suspend_pvcontrol_acked(state))
was not taken.  It seems unlikely that rc!=0, because the
node is read in xswait_xswatch_callback using libxl__xs_read_checked
which I think would log a message.  So probably
/local/domain/3/control/shutdown was `suspend', meaning the domain had
indeed acked the suspend request.

2018-05-30 22:12:44.696+: libxl: 
libxl_event.c:636:libxl__ev_xswatch_register: watch w=0xb40005f8 
wpath=@releaseDomain token=2/c: register slotnum=2

This is the watch registration in domain_suspend_common_wait_guest.

2018-05-30 22:12:44.696+: libxl: libxl_event.c:548:watchfd_callback: watch 
w=0xb40005f8 epath=/local/domain/3/control/shutdown token=2/b: counter != c

This is a watch event for the watch we set up at 2018-05-30
22:12:44.696+.  You can tell because the token is the same.  But
that watch was cancelled within libxl at 2018-05-30
22:12:44.696+.  libxl's watch handling machinery knows this, and
discards the event.  "counter != c", libxl_event.c:547.

It does indeed use the same slot in the libxl xswatch data structure,
but libxl can distinguish it by the counter and the event path.  (In
any case xs watch handlers should tolerate spurious events and be
idempotent, although that does not matter here.)

I think this must be the watch event from the guest actually writing
its acknowledgement to the control node - we would indeed expect two
such events, one generated by the watch setup, and one from the
guest's write.  The timing meant that here we processed the guest's
written value as a result of the first watch event.  This is fine.

2018-05-30 22:12:44.696+: libxl: libxl_event.c:573:watchfd_callback: watch 
w=0xb40005f8 wpath=@releaseDomain token=2/c: event epath=@releaseDomain

This is the immediate-auto-firing of the @releaseDomain event set up
at 2018-05-30 22:12:44.696+.  libxl's xswatch machinery looks this
up in slot 2 and finds that the counter and paths are right, so it
will dispatch it to suspend_common_wait_guest_watch which is a
frontend for suspend_common_wait_guest_check.

In the absence of log messages from that function we can conclude that
  !(info.flags & XEN_DOMINF_shutdown)
ie the guest has not shut down yet.

2018-05-30 22:12:44.720+: libxl: libxl_event.c:573:watchfd_callback: watch 
w=0xb2a26708 wpath=@releaseDomain token=3/0: event epath=@releaseDomain

This is a watch event which was set up much earlier at 2018-05-30
21:58:02.182+.  The surrounding context there (references to
domain_death_xswatch_callback) makes it clear that this is pursuant to
libxl_evenable_domain_death - ie, libvirt asked libxl to monitor for
the death of the domain.

2018-05-30 22:12:44.724+: libxl: 
libxl_domain.c:816:domain_death_xswatch_callback:  shutdown reporting

The output here is a bit perplexing.  I don't understand how we can
have the message "shutdown reporting" without any previous message
"Exists shutdown_reported=%d" or "[evg=%p] nentries=%d rc=%d %ld..%ld"
both of which seem to precede the "shutdown reporting" message in
domain_death_xswatch_callback.

However, we can conclude that, at this point, libxl finds that
  got->flags & XEN_DOMINF_shutdown
and it decides to inform libvirt that the domain ha

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-05 Thread Ian Jackson
>>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 

I thought I would reply again with the key point from my earlier mail
highlighted, and go a bit further.  The first thing to go wrong in
this was:

2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14 = Bad 
address): Internal error
2018-05-30 22:12:49.483+: xc: Save failed (14 = Bad address): Internal error
2018-05-30 22:12:49.648+: libxl-save-helper: complete r=-1: Bad address

You can see similar messages in the other logfile:

2018-05-30 22:12:49.650+: libxl: 
libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving domain: 
domain responded to suspend request: Bad address

All of these are reports of the same thing: xc_get_pfn_type_batch at
xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.

There is no corresponding message in the host's serial log nor the
dom0 kernel log.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-06 Thread Juergen Gross
On 05/06/18 18:16, Ian Jackson wrote:
> 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14 = Bad 
> address): Internal error

This is worrying me.

The message is issued as a result of xc_get_pfn_type_batch() failing.
I see no other possibility for the failure with errno being 14 (EFAULT)
than the hypervisor failing a copy from/to guest for either struct
xen_domctl or the pfn array passed via struct xen_domctl (op
XEN_DOMCTL_getpageframeinfo3). Both should be accessible as they have
been correctly declared via DECLARE_HYPERCALL_BOUNCE() in xc_private.c.

Any ideas how that could have happened?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-06 Thread Jan Beulich
>>> On 05.06.18 at 18:19,  wrote:
>> >  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 
> 
> I thought I would reply again with the key point from my earlier mail
> highlighted, and go a bit further.  The first thing to go wrong in
> this was:
> 
> 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14 = Bad 
> address): Internal error
> 2018-05-30 22:12:49.483+: xc: Save failed (14 = Bad address): Internal 
> error
> 2018-05-30 22:12:49.648+: libxl-save-helper: complete r=-1: Bad address
> 
> You can see similar messages in the other logfile:
> 
> 2018-05-30 22:12:49.650+: libxl: 
> libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving domain: 
> domain responded to suspend request: Bad address
> 
> All of these are reports of the same thing: xc_get_pfn_type_batch at
> xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.
> 
> There is no corresponding message in the host's serial log nor the
> dom0 kernel log.

I vaguely recall from the time when I had looked at the similar Windows
migration issues that the guest is already in the process of being cleaned
up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless
warning") intentionally suppressed a log message here, and the
immediately following debugging code (933f966bcd x86/mm: add
temporary debugging code to get_page_from_gfn_p2m()) was reverted
a little over a month later. This wasn't as a follow-up to another patch
(fix), but following the discussion rooted at
https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-06 Thread Juergen Gross
On 06/06/18 11:35, Jan Beulich wrote:
 On 05.06.18 at 18:19,  wrote:
  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 
>>
>> I thought I would reply again with the key point from my earlier mail
>> highlighted, and go a bit further.  The first thing to go wrong in
>> this was:
>>
>> 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14 = 
>> Bad address): Internal error
>> 2018-05-30 22:12:49.483+: xc: Save failed (14 = Bad address): Internal 
>> error
>> 2018-05-30 22:12:49.648+: libxl-save-helper: complete r=-1: Bad address
>>
>> You can see similar messages in the other logfile:
>>
>> 2018-05-30 22:12:49.650+: libxl: 
>> libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving domain: 
>> domain responded to suspend request: Bad address
>>
>> All of these are reports of the same thing: xc_get_pfn_type_batch at
>> xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.
>>
>> There is no corresponding message in the host's serial log nor the
>> dom0 kernel log.
> 
> I vaguely recall from the time when I had looked at the similar Windows
> migration issues that the guest is already in the process of being cleaned
> up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless
> warning") intentionally suppressed a log message here, and the
> immediately following debugging code (933f966bcd x86/mm: add
> temporary debugging code to get_page_from_gfn_p2m()) was reverted
> a little over a month later. This wasn't as a follow-up to another patch
> (fix), but following the discussion rooted at
> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html

That was -ESRCH, not -EFAULT.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-07 Thread Juergen Gross
On 06/06/18 11:40, Juergen Gross wrote:
> On 06/06/18 11:35, Jan Beulich wrote:
> On 05.06.18 at 18:19,  wrote:
>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 
>>>
>>> I thought I would reply again with the key point from my earlier mail
>>> highlighted, and go a bit further.  The first thing to go wrong in
>>> this was:
>>>
>>> 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14 = 
>>> Bad address): Internal error
>>> 2018-05-30 22:12:49.483+: xc: Save failed (14 = Bad address): Internal 
>>> error
>>> 2018-05-30 22:12:49.648+: libxl-save-helper: complete r=-1: Bad address
>>>
>>> You can see similar messages in the other logfile:
>>>
>>> 2018-05-30 22:12:49.650+: libxl: 
>>> libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving 
>>> domain: domain responded to suspend request: Bad address
>>>
>>> All of these are reports of the same thing: xc_get_pfn_type_batch at
>>> xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.
>>>
>>> There is no corresponding message in the host's serial log nor the
>>> dom0 kernel log.
>>
>> I vaguely recall from the time when I had looked at the similar Windows
>> migration issues that the guest is already in the process of being cleaned
>> up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless
>> warning") intentionally suppressed a log message here, and the
>> immediately following debugging code (933f966bcd x86/mm: add
>> temporary debugging code to get_page_from_gfn_p2m()) was reverted
>> a little over a month later. This wasn't as a follow-up to another patch
>> (fix), but following the discussion rooted at
>> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html
> 
> That was -ESRCH, not -EFAULT.

I've looked a little bit more into this.

As we are seeing EFAULT being returned by the hypervisor this either
means the tools are specifying an invalid address (quite unlikely)
or the buffers are not as MAP_LOCKED as we wish them to be.

Is there a way to see whether the host was experiencing some memory
shortage, so the buffers might have been swapped out?

man mmap tells me: "This implementation will try to populate (prefault)
the whole range but the mmap call doesn't fail with ENOMEM if this
fails. Therefore major faults might happen later on."

And: "One should use mmap(2) plus mlock(2) when major faults are not
acceptable after the initialization of the mapping."

With osdep_alloc_pages() in tools/libs/call/linux.c touching all the
hypercall buffer pages before doing the hypercall I'm not sure this
could be an issue.

Any thoughts on that?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-08 Thread Juergen Gross
On 07/06/18 13:30, Juergen Gross wrote:
> On 06/06/18 11:40, Juergen Gross wrote:
>> On 06/06/18 11:35, Jan Beulich wrote:
>> On 05.06.18 at 18:19,  wrote:
>>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 
>> guest-saverestore.2 

 I thought I would reply again with the key point from my earlier mail
 highlighted, and go a bit further.  The first thing to go wrong in
 this was:

 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14 = 
 Bad address): Internal error
 2018-05-30 22:12:49.483+: xc: Save failed (14 = Bad address): Internal 
 error
 2018-05-30 22:12:49.648+: libxl-save-helper: complete r=-1: Bad address

 You can see similar messages in the other logfile:

 2018-05-30 22:12:49.650+: libxl: 
 libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving 
 domain: domain responded to suspend request: Bad address

 All of these are reports of the same thing: xc_get_pfn_type_batch at
 xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.

 There is no corresponding message in the host's serial log nor the
 dom0 kernel log.
>>>
>>> I vaguely recall from the time when I had looked at the similar Windows
>>> migration issues that the guest is already in the process of being cleaned
>>> up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless
>>> warning") intentionally suppressed a log message here, and the
>>> immediately following debugging code (933f966bcd x86/mm: add
>>> temporary debugging code to get_page_from_gfn_p2m()) was reverted
>>> a little over a month later. This wasn't as a follow-up to another patch
>>> (fix), but following the discussion rooted at
>>> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html
>>
>> That was -ESRCH, not -EFAULT.
> 
> I've looked a little bit more into this.
> 
> As we are seeing EFAULT being returned by the hypervisor this either
> means the tools are specifying an invalid address (quite unlikely)
> or the buffers are not as MAP_LOCKED as we wish them to be.
> 
> Is there a way to see whether the host was experiencing some memory
> shortage, so the buffers might have been swapped out?
> 
> man mmap tells me: "This implementation will try to populate (prefault)
> the whole range but the mmap call doesn't fail with ENOMEM if this
> fails. Therefore major faults might happen later on."
> 
> And: "One should use mmap(2) plus mlock(2) when major faults are not
> acceptable after the initialization of the mapping."
> 
> With osdep_alloc_pages() in tools/libs/call/linux.c touching all the
> hypercall buffer pages before doing the hypercall I'm not sure this
> could be an issue.
> 
> Any thoughts on that?

Ian, is there a chance to dedicate a machine to a specific test trying
to reproduce the problem? In case we manage to get this failure in a
reasonable time frame I guess the most promising approach would be to
use a test hypervisor producing more debug data. If you think this is
worth doing I can write a patch.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-12 Thread Jan Beulich
>>> On 12.06.18 at 17:58,  wrote:
> Trying to reproduce the problem in a limited test environment finally
> worked: doing a loop of "xl save -c" produced the problem after 198
> iterations.
> 
> I have asked a SUSE engineer doing kernel memory management if he
> could think of something. His idea is that maybe some kthread could be
> the reason for our problem, e.g. trying page migration or compaction
> (at least on the test machine I've looked at compaction of mlocked
> pages is allowed: /proc/sys/vm/compact_unevictable_allowed is 1).

Iirc the primary goal of compaction is to make contiguous memory
available for huge page allocations. PV not using huge pages, this is
of no interest here. The secondary consideration of physically
contiguous I/O buffer is an illusion only under PV, so perhaps not
much more of an interest (albeit I can see drivers wanting to allocate
physically contiguous buffers nevertheless now and then, but I'd
expect this to be mostly limited to driver initialization and device hot
add).

So it is perhaps at least worth considering whether to turn off
compaction/migration when running PV. But the problem would still
need addressing then mid-term, as PVH Dom0 would have the same
issue (and of course DomU, i.e. including HVM, can make hypercalls
too, and hence would be affected as well, just perhaps not as
visibly).

> In order to be really sure nothing in the kernel can temporarily
> switch hypercall buffer pages read-only or invalid for the hypervisor
> we'll have to modify the privcmd driver interface: it will have to
> gain knowledge which pages are handed over to the hypervisor as buffers
> in order to be able to lock them accordingly via get_user_pages().

So are you / is he saying that mlock() doesn't protect against such
playing with process memory? Teaching the privcmd driver of all
the indirections in hypercall request structures doesn't look very
attractive (or maintainable). Or are you thinking of the caller
providing sideband information describing the buffers involved,
perhaps along the lines of how dm_op was designed?

There's another option, but that has potentially severe drawbacks
too: Instead of returning -EFAULT on buffer access issues, we
could raise #PF on the very hypercall insn. Maybe something to
consider as an opt-in for PV/PVH, and as default for HVM.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-12 Thread Juergen Gross
On 13/06/18 08:11, Jan Beulich wrote:
 On 12.06.18 at 17:58,  wrote:
>> Trying to reproduce the problem in a limited test environment finally
>> worked: doing a loop of "xl save -c" produced the problem after 198
>> iterations.
>>
>> I have asked a SUSE engineer doing kernel memory management if he
>> could think of something. His idea is that maybe some kthread could be
>> the reason for our problem, e.g. trying page migration or compaction
>> (at least on the test machine I've looked at compaction of mlocked
>> pages is allowed: /proc/sys/vm/compact_unevictable_allowed is 1).
> 
> Iirc the primary goal of compaction is to make contiguous memory
> available for huge page allocations. PV not using huge pages, this is
> of no interest here. The secondary consideration of physically
> contiguous I/O buffer is an illusion only under PV, so perhaps not
> much more of an interest (albeit I can see drivers wanting to allocate
> physically contiguous buffers nevertheless now and then, but I'd
> expect this to be mostly limited to driver initialization and device hot
> add).
> 
> So it is perhaps at least worth considering whether to turn off
> compaction/migration when running PV. But the problem would still
> need addressing then mid-term, as PVH Dom0 would have the same
> issue (and of course DomU, i.e. including HVM, can make hypercalls
> too, and hence would be affected as well, just perhaps not as
> visibly).

I think we should try to solve the problem by being aware of such
possibilities. Another potential source would be NUMA memory
migration (not now in pv, of course). And who knows what will come
in the next years.

> 
>> In order to be really sure nothing in the kernel can temporarily
>> switch hypercall buffer pages read-only or invalid for the hypervisor
>> we'll have to modify the privcmd driver interface: it will have to
>> gain knowledge which pages are handed over to the hypervisor as buffers
>> in order to be able to lock them accordingly via get_user_pages().
> 
> So are you / is he saying that mlock() doesn't protect against such
> playing with process memory?

Right. Due to proper locking in the kernel this is just a guarantee you
won't ever see a fault for such a page in user mode.

> Teaching the privcmd driver of all
> the indirections in hypercall request structures doesn't look very
> attractive (or maintainable). Or are you thinking of the caller
> providing sideband information describing the buffers involved,
> perhaps along the lines of how dm_op was designed?

I thought about that, yes. libxencall already has all the needed data
for that. Another possibility would be a dedicated ioctl for registering
a hypercall buffer (or some of them).

> There's another option, but that has potentially severe drawbacks
> too: Instead of returning -EFAULT on buffer access issues, we
> could raise #PF on the very hypercall insn. Maybe something to
> consider as an opt-in for PV/PVH, and as default for HVM.

Hmm, I'm not sure this will solve any problem. I'm not aware that it
is considered good practice to just access a user buffer from kernel
without using copyin()/copyout() when you haven't locked the page(s)
via get_user_pages(), even if the buffer was mlock()ed. Returning
-EFAULT is the right thing to do, I believe.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-13 Thread Jan Beulich
>>> On 13.06.18 at 08:50,  wrote:
> On 13/06/18 08:11, Jan Beulich wrote:
>> Teaching the privcmd driver of all
>> the indirections in hypercall request structures doesn't look very
>> attractive (or maintainable). Or are you thinking of the caller
>> providing sideband information describing the buffers involved,
>> perhaps along the lines of how dm_op was designed?
> 
> I thought about that, yes. libxencall already has all the needed data
> for that. Another possibility would be a dedicated ioctl for registering
> a hypercall buffer (or some of them).

I'm not sure that's an option: Is it legitimate (secure) to retain the
effects of get_user_pages() across system calls?

>> There's another option, but that has potentially severe drawbacks
>> too: Instead of returning -EFAULT on buffer access issues, we
>> could raise #PF on the very hypercall insn. Maybe something to
>> consider as an opt-in for PV/PVH, and as default for HVM.
> 
> Hmm, I'm not sure this will solve any problem. I'm not aware that it
> is considered good practice to just access a user buffer from kernel
> without using copyin()/copyout() when you haven't locked the page(s)
> via get_user_pages(), even if the buffer was mlock()ed. Returning
> -EFAULT is the right thing to do, I believe.

But we're talking about the very copyin()/copyout(), just that here
it's being amortized by doing the operation just once (in the
hypervisor). A #PF would arise from syscall buffer copyin()/copyout(),
and the suggestion was to produce the same effect for the squashed
operation. Perhaps we wouldn't want #PF to come back from ordinary
(kernel invoked) hypercalls, but ones relayed by privcmd are different
in many ways anyway (see the stac()/clac() pair around the actual
call, for example).

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-13 Thread Juergen Gross
On 13/06/18 09:21, Jan Beulich wrote:
 On 13.06.18 at 08:50,  wrote:
>> On 13/06/18 08:11, Jan Beulich wrote:
>>> Teaching the privcmd driver of all
>>> the indirections in hypercall request structures doesn't look very
>>> attractive (or maintainable). Or are you thinking of the caller
>>> providing sideband information describing the buffers involved,
>>> perhaps along the lines of how dm_op was designed?
>>
>> I thought about that, yes. libxencall already has all the needed data
>> for that. Another possibility would be a dedicated ioctl for registering
>> a hypercall buffer (or some of them).
> 
> I'm not sure that's an option: Is it legitimate (secure) to retain the
> effects of get_user_pages() across system calls?

I have to check that.

>>> There's another option, but that has potentially severe drawbacks
>>> too: Instead of returning -EFAULT on buffer access issues, we
>>> could raise #PF on the very hypercall insn. Maybe something to
>>> consider as an opt-in for PV/PVH, and as default for HVM.
>>
>> Hmm, I'm not sure this will solve any problem. I'm not aware that it
>> is considered good practice to just access a user buffer from kernel
>> without using copyin()/copyout() when you haven't locked the page(s)
>> via get_user_pages(), even if the buffer was mlock()ed. Returning
>> -EFAULT is the right thing to do, I believe.
> 
> But we're talking about the very copyin()/copyout(), just that here
> it's being amortized by doing the operation just once (in the
> hypervisor). A #PF would arise from syscall buffer copyin()/copyout(),
> and the suggestion was to produce the same effect for the squashed
> operation. Perhaps we wouldn't want #PF to come back from ordinary
> (kernel invoked) hypercalls, but ones relayed by privcmd are different
> in many ways anyway (see the stac()/clac() pair around the actual
> call, for example).

Aah, okay. This is an option, but it would require some kind of
interface to tell the hypervisor it should raise the #PF instead of
returning -EFAULT, of course, as the kernel has to be prepared for
that.

I like that idea very much!


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-13 Thread Juergen Gross
On 12/06/18 17:58, Juergen Gross wrote:
> On 08/06/18 12:12, Juergen Gross wrote:
>> On 07/06/18 13:30, Juergen Gross wrote:
>>> On 06/06/18 11:40, Juergen Gross wrote:
 On 06/06/18 11:35, Jan Beulich wrote:
 On 05.06.18 at 18:19,  wrote:
  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 
 guest-saverestore.2 
>>
>> I thought I would reply again with the key point from my earlier mail
>> highlighted, and go a bit further.  The first thing to go wrong in
>> this was:
>>
>> 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14 
>> = Bad address): Internal error
>> 2018-05-30 22:12:49.483+: xc: Save failed (14 = Bad address): 
>> Internal error
>> 2018-05-30 22:12:49.648+: libxl-save-helper: complete r=-1: Bad 
>> address
>>
>> You can see similar messages in the other logfile:
>>
>> 2018-05-30 22:12:49.650+: libxl: 
>> libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving 
>> domain: domain responded to suspend request: Bad address
>>
>> All of these are reports of the same thing: xc_get_pfn_type_batch at
>> xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.
>>
>> There is no corresponding message in the host's serial log nor the
>> dom0 kernel log.
>
> I vaguely recall from the time when I had looked at the similar Windows
> migration issues that the guest is already in the process of being cleaned
> up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless
> warning") intentionally suppressed a log message here, and the
> immediately following debugging code (933f966bcd x86/mm: add
> temporary debugging code to get_page_from_gfn_p2m()) was reverted
> a little over a month later. This wasn't as a follow-up to another patch
> (fix), but following the discussion rooted at
> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html

 That was -ESRCH, not -EFAULT.
>>>
>>> I've looked a little bit more into this.
>>>
>>> As we are seeing EFAULT being returned by the hypervisor this either
>>> means the tools are specifying an invalid address (quite unlikely)
>>> or the buffers are not as MAP_LOCKED as we wish them to be.
>>>
>>> Is there a way to see whether the host was experiencing some memory
>>> shortage, so the buffers might have been swapped out?
>>>
>>> man mmap tells me: "This implementation will try to populate (prefault)
>>> the whole range but the mmap call doesn't fail with ENOMEM if this
>>> fails. Therefore major faults might happen later on."
>>>
>>> And: "One should use mmap(2) plus mlock(2) when major faults are not
>>> acceptable after the initialization of the mapping."
>>>
>>> With osdep_alloc_pages() in tools/libs/call/linux.c touching all the
>>> hypercall buffer pages before doing the hypercall I'm not sure this
>>> could be an issue.
>>>
>>> Any thoughts on that?
>>
>> Ian, is there a chance to dedicate a machine to a specific test trying
>> to reproduce the problem? In case we manage to get this failure in a
>> reasonable time frame I guess the most promising approach would be to
>> use a test hypervisor producing more debug data. If you think this is
>> worth doing I can write a patch.
> 
> Trying to reproduce the problem in a limited test environment finally
> worked: doing a loop of "xl save -c" produced the problem after 198
> iterations.
> 
> I have asked a SUSE engineer doing kernel memory management if he
> could think of something. His idea is that maybe some kthread could be
> the reason for our problem, e.g. trying page migration or compaction
> (at least on the test machine I've looked at compaction of mlocked
> pages is allowed: /proc/sys/vm/compact_unevictable_allowed is 1).
> 
> In order to be really sure nothing in the kernel can temporarily
> switch hypercall buffer pages read-only or invalid for the hypervisor
> we'll have to modify the privcmd driver interface: it will have to
> gain knowledge which pages are handed over to the hypervisor as buffers
> in order to be able to lock them accordingly via get_user_pages().
> 
> While this is a possible explanation of the fault we are seeing it might
> be related to another reason. So I'm going to apply some modifications
> to the hypervisor to get some more diagnostics in order to verify the
> suspected kernel behavior is really the reason for the hypervisor to
> return EFAULT.

I was lucky. Took only 39 iterations this time.

The debug data confirms the theory that the kernel is setting the PTE to
invalid or read only for a short amount of time:

(XEN) fixup for address 7ffb9904fe44, error_code 0002:
(XEN) Pagetable walk from 7ffb9904fe44:
(XEN)  L4[0x0ff] = 000458da6067 00019190
(XEN)  L3[0x1ee] = 000457d26067 00018210
(XEN)  L2[0x0c8] = 000445ab3067 6083
(XEN)  L1[0x04f] = 800458cdc107 0001925a
(XEN) Xen ca

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-13 Thread Andrew Cooper
On 13/06/18 09:52, Juergen Gross wrote:
> On 12/06/18 17:58, Juergen Gross wrote:
>> On 08/06/18 12:12, Juergen Gross wrote:
>>> On 07/06/18 13:30, Juergen Gross wrote:
 On 06/06/18 11:40, Juergen Gross wrote:
> On 06/06/18 11:35, Jan Beulich wrote:
> On 05.06.18 at 18:19,  wrote:
>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 
> guest-saverestore.2 
>>> I thought I would reply again with the key point from my earlier mail
>>> highlighted, and go a bit further.  The first thing to go wrong in
>>> this was:
>>>
>>> 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch (14 
>>> = Bad address): Internal error
>>> 2018-05-30 22:12:49.483+: xc: Save failed (14 = Bad address): 
>>> Internal error
>>> 2018-05-30 22:12:49.648+: libxl-save-helper: complete r=-1: Bad 
>>> address
>>>
>>> You can see similar messages in the other logfile:
>>>
>>> 2018-05-30 22:12:49.650+: libxl: 
>>> libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving 
>>> domain: domain responded to suspend request: Bad address
>>>
>>> All of these are reports of the same thing: xc_get_pfn_type_batch at
>>> xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.
>>>
>>> There is no corresponding message in the host's serial log nor the
>>> dom0 kernel log.
>> I vaguely recall from the time when I had looked at the similar Windows
>> migration issues that the guest is already in the process of being 
>> cleaned
>> up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless
>> warning") intentionally suppressed a log message here, and the
>> immediately following debugging code (933f966bcd x86/mm: add
>> temporary debugging code to get_page_from_gfn_p2m()) was reverted
>> a little over a month later. This wasn't as a follow-up to another patch
>> (fix), but following the discussion rooted at
>> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html
> That was -ESRCH, not -EFAULT.
 I've looked a little bit more into this.

 As we are seeing EFAULT being returned by the hypervisor this either
 means the tools are specifying an invalid address (quite unlikely)
 or the buffers are not as MAP_LOCKED as we wish them to be.

 Is there a way to see whether the host was experiencing some memory
 shortage, so the buffers might have been swapped out?

 man mmap tells me: "This implementation will try to populate (prefault)
 the whole range but the mmap call doesn't fail with ENOMEM if this
 fails. Therefore major faults might happen later on."

 And: "One should use mmap(2) plus mlock(2) when major faults are not
 acceptable after the initialization of the mapping."

 With osdep_alloc_pages() in tools/libs/call/linux.c touching all the
 hypercall buffer pages before doing the hypercall I'm not sure this
 could be an issue.

 Any thoughts on that?
>>> Ian, is there a chance to dedicate a machine to a specific test trying
>>> to reproduce the problem? In case we manage to get this failure in a
>>> reasonable time frame I guess the most promising approach would be to
>>> use a test hypervisor producing more debug data. If you think this is
>>> worth doing I can write a patch.
>> Trying to reproduce the problem in a limited test environment finally
>> worked: doing a loop of "xl save -c" produced the problem after 198
>> iterations.
>>
>> I have asked a SUSE engineer doing kernel memory management if he
>> could think of something. His idea is that maybe some kthread could be
>> the reason for our problem, e.g. trying page migration or compaction
>> (at least on the test machine I've looked at compaction of mlocked
>> pages is allowed: /proc/sys/vm/compact_unevictable_allowed is 1).
>>
>> In order to be really sure nothing in the kernel can temporarily
>> switch hypercall buffer pages read-only or invalid for the hypervisor
>> we'll have to modify the privcmd driver interface: it will have to
>> gain knowledge which pages are handed over to the hypervisor as buffers
>> in order to be able to lock them accordingly via get_user_pages().
>>
>> While this is a possible explanation of the fault we are seeing it might
>> be related to another reason. So I'm going to apply some modifications
>> to the hypervisor to get some more diagnostics in order to verify the
>> suspected kernel behavior is really the reason for the hypervisor to
>> return EFAULT.
> I was lucky. Took only 39 iterations this time.
>
> The debug data confirms the theory that the kernel is setting the PTE to
> invalid or read only for a short amount of time:
>
> (XEN) fixup for address 7ffb9904fe44, error_code 0002:
> (XEN) Pagetable walk from 7ffb9904fe44:
> (XEN)  L4[0x0ff] = 000458da6067 00019190
> (XEN)  L3[0x1ee] = 000457d26067 0001821

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL

2018-06-13 Thread Juergen Gross
On 13/06/18 10:58, Andrew Cooper wrote:
> On 13/06/18 09:52, Juergen Gross wrote:
>> On 12/06/18 17:58, Juergen Gross wrote:
>>> On 08/06/18 12:12, Juergen Gross wrote:
 On 07/06/18 13:30, Juergen Gross wrote:
> On 06/06/18 11:40, Juergen Gross wrote:
>> On 06/06/18 11:35, Jan Beulich wrote:
>> On 05.06.18 at 18:19,  wrote:
>>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 
>> guest-saverestore.2 
 I thought I would reply again with the key point from my earlier mail
 highlighted, and go a bit further.  The first thing to go wrong in
 this was:

 2018-05-30 22:12:49.320+: xc: Failed to get types for pfn batch 
 (14 = Bad address): Internal error
 2018-05-30 22:12:49.483+: xc: Save failed (14 = Bad address): 
 Internal error
 2018-05-30 22:12:49.648+: libxl-save-helper: complete r=-1: Bad 
 address

 You can see similar messages in the other logfile:

 2018-05-30 22:12:49.650+: libxl: 
 libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving 
 domain: domain responded to suspend request: Bad address

 All of these are reports of the same thing: xc_get_pfn_type_batch at
 xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.

 There is no corresponding message in the host's serial log nor the
 dom0 kernel log.
>>> I vaguely recall from the time when I had looked at the similar Windows
>>> migration issues that the guest is already in the process of being 
>>> cleaned
>>> up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless
>>> warning") intentionally suppressed a log message here, and the
>>> immediately following debugging code (933f966bcd x86/mm: add
>>> temporary debugging code to get_page_from_gfn_p2m()) was reverted
>>> a little over a month later. This wasn't as a follow-up to another patch
>>> (fix), but following the discussion rooted at
>>> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html
>> That was -ESRCH, not -EFAULT.
> I've looked a little bit more into this.
>
> As we are seeing EFAULT being returned by the hypervisor this either
> means the tools are specifying an invalid address (quite unlikely)
> or the buffers are not as MAP_LOCKED as we wish them to be.
>
> Is there a way to see whether the host was experiencing some memory
> shortage, so the buffers might have been swapped out?
>
> man mmap tells me: "This implementation will try to populate (prefault)
> the whole range but the mmap call doesn't fail with ENOMEM if this
> fails. Therefore major faults might happen later on."
>
> And: "One should use mmap(2) plus mlock(2) when major faults are not
> acceptable after the initialization of the mapping."
>
> With osdep_alloc_pages() in tools/libs/call/linux.c touching all the
> hypercall buffer pages before doing the hypercall I'm not sure this
> could be an issue.
>
> Any thoughts on that?
 Ian, is there a chance to dedicate a machine to a specific test trying
 to reproduce the problem? In case we manage to get this failure in a
 reasonable time frame I guess the most promising approach would be to
 use a test hypervisor producing more debug data. If you think this is
 worth doing I can write a patch.
>>> Trying to reproduce the problem in a limited test environment finally
>>> worked: doing a loop of "xl save -c" produced the problem after 198
>>> iterations.
>>>
>>> I have asked a SUSE engineer doing kernel memory management if he
>>> could think of something. His idea is that maybe some kthread could be
>>> the reason for our problem, e.g. trying page migration or compaction
>>> (at least on the test machine I've looked at compaction of mlocked
>>> pages is allowed: /proc/sys/vm/compact_unevictable_allowed is 1).
>>>
>>> In order to be really sure nothing in the kernel can temporarily
>>> switch hypercall buffer pages read-only or invalid for the hypervisor
>>> we'll have to modify the privcmd driver interface: it will have to
>>> gain knowledge which pages are handed over to the hypervisor as buffers
>>> in order to be able to lock them accordingly via get_user_pages().
>>>
>>> While this is a possible explanation of the fault we are seeing it might
>>> be related to another reason. So I'm going to apply some modifications
>>> to the hypervisor to get some more diagnostics in order to verify the
>>> suspected kernel behavior is really the reason for the hypervisor to
>>> return EFAULT.
>> I was lucky. Took only 39 iterations this time.
>>
>> The debug data confirms the theory that the kernel is setting the PTE to
>> invalid or read only for a short amount of time:
>>
>> (XEN) fixup for address 7ffb9904fe44, error_code 0002:
>> (XEN) Pagetable