[libvirt test] 159437: regressions - FAIL

2021-02-17 Thread osstest service owner
flight 159437 libvirt real [real]
http://logs.test-lab.xenproject.org/osstest/logs/159437/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 151777
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 151777
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 151777
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 151777

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-vhd  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-i386-libvirt-xsm   1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a

version targeted for testing:
 libvirt  040a5bc307eb8cd304a4330b7538370315db3774
baseline version:
 libvirt  2c846fa6bcc11929c9fb857a22430fb9945654ad

Last test of basis   151777  2020-07-10 04:19:19 Z  223 days
Failing since151818  2020-07-11 04:18:52 Z  222 days  214 attempts
Testing same since   159437  2021-02-17 04:18:55 Z1 days1 attempts


People who touched revisions under test:
  Adolfo Jayme Barrientos 
  Aleksandr Alekseev 
  Andika Triwidada 
  Andrea Bolognani 
  Balázs Meskó 
  Barrett Schonefeld 
  Bastien Orivel 
  Bihong Yu 
  Binfeng Wu 
  Boris Fiuczynski 
  Brian Turek 
  Christian Ehrhardt 
  Christian Schoenebeck 
  Cole Robinson 
  Collin Walling 
  Cornelia Huck 
  Cédric Bosdonnat 
  Côme Borsoi 
  Daniel Henrique Barboza 
  Daniel Letai 
  Daniel P. Berrange 
  Daniel P. Berrangé 
  Dmytro Linkin 
  Eiichi Tsukata 
  Erik Skultety 
  Fabian Affolter 
  Fabian Freyer 
  Fangge Jin 
  Farhan Ali 
  Fedora Weblate Translation 
  gongwei 
  Guoyi Tu
  Göran Uddeborg 
  Halil Pasic 
  Han Han 
  Hao Wang 
  Helmut Grohne 
  Ian Wienand 
  Jakob Meng 
  Jamie Strandboge 
  Jamie Strandboge 
  Jan Kuparinen 
  Jean-Baptiste Holcroft 
  Jianan Gao 
  Jim Fehlig 
  Jin Yan 
  Jiri Denemark 
  John Ferlan 
  Jonathan Watt 
  Jonathon Jongsma 
  Julio Faracco 
  Ján Tomko 
  Kashyap Chamarthy 
  Kevin Locke 
  Kristina Hanicova 
  Laine Stump 
  Laszlo Ersek 
  Liao Pingfang 
  Lin Ma 
  Lin Ma 
  Lin Ma 
  Marc Hartmayer 
  Marc-André Lureau 
  Marek Marczykowski-Górecki 
  Markus Schade 
  Martin Kletzander 
  Masayoshi Mizuma 
  Matt Coleman 
  Matt Coleman 
  Mauro Matteo Cascella 
  Meina Li 
  Michal Privoznik 
  Michał Smyk 
  Milo Casagrande 
  Moshe Levi 
  Muha Aliss 
  Neal Gompa 
  Nick Shyrokovskiy 
  Nickys Music Group 
  Nico Pache 
  Nikolay Shirokovskiy 
  Olaf Hering 
  Olesya Gerasimenko 
  Orion Poplawski 
  Patrick Magauran 
  Paulo de Rezende Pinatti 
  Pavel Hrdina 
  Peter Krempa 
  Pino Toscano 
  Pino Toscano 
  Piotr Drąg 
  Prathamesh Chavan 
  Ricky Tigg 
  Roman Bogorodskiy 
  Roman Bolshakov 
  Ryan Gahagan 
  Ryan Schmidt 
  Sam Hartman 
  Scott Shambarger 
  Sebastian Mitterle 
  Shalini Chellathurai Saroja 
  Shaojun Yang 
  Shi Lei 
  Simon Gaiser 
  Stefan Bader 
  Stefan Berger 
  Szymon Scholz 
  Thomas Huth 
  Tim Wiederhake 
  Tomáš Golembiovský 
  Tomáš Janoušek 
  Tuguoyi 
  Ville Skyttä 
  Wang Xin 
  Weblate 
  Yalei Li <274268...@qq.com>
  Yalei Li 
  Yang Hang 
  Yanqiu Zhang 
  Yi Li 
  Yi Wang 
  Yuri Chornoivan 
  Zheng Chuan 
  zhenwei pi 
  Zhenyu Zheng 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  fail
 build-arm64-libvirt  fail
 build-armhf-libvirt 

[xen-unstable test] 159424: tolerable FAIL - PUSHED

2021-02-17 Thread osstest service owner
flight 159424 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/159424/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail  like 159335
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail like 
159362
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 159396
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 159396
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 159396
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 159396
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 159396
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 159396
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 159396
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 159396
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 159396
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 159396
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 159396
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  3b1cc15f1931ba56d0ee256fe9bfe65509733b27
baseline version:
 xen  04085ec1ac05a362812e9b0c6b5a8713d7dc88ad

Last test of basis   159396  2021-02-16 01:52:34 Z2 days
Testing same since   159424  2021-02-16 17:37:44 Z1 days1 attempts


People who touched revisions under test:
  Jan Beulich 
  Julien Grall 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm

Re: [RFC PATCH 29/34] power/swap: use bio_new in hib_submit_io

2021-02-17 Thread Chaitanya Kulkarni
On 2/17/21 14:03, Pavel Machek wrote:
> Hi!
>> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
>> index c73f2e295167..e92e36c053a6 100644
>> --- a/kernel/power/swap.c
>> +++ b/kernel/power/swap.c
>> @@ -271,13 +271,12 @@ static int hib_submit_io(int op, int op_flags, pgoff_t 
>> page_off, void *addr,
>>  struct hib_bio_batch *hb)
>>  {
>>  struct page *page = virt_to_page(addr);
>> +sector_t sect = page_off * (PAGE_SIZE >> 9);
>>  struct bio *bio;
>>  int error = 0;
>>  
>> -bio = bio_alloc(GFP_NOIO | __GFP_HIGH, 1);
>> -bio->bi_iter.bi_sector = page_off * (PAGE_SIZE >> 9);
>> -bio_set_dev(bio, hib_resume_bdev);
>> -bio_set_op_attrs(bio, op, op_flags);
>> +bio = bio_new(hib_resume_bdev, sect, op, op_flags, 1,
>> +  GFP_NOIO | __GFP_HIGH);
>>  
> C function with 6 arguments... dunno. Old version looks comparable or
> even more readable...
>
> Best regards,
>   Pavel
The library functions that are in the kernel tree which are used
in different file-systems and fabrics drivers do take 6 arguments.

Plus what is the point of duplicating code for mandatory
parameters all over the kernel ?




Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu

2021-02-17 Thread Stefano Stabellini
On Wed, 17 Feb 2021, Julien Grall wrote:
> On 17/02/2021 02:00, Stefano Stabellini wrote:
> > Hi all,
> > 
> > Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to
> > translate addresses for DMA operations in Dom0. Specifically,
> > swiotlb-xen is used to translate the address of a foreign page (a page
> > belonging to a domU) mapped into Dom0 before using it for DMA.
> > 
> > This is important because although Dom0 is 1:1 mapped, DomUs are not. On
> > systems without an IOMMU swiotlb-xen enables PV drivers to work as long
> > as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation
> > ends up using the MFN, rather than the GFN.
> > 
> > 
> > On systems with an IOMMU, this is not necessary: when a foreign page is
> > mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation
> > is enstablished for both MMU and SMMU. Dom0 could safely use the GFN
> > address (instead of the MFN) for DMA operations and they would work. It
> > would be more efficient than using swiotlb-xen.
> > 
> > If you recall my presentation from Xen Summit 2020, Xilinx is working on
> > cache coloring. With cache coloring, no domain is 1:1 mapped, not even
> > Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not
> > work as intended.
> > 
> > 
> > The suggested solution for both these issues is to add a new feature
> > flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use
> > swiotlb-xen because IOMMU translations are available for Dom0. If
> > XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen
> > initialization. I have tested this scheme with and without cache
> > coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it
> > works as expected: DMA operations succeed.
> > 
> > 
> > What about systems where an IOMMU is present but not all devices are
> > protected?
> > 
> > There is no way for Xen to know which devices are protected and which
> > ones are not: devices that do not have the "iommus" property could or
> > could not be DMA masters.
> > 
> > Perhaps Xen could populate a whitelist of devices protected by the IOMMU
> > based on the "iommus" property. It would require some added complexity
> > in Xen and especially in the swiotlb-xen driver in Linux to use it,
> > which is not ideal.
> 
> You are trading a bit more complexity in Xen and Linux with a user will may
> not be able to use the hypervisor on his/her platform without a quirk in Xen
> (see more below).
> 
> > However, this approach would not work for cache
> > coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be
> > used either way
> 
> Not all the Dom0 Linux kernel will be able to work with cache colouring. So
> you will need a way for the kernel to say "Hey I am can avoid using swiotlb".

A dom0 Linux kernel unmodified can work with Xen cache coloring enabled.
The swiotlb-xen is unneeded and also all the pfn_valid() checks to detect
if a page is local or foreign don't work as intended. However, it still
works on systems where the IOMMU can be relied upon. That's because the
IOMMU does the GFN->MFN translations, and also because both swiotlb-xen
code paths (cache flush local and cache flush via hypercall) work.

Also considering that somebody that enables cache coloring has to know
the entire system well, I don't think we need a runtime flag from Linux
to say "Hey I can avoid using swiotlb".

That said, I think it is important to fix Linux because the code might
work today, but it is more by accident than by design.


> > For these reasons, I would like to propose a single flag
> > XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for
> > DMA translations. In situations where a DMA master is not SMMU
> > protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a
> > platform where an IOMMU is present and protects most DMA masters but it
> > is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should
> > not be set (because PV block is not going to work without swiotlb-xen.)
> > This also means that cache coloring won't be usable on such a system (at
> > least not usable with the MMC controller so the system integrator should
> > pay special care to setup the system).
> > 
> > It is worth noting that if we wanted to extend the interface to add a
> > list of protected devices in the future, it would still be possible. It
> > would be compatible with XENFEAT_ARM_dom0_iommu.
> 
> I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared and
> instead the device-tree list would be used. Is that correct?

If XENFEAT_ARM_dom0_iommu is set, the device list is redundant.
If XENFEAT_ARM_dom0_iommu is not set, the device list is useful.

The device list and XENFEAT_ARM_dom0_iommu serve different purposes. The
device list is meant to skip the swiotlb-xen for some devices.

XENFEAT_ARM_dom0_iommu is meant to skip the swiotlb-xen for all devices,
which is especially useful when dom0 is not 1:1 mapped, 

Re: [PATCH] xen/arm : smmuv3: Fix to handle multiple StreamIds per device.

2021-02-17 Thread Stefano Stabellini
+IanJ

On Wed, 17 Feb 2021, Bertrand Marquis wrote:
> Hi Rahul,
> 
> 
> > On 17 Feb 2021, at 10:05, Rahul Singh  wrote:
> > 
> > SMMUv3 driver does not handle multiple StreamId if the master device
> > supports more than one StreamID.
> > 
> > This bug was introduced when the driver was ported from Linux to XEN.
> > dt_device_set_protected(..) should be called from add_device(..) not
> > from the dt_xlate(..).
> > 
> > Move dt_device_set_protected(..) from dt_xlate(..) to add_device().
> > 
> > Signed-off-by: Rahul Singh 
> Reviewed-by: Bertrand Marquis 
> 
> Thanks a lot, this is fixing issues with multiple stream ids for one device 
> :-)

Reviewed-by: Stefano Stabellini 


> > ---
> > This patch is a candidate for 4.15 as without this patch it is not possible 
> > to
> > assign multiple StreamIds to the same device when device is protected behind
> > SMMUv3.

I agree especially considering that the impact is limited to smmu-v3.c.


> > ---
> > xen/drivers/passthrough/arm/smmu-v3.c | 29 ++-
> > 1 file changed, 11 insertions(+), 18 deletions(-)
> > 
> > diff --git a/xen/drivers/passthrough/arm/smmu-v3.c 
> > b/xen/drivers/passthrough/arm/smmu-v3.c
> > index 914cdc1cf4..53d150cdb6 100644
> > --- a/xen/drivers/passthrough/arm/smmu-v3.c
> > +++ b/xen/drivers/passthrough/arm/smmu-v3.c
> > @@ -2207,24 +2207,6 @@ static int arm_smmu_add_device(u8 devfn, struct 
> > device *dev)
> >  */
> > arm_smmu_enable_pasid(master);
> > 
> > -   return 0;
> > -
> > -err_free_master:
> > -   xfree(master);
> > -   dev_iommu_priv_set(dev, NULL);
> > -   return ret;
> > -}
> > -
> > -static int arm_smmu_dt_xlate(struct device *dev,
> > -   const struct dt_phandle_args *args)
> > -{
> > -   int ret;
> > -   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > -
> > -   ret = iommu_fwspec_add_ids(dev, args->args, 1);
> > -   if (ret)
> > -   return ret;
> > -
> > if (dt_device_is_protected(dev_to_dt(dev))) {
> > dev_err(dev, "Already added to SMMUv3\n");
> > return -EEXIST;
> > @@ -2237,6 +2219,17 @@ static int arm_smmu_dt_xlate(struct device *dev,
> > dev_name(fwspec->iommu_dev), fwspec->num_ids);
> > 
> > return 0;
> > +
> > +err_free_master:
> > +   xfree(master);
> > +   dev_iommu_priv_set(dev, NULL);
> > +   return ret;
> > +}
> > +
> > +static int arm_smmu_dt_xlate(struct device *dev,
> > +   const struct dt_phandle_args *args)
> > +{
> > +   return iommu_fwspec_add_ids(dev, args->args, 1);
> > }
> > 
> > /* Probing and initialisation functions */
> > -- 
> > 2.17.1
> > 
> 



Re: [RFC PATCH 29/34] power/swap: use bio_new in hib_submit_io

2021-02-17 Thread Pavel Machek
Hi!
> 
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index c73f2e295167..e92e36c053a6 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -271,13 +271,12 @@ static int hib_submit_io(int op, int op_flags, pgoff_t 
> page_off, void *addr,
>   struct hib_bio_batch *hb)
>  {
>   struct page *page = virt_to_page(addr);
> + sector_t sect = page_off * (PAGE_SIZE >> 9);
>   struct bio *bio;
>   int error = 0;
>  
> - bio = bio_alloc(GFP_NOIO | __GFP_HIGH, 1);
> - bio->bi_iter.bi_sector = page_off * (PAGE_SIZE >> 9);
> - bio_set_dev(bio, hib_resume_bdev);
> - bio_set_op_attrs(bio, op, op_flags);
> + bio = bio_new(hib_resume_bdev, sect, op, op_flags, 1,
> +   GFP_NOIO | __GFP_HIGH);
>  

C function with 6 arguments... dunno. Old version looks comparable or
even more readable...

Best regards,
Pavel

-- 
http://www.livejournal.com/~pavelmachek


signature.asc
Description: Digital signature


[xen-4.14-testing test] 159420: tolerable FAIL - PUSHED

2021-02-17 Thread osstest service owner
flight 159420 xen-4.14-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/159420/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 158558
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 158558
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 158558
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 158558
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 158558
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 158558
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 158558
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 158558
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 158558
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 158558
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 158558
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  9f357fe3e4593fc1109962b76d4db73d589ebef5
baseline version:
 xen  4170218cb96546426664e5c1d00c5a848a26ae9e

Last test of basis   158558  2021-01-21 15:37:26 Z   27 days
Testing same since   159420  2021-02-16 15:06:34 Z1 days1 attempts


People who touched revisions under test:
  Julien Grall 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64-xtf   

[xen-unstable-smoke test] 159445: tolerable all pass - PUSHED

2021-02-17 Thread osstest service owner
flight 159445 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/159445/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  7a4133feaf42000923eb9d84badb6b171625f137
baseline version:
 xen  d670ef3401b91d04c58d72cd8ce5579b4fa900d8

Last test of basis   159443  2021-02-17 12:00:28 Z0 days
Testing same since   159445  2021-02-17 16:00:31 Z0 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Julien Grall 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   d670ef3401..7a4133feaf  7a4133feaf42000923eb9d84badb6b171625f137 -> smoke



Re: [BUG] Linux pvh vm not getting destroyed on shutdown

2021-02-17 Thread Maximilian Engelhardt
On Samstag, 13. Februar 2021 16:36:24 CET Maximilian Engelhardt wrote:
> Here are some things I noticed while trying to debug this issue:
> 
> * It happens on a Debian buster dom0 as well as on a bullseye dom0
> 
> * It seems to only affect pvh vms.
> 
> * shutdown from the pvgrub menu ("c" -> "halt") does work
> 
> * the vm seems to shut down normal, the last lines in the console are:
[...]
> 
> * issuing a reboot instead of a shutdown does work fine.
> 
> * The issue started with Debian kernel 5.8.3+1~exp1 running in the vm,
> Debian kernel 5.7.17-1 does not show the issue.
> 
> * setting vcpus equal to maxvcpus does *not* show the hang.

One thing I just realized I totally forgot to mention in my initial report is 
that this issue is present for us also on a modern kernel. We tested with 
Debian kernel 5.10.13-1.

signature.asc
Description: This is a digitally signed message part.


Re: [PATCH 3/3] tools/libxl: Rework invariants in libxl__domain_get_device_model_uid()

2021-02-17 Thread Andrew Cooper
On 17/02/2021 17:50, Ian Jackson wrote:
> Andrew Cooper writes ("[PATCH 3/3] tools/libxl: Rework invariants in 
> libxl__domain_get_device_model_uid()"):
>> Various version of gcc, when compiling with -Og, complain:
>>
>>   libxl_dm.c: In function 'libxl__domain_get_device_model_uid':
>>   libxl_dm.c:256:12: error: 'kill_by_uid' may be used uninitialized in this 
>> function [-Werror=maybe-uninitialized]
>> 256 | if (kill_by_uid)
>> |^
> Thanks for working on this.  I have reviewed your changes and I see
> where you are coming from.  The situation is not very nice, mostly
> because we don't have proper sum types in C.
>
> I'm sorry to say that with my release manager hat on I think it is too
> late for this kind of reorganisation for 4.15, especially just to work
> around an overzealous compiler warning.
>
> I think we can fix the compiler warning simply by setting the
> `kill_by_uid` variable on more of the exit paths.  This approach was
> already taken in this code for one of the paths.
>
> I would prefer that approach at this stage of the release.

Well - I have explained why I'm not happy with that approach, but you
are the maintainer and RM.

I will trade you a minimal patch for formal R-b's so the time invested
so far fixing this mess isn't wasted when 4.16 opens.

~Andrew



[xen-4.13-testing test] 159419: tolerable FAIL - PUSHED

2021-02-17 Thread osstest service owner
flight 159419 xen-4.13-testing real [real]
flight 159446 xen-4.13-testing real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/159419/
http://logs.test-lab.xenproject.org/osstest/logs/159446/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass 
in 159446-retest

Tests which did not succeed, but are not blocking:
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 158557
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 158557
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 158557
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 158557
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 158557
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 158557
 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail  like 158557
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 158557
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 158557
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 158557
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 158557
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 158557
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  ab995b6af9ab723b0b52e5ea0e342b612f1a7b89
baseline version:
 xen  e4161938b315f3b9c6a13ade30d16c11504a2d16

Last test of basis   158557  2021-01-21 15:37:26 Z   27 days
Testing same since   159419  2021-02-16 15:06:29 Z1 days1 attempts


Re: [PATCH 3/3] tools/libxl: Rework invariants in libxl__domain_get_device_model_uid()

2021-02-17 Thread Ian Jackson
Andrew Cooper writes ("[PATCH 3/3] tools/libxl: Rework invariants in 
libxl__domain_get_device_model_uid()"):
> Various version of gcc, when compiling with -Og, complain:
> 
>   libxl_dm.c: In function 'libxl__domain_get_device_model_uid':
>   libxl_dm.c:256:12: error: 'kill_by_uid' may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
> 256 | if (kill_by_uid)
> |^

Thanks for working on this.  I have reviewed your changes and I see
where you are coming from.  The situation is not very nice, mostly
because we don't have proper sum types in C.

I'm sorry to say that with my release manager hat on I think it is too
late for this kind of reorganisation for 4.15, especially just to work
around an overzealous compiler warning.

I think we can fix the compiler warning simply by setting the
`kill_by_uid` variable on more of the exit paths.  This approach was
already taken in this code for one of the paths.

I would prefer that approach at this stage of the release.

Sorry,
Ian.



Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu

2021-02-17 Thread Jan Beulich
On 17.02.2021 16:34, Bertrand Marquis wrote:
>> On 17 Feb 2021, at 02:00, Stefano Stabellini  wrote:
>> Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to
>> translate addresses for DMA operations in Dom0. Specifically,
>> swiotlb-xen is used to translate the address of a foreign page (a page
>> belonging to a domU) mapped into Dom0 before using it for DMA.
>>
>> This is important because although Dom0 is 1:1 mapped, DomUs are not. On
>> systems without an IOMMU swiotlb-xen enables PV drivers to work as long
>> as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation
>> ends up using the MFN, rather than the GFN.
>>
>>
>> On systems with an IOMMU, this is not necessary: when a foreign page is
>> mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation
>> is enstablished for both MMU and SMMU. Dom0 could safely use the GFN
>> address (instead of the MFN) for DMA operations and they would work. It
>> would be more efficient than using swiotlb-xen.
>>
>> If you recall my presentation from Xen Summit 2020, Xilinx is working on
>> cache coloring. With cache coloring, no domain is 1:1 mapped, not even
>> Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not
>> work as intended.
>>
>>
>> The suggested solution for both these issues is to add a new feature
>> flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use
>> swiotlb-xen because IOMMU translations are available for Dom0. If
>> XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen
>> initialization. I have tested this scheme with and without cache
>> coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it
>> works as expected: DMA operations succeed.
> 
> Wouldn’t it be easier to name the flag XENFEAT_ARM_swiotlb_needed ?

Except that "swiotlb" is Linux terminology, which I don't think a
Xen feature should be named after. Names should be generic, except
maybe when they're really targeting exactly one kind of guest
(which imo would better never be the case).

Jan



Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu

2021-02-17 Thread Julien Grall




On 17/02/2021 16:47, Bertrand Marquis wrote:

Hi Julien,


On 17 Feb 2021, at 16:41, Julien Grall  wrote:



On 17/02/2021 15:37, Bertrand Marquis wrote:

Hi Julien,


Hi Bertrand,


On 17 Feb 2021, at 08:50, Julien Grall  wrote:



On 17/02/2021 02:00, Stefano Stabellini wrote:

Hi all,
Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to
translate addresses for DMA operations in Dom0. Specifically,
swiotlb-xen is used to translate the address of a foreign page (a page
belonging to a domU) mapped into Dom0 before using it for DMA.
This is important because although Dom0 is 1:1 mapped, DomUs are not. On
systems without an IOMMU swiotlb-xen enables PV drivers to work as long
as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation
ends up using the MFN, rather than the GFN.
On systems with an IOMMU, this is not necessary: when a foreign page is
mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation
is enstablished for both MMU and SMMU. Dom0 could safely use the GFN
address (instead of the MFN) for DMA operations and they would work. It
would be more efficient than using swiotlb-xen.
If you recall my presentation from Xen Summit 2020, Xilinx is working on
cache coloring. With cache coloring, no domain is 1:1 mapped, not even
Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not
work as intended.
The suggested solution for both these issues is to add a new feature
flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use
swiotlb-xen because IOMMU translations are available for Dom0. If
XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen
initialization. I have tested this scheme with and without cache
coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it
works as expected: DMA operations succeed.
What about systems where an IOMMU is present but not all devices are
protected?
There is no way for Xen to know which devices are protected and which
ones are not: devices that do not have the "iommus" property could or
could not be DMA masters.
Perhaps Xen could populate a whitelist of devices protected by the IOMMU
based on the "iommus" property. It would require some added complexity
in Xen and especially in the swiotlb-xen driver in Linux to use it,
which is not ideal.


You are trading a bit more complexity in Xen and Linux with a user will may not 
be able to use the hypervisor on his/her platform without a quirk in Xen (see 
more below).


However, this approach would not work for cache
coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be
used either way


Not all the Dom0 Linux kernel will be able to work with cache colouring. So you will need 
a way for the kernel to say "Hey I am can avoid using swiotlb".

I fully agree and from my current understanding the condition is “having an 
iommu”.



For these reasons, I would like to propose a single flag
XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for
DMA translations. In situations where a DMA master is not SMMU
protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a
platform where an IOMMU is present and protects most DMA masters but it
is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should
not be set (because PV block is not going to work without swiotlb-xen.)
This also means that cache coloring won't be usable on such a system (at
least not usable with the MMC controller so the system integrator should
pay special care to setup the system).
It is worth noting that if we wanted to extend the interface to add a
list of protected devices in the future, it would still be possible. It
would be compatible with XENFEAT_ARM_dom0_iommu.


I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared and 
instead the device-tree list would be used. Is that correct?

What do you mean by device tree list here ?


Sorry I meant "device list". I was referring to Stefano's suggestion to 
describe the list of devices protected in the device-tree.


Ok you mean adding to the device tree some kind of device list for which 
swiotlb should be used (or maybe the opposite list in fact).


I think the list of protected devices is better because we could create 
an Xen IOMMU node.









How to set XENFEAT_ARM_dom0_iommu?
We could set XENFEAT_ARM_dom0_iommu automatically when
is_iommu_enabled(d) for Dom0. We could also have a platform specific
(xen/arch/arm/platforms/) override so that a specific platform can
disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced
users, it would also be useful to be able to override it via a Xen
command line parameter.

Platform quirks should be are limited to a small set of platforms.

In this case, this would not be only per-platform but also per-firmware table 
as a developer can decide to remove/add IOMMU nodes in the DT at any time.

In addition to that, it means we are introducing a regression for those users 
as Xen 4.14 would have 

Re: [for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables

2021-02-17 Thread Julien Grall




On 17/02/2021 15:17, Jan Beulich wrote:

On 17.02.2021 16:00, Julien Grall wrote:

Hi Jan,

On 17/02/2021 14:54, Jan Beulich wrote:

On 17.02.2021 15:24, Julien Grall wrote:

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -267,6 +267,15 @@ int iommu_free_pgtables(struct domain *d)
   struct page_info *pg;
   unsigned int done = 0;
   
+if ( !is_iommu_enabled(d) )

+return 0;
+
+/*
+ * Pages will be moved to the free list below. So we want to
+ * clear the root page-table to avoid any potential use after-free.
+ */
+hd->platform_ops->clear_root_pgtable(d);


Taking amd_iommu_alloc_root() as example, is this really correct
prior to what is now patch 2?


Yes, there are no more use-after-free...


And this is because of ...? The necessary lock isn't being held
here, so on another CPU allocation of a new root and then of new
page tables could happen before you make enough progress here,
and hence it looks to me as if there might then still be pages
which get freed while present in the page tables (and hence
accessible by devices).


Ah yes. I forgot that now patch #3 is not first anymore. I can move 
again patch #3 first, although I know you dislike the approach taken 
there...


Cheers,

--
Julien Grall



Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu

2021-02-17 Thread Bertrand Marquis
Hi Julien,

> On 17 Feb 2021, at 16:41, Julien Grall  wrote:
> 
> 
> 
> On 17/02/2021 15:37, Bertrand Marquis wrote:
>> Hi Julien,
> 
> Hi Bertrand,
> 
>>> On 17 Feb 2021, at 08:50, Julien Grall  wrote:
>>> 
>>> 
>>> 
>>> On 17/02/2021 02:00, Stefano Stabellini wrote:
 Hi all,
 Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to
 translate addresses for DMA operations in Dom0. Specifically,
 swiotlb-xen is used to translate the address of a foreign page (a page
 belonging to a domU) mapped into Dom0 before using it for DMA.
 This is important because although Dom0 is 1:1 mapped, DomUs are not. On
 systems without an IOMMU swiotlb-xen enables PV drivers to work as long
 as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation
 ends up using the MFN, rather than the GFN.
 On systems with an IOMMU, this is not necessary: when a foreign page is
 mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation
 is enstablished for both MMU and SMMU. Dom0 could safely use the GFN
 address (instead of the MFN) for DMA operations and they would work. It
 would be more efficient than using swiotlb-xen.
 If you recall my presentation from Xen Summit 2020, Xilinx is working on
 cache coloring. With cache coloring, no domain is 1:1 mapped, not even
 Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not
 work as intended.
 The suggested solution for both these issues is to add a new feature
 flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use
 swiotlb-xen because IOMMU translations are available for Dom0. If
 XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen
 initialization. I have tested this scheme with and without cache
 coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it
 works as expected: DMA operations succeed.
 What about systems where an IOMMU is present but not all devices are
 protected?
 There is no way for Xen to know which devices are protected and which
 ones are not: devices that do not have the "iommus" property could or
 could not be DMA masters.
 Perhaps Xen could populate a whitelist of devices protected by the IOMMU
 based on the "iommus" property. It would require some added complexity
 in Xen and especially in the swiotlb-xen driver in Linux to use it,
 which is not ideal.
>>> 
>>> You are trading a bit more complexity in Xen and Linux with a user will may 
>>> not be able to use the hypervisor on his/her platform without a quirk in 
>>> Xen (see more below).
>>> 
 However, this approach would not work for cache
 coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be
 used either way
>>> 
>>> Not all the Dom0 Linux kernel will be able to work with cache colouring. So 
>>> you will need a way for the kernel to say "Hey I am can avoid using 
>>> swiotlb".
>> I fully agree and from my current understanding the condition is “having an 
>> iommu”.
>>> 
 For these reasons, I would like to propose a single flag
 XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for
 DMA translations. In situations where a DMA master is not SMMU
 protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a
 platform where an IOMMU is present and protects most DMA masters but it
 is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should
 not be set (because PV block is not going to work without swiotlb-xen.)
 This also means that cache coloring won't be usable on such a system (at
 least not usable with the MMC controller so the system integrator should
 pay special care to setup the system).
 It is worth noting that if we wanted to extend the interface to add a
 list of protected devices in the future, it would still be possible. It
 would be compatible with XENFEAT_ARM_dom0_iommu.
>>> 
>>> I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared 
>>> and instead the device-tree list would be used. Is that correct?
>> What do you mean by device tree list here ?
> 
> Sorry I meant "device list". I was referring to Stefano's suggestion to 
> describe the list of devices protected in the device-tree.

Ok you mean adding to the device tree some kind of device list for which 
swiotlb should be used (or maybe the opposite list in fact).

> 
>>> 
 How to set XENFEAT_ARM_dom0_iommu?
 We could set XENFEAT_ARM_dom0_iommu automatically when
 is_iommu_enabled(d) for Dom0. We could also have a platform specific
 (xen/arch/arm/platforms/) override so that a specific platform can
 disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced
 users, it would also be useful to be able to override it via a Xen
 command line parameter.
>>> Platform quirks should be are limited to a small set of platforms.
>>> 
>>> In this case, 

[PATCH 2/3] tools/libxl: Simplfy the out path in libxl__domain_get_device_model_uid()

2021-02-17 Thread Andrew Cooper
All paths heading towards `out` have rc = 0.  Assert this property.

The intended_uid check is an error path.  Use the err label rather than
falling into subsequent success logic.

With the above two changes, the two `if (!rc)` checks can be dropped.

Now, both remaining tests start with `if (user ...)`.  Combine the two blocks.

No functional change, but far simpler logic to follow.

Signed-off-by: Andrew Cooper 
---
CC: Ian Jackson 
CC: Wei Liu 
CC: Anthony PERARD 
---
 tools/libs/light/libxl_dm.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/tools/libs/light/libxl_dm.c b/tools/libs/light/libxl_dm.c
index 30b3242e57..7843c283ca 100644
--- a/tools/libs/light/libxl_dm.c
+++ b/tools/libs/light/libxl_dm.c
@@ -243,16 +243,17 @@ static int libxl__domain_get_device_model_uid(libxl__gc 
*gc,
 goto err;
 
 out:
-/* First, do a root check if appropriate */
-if (!rc) {
-if (user && intended_uid == 0) {
+assert(rc == 0);
+
+if (user) {
+/* First, do a root check if appropriate */
+if (intended_uid == 0) {
 LOGD(ERROR, guest_domid, "intended_uid is 0 (root)!");
 rc = ERROR_INVAL;
+goto err;
 }
-}
 
-/* Then do the final set, if still appropriate */
-if (!rc && user) {
+/* Then do the final set. */
 state->dm_runas = user;
 if (kill_by_uid)
 state->dm_kill_uid = GCSPRINTF("%ld", (long)intended_uid);
-- 
2.11.0




[PATCH for-4.15 0/3] tools/libxl: -Og fixes for libxl__domain_get_device_model_uid()

2021-02-17 Thread Andrew Cooper
Split out of previous series, and split up for clarity.

Andrew Cooper (3):
  tools/libxl: Split out and err paths in libxl__domain_get_device_model_uid()
  tools/libxl: Simplfy the out path in libxl__domain_get_device_model_uid()
  tools/libxl: Rework invariants in libxl__domain_get_device_model_uid()

 tools/libs/light/libxl_dm.c | 58 -
 1 file changed, 31 insertions(+), 27 deletions(-)

-- 
2.11.0




[PATCH 1/3] tools/libxl: Split out and err paths in libxl__domain_get_device_model_uid()

2021-02-17 Thread Andrew Cooper
All paths with nonzero rc head to err.  All paths with a zero rc stay heading
towards out.

The comment discussing invariants is now arguably stale - it will be fixed in
one coherent manner when further fixes are in place.

No functional change.

Signed-off-by: Andrew Cooper 
---
CC: Ian Jackson 
CC: Wei Liu 
CC: Anthony PERARD 
---
 tools/libs/light/libxl_dm.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/tools/libs/light/libxl_dm.c b/tools/libs/light/libxl_dm.c
index db4cec6a76..30b3242e57 100644
--- a/tools/libs/light/libxl_dm.c
+++ b/tools/libs/light/libxl_dm.c
@@ -153,13 +153,13 @@ static int libxl__domain_get_device_model_uid(libxl__gc 
*gc,
 if (user) {
 rc = userlookup_helper_getpwnam(gc, user, _pwbuf, _base);
 if (rc)
-goto out;
+goto err;
 
 if (!user_base) {
 LOGD(ERROR, guest_domid, "Couldn't find device_model_user %s",
  user);
 rc = ERROR_INVAL;
-goto out;
+goto err;
 }
 
 intended_uid = user_base->pw_uid;
@@ -188,7 +188,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc,
 rc = userlookup_helper_getpwnam(gc, LIBXL_QEMU_USER_RANGE_BASE,
  _pwbuf, _base);
 if (rc)
-goto out;
+goto err;
 if (user_base) {
 struct passwd *user_clash, user_clash_pwbuf;
 
@@ -196,14 +196,14 @@ static int libxl__domain_get_device_model_uid(libxl__gc 
*gc,
 rc = userlookup_helper_getpwuid(gc, intended_uid,
  _clash_pwbuf, _clash);
 if (rc)
-goto out;
+goto err;
 if (user_clash) {
 LOGD(ERROR, guest_domid,
  "wanted to use uid %ld (%s + %d) but that is user %s !",
  (long)intended_uid, LIBXL_QEMU_USER_RANGE_BASE,
  guest_domid, user_clash->pw_name);
 rc = ERROR_INVAL;
-goto out;
+goto err;
 }
 
 LOGD(DEBUG, guest_domid, "using uid %ld", (long)intended_uid);
@@ -222,7 +222,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc,
 user = LIBXL_QEMU_USER_SHARED;
 rc = userlookup_helper_getpwnam(gc, user, _pwbuf, _base);
 if (rc)
-goto out;
+goto err;
 if (user_base) {
 LOGD(WARN, guest_domid, "Could not find user %s, falling back to %s",
  LIBXL_QEMU_USER_RANGE_BASE, LIBXL_QEMU_USER_SHARED);
@@ -240,6 +240,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc,
  "Could not find user %s or range base pseudo-user %s, cannot 
restrict",
  LIBXL_QEMU_USER_SHARED, LIBXL_QEMU_USER_RANGE_BASE);
 rc = ERROR_INVAL;
+goto err;
 
 out:
 /* First, do a root check if appropriate */
@@ -257,6 +258,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc,
 state->dm_kill_uid = GCSPRINTF("%ld", (long)intended_uid);
 }
 
+ err:
 return rc;
 }
 
-- 
2.11.0




[PATCH 3/3] tools/libxl: Rework invariants in libxl__domain_get_device_model_uid()

2021-02-17 Thread Andrew Cooper
Various version of gcc, when compiling with -Og, complain:

  libxl_dm.c: In function 'libxl__domain_get_device_model_uid':
  libxl_dm.c:256:12: error: 'kill_by_uid' may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
256 | if (kill_by_uid)
|^

The logic is very tangled.

Two paths unconditionally set user before checking for associated errors.
This interferes with the expected use of uninitialised-variable heuristics to
force compile failures for ill-formed exit paths.

Use b_info->device_model_user and LIBXL_QEMU_USER_SHARED as appliable, and
only set user on the goto out paths.

All goto out paths now are comprised of the form:
  user = NULL;
  rc = 0;

or:
  user = non-NULL;
  indended_uid = ...;
  kill_by_uid = ...;
  rc = 0;

As a consequence, indended_uid can drop its default of -1, the dm_restrict can
drop its now-stale "just in case" comment and the redundant setting of
kill_by_uid to work around this issue at other optimisation levels.

Finally, rewrite the comment about invariants, indicating the split between
the out and err lables, and associated rc values.  Additionally, reword the
"is {not,} set" terminology to "is {non-,}NULL" to be more precise.

No funcational change.

Signed-off-by: Andrew Cooper 
---
CC: Ian Jackson 
CC: Wei Liu 
CC: Anthony PERARD 
---
 tools/libs/light/libxl_dm.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/tools/libs/light/libxl_dm.c b/tools/libs/light/libxl_dm.c
index 7843c283ca..8a7e084d89 100644
--- a/tools/libs/light/libxl_dm.c
+++ b/tools/libs/light/libxl_dm.c
@@ -127,7 +127,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc,
 struct passwd *user_base, user_pwbuf;
 int rc;
 char *user;
-uid_t intended_uid = -1;
+uid_t intended_uid;
 bool kill_by_uid;
 
 /* Only qemu-upstream can run as a different uid */
@@ -135,33 +135,34 @@ static int libxl__domain_get_device_model_uid(libxl__gc 
*gc,
 return 0;
 
 /*
- * From this point onward, all paths should go through the `out`
- * label.  The invariants should be:
+ * From this point onward, all paths should go through the `out` (success,
+ * rc = 0) or `err` (failure, rc != 0) labels.  The invariants should be:
  * - rc may be 0, or an error code.
- * - if rc is an error code, user and intended_uid are ignored.
- * - if rc is 0, user may be set or not set.
- * - if user is set, then intended_uid must be set to a UID matching
+ * - if rc is an error code, all settings are ignored.
+ * - if rc is 0, user may be NULL or non-NULL.
+ * - if user is non-NULL, then intended_uid must be set to a UID matching
  *   the username `user`, and kill_by_uid must be set to the appropriate
  *   value.  intended_uid will be checked for root (0).
  */
-
+
 /*
  * If device_model_user is present, set `-runas` even if
  * dm_restrict isn't in use
  */
-user = b_info->device_model_user;
-if (user) {
-rc = userlookup_helper_getpwnam(gc, user, _pwbuf, _base);
+if (b_info->device_model_user) {
+rc = userlookup_helper_getpwnam(gc, b_info->device_model_user,
+_pwbuf, _base);
 if (rc)
 goto err;
 
 if (!user_base) {
 LOGD(ERROR, guest_domid, "Couldn't find device_model_user %s",
- user);
+ b_info->device_model_user);
 rc = ERROR_INVAL;
 goto err;
 }
 
+user = b_info->device_model_user;
 intended_uid = user_base->pw_uid;
 kill_by_uid = true;
 rc = 0;
@@ -175,8 +176,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc,
 if (!libxl_defbool_val(b_info->dm_restrict)) {
 LOGD(DEBUG, guest_domid,
  "dm_restrict disabled, starting QEMU as root");
-user = NULL; /* Should already be null, but just in case */
-kill_by_uid = false; /* Keep older versions of gcc happy */
+user = NULL;
 rc = 0;
 goto out;
 }
@@ -219,13 +219,14 @@ static int libxl__domain_get_device_model_uid(libxl__gc 
*gc,
  * QEMU_USER_SHARED.  NB for QEMU_USER_SHARED, all QEMU will run
  * as the same UID, we can't kill by uid; therefore don't set uid.
  */
-user = LIBXL_QEMU_USER_SHARED;
-rc = userlookup_helper_getpwnam(gc, user, _pwbuf, _base);
+rc = userlookup_helper_getpwnam(gc, LIBXL_QEMU_USER_SHARED,
+_pwbuf, _base);
 if (rc)
 goto err;
 if (user_base) {
 LOGD(WARN, guest_domid, "Could not find user %s, falling back to %s",
  LIBXL_QEMU_USER_RANGE_BASE, LIBXL_QEMU_USER_SHARED);
+user = LIBXL_QEMU_USER_SHARED;
 intended_uid = user_base->pw_uid;
 kill_by_uid = false;
 rc = 0;
-- 
2.11.0




Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu

2021-02-17 Thread Julien Grall




On 17/02/2021 15:37, Bertrand Marquis wrote:

Hi Julien,


Hi Bertrand,


On 17 Feb 2021, at 08:50, Julien Grall  wrote:



On 17/02/2021 02:00, Stefano Stabellini wrote:

Hi all,
Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to
translate addresses for DMA operations in Dom0. Specifically,
swiotlb-xen is used to translate the address of a foreign page (a page
belonging to a domU) mapped into Dom0 before using it for DMA.
This is important because although Dom0 is 1:1 mapped, DomUs are not. On
systems without an IOMMU swiotlb-xen enables PV drivers to work as long
as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation
ends up using the MFN, rather than the GFN.
On systems with an IOMMU, this is not necessary: when a foreign page is
mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation
is enstablished for both MMU and SMMU. Dom0 could safely use the GFN
address (instead of the MFN) for DMA operations and they would work. It
would be more efficient than using swiotlb-xen.
If you recall my presentation from Xen Summit 2020, Xilinx is working on
cache coloring. With cache coloring, no domain is 1:1 mapped, not even
Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not
work as intended.
The suggested solution for both these issues is to add a new feature
flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use
swiotlb-xen because IOMMU translations are available for Dom0. If
XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen
initialization. I have tested this scheme with and without cache
coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it
works as expected: DMA operations succeed.
What about systems where an IOMMU is present but not all devices are
protected?
There is no way for Xen to know which devices are protected and which
ones are not: devices that do not have the "iommus" property could or
could not be DMA masters.
Perhaps Xen could populate a whitelist of devices protected by the IOMMU
based on the "iommus" property. It would require some added complexity
in Xen and especially in the swiotlb-xen driver in Linux to use it,
which is not ideal.


You are trading a bit more complexity in Xen and Linux with a user will may not 
be able to use the hypervisor on his/her platform without a quirk in Xen (see 
more below).


However, this approach would not work for cache
coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be
used either way


Not all the Dom0 Linux kernel will be able to work with cache colouring. So you will need 
a way for the kernel to say "Hey I am can avoid using swiotlb".


I fully agree and from my current understanding the condition is “having an 
iommu”.




For these reasons, I would like to propose a single flag
XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for
DMA translations. In situations where a DMA master is not SMMU
protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a
platform where an IOMMU is present and protects most DMA masters but it
is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should
not be set (because PV block is not going to work without swiotlb-xen.)
This also means that cache coloring won't be usable on such a system (at
least not usable with the MMC controller so the system integrator should
pay special care to setup the system).
It is worth noting that if we wanted to extend the interface to add a
list of protected devices in the future, it would still be possible. It
would be compatible with XENFEAT_ARM_dom0_iommu.


I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared and 
instead the device-tree list would be used. Is that correct?


What do you mean by device tree list here ?


Sorry I meant "device list". I was referring to Stefano's suggestion to 
describe the list of devices protected in the device-tree.







How to set XENFEAT_ARM_dom0_iommu?
We could set XENFEAT_ARM_dom0_iommu automatically when
is_iommu_enabled(d) for Dom0. We could also have a platform specific
(xen/arch/arm/platforms/) override so that a specific platform can
disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced
users, it would also be useful to be able to override it via a Xen
command line parameter.

Platform quirks should be are limited to a small set of platforms.

In this case, this would not be only per-platform but also per-firmware table 
as a developer can decide to remove/add IOMMU nodes in the DT at any time.

In addition to that, it means we are introducing a regression for those users 
as Xen 4.14 would have worked on there platform but not anymore. They would 
need to go through all the nodes and find out which one is not protected.


I am not sure i understand your point here as we cannot detect if a device is 
protected or not by an IOMMU as we do not know which device requires one.


That's correct...


Could you explain what use case 

Re: [for-4.15][PATCH v3 3/3] xen/iommu: x86: Harden the IOMMU page-table allocator

2021-02-17 Thread Julien Grall

Hi Jan,

On 17/02/2021 15:13, Jan Beulich wrote:

On 17.02.2021 15:24, Julien Grall wrote:> --- a/xen/drivers/passthrough/x86/iommu.c> +++ 
b/xen/drivers/passthrough/x86/iommu.c> @@ -149,6 +149,13 @@ int arch_iommu_domain_init(struct domain 
*d)>  >  void arch_iommu_domain_destroy(struct domain *d)>  {> +/*> + * There 
should be not page-tables left allocated by the time the
Nit: s/not/no/ ?


+ * domain is destroyed. Note that arch_iommu_domain_destroy() is
+ * called unconditionally, so pgtables may be unitialized.
+ */
+ASSERT(dom_iommu(d)->platform_ops == NULL ||
+   page_list_empty(_iommu(d)->arch.pgtables.list));
  }
  
  static bool __hwdom_init hwdom_iommu_map(const struct domain *d,

@@ -279,6 +286,9 @@ int iommu_free_pgtables(struct domain *d)
   */
  hd->platform_ops->clear_root_pgtable(d);
  
+/* After this barrier no new page allocations can occur. */

+spin_barrier(>arch.pgtables.lock);


Didn't patch 2 utilize the call to ->clear_root_pgtable() itself as
the barrier? Why introduce another one (with a similar comment)
explicitly now?
The barriers act differently, one will get against any IOMMU page-tables 
modification. The other one will gate against allocation.


There is no guarantee that the former will prevent the latter.




@@ -315,9 +326,29 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
  unmap_domain_page(p);
  
  spin_lock(>arch.pgtables.lock);

-page_list_add(pg, >arch.pgtables.list);
+/*
+ * The IOMMU page-tables are freed when relinquishing the domain, but
+ * nothing prevent allocation to happen afterwards. There is no valid
+ * reasons to continue to update the IOMMU page-tables while the
+ * domain is dying.
+ *
+ * So prevent page-table allocation when the domain is dying.
+ *
+ * We relying on >arch.pgtables.lock to synchronize d->is_dying.
+ */
+if ( likely(!d->is_dying) )
+{
+alive = true;
+page_list_add(pg, >arch.pgtables.list);
+}
  spin_unlock(>arch.pgtables.lock);
  
+if ( unlikely(!alive) )

+{
+free_domheap_page(pg);
+pg = NULL;
+}
+
  return pg;
  }


As before I'm concerned of this forcing error paths to be taken
elsewhere, in case an allocation still happens (e.g. from unmap
once super page mappings are supported). Considering some of the
error handling in the IOMMU code is to invoke domain_crash(), it
would be quite unfortunate if we ended up crashing a domain
while it is being cleaned up after.


It is unfortunate, but I think this is better than having to leak page 
tables.




Additionally, the (at present still hypothetical) unmap case, if
failing because of the change here, would then again chance to
leave mappings in place while the underlying pages get freed. As
this would likely require an XSA, the change doesn't feel like
"hardening" to me.


I would agree with this if memory allocations could never fail. That's 
not that case and will become worse as we use IOMMU pool.


Do you have callers in mind that doesn't check the returns of iommu_unmap()?

Cheers,


--
Julien Grall



Re: [for-4.15][PATCH v3 2/3] xen/x86: iommu: Ignore IOMMU mapping requests when a domain is dying

2021-02-17 Thread Julien Grall

Hi Jan,

On 17/02/2021 15:01, Jan Beulich wrote:

On 17.02.2021 15:24, Julien Grall wrote:

From: Julien Grall 

The new x86 IOMMU page-tables allocator will release the pages when
relinquishing the domain resources. However, this is not sufficient
when the domain is dying because nothing prevents page-table to be
allocated.

Currently page-table allocations can only happen from iommu_map(). As
the domain is dying, there is no good reason to continue to modify the
IOMMU page-tables.


While I agree this to be the case right now, I'm not sure it is a
good idea to build on it (in that you leave the unmap paths
untouched).


I don't build on that assumption. See next patch.


Imo there's a fair chance this would be overlooked at
the point super page mappings get introduced (which has been long
overdue), and I thought prior discussion had lead to a possible
approach without risking use-after-free due to squashed unmap
requests.


I know you suggested to zap the root page-tables... However, I don't 
think this is 4.15 material and you agree with this (you were the one 
pointed out that out).



--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -273,6 +273,9 @@ int iommu_free_pgtables(struct domain *d)
  /*
   * Pages will be moved to the free list below. So we want to
   * clear the root page-table to avoid any potential use after-free.
+ *
+ * After this call, no more IOMMU mapping can happen.
+ *
   */
  hd->platform_ops->clear_root_pgtable(d);


I.e. you utilize the call in place of spin_barrier(). Maybe worth
saying in the comment?


Sure.



Also, nit: Stray blank comment line.


Cheers,

--
Julien Grall



[xen-unstable-smoke test] 159443: tolerable all pass - PUSHED

2021-02-17 Thread osstest service owner
flight 159443 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/159443/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  d670ef3401b91d04c58d72cd8ce5579b4fa900d8
baseline version:
 xen  3b1cc15f1931ba56d0ee256fe9bfe65509733b27

Last test of basis   159416  2021-02-16 15:01:29 Z1 days
Testing same since   159443  2021-02-17 12:00:28 Z0 days1 attempts


People who touched revisions under test:
  Julien Grall 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   3b1cc15f19..d670ef3401  d670ef3401b91d04c58d72cd8ce5579b4fa900d8 -> smoke



Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu

2021-02-17 Thread Bertrand Marquis
Hi Julien,

> On 17 Feb 2021, at 08:50, Julien Grall  wrote:
> 
> 
> 
> On 17/02/2021 02:00, Stefano Stabellini wrote:
>> Hi all,
>> Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to
>> translate addresses for DMA operations in Dom0. Specifically,
>> swiotlb-xen is used to translate the address of a foreign page (a page
>> belonging to a domU) mapped into Dom0 before using it for DMA.
>> This is important because although Dom0 is 1:1 mapped, DomUs are not. On
>> systems without an IOMMU swiotlb-xen enables PV drivers to work as long
>> as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation
>> ends up using the MFN, rather than the GFN.
>> On systems with an IOMMU, this is not necessary: when a foreign page is
>> mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation
>> is enstablished for both MMU and SMMU. Dom0 could safely use the GFN
>> address (instead of the MFN) for DMA operations and they would work. It
>> would be more efficient than using swiotlb-xen.
>> If you recall my presentation from Xen Summit 2020, Xilinx is working on
>> cache coloring. With cache coloring, no domain is 1:1 mapped, not even
>> Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not
>> work as intended.
>> The suggested solution for both these issues is to add a new feature
>> flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use
>> swiotlb-xen because IOMMU translations are available for Dom0. If
>> XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen
>> initialization. I have tested this scheme with and without cache
>> coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it
>> works as expected: DMA operations succeed.
>> What about systems where an IOMMU is present but not all devices are
>> protected?
>> There is no way for Xen to know which devices are protected and which
>> ones are not: devices that do not have the "iommus" property could or
>> could not be DMA masters.
>> Perhaps Xen could populate a whitelist of devices protected by the IOMMU
>> based on the "iommus" property. It would require some added complexity
>> in Xen and especially in the swiotlb-xen driver in Linux to use it,
>> which is not ideal.
> 
> You are trading a bit more complexity in Xen and Linux with a user will may 
> not be able to use the hypervisor on his/her platform without a quirk in Xen 
> (see more below).
> 
>> However, this approach would not work for cache
>> coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be
>> used either way
> 
> Not all the Dom0 Linux kernel will be able to work with cache colouring. So 
> you will need a way for the kernel to say "Hey I am can avoid using swiotlb".

I fully agree and from my current understanding the condition is “having an 
iommu”.

> 
>> For these reasons, I would like to propose a single flag
>> XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for
>> DMA translations. In situations where a DMA master is not SMMU
>> protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a
>> platform where an IOMMU is present and protects most DMA masters but it
>> is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should
>> not be set (because PV block is not going to work without swiotlb-xen.)
>> This also means that cache coloring won't be usable on such a system (at
>> least not usable with the MMC controller so the system integrator should
>> pay special care to setup the system).
>> It is worth noting that if we wanted to extend the interface to add a
>> list of protected devices in the future, it would still be possible. It
>> would be compatible with XENFEAT_ARM_dom0_iommu.
> 
> I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared and 
> instead the device-tree list would be used. Is that correct?

What do you mean by device tree list here ?

> 
>> How to set XENFEAT_ARM_dom0_iommu?
>> We could set XENFEAT_ARM_dom0_iommu automatically when
>> is_iommu_enabled(d) for Dom0. We could also have a platform specific
>> (xen/arch/arm/platforms/) override so that a specific platform can
>> disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced
>> users, it would also be useful to be able to override it via a Xen
>> command line parameter.
> Platform quirks should be are limited to a small set of platforms.
> 
> In this case, this would not be only per-platform but also per-firmware table 
> as a developer can decide to remove/add IOMMU nodes in the DT at any time.
> 
> In addition to that, it means we are introducing a regression for those users 
> as Xen 4.14 would have worked on there platform but not anymore. They would 
> need to go through all the nodes and find out which one is not protected.

I am not sure i understand your point here as we cannot detect if a device is 
protected or not by an IOMMU as we do not know which device requires one.
Could you explain what use case working before 

Re: Linux PV/PVH domU crash on (guest) resume from suspend

2021-02-17 Thread Jürgen Groß

On 17.02.21 14:48, Marek Marczykowski-Górecki wrote:

On Wed, Feb 17, 2021 at 07:51:42AM +0100, Jürgen Groß wrote:

On 17.02.21 06:12, Marek Marczykowski-Górecki wrote:

Hi,

I'm observing Linux PV/PVH guest crash when I resume it from sleep. I do
this with:

  virsh -c xen dompmsuspend  mem
  virsh -c xen dompmwakeup 

But it's possible to trigger it with plain xl too:

  xl save -c  

The same on HVM works fine.

This is on Xen 4.14.1, and with guest kernel 5.4.90, the same happens
with 5.4.98. Dom0 kernel is the same, but I'm not sure if that's
relevant here. I can reliably reproduce it.


This is already on my list of issues to look at.

The problem seems to be related to the XSA-332 patches. You could try
the patches I've sent out recently addressing other fallout from XSA-332
which _might_ fix this issue, too:

https://patchew.org/Xen/20210211101616.13788-1-jgr...@suse.com/


Thanks for the patches. Sadly it doesn't change anything - I get exactly
the same crash. I applied that on top of 5.11-rc7 (that's what I had
handy). If you think there may be a difference with the final 5.11 or
another branch, please let me know.



Okay, thanks for testing.

I hope to find some time to look into the issue soon.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu

2021-02-17 Thread Bertrand Marquis
Hi Stefano,

> On 17 Feb 2021, at 02:00, Stefano Stabellini  wrote:
> 
> Hi all,
> 
> Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to
> translate addresses for DMA operations in Dom0. Specifically,
> swiotlb-xen is used to translate the address of a foreign page (a page
> belonging to a domU) mapped into Dom0 before using it for DMA.
> 
> This is important because although Dom0 is 1:1 mapped, DomUs are not. On
> systems without an IOMMU swiotlb-xen enables PV drivers to work as long
> as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation
> ends up using the MFN, rather than the GFN.
> 
> 
> On systems with an IOMMU, this is not necessary: when a foreign page is
> mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation
> is enstablished for both MMU and SMMU. Dom0 could safely use the GFN
> address (instead of the MFN) for DMA operations and they would work. It
> would be more efficient than using swiotlb-xen.
> 
> If you recall my presentation from Xen Summit 2020, Xilinx is working on
> cache coloring. With cache coloring, no domain is 1:1 mapped, not even
> Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not
> work as intended.
> 
> 
> The suggested solution for both these issues is to add a new feature
> flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use
> swiotlb-xen because IOMMU translations are available for Dom0. If
> XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen
> initialization. I have tested this scheme with and without cache
> coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it
> works as expected: DMA operations succeed.

Wouldn’t it be easier to name the flag XENFEAT_ARM_swiotlb_needed ?

> 
> 
> What about systems where an IOMMU is present but not all devices are
> protected?
> 
> There is no way for Xen to know which devices are protected and which
> ones are not: devices that do not have the "iommus" property could or
> could not be DMA masters.
> 
> Perhaps Xen could populate a whitelist of devices protected by the IOMMU
> based on the "iommus" property. It would require some added complexity
> in Xen and especially in the swiotlb-xen driver in Linux to use it,
> which is not ideal. However, this approach would not work for cache
> coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be
> used either way.

Would it be realistic to say that cache coloring cannot be used without an 
IOMMU ?

Having a flag for the swiotlb is a good idea because in the current status the 
switch
to use or not swiotlb is more or less implicit. 
But somehow there are use cases where we should simply say “not supported” as
we do for example for passthrough right now. Maybe cache coloring is a case like
that.

> 
> 
> For these reasons, I would like to propose a single flag
> XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for
> DMA translations. In situations where a DMA master is not SMMU
> protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a
> platform where an IOMMU is present and protects most DMA masters but it
> is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should
> not be set (because PV block is not going to work without swiotlb-xen.)
> This also means that cache coloring won't be usable on such a system (at
> least not usable with the MMC controller so the system integrator should
> pay special care to setup the system).

Any system where you have an IOMMU but not covering all devices is
almost impossible to magically handle smoothly and will require some
carefull configuration. Sadly as you stated, we do not have a way to
auto-detect such a case.

> 
> It is worth noting that if we wanted to extend the interface to add a
> list of protected devices in the future, it would still be possible. It
> would be compatible with XENFEAT_ARM_dom0_iommu.
> 
> 
> How to set XENFEAT_ARM_dom0_iommu?
> 
> We could set XENFEAT_ARM_dom0_iommu automatically when
> is_iommu_enabled(d) for Dom0. We could also have a platform specific
> (xen/arch/arm/platforms/) override so that a specific platform can
> disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced
> users, it would also be useful to be able to override it via a Xen
> command line parameter.
> 
> See appended patch as a reference.

I really think the naming of the flag will need to be modified.

Cheers
Bertrand

> 
> 
> Cheers,
> 
> Stefano
> 
> 
> diff --git a/xen/common/kernel.c b/xen/common/kernel.c
> index 7a345ae45e..4dbef48199 100644
> --- a/xen/common/kernel.c
> +++ b/xen/common/kernel.c
> @@ -16,6 +16,7 @@
> #include 
> #include 
> #include 
> +#include 
> #include 
> 
> #ifndef COMPAT
> @@ -549,6 +550,9 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> fi.submap |= 1U << XENFEAT_dom0;
> #ifdef CONFIG_ARM
> fi.submap |= (1U << XENFEAT_ARM_SMCCC_supported);
> +if ( 

Re: [for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables

2021-02-17 Thread Jan Beulich
On 17.02.2021 16:00, Julien Grall wrote:
> Hi Jan,
> 
> On 17/02/2021 14:54, Jan Beulich wrote:
>> On 17.02.2021 15:24, Julien Grall wrote:
>>> --- a/xen/drivers/passthrough/x86/iommu.c
>>> +++ b/xen/drivers/passthrough/x86/iommu.c
>>> @@ -267,6 +267,15 @@ int iommu_free_pgtables(struct domain *d)
>>>   struct page_info *pg;
>>>   unsigned int done = 0;
>>>   
>>> +if ( !is_iommu_enabled(d) )
>>> +return 0;
>>> +
>>> +/*
>>> + * Pages will be moved to the free list below. So we want to
>>> + * clear the root page-table to avoid any potential use after-free.
>>> + */
>>> +hd->platform_ops->clear_root_pgtable(d);
>>
>> Taking amd_iommu_alloc_root() as example, is this really correct
>> prior to what is now patch 2? 
> 
> Yes, there are no more use-after-free...

And this is because of ...? The necessary lock isn't being held
here, so on another CPU allocation of a new root and then of new
page tables could happen before you make enough progress here,
and hence it looks to me as if there might then still be pages
which get freed while present in the page tables (and hence
accessible by devices).

Jan

>> What guarantees a new root table
>> won't get allocated subsequently?
> 
> It doesn't prevent root table allocation. I view the two as distincts 
> issues, hence the two patches.
> 
> Cheers,
> 




Re: [for-4.15][PATCH v3 3/3] xen/iommu: x86: Harden the IOMMU page-table allocator

2021-02-17 Thread Jan Beulich
On 17.02.2021 15:24, Julien Grall wrote:> --- 
a/xen/drivers/passthrough/x86/iommu.c> +++ 
b/xen/drivers/passthrough/x86/iommu.c> @@ -149,6 +149,13 @@ int 
arch_iommu_domain_init(struct domain *d)>  >  void 
arch_iommu_domain_destroy(struct domain *d)>  {> +/*> + * There should 
be not page-tables left allocated by the time the
Nit: s/not/no/ ?

> + * domain is destroyed. Note that arch_iommu_domain_destroy() is
> + * called unconditionally, so pgtables may be unitialized.
> + */
> +ASSERT(dom_iommu(d)->platform_ops == NULL ||
> +   page_list_empty(_iommu(d)->arch.pgtables.list));
>  }
>  
>  static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
> @@ -279,6 +286,9 @@ int iommu_free_pgtables(struct domain *d)
>   */
>  hd->platform_ops->clear_root_pgtable(d);
>  
> +/* After this barrier no new page allocations can occur. */
> +spin_barrier(>arch.pgtables.lock);

Didn't patch 2 utilize the call to ->clear_root_pgtable() itself as
the barrier? Why introduce another one (with a similar comment)
explicitly now?

> @@ -315,9 +326,29 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
>  unmap_domain_page(p);
>  
>  spin_lock(>arch.pgtables.lock);
> -page_list_add(pg, >arch.pgtables.list);
> +/*
> + * The IOMMU page-tables are freed when relinquishing the domain, but
> + * nothing prevent allocation to happen afterwards. There is no valid
> + * reasons to continue to update the IOMMU page-tables while the
> + * domain is dying.
> + *
> + * So prevent page-table allocation when the domain is dying.
> + *
> + * We relying on >arch.pgtables.lock to synchronize d->is_dying.
> + */
> +if ( likely(!d->is_dying) )
> +{
> +alive = true;
> +page_list_add(pg, >arch.pgtables.list);
> +}
>  spin_unlock(>arch.pgtables.lock);
>  
> +if ( unlikely(!alive) )
> +{
> +free_domheap_page(pg);
> +pg = NULL;
> +}
> +
>  return pg;
>  }

As before I'm concerned of this forcing error paths to be taken
elsewhere, in case an allocation still happens (e.g. from unmap
once super page mappings are supported). Considering some of the
error handling in the IOMMU code is to invoke domain_crash(), it
would be quite unfortunate if we ended up crashing a domain
while it is being cleaned up after.

Additionally, the (at present still hypothetical) unmap case, if
failing because of the change here, would then again chance to
leave mappings in place while the underlying pages get freed. As
this would likely require an XSA, the change doesn't feel like
"hardening" to me.

Jan



Re: [for-4.15][PATCH v3 2/3] xen/x86: iommu: Ignore IOMMU mapping requests when a domain is dying

2021-02-17 Thread Jan Beulich
On 17.02.2021 15:24, Julien Grall wrote:
> From: Julien Grall 
> 
> The new x86 IOMMU page-tables allocator will release the pages when
> relinquishing the domain resources. However, this is not sufficient
> when the domain is dying because nothing prevents page-table to be
> allocated.
> 
> Currently page-table allocations can only happen from iommu_map(). As
> the domain is dying, there is no good reason to continue to modify the
> IOMMU page-tables.

While I agree this to be the case right now, I'm not sure it is a
good idea to build on it (in that you leave the unmap paths
untouched). Imo there's a fair chance this would be overlooked at
the point super page mappings get introduced (which has been long
overdue), and I thought prior discussion had lead to a possible
approach without risking use-after-free due to squashed unmap
requests.

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -273,6 +273,9 @@ int iommu_free_pgtables(struct domain *d)
>  /*
>   * Pages will be moved to the free list below. So we want to
>   * clear the root page-table to avoid any potential use after-free.
> + *
> + * After this call, no more IOMMU mapping can happen.
> + *
>   */
>  hd->platform_ops->clear_root_pgtable(d);

I.e. you utilize the call in place of spin_barrier(). Maybe worth
saying in the comment?

Also, nit: Stray blank comment line.

Jan



Re: [for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables

2021-02-17 Thread Julien Grall

Hi Jan,

On 17/02/2021 14:54, Jan Beulich wrote:

On 17.02.2021 15:24, Julien Grall wrote:

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -267,6 +267,15 @@ int iommu_free_pgtables(struct domain *d)
  struct page_info *pg;
  unsigned int done = 0;
  
+if ( !is_iommu_enabled(d) )

+return 0;
+
+/*
+ * Pages will be moved to the free list below. So we want to
+ * clear the root page-table to avoid any potential use after-free.
+ */
+hd->platform_ops->clear_root_pgtable(d);


Taking amd_iommu_alloc_root() as example, is this really correct
prior to what is now patch 2? 


Yes, there are no more use-after-free...


What guarantees a new root table
won't get allocated subsequently?


It doesn't prevent root table allocation. I view the two as distincts 
issues, hence the two patches.


Cheers,

--
Julien Grall



Re: [for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables

2021-02-17 Thread Jan Beulich
On 17.02.2021 15:24, Julien Grall wrote:
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -381,9 +381,18 @@ static int amd_iommu_assign_device(struct domain *d, u8 
> devfn,
>  return reassign_device(pdev->domain, d, devfn, pdev);
>  }
>  
> +static void iommu_clear_root_pgtable(struct domain *d)

Nit: amd_iommu_ as a prefix would be okay here considering other
(static) functions also use it. Since it is a static function,
no prefix at all would also do (my personal preference). But
iommu_ as a prefix isn't helpful and results in needless re-use
of VT-d's name.

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -267,6 +267,15 @@ int iommu_free_pgtables(struct domain *d)
>  struct page_info *pg;
>  unsigned int done = 0;
>  
> +if ( !is_iommu_enabled(d) )
> +return 0;
> +
> +/*
> + * Pages will be moved to the free list below. So we want to
> + * clear the root page-table to avoid any potential use after-free.
> + */
> +hd->platform_ops->clear_root_pgtable(d);

Taking amd_iommu_alloc_root() as example, is this really correct
prior to what is now patch 2? What guarantees a new root table
won't get allocated subsequently?

Jan



Re: [PATCH v2 7/7] tests/avocado: add boot_xen tests

2021-02-17 Thread Philippe Mathieu-Daudé
On 2/11/21 6:19 PM, Alex Bennée wrote:
> These tests make sure we can boot the Xen hypervisor with a Dom0
> kernel using the guest-loader. We currently have to use a kernel I
> built myself because there are issues using the Debian kernel images.
> 
> Signed-off-by: Alex Bennée 
> ---
>  MAINTAINERS  |   1 +
>  tests/acceptance/boot_xen.py | 117 +++
>  2 files changed, 118 insertions(+)
>  create mode 100644 tests/acceptance/boot_xen.py

> diff --git a/tests/acceptance/boot_xen.py b/tests/acceptance/boot_xen.py
> new file mode 100644
> index 00..8c7e091d40
> --- /dev/null
> +++ b/tests/acceptance/boot_xen.py
> @@ -0,0 +1,117 @@
> +# Functional test that boots a Xen hypervisor with a domU kernel and
> +# checks the console output is vaguely sane .
> +#
> +# Copyright (c) 2020 Linaro

2021?

Otherwise:
Reviewed-by: Philippe Mathieu-Daudé 



Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant

2021-02-17 Thread Julien Grall

Hi Jan,

On 17/02/2021 13:16, Jan Beulich wrote:

On 17.02.2021 12:41, Julien Grall wrote:

Hi Jan,

On 17/02/2021 11:38, Jan Beulich wrote:

On 17.02.2021 12:03, Julien Grall wrote:

On 17/02/2021 10:46, Jan Beulich wrote:

Mappings for a domain's own pages should already be present in the
IOMMU. While installing the same mapping again is merely redundant (and
inefficient), removing the mapping when the grant mapping gets removed
is outright wrong in this case: The mapping was there before the map, so
should remain in place after unmapping.

This affects
- Arm Dom0 in the direct mapped case,
- x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases,
- all x86 PV DomU-s, including driver domains.

Reported-by: Rahul Singh 
Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1243,7 +1243,7 @@ map_grant_ref(
goto undo_out;
}

-need_iommu = gnttab_need_iommu_mapping(ld);

+need_iommu = ld != rd && gnttab_need_iommu_mapping(ld);


AFAICT, the owner of the page may not always be rd. So do we want to
check against the owner instead?


For the DomIO case - specifically not. And the DomCOW case can't
happen when an IOMMU is in use. Did I overlook any other cases
where the page may not be owned by rd?


For the current code, it looks like not. But it feels to me this code is
fragile as we are assuming that other cases should never happen.

I think it would be worth explaining in a comment and the commit message
why check rd rather than the page owner is sufficient.


Well, I've added

 /*
  * This is deliberately not checking the page's owner: get_paged_frame()
  * explicitly rejects foreign pages, and all success paths above yield
  * either owner == rd or owner == dom_io (the dom_cow case is irrelevant
  * as mem-sharing and IOMMU use are incompatible). The dom_io case would
  * need checking separately if we compared against owner here.
  */

to map_grant_ref(), and a reference to this comment to both
unmap_common() and the commit message. Will this do?


LGTM. With that, you can add:

Reviewed-by: Julien Grall 

Cheers,

--
Julien Grall



[for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables

2021-02-17 Thread Julien Grall
From: Julien Grall 

The new per-domain IOMMU page-table allocator will now free the
page-tables when domain's resources are relinquished. However, the
per-domain IOMMU structure will still contain a dangling pointer to
the root page-table.

Xen may access the IOMMU page-tables afterwards at least in the case of
PV domain:

(XEN) Xen call trace:
(XEN)[] R iommu.c#addr_to_dma_page_maddr+0x12e/0x1d8
(XEN)[] F iommu.c#intel_iommu_unmap_page+0x5d/0xf8
(XEN)[] F iommu_unmap+0x9c/0x129
(XEN)[] F iommu_legacy_unmap+0x26/0x63
(XEN)[] F mm.c#cleanup_page_mappings+0x139/0x144
(XEN)[] F put_page+0x4b/0xb3
(XEN)[] F put_page_from_l1e+0x136/0x13b
(XEN)[] F devalidate_page+0x256/0x8dc
(XEN)[] F mm.c#_put_page_type+0x236/0x47e
(XEN)[] F mm.c#put_pt_page+0x6f/0x80
(XEN)[] F mm.c#put_page_from_l2e+0x8a/0xcf
(XEN)[] F devalidate_page+0x3a3/0x8dc
(XEN)[] F mm.c#_put_page_type+0x236/0x47e
(XEN)[] F mm.c#put_pt_page+0x6f/0x80
(XEN)[] F mm.c#put_page_from_l3e+0x8a/0xcf
(XEN)[] F devalidate_page+0x56c/0x8dc
(XEN)[] F mm.c#_put_page_type+0x236/0x47e
(XEN)[] F mm.c#put_pt_page+0x6f/0x80
(XEN)[] F mm.c#put_page_from_l4e+0x69/0x6d
(XEN)[] F devalidate_page+0x6a0/0x8dc
(XEN)[] F mm.c#_put_page_type+0x236/0x47e
(XEN)[] F put_page_type_preemptible+0x13/0x15
(XEN)[] F domain.c#relinquish_memory+0x1ff/0x4e9
(XEN)[] F domain_relinquish_resources+0x2b6/0x36a
(XEN)[] F domain_kill+0xb8/0x141
(XEN)[] F do_domctl+0xb6f/0x18e5
(XEN)[] F pv_hypercall+0x2f0/0x55f
(XEN)[] F lstar_enter+0x112/0x120

This will result to a use after-free and possibly an host crash or
memory corruption.

It would not be possible to free the page-tables further down in
domain_relinquish_resources() because cleanup_page_mappings() will only
be called when the last reference on the page dropped. This may happen
much later if another domain still hold a reference.

After all the PCI devices have been de-assigned, nobody should use the
IOMMU page-tables and it is therefore pointless to try to modify them.

So we can simply clear any reference to the root page-table in the
per-domain IOMMU structure. This requires to introduce a new callback of
the method will depend on the IOMMU driver used.

Fixes: 3eef6d07d722 ("x86/iommu: convert VT-d code to use new page table 
allocator")
Signed-off-by: Julien Grall 

---
Changes in v3:
- Move the patch earlier in the series
- Reword the commit message

Changes in v2:
- Introduce clear_root_pgtable()
- Move the patch later in the series
---
 xen/drivers/passthrough/amd/pci_amd_iommu.c | 12 +++-
 xen/drivers/passthrough/vtd/iommu.c | 12 +++-
 xen/drivers/passthrough/x86/iommu.c |  9 +
 xen/include/xen/iommu.h |  1 +
 4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c 
b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 42b5a5a9bec4..81add0ba26b4 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -381,9 +381,18 @@ static int amd_iommu_assign_device(struct domain *d, u8 
devfn,
 return reassign_device(pdev->domain, d, devfn, pdev);
 }
 
+static void iommu_clear_root_pgtable(struct domain *d)
+{
+struct domain_iommu *hd = dom_iommu(d);
+
+spin_lock(>arch.mapping_lock);
+hd->arch.amd.root_table = NULL;
+spin_unlock(>arch.mapping_lock);
+}
+
 static void amd_iommu_domain_destroy(struct domain *d)
 {
-dom_iommu(d)->arch.amd.root_table = NULL;
+ASSERT(!dom_iommu(d)->arch.amd.root_table);
 }
 
 static int amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
@@ -565,6 +574,7 @@ static const struct iommu_ops __initconstrel _iommu_ops = {
 .remove_device = amd_iommu_remove_device,
 .assign_device  = amd_iommu_assign_device,
 .teardown = amd_iommu_domain_destroy,
+.clear_root_pgtable = iommu_clear_root_pgtable,
 .map_page = amd_iommu_map_page,
 .unmap_page = amd_iommu_unmap_page,
 .iotlb_flush = amd_iommu_flush_iotlb_pages,
diff --git a/xen/drivers/passthrough/vtd/iommu.c 
b/xen/drivers/passthrough/vtd/iommu.c
index d136fe36883b..e1871f6c2bc1 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1726,6 +1726,15 @@ out:
 return ret;
 }
 
+static void iommu_clear_root_pgtable(struct domain *d)
+{
+struct domain_iommu *hd = dom_iommu(d);
+
+spin_lock(>arch.mapping_lock);
+hd->arch.vtd.pgd_maddr = 0;
+spin_unlock(>arch.mapping_lock);
+}
+
 static void iommu_domain_teardown(struct domain *d)
 {
 struct domain_iommu *hd = dom_iommu(d);
@@ -1740,7 +1749,7 @@ static void iommu_domain_teardown(struct domain *d)
 xfree(mrmrr);
 }
 
-hd->arch.vtd.pgd_maddr = 0;
+ASSERT(!hd->arch.vtd.pgd_maddr);
 }
 
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
@@ -2719,6 +2728,7 @@ 

[for-4.15][PATCH v3 3/3] xen/iommu: x86: Harden the IOMMU page-table allocator

2021-02-17 Thread Julien Grall
From: Julien Grall 

At the moment, we are assuming that only iommu_map() can allocate
IOMMU page-table.

Given the complexity of the IOMMU framework, it would be sensible to
have a check closer to the IOMMU allocator. This would avoid to leak
IOMMU page-tables again in the future.

iommu_alloc_pgtable() is now checking if the domain is dying before
adding the page in the list. We are relying on >arch.pgtables.lock
to synchronize d->is_dying.

Take the opportunity to add an ASSERT() in arch_iommu_domain_destroy()
to check if we freed all the IOMMU page tables.

Signed-off-by: Julien Grall 

---

Changes in v3:
- Rename the patch. This was originally "xen/iommu: x86: Don't leak
the IOMMU page-tables"
- Rework the commit message
- Move the patch towards the end of the series

Changes in v2:
- Rework the approach
- Move the patch earlier in the series
---
 xen/drivers/passthrough/x86/iommu.c | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/x86/iommu.c 
b/xen/drivers/passthrough/x86/iommu.c
index faa0078db595..a67075f0045d 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -149,6 +149,13 @@ int arch_iommu_domain_init(struct domain *d)
 
 void arch_iommu_domain_destroy(struct domain *d)
 {
+/*
+ * There should be not page-tables left allocated by the time the
+ * domain is destroyed. Note that arch_iommu_domain_destroy() is
+ * called unconditionally, so pgtables may be unitialized.
+ */
+ASSERT(dom_iommu(d)->platform_ops == NULL ||
+   page_list_empty(_iommu(d)->arch.pgtables.list));
 }
 
 static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
@@ -279,6 +286,9 @@ int iommu_free_pgtables(struct domain *d)
  */
 hd->platform_ops->clear_root_pgtable(d);
 
+/* After this barrier no new page allocations can occur. */
+spin_barrier(>arch.pgtables.lock);
+
 while ( (pg = page_list_remove_head(>arch.pgtables.list)) )
 {
 free_domheap_page(pg);
@@ -296,6 +306,7 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
 unsigned int memflags = 0;
 struct page_info *pg;
 void *p;
+bool alive = false;
 
 #ifdef CONFIG_NUMA
 if ( hd->node != NUMA_NO_NODE )
@@ -315,9 +326,29 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
 unmap_domain_page(p);
 
 spin_lock(>arch.pgtables.lock);
-page_list_add(pg, >arch.pgtables.list);
+/*
+ * The IOMMU page-tables are freed when relinquishing the domain, but
+ * nothing prevent allocation to happen afterwards. There is no valid
+ * reasons to continue to update the IOMMU page-tables while the
+ * domain is dying.
+ *
+ * So prevent page-table allocation when the domain is dying.
+ *
+ * We relying on >arch.pgtables.lock to synchronize d->is_dying.
+ */
+if ( likely(!d->is_dying) )
+{
+alive = true;
+page_list_add(pg, >arch.pgtables.list);
+}
 spin_unlock(>arch.pgtables.lock);
 
+if ( unlikely(!alive) )
+{
+free_domheap_page(pg);
+pg = NULL;
+}
+
 return pg;
 }
 
-- 
2.17.1




[for-4.15][PATCH v3 0/3] xen/iommu: Collection of bug fixes for IOMMU teadorwn

2021-02-17 Thread Julien Grall
From: Julien Grall 

Hi all,

This series is a collection of bug fixes for the IOMMU teardown code.
All of them are candidate for 4.15 as they can either leak memory or
lead to host crash/host corruption.

This is sent directly on xen-devel because all the issues were either
introduced in 4.15 or happen in the domain creation code.

Major changes since v2:
- patch #1 "xen/x86: p2m: Don't map the special pages in the IOMMU
page-tables" has been removed. This requires Jan's patch [2] to
fully mitigate memory leaks.

Cheers,

[1] 
https://lore.kernel.org/xen-devel/1b6a411b-84e7-bfb1-647e-511a13df8...@suse.com

Julien Grall (3):
  xen/iommu: x86: Clear the root page-table before freeing the
page-tables
  xen/x86: iommu: Ignore IOMMU mapping requests when a domain is dying
  xen/iommu: x86: Harden the IOMMU page-table allocator

 xen/drivers/passthrough/amd/iommu_map.c | 13 ++
 xen/drivers/passthrough/amd/pci_amd_iommu.c | 12 +-
 xen/drivers/passthrough/vtd/iommu.c | 25 +++-
 xen/drivers/passthrough/x86/iommu.c | 45 -
 xen/include/xen/iommu.h |  1 +
 5 files changed, 93 insertions(+), 3 deletions(-)

-- 
2.17.1




[for-4.15][PATCH v3 2/3] xen/x86: iommu: Ignore IOMMU mapping requests when a domain is dying

2021-02-17 Thread Julien Grall
From: Julien Grall 

The new x86 IOMMU page-tables allocator will release the pages when
relinquishing the domain resources. However, this is not sufficient
when the domain is dying because nothing prevents page-table to be
allocated.

Currently page-table allocations can only happen from iommu_map(). As
the domain is dying, there is no good reason to continue to modify the
IOMMU page-tables.

In order to observe d->is_dying correctly, we need to rely on per-arch
locking, so the check to ignore IOMMU mapping is added on the per-driver
map_page() callback.

Signed-off-by: Julien Grall 

---

Changes in v3:
- Patch added. This is a replacement of "xen/iommu: iommu_map: Don't
crash the domain if it is dying"
---
 xen/drivers/passthrough/amd/iommu_map.c | 13 +
 xen/drivers/passthrough/vtd/iommu.c | 13 +
 xen/drivers/passthrough/x86/iommu.c |  3 +++
 3 files changed, 29 insertions(+)

diff --git a/xen/drivers/passthrough/amd/iommu_map.c 
b/xen/drivers/passthrough/amd/iommu_map.c
index d3a8b1aec766..ed78a083ba12 100644
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -285,6 +285,19 @@ int amd_iommu_map_page(struct domain *d, dfn_t dfn, mfn_t 
mfn,
 
 spin_lock(>arch.mapping_lock);
 
+/*
+ * IOMMU mapping request can be safely ignored when the domain is dying.
+ *
+ * hd->arch.mapping_lock guarantees that d->is_dying will be observed
+ * before any page tables are freed (see iommu_free_pgtables() and
+ * iommu_clear_root_pgtable()).
+ */
+if ( d->is_dying )
+{
+spin_unlock(>arch.mapping_lock);
+return 0;
+}
+
 rc = amd_iommu_alloc_root(d);
 if ( rc )
 {
diff --git a/xen/drivers/passthrough/vtd/iommu.c 
b/xen/drivers/passthrough/vtd/iommu.c
index e1871f6c2bc1..239a63f74f64 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1771,6 +1771,19 @@ static int __must_check intel_iommu_map_page(struct 
domain *d, dfn_t dfn,
 
 spin_lock(>arch.mapping_lock);
 
+/*
+ * IOMMU mapping request can be safely ignored when the domain is dying.
+ *
+ * hd->arch.mapping_lock guarantees that d->is_dying will be observed
+ * before any page tables are freed (see iommu_free_pgtables() and
+ * iommu_clear_root_pgtable()).
+ */
+if ( d->is_dying )
+{
+spin_unlock(>arch.mapping_lock);
+return 0;
+}
+
 pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1);
 if ( !pg_maddr )
 {
diff --git a/xen/drivers/passthrough/x86/iommu.c 
b/xen/drivers/passthrough/x86/iommu.c
index f54fc8093f18..faa0078db595 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -273,6 +273,9 @@ int iommu_free_pgtables(struct domain *d)
 /*
  * Pages will be moved to the free list below. So we want to
  * clear the root page-table to avoid any potential use after-free.
+ *
+ * After this call, no more IOMMU mapping can happen.
+ *
  */
 hd->platform_ops->clear_root_pgtable(d);
 
-- 
2.17.1




Re: [for-4.15][PATCH v2 5/5] xen/iommu: x86: Clear the root page-table before freeing the page-tables

2021-02-17 Thread Julien Grall

Hi Kevin,

On 10/02/2021 02:21, Tian, Kevin wrote:

From: Julien Grall 
Sent: Tuesday, February 9, 2021 11:28 PM

From: Julien Grall 

The new per-domain IOMMU page-table allocator will now free the
page-tables when domain's resources are relinquished. However, the root
page-table (i.e. hd->arch.pg_maddr) will not be cleared.


It's the pointer not the table itself.


I don't think the sentence is incorrect. One can view clearing as either 
clearing the page table itself or the pointer.






Xen may access the IOMMU page-tables afterwards at least in the case of
PV domain:

(XEN) Xen call trace:
(XEN)[] R
iommu.c#addr_to_dma_page_maddr+0x12e/0x1d8
(XEN)[] F
iommu.c#intel_iommu_unmap_page+0x5d/0xf8
(XEN)[] F iommu_unmap+0x9c/0x129
(XEN)[] F iommu_legacy_unmap+0x26/0x63
(XEN)[] F mm.c#cleanup_page_mappings+0x139/0x144
(XEN)[] F put_page+0x4b/0xb3
(XEN)[] F put_page_from_l1e+0x136/0x13b
(XEN)[] F devalidate_page+0x256/0x8dc
(XEN)[] F mm.c#_put_page_type+0x236/0x47e
(XEN)[] F mm.c#put_pt_page+0x6f/0x80
(XEN)[] F mm.c#put_page_from_l2e+0x8a/0xcf
(XEN)[] F devalidate_page+0x3a3/0x8dc
(XEN)[] F mm.c#_put_page_type+0x236/0x47e
(XEN)[] F mm.c#put_pt_page+0x6f/0x80
(XEN)[] F mm.c#put_page_from_l3e+0x8a/0xcf
(XEN)[] F devalidate_page+0x56c/0x8dc
(XEN)[] F mm.c#_put_page_type+0x236/0x47e
(XEN)[] F mm.c#put_pt_page+0x6f/0x80
(XEN)[] F mm.c#put_page_from_l4e+0x69/0x6d
(XEN)[] F devalidate_page+0x6a0/0x8dc
(XEN)[] F mm.c#_put_page_type+0x236/0x47e
(XEN)[] F put_page_type_preemptible+0x13/0x15
(XEN)[] F domain.c#relinquish_memory+0x1ff/0x4e9
(XEN)[] F domain_relinquish_resources+0x2b6/0x36a
(XEN)[] F domain_kill+0xb8/0x141
(XEN)[] F do_domctl+0xb6f/0x18e5
(XEN)[] F pv_hypercall+0x2f0/0x55f
(XEN)[] F lstar_enter+0x112/0x120

This will result to a use after-free and possibly an host crash or
memory corruption.

Freeing the page-tables further down in domain_relinquish_resources()
would not work because pages may not be released until later if another
domain hold a reference on them.

Once all the PCI devices have been de-assigned, it is actually pointless
to access modify the IOMMU page-tables. So we can simply clear the root
page-table address.


Above two paragraphs are confusing to me. I don't know what exact change
is proposed until looking over the whole patch. Isn't it clearer to just say
"We should clear the root page table pointer in iommu_free_pgtables to
avoid use-after-free after all pgtables are moved to the free list"?


Your sentence explain the approach taken but not the rational behind the 
approach. Both are equally important for future reference.


I will try to reword it.

Cheers,

--
Julien Grall



Re: Linux PV/PVH domU crash on (guest) resume from suspend

2021-02-17 Thread Marek Marczykowski-Górecki
On Wed, Feb 17, 2021 at 07:51:42AM +0100, Jürgen Groß wrote:
> On 17.02.21 06:12, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > I'm observing Linux PV/PVH guest crash when I resume it from sleep. I do
> > this with:
> > 
> >  virsh -c xen dompmsuspend  mem
> >  virsh -c xen dompmwakeup 
> > 
> > But it's possible to trigger it with plain xl too:
> > 
> >  xl save -c  
> > 
> > The same on HVM works fine.
> > 
> > This is on Xen 4.14.1, and with guest kernel 5.4.90, the same happens
> > with 5.4.98. Dom0 kernel is the same, but I'm not sure if that's
> > relevant here. I can reliably reproduce it.
> 
> This is already on my list of issues to look at.
> 
> The problem seems to be related to the XSA-332 patches. You could try
> the patches I've sent out recently addressing other fallout from XSA-332
> which _might_ fix this issue, too:
> 
> https://patchew.org/Xen/20210211101616.13788-1-jgr...@suse.com/

Thanks for the patches. Sadly it doesn't change anything - I get exactly
the same crash. I applied that on top of 5.11-rc7 (that's what I had
handy). If you think there may be a difference with the final 5.11 or
another branch, please let me know.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v2 8/8] xen/evtchn: use READ/WRITE_ONCE() for accessing ring indices

2021-02-17 Thread Ross Lagerwall
On 2021-02-11 10:16, Juergen Gross wrote:
> For avoiding read- and write-tearing by the compiler use READ_ONCE()
> and WRITE_ONCE() for accessing the ring indices in evtchn.c.
> 
> Signed-off-by: Juergen Gross 
> ---
> V2:
> - modify all accesses (Julien Grall)
> ---
>  drivers/xen/evtchn.c | 25 -
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
> index 421382c73d88..620008f89dbe 100644
> --- a/drivers/xen/evtchn.c
> +++ b/drivers/xen/evtchn.c
> @@ -162,6 +162,7 @@ static irqreturn_t evtchn_interrupt(int irq, void *data)
>  {
>   struct user_evtchn *evtchn = data;
>   struct per_user_data *u = evtchn->user;
> + unsigned int prod, cons;
>  
>   WARN(!evtchn->enabled,
>"Interrupt for port %u, but apparently not enabled; per-user %p\n",
> @@ -171,10 +172,14 @@ static irqreturn_t evtchn_interrupt(int irq, void *data)
>  
>   spin_lock(>ring_prod_lock);
>  
> - if ((u->ring_prod - u->ring_cons) < u->ring_size) {
> - *evtchn_ring_entry(u, u->ring_prod) = evtchn->port;
> + prod = READ_ONCE(u->ring_prod);
> + cons = READ_ONCE(u->ring_cons);
> +
> + if ((prod - cons) < u->ring_size) {
> + *evtchn_ring_entry(u, prod) = evtchn->port;
>   smp_wmb(); /* Ensure ring contents visible */
> - if (u->ring_cons == u->ring_prod++) {
> + if (cons == prod++) {
> + WRITE_ONCE(u->ring_prod, prod);
>   wake_up_interruptible(>evtchn_wait);
>   kill_fasync(>evtchn_async_queue,
>   SIGIO, POLL_IN);

This doesn't work correctly since now u->ring_prod is only updated if cons == 
prod++.

Ross



Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant

2021-02-17 Thread Jan Beulich
On 17.02.2021 12:41, Julien Grall wrote:
> Hi Jan,
> 
> On 17/02/2021 11:38, Jan Beulich wrote:
>> On 17.02.2021 12:03, Julien Grall wrote:
>>> On 17/02/2021 10:46, Jan Beulich wrote:
 Mappings for a domain's own pages should already be present in the
 IOMMU. While installing the same mapping again is merely redundant (and
 inefficient), removing the mapping when the grant mapping gets removed
 is outright wrong in this case: The mapping was there before the map, so
 should remain in place after unmapping.

 This affects
 - Arm Dom0 in the direct mapped case,
 - x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases,
 - all x86 PV DomU-s, including driver domains.

 Reported-by: Rahul Singh 
 Signed-off-by: Jan Beulich 

 --- a/xen/common/grant_table.c
 +++ b/xen/common/grant_table.c
 @@ -1243,7 +1243,7 @@ map_grant_ref(
goto undo_out;
}

 -need_iommu = gnttab_need_iommu_mapping(ld);
 +need_iommu = ld != rd && gnttab_need_iommu_mapping(ld);
>>>
>>> AFAICT, the owner of the page may not always be rd. So do we want to
>>> check against the owner instead?
>>
>> For the DomIO case - specifically not. And the DomCOW case can't
>> happen when an IOMMU is in use. Did I overlook any other cases
>> where the page may not be owned by rd?
> 
> For the current code, it looks like not. But it feels to me this code is 
> fragile as we are assuming that other cases should never happen.
> 
> I think it would be worth explaining in a comment and the commit message 
> why check rd rather than the page owner is sufficient.

Well, I've added

/*
 * This is deliberately not checking the page's owner: get_paged_frame()
 * explicitly rejects foreign pages, and all success paths above yield
 * either owner == rd or owner == dom_io (the dom_cow case is irrelevant
 * as mem-sharing and IOMMU use are incompatible). The dom_io case would
 * need checking separately if we compared against owner here.
 */

to map_grant_ref(), and a reference to this comment to both
unmap_common() and the commit message. Will this do?

Jan



Re: [for-4.15][PATCH v2 4/5] xen/iommu: x86: Don't leak the IOMMU page-tables

2021-02-17 Thread Jan Beulich
On 17.02.2021 12:49, Julien Grall wrote:
> On 10/02/2021 16:12, Jan Beulich wrote:
>> On 10.02.2021 16:04, Julien Grall wrote:
>>> Are you know saying that the following snipped would be fine:
>>>
>>> if ( d->is_dying )
>>> return 0;
>>
>> In {amd,intel}_iommu_map_page(), after the lock was acquired
>> and with it suitably released, yes. And if that's what you
>> suggested, then I'm sorry - I don't think I can see anything
>> fragile there anymore.
> 
> Duplicating the check sounds good to me.

The checks in said functions are mandatory, and I didn't really
have any duplication in mind. But yes, iommu_map() could have
an early (but racy) check, if so wanted.

 You could
 utilize the same lock, but you'd need to duplicate the
 checking in {amd,intel}_iommu_map_page().

 I'm not entirely certain in the case about unmap requests:
 It may be possible to also suppress/ignore them, but this
 may require some further thought.
>>>
>>> I think the unmap part is quite risky to d->is_dying because the PCI
>>> devices may not quiesced and still assigned to the domain.
>>
>> Hmm, yes, good point. Of course upon first unmap with is_dying
>> observed set we could zap the root page table, but I don't
>> suppose that's something we want to do for 4.15.
> 
> We would still need to zap the root page table in the relinquish path. 
> So I am not sure what benefits it would give us to zap the page tables 
> on the first iommu_unmap() afther domain dies.

I guess we think of different aspects of the zapping - what I'm
suggesting here is for the effect of no translations remaining
active anymore. I'm not after freeing the memory at this point;
that'll have to happen on the relinquish path, as you say. The
problem with allowing individual unmaps to proceed (unlike your
plan for maps) is that these, too, can trigger allocations (when
a large page needs shattering).

Jan



[xen-4.12-testing test] 159418: tolerable FAIL - PUSHED

2021-02-17 Thread osstest service owner
flight 159418 xen-4.12-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/159418/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 159241
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 159241
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 159241
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 159241
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 159241
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 159241
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 159241
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 159241
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 159241
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 159241
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 159241
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  4cf5929606adc2fb1ab4e2921c14ba4b8046ecd1
baseline version:
 xen  8d26cdd3b66ab86d560dacd763d76ff3da95723e

Last test of basis   159241  2021-02-11 05:00:05 Z6 days
Testing same since   159418  2021-02-16 15:06:11 Z0 days1 attempts


People who touched revisions under test:
  Julien Grall 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64-xtf   

Re: [for-4.15][PATCH v2 4/5] xen/iommu: x86: Don't leak the IOMMU page-tables

2021-02-17 Thread Julien Grall

Hi Jan,

On 10/02/2021 16:12, Jan Beulich wrote:

On 10.02.2021 16:04, Julien Grall wrote:

On 10/02/2021 14:32, Jan Beulich wrote:

On 09.02.2021 16:28, Julien Grall wrote:

From: Julien Grall 

The new IOMMU page-tables allocator will release the pages when
relinquish the domain resources. However, this is not sufficient when
the domain is dying because nothing prevents page-table to be allocated.

iommu_alloc_pgtable() is now checking if the domain is dying before
adding the page in the list. We are relying on >arch.pgtables.lock
to synchronize d->is_dying.


As said in reply to an earlier patch, I think suppressing
(really: ignoring) new mappings would be better.


This is exactly what I suggested in v1 but you wrote:

"Ignoring requests there seems fragile to me. Paul - what are your
thoughts about bailing early from hvm_add_ioreq_gfn() when the
domain is dying?"


Was this on the thread of this patch? I didn't find such a
reply of mine. I need more context here because you name
hvm_add_ioreq_gfn() above, while I refer to iommu_map()
(and downwards the call stack).


See [1].




Are you know saying that the following snipped would be fine:

if ( d->is_dying )
return 0;


In {amd,intel}_iommu_map_page(), after the lock was acquired
and with it suitably released, yes. And if that's what you
suggested, then I'm sorry - I don't think I can see anything
fragile there anymore.


Duplicating the check sounds good to me.




You could
utilize the same lock, but you'd need to duplicate the
checking in {amd,intel}_iommu_map_page().

I'm not entirely certain in the case about unmap requests:
It may be possible to also suppress/ignore them, but this
may require some further thought.


I think the unmap part is quite risky to d->is_dying because the PCI
devices may not quiesced and still assigned to the domain.


Hmm, yes, good point. Of course upon first unmap with is_dying
observed set we could zap the root page table, but I don't
suppose that's something we want to do for 4.15.


We would still need to zap the root page table in the relinquish path. 
So I am not sure what benefits it would give us to zap the page tables 
on the first iommu_unmap() afther domain dies.


Cheers,

[1] 
https://lore.kernel.org/xen-devel/f21f1f61-5213-55a8-320c-43e5fe801...@suse.com/


--
Julien Grall



Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant

2021-02-17 Thread Julien Grall

Hi Jan,

On 17/02/2021 11:38, Jan Beulich wrote:

On 17.02.2021 12:03, Julien Grall wrote:

On 17/02/2021 10:46, Jan Beulich wrote:

Mappings for a domain's own pages should already be present in the
IOMMU. While installing the same mapping again is merely redundant (and
inefficient), removing the mapping when the grant mapping gets removed
is outright wrong in this case: The mapping was there before the map, so
should remain in place after unmapping.

This affects
- Arm Dom0 in the direct mapped case,
- x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases,
- all x86 PV DomU-s, including driver domains.

Reported-by: Rahul Singh 
Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1243,7 +1243,7 @@ map_grant_ref(
   goto undo_out;
   }
   
-need_iommu = gnttab_need_iommu_mapping(ld);

+need_iommu = ld != rd && gnttab_need_iommu_mapping(ld);


AFAICT, the owner of the page may not always be rd. So do we want to
check against the owner instead?


For the DomIO case - specifically not. And the DomCOW case can't
happen when an IOMMU is in use. Did I overlook any other cases
where the page may not be owned by rd?


For the current code, it looks like not. But it feels to me this code is 
fragile as we are assuming that other cases should never happen.


I think it would be worth explaining in a comment and the commit message 
why check rd rather than the page owner is sufficient.


Cheers,

--
Julien Grall



Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant

2021-02-17 Thread Jan Beulich
On 17.02.2021 12:03, Julien Grall wrote:
> On 17/02/2021 10:46, Jan Beulich wrote:
>> Mappings for a domain's own pages should already be present in the
>> IOMMU. While installing the same mapping again is merely redundant (and
>> inefficient), removing the mapping when the grant mapping gets removed
>> is outright wrong in this case: The mapping was there before the map, so
>> should remain in place after unmapping.
>>
>> This affects
>> - Arm Dom0 in the direct mapped case,
>> - x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases,
>> - all x86 PV DomU-s, including driver domains.
>>
>> Reported-by: Rahul Singh 
>> Signed-off-by: Jan Beulich 
>>
>> --- a/xen/common/grant_table.c
>> +++ b/xen/common/grant_table.c
>> @@ -1243,7 +1243,7 @@ map_grant_ref(
>>   goto undo_out;
>>   }
>>   
>> -need_iommu = gnttab_need_iommu_mapping(ld);
>> +need_iommu = ld != rd && gnttab_need_iommu_mapping(ld);
> 
> AFAICT, the owner of the page may not always be rd. So do we want to 
> check against the owner instead?

For the DomIO case - specifically not. And the DomCOW case can't
happen when an IOMMU is in use. Did I overlook any other cases
where the page may not be owned by rd?

Jan



Re: [PATCH] xen/arm : smmuv3: Fix to handle multiple StreamIds per device.

2021-02-17 Thread Bertrand Marquis
Hi Rahul,


> On 17 Feb 2021, at 10:05, Rahul Singh  wrote:
> 
> SMMUv3 driver does not handle multiple StreamId if the master device
> supports more than one StreamID.
> 
> This bug was introduced when the driver was ported from Linux to XEN.
> dt_device_set_protected(..) should be called from add_device(..) not
> from the dt_xlate(..).
> 
> Move dt_device_set_protected(..) from dt_xlate(..) to add_device().
> 
> Signed-off-by: Rahul Singh 
Reviewed-by: Bertrand Marquis 

Thanks a lot, this is fixing issues with multiple stream ids for one device :-)

Cheers
Bertrand

> ---
> This patch is a candidate for 4.15 as without this patch it is not possible to
> assign multiple StreamIds to the same device when device is protected behind
> SMMUv3.
> ---
> xen/drivers/passthrough/arm/smmu-v3.c | 29 ++-
> 1 file changed, 11 insertions(+), 18 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/arm/smmu-v3.c 
> b/xen/drivers/passthrough/arm/smmu-v3.c
> index 914cdc1cf4..53d150cdb6 100644
> --- a/xen/drivers/passthrough/arm/smmu-v3.c
> +++ b/xen/drivers/passthrough/arm/smmu-v3.c
> @@ -2207,24 +2207,6 @@ static int arm_smmu_add_device(u8 devfn, struct device 
> *dev)
>*/
>   arm_smmu_enable_pasid(master);
> 
> - return 0;
> -
> -err_free_master:
> - xfree(master);
> - dev_iommu_priv_set(dev, NULL);
> - return ret;
> -}
> -
> -static int arm_smmu_dt_xlate(struct device *dev,
> - const struct dt_phandle_args *args)
> -{
> - int ret;
> - struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> -
> - ret = iommu_fwspec_add_ids(dev, args->args, 1);
> - if (ret)
> - return ret;
> -
>   if (dt_device_is_protected(dev_to_dt(dev))) {
>   dev_err(dev, "Already added to SMMUv3\n");
>   return -EEXIST;
> @@ -2237,6 +2219,17 @@ static int arm_smmu_dt_xlate(struct device *dev,
>   dev_name(fwspec->iommu_dev), fwspec->num_ids);
> 
>   return 0;
> +
> +err_free_master:
> + xfree(master);
> + dev_iommu_priv_set(dev, NULL);
> + return ret;
> +}
> +
> +static int arm_smmu_dt_xlate(struct device *dev,
> + const struct dt_phandle_args *args)
> +{
> + return iommu_fwspec_add_ids(dev, args->args, 1);
> }
> 
> /* Probing and initialisation functions */
> -- 
> 2.17.1
> 




Re: [for-4.15][PATCH v2 0/5] xen/iommu: Collection of bug fixes for IOMMU teadorwn

2021-02-17 Thread Julien Grall

Hi Ian,

On 09/02/2021 16:47, Ian Jackson wrote:

Julien Grall writes ("[for-4.15][PATCH v2 0/5] xen/iommu: Collection of bug fixes 
for IOMMU teadorwn"):

From: Julien Grall 

...

This series is a collection of bug fixes for the IOMMU teardown code.
All of them are candidate for 4.15 as they can either leak memory or
lead to host crash/host corruption.

This is sent directly on xen-devel because all the issues were either
introduced in 4.15 or happen in the domain creation code.


I think by current freeze rules this does not need a release-ack but
for the avoidance of doubt

Release-Acked-by: Ian Jackson 


Thanks!



assuming it's commited by the end of the week.


I saw you extended the freeze rules by a week. So I will assume that I 
have until end of this week (19th February) to commit it.


Please let me know if I misunderstood the extension.

Cheers,

--
Julien Grall



Re: [for-4.15][PATCH v2 2/5] xen/iommu: Check if the IOMMU was initialized before tearing down

2021-02-17 Thread Julien Grall

Hi Paul,

On 09/02/2021 20:22, Paul Durrant wrote:

-Original Message-
From: Julien Grall 
Sent: 09 February 2021 15:28
To: xen-devel@lists.xenproject.org
Cc: hongy...@amazon.co.uk; i...@xenproject.org; Julien Grall 
; Jan Beulich
; Paul Durrant 
Subject: [for-4.15][PATCH v2 2/5] xen/iommu: Check if the IOMMU was initialized 
before tearing down

From: Julien Grall 

is_iommu_enabled() will return true even if the IOMMU has not been
initialized (e.g. the ops are not set).

In the case of an early failure in arch_domain_init(), the function
iommu_destroy_domain() will be called even if the IOMMU is not
initialized.

This will result to dereference the ops which will be NULL and an host
crash.

Fix the issue by checking that ops has been set before accessing it.

Fixes: 71e617a6b8f6 ("use is_iommu_enabled() where appropriate...")
Signed-off-by: Julien Grall 


Reviewed-by: Paul Durrant 


Thanks! Ian gave his Release-Acked-by so I will commit this patch now.

Cheers,

--
Julien Grall



Re: [for-4.15][PATCH v2 1/5] xen/x86: p2m: Don't map the special pages in the IOMMU page-tables

2021-02-17 Thread Julien Grall




On 15/02/2021 13:14, Jan Beulich wrote:

On 15.02.2021 13:54, Julien Grall wrote:

On 15/02/2021 12:36, Jan Beulich wrote:

On 15.02.2021 12:38, Julien Grall wrote:

Given this series needs to go in 4.15 (we would introduce a 0-day
otherwise), could you clarify whether your patch [1] is intended to
replace this one in 4.15?


Yes, that or a cut down variant (simply moving the invocation of
set_mmio_p2m_entry()). The more that there the controversy
continued regarding the adjustment to p2m_get_iommu_flags(). I
did indicate there that I've dropped it for v2.


Do you have a link to v2? I would like to try with my series.


I didn't post it yet, as I didn't consider the v1 discussion
settled so far. The intermediate version I have at present is
below.


Thanks! I will drop patch #1 from my series and resend it.

Cheers,

--
Julien Grall



Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant

2021-02-17 Thread Julien Grall

Hi Jan,

On 17/02/2021 10:46, Jan Beulich wrote:

Mappings for a domain's own pages should already be present in the
IOMMU. While installing the same mapping again is merely redundant (and
inefficient), removing the mapping when the grant mapping gets removed
is outright wrong in this case: The mapping was there before the map, so
should remain in place after unmapping.

This affects
- Arm Dom0 in the direct mapped case,
- x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases,
- all x86 PV DomU-s, including driver domains.

Reported-by: Rahul Singh 
Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1243,7 +1243,7 @@ map_grant_ref(
  goto undo_out;
  }
  
-need_iommu = gnttab_need_iommu_mapping(ld);

+need_iommu = ld != rd && gnttab_need_iommu_mapping(ld);


AFAICT, the owner of the page may not always be rd. So do we want to 
check against the owner instead?


Cheers,

--
Julien Grall



Re: [PATCH v2 2/2] IOREQ: refine when to send mapcache invalidation request

2021-02-17 Thread Julien Grall




On 17/02/2021 10:02, Jan Beulich wrote:

On 02.02.2021 16:14, Jan Beulich wrote:

XENMEM_decrease_reservation isn't the only means by which pages can get
removed from a guest, yet all removals ought to be signaled to qemu. Put
setting of the flag into the central p2m_remove_page() underlying all
respective hypercalls as well as a few similar places, mainly in PoD
code.

Additionally there's no point sending the request for the local domain
when the domain acted upon is a different one. The latter domain's ioreq
server mapcaches need invalidating. We assume that domain to be paused
at the point the operation takes place, so sending the request in this
case happens from the hvm_do_resume() path, which as one of its first
steps calls handle_hvm_io_completion().

Even without the remote operation aspect a single domain-wide flag
doesn't do: Guests may e.g. decrease-reservation on multiple vCPU-s in
parallel. Each of them needs to issue an invalidation request in due
course, in particular because exiting to guest context should not happen
before the request was actually seen by (all) the emulator(s).

Signed-off-by: Jan Beulich 
---
v2: Preemption related adjustment split off. Make flag per-vCPU. More
 places to set the flag. Also handle acting on a remote domain.
 Re-base.


Can I get an Arm side ack (or otherwise) please?


This looks good for me.

Acked-by: Julien Grall 

Cheers,




Thanks, Jan


--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -759,10 +759,9 @@ static void p2m_free_entry(struct p2m_do
   * has failed (error case).
   * So, at worst, the spurious mapcache invalidation might be sent.
   */
-if ( (p2m->domain == current->domain) &&
-  domain_has_ioreq_server(p2m->domain) &&
-  p2m_is_ram(entry.p2m.type) )
-p2m->domain->mapcache_invalidate = true;
+if ( p2m_is_ram(entry.p2m.type) &&
+ domain_has_ioreq_server(p2m->domain) )
+ioreq_request_mapcache_invalidate(p2m->domain);
  #endif
  
  p2m->stats.mappings[level]--;

--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -1509,8 +1509,8 @@ static void do_trap_hypercall(struct cpu
   * Note that sending the invalidation request causes the vCPU to block
   * until all the IOREQ servers have acknowledged the invalidation.
   */
-if ( unlikely(curr->domain->mapcache_invalidate) &&
- test_and_clear_bool(curr->domain->mapcache_invalidate) )
+if ( unlikely(curr->mapcache_invalidate) &&
+ test_and_clear_bool(curr->mapcache_invalidate) )
  ioreq_signal_mapcache_invalidate();
  #endif
  }
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -32,7 +32,6 @@
  
  static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)

  {
-const struct vcpu *curr = current;
  long rc;
  
  switch ( cmd & MEMOP_CMD_MASK )

@@ -42,14 +41,11 @@ static long hvm_memory_op(int cmd, XEN_G
  return -ENOSYS;
  }
  
-if ( !curr->hcall_compat )

+if ( !current->hcall_compat )
  rc = do_memory_op(cmd, arg);
  else
  rc = compat_memory_op(cmd, arg);
  
-if ( (cmd & MEMOP_CMD_MASK) == XENMEM_decrease_reservation )

-curr->domain->mapcache_invalidate = true;
-
  return rc;
  }
  
@@ -327,9 +323,11 @@ int hvm_hypercall(struct cpu_user_regs *
  
  HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax);
  
-if ( unlikely(currd->mapcache_invalidate) &&

- test_and_clear_bool(currd->mapcache_invalidate) )
+if ( unlikely(curr->mapcache_invalidate) )
+{
+curr->mapcache_invalidate = false;
  ioreq_signal_mapcache_invalidate();
+}
  
  return curr->hcall_preempted ? HVM_HCALL_preempted : HVM_HCALL_completed;

  }
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -28,6 +28,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -815,6 +816,8 @@ p2m_remove_page(struct p2m_domain *p2m,
  }
  }
  
+ioreq_request_mapcache_invalidate(p2m->domain);

+
  return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid,
   p2m->default_access);
  }
@@ -1301,6 +1304,8 @@ static int set_typed_p2m_entry(struct do
  ASSERT(mfn_valid(mfn_add(omfn, i)));
  set_gpfn_from_mfn(mfn_x(omfn) + i, INVALID_M2P_ENTRY);
  }
+
+ioreq_request_mapcache_invalidate(d);
  }
  
  P2M_DEBUG("set %d %lx %lx\n", gfn_p2mt, gfn_l, mfn_x(mfn));

--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -20,6 +20,7 @@
   */
  
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -647,6 +648,8 @@ p2m_pod_decrease_reservation(struct doma
  set_gpfn_from_mfn(mfn_x(mfn), INVALID_M2P_ENTRY);
  p2m_pod_cache_add(p2m, page, cur_order);
  
+ioreq_request_mapcache_invalidate(d);

+
  steal_for_cache =  ( 

[PATCH 3/3] gnttab: GTF_sub_page is a v2-only flag

2021-02-17 Thread Jan Beulich
Prior to its introduction, v1 entries weren't checked for this flag, and
the flag also has been meaningless for v1 entries. Therefore it also
shouldn't be checked. (The only consistent alternative would be to also
check for all currently undefined flags to be clear.)

Fixes: b545941b6638 ("Implement sub-page grant support")
Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -762,13 +762,11 @@ static int _set_status_v1(const grant_en
   struct domain *rd,
   struct active_grant_entry *act,
   int readonly,
-  int mapflag,
-  domid_t  ldomid)
+  domid_t ldomid)
 {
 int rc = GNTST_okay;
 uint32_t *raw_shah = (uint32_t *)shah;
 union grant_combo scombo;
-uint16_t mask = GTF_type_mask;
 
 /*
  * We bound the number of times we retry CMPXCHG on memory locations that
@@ -780,11 +778,6 @@ static int _set_status_v1(const grant_en
  */
 int retries = 0;
 
-/* if this is a grant mapping operation we should ensure GTF_sub_page
-   is not set */
-if ( mapflag )
-mask |= GTF_sub_page;
-
 scombo.raw = ACCESS_ONCE(*raw_shah);
 
 /*
@@ -798,8 +791,9 @@ static int _set_status_v1(const grant_en
 union grant_combo prev, new;
 
 /* If not already pinned, check the grant domid and type. */
-if ( !act->pin && (((scombo.flags & mask) != GTF_permit_access) ||
-   (scombo.domid != ldomid)) )
+if ( !act->pin &&
+ (((scombo.flags & GTF_type_mask) != GTF_permit_access) ||
+  (scombo.domid != ldomid)) )
 PIN_FAIL(done, GNTST_general_error,
  "Bad flags (%x) or dom (%d); expected d%d\n",
  scombo.flags, scombo.domid, ldomid);
@@ -916,7 +910,7 @@ static int _set_status(const grant_entry
 {
 
 if ( evaluate_nospec(rgt_version == 1) )
-return _set_status_v1(shah, rd, act, readonly, mapflag, ldomid);
+return _set_status_v1(shah, rd, act, readonly, ldomid);
 else
 return _set_status_v2(shah, status, rd, act, readonly, mapflag, 
ldomid);
 }
--- a/xen/include/public/grant_table.h
+++ b/xen/include/public/grant_table.h
@@ -175,7 +175,7 @@ typedef struct grant_entry_v1 grant_entr
  * mappings of the grant [GST]
  *  GTF_sub_page: Grant access to only a subrange of the page.  @domid
  *will only be allowed to copy from the grant, and not
- *map it. [GST]
+ *map it. [GST, v2]
  */
 #define _GTF_readonly   (2)
 #define GTF_readonly(1U<<_GTF_readonly)




[PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant

2021-02-17 Thread Jan Beulich
Mappings for a domain's own pages should already be present in the
IOMMU. While installing the same mapping again is merely redundant (and
inefficient), removing the mapping when the grant mapping gets removed
is outright wrong in this case: The mapping was there before the map, so
should remain in place after unmapping.

This affects
- Arm Dom0 in the direct mapped case,
- x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases,
- all x86 PV DomU-s, including driver domains.

Reported-by: Rahul Singh 
Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1243,7 +1243,7 @@ map_grant_ref(
 goto undo_out;
 }
 
-need_iommu = gnttab_need_iommu_mapping(ld);
+need_iommu = ld != rd && gnttab_need_iommu_mapping(ld);
 if ( need_iommu )
 {
 unsigned int kind;
@@ -1493,7 +1493,7 @@ unmap_common(
 if ( put_handle )
 put_maptrack_handle(lgt, op->handle);
 
-if ( rc == GNTST_okay && gnttab_need_iommu_mapping(ld) )
+if ( rc == GNTST_okay && ld != rd && gnttab_need_iommu_mapping(ld) )
 {
 unsigned int kind;
 int err = 0;




[PATCH 1/3] gnttab: never permit mapping transitive grants

2021-02-17 Thread Jan Beulich
Transitive grants allow an intermediate domain I to grant a target
domain T access to a page which origin domain O did grant I access to.
As an implementation restriction, T is not allowed to map such a grant.
This restriction is currently tried to be enforced by marking active
entries resulting from transitive grants as is-sub-page; sub-page grants
for obvious reasons don't allow mapping. However, marking (and checking)
only active entries is insufficient, as a map attempt may also occur on
a grant not otherwise in use. When not presently in use (pin count zero)
the grant type itself needs checking. Otherwise T may be able to map an
unrelated page owned by I. This is because the "transitive" sub-
structure of the v2 union would end up being interpreted as "full_page"
sub-structure instead. The low 32 bits of the GFN used would match the
grant reference specified in I's transitive grant entry, while the upper
32 bits could be random (depending on how exactly I sets up its grant
table entries).

Note that if one mapping already exists and the granting domain _then_
changes the grant to GTF_transitive (which the domain is not supposed to
do), the changed type will only be honored after the pin count has gone
back to zero. This is no different from e.g. GTF_readonly or
GTF_sub_page becoming set when a grant is already in use.

While adjusting the implementation, also adjust commentary in the public
header to better reflect reality.

Fixes: 3672ce675c93 ("Transitive grant support")
Signed-off-by: Jan Beulich 

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -851,9 +851,10 @@ static int _set_status_v2(const grant_en
 mask |= GTF_sub_page;
 
 /* If not already pinned, check the grant domid and type. */
-if ( !act->pin && scombo.flags & mask) != GTF_permit_access) &&
-((scombo.flags & mask) != GTF_transitive)) ||
-   (scombo.domid != ldomid)) )
+if ( !act->pin &&
+ scombo.flags & mask) != GTF_permit_access) &&
+   (mapflag || ((scombo.flags & mask) != GTF_transitive))) ||
+  (scombo.domid != ldomid)) )
 PIN_FAIL(done, GNTST_general_error,
  "Bad flags (%x) or dom (%d); expected d%d, flags %x\n",
  scombo.flags, scombo.domid, ldomid, mask);
@@ -879,7 +880,7 @@ static int _set_status_v2(const grant_en
 if ( !act->pin )
 {
 if ( (((scombo.flags & mask) != GTF_permit_access) &&
-  ((scombo.flags & mask) != GTF_transitive)) ||
+  (mapflag || ((scombo.flags & mask) != GTF_transitive))) ||
  (scombo.domid != ldomid) ||
  (!readonly && (scombo.flags & GTF_readonly)) )
 {
--- a/xen/include/public/grant_table.h
+++ b/xen/include/public/grant_table.h
@@ -166,11 +166,13 @@ typedef struct grant_entry_v1 grant_entr
 #define GTF_type_mask   (3U<<0)
 
 /*
- * Subflags for GTF_permit_access.
+ * Subflags for GTF_permit_access and GTF_transitive.
  *  GTF_readonly: Restrict @domid to read-only mappings and accesses. [GST]
  *  GTF_reading: Grant entry is currently mapped for reading by @domid. [XEN]
  *  GTF_writing: Grant entry is currently mapped for writing by @domid. [XEN]
- *  GTF_PAT, GTF_PWT, GTF_PCD: (x86) cache attribute flags for the grant [GST]
+ * Further subflags for GTF_permit_access only.
+ *  GTF_PAT, GTF_PWT, GTF_PCD: (x86) cache attribute flags to be used for
+ * mappings of the grant [GST]
  *  GTF_sub_page: Grant access to only a subrange of the page.  @domid
  *will only be allowed to copy from the grant, and not
  *map it. [GST]




[PATCH 0/3] gnttab: misc fixes

2021-02-17 Thread Jan Beulich
Patches 1 and 2 clearly are intended for 4.15; patch 3 is possibly
controversial (along the lines of similar relaxation proposed for
other (sub-)hypercalls), and hence unlikely to be a candidate as
well.

1: never permit mapping transitive grants
2: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant
3: GTF_sub_page is a v2-only flag

Jan



[PATCH] xen/arm : smmuv3: Fix to handle multiple StreamIds per device.

2021-02-17 Thread Rahul Singh
SMMUv3 driver does not handle multiple StreamId if the master device
supports more than one StreamID.

This bug was introduced when the driver was ported from Linux to XEN.
dt_device_set_protected(..) should be called from add_device(..) not
from the dt_xlate(..).

Move dt_device_set_protected(..) from dt_xlate(..) to add_device().

Signed-off-by: Rahul Singh 
---
This patch is a candidate for 4.15 as without this patch it is not possible to
assign multiple StreamIds to the same device when device is protected behind
SMMUv3.
---
 xen/drivers/passthrough/arm/smmu-v3.c | 29 ++-
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/xen/drivers/passthrough/arm/smmu-v3.c 
b/xen/drivers/passthrough/arm/smmu-v3.c
index 914cdc1cf4..53d150cdb6 100644
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -2207,24 +2207,6 @@ static int arm_smmu_add_device(u8 devfn, struct device 
*dev)
 */
arm_smmu_enable_pasid(master);
 
-   return 0;
-
-err_free_master:
-   xfree(master);
-   dev_iommu_priv_set(dev, NULL);
-   return ret;
-}
-
-static int arm_smmu_dt_xlate(struct device *dev,
-   const struct dt_phandle_args *args)
-{
-   int ret;
-   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
-
-   ret = iommu_fwspec_add_ids(dev, args->args, 1);
-   if (ret)
-   return ret;
-
if (dt_device_is_protected(dev_to_dt(dev))) {
dev_err(dev, "Already added to SMMUv3\n");
return -EEXIST;
@@ -2237,6 +2219,17 @@ static int arm_smmu_dt_xlate(struct device *dev,
dev_name(fwspec->iommu_dev), fwspec->num_ids);
 
return 0;
+
+err_free_master:
+   xfree(master);
+   dev_iommu_priv_set(dev, NULL);
+   return ret;
+}
+
+static int arm_smmu_dt_xlate(struct device *dev,
+   const struct dt_phandle_args *args)
+{
+   return iommu_fwspec_add_ids(dev, args->args, 1);
 }
 
 /* Probing and initialisation functions */
-- 
2.17.1




Re: [PATCH v2 2/2] IOREQ: refine when to send mapcache invalidation request

2021-02-17 Thread Jan Beulich
On 02.02.2021 16:14, Jan Beulich wrote:
> XENMEM_decrease_reservation isn't the only means by which pages can get
> removed from a guest, yet all removals ought to be signaled to qemu. Put
> setting of the flag into the central p2m_remove_page() underlying all
> respective hypercalls as well as a few similar places, mainly in PoD
> code.
> 
> Additionally there's no point sending the request for the local domain
> when the domain acted upon is a different one. The latter domain's ioreq
> server mapcaches need invalidating. We assume that domain to be paused
> at the point the operation takes place, so sending the request in this
> case happens from the hvm_do_resume() path, which as one of its first
> steps calls handle_hvm_io_completion().
> 
> Even without the remote operation aspect a single domain-wide flag
> doesn't do: Guests may e.g. decrease-reservation on multiple vCPU-s in
> parallel. Each of them needs to issue an invalidation request in due
> course, in particular because exiting to guest context should not happen
> before the request was actually seen by (all) the emulator(s).
> 
> Signed-off-by: Jan Beulich 
> ---
> v2: Preemption related adjustment split off. Make flag per-vCPU. More
> places to set the flag. Also handle acting on a remote domain.
> Re-base.

Can I get an Arm side ack (or otherwise) please?

Thanks, Jan

> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -759,10 +759,9 @@ static void p2m_free_entry(struct p2m_do
>   * has failed (error case).
>   * So, at worst, the spurious mapcache invalidation might be sent.
>   */
> -if ( (p2m->domain == current->domain) &&
> -  domain_has_ioreq_server(p2m->domain) &&
> -  p2m_is_ram(entry.p2m.type) )
> -p2m->domain->mapcache_invalidate = true;
> +if ( p2m_is_ram(entry.p2m.type) &&
> + domain_has_ioreq_server(p2m->domain) )
> +ioreq_request_mapcache_invalidate(p2m->domain);
>  #endif
>  
>  p2m->stats.mappings[level]--;
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -1509,8 +1509,8 @@ static void do_trap_hypercall(struct cpu
>   * Note that sending the invalidation request causes the vCPU to block
>   * until all the IOREQ servers have acknowledged the invalidation.
>   */
> -if ( unlikely(curr->domain->mapcache_invalidate) &&
> - test_and_clear_bool(curr->domain->mapcache_invalidate) )
> +if ( unlikely(curr->mapcache_invalidate) &&
> + test_and_clear_bool(curr->mapcache_invalidate) )
>  ioreq_signal_mapcache_invalidate();
>  #endif
>  }
> --- a/xen/arch/x86/hvm/hypercall.c
> +++ b/xen/arch/x86/hvm/hypercall.c
> @@ -32,7 +32,6 @@
>  
>  static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  {
> -const struct vcpu *curr = current;
>  long rc;
>  
>  switch ( cmd & MEMOP_CMD_MASK )
> @@ -42,14 +41,11 @@ static long hvm_memory_op(int cmd, XEN_G
>  return -ENOSYS;
>  }
>  
> -if ( !curr->hcall_compat )
> +if ( !current->hcall_compat )
>  rc = do_memory_op(cmd, arg);
>  else
>  rc = compat_memory_op(cmd, arg);
>  
> -if ( (cmd & MEMOP_CMD_MASK) == XENMEM_decrease_reservation )
> -curr->domain->mapcache_invalidate = true;
> -
>  return rc;
>  }
>  
> @@ -327,9 +323,11 @@ int hvm_hypercall(struct cpu_user_regs *
>  
>  HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax);
>  
> -if ( unlikely(currd->mapcache_invalidate) &&
> - test_and_clear_bool(currd->mapcache_invalidate) )
> +if ( unlikely(curr->mapcache_invalidate) )
> +{
> +curr->mapcache_invalidate = false;
>  ioreq_signal_mapcache_invalidate();
> +}
>  
>  return curr->hcall_preempted ? HVM_HCALL_preempted : HVM_HCALL_completed;
>  }
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -815,6 +816,8 @@ p2m_remove_page(struct p2m_domain *p2m,
>  }
>  }
>  
> +ioreq_request_mapcache_invalidate(p2m->domain);
> +
>  return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid,
>   p2m->default_access);
>  }
> @@ -1301,6 +1304,8 @@ static int set_typed_p2m_entry(struct do
>  ASSERT(mfn_valid(mfn_add(omfn, i)));
>  set_gpfn_from_mfn(mfn_x(omfn) + i, INVALID_M2P_ENTRY);
>  }
> +
> +ioreq_request_mapcache_invalidate(d);
>  }
>  
>  P2M_DEBUG("set %d %lx %lx\n", gfn_p2mt, gfn_l, mfn_x(mfn));
> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -20,6 +20,7 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -647,6 +648,8 @@ p2m_pod_decrease_reservation(struct doma
>  set_gpfn_from_mfn(mfn_x(mfn), INVALID_M2P_ENTRY);
>  p2m_pod_cache_add(p2m, page, cur_order);
>  
> +

[xen-unstable-coverity test] 159441: all pass - PUSHED

2021-02-17 Thread osstest service owner
flight 159441 xen-unstable-coverity real [real]
http://logs.test-lab.xenproject.org/osstest/logs/159441/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 xen  3b1cc15f1931ba56d0ee256fe9bfe65509733b27
baseline version:
 xen  04085ec1ac05a362812e9b0c6b5a8713d7dc88ad

Last test of basis   159344  2021-02-14 09:18:27 Z3 days
Testing same since   159441  2021-02-17 09:20:56 Z0 days1 attempts


People who touched revisions under test:
  Jan Beulich 
  Julien Grall 

jobs:
 coverity-amd64   pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   04085ec1ac..3b1cc15f19  3b1cc15f1931ba56d0ee256fe9bfe65509733b27 -> 
coverity-tested/smoke



Re: [PATCH] firmware: don't build hvmloader if it is not needed

2021-02-17 Thread Jan Beulich
On 16.02.2021 19:31, Stefano Stabellini wrote:
> On Mon, 15 Feb 2021, Jan Beulich wrote:
>> On 13.02.2021 14:50, Marek Marczykowski-Górecki wrote:
>>> On Fri, Feb 12, 2021 at 06:05:40PM -0800, Stefano Stabellini wrote:
 If rombios, seabios and ovmf are all disabled, don't attempt to build
 hvmloader.
>>>
>>> What if you choose to not build any of rombios, seabios, ovmf, but use
>>> system one instead? Wouldn't that exclude hvmloader too?
>>
>> Even further - one can disable all firmware and have every guest
>> config explicitly specify the firmware to use, afaict.
> 
> I didn't realize there was a valid reason for wanting to build hvmloader
> without rombios, seabios, and ovfm.
> 
> 
>>> This heuristic seems like a bit too much, maybe instead add an explicit
>>> option to skip hvmloader?
>>
>> +1 (If making this configurable is needed at all - is having
>> hvmloader without needing it really a problem?)
> 
> The hvmloader build fails on Alpine Linux x86:
> 
> https://gitlab.com/xen-project/xen/-/jobs/1033722465
> 
> 
> 
> I admit I was just trying to find the fastest way to make those Alpine
> Linux builds succeed to unblock patchew: although the Alpine Linux
> builds are marked as "allow_failure: true" in gitlab-ci, patchew will
> still report the whole battery of tests as "failure". As a consequence
> the notification emails from patchew after a build of a contributed
> patch series always says "job failed" today, making it kind of useless.
> See attached.
> 
> I would love if somebody else took over this fix as I am doing this
> after hours, but if you have a simple suggestion on how to fix the
> Alpine Linux hvmloader builds, or skip the build when appropriate, I can
> try to follow up.

There is an issue with the definition of uint64_t there. Initial
errors like

hvmloader.c: In function 'init_vm86_tss':
hvmloader.c:202:39: error: left shift count >= width of type 
[-Werror=shift-count-overflow]
  202 |   ((uint64_t)TSS_SIZE << 32) | virt_to_phys(tss));

already hint at this, but then

util.c: In function 'get_cpu_mhz':
util.c:824:15: error: conversion from 'long long unsigned int' to 'uint64_t' 
{aka 'long unsigned int'} changes value from '429496729600' to '0' 
[-Werror=overflow]
  824 | cpu_khz = 100ull << 32;

is quite explicit: "aka 'long unsigned int'"? This is a 32-bit
environment, after all. I suspect the build picks up headers
(stdint.h here in particular) intended for 64-bit builds only.
Can you check whether "gcc -m32" properly sets include paths
_different_ from those plain "gcc" sets, if the headers found in
the latter case aren't suitable for the former? Or alternatively
is the Alpine build environment set up incorrectly, in that it
lacks 32-bit devel packages?

As an aside I don't think it's really a good idea to have
hvmloader depend on any external headers. Just like the
hypervisor it's a free-standing binary, and hence ought to be
free of any dependencies on the build/host environment.

Jan



Re: Ping: [PATCH v2 2/2] IOREQ: refine when to send mapcache invalidation request

2021-02-17 Thread Paul Durrant

On 17/02/2021 08:30, Jan Beulich wrote:

Paul (or others), thoughts?

On 04.02.2021 12:24, Jan Beulich wrote:

On 04.02.2021 10:26, Paul Durrant wrote:

From: Jan Beulich 
Sent: 02 February 2021 15:15

XENMEM_decrease_reservation isn't the only means by which pages can get
removed from a guest, yet all removals ought to be signaled to qemu. Put
setting of the flag into the central p2m_remove_page() underlying all
respective hypercalls as well as a few similar places, mainly in PoD
code.

Additionally there's no point sending the request for the local domain
when the domain acted upon is a different one. The latter domain's ioreq
server mapcaches need invalidating. We assume that domain to be paused
at the point the operation takes place, so sending the request in this
case happens from the hvm_do_resume() path, which as one of its first
steps calls handle_hvm_io_completion().

Even without the remote operation aspect a single domain-wide flag
doesn't do: Guests may e.g. decrease-reservation on multiple vCPU-s in
parallel. Each of them needs to issue an invalidation request in due
course, in particular because exiting to guest context should not happen
before the request was actually seen by (all) the emulator(s).

Signed-off-by: Jan Beulich 
---
v2: Preemption related adjustment split off. Make flag per-vCPU. More
 places to set the flag. Also handle acting on a remote domain.
 Re-base.


I'm wondering if a per-vcpu flag is overkill actually. We just need
to make sure that we don't miss sending an invalidation where
multiple vcpus are in play. The mapcache in the emulator is global
so issuing an invalidate for every vcpu is going to cause an
unnecessary storm of ioreqs, isn't it?


The only time a truly unnecessary storm would occur is when for
an already running guest mapcache invalidation gets triggered
by a remote domain. This should be a pretty rare event, so I
think the storm in this case ought to be tolerable.


Could we get away with the per-domain atomic counter?


Possible, but quite involved afaict: The potential storm above
is the price to pay for the simplicity of the model. It is
important to note that while we don't need all of the vCPU-s
to send these ioreqs, we need all of them to wait for the
request(s) to be acked. And this waiting is what we get "for
free" using the approach here, whereas we'd need to introduce
new logic for this with an atomic counter (afaict at least).

Jan





Ok, let's take the patch as-is then.

Reviewed-by: Paul Durrant 




Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu

2021-02-17 Thread Julien Grall




On 17/02/2021 02:00, Stefano Stabellini wrote:

Hi all,

Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to
translate addresses for DMA operations in Dom0. Specifically,
swiotlb-xen is used to translate the address of a foreign page (a page
belonging to a domU) mapped into Dom0 before using it for DMA.

This is important because although Dom0 is 1:1 mapped, DomUs are not. On
systems without an IOMMU swiotlb-xen enables PV drivers to work as long
as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation
ends up using the MFN, rather than the GFN.


On systems with an IOMMU, this is not necessary: when a foreign page is
mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation
is enstablished for both MMU and SMMU. Dom0 could safely use the GFN
address (instead of the MFN) for DMA operations and they would work. It
would be more efficient than using swiotlb-xen.

If you recall my presentation from Xen Summit 2020, Xilinx is working on
cache coloring. With cache coloring, no domain is 1:1 mapped, not even
Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not
work as intended.


The suggested solution for both these issues is to add a new feature
flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use
swiotlb-xen because IOMMU translations are available for Dom0. If
XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen
initialization. I have tested this scheme with and without cache
coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it
works as expected: DMA operations succeed.


What about systems where an IOMMU is present but not all devices are
protected?

There is no way for Xen to know which devices are protected and which
ones are not: devices that do not have the "iommus" property could or
could not be DMA masters.

Perhaps Xen could populate a whitelist of devices protected by the IOMMU
based on the "iommus" property. It would require some added complexity
in Xen and especially in the swiotlb-xen driver in Linux to use it,
which is not ideal.


You are trading a bit more complexity in Xen and Linux with a user will 
may not be able to use the hypervisor on his/her platform without a 
quirk in Xen (see more below).



However, this approach would not work for cache
coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be
used either way


Not all the Dom0 Linux kernel will be able to work with cache colouring. 
So you will need a way for the kernel to say "Hey I am can avoid using 
swiotlb".




For these reasons, I would like to propose a single flag
XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for
DMA translations. In situations where a DMA master is not SMMU
protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a
platform where an IOMMU is present and protects most DMA masters but it
is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should
not be set (because PV block is not going to work without swiotlb-xen.)
This also means that cache coloring won't be usable on such a system (at
least not usable with the MMC controller so the system integrator should
pay special care to setup the system).

It is worth noting that if we wanted to extend the interface to add a
list of protected devices in the future, it would still be possible. It
would be compatible with XENFEAT_ARM_dom0_iommu.


I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be 
cleared and instead the device-tree list would be used. Is that correct?





How to set XENFEAT_ARM_dom0_iommu?

We could set XENFEAT_ARM_dom0_iommu automatically when
is_iommu_enabled(d) for Dom0. We could also have a platform specific
(xen/arch/arm/platforms/) override so that a specific platform can
disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced
users, it would also be useful to be able to override it via a Xen
command line parameter.

Platform quirks should be are limited to a small set of platforms.

In this case, this would not be only per-platform but also per-firmware 
table as a developer can decide to remove/add IOMMU nodes in the DT at 
any time.


In addition to that, it means we are introducing a regression for those 
users as Xen 4.14 would have worked on there platform but not anymore. 
They would need to go through all the nodes and find out which one is 
not protected.


This is a bit of a daunting task and we are going to end up having a lot 
of per-platform quirk in Xen.


So this approach selecting the flag is a no-go for me. FAOD, the 
inverted idea (i.e. only setting XENFEAT_ARM_dom0_iommu per-platform) is 
a no go as well.


I don't have a good idea how to set the flag automatically. My 
requirement is newer Xen should continue to work on all supported 
platforms without any additional per-platform effort.


Cheers,

--
Julien Grall



Ping: [PATCH] x86emul: de-duplicate scatters to the same linear address

2021-02-17 Thread Jan Beulich
On 05.02.2021 12:28, Jan Beulich wrote:
> On 05.02.2021 11:41, Andrew Cooper wrote:
>> On 10/11/2020 13:26, Jan Beulich wrote:
>>> The SDM specifically allows for earlier writes to fully overlapping
>>> ranges to be dropped. If a guest did so, hvmemul_phys_mmio_access()
>>> would crash it if varying data was written to the same address. Detect
>>> overlaps early, as doing so in hvmemul_{linear,phys}_mmio_access() would
>>> be quite a bit more difficult.
>>
>> Are you saying that there is currently a bug if a guest does encode such
>> an instruction, and we emulate it?
> 
> That is my take on it, yes.
> 
>>> Note that due to cache slot use being linear address based, there's no
>>> similar issue with multiple writes to the same physical address (mapped
>>> through different linear addresses).
>>>
>>> Since this requires an adjustment to the EVEX Disp8 scaling test,
>>> correct a comment there at the same time.
>>>
>>> Signed-off-by: Jan Beulich 
>>> ---
>>> TBD: The SDM isn't entirely unambiguous about the faulting behavior in
>>>  this case: If a fault would need delivering on the earlier slot
>>>  despite the write getting squashed, we'd have to call ops->write()
>>>  with size set to zero for the earlier write(s). However,
>>>  hvm/emulate.c's handling of zero-byte accesses extends only to the
>>>  virtual-to-linear address conversions (and raising of involved
>>>  faults), so in order to also observe #PF changes to that logic
>>>  would then also be needed. Can we live with a possible misbehavior
>>>  here?
>>
>> Do you have a chapter/section reference?
> 
> The instruction pages. They say in particular
> 
> "If two or more destination indices completely overlap, the “earlier”
>  write(s) may be skipped."
> 
> and
> 
> "Faults are delivered in a right-to-left manner. That is, if a fault
>  is triggered by an element and delivered ..."
> 
> To me this may or may not mean the skipping of indices includes the
> skipping of faults (which a later element then would raise anyway).

Does the above address your concerns / questions? If not, what else
do I need to provide?

Jan



Ping: [PATCH v2 2/2] IOREQ: refine when to send mapcache invalidation request

2021-02-17 Thread Jan Beulich
Paul (or others), thoughts?

On 04.02.2021 12:24, Jan Beulich wrote:
> On 04.02.2021 10:26, Paul Durrant wrote:
>>> From: Jan Beulich 
>>> Sent: 02 February 2021 15:15
>>>
>>> XENMEM_decrease_reservation isn't the only means by which pages can get
>>> removed from a guest, yet all removals ought to be signaled to qemu. Put
>>> setting of the flag into the central p2m_remove_page() underlying all
>>> respective hypercalls as well as a few similar places, mainly in PoD
>>> code.
>>>
>>> Additionally there's no point sending the request for the local domain
>>> when the domain acted upon is a different one. The latter domain's ioreq
>>> server mapcaches need invalidating. We assume that domain to be paused
>>> at the point the operation takes place, so sending the request in this
>>> case happens from the hvm_do_resume() path, which as one of its first
>>> steps calls handle_hvm_io_completion().
>>>
>>> Even without the remote operation aspect a single domain-wide flag
>>> doesn't do: Guests may e.g. decrease-reservation on multiple vCPU-s in
>>> parallel. Each of them needs to issue an invalidation request in due
>>> course, in particular because exiting to guest context should not happen
>>> before the request was actually seen by (all) the emulator(s).
>>>
>>> Signed-off-by: Jan Beulich 
>>> ---
>>> v2: Preemption related adjustment split off. Make flag per-vCPU. More
>>> places to set the flag. Also handle acting on a remote domain.
>>> Re-base.
>>
>> I'm wondering if a per-vcpu flag is overkill actually. We just need
>> to make sure that we don't miss sending an invalidation where
>> multiple vcpus are in play. The mapcache in the emulator is global
>> so issuing an invalidate for every vcpu is going to cause an
>> unnecessary storm of ioreqs, isn't it?
> 
> The only time a truly unnecessary storm would occur is when for
> an already running guest mapcache invalidation gets triggered
> by a remote domain. This should be a pretty rare event, so I
> think the storm in this case ought to be tolerable.
> 
>> Could we get away with the per-domain atomic counter?
> 
> Possible, but quite involved afaict: The potential storm above
> is the price to pay for the simplicity of the model. It is
> important to note that while we don't need all of the vCPU-s
> to send these ioreqs, we need all of them to wait for the
> request(s) to be acked. And this waiting is what we get "for
> free" using the approach here, whereas we'd need to introduce
> new logic for this with an atomic counter (afaict at least).
> 
> Jan
> 




Re: Linux DomU freezes and dies under heavy memory shuffling

2021-02-17 Thread Jürgen Groß

On 17.02.21 09:12, Roman Shaposhnik wrote:

Hi Jürgen, thanks for taking a look at this. A few comments below:

On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß  wrote:


On 16.02.21 21:34, Stefano Stabellini wrote:

+ x86 maintainers

It looks like the tlbflush is getting stuck?


I have seen this case multiple times on customer systems now, but
reproducing it reliably seems to be very hard.


It is reliably reproducible under my workload but it take a long time
(~3 days of the workload running in the lab).


This is by far the best reproduction rate I have seen up to now.

The next best reproducer seems to be a huge installation with several
hundred hosts and thousands of VMs with about 1 crash each week.




I suspected fifo events to be blamed, but just yesterday I've been
informed of another case with fifo events disabled in the guest.

One common pattern seems to be that up to now I have seen this effect
only on systems with Intel Gold cpus. Can it be confirmed to be true
in this case, too?


I am pretty sure mine isn't -- I can get you full CPU specs if that's useful.


Just the output of "grep model /proc/cpuinfo" should be enough.




In case anybody has a reproducer (either in a guest or dom0) with a
setup where a diagnostic kernel can be used, I'd be _very_ interested!


I can easily add things to Dom0 and DomU. Whether that will disrupt the
experiment is, of course, another matter. Still please let me know what
would be helpful to do.


Is there a chance to switch to an upstream kernel in the guest? I'd like
to add some diagnostic code to the kernel and creating the patches will
be easier this way.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Ping: [PATCH v3 0/4] x86/time: calibration rendezvous adjustments

2021-02-17 Thread Jan Beulich
Ian?

On 09.02.2021 14:02, Jan Beulich wrote:
> On 09.02.2021 13:53, Jan Beulich wrote:
>> The middle two patches are meant to address a regression reported on
>> the list under "Problems with APIC on versions 4.9 and later (4.8
>> works)". In the course of analyzing output from a debugging patch I
>> ran into another anomaly again, which I thought I should finally try
>> to address. Hence patch 1. Patch 4 is new in v3 and RFC for now.
> 
> Of course this is the kind of change I'd prefer doing early in a
> release cycle. I don't think there are severe risks from patch 1, but
> I'm not going to claim patches 2 and 3 are risk free. They fix booting
> Xen on a system left in rather awkward state by the firmware. And
> they shouldn't affect well behaved modern systems at all (due to
> those using a different rendezvous function). While we've been having
> this issue for years, I also consider this set a backporting
> candidate. Hence I can see reasons pro and con inclusion in 4.15.
> 
> Jan
> 
>> 1: change initiation of the calibration timer
>> 2: adjust time recording time_calibration_tsc_rendezvous()
>> 3: don't move TSC backwards in time_calibration_tsc_rendezvous()
>> 4: re-arrange struct calibration_rendezvous
>>
>> Jan
>>
> 




[PATCH v2 8/8] x86/PV: use get_unsafe() instead of copy_from_unsafe()

2021-02-17 Thread Jan Beulich
The former expands to a single (memory accessing) insn, which the latter
does not guarantee. Yet we'd prefer to read consistent PTEs rather than
risking a split read racing with an update done elsewhere.

Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/pv/mm.c
+++ b/xen/arch/x86/pv/mm.c
@@ -41,9 +41,7 @@ l1_pgentry_t *map_guest_l1e(unsigned lon
 return NULL;
 
 /* Find this l1e and its enclosing l1mfn in the linear map. */
-if ( copy_from_unsafe(,
-  &__linear_l2_table[l2_linear_offset(linear)],
-  sizeof(l2_pgentry_t)) )
+if ( get_unsafe(l2e, &__linear_l2_table[l2_linear_offset(linear)]) )
 return NULL;
 
 /* Check flags that it will be safe to read the l1e. */
--- a/xen/arch/x86/pv/mm.h
+++ b/xen/arch/x86/pv/mm.h
@@ -22,9 +22,7 @@ static inline l1_pgentry_t guest_get_eff
 toggle_guest_pt(curr);
 
 if ( unlikely(!__addr_ok(linear)) ||
- copy_from_unsafe(,
-  &__linear_l1_table[l1_linear_offset(linear)],
-  sizeof(l1_pgentry_t)) )
+ get_unsafe(l1e, &__linear_l1_table[l1_linear_offset(linear)]) )
 l1e = l1e_empty();
 
 if ( user_mode )




[PATCH v2 7/8] x86: move stac()/clac() from {get,put}_unsafe_asm() ...

2021-02-17 Thread Jan Beulich
... to {get,put}_unsafe_size(). There's no need to have the macros
expanded once per case label in the latter. This also makes the former
well-formed single statements again. No change in generated code.

Signed-off-by: Jan Beulich 

--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -154,7 +154,6 @@ struct __large_struct { unsigned long bu
  * aliasing issues.
  */
 #define put_unsafe_asm(x, addr, GUARD, err, itype, rtype, ltype, errret) \
-   stac(); \
__asm__ __volatile__(   \
GUARD(  \
"   guest_access_mask_ptr %[ptr], %[scr1], %[scr2]\n" \
@@ -169,11 +168,9 @@ struct __large_struct { unsigned long bu
: [ret] "+r" (err), [ptr] "=" (dummy_)\
  GUARD(, [scr1] "=" (dummy_), [scr2] "=" (dummy_)) \
: [val] ltype (x), "m" (__m(addr)), \
- "[ptr]" (addr), [errno] "i" (errret));\
-   clac()
+ "[ptr]" (addr), [errno] "i" (errret))
 
 #define get_unsafe_asm(x, addr, GUARD, err, rtype, ltype, errret)  \
-   stac(); \
__asm__ __volatile__(   \
GUARD(  \
"   guest_access_mask_ptr %[ptr], %[scr1], %[scr2]\n" \
@@ -190,12 +187,12 @@ struct __large_struct { unsigned long bu
  [ptr] "=" (dummy_)  \
  GUARD(, [scr1] "=" (dummy_), [scr2] "=" (dummy_)) \
: "m" (__m(addr)), "[ptr]" (addr),  \
- [errno] "i" (errret));\
-   clac()
+ [errno] "i" (errret))
 
 #define put_unsafe_size(x, ptr, size, grd, retval, errret) \
 do {   \
 retval = 0;\
+stac();\
 switch ( size )\
 {  \
 long dummy_;   \
@@ -213,6 +210,7 @@ do {
 break; \
 default: __put_user_bad(); \
 }  \
+clac();\
 } while ( false )
 
 #define put_guest_size(x, ptr, size, retval, errret) \
@@ -221,6 +219,7 @@ do {
 #define get_unsafe_size(x, ptr, size, grd, retval, errret) \
 do {   \
 retval = 0;\
+stac();\
 switch ( size )\
 {  \
 long dummy_;   \
@@ -230,6 +229,7 @@ do {
 case 8: get_unsafe_asm(x, ptr, grd, retval,  "", "=r", errret); break; \
 default: __get_user_bad(); \
 }  \
+clac();\
 } while ( false )
 
 #define get_guest_size(x, ptr, size, retval, errret)   \



[PATCH v2 6/8] x86: rename copy_{from,to}_user() to copy_{from,to}_guest_pv()

2021-02-17 Thread Jan Beulich
Bring them (back) in line with __copy_{from,to}_guest_pv(). Since it
falls in the same group, also convert clear_user(). Instead of adjusting
__raw_clear_guest(), drop it - it's unused and would require a non-
checking __clear_guest_pv() which we don't have.

Add previously missing __user at some call sites and in the function
declarations.

Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/pv/emul-inv-op.c
+++ b/xen/arch/x86/pv/emul-inv-op.c
@@ -33,7 +33,7 @@ static int emulate_forced_invalid_op(str
 eip = regs->rip;
 
 /* Check for forced emulation signature: ud2 ; .ascii "xen". */
-if ( (rc = copy_from_user(sig, (char *)eip, sizeof(sig))) != 0 )
+if ( (rc = copy_from_guest_pv(sig, (char __user *)eip, sizeof(sig))) != 0 )
 {
 pv_inject_page_fault(0, eip + sizeof(sig) - rc);
 return EXCRET_fault_fixed;
@@ -43,7 +43,8 @@ static int emulate_forced_invalid_op(str
 eip += sizeof(sig);
 
 /* We only emulate CPUID. */
-if ( ( rc = copy_from_user(instr, (char *)eip, sizeof(instr))) != 0 )
+if ( (rc = copy_from_guest_pv(instr, (char __user *)eip,
+  sizeof(instr))) != 0 )
 {
 pv_inject_page_fault(0, eip + sizeof(instr) - rc);
 return EXCRET_fault_fixed;
--- a/xen/arch/x86/pv/iret.c
+++ b/xen/arch/x86/pv/iret.c
@@ -54,8 +54,8 @@ unsigned long do_iret(void)
 struct iret_context iret_saved;
 struct vcpu *v = current;
 
-if ( unlikely(copy_from_user(_saved, (void *)regs->rsp,
- sizeof(iret_saved))) )
+if ( unlikely(copy_from_guest_pv(_saved, (void __user *)regs->rsp,
+ sizeof(iret_saved))) )
 {
 gprintk(XENLOG_ERR,
 "Fault while reading IRET context from guest stack\n");
--- a/xen/arch/x86/pv/ro-page-fault.c
+++ b/xen/arch/x86/pv/ro-page-fault.c
@@ -90,7 +90,8 @@ static int ptwr_emulated_update(unsigned
 
 /* Align address; read full word. */
 addr &= ~(sizeof(full) - 1);
-if ( (rc = copy_from_user(, (void *)addr, sizeof(full))) != 0 )
+if ( (rc = copy_from_guest_pv(, (void __user *)addr,
+  sizeof(full))) != 0 )
 {
 x86_emul_pagefault(0, /* Read fault. */
addr + sizeof(full) - rc,
--- a/xen/arch/x86/usercopy.c
+++ b/xen/arch/x86/usercopy.c
@@ -109,19 +109,17 @@ unsigned int copy_from_guest_ll(void *to
 #if GUARD(1) + 0
 
 /**
- * copy_to_user: - Copy a block of data into user space.
- * @to:   Destination address, in user space.
- * @from: Source address, in kernel space.
+ * copy_to_guest_pv: - Copy a block of data into guest space.
+ * @to:   Destination address, in guest space.
+ * @from: Source address, in hypervisor space.
  * @n:Number of bytes to copy.
  *
- * Context: User context only.  This function may sleep.
- *
- * Copy data from kernel space to user space.
+ * Copy data from hypervisor space to guest space.
  *
  * Returns number of bytes that could not be copied.
  * On success, this will be zero.
  */
-unsigned copy_to_user(void __user *to, const void *from, unsigned n)
+unsigned int copy_to_guest_pv(void __user *to, const void *from, unsigned int 
n)
 {
 if ( access_ok(to, n) )
 n = __copy_to_guest_pv(to, from, n);
@@ -129,16 +127,16 @@ unsigned copy_to_user(void __user *to, c
 }
 
 /**
- * clear_user: - Zero a block of memory in user space.
- * @to:   Destination address, in user space.
+ * clear_guest_pv: - Zero a block of memory in guest space.
+ * @to:   Destination address, in guest space.
  * @n:Number of bytes to zero.
  *
- * Zero a block of memory in user space.
+ * Zero a block of memory in guest space.
  *
  * Returns number of bytes that could not be cleared.
  * On success, this will be zero.
  */
-unsigned clear_user(void __user *to, unsigned n)
+unsigned int clear_guest_pv(void __user *to, unsigned int n)
 {
 if ( access_ok(to, n) )
 {
@@ -168,14 +166,12 @@ unsigned clear_user(void __user *to, uns
 }
 
 /**
- * copy_from_user: - Copy a block of data from user space.
- * @to:   Destination address, in kernel space.
- * @from: Source address, in user space.
+ * copy_from_guest_pv: - Copy a block of data from guest space.
+ * @to:   Destination address, in hypervisor space.
+ * @from: Source address, in guest space.
  * @n:Number of bytes to copy.
  *
- * Context: User context only.  This function may sleep.
- *
- * Copy data from user space to kernel space.
+ * Copy data from guest space to hypervisor space.
  *
  * Returns number of bytes that could not be copied.
  * On success, this will be zero.
@@ -183,7 +179,8 @@ unsigned clear_user(void __user *to, uns
  * If some data could not be copied, this function will pad the copied
  * data to the requested size using zero bytes.
  */
-unsigned copy_from_user(void *to, const void __user *from, unsigned n)
+unsigned int copy_from_guest_pv(void *to, const void __user *from,
+  

[PATCH v2 5/8] x86/gdbsx: convert "user" to "guest" accesses

2021-02-17 Thread Jan Beulich
Using copy_{from,to}_user(), this code was assuming to be called only by
PV guests. Use copy_{from,to}_guest() instead, transforming the incoming
structure field into a guest handle (the field should really have been
one in the first place). Also do not transform the debuggee address into
a pointer.

Signed-off-by: Jan Beulich 
---
v2: Re-base (bug fix side effect was taken care of already).

--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -108,12 +108,11 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma
 }
 
 /* Returns: number of bytes remaining to be copied */
-static unsigned int dbg_rw_guest_mem(struct domain *dp, void * __user gaddr,
- void * __user buf, unsigned int len,
- bool toaddr, uint64_t pgd3)
+static unsigned int dbg_rw_guest_mem(struct domain *dp, unsigned long addr,
+ XEN_GUEST_HANDLE_PARAM(void) buf,
+ unsigned int len, bool toaddr,
+ uint64_t pgd3)
 {
-unsigned long addr = (unsigned long)gaddr;
-
 while ( len > 0 )
 {
 char *va;
@@ -134,20 +133,18 @@ static unsigned int dbg_rw_guest_mem(str
 
 if ( toaddr )
 {
-copy_from_user(va, buf, pagecnt);/* va = buf */
+copy_from_guest(va, buf, pagecnt);
 paging_mark_dirty(dp, mfn);
 }
 else
-{
-copy_to_user(buf, va, pagecnt);/* buf = va */
-}
+copy_to_guest(buf, va, pagecnt);
 
 unmap_domain_page(va);
 if ( !gfn_eq(gfn, INVALID_GFN) )
 put_gfn(dp, gfn_x(gfn));
 
 addr += pagecnt;
-buf += pagecnt;
+guest_handle_add_offset(buf, pagecnt);
 len -= pagecnt;
 }
 
@@ -161,7 +158,7 @@ static unsigned int dbg_rw_guest_mem(str
  * pgd3: value of init_mm.pgd[3] in guest. see above.
  * Returns: number of bytes remaining to be copied.
  */
-unsigned int dbg_rw_mem(void * __user addr, void * __user buf,
+unsigned int dbg_rw_mem(unsigned long gva, XEN_GUEST_HANDLE_PARAM(void) buf,
 unsigned int len, domid_t domid, bool toaddr,
 uint64_t pgd3)
 {
@@ -170,7 +167,7 @@ unsigned int dbg_rw_mem(void * __user ad
 if ( d )
 {
 if ( !d->is_dying )
-len = dbg_rw_guest_mem(d, addr, buf, len, toaddr, pgd3);
+len = dbg_rw_guest_mem(d, gva, buf, len, toaddr, pgd3);
 rcu_unlock_domain(d);
 }
 
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -40,10 +40,8 @@
 #ifdef CONFIG_GDBSX
 static int gdbsx_guest_mem_io(domid_t domid, struct xen_domctl_gdbsx_memio 
*iop)
 {
-void * __user gva = (void *)iop->gva, * __user uva = (void *)iop->uva;
-
-iop->remain = dbg_rw_mem(gva, uva, iop->len, domid,
- !!iop->gwr, iop->pgd3val);
+iop->remain = dbg_rw_mem(iop->gva, guest_handle_from_ptr(iop->uva, void),
+ iop->len, domid, iop->gwr, iop->pgd3val);
 
 return iop->remain ? -EFAULT : 0;
 }
--- a/xen/include/asm-x86/debugger.h
+++ b/xen/include/asm-x86/debugger.h
@@ -93,9 +93,9 @@ static inline bool debugger_trap_entry(
 #endif
 
 #ifdef CONFIG_GDBSX
-unsigned int dbg_rw_mem(void * __user addr, void * __user buf,
+unsigned int dbg_rw_mem(unsigned long gva, XEN_GUEST_HANDLE_PARAM(void) buf,
 unsigned int len, domid_t domid, bool toaddr,
-uint64_t pgd3);
+unsigned long pgd3);
 #endif
 
 #endif /* __X86_DEBUGGER_H__ */




[PATCH v2 4/8] x86: rename {get,put}_user() to {get,put}_guest()

2021-02-17 Thread Jan Beulich
Bring them (back) in line with __{get,put}_guest().

Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1649,19 +1649,19 @@ static void load_segments(struct vcpu *n
 
 if ( !ring_1(regs) )
 {
-ret  = put_user(regs->ss,   esp-1);
-ret |= put_user(regs->esp,  esp-2);
+ret  = put_guest(regs->ss,  esp - 1);
+ret |= put_guest(regs->esp, esp - 2);
 esp -= 2;
 }
 
 if ( ret |
- put_user(rflags,  esp-1) |
- put_user(cs_and_mask, esp-2) |
- put_user(regs->eip,   esp-3) |
- put_user(uregs->gs,   esp-4) |
- put_user(uregs->fs,   esp-5) |
- put_user(uregs->es,   esp-6) |
- put_user(uregs->ds,   esp-7) )
+ put_guest(rflags,  esp - 1) |
+ put_guest(cs_and_mask, esp - 2) |
+ put_guest(regs->eip,   esp - 3) |
+ put_guest(uregs->gs,   esp - 4) |
+ put_guest(uregs->fs,   esp - 5) |
+ put_guest(uregs->es,   esp - 6) |
+ put_guest(uregs->ds,   esp - 7) )
 {
 gprintk(XENLOG_ERR,
 "error while creating compat failsafe callback 
frame\n");
@@ -1690,17 +1690,17 @@ static void load_segments(struct vcpu *n
 cs_and_mask = (unsigned long)regs->cs |
 ((unsigned long)vcpu_info(n, evtchn_upcall_mask) << 32);
 
-if ( put_user(regs->ss,rsp- 1) |
- put_user(regs->rsp,   rsp- 2) |
- put_user(rflags,  rsp- 3) |
- put_user(cs_and_mask, rsp- 4) |
- put_user(regs->rip,   rsp- 5) |
- put_user(uregs->gs,   rsp- 6) |
- put_user(uregs->fs,   rsp- 7) |
- put_user(uregs->es,   rsp- 8) |
- put_user(uregs->ds,   rsp- 9) |
- put_user(regs->r11,   rsp-10) |
- put_user(regs->rcx,   rsp-11) )
+if ( put_guest(regs->ss,rsp -  1) |
+ put_guest(regs->rsp,   rsp -  2) |
+ put_guest(rflags,  rsp -  3) |
+ put_guest(cs_and_mask, rsp -  4) |
+ put_guest(regs->rip,   rsp -  5) |
+ put_guest(uregs->gs,   rsp -  6) |
+ put_guest(uregs->fs,   rsp -  7) |
+ put_guest(uregs->es,   rsp -  8) |
+ put_guest(uregs->ds,   rsp -  9) |
+ put_guest(regs->r11,   rsp - 10) |
+ put_guest(regs->rcx,   rsp - 11) )
 {
 gprintk(XENLOG_ERR,
 "error while creating failsafe callback frame\n");
--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -26,14 +26,12 @@ extern void __put_user_bad(void);
 #define UA_DROP(args...)
 
 /**
- * get_user: - Get a simple variable from user space.
+ * get_guest: - Get a simple variable from guest space.
  * @x:   Variable to store result.
- * @ptr: Source address, in user space.
- *
- * Context: User context only.  This function may sleep.
+ * @ptr: Source address, in guest space.
  *
- * This macro copies a single simple variable from user space to kernel
- * space.  It supports simple types like char and int, but not larger
+ * This macro load a single simple variable from guest space.
+ * It supports simple types like char and int, but not larger
  * data types like structures or arrays.
  *
  * @ptr must have pointer-to-simple-variable type, and the result of
@@ -42,18 +40,15 @@ extern void __put_user_bad(void);
  * Returns zero on success, or -EFAULT on error.
  * On error, the variable @x is set to zero.
  */
-#define get_user(x,ptr)\
-  __get_user_check((x),(ptr),sizeof(*(ptr)))
+#define get_guest(x, ptr) get_guest_check(x, ptr, sizeof(*(ptr)))
 
 /**
- * put_user: - Write a simple value into user space.
- * @x:   Value to copy to user space.
- * @ptr: Destination address, in user space.
- *
- * Context: User context only.  This function may sleep.
+ * put_guest: - Write a simple value into guest space.
+ * @x:   Value to store in guest space.
+ * @ptr: Destination address, in guest space.
  *
- * This macro copies a single simple value from kernel space to user
- * space.  It supports simple types like char and int, but not larger
+ * This macro stores a single simple value from to guest space.
+ * It supports simple types like char and int, but not larger
  * data types like structures or arrays.
  *
  * @ptr must have pointer-to-simple-variable type, and @x must be assignable
@@ -61,8 +56,8 @@ extern void __put_user_bad(void);
  *
  * Returns zero on success, or -EFAULT on error.
  */
-#define put_user(x,ptr)
\
-  

[PATCH v2 3/8] x86/PV: harden guest memory accesses against speculative abuse

2021-02-17 Thread Jan Beulich
Inspired by
https://lore.kernel.org/lkml/f12e7d3cecf41b2c29734ea45a393be21d4a8058.1597848273.git.jpoim...@redhat.com/
and prior work in that area of x86 Linux, suppress speculation with
guest specified pointer values by suitably masking the addresses to
non-canonical space in case they fall into Xen's virtual address range.

Introduce a new Kconfig control.

Note that it is necessary in such code to avoid using "m" kind operands:
If we didn't, there would be no guarantee that the register passed to
guest_access_mask_ptr is also the (base) one used for the memory access.

As a minor unrelated change in get_unsafe_asm() the unnecessary "itype"
parameter gets dropped and the XOR on the fixup path gets changed to be
a 32-bit one in all cases: This way we avoid pointless REX.W or operand
size overrides, or writes to partial registers.

Requested-by: Andrew Cooper 
Signed-off-by: Jan Beulich 
Reviewed-by: Roger Pau Monné 
---
v2: Add comment to assembler macro.
---
The insn sequence chosen is certainly up for discussion; I've picked
this one despite the RCR because alternatives I could come up with,
like

mov $(HYPERVISOR_VIRT_END), %rax
mov $~0, %rdx
mov $0x7fff, %rcx
cmp %rax, %rdi
cmovb   %rcx, %rdx
and %rdx, %rdi

weren't necessarily better: Either, as above, they are longer and
require a 3rd scratch register, or they also utilize the carry flag in
some similar way.
---
Judging from the comment ahead of put_unsafe_asm() we might as well not
tell gcc at all anymore about the memory access there, now that there's
no use of the operand anymore in the assembly code.

--- a/xen/arch/x86/usercopy.c
+++ b/xen/arch/x86/usercopy.c
@@ -10,12 +10,19 @@
 #include 
 #include 
 
-unsigned __copy_to_user_ll(void __user *to, const void *from, unsigned n)
+#ifndef GUARD
+# define GUARD UA_KEEP
+#endif
+
+unsigned int copy_to_guest_ll(void __user *to, const void *from, unsigned int 
n)
 {
 unsigned dummy;
 
 stac();
 asm volatile (
+GUARD(
+"guest_access_mask_ptr %[to], %q[scratch1], %q[scratch2]\n"
+)
 "cmp  $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n"
 "jbe  1f\n"
 "mov  %k[to], %[cnt]\n"
@@ -42,6 +49,7 @@ unsigned __copy_to_user_ll(void __user *
 _ASM_EXTABLE(1b, 2b)
 : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from),
   [aux] "=" (dummy)
+  GUARD(, [scratch1] "=" (dummy), [scratch2] "=" (dummy))
 : "[aux]" (n)
 : "memory" );
 clac();
@@ -49,12 +57,15 @@ unsigned __copy_to_user_ll(void __user *
 return n;
 }
 
-unsigned __copy_from_user_ll(void *to, const void __user *from, unsigned n)
+unsigned int copy_from_guest_ll(void *to, const void __user *from, unsigned 
int n)
 {
 unsigned dummy;
 
 stac();
 asm volatile (
+GUARD(
+"guest_access_mask_ptr %[from], %q[scratch1], %q[scratch2]\n"
+)
 "cmp  $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n"
 "jbe  1f\n"
 "mov  %k[to], %[cnt]\n"
@@ -87,6 +98,7 @@ unsigned __copy_from_user_ll(void *to, c
 _ASM_EXTABLE(1b, 6b)
 : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from),
   [aux] "=" (dummy)
+  GUARD(, [scratch1] "=" (dummy), [scratch2] "=" (dummy))
 : "[aux]" (n)
 : "memory" );
 clac();
@@ -94,6 +106,8 @@ unsigned __copy_from_user_ll(void *to, c
 return n;
 }
 
+#if GUARD(1) + 0
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
@@ -128,8 +142,11 @@ unsigned clear_user(void __user *to, uns
 {
 if ( access_ok(to, n) )
 {
+long dummy;
+
 stac();
 asm volatile (
+"guest_access_mask_ptr %[to], %[scratch1], %[scratch2]\n"
 "0:  rep stos"__OS"\n"
 "mov  %[bytes], %[cnt]\n"
 "1:  rep stosb\n"
@@ -140,7 +157,8 @@ unsigned clear_user(void __user *to, uns
 ".previous\n"
 _ASM_EXTABLE(0b,3b)
 _ASM_EXTABLE(1b,2b)
-: [cnt] "=" (n), [to] "+D" (to)
+: [cnt] "=" (n), [to] "+D" (to), [scratch1] "=" (dummy),
+  [scratch2] "=" (dummy)
 : [bytes] "r" (n & (BYTES_PER_LONG - 1)),
   [longs] "0" (n / BYTES_PER_LONG), "a" (0) );
 clac();
@@ -174,6 +192,16 @@ unsigned copy_from_user(void *to, const
 return n;
 }
 
+# undef GUARD
+# define GUARD UA_DROP
+# define copy_to_guest_ll copy_to_unsafe_ll
+# define copy_from_guest_ll copy_from_unsafe_ll
+# undef __user
+# define __user
+# include __FILE__
+
+#endif /* GUARD(1) */
+
 /*
  * Local variables:
  * mode: C
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -458,6 +458,8 @@ UNLIKELY_START(g, create_bounce_frame_ba
 jmp   asm_domain_crash_synchronous  /* Does not return */
 __UNLIKELY_END(create_bounce_frame_bad_sp)
 
+guest_access_mask_ptr 

[PATCH v2 2/8] x86: split __copy_{from,to}_user() into "guest" and "unsafe" variants

2021-02-17 Thread Jan Beulich
The "guest" variants are intended to work with (potentially) fully guest
controlled addresses, while the "unsafe" variants are intended to be
used in order to access addresses not (directly) under guest control,
within Xen's part of virtual address space. Subsequently we will want
them to have distinct behavior, so as first step identify which one is
which. For now, both groups of constructs alias one another.

Double underscore prefixes are retained only on
__copy_{from,to}_guest_pv(), to allow still distinguishing them from
their "checking" counterparts once they also get renamed (to
copy_{from,to}_guest_pv()).

Add previously missing __user at some call sites.

Signed-off-by: Jan Beulich 
Reviewed-by: Tim Deegan  [shadow]
Reviewed-by: Roger Pau Monné 
---
Instead of __copy_{from,to}_guest_pv(), perhaps name them just
__copy_{from,to}_pv()?

--- a/xen/arch/x86/gdbstub.c
+++ b/xen/arch/x86/gdbstub.c
@@ -33,13 +33,13 @@ gdb_arch_signal_num(struct cpu_user_regs
 unsigned int
 gdb_arch_copy_from_user(void *dest, const void *src, unsigned len)
 {
-return __copy_from_user(dest, src, len);
+return copy_from_unsafe(dest, src, len);
 }
 
 unsigned int 
 gdb_arch_copy_to_user(void *dest, const void *src, unsigned len)
 {
-return __copy_to_user(dest, src, len);
+return copy_to_unsafe(dest, src, len);
 }
 
 void
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2614,7 +2614,7 @@ static int sh_page_fault(struct vcpu *v,
 {
 shadow_l2e_t sl2e;
 mfn_t gl1mfn;
-if ( (__copy_from_user(,
+if ( (copy_from_unsafe(,
(sh_linear_l2_table(v)
 + shadow_l2_linear_offset(va)),
sizeof(sl2e)) != 0)
@@ -2633,7 +2633,7 @@ static int sh_page_fault(struct vcpu *v,
 #endif /* SHOPT_OUT_OF_SYNC */
 /* The only reasons for reserved bits to be set in shadow entries
  * are the two "magic" shadow_l1e entries. */
-if ( likely((__copy_from_user(,
+if ( likely((copy_from_unsafe(,
   (sh_linear_l1_table(v)
+ shadow_l1_linear_offset(va)),
   sizeof(sl1e)) == 0)
@@ -3308,10 +3308,10 @@ static bool sh_invlpg(struct vcpu *v, un
sh_linear_l4_table(v)[shadow_l4_linear_offset(linear)])
& _PAGE_PRESENT) )
 return false;
-/* This must still be a copy-from-user because we don't have the
+/* This must still be a copy-from-unsafe because we don't have the
  * paging lock, and the higher-level shadows might disappear
  * under our feet. */
-if ( __copy_from_user(, (sh_linear_l3_table(v)
+if ( copy_from_unsafe(, (sh_linear_l3_table(v)
   + shadow_l3_linear_offset(linear)),
   sizeof (sl3e)) != 0 )
 {
@@ -3330,9 +3330,9 @@ static bool sh_invlpg(struct vcpu *v, un
 return false;
 #endif
 
-/* This must still be a copy-from-user because we don't have the shadow
+/* This must still be a copy-from-unsafe because we don't have the shadow
  * lock, and the higher-level shadows might disappear under our feet. */
-if ( __copy_from_user(,
+if ( copy_from_unsafe(,
   sh_linear_l2_table(v) + 
shadow_l2_linear_offset(linear),
   sizeof (sl2e)) != 0 )
 {
@@ -3371,11 +3371,11 @@ static bool sh_invlpg(struct vcpu *v, un
  * hold the paging lock yet.  Check again with the lock held. */
 paging_lock(d);
 
-/* This must still be a copy-from-user because we didn't
+/* This must still be a copy-from-unsafe because we didn't
  * have the paging lock last time we checked, and the
  * higher-level shadows might have disappeared under our
  * feet. */
-if ( __copy_from_user(,
+if ( copy_from_unsafe(,
   sh_linear_l2_table(v)
   + shadow_l2_linear_offset(linear),
   sizeof (sl2e)) != 0 )
--- a/xen/arch/x86/pv/emul-gate-op.c
+++ b/xen/arch/x86/pv/emul-gate-op.c
@@ -149,12 +149,12 @@ static int read_mem(enum x86_segment seg
 
 addr = (uint32_t)addr;
 
-if ( (rc = __copy_from_user(p_data, (void *)addr, bytes)) )
+if ( (rc = __copy_from_guest_pv(p_data, (void __user *)addr, bytes)) )
 {
 /*
  * TODO: This should report PFEC_insn_fetch when goc->insn_fetch &&
  * cpu_has_nx, but we'd then need a "fetch" variant of
- * __copy_from_user() respecting NX, SMEP, and protection keys.
+ * __copy_from_guest_pv() respecting NX, SMEP, and protection keys.
  */
 x86_emul_pagefault(0, addr + bytes - rc, ctxt);
 return 

[PATCH v2 1/8] x86: split __{get,put}_user() into "guest" and "unsafe" variants

2021-02-17 Thread Jan Beulich
The "guest" variants are intended to work with (potentially) fully guest
controlled addresses, while the "unsafe" variants are intended to be
used in order to access addresses not (directly) under guest control,
within Xen's part of virtual address space. (For linear page table and
descriptor table accesses the low bits of the addresses may still be
guest controlled, but this still won't allow speculation to "escape"
into unwanted areas.) Subsequently we will want them to have distinct
behavior, so as first step identify which one is which. For now, both
groups of constructs alias one another.

Double underscore prefixes are retained only on __{get,put}_guest(), to
allow still distinguishing them from their "checking" counterparts once
they also get renamed (to {get,put}_guest()).

Since for them it's almost a full re-write, move what becomes
{get,put}_unsafe_size() into the "common" uaccess.h (x86_64/*.h should
disappear at some point anyway).

In __copy_to_user() one of the two casts in each put_guest_size()
invocation gets dropped. They're not needed and did break symmetry with
__copy_from_user().

Signed-off-by: Jan Beulich 
Reviewed-by: Tim Deegan  [shadow]
Reviewed-by: Roger Pau Monné 
---
v2: Use __get_guest() in {,compat_}show_guest_stack().

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -776,9 +776,9 @@ shadow_write_entries(void *d, void *s, i
 /* Because we mirror access rights at all levels in the shadow, an
  * l2 (or higher) entry with the RW bit cleared will leave us with
  * no write access through the linear map.
- * We detect that by writing to the shadow with __put_user() and
+ * We detect that by writing to the shadow with put_unsafe() and
  * using map_domain_page() to get a writeable mapping if we need to. */
-if ( __put_user(*dst, dst) )
+if ( put_unsafe(*dst, dst) )
 {
 perfc_incr(shadow_linear_map_failed);
 map = map_domain_page(mfn);
--- a/xen/arch/x86/pv/emul-gate-op.c
+++ b/xen/arch/x86/pv/emul-gate-op.c
@@ -40,7 +40,7 @@ static int read_gate_descriptor(unsigned
  ((gate_sel >> 3) + !is_pv_32bit_vcpu(v) >=
   (gate_sel & 4 ? v->arch.pv.ldt_ents
 : v->arch.pv.gdt_ents)) ||
- __get_user(desc, pdesc) )
+ get_unsafe(desc, pdesc) )
 return 0;
 
 *sel = (desc.a >> 16) & 0xfffc;
@@ -59,7 +59,7 @@ static int read_gate_descriptor(unsigned
 {
 if ( (*ar & 0x1f00) != 0x0c00 ||
  /* Limit check done above already. */
- __get_user(desc, pdesc + 1) ||
+ get_unsafe(desc, pdesc + 1) ||
  (desc.b & 0x1f00) )
 return 0;
 
@@ -294,7 +294,7 @@ void pv_emulate_gate_op(struct cpu_user_
 { \
 --stkp; \
 esp -= 4; \
-rc = __put_user(item, stkp); \
+rc = __put_guest(item, stkp); \
 if ( rc ) \
 { \
 pv_inject_page_fault(PFEC_write_access, \
@@ -362,7 +362,7 @@ void pv_emulate_gate_op(struct cpu_user_
 unsigned int parm;
 
 --ustkp;
-rc = __get_user(parm, ustkp);
+rc = __get_guest(parm, ustkp);
 if ( rc )
 {
 pv_inject_page_fault(0, (unsigned long)(ustkp + 1) - 
rc);
--- a/xen/arch/x86/pv/emulate.c
+++ b/xen/arch/x86/pv/emulate.c
@@ -34,13 +34,13 @@ int pv_emul_read_descriptor(unsigned int
 if ( sel < 4 ||
  /*
   * Don't apply the GDT limit here, as the selector may be a Xen
-  * provided one. __get_user() will fail (without taking further
+  * provided one. get_unsafe() will fail (without taking further
   * action) for ones falling in the gap between guest populated
   * and Xen ones.
   */
  ((sel & 4) && (sel >> 3) >= v->arch.pv.ldt_ents) )
 desc.b = desc.a = 0;
-else if ( __get_user(desc, gdt_ldt_desc_ptr(sel)) )
+else if ( get_unsafe(desc, gdt_ldt_desc_ptr(sel)) )
 return 0;
 if ( !insn_fetch )
 desc.b &= ~_SEGMENT_L;
--- a/xen/arch/x86/pv/iret.c
+++ b/xen/arch/x86/pv/iret.c
@@ -114,15 +114,15 @@ unsigned int compat_iret(void)
 regs->rsp = (u32)regs->rsp;
 
 /* Restore EAX (clobbered by hypercall). */
-if ( unlikely(__get_user(regs->eax, (u32 *)regs->rsp)) )
+if ( unlikely(__get_guest(regs->eax, (u32 *)regs->rsp)) )
 {
 domain_crash(v->domain);
 return 0;
 }
 
 /* Restore CS and EIP. */
-if ( unlikely(__get_user(regs->eip, (u32 *)regs->rsp + 1)) ||
-unlikely(__get_user(regs->cs, (u32 *)regs->rsp + 2)) )
+if ( unlikely(__get_guest(regs->eip, (u32 *)regs->rsp + 1)) ||
+unlikely(__get_guest(regs->cs, (u32 *)regs->rsp + 2)) )
 {
 domain_crash(v->domain);
 return 0;
@@ -132,7 +132,7 @@ unsigned int compat_iret(void)
  * Fix up and restore EFLAGS. We 

[PATCH v2 0/8] x86/PV: avoid speculation abuse through guest accessors

2021-02-17 Thread Jan Beulich
Re-sending primarily for the purpose of getting a release ack, an
explicit release nak, or an indication of there not being a need,
all for at least the first three patches here (which are otherwise
ready to go in). I've dropped the shadow part of the series from
this re-submission, because it has all got reviewed by Tim already
and is intended for 4.16 only anyway. I'm re-including the follow
up patches getting the code base in consistent shape again, as I
continue to think this consistency goal is at least worth a
consideration towards a freeze exception.

1: split __{get,put}_user() into "guest" and "unsafe" variants
2: split __copy_{from,to}_user() into "guest" and "unsafe" variants
3: PV: harden guest memory accesses against speculative abuse
4: rename {get,put}_user() to {get,put}_guest()
5: gdbsx: convert "user" to "guest" accesses
6: rename copy_{from,to}_user() to copy_{from,to}_guest_pv()
7: move stac()/clac() from {get,put}_unsafe_asm() ...
8: PV: use get_unsafe() instead of copy_from_unsafe()

Jan



Re: Linux DomU freezes and dies under heavy memory shuffling

2021-02-17 Thread Roman Shaposhnik
Hi Jürgen, thanks for taking a look at this. A few comments below:

On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß  wrote:
>
> On 16.02.21 21:34, Stefano Stabellini wrote:
> > + x86 maintainers
> >
> > It looks like the tlbflush is getting stuck?
>
> I have seen this case multiple times on customer systems now, but
> reproducing it reliably seems to be very hard.

It is reliably reproducible under my workload but it take a long time
(~3 days of the workload running in the lab).

> I suspected fifo events to be blamed, but just yesterday I've been
> informed of another case with fifo events disabled in the guest.
>
> One common pattern seems to be that up to now I have seen this effect
> only on systems with Intel Gold cpus. Can it be confirmed to be true
> in this case, too?

I am pretty sure mine isn't -- I can get you full CPU specs if that's useful.

> In case anybody has a reproducer (either in a guest or dom0) with a
> setup where a diagnostic kernel can be used, I'd be _very_ interested!

I can easily add things to Dom0 and DomU. Whether that will disrupt the
experiment is, of course, another matter. Still please let me know what
would be helpful to do.

Thanks,
Roman.