[libvirt test] 159437: regressions - FAIL
flight 159437 libvirt real [real] http://logs.test-lab.xenproject.org/osstest/logs/159437/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-armhf-libvirt 6 libvirt-buildfail REGR. vs. 151777 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 151777 build-i386-libvirt6 libvirt-buildfail REGR. vs. 151777 build-arm64-libvirt 6 libvirt-buildfail REGR. vs. 151777 Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-vhd 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt 1 build-check(1) blocked n/a test-amd64-i386-libvirt-pair 1 build-check(1) blocked n/a test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-xsm 1 build-check(1) blocked n/a test-arm64-arm64-libvirt 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-qcow2 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a version targeted for testing: libvirt 040a5bc307eb8cd304a4330b7538370315db3774 baseline version: libvirt 2c846fa6bcc11929c9fb857a22430fb9945654ad Last test of basis 151777 2020-07-10 04:19:19 Z 223 days Failing since151818 2020-07-11 04:18:52 Z 222 days 214 attempts Testing same since 159437 2021-02-17 04:18:55 Z1 days1 attempts People who touched revisions under test: Adolfo Jayme Barrientos Aleksandr Alekseev Andika Triwidada Andrea Bolognani Balázs Meskó Barrett Schonefeld Bastien Orivel Bihong Yu Binfeng Wu Boris Fiuczynski Brian Turek Christian Ehrhardt Christian Schoenebeck Cole Robinson Collin Walling Cornelia Huck Cédric Bosdonnat Côme Borsoi Daniel Henrique Barboza Daniel Letai Daniel P. Berrange Daniel P. Berrangé Dmytro Linkin Eiichi Tsukata Erik Skultety Fabian Affolter Fabian Freyer Fangge Jin Farhan Ali Fedora Weblate Translation gongwei Guoyi Tu Göran Uddeborg Halil Pasic Han Han Hao Wang Helmut Grohne Ian Wienand Jakob Meng Jamie Strandboge Jamie Strandboge Jan Kuparinen Jean-Baptiste Holcroft Jianan Gao Jim Fehlig Jin Yan Jiri Denemark John Ferlan Jonathan Watt Jonathon Jongsma Julio Faracco Ján Tomko Kashyap Chamarthy Kevin Locke Kristina Hanicova Laine Stump Laszlo Ersek Liao Pingfang Lin Ma Lin Ma Lin Ma Marc Hartmayer Marc-André Lureau Marek Marczykowski-Górecki Markus Schade Martin Kletzander Masayoshi Mizuma Matt Coleman Matt Coleman Mauro Matteo Cascella Meina Li Michal Privoznik Michał Smyk Milo Casagrande Moshe Levi Muha Aliss Neal Gompa Nick Shyrokovskiy Nickys Music Group Nico Pache Nikolay Shirokovskiy Olaf Hering Olesya Gerasimenko Orion Poplawski Patrick Magauran Paulo de Rezende Pinatti Pavel Hrdina Peter Krempa Pino Toscano Pino Toscano Piotr Drąg Prathamesh Chavan Ricky Tigg Roman Bogorodskiy Roman Bolshakov Ryan Gahagan Ryan Schmidt Sam Hartman Scott Shambarger Sebastian Mitterle Shalini Chellathurai Saroja Shaojun Yang Shi Lei Simon Gaiser Stefan Bader Stefan Berger Szymon Scholz Thomas Huth Tim Wiederhake Tomáš Golembiovský Tomáš Janoušek Tuguoyi Ville Skyttä Wang Xin Weblate Yalei Li <274268...@qq.com> Yalei Li Yang Hang Yanqiu Zhang Yi Li Yi Wang Yuri Chornoivan Zheng Chuan zhenwei pi Zhenyu Zheng jobs: build-amd64-xsm pass build-arm64-xsm pass build-i386-xsm pass build-amd64 pass build-arm64 pass build-armhf pass build-i386 pass build-amd64-libvirt fail build-arm64-libvirt fail build-armhf-libvirt
[xen-unstable test] 159424: tolerable FAIL - PUSHED
flight 159424 xen-unstable real [real] http://logs.test-lab.xenproject.org/osstest/logs/159424/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail like 159335 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail like 159362 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 159396 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 159396 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 159396 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 159396 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 159396 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 159396 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 159396 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 159396 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 159396 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 159396 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 159396 test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass version targeted for testing: xen 3b1cc15f1931ba56d0ee256fe9bfe65509733b27 baseline version: xen 04085ec1ac05a362812e9b0c6b5a8713d7dc88ad Last test of basis 159396 2021-02-16 01:52:34 Z2 days Testing same since 159424 2021-02-16 17:37:44 Z1 days1 attempts People who touched revisions under test: Jan Beulich Julien Grall jobs: build-amd64-xsm pass build-arm64-xsm
Re: [RFC PATCH 29/34] power/swap: use bio_new in hib_submit_io
On 2/17/21 14:03, Pavel Machek wrote: > Hi! >> diff --git a/kernel/power/swap.c b/kernel/power/swap.c >> index c73f2e295167..e92e36c053a6 100644 >> --- a/kernel/power/swap.c >> +++ b/kernel/power/swap.c >> @@ -271,13 +271,12 @@ static int hib_submit_io(int op, int op_flags, pgoff_t >> page_off, void *addr, >> struct hib_bio_batch *hb) >> { >> struct page *page = virt_to_page(addr); >> +sector_t sect = page_off * (PAGE_SIZE >> 9); >> struct bio *bio; >> int error = 0; >> >> -bio = bio_alloc(GFP_NOIO | __GFP_HIGH, 1); >> -bio->bi_iter.bi_sector = page_off * (PAGE_SIZE >> 9); >> -bio_set_dev(bio, hib_resume_bdev); >> -bio_set_op_attrs(bio, op, op_flags); >> +bio = bio_new(hib_resume_bdev, sect, op, op_flags, 1, >> + GFP_NOIO | __GFP_HIGH); >> > C function with 6 arguments... dunno. Old version looks comparable or > even more readable... > > Best regards, > Pavel The library functions that are in the kernel tree which are used in different file-systems and fabrics drivers do take 6 arguments. Plus what is the point of duplicating code for mandatory parameters all over the kernel ?
Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu
On Wed, 17 Feb 2021, Julien Grall wrote: > On 17/02/2021 02:00, Stefano Stabellini wrote: > > Hi all, > > > > Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to > > translate addresses for DMA operations in Dom0. Specifically, > > swiotlb-xen is used to translate the address of a foreign page (a page > > belonging to a domU) mapped into Dom0 before using it for DMA. > > > > This is important because although Dom0 is 1:1 mapped, DomUs are not. On > > systems without an IOMMU swiotlb-xen enables PV drivers to work as long > > as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation > > ends up using the MFN, rather than the GFN. > > > > > > On systems with an IOMMU, this is not necessary: when a foreign page is > > mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation > > is enstablished for both MMU and SMMU. Dom0 could safely use the GFN > > address (instead of the MFN) for DMA operations and they would work. It > > would be more efficient than using swiotlb-xen. > > > > If you recall my presentation from Xen Summit 2020, Xilinx is working on > > cache coloring. With cache coloring, no domain is 1:1 mapped, not even > > Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not > > work as intended. > > > > > > The suggested solution for both these issues is to add a new feature > > flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use > > swiotlb-xen because IOMMU translations are available for Dom0. If > > XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen > > initialization. I have tested this scheme with and without cache > > coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it > > works as expected: DMA operations succeed. > > > > > > What about systems where an IOMMU is present but not all devices are > > protected? > > > > There is no way for Xen to know which devices are protected and which > > ones are not: devices that do not have the "iommus" property could or > > could not be DMA masters. > > > > Perhaps Xen could populate a whitelist of devices protected by the IOMMU > > based on the "iommus" property. It would require some added complexity > > in Xen and especially in the swiotlb-xen driver in Linux to use it, > > which is not ideal. > > You are trading a bit more complexity in Xen and Linux with a user will may > not be able to use the hypervisor on his/her platform without a quirk in Xen > (see more below). > > > However, this approach would not work for cache > > coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be > > used either way > > Not all the Dom0 Linux kernel will be able to work with cache colouring. So > you will need a way for the kernel to say "Hey I am can avoid using swiotlb". A dom0 Linux kernel unmodified can work with Xen cache coloring enabled. The swiotlb-xen is unneeded and also all the pfn_valid() checks to detect if a page is local or foreign don't work as intended. However, it still works on systems where the IOMMU can be relied upon. That's because the IOMMU does the GFN->MFN translations, and also because both swiotlb-xen code paths (cache flush local and cache flush via hypercall) work. Also considering that somebody that enables cache coloring has to know the entire system well, I don't think we need a runtime flag from Linux to say "Hey I can avoid using swiotlb". That said, I think it is important to fix Linux because the code might work today, but it is more by accident than by design. > > For these reasons, I would like to propose a single flag > > XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for > > DMA translations. In situations where a DMA master is not SMMU > > protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a > > platform where an IOMMU is present and protects most DMA masters but it > > is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should > > not be set (because PV block is not going to work without swiotlb-xen.) > > This also means that cache coloring won't be usable on such a system (at > > least not usable with the MMC controller so the system integrator should > > pay special care to setup the system). > > > > It is worth noting that if we wanted to extend the interface to add a > > list of protected devices in the future, it would still be possible. It > > would be compatible with XENFEAT_ARM_dom0_iommu. > > I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared and > instead the device-tree list would be used. Is that correct? If XENFEAT_ARM_dom0_iommu is set, the device list is redundant. If XENFEAT_ARM_dom0_iommu is not set, the device list is useful. The device list and XENFEAT_ARM_dom0_iommu serve different purposes. The device list is meant to skip the swiotlb-xen for some devices. XENFEAT_ARM_dom0_iommu is meant to skip the swiotlb-xen for all devices, which is especially useful when dom0 is not 1:1 mapped,
Re: [PATCH] xen/arm : smmuv3: Fix to handle multiple StreamIds per device.
+IanJ On Wed, 17 Feb 2021, Bertrand Marquis wrote: > Hi Rahul, > > > > On 17 Feb 2021, at 10:05, Rahul Singh wrote: > > > > SMMUv3 driver does not handle multiple StreamId if the master device > > supports more than one StreamID. > > > > This bug was introduced when the driver was ported from Linux to XEN. > > dt_device_set_protected(..) should be called from add_device(..) not > > from the dt_xlate(..). > > > > Move dt_device_set_protected(..) from dt_xlate(..) to add_device(). > > > > Signed-off-by: Rahul Singh > Reviewed-by: Bertrand Marquis > > Thanks a lot, this is fixing issues with multiple stream ids for one device > :-) Reviewed-by: Stefano Stabellini > > --- > > This patch is a candidate for 4.15 as without this patch it is not possible > > to > > assign multiple StreamIds to the same device when device is protected behind > > SMMUv3. I agree especially considering that the impact is limited to smmu-v3.c. > > --- > > xen/drivers/passthrough/arm/smmu-v3.c | 29 ++- > > 1 file changed, 11 insertions(+), 18 deletions(-) > > > > diff --git a/xen/drivers/passthrough/arm/smmu-v3.c > > b/xen/drivers/passthrough/arm/smmu-v3.c > > index 914cdc1cf4..53d150cdb6 100644 > > --- a/xen/drivers/passthrough/arm/smmu-v3.c > > +++ b/xen/drivers/passthrough/arm/smmu-v3.c > > @@ -2207,24 +2207,6 @@ static int arm_smmu_add_device(u8 devfn, struct > > device *dev) > > */ > > arm_smmu_enable_pasid(master); > > > > - return 0; > > - > > -err_free_master: > > - xfree(master); > > - dev_iommu_priv_set(dev, NULL); > > - return ret; > > -} > > - > > -static int arm_smmu_dt_xlate(struct device *dev, > > - const struct dt_phandle_args *args) > > -{ > > - int ret; > > - struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); > > - > > - ret = iommu_fwspec_add_ids(dev, args->args, 1); > > - if (ret) > > - return ret; > > - > > if (dt_device_is_protected(dev_to_dt(dev))) { > > dev_err(dev, "Already added to SMMUv3\n"); > > return -EEXIST; > > @@ -2237,6 +2219,17 @@ static int arm_smmu_dt_xlate(struct device *dev, > > dev_name(fwspec->iommu_dev), fwspec->num_ids); > > > > return 0; > > + > > +err_free_master: > > + xfree(master); > > + dev_iommu_priv_set(dev, NULL); > > + return ret; > > +} > > + > > +static int arm_smmu_dt_xlate(struct device *dev, > > + const struct dt_phandle_args *args) > > +{ > > + return iommu_fwspec_add_ids(dev, args->args, 1); > > } > > > > /* Probing and initialisation functions */ > > -- > > 2.17.1 > > >
Re: [RFC PATCH 29/34] power/swap: use bio_new in hib_submit_io
Hi! > > diff --git a/kernel/power/swap.c b/kernel/power/swap.c > index c73f2e295167..e92e36c053a6 100644 > --- a/kernel/power/swap.c > +++ b/kernel/power/swap.c > @@ -271,13 +271,12 @@ static int hib_submit_io(int op, int op_flags, pgoff_t > page_off, void *addr, > struct hib_bio_batch *hb) > { > struct page *page = virt_to_page(addr); > + sector_t sect = page_off * (PAGE_SIZE >> 9); > struct bio *bio; > int error = 0; > > - bio = bio_alloc(GFP_NOIO | __GFP_HIGH, 1); > - bio->bi_iter.bi_sector = page_off * (PAGE_SIZE >> 9); > - bio_set_dev(bio, hib_resume_bdev); > - bio_set_op_attrs(bio, op, op_flags); > + bio = bio_new(hib_resume_bdev, sect, op, op_flags, 1, > + GFP_NOIO | __GFP_HIGH); > C function with 6 arguments... dunno. Old version looks comparable or even more readable... Best regards, Pavel -- http://www.livejournal.com/~pavelmachek signature.asc Description: Digital signature
[xen-4.14-testing test] 159420: tolerable FAIL - PUSHED
flight 159420 xen-4.14-testing real [real] http://logs.test-lab.xenproject.org/osstest/logs/159420/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 158558 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 158558 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 158558 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 158558 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 158558 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 158558 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 158558 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 158558 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 158558 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 158558 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 158558 test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass version targeted for testing: xen 9f357fe3e4593fc1109962b76d4db73d589ebef5 baseline version: xen 4170218cb96546426664e5c1d00c5a848a26ae9e Last test of basis 158558 2021-01-21 15:37:26 Z 27 days Testing same since 159420 2021-02-16 15:06:34 Z1 days1 attempts People who touched revisions under test: Julien Grall jobs: build-amd64-xsm pass build-arm64-xsm pass build-i386-xsm pass build-amd64-xtf
[xen-unstable-smoke test] 159445: tolerable all pass - PUSHED
flight 159445 xen-unstable-smoke real [real] http://logs.test-lab.xenproject.org/osstest/logs/159445/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass version targeted for testing: xen 7a4133feaf42000923eb9d84badb6b171625f137 baseline version: xen d670ef3401b91d04c58d72cd8ce5579b4fa900d8 Last test of basis 159443 2021-02-17 12:00:28 Z0 days Testing same since 159445 2021-02-17 16:00:31 Z0 days1 attempts People who touched revisions under test: Andrew Cooper Julien Grall jobs: build-arm64-xsm pass build-amd64 pass build-armhf pass build-amd64-libvirt pass test-armhf-armhf-xl pass test-arm64-arm64-xl-xsm pass test-amd64-amd64-xl-qemuu-debianhvm-amd64pass test-amd64-amd64-libvirt pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Pushing revision : To xenbits.xen.org:/home/xen/git/xen.git d670ef3401..7a4133feaf 7a4133feaf42000923eb9d84badb6b171625f137 -> smoke
Re: [BUG] Linux pvh vm not getting destroyed on shutdown
On Samstag, 13. Februar 2021 16:36:24 CET Maximilian Engelhardt wrote: > Here are some things I noticed while trying to debug this issue: > > * It happens on a Debian buster dom0 as well as on a bullseye dom0 > > * It seems to only affect pvh vms. > > * shutdown from the pvgrub menu ("c" -> "halt") does work > > * the vm seems to shut down normal, the last lines in the console are: [...] > > * issuing a reboot instead of a shutdown does work fine. > > * The issue started with Debian kernel 5.8.3+1~exp1 running in the vm, > Debian kernel 5.7.17-1 does not show the issue. > > * setting vcpus equal to maxvcpus does *not* show the hang. One thing I just realized I totally forgot to mention in my initial report is that this issue is present for us also on a modern kernel. We tested with Debian kernel 5.10.13-1. signature.asc Description: This is a digitally signed message part.
Re: [PATCH 3/3] tools/libxl: Rework invariants in libxl__domain_get_device_model_uid()
On 17/02/2021 17:50, Ian Jackson wrote: > Andrew Cooper writes ("[PATCH 3/3] tools/libxl: Rework invariants in > libxl__domain_get_device_model_uid()"): >> Various version of gcc, when compiling with -Og, complain: >> >> libxl_dm.c: In function 'libxl__domain_get_device_model_uid': >> libxl_dm.c:256:12: error: 'kill_by_uid' may be used uninitialized in this >> function [-Werror=maybe-uninitialized] >> 256 | if (kill_by_uid) >> |^ > Thanks for working on this. I have reviewed your changes and I see > where you are coming from. The situation is not very nice, mostly > because we don't have proper sum types in C. > > I'm sorry to say that with my release manager hat on I think it is too > late for this kind of reorganisation for 4.15, especially just to work > around an overzealous compiler warning. > > I think we can fix the compiler warning simply by setting the > `kill_by_uid` variable on more of the exit paths. This approach was > already taken in this code for one of the paths. > > I would prefer that approach at this stage of the release. Well - I have explained why I'm not happy with that approach, but you are the maintainer and RM. I will trade you a minimal patch for formal R-b's so the time invested so far fixing this mess isn't wasted when 4.16 opens. ~Andrew
[xen-4.13-testing test] 159419: tolerable FAIL - PUSHED
flight 159419 xen-4.13-testing real [real] flight 159446 xen-4.13-testing real-retest [real] http://logs.test-lab.xenproject.org/osstest/logs/159419/ http://logs.test-lab.xenproject.org/osstest/logs/159446/ Failures :-/ but no regressions. Tests which are failing intermittently (not blocking): test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass in 159446-retest Tests which did not succeed, but are not blocking: test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 158557 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 158557 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 158557 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 158557 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 158557 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 158557 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail like 158557 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 158557 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 158557 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 158557 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 158557 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 158557 test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-xl-pvshim14 guest-start fail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass version targeted for testing: xen ab995b6af9ab723b0b52e5ea0e342b612f1a7b89 baseline version: xen e4161938b315f3b9c6a13ade30d16c11504a2d16 Last test of basis 158557 2021-01-21 15:37:26 Z 27 days Testing same since 159419 2021-02-16 15:06:29 Z1 days1 attempts
Re: [PATCH 3/3] tools/libxl: Rework invariants in libxl__domain_get_device_model_uid()
Andrew Cooper writes ("[PATCH 3/3] tools/libxl: Rework invariants in libxl__domain_get_device_model_uid()"): > Various version of gcc, when compiling with -Og, complain: > > libxl_dm.c: In function 'libxl__domain_get_device_model_uid': > libxl_dm.c:256:12: error: 'kill_by_uid' may be used uninitialized in this > function [-Werror=maybe-uninitialized] > 256 | if (kill_by_uid) > |^ Thanks for working on this. I have reviewed your changes and I see where you are coming from. The situation is not very nice, mostly because we don't have proper sum types in C. I'm sorry to say that with my release manager hat on I think it is too late for this kind of reorganisation for 4.15, especially just to work around an overzealous compiler warning. I think we can fix the compiler warning simply by setting the `kill_by_uid` variable on more of the exit paths. This approach was already taken in this code for one of the paths. I would prefer that approach at this stage of the release. Sorry, Ian.
Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu
On 17.02.2021 16:34, Bertrand Marquis wrote: >> On 17 Feb 2021, at 02:00, Stefano Stabellini wrote: >> Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to >> translate addresses for DMA operations in Dom0. Specifically, >> swiotlb-xen is used to translate the address of a foreign page (a page >> belonging to a domU) mapped into Dom0 before using it for DMA. >> >> This is important because although Dom0 is 1:1 mapped, DomUs are not. On >> systems without an IOMMU swiotlb-xen enables PV drivers to work as long >> as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation >> ends up using the MFN, rather than the GFN. >> >> >> On systems with an IOMMU, this is not necessary: when a foreign page is >> mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation >> is enstablished for both MMU and SMMU. Dom0 could safely use the GFN >> address (instead of the MFN) for DMA operations and they would work. It >> would be more efficient than using swiotlb-xen. >> >> If you recall my presentation from Xen Summit 2020, Xilinx is working on >> cache coloring. With cache coloring, no domain is 1:1 mapped, not even >> Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not >> work as intended. >> >> >> The suggested solution for both these issues is to add a new feature >> flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use >> swiotlb-xen because IOMMU translations are available for Dom0. If >> XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen >> initialization. I have tested this scheme with and without cache >> coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it >> works as expected: DMA operations succeed. > > Wouldn’t it be easier to name the flag XENFEAT_ARM_swiotlb_needed ? Except that "swiotlb" is Linux terminology, which I don't think a Xen feature should be named after. Names should be generic, except maybe when they're really targeting exactly one kind of guest (which imo would better never be the case). Jan
Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu
On 17/02/2021 16:47, Bertrand Marquis wrote: Hi Julien, On 17 Feb 2021, at 16:41, Julien Grall wrote: On 17/02/2021 15:37, Bertrand Marquis wrote: Hi Julien, Hi Bertrand, On 17 Feb 2021, at 08:50, Julien Grall wrote: On 17/02/2021 02:00, Stefano Stabellini wrote: Hi all, Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to translate addresses for DMA operations in Dom0. Specifically, swiotlb-xen is used to translate the address of a foreign page (a page belonging to a domU) mapped into Dom0 before using it for DMA. This is important because although Dom0 is 1:1 mapped, DomUs are not. On systems without an IOMMU swiotlb-xen enables PV drivers to work as long as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation ends up using the MFN, rather than the GFN. On systems with an IOMMU, this is not necessary: when a foreign page is mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation is enstablished for both MMU and SMMU. Dom0 could safely use the GFN address (instead of the MFN) for DMA operations and they would work. It would be more efficient than using swiotlb-xen. If you recall my presentation from Xen Summit 2020, Xilinx is working on cache coloring. With cache coloring, no domain is 1:1 mapped, not even Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not work as intended. The suggested solution for both these issues is to add a new feature flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use swiotlb-xen because IOMMU translations are available for Dom0. If XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen initialization. I have tested this scheme with and without cache coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it works as expected: DMA operations succeed. What about systems where an IOMMU is present but not all devices are protected? There is no way for Xen to know which devices are protected and which ones are not: devices that do not have the "iommus" property could or could not be DMA masters. Perhaps Xen could populate a whitelist of devices protected by the IOMMU based on the "iommus" property. It would require some added complexity in Xen and especially in the swiotlb-xen driver in Linux to use it, which is not ideal. You are trading a bit more complexity in Xen and Linux with a user will may not be able to use the hypervisor on his/her platform without a quirk in Xen (see more below). However, this approach would not work for cache coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be used either way Not all the Dom0 Linux kernel will be able to work with cache colouring. So you will need a way for the kernel to say "Hey I am can avoid using swiotlb". I fully agree and from my current understanding the condition is “having an iommu”. For these reasons, I would like to propose a single flag XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for DMA translations. In situations where a DMA master is not SMMU protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a platform where an IOMMU is present and protects most DMA masters but it is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should not be set (because PV block is not going to work without swiotlb-xen.) This also means that cache coloring won't be usable on such a system (at least not usable with the MMC controller so the system integrator should pay special care to setup the system). It is worth noting that if we wanted to extend the interface to add a list of protected devices in the future, it would still be possible. It would be compatible with XENFEAT_ARM_dom0_iommu. I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared and instead the device-tree list would be used. Is that correct? What do you mean by device tree list here ? Sorry I meant "device list". I was referring to Stefano's suggestion to describe the list of devices protected in the device-tree. Ok you mean adding to the device tree some kind of device list for which swiotlb should be used (or maybe the opposite list in fact). I think the list of protected devices is better because we could create an Xen IOMMU node. How to set XENFEAT_ARM_dom0_iommu? We could set XENFEAT_ARM_dom0_iommu automatically when is_iommu_enabled(d) for Dom0. We could also have a platform specific (xen/arch/arm/platforms/) override so that a specific platform can disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced users, it would also be useful to be able to override it via a Xen command line parameter. Platform quirks should be are limited to a small set of platforms. In this case, this would not be only per-platform but also per-firmware table as a developer can decide to remove/add IOMMU nodes in the DT at any time. In addition to that, it means we are introducing a regression for those users as Xen 4.14 would have
Re: [for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables
On 17/02/2021 15:17, Jan Beulich wrote: On 17.02.2021 16:00, Julien Grall wrote: Hi Jan, On 17/02/2021 14:54, Jan Beulich wrote: On 17.02.2021 15:24, Julien Grall wrote: --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -267,6 +267,15 @@ int iommu_free_pgtables(struct domain *d) struct page_info *pg; unsigned int done = 0; +if ( !is_iommu_enabled(d) ) +return 0; + +/* + * Pages will be moved to the free list below. So we want to + * clear the root page-table to avoid any potential use after-free. + */ +hd->platform_ops->clear_root_pgtable(d); Taking amd_iommu_alloc_root() as example, is this really correct prior to what is now patch 2? Yes, there are no more use-after-free... And this is because of ...? The necessary lock isn't being held here, so on another CPU allocation of a new root and then of new page tables could happen before you make enough progress here, and hence it looks to me as if there might then still be pages which get freed while present in the page tables (and hence accessible by devices). Ah yes. I forgot that now patch #3 is not first anymore. I can move again patch #3 first, although I know you dislike the approach taken there... Cheers, -- Julien Grall
Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu
Hi Julien, > On 17 Feb 2021, at 16:41, Julien Grall wrote: > > > > On 17/02/2021 15:37, Bertrand Marquis wrote: >> Hi Julien, > > Hi Bertrand, > >>> On 17 Feb 2021, at 08:50, Julien Grall wrote: >>> >>> >>> >>> On 17/02/2021 02:00, Stefano Stabellini wrote: Hi all, Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to translate addresses for DMA operations in Dom0. Specifically, swiotlb-xen is used to translate the address of a foreign page (a page belonging to a domU) mapped into Dom0 before using it for DMA. This is important because although Dom0 is 1:1 mapped, DomUs are not. On systems without an IOMMU swiotlb-xen enables PV drivers to work as long as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation ends up using the MFN, rather than the GFN. On systems with an IOMMU, this is not necessary: when a foreign page is mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation is enstablished for both MMU and SMMU. Dom0 could safely use the GFN address (instead of the MFN) for DMA operations and they would work. It would be more efficient than using swiotlb-xen. If you recall my presentation from Xen Summit 2020, Xilinx is working on cache coloring. With cache coloring, no domain is 1:1 mapped, not even Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not work as intended. The suggested solution for both these issues is to add a new feature flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use swiotlb-xen because IOMMU translations are available for Dom0. If XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen initialization. I have tested this scheme with and without cache coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it works as expected: DMA operations succeed. What about systems where an IOMMU is present but not all devices are protected? There is no way for Xen to know which devices are protected and which ones are not: devices that do not have the "iommus" property could or could not be DMA masters. Perhaps Xen could populate a whitelist of devices protected by the IOMMU based on the "iommus" property. It would require some added complexity in Xen and especially in the swiotlb-xen driver in Linux to use it, which is not ideal. >>> >>> You are trading a bit more complexity in Xen and Linux with a user will may >>> not be able to use the hypervisor on his/her platform without a quirk in >>> Xen (see more below). >>> However, this approach would not work for cache coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be used either way >>> >>> Not all the Dom0 Linux kernel will be able to work with cache colouring. So >>> you will need a way for the kernel to say "Hey I am can avoid using >>> swiotlb". >> I fully agree and from my current understanding the condition is “having an >> iommu”. >>> For these reasons, I would like to propose a single flag XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for DMA translations. In situations where a DMA master is not SMMU protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a platform where an IOMMU is present and protects most DMA masters but it is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should not be set (because PV block is not going to work without swiotlb-xen.) This also means that cache coloring won't be usable on such a system (at least not usable with the MMC controller so the system integrator should pay special care to setup the system). It is worth noting that if we wanted to extend the interface to add a list of protected devices in the future, it would still be possible. It would be compatible with XENFEAT_ARM_dom0_iommu. >>> >>> I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared >>> and instead the device-tree list would be used. Is that correct? >> What do you mean by device tree list here ? > > Sorry I meant "device list". I was referring to Stefano's suggestion to > describe the list of devices protected in the device-tree. Ok you mean adding to the device tree some kind of device list for which swiotlb should be used (or maybe the opposite list in fact). > >>> How to set XENFEAT_ARM_dom0_iommu? We could set XENFEAT_ARM_dom0_iommu automatically when is_iommu_enabled(d) for Dom0. We could also have a platform specific (xen/arch/arm/platforms/) override so that a specific platform can disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced users, it would also be useful to be able to override it via a Xen command line parameter. >>> Platform quirks should be are limited to a small set of platforms. >>> >>> In this case,
[PATCH 2/3] tools/libxl: Simplfy the out path in libxl__domain_get_device_model_uid()
All paths heading towards `out` have rc = 0. Assert this property. The intended_uid check is an error path. Use the err label rather than falling into subsequent success logic. With the above two changes, the two `if (!rc)` checks can be dropped. Now, both remaining tests start with `if (user ...)`. Combine the two blocks. No functional change, but far simpler logic to follow. Signed-off-by: Andrew Cooper --- CC: Ian Jackson CC: Wei Liu CC: Anthony PERARD --- tools/libs/light/libxl_dm.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/tools/libs/light/libxl_dm.c b/tools/libs/light/libxl_dm.c index 30b3242e57..7843c283ca 100644 --- a/tools/libs/light/libxl_dm.c +++ b/tools/libs/light/libxl_dm.c @@ -243,16 +243,17 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, goto err; out: -/* First, do a root check if appropriate */ -if (!rc) { -if (user && intended_uid == 0) { +assert(rc == 0); + +if (user) { +/* First, do a root check if appropriate */ +if (intended_uid == 0) { LOGD(ERROR, guest_domid, "intended_uid is 0 (root)!"); rc = ERROR_INVAL; +goto err; } -} -/* Then do the final set, if still appropriate */ -if (!rc && user) { +/* Then do the final set. */ state->dm_runas = user; if (kill_by_uid) state->dm_kill_uid = GCSPRINTF("%ld", (long)intended_uid); -- 2.11.0
[PATCH for-4.15 0/3] tools/libxl: -Og fixes for libxl__domain_get_device_model_uid()
Split out of previous series, and split up for clarity. Andrew Cooper (3): tools/libxl: Split out and err paths in libxl__domain_get_device_model_uid() tools/libxl: Simplfy the out path in libxl__domain_get_device_model_uid() tools/libxl: Rework invariants in libxl__domain_get_device_model_uid() tools/libs/light/libxl_dm.c | 58 - 1 file changed, 31 insertions(+), 27 deletions(-) -- 2.11.0
[PATCH 1/3] tools/libxl: Split out and err paths in libxl__domain_get_device_model_uid()
All paths with nonzero rc head to err. All paths with a zero rc stay heading towards out. The comment discussing invariants is now arguably stale - it will be fixed in one coherent manner when further fixes are in place. No functional change. Signed-off-by: Andrew Cooper --- CC: Ian Jackson CC: Wei Liu CC: Anthony PERARD --- tools/libs/light/libxl_dm.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/tools/libs/light/libxl_dm.c b/tools/libs/light/libxl_dm.c index db4cec6a76..30b3242e57 100644 --- a/tools/libs/light/libxl_dm.c +++ b/tools/libs/light/libxl_dm.c @@ -153,13 +153,13 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, if (user) { rc = userlookup_helper_getpwnam(gc, user, _pwbuf, _base); if (rc) -goto out; +goto err; if (!user_base) { LOGD(ERROR, guest_domid, "Couldn't find device_model_user %s", user); rc = ERROR_INVAL; -goto out; +goto err; } intended_uid = user_base->pw_uid; @@ -188,7 +188,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, rc = userlookup_helper_getpwnam(gc, LIBXL_QEMU_USER_RANGE_BASE, _pwbuf, _base); if (rc) -goto out; +goto err; if (user_base) { struct passwd *user_clash, user_clash_pwbuf; @@ -196,14 +196,14 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, rc = userlookup_helper_getpwuid(gc, intended_uid, _clash_pwbuf, _clash); if (rc) -goto out; +goto err; if (user_clash) { LOGD(ERROR, guest_domid, "wanted to use uid %ld (%s + %d) but that is user %s !", (long)intended_uid, LIBXL_QEMU_USER_RANGE_BASE, guest_domid, user_clash->pw_name); rc = ERROR_INVAL; -goto out; +goto err; } LOGD(DEBUG, guest_domid, "using uid %ld", (long)intended_uid); @@ -222,7 +222,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, user = LIBXL_QEMU_USER_SHARED; rc = userlookup_helper_getpwnam(gc, user, _pwbuf, _base); if (rc) -goto out; +goto err; if (user_base) { LOGD(WARN, guest_domid, "Could not find user %s, falling back to %s", LIBXL_QEMU_USER_RANGE_BASE, LIBXL_QEMU_USER_SHARED); @@ -240,6 +240,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, "Could not find user %s or range base pseudo-user %s, cannot restrict", LIBXL_QEMU_USER_SHARED, LIBXL_QEMU_USER_RANGE_BASE); rc = ERROR_INVAL; +goto err; out: /* First, do a root check if appropriate */ @@ -257,6 +258,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, state->dm_kill_uid = GCSPRINTF("%ld", (long)intended_uid); } + err: return rc; } -- 2.11.0
[PATCH 3/3] tools/libxl: Rework invariants in libxl__domain_get_device_model_uid()
Various version of gcc, when compiling with -Og, complain: libxl_dm.c: In function 'libxl__domain_get_device_model_uid': libxl_dm.c:256:12: error: 'kill_by_uid' may be used uninitialized in this function [-Werror=maybe-uninitialized] 256 | if (kill_by_uid) |^ The logic is very tangled. Two paths unconditionally set user before checking for associated errors. This interferes with the expected use of uninitialised-variable heuristics to force compile failures for ill-formed exit paths. Use b_info->device_model_user and LIBXL_QEMU_USER_SHARED as appliable, and only set user on the goto out paths. All goto out paths now are comprised of the form: user = NULL; rc = 0; or: user = non-NULL; indended_uid = ...; kill_by_uid = ...; rc = 0; As a consequence, indended_uid can drop its default of -1, the dm_restrict can drop its now-stale "just in case" comment and the redundant setting of kill_by_uid to work around this issue at other optimisation levels. Finally, rewrite the comment about invariants, indicating the split between the out and err lables, and associated rc values. Additionally, reword the "is {not,} set" terminology to "is {non-,}NULL" to be more precise. No funcational change. Signed-off-by: Andrew Cooper --- CC: Ian Jackson CC: Wei Liu CC: Anthony PERARD --- tools/libs/light/libxl_dm.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/tools/libs/light/libxl_dm.c b/tools/libs/light/libxl_dm.c index 7843c283ca..8a7e084d89 100644 --- a/tools/libs/light/libxl_dm.c +++ b/tools/libs/light/libxl_dm.c @@ -127,7 +127,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, struct passwd *user_base, user_pwbuf; int rc; char *user; -uid_t intended_uid = -1; +uid_t intended_uid; bool kill_by_uid; /* Only qemu-upstream can run as a different uid */ @@ -135,33 +135,34 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, return 0; /* - * From this point onward, all paths should go through the `out` - * label. The invariants should be: + * From this point onward, all paths should go through the `out` (success, + * rc = 0) or `err` (failure, rc != 0) labels. The invariants should be: * - rc may be 0, or an error code. - * - if rc is an error code, user and intended_uid are ignored. - * - if rc is 0, user may be set or not set. - * - if user is set, then intended_uid must be set to a UID matching + * - if rc is an error code, all settings are ignored. + * - if rc is 0, user may be NULL or non-NULL. + * - if user is non-NULL, then intended_uid must be set to a UID matching * the username `user`, and kill_by_uid must be set to the appropriate * value. intended_uid will be checked for root (0). */ - + /* * If device_model_user is present, set `-runas` even if * dm_restrict isn't in use */ -user = b_info->device_model_user; -if (user) { -rc = userlookup_helper_getpwnam(gc, user, _pwbuf, _base); +if (b_info->device_model_user) { +rc = userlookup_helper_getpwnam(gc, b_info->device_model_user, +_pwbuf, _base); if (rc) goto err; if (!user_base) { LOGD(ERROR, guest_domid, "Couldn't find device_model_user %s", - user); + b_info->device_model_user); rc = ERROR_INVAL; goto err; } +user = b_info->device_model_user; intended_uid = user_base->pw_uid; kill_by_uid = true; rc = 0; @@ -175,8 +176,7 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, if (!libxl_defbool_val(b_info->dm_restrict)) { LOGD(DEBUG, guest_domid, "dm_restrict disabled, starting QEMU as root"); -user = NULL; /* Should already be null, but just in case */ -kill_by_uid = false; /* Keep older versions of gcc happy */ +user = NULL; rc = 0; goto out; } @@ -219,13 +219,14 @@ static int libxl__domain_get_device_model_uid(libxl__gc *gc, * QEMU_USER_SHARED. NB for QEMU_USER_SHARED, all QEMU will run * as the same UID, we can't kill by uid; therefore don't set uid. */ -user = LIBXL_QEMU_USER_SHARED; -rc = userlookup_helper_getpwnam(gc, user, _pwbuf, _base); +rc = userlookup_helper_getpwnam(gc, LIBXL_QEMU_USER_SHARED, +_pwbuf, _base); if (rc) goto err; if (user_base) { LOGD(WARN, guest_domid, "Could not find user %s, falling back to %s", LIBXL_QEMU_USER_RANGE_BASE, LIBXL_QEMU_USER_SHARED); +user = LIBXL_QEMU_USER_SHARED; intended_uid = user_base->pw_uid; kill_by_uid = false; rc = 0; -- 2.11.0
Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu
On 17/02/2021 15:37, Bertrand Marquis wrote: Hi Julien, Hi Bertrand, On 17 Feb 2021, at 08:50, Julien Grall wrote: On 17/02/2021 02:00, Stefano Stabellini wrote: Hi all, Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to translate addresses for DMA operations in Dom0. Specifically, swiotlb-xen is used to translate the address of a foreign page (a page belonging to a domU) mapped into Dom0 before using it for DMA. This is important because although Dom0 is 1:1 mapped, DomUs are not. On systems without an IOMMU swiotlb-xen enables PV drivers to work as long as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation ends up using the MFN, rather than the GFN. On systems with an IOMMU, this is not necessary: when a foreign page is mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation is enstablished for both MMU and SMMU. Dom0 could safely use the GFN address (instead of the MFN) for DMA operations and they would work. It would be more efficient than using swiotlb-xen. If you recall my presentation from Xen Summit 2020, Xilinx is working on cache coloring. With cache coloring, no domain is 1:1 mapped, not even Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not work as intended. The suggested solution for both these issues is to add a new feature flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use swiotlb-xen because IOMMU translations are available for Dom0. If XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen initialization. I have tested this scheme with and without cache coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it works as expected: DMA operations succeed. What about systems where an IOMMU is present but not all devices are protected? There is no way for Xen to know which devices are protected and which ones are not: devices that do not have the "iommus" property could or could not be DMA masters. Perhaps Xen could populate a whitelist of devices protected by the IOMMU based on the "iommus" property. It would require some added complexity in Xen and especially in the swiotlb-xen driver in Linux to use it, which is not ideal. You are trading a bit more complexity in Xen and Linux with a user will may not be able to use the hypervisor on his/her platform without a quirk in Xen (see more below). However, this approach would not work for cache coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be used either way Not all the Dom0 Linux kernel will be able to work with cache colouring. So you will need a way for the kernel to say "Hey I am can avoid using swiotlb". I fully agree and from my current understanding the condition is “having an iommu”. For these reasons, I would like to propose a single flag XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for DMA translations. In situations where a DMA master is not SMMU protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a platform where an IOMMU is present and protects most DMA masters but it is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should not be set (because PV block is not going to work without swiotlb-xen.) This also means that cache coloring won't be usable on such a system (at least not usable with the MMC controller so the system integrator should pay special care to setup the system). It is worth noting that if we wanted to extend the interface to add a list of protected devices in the future, it would still be possible. It would be compatible with XENFEAT_ARM_dom0_iommu. I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared and instead the device-tree list would be used. Is that correct? What do you mean by device tree list here ? Sorry I meant "device list". I was referring to Stefano's suggestion to describe the list of devices protected in the device-tree. How to set XENFEAT_ARM_dom0_iommu? We could set XENFEAT_ARM_dom0_iommu automatically when is_iommu_enabled(d) for Dom0. We could also have a platform specific (xen/arch/arm/platforms/) override so that a specific platform can disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced users, it would also be useful to be able to override it via a Xen command line parameter. Platform quirks should be are limited to a small set of platforms. In this case, this would not be only per-platform but also per-firmware table as a developer can decide to remove/add IOMMU nodes in the DT at any time. In addition to that, it means we are introducing a regression for those users as Xen 4.14 would have worked on there platform but not anymore. They would need to go through all the nodes and find out which one is not protected. I am not sure i understand your point here as we cannot detect if a device is protected or not by an IOMMU as we do not know which device requires one. That's correct... Could you explain what use case
Re: [for-4.15][PATCH v3 3/3] xen/iommu: x86: Harden the IOMMU page-table allocator
Hi Jan, On 17/02/2021 15:13, Jan Beulich wrote: On 17.02.2021 15:24, Julien Grall wrote:> --- a/xen/drivers/passthrough/x86/iommu.c> +++ b/xen/drivers/passthrough/x86/iommu.c> @@ -149,6 +149,13 @@ int arch_iommu_domain_init(struct domain *d)> > void arch_iommu_domain_destroy(struct domain *d)> {> +/*> + * There should be not page-tables left allocated by the time the Nit: s/not/no/ ? + * domain is destroyed. Note that arch_iommu_domain_destroy() is + * called unconditionally, so pgtables may be unitialized. + */ +ASSERT(dom_iommu(d)->platform_ops == NULL || + page_list_empty(_iommu(d)->arch.pgtables.list)); } static bool __hwdom_init hwdom_iommu_map(const struct domain *d, @@ -279,6 +286,9 @@ int iommu_free_pgtables(struct domain *d) */ hd->platform_ops->clear_root_pgtable(d); +/* After this barrier no new page allocations can occur. */ +spin_barrier(>arch.pgtables.lock); Didn't patch 2 utilize the call to ->clear_root_pgtable() itself as the barrier? Why introduce another one (with a similar comment) explicitly now? The barriers act differently, one will get against any IOMMU page-tables modification. The other one will gate against allocation. There is no guarantee that the former will prevent the latter. @@ -315,9 +326,29 @@ struct page_info *iommu_alloc_pgtable(struct domain *d) unmap_domain_page(p); spin_lock(>arch.pgtables.lock); -page_list_add(pg, >arch.pgtables.list); +/* + * The IOMMU page-tables are freed when relinquishing the domain, but + * nothing prevent allocation to happen afterwards. There is no valid + * reasons to continue to update the IOMMU page-tables while the + * domain is dying. + * + * So prevent page-table allocation when the domain is dying. + * + * We relying on >arch.pgtables.lock to synchronize d->is_dying. + */ +if ( likely(!d->is_dying) ) +{ +alive = true; +page_list_add(pg, >arch.pgtables.list); +} spin_unlock(>arch.pgtables.lock); +if ( unlikely(!alive) ) +{ +free_domheap_page(pg); +pg = NULL; +} + return pg; } As before I'm concerned of this forcing error paths to be taken elsewhere, in case an allocation still happens (e.g. from unmap once super page mappings are supported). Considering some of the error handling in the IOMMU code is to invoke domain_crash(), it would be quite unfortunate if we ended up crashing a domain while it is being cleaned up after. It is unfortunate, but I think this is better than having to leak page tables. Additionally, the (at present still hypothetical) unmap case, if failing because of the change here, would then again chance to leave mappings in place while the underlying pages get freed. As this would likely require an XSA, the change doesn't feel like "hardening" to me. I would agree with this if memory allocations could never fail. That's not that case and will become worse as we use IOMMU pool. Do you have callers in mind that doesn't check the returns of iommu_unmap()? Cheers, -- Julien Grall
Re: [for-4.15][PATCH v3 2/3] xen/x86: iommu: Ignore IOMMU mapping requests when a domain is dying
Hi Jan, On 17/02/2021 15:01, Jan Beulich wrote: On 17.02.2021 15:24, Julien Grall wrote: From: Julien Grall The new x86 IOMMU page-tables allocator will release the pages when relinquishing the domain resources. However, this is not sufficient when the domain is dying because nothing prevents page-table to be allocated. Currently page-table allocations can only happen from iommu_map(). As the domain is dying, there is no good reason to continue to modify the IOMMU page-tables. While I agree this to be the case right now, I'm not sure it is a good idea to build on it (in that you leave the unmap paths untouched). I don't build on that assumption. See next patch. Imo there's a fair chance this would be overlooked at the point super page mappings get introduced (which has been long overdue), and I thought prior discussion had lead to a possible approach without risking use-after-free due to squashed unmap requests. I know you suggested to zap the root page-tables... However, I don't think this is 4.15 material and you agree with this (you were the one pointed out that out). --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -273,6 +273,9 @@ int iommu_free_pgtables(struct domain *d) /* * Pages will be moved to the free list below. So we want to * clear the root page-table to avoid any potential use after-free. + * + * After this call, no more IOMMU mapping can happen. + * */ hd->platform_ops->clear_root_pgtable(d); I.e. you utilize the call in place of spin_barrier(). Maybe worth saying in the comment? Sure. Also, nit: Stray blank comment line. Cheers, -- Julien Grall
[xen-unstable-smoke test] 159443: tolerable all pass - PUSHED
flight 159443 xen-unstable-smoke real [real] http://logs.test-lab.xenproject.org/osstest/logs/159443/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass version targeted for testing: xen d670ef3401b91d04c58d72cd8ce5579b4fa900d8 baseline version: xen 3b1cc15f1931ba56d0ee256fe9bfe65509733b27 Last test of basis 159416 2021-02-16 15:01:29 Z1 days Testing same since 159443 2021-02-17 12:00:28 Z0 days1 attempts People who touched revisions under test: Julien Grall jobs: build-arm64-xsm pass build-amd64 pass build-armhf pass build-amd64-libvirt pass test-armhf-armhf-xl pass test-arm64-arm64-xl-xsm pass test-amd64-amd64-xl-qemuu-debianhvm-amd64pass test-amd64-amd64-libvirt pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Pushing revision : To xenbits.xen.org:/home/xen/git/xen.git 3b1cc15f19..d670ef3401 d670ef3401b91d04c58d72cd8ce5579b4fa900d8 -> smoke
Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu
Hi Julien, > On 17 Feb 2021, at 08:50, Julien Grall wrote: > > > > On 17/02/2021 02:00, Stefano Stabellini wrote: >> Hi all, >> Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to >> translate addresses for DMA operations in Dom0. Specifically, >> swiotlb-xen is used to translate the address of a foreign page (a page >> belonging to a domU) mapped into Dom0 before using it for DMA. >> This is important because although Dom0 is 1:1 mapped, DomUs are not. On >> systems without an IOMMU swiotlb-xen enables PV drivers to work as long >> as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation >> ends up using the MFN, rather than the GFN. >> On systems with an IOMMU, this is not necessary: when a foreign page is >> mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation >> is enstablished for both MMU and SMMU. Dom0 could safely use the GFN >> address (instead of the MFN) for DMA operations and they would work. It >> would be more efficient than using swiotlb-xen. >> If you recall my presentation from Xen Summit 2020, Xilinx is working on >> cache coloring. With cache coloring, no domain is 1:1 mapped, not even >> Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not >> work as intended. >> The suggested solution for both these issues is to add a new feature >> flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use >> swiotlb-xen because IOMMU translations are available for Dom0. If >> XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen >> initialization. I have tested this scheme with and without cache >> coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it >> works as expected: DMA operations succeed. >> What about systems where an IOMMU is present but not all devices are >> protected? >> There is no way for Xen to know which devices are protected and which >> ones are not: devices that do not have the "iommus" property could or >> could not be DMA masters. >> Perhaps Xen could populate a whitelist of devices protected by the IOMMU >> based on the "iommus" property. It would require some added complexity >> in Xen and especially in the swiotlb-xen driver in Linux to use it, >> which is not ideal. > > You are trading a bit more complexity in Xen and Linux with a user will may > not be able to use the hypervisor on his/her platform without a quirk in Xen > (see more below). > >> However, this approach would not work for cache >> coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be >> used either way > > Not all the Dom0 Linux kernel will be able to work with cache colouring. So > you will need a way for the kernel to say "Hey I am can avoid using swiotlb". I fully agree and from my current understanding the condition is “having an iommu”. > >> For these reasons, I would like to propose a single flag >> XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for >> DMA translations. In situations where a DMA master is not SMMU >> protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a >> platform where an IOMMU is present and protects most DMA masters but it >> is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should >> not be set (because PV block is not going to work without swiotlb-xen.) >> This also means that cache coloring won't be usable on such a system (at >> least not usable with the MMC controller so the system integrator should >> pay special care to setup the system). >> It is worth noting that if we wanted to extend the interface to add a >> list of protected devices in the future, it would still be possible. It >> would be compatible with XENFEAT_ARM_dom0_iommu. > > I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared and > instead the device-tree list would be used. Is that correct? What do you mean by device tree list here ? > >> How to set XENFEAT_ARM_dom0_iommu? >> We could set XENFEAT_ARM_dom0_iommu automatically when >> is_iommu_enabled(d) for Dom0. We could also have a platform specific >> (xen/arch/arm/platforms/) override so that a specific platform can >> disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced >> users, it would also be useful to be able to override it via a Xen >> command line parameter. > Platform quirks should be are limited to a small set of platforms. > > In this case, this would not be only per-platform but also per-firmware table > as a developer can decide to remove/add IOMMU nodes in the DT at any time. > > In addition to that, it means we are introducing a regression for those users > as Xen 4.14 would have worked on there platform but not anymore. They would > need to go through all the nodes and find out which one is not protected. I am not sure i understand your point here as we cannot detect if a device is protected or not by an IOMMU as we do not know which device requires one. Could you explain what use case working before
Re: Linux PV/PVH domU crash on (guest) resume from suspend
On 17.02.21 14:48, Marek Marczykowski-Górecki wrote: On Wed, Feb 17, 2021 at 07:51:42AM +0100, Jürgen Groß wrote: On 17.02.21 06:12, Marek Marczykowski-Górecki wrote: Hi, I'm observing Linux PV/PVH guest crash when I resume it from sleep. I do this with: virsh -c xen dompmsuspend mem virsh -c xen dompmwakeup But it's possible to trigger it with plain xl too: xl save -c The same on HVM works fine. This is on Xen 4.14.1, and with guest kernel 5.4.90, the same happens with 5.4.98. Dom0 kernel is the same, but I'm not sure if that's relevant here. I can reliably reproduce it. This is already on my list of issues to look at. The problem seems to be related to the XSA-332 patches. You could try the patches I've sent out recently addressing other fallout from XSA-332 which _might_ fix this issue, too: https://patchew.org/Xen/20210211101616.13788-1-jgr...@suse.com/ Thanks for the patches. Sadly it doesn't change anything - I get exactly the same crash. I applied that on top of 5.11-rc7 (that's what I had handy). If you think there may be a difference with the final 5.11 or another branch, please let me know. Okay, thanks for testing. I hope to find some time to look into the issue soon. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu
Hi Stefano, > On 17 Feb 2021, at 02:00, Stefano Stabellini wrote: > > Hi all, > > Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to > translate addresses for DMA operations in Dom0. Specifically, > swiotlb-xen is used to translate the address of a foreign page (a page > belonging to a domU) mapped into Dom0 before using it for DMA. > > This is important because although Dom0 is 1:1 mapped, DomUs are not. On > systems without an IOMMU swiotlb-xen enables PV drivers to work as long > as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation > ends up using the MFN, rather than the GFN. > > > On systems with an IOMMU, this is not necessary: when a foreign page is > mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation > is enstablished for both MMU and SMMU. Dom0 could safely use the GFN > address (instead of the MFN) for DMA operations and they would work. It > would be more efficient than using swiotlb-xen. > > If you recall my presentation from Xen Summit 2020, Xilinx is working on > cache coloring. With cache coloring, no domain is 1:1 mapped, not even > Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not > work as intended. > > > The suggested solution for both these issues is to add a new feature > flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use > swiotlb-xen because IOMMU translations are available for Dom0. If > XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen > initialization. I have tested this scheme with and without cache > coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it > works as expected: DMA operations succeed. Wouldn’t it be easier to name the flag XENFEAT_ARM_swiotlb_needed ? > > > What about systems where an IOMMU is present but not all devices are > protected? > > There is no way for Xen to know which devices are protected and which > ones are not: devices that do not have the "iommus" property could or > could not be DMA masters. > > Perhaps Xen could populate a whitelist of devices protected by the IOMMU > based on the "iommus" property. It would require some added complexity > in Xen and especially in the swiotlb-xen driver in Linux to use it, > which is not ideal. However, this approach would not work for cache > coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be > used either way. Would it be realistic to say that cache coloring cannot be used without an IOMMU ? Having a flag for the swiotlb is a good idea because in the current status the switch to use or not swiotlb is more or less implicit. But somehow there are use cases where we should simply say “not supported” as we do for example for passthrough right now. Maybe cache coloring is a case like that. > > > For these reasons, I would like to propose a single flag > XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for > DMA translations. In situations where a DMA master is not SMMU > protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a > platform where an IOMMU is present and protects most DMA masters but it > is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should > not be set (because PV block is not going to work without swiotlb-xen.) > This also means that cache coloring won't be usable on such a system (at > least not usable with the MMC controller so the system integrator should > pay special care to setup the system). Any system where you have an IOMMU but not covering all devices is almost impossible to magically handle smoothly and will require some carefull configuration. Sadly as you stated, we do not have a way to auto-detect such a case. > > It is worth noting that if we wanted to extend the interface to add a > list of protected devices in the future, it would still be possible. It > would be compatible with XENFEAT_ARM_dom0_iommu. > > > How to set XENFEAT_ARM_dom0_iommu? > > We could set XENFEAT_ARM_dom0_iommu automatically when > is_iommu_enabled(d) for Dom0. We could also have a platform specific > (xen/arch/arm/platforms/) override so that a specific platform can > disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced > users, it would also be useful to be able to override it via a Xen > command line parameter. > > See appended patch as a reference. I really think the naming of the flag will need to be modified. Cheers Bertrand > > > Cheers, > > Stefano > > > diff --git a/xen/common/kernel.c b/xen/common/kernel.c > index 7a345ae45e..4dbef48199 100644 > --- a/xen/common/kernel.c > +++ b/xen/common/kernel.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > #include > > #ifndef COMPAT > @@ -549,6 +550,9 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > fi.submap |= 1U << XENFEAT_dom0; > #ifdef CONFIG_ARM > fi.submap |= (1U << XENFEAT_ARM_SMCCC_supported); > +if (
Re: [for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables
On 17.02.2021 16:00, Julien Grall wrote: > Hi Jan, > > On 17/02/2021 14:54, Jan Beulich wrote: >> On 17.02.2021 15:24, Julien Grall wrote: >>> --- a/xen/drivers/passthrough/x86/iommu.c >>> +++ b/xen/drivers/passthrough/x86/iommu.c >>> @@ -267,6 +267,15 @@ int iommu_free_pgtables(struct domain *d) >>> struct page_info *pg; >>> unsigned int done = 0; >>> >>> +if ( !is_iommu_enabled(d) ) >>> +return 0; >>> + >>> +/* >>> + * Pages will be moved to the free list below. So we want to >>> + * clear the root page-table to avoid any potential use after-free. >>> + */ >>> +hd->platform_ops->clear_root_pgtable(d); >> >> Taking amd_iommu_alloc_root() as example, is this really correct >> prior to what is now patch 2? > > Yes, there are no more use-after-free... And this is because of ...? The necessary lock isn't being held here, so on another CPU allocation of a new root and then of new page tables could happen before you make enough progress here, and hence it looks to me as if there might then still be pages which get freed while present in the page tables (and hence accessible by devices). Jan >> What guarantees a new root table >> won't get allocated subsequently? > > It doesn't prevent root table allocation. I view the two as distincts > issues, hence the two patches. > > Cheers, >
Re: [for-4.15][PATCH v3 3/3] xen/iommu: x86: Harden the IOMMU page-table allocator
On 17.02.2021 15:24, Julien Grall wrote:> --- a/xen/drivers/passthrough/x86/iommu.c> +++ b/xen/drivers/passthrough/x86/iommu.c> @@ -149,6 +149,13 @@ int arch_iommu_domain_init(struct domain *d)> > void arch_iommu_domain_destroy(struct domain *d)> {> +/*> + * There should be not page-tables left allocated by the time the Nit: s/not/no/ ? > + * domain is destroyed. Note that arch_iommu_domain_destroy() is > + * called unconditionally, so pgtables may be unitialized. > + */ > +ASSERT(dom_iommu(d)->platform_ops == NULL || > + page_list_empty(_iommu(d)->arch.pgtables.list)); > } > > static bool __hwdom_init hwdom_iommu_map(const struct domain *d, > @@ -279,6 +286,9 @@ int iommu_free_pgtables(struct domain *d) > */ > hd->platform_ops->clear_root_pgtable(d); > > +/* After this barrier no new page allocations can occur. */ > +spin_barrier(>arch.pgtables.lock); Didn't patch 2 utilize the call to ->clear_root_pgtable() itself as the barrier? Why introduce another one (with a similar comment) explicitly now? > @@ -315,9 +326,29 @@ struct page_info *iommu_alloc_pgtable(struct domain *d) > unmap_domain_page(p); > > spin_lock(>arch.pgtables.lock); > -page_list_add(pg, >arch.pgtables.list); > +/* > + * The IOMMU page-tables are freed when relinquishing the domain, but > + * nothing prevent allocation to happen afterwards. There is no valid > + * reasons to continue to update the IOMMU page-tables while the > + * domain is dying. > + * > + * So prevent page-table allocation when the domain is dying. > + * > + * We relying on >arch.pgtables.lock to synchronize d->is_dying. > + */ > +if ( likely(!d->is_dying) ) > +{ > +alive = true; > +page_list_add(pg, >arch.pgtables.list); > +} > spin_unlock(>arch.pgtables.lock); > > +if ( unlikely(!alive) ) > +{ > +free_domheap_page(pg); > +pg = NULL; > +} > + > return pg; > } As before I'm concerned of this forcing error paths to be taken elsewhere, in case an allocation still happens (e.g. from unmap once super page mappings are supported). Considering some of the error handling in the IOMMU code is to invoke domain_crash(), it would be quite unfortunate if we ended up crashing a domain while it is being cleaned up after. Additionally, the (at present still hypothetical) unmap case, if failing because of the change here, would then again chance to leave mappings in place while the underlying pages get freed. As this would likely require an XSA, the change doesn't feel like "hardening" to me. Jan
Re: [for-4.15][PATCH v3 2/3] xen/x86: iommu: Ignore IOMMU mapping requests when a domain is dying
On 17.02.2021 15:24, Julien Grall wrote: > From: Julien Grall > > The new x86 IOMMU page-tables allocator will release the pages when > relinquishing the domain resources. However, this is not sufficient > when the domain is dying because nothing prevents page-table to be > allocated. > > Currently page-table allocations can only happen from iommu_map(). As > the domain is dying, there is no good reason to continue to modify the > IOMMU page-tables. While I agree this to be the case right now, I'm not sure it is a good idea to build on it (in that you leave the unmap paths untouched). Imo there's a fair chance this would be overlooked at the point super page mappings get introduced (which has been long overdue), and I thought prior discussion had lead to a possible approach without risking use-after-free due to squashed unmap requests. > --- a/xen/drivers/passthrough/x86/iommu.c > +++ b/xen/drivers/passthrough/x86/iommu.c > @@ -273,6 +273,9 @@ int iommu_free_pgtables(struct domain *d) > /* > * Pages will be moved to the free list below. So we want to > * clear the root page-table to avoid any potential use after-free. > + * > + * After this call, no more IOMMU mapping can happen. > + * > */ > hd->platform_ops->clear_root_pgtable(d); I.e. you utilize the call in place of spin_barrier(). Maybe worth saying in the comment? Also, nit: Stray blank comment line. Jan
Re: [for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables
Hi Jan, On 17/02/2021 14:54, Jan Beulich wrote: On 17.02.2021 15:24, Julien Grall wrote: --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -267,6 +267,15 @@ int iommu_free_pgtables(struct domain *d) struct page_info *pg; unsigned int done = 0; +if ( !is_iommu_enabled(d) ) +return 0; + +/* + * Pages will be moved to the free list below. So we want to + * clear the root page-table to avoid any potential use after-free. + */ +hd->platform_ops->clear_root_pgtable(d); Taking amd_iommu_alloc_root() as example, is this really correct prior to what is now patch 2? Yes, there are no more use-after-free... What guarantees a new root table won't get allocated subsequently? It doesn't prevent root table allocation. I view the two as distincts issues, hence the two patches. Cheers, -- Julien Grall
Re: [for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables
On 17.02.2021 15:24, Julien Grall wrote: > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c > @@ -381,9 +381,18 @@ static int amd_iommu_assign_device(struct domain *d, u8 > devfn, > return reassign_device(pdev->domain, d, devfn, pdev); > } > > +static void iommu_clear_root_pgtable(struct domain *d) Nit: amd_iommu_ as a prefix would be okay here considering other (static) functions also use it. Since it is a static function, no prefix at all would also do (my personal preference). But iommu_ as a prefix isn't helpful and results in needless re-use of VT-d's name. > --- a/xen/drivers/passthrough/x86/iommu.c > +++ b/xen/drivers/passthrough/x86/iommu.c > @@ -267,6 +267,15 @@ int iommu_free_pgtables(struct domain *d) > struct page_info *pg; > unsigned int done = 0; > > +if ( !is_iommu_enabled(d) ) > +return 0; > + > +/* > + * Pages will be moved to the free list below. So we want to > + * clear the root page-table to avoid any potential use after-free. > + */ > +hd->platform_ops->clear_root_pgtable(d); Taking amd_iommu_alloc_root() as example, is this really correct prior to what is now patch 2? What guarantees a new root table won't get allocated subsequently? Jan
Re: [PATCH v2 7/7] tests/avocado: add boot_xen tests
On 2/11/21 6:19 PM, Alex Bennée wrote: > These tests make sure we can boot the Xen hypervisor with a Dom0 > kernel using the guest-loader. We currently have to use a kernel I > built myself because there are issues using the Debian kernel images. > > Signed-off-by: Alex Bennée > --- > MAINTAINERS | 1 + > tests/acceptance/boot_xen.py | 117 +++ > 2 files changed, 118 insertions(+) > create mode 100644 tests/acceptance/boot_xen.py > diff --git a/tests/acceptance/boot_xen.py b/tests/acceptance/boot_xen.py > new file mode 100644 > index 00..8c7e091d40 > --- /dev/null > +++ b/tests/acceptance/boot_xen.py > @@ -0,0 +1,117 @@ > +# Functional test that boots a Xen hypervisor with a domU kernel and > +# checks the console output is vaguely sane . > +# > +# Copyright (c) 2020 Linaro 2021? Otherwise: Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant
Hi Jan, On 17/02/2021 13:16, Jan Beulich wrote: On 17.02.2021 12:41, Julien Grall wrote: Hi Jan, On 17/02/2021 11:38, Jan Beulich wrote: On 17.02.2021 12:03, Julien Grall wrote: On 17/02/2021 10:46, Jan Beulich wrote: Mappings for a domain's own pages should already be present in the IOMMU. While installing the same mapping again is merely redundant (and inefficient), removing the mapping when the grant mapping gets removed is outright wrong in this case: The mapping was there before the map, so should remain in place after unmapping. This affects - Arm Dom0 in the direct mapped case, - x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases, - all x86 PV DomU-s, including driver domains. Reported-by: Rahul Singh Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1243,7 +1243,7 @@ map_grant_ref( goto undo_out; } -need_iommu = gnttab_need_iommu_mapping(ld); +need_iommu = ld != rd && gnttab_need_iommu_mapping(ld); AFAICT, the owner of the page may not always be rd. So do we want to check against the owner instead? For the DomIO case - specifically not. And the DomCOW case can't happen when an IOMMU is in use. Did I overlook any other cases where the page may not be owned by rd? For the current code, it looks like not. But it feels to me this code is fragile as we are assuming that other cases should never happen. I think it would be worth explaining in a comment and the commit message why check rd rather than the page owner is sufficient. Well, I've added /* * This is deliberately not checking the page's owner: get_paged_frame() * explicitly rejects foreign pages, and all success paths above yield * either owner == rd or owner == dom_io (the dom_cow case is irrelevant * as mem-sharing and IOMMU use are incompatible). The dom_io case would * need checking separately if we compared against owner here. */ to map_grant_ref(), and a reference to this comment to both unmap_common() and the commit message. Will this do? LGTM. With that, you can add: Reviewed-by: Julien Grall Cheers, -- Julien Grall
[for-4.15][PATCH v3 1/3] xen/iommu: x86: Clear the root page-table before freeing the page-tables
From: Julien Grall The new per-domain IOMMU page-table allocator will now free the page-tables when domain's resources are relinquished. However, the per-domain IOMMU structure will still contain a dangling pointer to the root page-table. Xen may access the IOMMU page-tables afterwards at least in the case of PV domain: (XEN) Xen call trace: (XEN)[] R iommu.c#addr_to_dma_page_maddr+0x12e/0x1d8 (XEN)[] F iommu.c#intel_iommu_unmap_page+0x5d/0xf8 (XEN)[] F iommu_unmap+0x9c/0x129 (XEN)[] F iommu_legacy_unmap+0x26/0x63 (XEN)[] F mm.c#cleanup_page_mappings+0x139/0x144 (XEN)[] F put_page+0x4b/0xb3 (XEN)[] F put_page_from_l1e+0x136/0x13b (XEN)[] F devalidate_page+0x256/0x8dc (XEN)[] F mm.c#_put_page_type+0x236/0x47e (XEN)[] F mm.c#put_pt_page+0x6f/0x80 (XEN)[] F mm.c#put_page_from_l2e+0x8a/0xcf (XEN)[] F devalidate_page+0x3a3/0x8dc (XEN)[] F mm.c#_put_page_type+0x236/0x47e (XEN)[] F mm.c#put_pt_page+0x6f/0x80 (XEN)[] F mm.c#put_page_from_l3e+0x8a/0xcf (XEN)[] F devalidate_page+0x56c/0x8dc (XEN)[] F mm.c#_put_page_type+0x236/0x47e (XEN)[] F mm.c#put_pt_page+0x6f/0x80 (XEN)[] F mm.c#put_page_from_l4e+0x69/0x6d (XEN)[] F devalidate_page+0x6a0/0x8dc (XEN)[] F mm.c#_put_page_type+0x236/0x47e (XEN)[] F put_page_type_preemptible+0x13/0x15 (XEN)[] F domain.c#relinquish_memory+0x1ff/0x4e9 (XEN)[] F domain_relinquish_resources+0x2b6/0x36a (XEN)[] F domain_kill+0xb8/0x141 (XEN)[] F do_domctl+0xb6f/0x18e5 (XEN)[] F pv_hypercall+0x2f0/0x55f (XEN)[] F lstar_enter+0x112/0x120 This will result to a use after-free and possibly an host crash or memory corruption. It would not be possible to free the page-tables further down in domain_relinquish_resources() because cleanup_page_mappings() will only be called when the last reference on the page dropped. This may happen much later if another domain still hold a reference. After all the PCI devices have been de-assigned, nobody should use the IOMMU page-tables and it is therefore pointless to try to modify them. So we can simply clear any reference to the root page-table in the per-domain IOMMU structure. This requires to introduce a new callback of the method will depend on the IOMMU driver used. Fixes: 3eef6d07d722 ("x86/iommu: convert VT-d code to use new page table allocator") Signed-off-by: Julien Grall --- Changes in v3: - Move the patch earlier in the series - Reword the commit message Changes in v2: - Introduce clear_root_pgtable() - Move the patch later in the series --- xen/drivers/passthrough/amd/pci_amd_iommu.c | 12 +++- xen/drivers/passthrough/vtd/iommu.c | 12 +++- xen/drivers/passthrough/x86/iommu.c | 9 + xen/include/xen/iommu.h | 1 + 4 files changed, 32 insertions(+), 2 deletions(-) diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c index 42b5a5a9bec4..81add0ba26b4 100644 --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -381,9 +381,18 @@ static int amd_iommu_assign_device(struct domain *d, u8 devfn, return reassign_device(pdev->domain, d, devfn, pdev); } +static void iommu_clear_root_pgtable(struct domain *d) +{ +struct domain_iommu *hd = dom_iommu(d); + +spin_lock(>arch.mapping_lock); +hd->arch.amd.root_table = NULL; +spin_unlock(>arch.mapping_lock); +} + static void amd_iommu_domain_destroy(struct domain *d) { -dom_iommu(d)->arch.amd.root_table = NULL; +ASSERT(!dom_iommu(d)->arch.amd.root_table); } static int amd_iommu_add_device(u8 devfn, struct pci_dev *pdev) @@ -565,6 +574,7 @@ static const struct iommu_ops __initconstrel _iommu_ops = { .remove_device = amd_iommu_remove_device, .assign_device = amd_iommu_assign_device, .teardown = amd_iommu_domain_destroy, +.clear_root_pgtable = iommu_clear_root_pgtable, .map_page = amd_iommu_map_page, .unmap_page = amd_iommu_unmap_page, .iotlb_flush = amd_iommu_flush_iotlb_pages, diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c index d136fe36883b..e1871f6c2bc1 100644 --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -1726,6 +1726,15 @@ out: return ret; } +static void iommu_clear_root_pgtable(struct domain *d) +{ +struct domain_iommu *hd = dom_iommu(d); + +spin_lock(>arch.mapping_lock); +hd->arch.vtd.pgd_maddr = 0; +spin_unlock(>arch.mapping_lock); +} + static void iommu_domain_teardown(struct domain *d) { struct domain_iommu *hd = dom_iommu(d); @@ -1740,7 +1749,7 @@ static void iommu_domain_teardown(struct domain *d) xfree(mrmrr); } -hd->arch.vtd.pgd_maddr = 0; +ASSERT(!hd->arch.vtd.pgd_maddr); } static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn, @@ -2719,6 +2728,7 @@
[for-4.15][PATCH v3 3/3] xen/iommu: x86: Harden the IOMMU page-table allocator
From: Julien Grall At the moment, we are assuming that only iommu_map() can allocate IOMMU page-table. Given the complexity of the IOMMU framework, it would be sensible to have a check closer to the IOMMU allocator. This would avoid to leak IOMMU page-tables again in the future. iommu_alloc_pgtable() is now checking if the domain is dying before adding the page in the list. We are relying on >arch.pgtables.lock to synchronize d->is_dying. Take the opportunity to add an ASSERT() in arch_iommu_domain_destroy() to check if we freed all the IOMMU page tables. Signed-off-by: Julien Grall --- Changes in v3: - Rename the patch. This was originally "xen/iommu: x86: Don't leak the IOMMU page-tables" - Rework the commit message - Move the patch towards the end of the series Changes in v2: - Rework the approach - Move the patch earlier in the series --- xen/drivers/passthrough/x86/iommu.c | 33 - 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c index faa0078db595..a67075f0045d 100644 --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -149,6 +149,13 @@ int arch_iommu_domain_init(struct domain *d) void arch_iommu_domain_destroy(struct domain *d) { +/* + * There should be not page-tables left allocated by the time the + * domain is destroyed. Note that arch_iommu_domain_destroy() is + * called unconditionally, so pgtables may be unitialized. + */ +ASSERT(dom_iommu(d)->platform_ops == NULL || + page_list_empty(_iommu(d)->arch.pgtables.list)); } static bool __hwdom_init hwdom_iommu_map(const struct domain *d, @@ -279,6 +286,9 @@ int iommu_free_pgtables(struct domain *d) */ hd->platform_ops->clear_root_pgtable(d); +/* After this barrier no new page allocations can occur. */ +spin_barrier(>arch.pgtables.lock); + while ( (pg = page_list_remove_head(>arch.pgtables.list)) ) { free_domheap_page(pg); @@ -296,6 +306,7 @@ struct page_info *iommu_alloc_pgtable(struct domain *d) unsigned int memflags = 0; struct page_info *pg; void *p; +bool alive = false; #ifdef CONFIG_NUMA if ( hd->node != NUMA_NO_NODE ) @@ -315,9 +326,29 @@ struct page_info *iommu_alloc_pgtable(struct domain *d) unmap_domain_page(p); spin_lock(>arch.pgtables.lock); -page_list_add(pg, >arch.pgtables.list); +/* + * The IOMMU page-tables are freed when relinquishing the domain, but + * nothing prevent allocation to happen afterwards. There is no valid + * reasons to continue to update the IOMMU page-tables while the + * domain is dying. + * + * So prevent page-table allocation when the domain is dying. + * + * We relying on >arch.pgtables.lock to synchronize d->is_dying. + */ +if ( likely(!d->is_dying) ) +{ +alive = true; +page_list_add(pg, >arch.pgtables.list); +} spin_unlock(>arch.pgtables.lock); +if ( unlikely(!alive) ) +{ +free_domheap_page(pg); +pg = NULL; +} + return pg; } -- 2.17.1
[for-4.15][PATCH v3 0/3] xen/iommu: Collection of bug fixes for IOMMU teadorwn
From: Julien Grall Hi all, This series is a collection of bug fixes for the IOMMU teardown code. All of them are candidate for 4.15 as they can either leak memory or lead to host crash/host corruption. This is sent directly on xen-devel because all the issues were either introduced in 4.15 or happen in the domain creation code. Major changes since v2: - patch #1 "xen/x86: p2m: Don't map the special pages in the IOMMU page-tables" has been removed. This requires Jan's patch [2] to fully mitigate memory leaks. Cheers, [1] https://lore.kernel.org/xen-devel/1b6a411b-84e7-bfb1-647e-511a13df8...@suse.com Julien Grall (3): xen/iommu: x86: Clear the root page-table before freeing the page-tables xen/x86: iommu: Ignore IOMMU mapping requests when a domain is dying xen/iommu: x86: Harden the IOMMU page-table allocator xen/drivers/passthrough/amd/iommu_map.c | 13 ++ xen/drivers/passthrough/amd/pci_amd_iommu.c | 12 +- xen/drivers/passthrough/vtd/iommu.c | 25 +++- xen/drivers/passthrough/x86/iommu.c | 45 - xen/include/xen/iommu.h | 1 + 5 files changed, 93 insertions(+), 3 deletions(-) -- 2.17.1
[for-4.15][PATCH v3 2/3] xen/x86: iommu: Ignore IOMMU mapping requests when a domain is dying
From: Julien Grall The new x86 IOMMU page-tables allocator will release the pages when relinquishing the domain resources. However, this is not sufficient when the domain is dying because nothing prevents page-table to be allocated. Currently page-table allocations can only happen from iommu_map(). As the domain is dying, there is no good reason to continue to modify the IOMMU page-tables. In order to observe d->is_dying correctly, we need to rely on per-arch locking, so the check to ignore IOMMU mapping is added on the per-driver map_page() callback. Signed-off-by: Julien Grall --- Changes in v3: - Patch added. This is a replacement of "xen/iommu: iommu_map: Don't crash the domain if it is dying" --- xen/drivers/passthrough/amd/iommu_map.c | 13 + xen/drivers/passthrough/vtd/iommu.c | 13 + xen/drivers/passthrough/x86/iommu.c | 3 +++ 3 files changed, 29 insertions(+) diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c index d3a8b1aec766..ed78a083ba12 100644 --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -285,6 +285,19 @@ int amd_iommu_map_page(struct domain *d, dfn_t dfn, mfn_t mfn, spin_lock(>arch.mapping_lock); +/* + * IOMMU mapping request can be safely ignored when the domain is dying. + * + * hd->arch.mapping_lock guarantees that d->is_dying will be observed + * before any page tables are freed (see iommu_free_pgtables() and + * iommu_clear_root_pgtable()). + */ +if ( d->is_dying ) +{ +spin_unlock(>arch.mapping_lock); +return 0; +} + rc = amd_iommu_alloc_root(d); if ( rc ) { diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c index e1871f6c2bc1..239a63f74f64 100644 --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -1771,6 +1771,19 @@ static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn, spin_lock(>arch.mapping_lock); +/* + * IOMMU mapping request can be safely ignored when the domain is dying. + * + * hd->arch.mapping_lock guarantees that d->is_dying will be observed + * before any page tables are freed (see iommu_free_pgtables() and + * iommu_clear_root_pgtable()). + */ +if ( d->is_dying ) +{ +spin_unlock(>arch.mapping_lock); +return 0; +} + pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1); if ( !pg_maddr ) { diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c index f54fc8093f18..faa0078db595 100644 --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -273,6 +273,9 @@ int iommu_free_pgtables(struct domain *d) /* * Pages will be moved to the free list below. So we want to * clear the root page-table to avoid any potential use after-free. + * + * After this call, no more IOMMU mapping can happen. + * */ hd->platform_ops->clear_root_pgtable(d); -- 2.17.1
Re: [for-4.15][PATCH v2 5/5] xen/iommu: x86: Clear the root page-table before freeing the page-tables
Hi Kevin, On 10/02/2021 02:21, Tian, Kevin wrote: From: Julien Grall Sent: Tuesday, February 9, 2021 11:28 PM From: Julien Grall The new per-domain IOMMU page-table allocator will now free the page-tables when domain's resources are relinquished. However, the root page-table (i.e. hd->arch.pg_maddr) will not be cleared. It's the pointer not the table itself. I don't think the sentence is incorrect. One can view clearing as either clearing the page table itself or the pointer. Xen may access the IOMMU page-tables afterwards at least in the case of PV domain: (XEN) Xen call trace: (XEN)[] R iommu.c#addr_to_dma_page_maddr+0x12e/0x1d8 (XEN)[] F iommu.c#intel_iommu_unmap_page+0x5d/0xf8 (XEN)[] F iommu_unmap+0x9c/0x129 (XEN)[] F iommu_legacy_unmap+0x26/0x63 (XEN)[] F mm.c#cleanup_page_mappings+0x139/0x144 (XEN)[] F put_page+0x4b/0xb3 (XEN)[] F put_page_from_l1e+0x136/0x13b (XEN)[] F devalidate_page+0x256/0x8dc (XEN)[] F mm.c#_put_page_type+0x236/0x47e (XEN)[] F mm.c#put_pt_page+0x6f/0x80 (XEN)[] F mm.c#put_page_from_l2e+0x8a/0xcf (XEN)[] F devalidate_page+0x3a3/0x8dc (XEN)[] F mm.c#_put_page_type+0x236/0x47e (XEN)[] F mm.c#put_pt_page+0x6f/0x80 (XEN)[] F mm.c#put_page_from_l3e+0x8a/0xcf (XEN)[] F devalidate_page+0x56c/0x8dc (XEN)[] F mm.c#_put_page_type+0x236/0x47e (XEN)[] F mm.c#put_pt_page+0x6f/0x80 (XEN)[] F mm.c#put_page_from_l4e+0x69/0x6d (XEN)[] F devalidate_page+0x6a0/0x8dc (XEN)[] F mm.c#_put_page_type+0x236/0x47e (XEN)[] F put_page_type_preemptible+0x13/0x15 (XEN)[] F domain.c#relinquish_memory+0x1ff/0x4e9 (XEN)[] F domain_relinquish_resources+0x2b6/0x36a (XEN)[] F domain_kill+0xb8/0x141 (XEN)[] F do_domctl+0xb6f/0x18e5 (XEN)[] F pv_hypercall+0x2f0/0x55f (XEN)[] F lstar_enter+0x112/0x120 This will result to a use after-free and possibly an host crash or memory corruption. Freeing the page-tables further down in domain_relinquish_resources() would not work because pages may not be released until later if another domain hold a reference on them. Once all the PCI devices have been de-assigned, it is actually pointless to access modify the IOMMU page-tables. So we can simply clear the root page-table address. Above two paragraphs are confusing to me. I don't know what exact change is proposed until looking over the whole patch. Isn't it clearer to just say "We should clear the root page table pointer in iommu_free_pgtables to avoid use-after-free after all pgtables are moved to the free list"? Your sentence explain the approach taken but not the rational behind the approach. Both are equally important for future reference. I will try to reword it. Cheers, -- Julien Grall
Re: Linux PV/PVH domU crash on (guest) resume from suspend
On Wed, Feb 17, 2021 at 07:51:42AM +0100, Jürgen Groß wrote: > On 17.02.21 06:12, Marek Marczykowski-Górecki wrote: > > Hi, > > > > I'm observing Linux PV/PVH guest crash when I resume it from sleep. I do > > this with: > > > > virsh -c xen dompmsuspend mem > > virsh -c xen dompmwakeup > > > > But it's possible to trigger it with plain xl too: > > > > xl save -c > > > > The same on HVM works fine. > > > > This is on Xen 4.14.1, and with guest kernel 5.4.90, the same happens > > with 5.4.98. Dom0 kernel is the same, but I'm not sure if that's > > relevant here. I can reliably reproduce it. > > This is already on my list of issues to look at. > > The problem seems to be related to the XSA-332 patches. You could try > the patches I've sent out recently addressing other fallout from XSA-332 > which _might_ fix this issue, too: > > https://patchew.org/Xen/20210211101616.13788-1-jgr...@suse.com/ Thanks for the patches. Sadly it doesn't change anything - I get exactly the same crash. I applied that on top of 5.11-rc7 (that's what I had handy). If you think there may be a difference with the final 5.11 or another branch, please let me know. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v2 8/8] xen/evtchn: use READ/WRITE_ONCE() for accessing ring indices
On 2021-02-11 10:16, Juergen Gross wrote: > For avoiding read- and write-tearing by the compiler use READ_ONCE() > and WRITE_ONCE() for accessing the ring indices in evtchn.c. > > Signed-off-by: Juergen Gross > --- > V2: > - modify all accesses (Julien Grall) > --- > drivers/xen/evtchn.c | 25 - > 1 file changed, 16 insertions(+), 9 deletions(-) > > diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c > index 421382c73d88..620008f89dbe 100644 > --- a/drivers/xen/evtchn.c > +++ b/drivers/xen/evtchn.c > @@ -162,6 +162,7 @@ static irqreturn_t evtchn_interrupt(int irq, void *data) > { > struct user_evtchn *evtchn = data; > struct per_user_data *u = evtchn->user; > + unsigned int prod, cons; > > WARN(!evtchn->enabled, >"Interrupt for port %u, but apparently not enabled; per-user %p\n", > @@ -171,10 +172,14 @@ static irqreturn_t evtchn_interrupt(int irq, void *data) > > spin_lock(>ring_prod_lock); > > - if ((u->ring_prod - u->ring_cons) < u->ring_size) { > - *evtchn_ring_entry(u, u->ring_prod) = evtchn->port; > + prod = READ_ONCE(u->ring_prod); > + cons = READ_ONCE(u->ring_cons); > + > + if ((prod - cons) < u->ring_size) { > + *evtchn_ring_entry(u, prod) = evtchn->port; > smp_wmb(); /* Ensure ring contents visible */ > - if (u->ring_cons == u->ring_prod++) { > + if (cons == prod++) { > + WRITE_ONCE(u->ring_prod, prod); > wake_up_interruptible(>evtchn_wait); > kill_fasync(>evtchn_async_queue, > SIGIO, POLL_IN); This doesn't work correctly since now u->ring_prod is only updated if cons == prod++. Ross
Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant
On 17.02.2021 12:41, Julien Grall wrote: > Hi Jan, > > On 17/02/2021 11:38, Jan Beulich wrote: >> On 17.02.2021 12:03, Julien Grall wrote: >>> On 17/02/2021 10:46, Jan Beulich wrote: Mappings for a domain's own pages should already be present in the IOMMU. While installing the same mapping again is merely redundant (and inefficient), removing the mapping when the grant mapping gets removed is outright wrong in this case: The mapping was there before the map, so should remain in place after unmapping. This affects - Arm Dom0 in the direct mapped case, - x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases, - all x86 PV DomU-s, including driver domains. Reported-by: Rahul Singh Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1243,7 +1243,7 @@ map_grant_ref( goto undo_out; } -need_iommu = gnttab_need_iommu_mapping(ld); +need_iommu = ld != rd && gnttab_need_iommu_mapping(ld); >>> >>> AFAICT, the owner of the page may not always be rd. So do we want to >>> check against the owner instead? >> >> For the DomIO case - specifically not. And the DomCOW case can't >> happen when an IOMMU is in use. Did I overlook any other cases >> where the page may not be owned by rd? > > For the current code, it looks like not. But it feels to me this code is > fragile as we are assuming that other cases should never happen. > > I think it would be worth explaining in a comment and the commit message > why check rd rather than the page owner is sufficient. Well, I've added /* * This is deliberately not checking the page's owner: get_paged_frame() * explicitly rejects foreign pages, and all success paths above yield * either owner == rd or owner == dom_io (the dom_cow case is irrelevant * as mem-sharing and IOMMU use are incompatible). The dom_io case would * need checking separately if we compared against owner here. */ to map_grant_ref(), and a reference to this comment to both unmap_common() and the commit message. Will this do? Jan
Re: [for-4.15][PATCH v2 4/5] xen/iommu: x86: Don't leak the IOMMU page-tables
On 17.02.2021 12:49, Julien Grall wrote: > On 10/02/2021 16:12, Jan Beulich wrote: >> On 10.02.2021 16:04, Julien Grall wrote: >>> Are you know saying that the following snipped would be fine: >>> >>> if ( d->is_dying ) >>> return 0; >> >> In {amd,intel}_iommu_map_page(), after the lock was acquired >> and with it suitably released, yes. And if that's what you >> suggested, then I'm sorry - I don't think I can see anything >> fragile there anymore. > > Duplicating the check sounds good to me. The checks in said functions are mandatory, and I didn't really have any duplication in mind. But yes, iommu_map() could have an early (but racy) check, if so wanted. You could utilize the same lock, but you'd need to duplicate the checking in {amd,intel}_iommu_map_page(). I'm not entirely certain in the case about unmap requests: It may be possible to also suppress/ignore them, but this may require some further thought. >>> >>> I think the unmap part is quite risky to d->is_dying because the PCI >>> devices may not quiesced and still assigned to the domain. >> >> Hmm, yes, good point. Of course upon first unmap with is_dying >> observed set we could zap the root page table, but I don't >> suppose that's something we want to do for 4.15. > > We would still need to zap the root page table in the relinquish path. > So I am not sure what benefits it would give us to zap the page tables > on the first iommu_unmap() afther domain dies. I guess we think of different aspects of the zapping - what I'm suggesting here is for the effect of no translations remaining active anymore. I'm not after freeing the memory at this point; that'll have to happen on the relinquish path, as you say. The problem with allowing individual unmaps to proceed (unlike your plan for maps) is that these, too, can trigger allocations (when a large page needs shattering). Jan
[xen-4.12-testing test] 159418: tolerable FAIL - PUSHED
flight 159418 xen-4.12-testing real [real] http://logs.test-lab.xenproject.org/osstest/logs/159418/ Failures :-/ but no regressions. Tests which did not succeed, but are not blocking: test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 159241 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail like 159241 test-armhf-armhf-libvirt 16 saverestore-support-checkfail like 159241 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 159241 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 159241 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 159241 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 159241 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 159241 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 159241 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 159241 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 159241 test-amd64-i386-xl-pvshim14 guest-start fail never pass test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-libvirt 15 migrate-support-checkfail never pass test-amd64-i386-libvirt-xsm 15 migrate-support-checkfail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-libvirt 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass version targeted for testing: xen 4cf5929606adc2fb1ab4e2921c14ba4b8046ecd1 baseline version: xen 8d26cdd3b66ab86d560dacd763d76ff3da95723e Last test of basis 159241 2021-02-11 05:00:05 Z6 days Testing same since 159418 2021-02-16 15:06:11 Z0 days1 attempts People who touched revisions under test: Julien Grall jobs: build-amd64-xsm pass build-arm64-xsm pass build-i386-xsm pass build-amd64-xtf
Re: [for-4.15][PATCH v2 4/5] xen/iommu: x86: Don't leak the IOMMU page-tables
Hi Jan, On 10/02/2021 16:12, Jan Beulich wrote: On 10.02.2021 16:04, Julien Grall wrote: On 10/02/2021 14:32, Jan Beulich wrote: On 09.02.2021 16:28, Julien Grall wrote: From: Julien Grall The new IOMMU page-tables allocator will release the pages when relinquish the domain resources. However, this is not sufficient when the domain is dying because nothing prevents page-table to be allocated. iommu_alloc_pgtable() is now checking if the domain is dying before adding the page in the list. We are relying on >arch.pgtables.lock to synchronize d->is_dying. As said in reply to an earlier patch, I think suppressing (really: ignoring) new mappings would be better. This is exactly what I suggested in v1 but you wrote: "Ignoring requests there seems fragile to me. Paul - what are your thoughts about bailing early from hvm_add_ioreq_gfn() when the domain is dying?" Was this on the thread of this patch? I didn't find such a reply of mine. I need more context here because you name hvm_add_ioreq_gfn() above, while I refer to iommu_map() (and downwards the call stack). See [1]. Are you know saying that the following snipped would be fine: if ( d->is_dying ) return 0; In {amd,intel}_iommu_map_page(), after the lock was acquired and with it suitably released, yes. And if that's what you suggested, then I'm sorry - I don't think I can see anything fragile there anymore. Duplicating the check sounds good to me. You could utilize the same lock, but you'd need to duplicate the checking in {amd,intel}_iommu_map_page(). I'm not entirely certain in the case about unmap requests: It may be possible to also suppress/ignore them, but this may require some further thought. I think the unmap part is quite risky to d->is_dying because the PCI devices may not quiesced and still assigned to the domain. Hmm, yes, good point. Of course upon first unmap with is_dying observed set we could zap the root page table, but I don't suppose that's something we want to do for 4.15. We would still need to zap the root page table in the relinquish path. So I am not sure what benefits it would give us to zap the page tables on the first iommu_unmap() afther domain dies. Cheers, [1] https://lore.kernel.org/xen-devel/f21f1f61-5213-55a8-320c-43e5fe801...@suse.com/ -- Julien Grall
Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant
Hi Jan, On 17/02/2021 11:38, Jan Beulich wrote: On 17.02.2021 12:03, Julien Grall wrote: On 17/02/2021 10:46, Jan Beulich wrote: Mappings for a domain's own pages should already be present in the IOMMU. While installing the same mapping again is merely redundant (and inefficient), removing the mapping when the grant mapping gets removed is outright wrong in this case: The mapping was there before the map, so should remain in place after unmapping. This affects - Arm Dom0 in the direct mapped case, - x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases, - all x86 PV DomU-s, including driver domains. Reported-by: Rahul Singh Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1243,7 +1243,7 @@ map_grant_ref( goto undo_out; } -need_iommu = gnttab_need_iommu_mapping(ld); +need_iommu = ld != rd && gnttab_need_iommu_mapping(ld); AFAICT, the owner of the page may not always be rd. So do we want to check against the owner instead? For the DomIO case - specifically not. And the DomCOW case can't happen when an IOMMU is in use. Did I overlook any other cases where the page may not be owned by rd? For the current code, it looks like not. But it feels to me this code is fragile as we are assuming that other cases should never happen. I think it would be worth explaining in a comment and the commit message why check rd rather than the page owner is sufficient. Cheers, -- Julien Grall
Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant
On 17.02.2021 12:03, Julien Grall wrote: > On 17/02/2021 10:46, Jan Beulich wrote: >> Mappings for a domain's own pages should already be present in the >> IOMMU. While installing the same mapping again is merely redundant (and >> inefficient), removing the mapping when the grant mapping gets removed >> is outright wrong in this case: The mapping was there before the map, so >> should remain in place after unmapping. >> >> This affects >> - Arm Dom0 in the direct mapped case, >> - x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases, >> - all x86 PV DomU-s, including driver domains. >> >> Reported-by: Rahul Singh >> Signed-off-by: Jan Beulich >> >> --- a/xen/common/grant_table.c >> +++ b/xen/common/grant_table.c >> @@ -1243,7 +1243,7 @@ map_grant_ref( >> goto undo_out; >> } >> >> -need_iommu = gnttab_need_iommu_mapping(ld); >> +need_iommu = ld != rd && gnttab_need_iommu_mapping(ld); > > AFAICT, the owner of the page may not always be rd. So do we want to > check against the owner instead? For the DomIO case - specifically not. And the DomCOW case can't happen when an IOMMU is in use. Did I overlook any other cases where the page may not be owned by rd? Jan
Re: [PATCH] xen/arm : smmuv3: Fix to handle multiple StreamIds per device.
Hi Rahul, > On 17 Feb 2021, at 10:05, Rahul Singh wrote: > > SMMUv3 driver does not handle multiple StreamId if the master device > supports more than one StreamID. > > This bug was introduced when the driver was ported from Linux to XEN. > dt_device_set_protected(..) should be called from add_device(..) not > from the dt_xlate(..). > > Move dt_device_set_protected(..) from dt_xlate(..) to add_device(). > > Signed-off-by: Rahul Singh Reviewed-by: Bertrand Marquis Thanks a lot, this is fixing issues with multiple stream ids for one device :-) Cheers Bertrand > --- > This patch is a candidate for 4.15 as without this patch it is not possible to > assign multiple StreamIds to the same device when device is protected behind > SMMUv3. > --- > xen/drivers/passthrough/arm/smmu-v3.c | 29 ++- > 1 file changed, 11 insertions(+), 18 deletions(-) > > diff --git a/xen/drivers/passthrough/arm/smmu-v3.c > b/xen/drivers/passthrough/arm/smmu-v3.c > index 914cdc1cf4..53d150cdb6 100644 > --- a/xen/drivers/passthrough/arm/smmu-v3.c > +++ b/xen/drivers/passthrough/arm/smmu-v3.c > @@ -2207,24 +2207,6 @@ static int arm_smmu_add_device(u8 devfn, struct device > *dev) >*/ > arm_smmu_enable_pasid(master); > > - return 0; > - > -err_free_master: > - xfree(master); > - dev_iommu_priv_set(dev, NULL); > - return ret; > -} > - > -static int arm_smmu_dt_xlate(struct device *dev, > - const struct dt_phandle_args *args) > -{ > - int ret; > - struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); > - > - ret = iommu_fwspec_add_ids(dev, args->args, 1); > - if (ret) > - return ret; > - > if (dt_device_is_protected(dev_to_dt(dev))) { > dev_err(dev, "Already added to SMMUv3\n"); > return -EEXIST; > @@ -2237,6 +2219,17 @@ static int arm_smmu_dt_xlate(struct device *dev, > dev_name(fwspec->iommu_dev), fwspec->num_ids); > > return 0; > + > +err_free_master: > + xfree(master); > + dev_iommu_priv_set(dev, NULL); > + return ret; > +} > + > +static int arm_smmu_dt_xlate(struct device *dev, > + const struct dt_phandle_args *args) > +{ > + return iommu_fwspec_add_ids(dev, args->args, 1); > } > > /* Probing and initialisation functions */ > -- > 2.17.1 >
Re: [for-4.15][PATCH v2 0/5] xen/iommu: Collection of bug fixes for IOMMU teadorwn
Hi Ian, On 09/02/2021 16:47, Ian Jackson wrote: Julien Grall writes ("[for-4.15][PATCH v2 0/5] xen/iommu: Collection of bug fixes for IOMMU teadorwn"): From: Julien Grall ... This series is a collection of bug fixes for the IOMMU teardown code. All of them are candidate for 4.15 as they can either leak memory or lead to host crash/host corruption. This is sent directly on xen-devel because all the issues were either introduced in 4.15 or happen in the domain creation code. I think by current freeze rules this does not need a release-ack but for the avoidance of doubt Release-Acked-by: Ian Jackson Thanks! assuming it's commited by the end of the week. I saw you extended the freeze rules by a week. So I will assume that I have until end of this week (19th February) to commit it. Please let me know if I misunderstood the extension. Cheers, -- Julien Grall
Re: [for-4.15][PATCH v2 2/5] xen/iommu: Check if the IOMMU was initialized before tearing down
Hi Paul, On 09/02/2021 20:22, Paul Durrant wrote: -Original Message- From: Julien Grall Sent: 09 February 2021 15:28 To: xen-devel@lists.xenproject.org Cc: hongy...@amazon.co.uk; i...@xenproject.org; Julien Grall ; Jan Beulich ; Paul Durrant Subject: [for-4.15][PATCH v2 2/5] xen/iommu: Check if the IOMMU was initialized before tearing down From: Julien Grall is_iommu_enabled() will return true even if the IOMMU has not been initialized (e.g. the ops are not set). In the case of an early failure in arch_domain_init(), the function iommu_destroy_domain() will be called even if the IOMMU is not initialized. This will result to dereference the ops which will be NULL and an host crash. Fix the issue by checking that ops has been set before accessing it. Fixes: 71e617a6b8f6 ("use is_iommu_enabled() where appropriate...") Signed-off-by: Julien Grall Reviewed-by: Paul Durrant Thanks! Ian gave his Release-Acked-by so I will commit this patch now. Cheers, -- Julien Grall
Re: [for-4.15][PATCH v2 1/5] xen/x86: p2m: Don't map the special pages in the IOMMU page-tables
On 15/02/2021 13:14, Jan Beulich wrote: On 15.02.2021 13:54, Julien Grall wrote: On 15/02/2021 12:36, Jan Beulich wrote: On 15.02.2021 12:38, Julien Grall wrote: Given this series needs to go in 4.15 (we would introduce a 0-day otherwise), could you clarify whether your patch [1] is intended to replace this one in 4.15? Yes, that or a cut down variant (simply moving the invocation of set_mmio_p2m_entry()). The more that there the controversy continued regarding the adjustment to p2m_get_iommu_flags(). I did indicate there that I've dropped it for v2. Do you have a link to v2? I would like to try with my series. I didn't post it yet, as I didn't consider the v1 discussion settled so far. The intermediate version I have at present is below. Thanks! I will drop patch #1 from my series and resend it. Cheers, -- Julien Grall
Re: [PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant
Hi Jan, On 17/02/2021 10:46, Jan Beulich wrote: Mappings for a domain's own pages should already be present in the IOMMU. While installing the same mapping again is merely redundant (and inefficient), removing the mapping when the grant mapping gets removed is outright wrong in this case: The mapping was there before the map, so should remain in place after unmapping. This affects - Arm Dom0 in the direct mapped case, - x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases, - all x86 PV DomU-s, including driver domains. Reported-by: Rahul Singh Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1243,7 +1243,7 @@ map_grant_ref( goto undo_out; } -need_iommu = gnttab_need_iommu_mapping(ld); +need_iommu = ld != rd && gnttab_need_iommu_mapping(ld); AFAICT, the owner of the page may not always be rd. So do we want to check against the owner instead? Cheers, -- Julien Grall
Re: [PATCH v2 2/2] IOREQ: refine when to send mapcache invalidation request
On 17/02/2021 10:02, Jan Beulich wrote: On 02.02.2021 16:14, Jan Beulich wrote: XENMEM_decrease_reservation isn't the only means by which pages can get removed from a guest, yet all removals ought to be signaled to qemu. Put setting of the flag into the central p2m_remove_page() underlying all respective hypercalls as well as a few similar places, mainly in PoD code. Additionally there's no point sending the request for the local domain when the domain acted upon is a different one. The latter domain's ioreq server mapcaches need invalidating. We assume that domain to be paused at the point the operation takes place, so sending the request in this case happens from the hvm_do_resume() path, which as one of its first steps calls handle_hvm_io_completion(). Even without the remote operation aspect a single domain-wide flag doesn't do: Guests may e.g. decrease-reservation on multiple vCPU-s in parallel. Each of them needs to issue an invalidation request in due course, in particular because exiting to guest context should not happen before the request was actually seen by (all) the emulator(s). Signed-off-by: Jan Beulich --- v2: Preemption related adjustment split off. Make flag per-vCPU. More places to set the flag. Also handle acting on a remote domain. Re-base. Can I get an Arm side ack (or otherwise) please? This looks good for me. Acked-by: Julien Grall Cheers, Thanks, Jan --- a/xen/arch/arm/p2m.c +++ b/xen/arch/arm/p2m.c @@ -759,10 +759,9 @@ static void p2m_free_entry(struct p2m_do * has failed (error case). * So, at worst, the spurious mapcache invalidation might be sent. */ -if ( (p2m->domain == current->domain) && - domain_has_ioreq_server(p2m->domain) && - p2m_is_ram(entry.p2m.type) ) -p2m->domain->mapcache_invalidate = true; +if ( p2m_is_ram(entry.p2m.type) && + domain_has_ioreq_server(p2m->domain) ) +ioreq_request_mapcache_invalidate(p2m->domain); #endif p2m->stats.mappings[level]--; --- a/xen/arch/arm/traps.c +++ b/xen/arch/arm/traps.c @@ -1509,8 +1509,8 @@ static void do_trap_hypercall(struct cpu * Note that sending the invalidation request causes the vCPU to block * until all the IOREQ servers have acknowledged the invalidation. */ -if ( unlikely(curr->domain->mapcache_invalidate) && - test_and_clear_bool(curr->domain->mapcache_invalidate) ) +if ( unlikely(curr->mapcache_invalidate) && + test_and_clear_bool(curr->mapcache_invalidate) ) ioreq_signal_mapcache_invalidate(); #endif } --- a/xen/arch/x86/hvm/hypercall.c +++ b/xen/arch/x86/hvm/hypercall.c @@ -32,7 +32,6 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) { -const struct vcpu *curr = current; long rc; switch ( cmd & MEMOP_CMD_MASK ) @@ -42,14 +41,11 @@ static long hvm_memory_op(int cmd, XEN_G return -ENOSYS; } -if ( !curr->hcall_compat ) +if ( !current->hcall_compat ) rc = do_memory_op(cmd, arg); else rc = compat_memory_op(cmd, arg); -if ( (cmd & MEMOP_CMD_MASK) == XENMEM_decrease_reservation ) -curr->domain->mapcache_invalidate = true; - return rc; } @@ -327,9 +323,11 @@ int hvm_hypercall(struct cpu_user_regs * HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax); -if ( unlikely(currd->mapcache_invalidate) && - test_and_clear_bool(currd->mapcache_invalidate) ) +if ( unlikely(curr->mapcache_invalidate) ) +{ +curr->mapcache_invalidate = false; ioreq_signal_mapcache_invalidate(); +} return curr->hcall_preempted ? HVM_HCALL_preempted : HVM_HCALL_completed; } --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -815,6 +816,8 @@ p2m_remove_page(struct p2m_domain *p2m, } } +ioreq_request_mapcache_invalidate(p2m->domain); + return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid, p2m->default_access); } @@ -1301,6 +1304,8 @@ static int set_typed_p2m_entry(struct do ASSERT(mfn_valid(mfn_add(omfn, i))); set_gpfn_from_mfn(mfn_x(omfn) + i, INVALID_M2P_ENTRY); } + +ioreq_request_mapcache_invalidate(d); } P2M_DEBUG("set %d %lx %lx\n", gfn_p2mt, gfn_l, mfn_x(mfn)); --- a/xen/arch/x86/mm/p2m-pod.c +++ b/xen/arch/x86/mm/p2m-pod.c @@ -20,6 +20,7 @@ */ #include +#include #include #include #include @@ -647,6 +648,8 @@ p2m_pod_decrease_reservation(struct doma set_gpfn_from_mfn(mfn_x(mfn), INVALID_M2P_ENTRY); p2m_pod_cache_add(p2m, page, cur_order); +ioreq_request_mapcache_invalidate(d); + steal_for_cache = (
[PATCH 3/3] gnttab: GTF_sub_page is a v2-only flag
Prior to its introduction, v1 entries weren't checked for this flag, and the flag also has been meaningless for v1 entries. Therefore it also shouldn't be checked. (The only consistent alternative would be to also check for all currently undefined flags to be clear.) Fixes: b545941b6638 ("Implement sub-page grant support") Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -762,13 +762,11 @@ static int _set_status_v1(const grant_en struct domain *rd, struct active_grant_entry *act, int readonly, - int mapflag, - domid_t ldomid) + domid_t ldomid) { int rc = GNTST_okay; uint32_t *raw_shah = (uint32_t *)shah; union grant_combo scombo; -uint16_t mask = GTF_type_mask; /* * We bound the number of times we retry CMPXCHG on memory locations that @@ -780,11 +778,6 @@ static int _set_status_v1(const grant_en */ int retries = 0; -/* if this is a grant mapping operation we should ensure GTF_sub_page - is not set */ -if ( mapflag ) -mask |= GTF_sub_page; - scombo.raw = ACCESS_ONCE(*raw_shah); /* @@ -798,8 +791,9 @@ static int _set_status_v1(const grant_en union grant_combo prev, new; /* If not already pinned, check the grant domid and type. */ -if ( !act->pin && (((scombo.flags & mask) != GTF_permit_access) || - (scombo.domid != ldomid)) ) +if ( !act->pin && + (((scombo.flags & GTF_type_mask) != GTF_permit_access) || + (scombo.domid != ldomid)) ) PIN_FAIL(done, GNTST_general_error, "Bad flags (%x) or dom (%d); expected d%d\n", scombo.flags, scombo.domid, ldomid); @@ -916,7 +910,7 @@ static int _set_status(const grant_entry { if ( evaluate_nospec(rgt_version == 1) ) -return _set_status_v1(shah, rd, act, readonly, mapflag, ldomid); +return _set_status_v1(shah, rd, act, readonly, ldomid); else return _set_status_v2(shah, status, rd, act, readonly, mapflag, ldomid); } --- a/xen/include/public/grant_table.h +++ b/xen/include/public/grant_table.h @@ -175,7 +175,7 @@ typedef struct grant_entry_v1 grant_entr * mappings of the grant [GST] * GTF_sub_page: Grant access to only a subrange of the page. @domid *will only be allowed to copy from the grant, and not - *map it. [GST] + *map it. [GST, v2] */ #define _GTF_readonly (2) #define GTF_readonly(1U<<_GTF_readonly)
[PATCH 2/3] gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant
Mappings for a domain's own pages should already be present in the IOMMU. While installing the same mapping again is merely redundant (and inefficient), removing the mapping when the grant mapping gets removed is outright wrong in this case: The mapping was there before the map, so should remain in place after unmapping. This affects - Arm Dom0 in the direct mapped case, - x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" cases, - all x86 PV DomU-s, including driver domains. Reported-by: Rahul Singh Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1243,7 +1243,7 @@ map_grant_ref( goto undo_out; } -need_iommu = gnttab_need_iommu_mapping(ld); +need_iommu = ld != rd && gnttab_need_iommu_mapping(ld); if ( need_iommu ) { unsigned int kind; @@ -1493,7 +1493,7 @@ unmap_common( if ( put_handle ) put_maptrack_handle(lgt, op->handle); -if ( rc == GNTST_okay && gnttab_need_iommu_mapping(ld) ) +if ( rc == GNTST_okay && ld != rd && gnttab_need_iommu_mapping(ld) ) { unsigned int kind; int err = 0;
[PATCH 1/3] gnttab: never permit mapping transitive grants
Transitive grants allow an intermediate domain I to grant a target domain T access to a page which origin domain O did grant I access to. As an implementation restriction, T is not allowed to map such a grant. This restriction is currently tried to be enforced by marking active entries resulting from transitive grants as is-sub-page; sub-page grants for obvious reasons don't allow mapping. However, marking (and checking) only active entries is insufficient, as a map attempt may also occur on a grant not otherwise in use. When not presently in use (pin count zero) the grant type itself needs checking. Otherwise T may be able to map an unrelated page owned by I. This is because the "transitive" sub- structure of the v2 union would end up being interpreted as "full_page" sub-structure instead. The low 32 bits of the GFN used would match the grant reference specified in I's transitive grant entry, while the upper 32 bits could be random (depending on how exactly I sets up its grant table entries). Note that if one mapping already exists and the granting domain _then_ changes the grant to GTF_transitive (which the domain is not supposed to do), the changed type will only be honored after the pin count has gone back to zero. This is no different from e.g. GTF_readonly or GTF_sub_page becoming set when a grant is already in use. While adjusting the implementation, also adjust commentary in the public header to better reflect reality. Fixes: 3672ce675c93 ("Transitive grant support") Signed-off-by: Jan Beulich --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -851,9 +851,10 @@ static int _set_status_v2(const grant_en mask |= GTF_sub_page; /* If not already pinned, check the grant domid and type. */ -if ( !act->pin && scombo.flags & mask) != GTF_permit_access) && -((scombo.flags & mask) != GTF_transitive)) || - (scombo.domid != ldomid)) ) +if ( !act->pin && + scombo.flags & mask) != GTF_permit_access) && + (mapflag || ((scombo.flags & mask) != GTF_transitive))) || + (scombo.domid != ldomid)) ) PIN_FAIL(done, GNTST_general_error, "Bad flags (%x) or dom (%d); expected d%d, flags %x\n", scombo.flags, scombo.domid, ldomid, mask); @@ -879,7 +880,7 @@ static int _set_status_v2(const grant_en if ( !act->pin ) { if ( (((scombo.flags & mask) != GTF_permit_access) && - ((scombo.flags & mask) != GTF_transitive)) || + (mapflag || ((scombo.flags & mask) != GTF_transitive))) || (scombo.domid != ldomid) || (!readonly && (scombo.flags & GTF_readonly)) ) { --- a/xen/include/public/grant_table.h +++ b/xen/include/public/grant_table.h @@ -166,11 +166,13 @@ typedef struct grant_entry_v1 grant_entr #define GTF_type_mask (3U<<0) /* - * Subflags for GTF_permit_access. + * Subflags for GTF_permit_access and GTF_transitive. * GTF_readonly: Restrict @domid to read-only mappings and accesses. [GST] * GTF_reading: Grant entry is currently mapped for reading by @domid. [XEN] * GTF_writing: Grant entry is currently mapped for writing by @domid. [XEN] - * GTF_PAT, GTF_PWT, GTF_PCD: (x86) cache attribute flags for the grant [GST] + * Further subflags for GTF_permit_access only. + * GTF_PAT, GTF_PWT, GTF_PCD: (x86) cache attribute flags to be used for + * mappings of the grant [GST] * GTF_sub_page: Grant access to only a subrange of the page. @domid *will only be allowed to copy from the grant, and not *map it. [GST]
[PATCH 0/3] gnttab: misc fixes
Patches 1 and 2 clearly are intended for 4.15; patch 3 is possibly controversial (along the lines of similar relaxation proposed for other (sub-)hypercalls), and hence unlikely to be a candidate as well. 1: never permit mapping transitive grants 2: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant 3: GTF_sub_page is a v2-only flag Jan
[PATCH] xen/arm : smmuv3: Fix to handle multiple StreamIds per device.
SMMUv3 driver does not handle multiple StreamId if the master device supports more than one StreamID. This bug was introduced when the driver was ported from Linux to XEN. dt_device_set_protected(..) should be called from add_device(..) not from the dt_xlate(..). Move dt_device_set_protected(..) from dt_xlate(..) to add_device(). Signed-off-by: Rahul Singh --- This patch is a candidate for 4.15 as without this patch it is not possible to assign multiple StreamIds to the same device when device is protected behind SMMUv3. --- xen/drivers/passthrough/arm/smmu-v3.c | 29 ++- 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/xen/drivers/passthrough/arm/smmu-v3.c b/xen/drivers/passthrough/arm/smmu-v3.c index 914cdc1cf4..53d150cdb6 100644 --- a/xen/drivers/passthrough/arm/smmu-v3.c +++ b/xen/drivers/passthrough/arm/smmu-v3.c @@ -2207,24 +2207,6 @@ static int arm_smmu_add_device(u8 devfn, struct device *dev) */ arm_smmu_enable_pasid(master); - return 0; - -err_free_master: - xfree(master); - dev_iommu_priv_set(dev, NULL); - return ret; -} - -static int arm_smmu_dt_xlate(struct device *dev, - const struct dt_phandle_args *args) -{ - int ret; - struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); - - ret = iommu_fwspec_add_ids(dev, args->args, 1); - if (ret) - return ret; - if (dt_device_is_protected(dev_to_dt(dev))) { dev_err(dev, "Already added to SMMUv3\n"); return -EEXIST; @@ -2237,6 +2219,17 @@ static int arm_smmu_dt_xlate(struct device *dev, dev_name(fwspec->iommu_dev), fwspec->num_ids); return 0; + +err_free_master: + xfree(master); + dev_iommu_priv_set(dev, NULL); + return ret; +} + +static int arm_smmu_dt_xlate(struct device *dev, + const struct dt_phandle_args *args) +{ + return iommu_fwspec_add_ids(dev, args->args, 1); } /* Probing and initialisation functions */ -- 2.17.1
Re: [PATCH v2 2/2] IOREQ: refine when to send mapcache invalidation request
On 02.02.2021 16:14, Jan Beulich wrote: > XENMEM_decrease_reservation isn't the only means by which pages can get > removed from a guest, yet all removals ought to be signaled to qemu. Put > setting of the flag into the central p2m_remove_page() underlying all > respective hypercalls as well as a few similar places, mainly in PoD > code. > > Additionally there's no point sending the request for the local domain > when the domain acted upon is a different one. The latter domain's ioreq > server mapcaches need invalidating. We assume that domain to be paused > at the point the operation takes place, so sending the request in this > case happens from the hvm_do_resume() path, which as one of its first > steps calls handle_hvm_io_completion(). > > Even without the remote operation aspect a single domain-wide flag > doesn't do: Guests may e.g. decrease-reservation on multiple vCPU-s in > parallel. Each of them needs to issue an invalidation request in due > course, in particular because exiting to guest context should not happen > before the request was actually seen by (all) the emulator(s). > > Signed-off-by: Jan Beulich > --- > v2: Preemption related adjustment split off. Make flag per-vCPU. More > places to set the flag. Also handle acting on a remote domain. > Re-base. Can I get an Arm side ack (or otherwise) please? Thanks, Jan > --- a/xen/arch/arm/p2m.c > +++ b/xen/arch/arm/p2m.c > @@ -759,10 +759,9 @@ static void p2m_free_entry(struct p2m_do > * has failed (error case). > * So, at worst, the spurious mapcache invalidation might be sent. > */ > -if ( (p2m->domain == current->domain) && > - domain_has_ioreq_server(p2m->domain) && > - p2m_is_ram(entry.p2m.type) ) > -p2m->domain->mapcache_invalidate = true; > +if ( p2m_is_ram(entry.p2m.type) && > + domain_has_ioreq_server(p2m->domain) ) > +ioreq_request_mapcache_invalidate(p2m->domain); > #endif > > p2m->stats.mappings[level]--; > --- a/xen/arch/arm/traps.c > +++ b/xen/arch/arm/traps.c > @@ -1509,8 +1509,8 @@ static void do_trap_hypercall(struct cpu > * Note that sending the invalidation request causes the vCPU to block > * until all the IOREQ servers have acknowledged the invalidation. > */ > -if ( unlikely(curr->domain->mapcache_invalidate) && > - test_and_clear_bool(curr->domain->mapcache_invalidate) ) > +if ( unlikely(curr->mapcache_invalidate) && > + test_and_clear_bool(curr->mapcache_invalidate) ) > ioreq_signal_mapcache_invalidate(); > #endif > } > --- a/xen/arch/x86/hvm/hypercall.c > +++ b/xen/arch/x86/hvm/hypercall.c > @@ -32,7 +32,6 @@ > > static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > { > -const struct vcpu *curr = current; > long rc; > > switch ( cmd & MEMOP_CMD_MASK ) > @@ -42,14 +41,11 @@ static long hvm_memory_op(int cmd, XEN_G > return -ENOSYS; > } > > -if ( !curr->hcall_compat ) > +if ( !current->hcall_compat ) > rc = do_memory_op(cmd, arg); > else > rc = compat_memory_op(cmd, arg); > > -if ( (cmd & MEMOP_CMD_MASK) == XENMEM_decrease_reservation ) > -curr->domain->mapcache_invalidate = true; > - > return rc; > } > > @@ -327,9 +323,11 @@ int hvm_hypercall(struct cpu_user_regs * > > HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax); > > -if ( unlikely(currd->mapcache_invalidate) && > - test_and_clear_bool(currd->mapcache_invalidate) ) > +if ( unlikely(curr->mapcache_invalidate) ) > +{ > +curr->mapcache_invalidate = false; > ioreq_signal_mapcache_invalidate(); > +} > > return curr->hcall_preempted ? HVM_HCALL_preempted : HVM_HCALL_completed; > } > --- a/xen/arch/x86/mm/p2m.c > +++ b/xen/arch/x86/mm/p2m.c > @@ -28,6 +28,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -815,6 +816,8 @@ p2m_remove_page(struct p2m_domain *p2m, > } > } > > +ioreq_request_mapcache_invalidate(p2m->domain); > + > return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid, > p2m->default_access); > } > @@ -1301,6 +1304,8 @@ static int set_typed_p2m_entry(struct do > ASSERT(mfn_valid(mfn_add(omfn, i))); > set_gpfn_from_mfn(mfn_x(omfn) + i, INVALID_M2P_ENTRY); > } > + > +ioreq_request_mapcache_invalidate(d); > } > > P2M_DEBUG("set %d %lx %lx\n", gfn_p2mt, gfn_l, mfn_x(mfn)); > --- a/xen/arch/x86/mm/p2m-pod.c > +++ b/xen/arch/x86/mm/p2m-pod.c > @@ -20,6 +20,7 @@ > */ > > #include > +#include > #include > #include > #include > @@ -647,6 +648,8 @@ p2m_pod_decrease_reservation(struct doma > set_gpfn_from_mfn(mfn_x(mfn), INVALID_M2P_ENTRY); > p2m_pod_cache_add(p2m, page, cur_order); > > +
[xen-unstable-coverity test] 159441: all pass - PUSHED
flight 159441 xen-unstable-coverity real [real] http://logs.test-lab.xenproject.org/osstest/logs/159441/ Perfect :-) All tests in this flight passed as required version targeted for testing: xen 3b1cc15f1931ba56d0ee256fe9bfe65509733b27 baseline version: xen 04085ec1ac05a362812e9b0c6b5a8713d7dc88ad Last test of basis 159344 2021-02-14 09:18:27 Z3 days Testing same since 159441 2021-02-17 09:20:56 Z0 days1 attempts People who touched revisions under test: Jan Beulich Julien Grall jobs: coverity-amd64 pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Pushing revision : To xenbits.xen.org:/home/xen/git/xen.git 04085ec1ac..3b1cc15f19 3b1cc15f1931ba56d0ee256fe9bfe65509733b27 -> coverity-tested/smoke
Re: [PATCH] firmware: don't build hvmloader if it is not needed
On 16.02.2021 19:31, Stefano Stabellini wrote: > On Mon, 15 Feb 2021, Jan Beulich wrote: >> On 13.02.2021 14:50, Marek Marczykowski-Górecki wrote: >>> On Fri, Feb 12, 2021 at 06:05:40PM -0800, Stefano Stabellini wrote: If rombios, seabios and ovmf are all disabled, don't attempt to build hvmloader. >>> >>> What if you choose to not build any of rombios, seabios, ovmf, but use >>> system one instead? Wouldn't that exclude hvmloader too? >> >> Even further - one can disable all firmware and have every guest >> config explicitly specify the firmware to use, afaict. > > I didn't realize there was a valid reason for wanting to build hvmloader > without rombios, seabios, and ovfm. > > >>> This heuristic seems like a bit too much, maybe instead add an explicit >>> option to skip hvmloader? >> >> +1 (If making this configurable is needed at all - is having >> hvmloader without needing it really a problem?) > > The hvmloader build fails on Alpine Linux x86: > > https://gitlab.com/xen-project/xen/-/jobs/1033722465 > > > > I admit I was just trying to find the fastest way to make those Alpine > Linux builds succeed to unblock patchew: although the Alpine Linux > builds are marked as "allow_failure: true" in gitlab-ci, patchew will > still report the whole battery of tests as "failure". As a consequence > the notification emails from patchew after a build of a contributed > patch series always says "job failed" today, making it kind of useless. > See attached. > > I would love if somebody else took over this fix as I am doing this > after hours, but if you have a simple suggestion on how to fix the > Alpine Linux hvmloader builds, or skip the build when appropriate, I can > try to follow up. There is an issue with the definition of uint64_t there. Initial errors like hvmloader.c: In function 'init_vm86_tss': hvmloader.c:202:39: error: left shift count >= width of type [-Werror=shift-count-overflow] 202 | ((uint64_t)TSS_SIZE << 32) | virt_to_phys(tss)); already hint at this, but then util.c: In function 'get_cpu_mhz': util.c:824:15: error: conversion from 'long long unsigned int' to 'uint64_t' {aka 'long unsigned int'} changes value from '429496729600' to '0' [-Werror=overflow] 824 | cpu_khz = 100ull << 32; is quite explicit: "aka 'long unsigned int'"? This is a 32-bit environment, after all. I suspect the build picks up headers (stdint.h here in particular) intended for 64-bit builds only. Can you check whether "gcc -m32" properly sets include paths _different_ from those plain "gcc" sets, if the headers found in the latter case aren't suitable for the former? Or alternatively is the Alpine build environment set up incorrectly, in that it lacks 32-bit devel packages? As an aside I don't think it's really a good idea to have hvmloader depend on any external headers. Just like the hypervisor it's a free-standing binary, and hence ought to be free of any dependencies on the build/host environment. Jan
Re: Ping: [PATCH v2 2/2] IOREQ: refine when to send mapcache invalidation request
On 17/02/2021 08:30, Jan Beulich wrote: Paul (or others), thoughts? On 04.02.2021 12:24, Jan Beulich wrote: On 04.02.2021 10:26, Paul Durrant wrote: From: Jan Beulich Sent: 02 February 2021 15:15 XENMEM_decrease_reservation isn't the only means by which pages can get removed from a guest, yet all removals ought to be signaled to qemu. Put setting of the flag into the central p2m_remove_page() underlying all respective hypercalls as well as a few similar places, mainly in PoD code. Additionally there's no point sending the request for the local domain when the domain acted upon is a different one. The latter domain's ioreq server mapcaches need invalidating. We assume that domain to be paused at the point the operation takes place, so sending the request in this case happens from the hvm_do_resume() path, which as one of its first steps calls handle_hvm_io_completion(). Even without the remote operation aspect a single domain-wide flag doesn't do: Guests may e.g. decrease-reservation on multiple vCPU-s in parallel. Each of them needs to issue an invalidation request in due course, in particular because exiting to guest context should not happen before the request was actually seen by (all) the emulator(s). Signed-off-by: Jan Beulich --- v2: Preemption related adjustment split off. Make flag per-vCPU. More places to set the flag. Also handle acting on a remote domain. Re-base. I'm wondering if a per-vcpu flag is overkill actually. We just need to make sure that we don't miss sending an invalidation where multiple vcpus are in play. The mapcache in the emulator is global so issuing an invalidate for every vcpu is going to cause an unnecessary storm of ioreqs, isn't it? The only time a truly unnecessary storm would occur is when for an already running guest mapcache invalidation gets triggered by a remote domain. This should be a pretty rare event, so I think the storm in this case ought to be tolerable. Could we get away with the per-domain atomic counter? Possible, but quite involved afaict: The potential storm above is the price to pay for the simplicity of the model. It is important to note that while we don't need all of the vCPU-s to send these ioreqs, we need all of them to wait for the request(s) to be acked. And this waiting is what we get "for free" using the approach here, whereas we'd need to introduce new logic for this with an atomic counter (afaict at least). Jan Ok, let's take the patch as-is then. Reviewed-by: Paul Durrant
Re: [RFC] xen/arm: introduce XENFEAT_ARM_dom0_iommu
On 17/02/2021 02:00, Stefano Stabellini wrote: Hi all, Today Linux uses the swiotlb-xen driver (drivers/xen/swiotlb-xen.c) to translate addresses for DMA operations in Dom0. Specifically, swiotlb-xen is used to translate the address of a foreign page (a page belonging to a domU) mapped into Dom0 before using it for DMA. This is important because although Dom0 is 1:1 mapped, DomUs are not. On systems without an IOMMU swiotlb-xen enables PV drivers to work as long as the backends are in Dom0. Thanks to swiotlb-xen, the DMA operation ends up using the MFN, rather than the GFN. On systems with an IOMMU, this is not necessary: when a foreign page is mapped in Dom0, it is added to the Dom0 p2m. A new GFN->MFN translation is enstablished for both MMU and SMMU. Dom0 could safely use the GFN address (instead of the MFN) for DMA operations and they would work. It would be more efficient than using swiotlb-xen. If you recall my presentation from Xen Summit 2020, Xilinx is working on cache coloring. With cache coloring, no domain is 1:1 mapped, not even Dom0. In a scenario where Dom0 is not 1:1 mapped, swiotlb-xen does not work as intended. The suggested solution for both these issues is to add a new feature flag "XENFEAT_ARM_dom0_iommu" that tells Dom0 that it is safe not to use swiotlb-xen because IOMMU translations are available for Dom0. If XENFEAT_ARM_dom0_iommu is set, Linux should skip the swiotlb-xen initialization. I have tested this scheme with and without cache coloring (hence with and without 1:1 mapping of Dom0) on ZCU102 and it works as expected: DMA operations succeed. What about systems where an IOMMU is present but not all devices are protected? There is no way for Xen to know which devices are protected and which ones are not: devices that do not have the "iommus" property could or could not be DMA masters. Perhaps Xen could populate a whitelist of devices protected by the IOMMU based on the "iommus" property. It would require some added complexity in Xen and especially in the swiotlb-xen driver in Linux to use it, which is not ideal. You are trading a bit more complexity in Xen and Linux with a user will may not be able to use the hypervisor on his/her platform without a quirk in Xen (see more below). However, this approach would not work for cache coloring where dom0 is not 1:1 mapped so the swiotlb-xen should not be used either way Not all the Dom0 Linux kernel will be able to work with cache colouring. So you will need a way for the kernel to say "Hey I am can avoid using swiotlb". For these reasons, I would like to propose a single flag XENFEAT_ARM_dom0_iommu which says that the IOMMU can be relied upon for DMA translations. In situations where a DMA master is not SMMU protected, XENFEAT_ARM_dom0_iommu should not be set. For example, on a platform where an IOMMU is present and protects most DMA masters but it is leaving out the MMC controller, then XENFEAT_ARM_dom0_iommu should not be set (because PV block is not going to work without swiotlb-xen.) This also means that cache coloring won't be usable on such a system (at least not usable with the MMC controller so the system integrator should pay special care to setup the system). It is worth noting that if we wanted to extend the interface to add a list of protected devices in the future, it would still be possible. It would be compatible with XENFEAT_ARM_dom0_iommu. I imagine by compatible, you mean XENFEAT_ARM_dom0_iommu would be cleared and instead the device-tree list would be used. Is that correct? How to set XENFEAT_ARM_dom0_iommu? We could set XENFEAT_ARM_dom0_iommu automatically when is_iommu_enabled(d) for Dom0. We could also have a platform specific (xen/arch/arm/platforms/) override so that a specific platform can disable XENFEAT_ARM_dom0_iommu. For debugging purposes and advanced users, it would also be useful to be able to override it via a Xen command line parameter. Platform quirks should be are limited to a small set of platforms. In this case, this would not be only per-platform but also per-firmware table as a developer can decide to remove/add IOMMU nodes in the DT at any time. In addition to that, it means we are introducing a regression for those users as Xen 4.14 would have worked on there platform but not anymore. They would need to go through all the nodes and find out which one is not protected. This is a bit of a daunting task and we are going to end up having a lot of per-platform quirk in Xen. So this approach selecting the flag is a no-go for me. FAOD, the inverted idea (i.e. only setting XENFEAT_ARM_dom0_iommu per-platform) is a no go as well. I don't have a good idea how to set the flag automatically. My requirement is newer Xen should continue to work on all supported platforms without any additional per-platform effort. Cheers, -- Julien Grall
Ping: [PATCH] x86emul: de-duplicate scatters to the same linear address
On 05.02.2021 12:28, Jan Beulich wrote: > On 05.02.2021 11:41, Andrew Cooper wrote: >> On 10/11/2020 13:26, Jan Beulich wrote: >>> The SDM specifically allows for earlier writes to fully overlapping >>> ranges to be dropped. If a guest did so, hvmemul_phys_mmio_access() >>> would crash it if varying data was written to the same address. Detect >>> overlaps early, as doing so in hvmemul_{linear,phys}_mmio_access() would >>> be quite a bit more difficult. >> >> Are you saying that there is currently a bug if a guest does encode such >> an instruction, and we emulate it? > > That is my take on it, yes. > >>> Note that due to cache slot use being linear address based, there's no >>> similar issue with multiple writes to the same physical address (mapped >>> through different linear addresses). >>> >>> Since this requires an adjustment to the EVEX Disp8 scaling test, >>> correct a comment there at the same time. >>> >>> Signed-off-by: Jan Beulich >>> --- >>> TBD: The SDM isn't entirely unambiguous about the faulting behavior in >>> this case: If a fault would need delivering on the earlier slot >>> despite the write getting squashed, we'd have to call ops->write() >>> with size set to zero for the earlier write(s). However, >>> hvm/emulate.c's handling of zero-byte accesses extends only to the >>> virtual-to-linear address conversions (and raising of involved >>> faults), so in order to also observe #PF changes to that logic >>> would then also be needed. Can we live with a possible misbehavior >>> here? >> >> Do you have a chapter/section reference? > > The instruction pages. They say in particular > > "If two or more destination indices completely overlap, the “earlier” > write(s) may be skipped." > > and > > "Faults are delivered in a right-to-left manner. That is, if a fault > is triggered by an element and delivered ..." > > To me this may or may not mean the skipping of indices includes the > skipping of faults (which a later element then would raise anyway). Does the above address your concerns / questions? If not, what else do I need to provide? Jan
Ping: [PATCH v2 2/2] IOREQ: refine when to send mapcache invalidation request
Paul (or others), thoughts? On 04.02.2021 12:24, Jan Beulich wrote: > On 04.02.2021 10:26, Paul Durrant wrote: >>> From: Jan Beulich >>> Sent: 02 February 2021 15:15 >>> >>> XENMEM_decrease_reservation isn't the only means by which pages can get >>> removed from a guest, yet all removals ought to be signaled to qemu. Put >>> setting of the flag into the central p2m_remove_page() underlying all >>> respective hypercalls as well as a few similar places, mainly in PoD >>> code. >>> >>> Additionally there's no point sending the request for the local domain >>> when the domain acted upon is a different one. The latter domain's ioreq >>> server mapcaches need invalidating. We assume that domain to be paused >>> at the point the operation takes place, so sending the request in this >>> case happens from the hvm_do_resume() path, which as one of its first >>> steps calls handle_hvm_io_completion(). >>> >>> Even without the remote operation aspect a single domain-wide flag >>> doesn't do: Guests may e.g. decrease-reservation on multiple vCPU-s in >>> parallel. Each of them needs to issue an invalidation request in due >>> course, in particular because exiting to guest context should not happen >>> before the request was actually seen by (all) the emulator(s). >>> >>> Signed-off-by: Jan Beulich >>> --- >>> v2: Preemption related adjustment split off. Make flag per-vCPU. More >>> places to set the flag. Also handle acting on a remote domain. >>> Re-base. >> >> I'm wondering if a per-vcpu flag is overkill actually. We just need >> to make sure that we don't miss sending an invalidation where >> multiple vcpus are in play. The mapcache in the emulator is global >> so issuing an invalidate for every vcpu is going to cause an >> unnecessary storm of ioreqs, isn't it? > > The only time a truly unnecessary storm would occur is when for > an already running guest mapcache invalidation gets triggered > by a remote domain. This should be a pretty rare event, so I > think the storm in this case ought to be tolerable. > >> Could we get away with the per-domain atomic counter? > > Possible, but quite involved afaict: The potential storm above > is the price to pay for the simplicity of the model. It is > important to note that while we don't need all of the vCPU-s > to send these ioreqs, we need all of them to wait for the > request(s) to be acked. And this waiting is what we get "for > free" using the approach here, whereas we'd need to introduce > new logic for this with an atomic counter (afaict at least). > > Jan >
Re: Linux DomU freezes and dies under heavy memory shuffling
On 17.02.21 09:12, Roman Shaposhnik wrote: Hi Jürgen, thanks for taking a look at this. A few comments below: On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß wrote: On 16.02.21 21:34, Stefano Stabellini wrote: + x86 maintainers It looks like the tlbflush is getting stuck? I have seen this case multiple times on customer systems now, but reproducing it reliably seems to be very hard. It is reliably reproducible under my workload but it take a long time (~3 days of the workload running in the lab). This is by far the best reproduction rate I have seen up to now. The next best reproducer seems to be a huge installation with several hundred hosts and thousands of VMs with about 1 crash each week. I suspected fifo events to be blamed, but just yesterday I've been informed of another case with fifo events disabled in the guest. One common pattern seems to be that up to now I have seen this effect only on systems with Intel Gold cpus. Can it be confirmed to be true in this case, too? I am pretty sure mine isn't -- I can get you full CPU specs if that's useful. Just the output of "grep model /proc/cpuinfo" should be enough. In case anybody has a reproducer (either in a guest or dom0) with a setup where a diagnostic kernel can be used, I'd be _very_ interested! I can easily add things to Dom0 and DomU. Whether that will disrupt the experiment is, of course, another matter. Still please let me know what would be helpful to do. Is there a chance to switch to an upstream kernel in the guest? I'd like to add some diagnostic code to the kernel and creating the patches will be easier this way. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Ping: [PATCH v3 0/4] x86/time: calibration rendezvous adjustments
Ian? On 09.02.2021 14:02, Jan Beulich wrote: > On 09.02.2021 13:53, Jan Beulich wrote: >> The middle two patches are meant to address a regression reported on >> the list under "Problems with APIC on versions 4.9 and later (4.8 >> works)". In the course of analyzing output from a debugging patch I >> ran into another anomaly again, which I thought I should finally try >> to address. Hence patch 1. Patch 4 is new in v3 and RFC for now. > > Of course this is the kind of change I'd prefer doing early in a > release cycle. I don't think there are severe risks from patch 1, but > I'm not going to claim patches 2 and 3 are risk free. They fix booting > Xen on a system left in rather awkward state by the firmware. And > they shouldn't affect well behaved modern systems at all (due to > those using a different rendezvous function). While we've been having > this issue for years, I also consider this set a backporting > candidate. Hence I can see reasons pro and con inclusion in 4.15. > > Jan > >> 1: change initiation of the calibration timer >> 2: adjust time recording time_calibration_tsc_rendezvous() >> 3: don't move TSC backwards in time_calibration_tsc_rendezvous() >> 4: re-arrange struct calibration_rendezvous >> >> Jan >> >
[PATCH v2 8/8] x86/PV: use get_unsafe() instead of copy_from_unsafe()
The former expands to a single (memory accessing) insn, which the latter does not guarantee. Yet we'd prefer to read consistent PTEs rather than risking a split read racing with an update done elsewhere. Signed-off-by: Jan Beulich --- a/xen/arch/x86/pv/mm.c +++ b/xen/arch/x86/pv/mm.c @@ -41,9 +41,7 @@ l1_pgentry_t *map_guest_l1e(unsigned lon return NULL; /* Find this l1e and its enclosing l1mfn in the linear map. */ -if ( copy_from_unsafe(, - &__linear_l2_table[l2_linear_offset(linear)], - sizeof(l2_pgentry_t)) ) +if ( get_unsafe(l2e, &__linear_l2_table[l2_linear_offset(linear)]) ) return NULL; /* Check flags that it will be safe to read the l1e. */ --- a/xen/arch/x86/pv/mm.h +++ b/xen/arch/x86/pv/mm.h @@ -22,9 +22,7 @@ static inline l1_pgentry_t guest_get_eff toggle_guest_pt(curr); if ( unlikely(!__addr_ok(linear)) || - copy_from_unsafe(, - &__linear_l1_table[l1_linear_offset(linear)], - sizeof(l1_pgentry_t)) ) + get_unsafe(l1e, &__linear_l1_table[l1_linear_offset(linear)]) ) l1e = l1e_empty(); if ( user_mode )
[PATCH v2 7/8] x86: move stac()/clac() from {get,put}_unsafe_asm() ...
... to {get,put}_unsafe_size(). There's no need to have the macros expanded once per case label in the latter. This also makes the former well-formed single statements again. No change in generated code. Signed-off-by: Jan Beulich --- a/xen/include/asm-x86/uaccess.h +++ b/xen/include/asm-x86/uaccess.h @@ -154,7 +154,6 @@ struct __large_struct { unsigned long bu * aliasing issues. */ #define put_unsafe_asm(x, addr, GUARD, err, itype, rtype, ltype, errret) \ - stac(); \ __asm__ __volatile__( \ GUARD( \ " guest_access_mask_ptr %[ptr], %[scr1], %[scr2]\n" \ @@ -169,11 +168,9 @@ struct __large_struct { unsigned long bu : [ret] "+r" (err), [ptr] "=" (dummy_)\ GUARD(, [scr1] "=" (dummy_), [scr2] "=" (dummy_)) \ : [val] ltype (x), "m" (__m(addr)), \ - "[ptr]" (addr), [errno] "i" (errret));\ - clac() + "[ptr]" (addr), [errno] "i" (errret)) #define get_unsafe_asm(x, addr, GUARD, err, rtype, ltype, errret) \ - stac(); \ __asm__ __volatile__( \ GUARD( \ " guest_access_mask_ptr %[ptr], %[scr1], %[scr2]\n" \ @@ -190,12 +187,12 @@ struct __large_struct { unsigned long bu [ptr] "=" (dummy_) \ GUARD(, [scr1] "=" (dummy_), [scr2] "=" (dummy_)) \ : "m" (__m(addr)), "[ptr]" (addr), \ - [errno] "i" (errret));\ - clac() + [errno] "i" (errret)) #define put_unsafe_size(x, ptr, size, grd, retval, errret) \ do { \ retval = 0;\ +stac();\ switch ( size )\ { \ long dummy_; \ @@ -213,6 +210,7 @@ do { break; \ default: __put_user_bad(); \ } \ +clac();\ } while ( false ) #define put_guest_size(x, ptr, size, retval, errret) \ @@ -221,6 +219,7 @@ do { #define get_unsafe_size(x, ptr, size, grd, retval, errret) \ do { \ retval = 0;\ +stac();\ switch ( size )\ { \ long dummy_; \ @@ -230,6 +229,7 @@ do { case 8: get_unsafe_asm(x, ptr, grd, retval, "", "=r", errret); break; \ default: __get_user_bad(); \ } \ +clac();\ } while ( false ) #define get_guest_size(x, ptr, size, retval, errret) \
[PATCH v2 6/8] x86: rename copy_{from,to}_user() to copy_{from,to}_guest_pv()
Bring them (back) in line with __copy_{from,to}_guest_pv(). Since it falls in the same group, also convert clear_user(). Instead of adjusting __raw_clear_guest(), drop it - it's unused and would require a non- checking __clear_guest_pv() which we don't have. Add previously missing __user at some call sites and in the function declarations. Signed-off-by: Jan Beulich --- a/xen/arch/x86/pv/emul-inv-op.c +++ b/xen/arch/x86/pv/emul-inv-op.c @@ -33,7 +33,7 @@ static int emulate_forced_invalid_op(str eip = regs->rip; /* Check for forced emulation signature: ud2 ; .ascii "xen". */ -if ( (rc = copy_from_user(sig, (char *)eip, sizeof(sig))) != 0 ) +if ( (rc = copy_from_guest_pv(sig, (char __user *)eip, sizeof(sig))) != 0 ) { pv_inject_page_fault(0, eip + sizeof(sig) - rc); return EXCRET_fault_fixed; @@ -43,7 +43,8 @@ static int emulate_forced_invalid_op(str eip += sizeof(sig); /* We only emulate CPUID. */ -if ( ( rc = copy_from_user(instr, (char *)eip, sizeof(instr))) != 0 ) +if ( (rc = copy_from_guest_pv(instr, (char __user *)eip, + sizeof(instr))) != 0 ) { pv_inject_page_fault(0, eip + sizeof(instr) - rc); return EXCRET_fault_fixed; --- a/xen/arch/x86/pv/iret.c +++ b/xen/arch/x86/pv/iret.c @@ -54,8 +54,8 @@ unsigned long do_iret(void) struct iret_context iret_saved; struct vcpu *v = current; -if ( unlikely(copy_from_user(_saved, (void *)regs->rsp, - sizeof(iret_saved))) ) +if ( unlikely(copy_from_guest_pv(_saved, (void __user *)regs->rsp, + sizeof(iret_saved))) ) { gprintk(XENLOG_ERR, "Fault while reading IRET context from guest stack\n"); --- a/xen/arch/x86/pv/ro-page-fault.c +++ b/xen/arch/x86/pv/ro-page-fault.c @@ -90,7 +90,8 @@ static int ptwr_emulated_update(unsigned /* Align address; read full word. */ addr &= ~(sizeof(full) - 1); -if ( (rc = copy_from_user(, (void *)addr, sizeof(full))) != 0 ) +if ( (rc = copy_from_guest_pv(, (void __user *)addr, + sizeof(full))) != 0 ) { x86_emul_pagefault(0, /* Read fault. */ addr + sizeof(full) - rc, --- a/xen/arch/x86/usercopy.c +++ b/xen/arch/x86/usercopy.c @@ -109,19 +109,17 @@ unsigned int copy_from_guest_ll(void *to #if GUARD(1) + 0 /** - * copy_to_user: - Copy a block of data into user space. - * @to: Destination address, in user space. - * @from: Source address, in kernel space. + * copy_to_guest_pv: - Copy a block of data into guest space. + * @to: Destination address, in guest space. + * @from: Source address, in hypervisor space. * @n:Number of bytes to copy. * - * Context: User context only. This function may sleep. - * - * Copy data from kernel space to user space. + * Copy data from hypervisor space to guest space. * * Returns number of bytes that could not be copied. * On success, this will be zero. */ -unsigned copy_to_user(void __user *to, const void *from, unsigned n) +unsigned int copy_to_guest_pv(void __user *to, const void *from, unsigned int n) { if ( access_ok(to, n) ) n = __copy_to_guest_pv(to, from, n); @@ -129,16 +127,16 @@ unsigned copy_to_user(void __user *to, c } /** - * clear_user: - Zero a block of memory in user space. - * @to: Destination address, in user space. + * clear_guest_pv: - Zero a block of memory in guest space. + * @to: Destination address, in guest space. * @n:Number of bytes to zero. * - * Zero a block of memory in user space. + * Zero a block of memory in guest space. * * Returns number of bytes that could not be cleared. * On success, this will be zero. */ -unsigned clear_user(void __user *to, unsigned n) +unsigned int clear_guest_pv(void __user *to, unsigned int n) { if ( access_ok(to, n) ) { @@ -168,14 +166,12 @@ unsigned clear_user(void __user *to, uns } /** - * copy_from_user: - Copy a block of data from user space. - * @to: Destination address, in kernel space. - * @from: Source address, in user space. + * copy_from_guest_pv: - Copy a block of data from guest space. + * @to: Destination address, in hypervisor space. + * @from: Source address, in guest space. * @n:Number of bytes to copy. * - * Context: User context only. This function may sleep. - * - * Copy data from user space to kernel space. + * Copy data from guest space to hypervisor space. * * Returns number of bytes that could not be copied. * On success, this will be zero. @@ -183,7 +179,8 @@ unsigned clear_user(void __user *to, uns * If some data could not be copied, this function will pad the copied * data to the requested size using zero bytes. */ -unsigned copy_from_user(void *to, const void __user *from, unsigned n) +unsigned int copy_from_guest_pv(void *to, const void __user *from, +
[PATCH v2 5/8] x86/gdbsx: convert "user" to "guest" accesses
Using copy_{from,to}_user(), this code was assuming to be called only by PV guests. Use copy_{from,to}_guest() instead, transforming the incoming structure field into a guest handle (the field should really have been one in the first place). Also do not transform the debuggee address into a pointer. Signed-off-by: Jan Beulich --- v2: Re-base (bug fix side effect was taken care of already). --- a/xen/arch/x86/debug.c +++ b/xen/arch/x86/debug.c @@ -108,12 +108,11 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma } /* Returns: number of bytes remaining to be copied */ -static unsigned int dbg_rw_guest_mem(struct domain *dp, void * __user gaddr, - void * __user buf, unsigned int len, - bool toaddr, uint64_t pgd3) +static unsigned int dbg_rw_guest_mem(struct domain *dp, unsigned long addr, + XEN_GUEST_HANDLE_PARAM(void) buf, + unsigned int len, bool toaddr, + uint64_t pgd3) { -unsigned long addr = (unsigned long)gaddr; - while ( len > 0 ) { char *va; @@ -134,20 +133,18 @@ static unsigned int dbg_rw_guest_mem(str if ( toaddr ) { -copy_from_user(va, buf, pagecnt);/* va = buf */ +copy_from_guest(va, buf, pagecnt); paging_mark_dirty(dp, mfn); } else -{ -copy_to_user(buf, va, pagecnt);/* buf = va */ -} +copy_to_guest(buf, va, pagecnt); unmap_domain_page(va); if ( !gfn_eq(gfn, INVALID_GFN) ) put_gfn(dp, gfn_x(gfn)); addr += pagecnt; -buf += pagecnt; +guest_handle_add_offset(buf, pagecnt); len -= pagecnt; } @@ -161,7 +158,7 @@ static unsigned int dbg_rw_guest_mem(str * pgd3: value of init_mm.pgd[3] in guest. see above. * Returns: number of bytes remaining to be copied. */ -unsigned int dbg_rw_mem(void * __user addr, void * __user buf, +unsigned int dbg_rw_mem(unsigned long gva, XEN_GUEST_HANDLE_PARAM(void) buf, unsigned int len, domid_t domid, bool toaddr, uint64_t pgd3) { @@ -170,7 +167,7 @@ unsigned int dbg_rw_mem(void * __user ad if ( d ) { if ( !d->is_dying ) -len = dbg_rw_guest_mem(d, addr, buf, len, toaddr, pgd3); +len = dbg_rw_guest_mem(d, gva, buf, len, toaddr, pgd3); rcu_unlock_domain(d); } --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -40,10 +40,8 @@ #ifdef CONFIG_GDBSX static int gdbsx_guest_mem_io(domid_t domid, struct xen_domctl_gdbsx_memio *iop) { -void * __user gva = (void *)iop->gva, * __user uva = (void *)iop->uva; - -iop->remain = dbg_rw_mem(gva, uva, iop->len, domid, - !!iop->gwr, iop->pgd3val); +iop->remain = dbg_rw_mem(iop->gva, guest_handle_from_ptr(iop->uva, void), + iop->len, domid, iop->gwr, iop->pgd3val); return iop->remain ? -EFAULT : 0; } --- a/xen/include/asm-x86/debugger.h +++ b/xen/include/asm-x86/debugger.h @@ -93,9 +93,9 @@ static inline bool debugger_trap_entry( #endif #ifdef CONFIG_GDBSX -unsigned int dbg_rw_mem(void * __user addr, void * __user buf, +unsigned int dbg_rw_mem(unsigned long gva, XEN_GUEST_HANDLE_PARAM(void) buf, unsigned int len, domid_t domid, bool toaddr, -uint64_t pgd3); +unsigned long pgd3); #endif #endif /* __X86_DEBUGGER_H__ */
[PATCH v2 4/8] x86: rename {get,put}_user() to {get,put}_guest()
Bring them (back) in line with __{get,put}_guest(). Signed-off-by: Jan Beulich --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1649,19 +1649,19 @@ static void load_segments(struct vcpu *n if ( !ring_1(regs) ) { -ret = put_user(regs->ss, esp-1); -ret |= put_user(regs->esp, esp-2); +ret = put_guest(regs->ss, esp - 1); +ret |= put_guest(regs->esp, esp - 2); esp -= 2; } if ( ret | - put_user(rflags, esp-1) | - put_user(cs_and_mask, esp-2) | - put_user(regs->eip, esp-3) | - put_user(uregs->gs, esp-4) | - put_user(uregs->fs, esp-5) | - put_user(uregs->es, esp-6) | - put_user(uregs->ds, esp-7) ) + put_guest(rflags, esp - 1) | + put_guest(cs_and_mask, esp - 2) | + put_guest(regs->eip, esp - 3) | + put_guest(uregs->gs, esp - 4) | + put_guest(uregs->fs, esp - 5) | + put_guest(uregs->es, esp - 6) | + put_guest(uregs->ds, esp - 7) ) { gprintk(XENLOG_ERR, "error while creating compat failsafe callback frame\n"); @@ -1690,17 +1690,17 @@ static void load_segments(struct vcpu *n cs_and_mask = (unsigned long)regs->cs | ((unsigned long)vcpu_info(n, evtchn_upcall_mask) << 32); -if ( put_user(regs->ss,rsp- 1) | - put_user(regs->rsp, rsp- 2) | - put_user(rflags, rsp- 3) | - put_user(cs_and_mask, rsp- 4) | - put_user(regs->rip, rsp- 5) | - put_user(uregs->gs, rsp- 6) | - put_user(uregs->fs, rsp- 7) | - put_user(uregs->es, rsp- 8) | - put_user(uregs->ds, rsp- 9) | - put_user(regs->r11, rsp-10) | - put_user(regs->rcx, rsp-11) ) +if ( put_guest(regs->ss,rsp - 1) | + put_guest(regs->rsp, rsp - 2) | + put_guest(rflags, rsp - 3) | + put_guest(cs_and_mask, rsp - 4) | + put_guest(regs->rip, rsp - 5) | + put_guest(uregs->gs, rsp - 6) | + put_guest(uregs->fs, rsp - 7) | + put_guest(uregs->es, rsp - 8) | + put_guest(uregs->ds, rsp - 9) | + put_guest(regs->r11, rsp - 10) | + put_guest(regs->rcx, rsp - 11) ) { gprintk(XENLOG_ERR, "error while creating failsafe callback frame\n"); --- a/xen/include/asm-x86/uaccess.h +++ b/xen/include/asm-x86/uaccess.h @@ -26,14 +26,12 @@ extern void __put_user_bad(void); #define UA_DROP(args...) /** - * get_user: - Get a simple variable from user space. + * get_guest: - Get a simple variable from guest space. * @x: Variable to store result. - * @ptr: Source address, in user space. - * - * Context: User context only. This function may sleep. + * @ptr: Source address, in guest space. * - * This macro copies a single simple variable from user space to kernel - * space. It supports simple types like char and int, but not larger + * This macro load a single simple variable from guest space. + * It supports simple types like char and int, but not larger * data types like structures or arrays. * * @ptr must have pointer-to-simple-variable type, and the result of @@ -42,18 +40,15 @@ extern void __put_user_bad(void); * Returns zero on success, or -EFAULT on error. * On error, the variable @x is set to zero. */ -#define get_user(x,ptr)\ - __get_user_check((x),(ptr),sizeof(*(ptr))) +#define get_guest(x, ptr) get_guest_check(x, ptr, sizeof(*(ptr))) /** - * put_user: - Write a simple value into user space. - * @x: Value to copy to user space. - * @ptr: Destination address, in user space. - * - * Context: User context only. This function may sleep. + * put_guest: - Write a simple value into guest space. + * @x: Value to store in guest space. + * @ptr: Destination address, in guest space. * - * This macro copies a single simple value from kernel space to user - * space. It supports simple types like char and int, but not larger + * This macro stores a single simple value from to guest space. + * It supports simple types like char and int, but not larger * data types like structures or arrays. * * @ptr must have pointer-to-simple-variable type, and @x must be assignable @@ -61,8 +56,8 @@ extern void __put_user_bad(void); * * Returns zero on success, or -EFAULT on error. */ -#define put_user(x,ptr) \ -
[PATCH v2 3/8] x86/PV: harden guest memory accesses against speculative abuse
Inspired by https://lore.kernel.org/lkml/f12e7d3cecf41b2c29734ea45a393be21d4a8058.1597848273.git.jpoim...@redhat.com/ and prior work in that area of x86 Linux, suppress speculation with guest specified pointer values by suitably masking the addresses to non-canonical space in case they fall into Xen's virtual address range. Introduce a new Kconfig control. Note that it is necessary in such code to avoid using "m" kind operands: If we didn't, there would be no guarantee that the register passed to guest_access_mask_ptr is also the (base) one used for the memory access. As a minor unrelated change in get_unsafe_asm() the unnecessary "itype" parameter gets dropped and the XOR on the fixup path gets changed to be a 32-bit one in all cases: This way we avoid pointless REX.W or operand size overrides, or writes to partial registers. Requested-by: Andrew Cooper Signed-off-by: Jan Beulich Reviewed-by: Roger Pau Monné --- v2: Add comment to assembler macro. --- The insn sequence chosen is certainly up for discussion; I've picked this one despite the RCR because alternatives I could come up with, like mov $(HYPERVISOR_VIRT_END), %rax mov $~0, %rdx mov $0x7fff, %rcx cmp %rax, %rdi cmovb %rcx, %rdx and %rdx, %rdi weren't necessarily better: Either, as above, they are longer and require a 3rd scratch register, or they also utilize the carry flag in some similar way. --- Judging from the comment ahead of put_unsafe_asm() we might as well not tell gcc at all anymore about the memory access there, now that there's no use of the operand anymore in the assembly code. --- a/xen/arch/x86/usercopy.c +++ b/xen/arch/x86/usercopy.c @@ -10,12 +10,19 @@ #include #include -unsigned __copy_to_user_ll(void __user *to, const void *from, unsigned n) +#ifndef GUARD +# define GUARD UA_KEEP +#endif + +unsigned int copy_to_guest_ll(void __user *to, const void *from, unsigned int n) { unsigned dummy; stac(); asm volatile ( +GUARD( +"guest_access_mask_ptr %[to], %q[scratch1], %q[scratch2]\n" +) "cmp $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n" "jbe 1f\n" "mov %k[to], %[cnt]\n" @@ -42,6 +49,7 @@ unsigned __copy_to_user_ll(void __user * _ASM_EXTABLE(1b, 2b) : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from), [aux] "=" (dummy) + GUARD(, [scratch1] "=" (dummy), [scratch2] "=" (dummy)) : "[aux]" (n) : "memory" ); clac(); @@ -49,12 +57,15 @@ unsigned __copy_to_user_ll(void __user * return n; } -unsigned __copy_from_user_ll(void *to, const void __user *from, unsigned n) +unsigned int copy_from_guest_ll(void *to, const void __user *from, unsigned int n) { unsigned dummy; stac(); asm volatile ( +GUARD( +"guest_access_mask_ptr %[from], %q[scratch1], %q[scratch2]\n" +) "cmp $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n" "jbe 1f\n" "mov %k[to], %[cnt]\n" @@ -87,6 +98,7 @@ unsigned __copy_from_user_ll(void *to, c _ASM_EXTABLE(1b, 6b) : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from), [aux] "=" (dummy) + GUARD(, [scratch1] "=" (dummy), [scratch2] "=" (dummy)) : "[aux]" (n) : "memory" ); clac(); @@ -94,6 +106,8 @@ unsigned __copy_from_user_ll(void *to, c return n; } +#if GUARD(1) + 0 + /** * copy_to_user: - Copy a block of data into user space. * @to: Destination address, in user space. @@ -128,8 +142,11 @@ unsigned clear_user(void __user *to, uns { if ( access_ok(to, n) ) { +long dummy; + stac(); asm volatile ( +"guest_access_mask_ptr %[to], %[scratch1], %[scratch2]\n" "0: rep stos"__OS"\n" "mov %[bytes], %[cnt]\n" "1: rep stosb\n" @@ -140,7 +157,8 @@ unsigned clear_user(void __user *to, uns ".previous\n" _ASM_EXTABLE(0b,3b) _ASM_EXTABLE(1b,2b) -: [cnt] "=" (n), [to] "+D" (to) +: [cnt] "=" (n), [to] "+D" (to), [scratch1] "=" (dummy), + [scratch2] "=" (dummy) : [bytes] "r" (n & (BYTES_PER_LONG - 1)), [longs] "0" (n / BYTES_PER_LONG), "a" (0) ); clac(); @@ -174,6 +192,16 @@ unsigned copy_from_user(void *to, const return n; } +# undef GUARD +# define GUARD UA_DROP +# define copy_to_guest_ll copy_to_unsafe_ll +# define copy_from_guest_ll copy_from_unsafe_ll +# undef __user +# define __user +# include __FILE__ + +#endif /* GUARD(1) */ + /* * Local variables: * mode: C --- a/xen/arch/x86/x86_64/entry.S +++ b/xen/arch/x86/x86_64/entry.S @@ -458,6 +458,8 @@ UNLIKELY_START(g, create_bounce_frame_ba jmp asm_domain_crash_synchronous /* Does not return */ __UNLIKELY_END(create_bounce_frame_bad_sp) +guest_access_mask_ptr
[PATCH v2 2/8] x86: split __copy_{from,to}_user() into "guest" and "unsafe" variants
The "guest" variants are intended to work with (potentially) fully guest controlled addresses, while the "unsafe" variants are intended to be used in order to access addresses not (directly) under guest control, within Xen's part of virtual address space. Subsequently we will want them to have distinct behavior, so as first step identify which one is which. For now, both groups of constructs alias one another. Double underscore prefixes are retained only on __copy_{from,to}_guest_pv(), to allow still distinguishing them from their "checking" counterparts once they also get renamed (to copy_{from,to}_guest_pv()). Add previously missing __user at some call sites. Signed-off-by: Jan Beulich Reviewed-by: Tim Deegan [shadow] Reviewed-by: Roger Pau Monné --- Instead of __copy_{from,to}_guest_pv(), perhaps name them just __copy_{from,to}_pv()? --- a/xen/arch/x86/gdbstub.c +++ b/xen/arch/x86/gdbstub.c @@ -33,13 +33,13 @@ gdb_arch_signal_num(struct cpu_user_regs unsigned int gdb_arch_copy_from_user(void *dest, const void *src, unsigned len) { -return __copy_from_user(dest, src, len); +return copy_from_unsafe(dest, src, len); } unsigned int gdb_arch_copy_to_user(void *dest, const void *src, unsigned len) { -return __copy_to_user(dest, src, len); +return copy_to_unsafe(dest, src, len); } void --- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -2614,7 +2614,7 @@ static int sh_page_fault(struct vcpu *v, { shadow_l2e_t sl2e; mfn_t gl1mfn; -if ( (__copy_from_user(, +if ( (copy_from_unsafe(, (sh_linear_l2_table(v) + shadow_l2_linear_offset(va)), sizeof(sl2e)) != 0) @@ -2633,7 +2633,7 @@ static int sh_page_fault(struct vcpu *v, #endif /* SHOPT_OUT_OF_SYNC */ /* The only reasons for reserved bits to be set in shadow entries * are the two "magic" shadow_l1e entries. */ -if ( likely((__copy_from_user(, +if ( likely((copy_from_unsafe(, (sh_linear_l1_table(v) + shadow_l1_linear_offset(va)), sizeof(sl1e)) == 0) @@ -3308,10 +3308,10 @@ static bool sh_invlpg(struct vcpu *v, un sh_linear_l4_table(v)[shadow_l4_linear_offset(linear)]) & _PAGE_PRESENT) ) return false; -/* This must still be a copy-from-user because we don't have the +/* This must still be a copy-from-unsafe because we don't have the * paging lock, and the higher-level shadows might disappear * under our feet. */ -if ( __copy_from_user(, (sh_linear_l3_table(v) +if ( copy_from_unsafe(, (sh_linear_l3_table(v) + shadow_l3_linear_offset(linear)), sizeof (sl3e)) != 0 ) { @@ -3330,9 +3330,9 @@ static bool sh_invlpg(struct vcpu *v, un return false; #endif -/* This must still be a copy-from-user because we don't have the shadow +/* This must still be a copy-from-unsafe because we don't have the shadow * lock, and the higher-level shadows might disappear under our feet. */ -if ( __copy_from_user(, +if ( copy_from_unsafe(, sh_linear_l2_table(v) + shadow_l2_linear_offset(linear), sizeof (sl2e)) != 0 ) { @@ -3371,11 +3371,11 @@ static bool sh_invlpg(struct vcpu *v, un * hold the paging lock yet. Check again with the lock held. */ paging_lock(d); -/* This must still be a copy-from-user because we didn't +/* This must still be a copy-from-unsafe because we didn't * have the paging lock last time we checked, and the * higher-level shadows might have disappeared under our * feet. */ -if ( __copy_from_user(, +if ( copy_from_unsafe(, sh_linear_l2_table(v) + shadow_l2_linear_offset(linear), sizeof (sl2e)) != 0 ) --- a/xen/arch/x86/pv/emul-gate-op.c +++ b/xen/arch/x86/pv/emul-gate-op.c @@ -149,12 +149,12 @@ static int read_mem(enum x86_segment seg addr = (uint32_t)addr; -if ( (rc = __copy_from_user(p_data, (void *)addr, bytes)) ) +if ( (rc = __copy_from_guest_pv(p_data, (void __user *)addr, bytes)) ) { /* * TODO: This should report PFEC_insn_fetch when goc->insn_fetch && * cpu_has_nx, but we'd then need a "fetch" variant of - * __copy_from_user() respecting NX, SMEP, and protection keys. + * __copy_from_guest_pv() respecting NX, SMEP, and protection keys. */ x86_emul_pagefault(0, addr + bytes - rc, ctxt); return
[PATCH v2 1/8] x86: split __{get,put}_user() into "guest" and "unsafe" variants
The "guest" variants are intended to work with (potentially) fully guest controlled addresses, while the "unsafe" variants are intended to be used in order to access addresses not (directly) under guest control, within Xen's part of virtual address space. (For linear page table and descriptor table accesses the low bits of the addresses may still be guest controlled, but this still won't allow speculation to "escape" into unwanted areas.) Subsequently we will want them to have distinct behavior, so as first step identify which one is which. For now, both groups of constructs alias one another. Double underscore prefixes are retained only on __{get,put}_guest(), to allow still distinguishing them from their "checking" counterparts once they also get renamed (to {get,put}_guest()). Since for them it's almost a full re-write, move what becomes {get,put}_unsafe_size() into the "common" uaccess.h (x86_64/*.h should disappear at some point anyway). In __copy_to_user() one of the two casts in each put_guest_size() invocation gets dropped. They're not needed and did break symmetry with __copy_from_user(). Signed-off-by: Jan Beulich Reviewed-by: Tim Deegan [shadow] Reviewed-by: Roger Pau Monné --- v2: Use __get_guest() in {,compat_}show_guest_stack(). --- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -776,9 +776,9 @@ shadow_write_entries(void *d, void *s, i /* Because we mirror access rights at all levels in the shadow, an * l2 (or higher) entry with the RW bit cleared will leave us with * no write access through the linear map. - * We detect that by writing to the shadow with __put_user() and + * We detect that by writing to the shadow with put_unsafe() and * using map_domain_page() to get a writeable mapping if we need to. */ -if ( __put_user(*dst, dst) ) +if ( put_unsafe(*dst, dst) ) { perfc_incr(shadow_linear_map_failed); map = map_domain_page(mfn); --- a/xen/arch/x86/pv/emul-gate-op.c +++ b/xen/arch/x86/pv/emul-gate-op.c @@ -40,7 +40,7 @@ static int read_gate_descriptor(unsigned ((gate_sel >> 3) + !is_pv_32bit_vcpu(v) >= (gate_sel & 4 ? v->arch.pv.ldt_ents : v->arch.pv.gdt_ents)) || - __get_user(desc, pdesc) ) + get_unsafe(desc, pdesc) ) return 0; *sel = (desc.a >> 16) & 0xfffc; @@ -59,7 +59,7 @@ static int read_gate_descriptor(unsigned { if ( (*ar & 0x1f00) != 0x0c00 || /* Limit check done above already. */ - __get_user(desc, pdesc + 1) || + get_unsafe(desc, pdesc + 1) || (desc.b & 0x1f00) ) return 0; @@ -294,7 +294,7 @@ void pv_emulate_gate_op(struct cpu_user_ { \ --stkp; \ esp -= 4; \ -rc = __put_user(item, stkp); \ +rc = __put_guest(item, stkp); \ if ( rc ) \ { \ pv_inject_page_fault(PFEC_write_access, \ @@ -362,7 +362,7 @@ void pv_emulate_gate_op(struct cpu_user_ unsigned int parm; --ustkp; -rc = __get_user(parm, ustkp); +rc = __get_guest(parm, ustkp); if ( rc ) { pv_inject_page_fault(0, (unsigned long)(ustkp + 1) - rc); --- a/xen/arch/x86/pv/emulate.c +++ b/xen/arch/x86/pv/emulate.c @@ -34,13 +34,13 @@ int pv_emul_read_descriptor(unsigned int if ( sel < 4 || /* * Don't apply the GDT limit here, as the selector may be a Xen - * provided one. __get_user() will fail (without taking further + * provided one. get_unsafe() will fail (without taking further * action) for ones falling in the gap between guest populated * and Xen ones. */ ((sel & 4) && (sel >> 3) >= v->arch.pv.ldt_ents) ) desc.b = desc.a = 0; -else if ( __get_user(desc, gdt_ldt_desc_ptr(sel)) ) +else if ( get_unsafe(desc, gdt_ldt_desc_ptr(sel)) ) return 0; if ( !insn_fetch ) desc.b &= ~_SEGMENT_L; --- a/xen/arch/x86/pv/iret.c +++ b/xen/arch/x86/pv/iret.c @@ -114,15 +114,15 @@ unsigned int compat_iret(void) regs->rsp = (u32)regs->rsp; /* Restore EAX (clobbered by hypercall). */ -if ( unlikely(__get_user(regs->eax, (u32 *)regs->rsp)) ) +if ( unlikely(__get_guest(regs->eax, (u32 *)regs->rsp)) ) { domain_crash(v->domain); return 0; } /* Restore CS and EIP. */ -if ( unlikely(__get_user(regs->eip, (u32 *)regs->rsp + 1)) || -unlikely(__get_user(regs->cs, (u32 *)regs->rsp + 2)) ) +if ( unlikely(__get_guest(regs->eip, (u32 *)regs->rsp + 1)) || +unlikely(__get_guest(regs->cs, (u32 *)regs->rsp + 2)) ) { domain_crash(v->domain); return 0; @@ -132,7 +132,7 @@ unsigned int compat_iret(void) * Fix up and restore EFLAGS. We
[PATCH v2 0/8] x86/PV: avoid speculation abuse through guest accessors
Re-sending primarily for the purpose of getting a release ack, an explicit release nak, or an indication of there not being a need, all for at least the first three patches here (which are otherwise ready to go in). I've dropped the shadow part of the series from this re-submission, because it has all got reviewed by Tim already and is intended for 4.16 only anyway. I'm re-including the follow up patches getting the code base in consistent shape again, as I continue to think this consistency goal is at least worth a consideration towards a freeze exception. 1: split __{get,put}_user() into "guest" and "unsafe" variants 2: split __copy_{from,to}_user() into "guest" and "unsafe" variants 3: PV: harden guest memory accesses against speculative abuse 4: rename {get,put}_user() to {get,put}_guest() 5: gdbsx: convert "user" to "guest" accesses 6: rename copy_{from,to}_user() to copy_{from,to}_guest_pv() 7: move stac()/clac() from {get,put}_unsafe_asm() ... 8: PV: use get_unsafe() instead of copy_from_unsafe() Jan
Re: Linux DomU freezes and dies under heavy memory shuffling
Hi Jürgen, thanks for taking a look at this. A few comments below: On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß wrote: > > On 16.02.21 21:34, Stefano Stabellini wrote: > > + x86 maintainers > > > > It looks like the tlbflush is getting stuck? > > I have seen this case multiple times on customer systems now, but > reproducing it reliably seems to be very hard. It is reliably reproducible under my workload but it take a long time (~3 days of the workload running in the lab). > I suspected fifo events to be blamed, but just yesterday I've been > informed of another case with fifo events disabled in the guest. > > One common pattern seems to be that up to now I have seen this effect > only on systems with Intel Gold cpus. Can it be confirmed to be true > in this case, too? I am pretty sure mine isn't -- I can get you full CPU specs if that's useful. > In case anybody has a reproducer (either in a guest or dom0) with a > setup where a diagnostic kernel can be used, I'd be _very_ interested! I can easily add things to Dom0 and DomU. Whether that will disrupt the experiment is, of course, another matter. Still please let me know what would be helpful to do. Thanks, Roman.