date:20240507

flight 185941 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185941/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf 2727231b0a6fb4c043479d132df4d36cf9f751c2
baseline version:
 ovmf 987bea6525d70cd01649472c93d19f89d41d83a2

Last test of basis   185937  2024-05-07 07:43:05 Z0 days
Testing same since   185941  2024-05-08 01:57:46 Z0 days1 attempts


People who touched revisions under test:
  Gerd Hoffmann 
  Jiaxin Wu 
  Jiewen Yao 
  Ray Ni 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   987bea6525..2727231b0a  2727231b0a6fb4c043479d132df4d36cf9f751c2 -> 
xen-tested-master

[xen-unstable test] 185940: tolerable FAIL - PUSHED

flight 185940 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185940/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185936
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185936
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185936
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185936
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail like 
185936
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185936
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185936
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  a2330b51df267e20e66bbba6c5bf08f0570ed58b
baseline version:
 xen  ebab808eb1bb8f24c7d0dd41b956e48cb1824b81

Last test of basis   185936  2024-05-07 05:53:32 Z0 days
Testing same since   185940  2024-05-07 20:08:36 Z0 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Roger Pau Monné 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64-xtf  pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass

Re: [PATCH 05/15] tools/libs/light: Increase nr_spi to 160


Hi Julien,

On 5/7/2024 10:35 PM, Julien Grall wrote:

Hi,

On 06/05/2024 06:17, Henry Wang wrote:

On 5/1/2024 9:58 PM, Anthony PERARD wrote:

On Wed, Apr 24, 2024 at 11:34:39AM +0800, Henry Wang wrote:
Increase number of spi to 160 i.e. gic_number_lines() for Xilinx 
ZynqMP - 32.

This was done to allocate and assign IRQs to a running domain.

Signed-off-by: Vikram Garhwal 
Signed-off-by: Stefano Stabellini 
Signed-off-by: Henry Wang 
---
  tools/libs/light/libxl_arm.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/libs/light/libxl_arm.c 
b/tools/libs/light/libxl_arm.c

index dd5c9f4917..50dbd0f2a9 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -181,7 +181,8 @@ int libxl__arch_domain_prepare_config(libxl__gc 
*gc,

  LOG(DEBUG, "Configure the domain");
-    config->arch.nr_spis = nr_spis;
+    /* gic_number_lines() is 192 for Xilinx ZynqMP. min nr_spis = 
192 - 32. */

+    config->arch.nr_spis = MAX(nr_spis, 160);

Is there a way that that Xen or libxl could find out what the minimum
number of SPI needs to be?


I am afraid currently there is none.


Are we going to have to increase that minimum
number every time a new platform comes along?

It doesn't appear that libxl is using that `nr_spis` value and it is
probably just given to Xen. So my guess is that Xen could simply take
care of the minimum value, gic_number_lines() seems to be a Xen
function.


Xen will take care of the value of nr_spis for dom0 in create_dom0()
dom0_cfg.arch.nr_spis = min(gic_number_lines(), (unsigned int) 992) - 
32;

and also for dom0less domUs in create_domUs().

However, it looks like Xen will not take care of the mininum value 
for libxl guests, the value from config->arch.nr_spis in guest config 
file will be directly passed to the domain_vgic_init() function from 
arch_domain_create().


I agree with you that we shouldn't just bump the number everytime 
when we have a new platform. Therefore, would it be a good idea to 
move the logic in this patch to arch_sanitise_domain_config()?


Xen domains are supposed to be platform agnostics and therefore the 
numbers of SPIs should not be based on the HW.


Furthermore, with your proposal we would end up to allocate data 
structure for N SPIs when a domain may never needs any SPIs (such as 
if passthrough is not in-use). This is more likely for domain created 
by the toolstack than from Xen directly.


Agreed on both comments.

Instead, we should introduce a new XL configuration to let the user 
decide the number of SPIs. I would suggest to name "nr_spis" to match 
the DT bindings.


Sure, I will introduce a new xl config for this to replace this patch. 
Thank you for the suggestion.


Kind regards,
Henry



Cheers,

Re: [PATCH v6 8/8] xen: allow up to 16383 cpus

2024-05-07 Thread Stefano Stabellini

On Tue, 7 May 2024, Julien Grall wrote:
> Hi Stefano,
> 
> On 03/05/2024 20:07, Stefano Stabellini wrote:
> > On Fri, 3 May 2024, Julien Grall wrote:
> 
> [...]
> 
> > > So are you saying that from Xen point of view, you are expecting no
> > > difference
> > > between 256 and 512. And therefore you would be happy if to backport
> > > patches
> > > if someone find differences (or even security issues) when using > 256
> > > pCPUs?
> > 
> > It is difficult to be sure about anything that it is not regularly
> > tested. I am pretty sure someone in the community got Xen running on an
> > Ampere, so like you said 192 is a good number. However, that is not
> > regularly tested, so we don't have any regression checks in gitlab-ci or
> > OSSTest for it.
> > 
> > One approach would be to only support things regularly tested either by
> > OSSTest, Gitlab-ci, or also Xen community members. I am not sure what
> > would be the highest number with this way of thinking but likely no
> > more than 192, probably less. I don't know the CPU core count of the
> > biggest ARM machine in OSSTest.
> 
> This would be rochester* (Cavium Thunder-X). They have 96 pCPUs which, IIRC,
> are split across two numa nodes.
> 
> > 
> > Another approach is to support a "sensible" number: not something tested
> > but something we believe it should work. No regular testing. (In safety,
> > they only believe in things that are actually tested, so this would not
> > be OK. But this is security, not safety, just FYI.) With this approach,
> > we could round up the number to a limit we think it won't break. If 192
> > works, 256/512 should work? I don't know but couldn't think of something
> > that would break going from 192 to 256.
> 
> It depends what you mean by work/break. Strictly speaking, Xen should run
> (i.e. not crash). However, it is unclear how well as if you increase the
> number of physical CPUs, you will increase contention and may find some
> bottleneck.
> 
> I haven't done any performance testing with that many CPUs and I haven't seen
> any so far with Xen. But I have some areas of concerns.
> 
> * Xenstored: At least the C version is single-threaded. Technically the limit
> here is not based on the number of pCPUs, but as you increase it, you
> indirectly increase the number of domains that can run. I doubt it will behave
> well if you have 4096 domains running (I am thinking about the x86 limit...).
> 
> * Locking
>   * How Xen use the locks: I don't think we have many places where we have
> global locks (one is the memory subsystem). If a lock is already taken, the
> others will spin. It is unclear if we could high contending.
>   * How Xen implements the locks: At the moment, we are using LL/SC. My take
> of XSA-295 is there is a lack of fairness with them. I am not sure what would
> happen if they get contented (as we support more pCPUs). It is also probably
> time to finally implement LSE atomics.
> 
> * TLB flush: The TLB flush are broadcasted. There are some suggestions on the
> Linux ML [1] that they don't perform well on some processors. The discussion
> seems to have gone nowhere in Linux. But I think it is propably worth to take
> into account when we decide to update the limit we (security) support.
> 
> > 
> > It depends on how strict we want to be on testing requirements.
> From above, I am rather worry about claiming that Xen can supports up to 256
> (and TBH even 192) without any proper testing. This could end up to backfire
> as we may need to do (in a rush) and backport some rather large work (unless
> we decide to remove support after the fact).

I agree with everything you said and I would also add that is not just
about backports: if we "support" something it is supposed to mean that
we strongly believe it is working. I think we should only make that
claim if we test regularly that configuration/feature.


> I think I would prefer if we have a low number until someone can do some
> testing (including potentially malicious guest). If we want for a
> power-of-two, I would go with 128 because this is closer to the HW we have in
> testing. If in the future someone can show some data on other platforms (e.g.
> Ampere), then we can up the limit.

I am OK with that. I wonder if we could use QEMU to add a test for this.


> > I am not
> > sure what approach was taken by x86 so far.
> 
> It is unclear to me. I don't see how we can claim to support up to 4096 CPUs.
> But that's for the x86 folks to decide.

Until not long ago, many things were "supported" in many Open Source
projects (including Linux, QEMU, etc.) without any automated tests at
all. Maybe it is time to revisit this practice.

Re: [PATCH 02/15] xen/arm/gic: Enable interrupt assignment to running VM


Hi Henry,

On 06/05/2024 09:32, Henry Wang wrote:

On 5/1/2024 4:13 AM, Julien Grall wrote:

Hi Henry,

On 30/04/2024 04:50, Henry Wang wrote:

On 4/25/2024 10:28 PM, Julien Grall wrote:
Thanks for your feeedback. After checking the b8577547236f commit 
message I think I now understand your point. Do you have any 
suggestion about how can I properly add the support to route/remove 
the IRQ to running domains? Thanks.


I spent some time going through the GIC/vGIC code and had some 
discussions with Stefano and Stewart during the last couple of days, 
let me see if I can describe the use case properly now to continue 
the discussion:


We have some use cases that requires assigning devices to domains 
after domain boot time. For example, suppose there is an FPGA on the 
board which can simulate a device, and the bitstream for the FPGA is 
provided and programmed after domain boot. So we need a way to assign 
the device to the running domain. This series tries to implement this 
use case by using device tree overlay - users can firstly add the 
overlay to Xen dtb, assign the device in the overlay to a domain by 
the xl command, then apply the overlay to Linux.


Thanks for the description! This helps to understand your goal :).


Thank you very much for spending your time on discussing this and 
provide these valuable comments!




I haven't really look at that code in quite a while. I think we need 
to make sure that the virtual and physical IRQ state matches at the 
time we do the routing.


I am undecided on whether we want to simply prevent the action to 
happen or try to reset the state.


There is also the question of what to do if the guest is enabling 
the vIRQ before it is routed.


Sorry for bothering, would you mind elaborating a bit more about the 
two cases that you mentioned above? Commit b8577547236f ("xen/arm: 
Restrict when a physical IRQ can be routed/removed from/to a domain") 
only said there will be undesirable effects, so I am not sure if I 
understand the concerns raised above and the consequences of these 
two use cases.


I will try to explain them below after I answer the rest.

I am probably wrong, I think when we add the overlay, we are probably 
fine as the interrupt is not being used before. 


What if the DT overlay is unloaded and then reloaded? Wouldn't the 
same interrupt be re-used? As a more generic case, this could also be 
a new bitstream for the FPGA.


But even if the interrupt is brand new every time for the DT overlay, 
you are effectively relaxing the check for every user (such as 
XEN_DOMCTL_bind_pt_irq). So the interrupt re-use case needs to be 
taken into account.


I agree. I think IIUC, with your explanation here and below, could we 
simplify the problem to how to properly handle the removal of the IRQ 
from a running guest, if we always properly remove and clean up the 
information when remove the IRQ from the guest? In this way, the IRQ can 
always be viewed as a brand new one when we add it back.


If we can make sure the virtual IRQ and physical IRQ is cleaned then yes.

Then the only 
corner case that we need to take care of would be...


Can you clarify whether you say the "only corner case" because you 
looked at the code? Or is it just because I mentioned only one?




Also since we only load the device driver after the IRQ is routed to 
the guest, 


This is what a well-behave guest will do. However, we need to think 
what will happen if a guest misbehaves. I am not concerned about a 
guest only impacting itself, I am more concerned about the case where 
the rest of the system is impacted.



I am not sure the guest can enable the vIRQ before it is routed.


Xen allows the guest to enable a vIRQ even if there is no pIRQ 
assigned. Thanksfully, it looks like the vgic_connect_hw_irq(), in 
both the current and new vGIC, will return an error if we are trying 
to route a pIRQ to an already enabled vIRQ.


But we need to investigate all the possible scenarios to make sure 
that any inconsistencies between the physical state and virtual state 
(including the LRs) will not result to bigger problem.


The one that comes to my mind is: The physical interrupt is 
de-assigned from the guest before it was EOIed. In this case, the 
interrupt will still be in the LR with the HW bit set. This would 
allow the guest to EOI the interrupt even if it is routed to someone 
else. It is unclear what would be the impact on the other guest.


...same as this case, i.e.
test_bit(_IRQ_INPROGRESS, >status) || !test_bit(_IRQ_DISABLED, 
>status)) when we try to remove the IRQ from a running domain.


We already call ->shutdown() which will disable the IRQ. So don't we 
only need to take care of _IRQ_INPROGRESS?


[...]

we have 3 possible states which can be read from LR for this case : 
active, pending, pending and active.
- I don't think we can do anything about the active state, so we should 
return -EBUSY and reject the whole operation of removing the IRQ from 
running

[xen-unstable-smoke test] 185939: tolerable all pass - PUSHED

flight 185939 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185939/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  a2330b51df267e20e66bbba6c5bf08f0570ed58b
baseline version:
 xen  ebab808eb1bb8f24c7d0dd41b956e48cb1824b81

Last test of basis   185926  2024-05-06 13:02:07 Z1 days
Testing same since   185939  2024-05-07 17:03:58 Z0 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Roger Pau Monné 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   ebab808eb1..a2330b51df  a2330b51df267e20e66bbba6c5bf08f0570ed58b -> smoke

Re: [PATCH v3 6/9] xen/arm64: bpi: Add missing code symbol annotations





On 07/05/2024 17:55, Edgar E. Iglesias wrote:

On Tue, May 7, 2024 at 11:57 AM Julien Grall  wrote:
Hi Julien,


Hi Edgar,



The reason I choose FUNC for the start of the symbol is because these
symbols contain
executable code (not only a table of pointers to code somewhere else)
and the ELF spec
says that STT_FUNC means the symbol contains functions or other executable
code (not only callable functions IIUC):

"STT_FUNC The symbol is associated with a function or other executable code."
https://refspecs.linuxbase.org/elf/elf.pdf
(Symbol Table 1-20).


Thanks for the pointer. I originally did intend to suggest the change, 
but then I saw the use of LABEL in x86 (such as svm_stgi_label). There 
are a few others example with LABEL_LOCAL.


AFAICT, this is also executable code which the only difference that it 
is not meant to be called by someone else. Furthermore, LABEL is using 
DO_CODE_ALIGN(...) for the alignment which imply that it is intended to 
be used by executable code. So I thought the only difference was whether 
the label was intended to be used as a function.




I think using LABEL instead of GLOBAL for the _end labels of these
code sequences makes sense.
I'm happy to change the _start labels to LABEL too if you guys feel
that's better.


I have to admit I am little confused with the difference between LABEL 
vs FUNC. I think I will need some guidance from Jan (he introduced 
linkage.h).


Cheers,

--
Julien Grall

Re: [PATCH v4 15/17] xen: mapcache: Remove assumption of RAMBlock with 0 offset

2024-05-07 Thread Edgar E. Iglesias

On Thu, May 2, 2024 at 10:02 PM Stefano Stabellini
 wrote:
>
> On Thu, 2 May 2024, Edgar E. Iglesias wrote:
> > On Thu, May 2, 2024 at 8:53 PM Stefano Stabellini
> >  wrote:
> > >
> > > +Xenia
> > >
> > > On Thu, 2 May 2024, Edgar E. Iglesias wrote:
> > > > On Wed, May 1, 2024 at 11:24 PM Stefano Stabellini
> > > >  wrote:
> > > > >
> > > > > On Tue, 30 Apr 2024, Edgar E. Iglesias wrote:
> > > > > > From: "Edgar E. Iglesias" 
> > > > > >
> > > > > > The current mapcache assumes that all memory is mapped
> > > > > > in a single RAM MR (the first one with offset 0). Remove
> > > > > > this assumption and propagate the offset to the mapcache
> > > > > > so it can do reverse mappings (from hostptr -> ram_addr).
> > > > > >
> > > > > > This is in preparation for adding grant mappings.
> > > > > >
> > > > > > Signed-off-by: Edgar E. Iglesias 
> > > > >
> > > > >
> > > > > Looking at xen_remap_bucket, it is only using address_index (without
> > > > > adding ram_offset) to map foreign memory. From xen_remap_bucket, I 
> > > > > would
> > > > > understand that address_index already includes the ram_offset.
> > > > >
> > > > > Meaning that if we want to map foreign mapping at address 0x5000, then
> > > > > address_index would be 0x5000, even if ram_offset is 0x1000.
> > > > >
> > > > > But then looking xen_ram_addr_from_mapcache_single ram_offset is added
> > > > > to paddr_index to calculate the physical address. So in that case we
> > > > > would want address_index to be 0x4000 and ram_offset to be 0x1000. But
> > > > > xen_remap_bucket would have to sum address_index and ram_offset to map
> > > > > foreign memory.
> > > > >
> > > > > So I am a bit confused, did I get it wrong? One more comment below.
> > > > >
> > > >
> > > > Thanks Stefano,
> > > >
> > > > I think the confusion is that this ram_addr_offset is not related to
> > > > guest address-space.
> > > > It's a QEMU internal thing and it shouldn't be included in the address
> > > > used to map foreign memory.
> > > > The mapcache can treat this ram_addr offset like a cookie that we keep
> > > > around to be able to do
> > > > reverse mappings from host pointers into ram_addr space
> > > > (xen_ram_addr_from_mapcache).
> > > >
> > > > The current mapcache implementation works because we've really only
> > > > been using foreign mappings
> > > > on RAMBlocks with offset 0. We're also creating RAM's such that the
> > > > offset into the RAM is also
> > > > the guest physical address, for x86 this is natural since RAM starts
> > > > at zero (for lowmem) but for
> > > > ARM we're creating larger than needed RAM's (GUEST_RAM0_BASE + 
> > > > ram-size) to
> > > > make this assumption true. Anyway, In this series I'm not addressing
> > > > this second assumption.
> > >
> > > Let's see if I understand correctly.
> > >
> > > The ram_addr space is an internal QEMU address space which is different
> > > from the guest physical address space and thus cannot and should not be
> > > used to do foreign mappings (foreign mapping hypercalls take a guest
> > > physical or a real physical address to map). Is that correct?
> > >
> > > If so, then I understand.
> > >
> >
> > Yes, that matches my understanding.
> >
> > >
> > >
> > > > There's a second call in physmem.c to xen_map_cache using the
> > > > block->offset as an address.
> > > > I was considering removing that second call since I can't see how it 
> > > > can work
> > > > (except perhaps in some specific use-case by luck?). Anyway, for now
> > > > I've left it unmodified.
> > >
> > > Yes, that code was written with the assumption that block->offset is an
> > > offset in the guest physical address space and could be used as a guest
> > > physical address. Actually, you might have spotted a real bug.
> > >
> > > The intent was for smaller regions (not the bit RAM region, things like
> > > a ROM region for instance) we could map them in full. So here we were
> > > trying to map the whole thing from start to finish using block->offset
> > > as start.
> > >
> > >
> > > > > > ---
> > > > > >  hw/xen/xen-mapcache.c | 25 ++---
> > > > > >  include/sysemu/xen-mapcache.h |  2 ++
> > > > > >  system/physmem.c  |  8 
> > > > > >  3 files changed, 24 insertions(+), 11 deletions(-)
> > > > > >
> > > > > > diff --git a/hw/xen/xen-mapcache.c b/hw/xen/xen-mapcache.c
> > > > > > index 09b5f36d9c..1b32d0c003 100644
> > > > > > --- a/hw/xen/xen-mapcache.c
> > > > > > +++ b/hw/xen/xen-mapcache.c
> > > > > > @@ -43,6 +43,9 @@ typedef struct MapCacheEntry {
> > > > > >  #define XEN_MAPCACHE_ENTRY_DUMMY (1 << 0)
> > > > > >  uint8_t flags;
> > > > > >  hwaddr size;
> > > > > > +
> > > > > > +/* Keep ram_addr offset for reverse mappings (hostptr -> 
> > > > > > ram_addr).  */
> > > > > > +ram_addr_t ram_offset;
> > > > > >  struct MapCacheEntry *next;
> > > > > >  } MapCacheEntry;
> > > > > >
> > > > > > @@ -165,7 +168,8 @@ static void xen_remap_bucket(MapCache *mc,
> > > >

Xen Security Advisory 457 v1 - Linux/xen-netback: Memory leak due to missing cleanup function

2024-05-07 Thread Xen . org security team

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Xen Security Advisory XSA-457

Linux/xen-netback: Memory leak due to missing cleanup function

ISSUE DESCRIPTION
=

In netback, xennet_alloc_one_rx_buffer() failed to call the
appropriate clean-up function, resulting in a memory leak.

IMPACT
==

A malicious guest userspace process can exhaust memory resources
within the guest kernel, potentially leading to a system crash (Denial
of Service). It is not known whether it can be triggered remotely.

VULNERABLE SYSTEMS
==

Systems with guests running Linux 5.9 and later with Xen PV network
devices are affected.

MITIGATION
==

For HVM guests, using emulated network devices will avoid this issue.

RESOLUTION
==

The following patch in Linux resolves the issue:

https://git.kernel.org/torvalds/c/037965402a010898d34f4e35327d22c0a95cd51f

A copy of which has attached.

xsa457.patch   Linux 5.9

$ sha256sum xsa457*
9d6ae3da27f1ff92f9f45c800822beecda603d6dea6726207cee6c768416114c  xsa457.patch
$


NOTE ON THE LACK OF EMBARGO
===

The issue was reported initially on a public bug tracker and fixed in
public before it was realized that there was a security aspect.
-BEGIN PGP SIGNATURE-

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmY6YN8MHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZq4kH/0BcaF/4dKqxQ/hYMMoLxcE1kzHn2kAdFPcvxcuu
Csk1yLugbvxHgwgp0lI9JjiqzSMt68pN8B9mWbcMBBvA7jGGsJ6Vjp25kQnUToLe
FPiAhW/TY+1YXOnhsfn9dHHk1Tv0W5D69QuUuj6zGUvRMdV+WPyA/mGPWnBrJgT+
5s6tKFxls1JiLdFxuJKqi8Ok8HrX1zE9unSWEUri8SNE2k3h5i29X2v+S8yBv2y0
XBnzr16kL9KKim0sNSErB1QU5BThnDBCFk+7FKAAYGAv5H6N3VLv66DLARCYfPhP
iXJU3/+yvAjwZjp5oYtbqHXzdd/m0b/IrF/0ZMLBaoDs0s4=
=vfs6
-END PGP SIGNATURE-


xsa457.patch
Description: Binary data

Xen Security Advisory 456 v3 (CVE-2024-2201) - x86: Native Branch History Injection

2024-05-07 Thread Xen . org security team

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Xen Security Advisory CVE-2024-2201 / XSA-456
version 3

x86: Native Branch History Injection

UPDATES IN VERSION 3

Issues were found with the original code changes. See the bottom of the
Resolution section for how to obtain those.

ISSUE DESCRIPTION
=

In August 2022, researchers at VU Amsterdam disclosed Spectre-BHB.

Spectre-BHB was discussed in XSA-398. At the time, the susceptibility
of Xen to Spectre-BHB was uncertain so no specific action was taken in
XSA-398. However, various changes were made thereafter in upstream Xen
as a consequence; more on these later.

VU Amsterdam have subsequently adjusted the attack to be pulled off
entirely from userspace, without the aid of a managed runtime in the
victim context.

For more details, see:
https://vusec.net/projects/native-bhi
https://vusec.net/projects/bhi-spectre-bhb

https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
https://xenbits.xen.org/xsa/advisory-398.html

IMPACT
==

An attacker might be able to infer the contents of arbitrary host
memory, including memory assigned to other guests.

VULNERABLE SYSTEMS
==

Systems running all versions of Xen are affected.

Only Intel x86 CPUs are potentially affected. CPUs from other
manufacturers are not known to be affected.

A wide range of Intel CPUs employ Branch History prediction techniques.
However for older CPUs existing Spectre-v2 mitigations (XSA-254) are
believed to be sufficient to mitigate Native-BHI.

Therefore, the rest of the discussion will be limited in scope to the
CPUs for which a change in behaviour is expected. These are believed to
be all CPUs with eIBRS (Enhanced IBRS, a.k.a. IBRS_ALL or IBRS_ATT).
eIBRS signifies a hardware adjustment (mode-tagged indirect predictions)
designed to combat Spectre-v2, available in CPUs from 2019 onwards.

To determine if a system has eIBRS, run `xen-cpuid -v` in dom0, looking for
the string "eibrs" in the Dynamic Raw block of information. e.g.

# xen-cpuid -v
...
Dynamic sets:
Raw ...
...
[16] MSR_ARCH_CAPS.lo ... eibrs ...
...
...

Be aware that the Static sets are compile time information so will include the
string "eibrs" irrespective of hardware support. If there is no row for "[16]
MSR_ARCH_CAPS.lo" then the fixes for XSA-435 are missing.

MITIGATION
==

There are no mitigations.

CREDITS
===

This issue was discovered by VU Amsterdam.

RESOLUTION
==

In Xen 4.17, in response to the original Spectre-BHB, CET-IBT support was
added to Xen to use on capable hardware. It also came with work to remove
unnecessary function pointers, and to de-virtualise function pointers at boot,
as both a performance and hardening improvement. This work has been steadily
continuing since, and every removed/de-virtualised function pointer reduces
the options available to an adversary trying to mount a Native-BHI attack.
All of this work has been backported to 4.17 and later for this advisory.

Beginning with the Intel Alder Lake (Client) and Sapphire Rapids (Server)
CPUs, a hardware control called BHI_DIS_S is available, which restricts
history-based predictions. This control requires updated microcode on some
CPUs. Look for "bhi-ctrl" in `xen-cpuid -v`, similar to eibrs above.

Xen has been updated to use this control when available, and to virtualise it
for guests to use.

For CPUs without BHI_DIS_S, BHB clearing sequences need using. Out of an
abundance of caution, all sequences in the Intel whitepaper have been
implemented, although Xen will only use the "short" sequence by default. The
others are available to opt in to.

The work to mitigate Native-BHI is extensive, and the backports are
more-extensive still.

Therefore, we have decided to produce new releases on all stable trees.
Please find fixes in the respective branches under the following release
tags:

RELEASE-4.18.2
RELEASE-4.17.4
RELEASE-4.16.6
RELEASE-4.15.6

Other release activities (tarballs, announcements, etc) will happen in
due course.

Issues were in those found subsequently. To address those, newer commits
from the stable branches need updating to, in particular

stable-4.15 05653eb44314cb90f2e3e7b2d405e86b5657
stable-4.16 d0e8f8ffbb19b5df5f767328baeb54c069b08e6a
stable-4.17 effcf70f020ff12d34c80e2abde0ecb00ce92bda
stable-4.18 f0ff1d9cb96041a84a24857a6464628240deed4f

For 4.15, since we're closing the branch, RELEASE-4.15.7 was tagged in
addition; other release activities - as per above - will follow.

DEPLOYMENT DURING EMBARGO
=

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing

Re: [PATCH v3 6/9] xen/arm64: bpi: Add missing code symbol annotations

2024-05-07 Thread Edgar E. Iglesias

On Tue, May 7, 2024 at 11:57 AM Julien Grall  wrote:
>
> Hi,
>
> On 06/05/2024 13:54, Edgar E. Iglesias wrote:
> > On Sat, May 4, 2024 at 2:14 AM Stefano Stabellini
> >  wrote:
> >>
> >> On Wed, 1 May 2024, Edgar E. Iglesias wrote:
> >>> From: "Edgar E. Iglesias" 
> >>>
> >>> Use the generic xen/linkage.h macros to annotate code symbols
> >>> and add missing annotations.
> >>>
> >>> Signed-off-by: Edgar E. Iglesias 
> >>> ---
> >>>   xen/arch/arm/arm64/bpi.S | 20 
> >>>   1 file changed, 12 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/xen/arch/arm/arm64/bpi.S b/xen/arch/arm/arm64/bpi.S
> >>> index 4e63825220..b16e4d1e29 100644
> >>> --- a/xen/arch/arm/arm64/bpi.S
> >>> +++ b/xen/arch/arm/arm64/bpi.S
> >>> @@ -52,14 +52,15 @@
> >>>* micro-architectures in a system.
> >>>*/
> >>>   .align   11
> >>> -ENTRY(__bp_harden_hyp_vecs_start)
> >>> +FUNC(__bp_harden_hyp_vecs_start)
> >>>   .rept 4
> >>>   vectors hyp_traps_vector
> >>>   .endr
> >>> -ENTRY(__bp_harden_hyp_vecs_end)
> >>> +GLOBAL(__bp_harden_hyp_vecs_end)
> >>> +END(__bp_harden_hyp_vecs_start)
> >>
> >> Shouldn't GLOBAL be changed to FUNC as well?
> >>
> >
> > I was a bit unsure but went for GLOBAL since the _end labels point to
> > addresses after and outside of the code sequence.
> > But I don't have a strong opinion and am happy to change them to FUNC
> > if you feel that's better.
>
> I don't think it should be FUNC as this is not meant to be called
> directly. I am also under the impression, we were planning to get rid of
> GLOBAL() as well.
>
> Furthermore, __bp_harden_hyp_vec_start is not a function per say. It is
> a pointer to the vector table.
>
>  From the brief look, the same remarks would apply to the rest of bpi.S.
> So I think we want to switch all the ENTRY() to LABEL().

Hi Julien,

The reason I choose FUNC for the start of the symbol is because these
symbols contain
executable code (not only a table of pointers to code somewhere else)
and the ELF spec
says that STT_FUNC means the symbol contains functions or other executable
code (not only callable functions IIUC):

"STT_FUNC The symbol is associated with a function or other executable code."
https://refspecs.linuxbase.org/elf/elf.pdf
(Symbol Table 1-20).

I think using LABEL instead of GLOBAL for the _end labels of these
code sequences makes sense.
I'm happy to change the _start labels to LABEL too if you guys feel
that's better.

Cheers,
Edgar

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-07 Thread Sean Christopherson

On Tue, May 07, 2024, Mickaël Salaün wrote:
> > Actually, potential bad/crazy idea.  Why does the _host_ need to define 
> > policy?
> > Linux already knows what assets it wants to (un)protect and when.  What's 
> > missing
> > is a way for the guest kernel to effectively deprivilege and re-authenticate
> > itself as needed.  We've been tossing around the idea of paired VMs+vCPUs to
> > support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, 
> > with a
> > bit of pKVM mixed in?
> > 
> > Borrowing VTL terminology, where VTL0 is the least privileged, userspace 
> > launches
> > the VM at VTL0.  At some point, the guest triggers the deprivileging 
> > sequence and
> > userspace creates VTL1.  Userpace also provides a way for VTL0 restrict 
> > access to
> > its memory, e.g. to effectively make the page tables for the kernel's 
> > direct map
> > writable only from VTL1, to make kernel text RO (or XO), etc.  And VTL0 
> > could then
> > also completely remove its access to code that changes CR0/CR4.
> > 
> > It would obviously require a _lot_ more upfront work, e.g. to isolate the 
> > kernel
> > text that modifies CR0/CR4 so that it can be removed from VTL0, but that 
> > should
> > be doable with annotations, e.g. tag relevant functions with __magic or 
> > whatever,
> > throw them in a dedicated section, and then free/protect the section(s) at 
> > the
> > appropriate time.
> > 
> > KVM would likely need to provide the ability to switch VTLs (or whatever 
> > they get
> > called), and host userspace would need to provide a decent amount of the 
> > backend
> > mechanisms and "core" policies, e.g. to manage VTL0 memory, teardown (turn 
> > off?)
> > VTL1 on kexec(), etc.  But everything else could live in the guest kernel 
> > itself.
> > E.g. to have CR pinning play nice with kexec(), toss the relevant kexec() 
> > code into
> > VTL1.  That way VTL1 can verify the kexec() target and tear itself down 
> > before
> > jumping into the new kernel. 
> > 
> > This is very off the cuff and have-wavy, e.g. I don't have much of an idea 
> > what
> > it would take to harden kernel text patching, but keeping the policy in the 
> > guest
> > seems like it'd make everything more tractable than trying to define an ABI
> > between Linux and a VMM that is rich and flexible enough to support all the
> > fancy things Linux does (and will do in the future).
> 
> Yes, we agree that the guest needs to manage its own policy.  That's why
> we implemented Heki for KVM this way, but without VTLs because KVM
> doesn't support them.
> 
> To sum up, is the VTL approach the only one that would be acceptable for
> KVM?  

Heh, that's not a question you want to be asking.  You're effectively asking me
to make an authorative, "final" decision on a topic which I am only passingly
familiar with.

But since you asked it... :-)  Probably?

I see a lot of advantages to a VTL/VSM-like approach:

 1. Provides Linux-as-a guest the flexibility it needs to meaningfully advance
its security, with the least amount of policy built into the guest/host ABI.

 2. Largely decouples guest policy from the host, i.e. should allow the guest to
evolve/update it's policy without needing to coordinate changes with the 
host.

 3. The KVM implementation can be generic enough to be reusable for other 
features.

 4. Other groups are already working on VTL-like support in KVM, e.g. for VSM
itself, and potentially for VMPL/SVSM support.

IMO, #2 is a *huge* selling point.  Not having to coordinate changes across
multiple code bases and/or organizations and/or maintainers is a big win for
velocity, long term maintenance, and probably the very viability of HEKI.

Providing the guest with the tools to define and implement its own policy means
end users don't have to way for some third party, e.g. CSPs, to deploy the
accompanying host-side changes, because there are no host-side changes.

And encapsulating everything in the guest drastically reduces the friction with
changes in the kernel that interact with hardening, both from a technical and a
social perspective.  I.e. giving the kernel (near) complete control over its
destiny minimizes the number of moving parts, and will be far, far easier to 
sell
to maintainers.  I would expect maintainers to react much more favorably to 
being
handed tools to harden the kernel, as opposed to being presented a set of APIs
that can be used to make the kernel compliant with _someone else's_ vision of
what kernel hardening should look like.

E.g. imagine a new feature comes along that requires overriding CR0/CR4 pinning
in a way that doesn't fit into existing policy.  If the VMM is involved in
defining/enforcing the CR pinning policy, then supporting said new feature would
require new guest/host ABI and an updated host VMM in order to make the new
feature compatible with HEKI.  Inevitably, even if everything goes smoothly from
an upstreaming perspective, that will result in guests that have to choose

Re: [PATCH v2 (resend) 12/27] x86/mapcache: Initialise the mapcache for the idle domain

2024-05-07 Thread Elias El Yandouzi


On 20/02/2024 10:51, Jan Beulich wrote:

On 16.01.2024 20:25, Elias El Yandouzi wrote:

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -750,9 +750,16 @@ int arch_domain_create(struct domain *d,
  
  spin_lock_init(>arch.e820_lock);
  
+if ( (rc = mapcache_domain_init(d)) != 0)

+{
+free_perdomain_mappings(d);
+return rc;
+}
+
  /* Minimal initialisation for the idle domain. */
  if ( unlikely(is_idle_domain(d)) )
  {
+struct page_info *pg = d->arch.perdomain_l3_pg;
  static const struct arch_csw idle_csw = {
  .from = paravirt_ctxt_switch_from,
  .to   = paravirt_ctxt_switch_to,
@@ -763,6 +770,9 @@ int arch_domain_create(struct domain *d,
  
  d->arch.cpu_policy = ZERO_BLOCK_PTR; /* Catch stray misuses. */
  
+idle_pg_table[l4_table_offset(PERDOMAIN_VIRT_START)] =

+l4e_from_page(pg, __PAGE_HYPERVISOR_RW);
+
  return 0;
  }


Why not add another call to mapcache_domain_init() right here, allowing
a more specific panic() to be invoked in case of failure (compared to
the BUG_ON() upon failure of creation of the idle domain as a whole)?
Then the other mapcache_domain_init() call doesn't need moving a 2nd
time in close succession.



Sorry but I don't get your point, why calling another time 
`mapcache_domain_init()`? What panic() are you referring to?


Elias

Re: [PATCH v2 (resend) 11/27] x86: Lift mapcache variable to the arch level

2024-05-07 Thread Elias El Yandouzi


This only lifts the mapcache variable up. Whether we populate the
mapcache for a domain is unchanged in this patch.


Is it? I wonder because of ...



I agree, the commit message doesn't completely reflect the changes below.


--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -843,6 +843,8 @@ int arch_domain_create(struct domain *d,
  
  psr_domain_init(d);
  
+mapcache_domain_init(d);

+
  if ( is_hvm_domain(d) )
  {
  if ( (rc = hvm_domain_initialise(d, config)) != 0 )
@@ -850,8 +852,6 @@ int arch_domain_create(struct domain *d,
  }
  else if ( is_pv_domain(d) )
  {
-mapcache_domain_init(d);
-
  if ( (rc = pv_domain_initialise(d)) != 0 )
  goto fail;
  }


... this and ...


--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -82,11 +82,11 @@ void *map_domain_page(mfn_t mfn)
  #endif
  
  v = mapcache_current_vcpu();

-if ( !v || !is_pv_vcpu(v) )
+if ( !v )
  return mfn_to_virt(mfn_x(mfn));


... this and yet more changes indicating otherwise.

Yet if which domains have a mapcache set up is already changed here, I
wonder whether the idle domain shouldn't be taken care of here as well.


Do you suggest to fold here the following patch where the mapcache gets 
initialized for idle domains?



Elias

Re: [PATCH v2 (resend) 09/27] x86/pv: Rewrite how building PV dom0 handles domheap mappings

2024-05-07 Thread Elias El Yandouzi


> On 20/02/2024 10:28, Jan Beulich wrote:

On 16.01.2024 20:25, Elias El Yandouzi wrote:
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -382,6 +382,10 @@ int __init dom0_construct_pv(struct domain *d,
  l3_pgentry_t *l3tab = NULL, *l3start = NULL;
  l2_pgentry_t *l2tab = NULL, *l2start = NULL;
  l1_pgentry_t *l1tab = NULL, *l1start = NULL;
+mfn_t l4start_mfn = INVALID_MFN;
+mfn_t l3start_mfn = INVALID_MFN;
+mfn_t l2start_mfn = INVALID_MFN;
+mfn_t l1start_mfn = INVALID_MFN;


The reason initializers are needed here is, aiui, the overly large scope
of these variables. For example ...



Correct, is it just an observation or do you want me to do anything?


@@ -708,22 +712,32 @@ int __init dom0_construct_pv(struct domain *d,
  v->arch.pv.event_callback_cs= FLAT_COMPAT_KERNEL_CS;
  }
  
+#define UNMAP_MAP_AND_ADVANCE(mfn_var, virt_var, maddr) \

+do {\
+unmap_domain_page(virt_var);\
+mfn_var = maddr_to_mfn(maddr);  \
+maddr += PAGE_SIZE; \
+virt_var = map_domain_page(mfn_var);\
+} while ( false )
+
  if ( !compat )
  {
  maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l4_page_table;
-l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+UNMAP_MAP_AND_ADVANCE(l4start_mfn, l4start, mpt_alloc);
+l4tab = l4start;
  clear_page(l4tab);
-init_xen_l4_slots(l4tab, _mfn(virt_to_mfn(l4start)),
-  d, INVALID_MFN, true);
-v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
+init_xen_l4_slots(l4tab, l4start_mfn, d, INVALID_MFN, true);
+v->arch.guest_table = pagetable_from_mfn(l4start_mfn);


... looks to be required only here, while ...


  }
  else
  {
  /* Monitor table already created by switch_compat(). */
-l4start = l4tab = __va(pagetable_get_paddr(v->arch.guest_table));
+l4start_mfn = pagetable_get_mfn(v->arch.guest_table);
+l4start = l4tab = map_domain_page(l4start_mfn);


... in principle the use of the variable could be avoided here. Below
from here there's no further use of it.


@@ -781,30 +797,34 @@ int __init dom0_construct_pv(struct domain *d,
  
  if ( compat )

  {
-l2_pgentry_t *l2t;
-
  /* Ensure the first four L3 entries are all populated. */
  for ( i = 0, l3tab = l3start; i < 4; ++i, ++l3tab )
  {
  if ( !l3e_get_intpte(*l3tab) )
  {
  maddr_to_page(mpt_alloc)->u.inuse.type_info = 
PGT_l2_page_table;
-l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
-clear_page(l2tab);
-*l3tab = l3e_from_paddr(__pa(l2tab), L3_PROT);
+UNMAP_MAP_AND_ADVANCE(l2start_mfn, l2start, mpt_alloc);
+clear_page(l2start);
+*l3tab = l3e_from_mfn(l2start_mfn, L3_PROT);
  }


The updating of l2start is only conditional here, yet ...


  if ( i == 3 )
  l3e_get_page(*l3tab)->u.inuse.type_info |= PGT_pae_xen_l2;
  }
  
-l2t = map_l2t_from_l3e(l3start[3]);

-init_xen_pae_l2_slots(l2t, d);
-unmap_domain_page(l2t);
+init_xen_pae_l2_slots(l2start, d);


... here you assume it points at the page referenced by the 3rd L3 entry.


Hmm, I missed it when sending the revision and indeed it doesn't look 
correct.



Question is why the original code is being replaced here in the first
place: It was already suitably mapping the page in question.


The code was already suitably mapping the pages in question. This patch 
doesn't aim to make any functional change, just to rework how the 
domheap pages are used. The goal of the series is to remove the mappings 
from the directmap, which means those pages needs to be mapped and 
unmapped when required.


This is all this patch do, see `UNMAP_MAP_AND_ADVANCE()` macro.

Elias

[xen-unstable test] 185936: tolerable FAIL

flight 185936 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185936/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185929
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185929
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185929
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185929
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail like 
185929
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185929
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185929
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-raw  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
baseline version:
 xen  ebab808eb1bb8f24c7d0dd41b956e48cb1824b81

Last test of basis   185936  2024-05-07 05:53:32 Z0 days
Testing same since  (not found) 0 attempts

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64-xtf  pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass

Re: [PATCH v2] x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake

On 07/05/2024 3:45 pm, Roger Pau Monné wrote:
> On Tue, May 07, 2024 at 03:31:19PM +0100, Andrew Cooper wrote:
>> On 07/05/2024 3:24 pm, Roger Pau Monné wrote:
>>> On Tue, May 07, 2024 at 02:45:40PM +0100, Andrew Cooper wrote:
 Ever since Xen 4.14, there has been a latent bug with migration.

 While some toolstacks can level the features properly, they don't shink
 feat.max_subleaf when all features have been dropped.  This is because
 we *still* have not completed the toolstack side work for full CPU Policy
 objects.

 As a consequence, even when properly feature levelled, VMs can't migrate
 "backwards" across hardware which reduces feat.max_subleaf.  One such 
 example
 is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).

 Extend the max policies feat.max_subleaf to the hightest number Xen knows
 about, but leave the default policies matching the host.  This will allow 
 VMs
 with a higher feat.max_subleaf than strictly necessary to migrate in.

 Eventually we'll manage to teach the toolstack how to avoid creating such 
 VMs
 in the first place, but there's still more work to do there.

 Signed-off-by: Andrew Cooper 
>>> Acked-by: Roger Pau Monné 
>> Thanks.
>>
>>> Even if we have just found one glitch with PSFD and Ice Lake vs
>>> Cascade Lack, wouldn't it be safer to always extend the max policies
>>> max leafs and subleafs to match the known array sizes?
>> This is the final max leaf (containing feature information) to gain
>> custom handling, I think?
> Couldn't the same happen with extended leaves?  Some of the extended
> leaves contain features, and hence for policy leveling toolstack might
> decide to zero them, yet extd.max_leaf won't be adjusted.

Hmm.  Right now, extd max leaf is also the one with the bit that we
unconditionally advertise, and it's inherited all the way from the host
policy.

So yes, in principle, but anything that bumps this limit is going to
have other implications too, and I'd prefer not to second-guess them at
this point.

I hope we can get the toolstack side fixes before this becomes a real
problem...

~Andrew

Re: [PATCH v2] x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake

2024-05-07 Thread Roger Pau Monné

On Tue, May 07, 2024 at 03:31:19PM +0100, Andrew Cooper wrote:
> On 07/05/2024 3:24 pm, Roger Pau Monné wrote:
> > On Tue, May 07, 2024 at 02:45:40PM +0100, Andrew Cooper wrote:
> >> Ever since Xen 4.14, there has been a latent bug with migration.
> >>
> >> While some toolstacks can level the features properly, they don't shink
> >> feat.max_subleaf when all features have been dropped.  This is because
> >> we *still* have not completed the toolstack side work for full CPU Policy
> >> objects.
> >>
> >> As a consequence, even when properly feature levelled, VMs can't migrate
> >> "backwards" across hardware which reduces feat.max_subleaf.  One such 
> >> example
> >> is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).
> >>
> >> Extend the max policies feat.max_subleaf to the hightest number Xen knows
> >> about, but leave the default policies matching the host.  This will allow 
> >> VMs
> >> with a higher feat.max_subleaf than strictly necessary to migrate in.
> >>
> >> Eventually we'll manage to teach the toolstack how to avoid creating such 
> >> VMs
> >> in the first place, but there's still more work to do there.
> >>
> >> Signed-off-by: Andrew Cooper 
> > Acked-by: Roger Pau Monné 
> 
> Thanks.
> 
> >
> > Even if we have just found one glitch with PSFD and Ice Lake vs
> > Cascade Lack, wouldn't it be safer to always extend the max policies
> > max leafs and subleafs to match the known array sizes?
> 
> This is the final max leaf (containing feature information) to gain
> custom handling, I think?

Couldn't the same happen with extended leaves?  Some of the extended
leaves contain features, and hence for policy leveling toolstack might
decide to zero them, yet extd.max_leaf won't be adjusted.

Thanks, Roger.

Re: [PATCH 05/15] tools/libs/light: Increase nr_spi to 160


Hi,

On 06/05/2024 06:17, Henry Wang wrote:

On 5/1/2024 9:58 PM, Anthony PERARD wrote:

On Wed, Apr 24, 2024 at 11:34:39AM +0800, Henry Wang wrote:
Increase number of spi to 160 i.e. gic_number_lines() for Xilinx 
ZynqMP - 32.

This was done to allocate and assign IRQs to a running domain.

Signed-off-by: Vikram Garhwal 
Signed-off-by: Stefano Stabellini 
Signed-off-by: Henry Wang 
---
  tools/libs/light/libxl_arm.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
index dd5c9f4917..50dbd0f2a9 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -181,7 +181,8 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
  LOG(DEBUG, "Configure the domain");
-    config->arch.nr_spis = nr_spis;
+    /* gic_number_lines() is 192 for Xilinx ZynqMP. min nr_spis = 
192 - 32. */

+    config->arch.nr_spis = MAX(nr_spis, 160);

Is there a way that that Xen or libxl could find out what the minimum
number of SPI needs to be?


I am afraid currently there is none.


Are we going to have to increase that minimum
number every time a new platform comes along?

It doesn't appear that libxl is using that `nr_spis` value and it is
probably just given to Xen. So my guess is that Xen could simply take
care of the minimum value, gic_number_lines() seems to be a Xen
function.


Xen will take care of the value of nr_spis for dom0 in create_dom0()
dom0_cfg.arch.nr_spis = min(gic_number_lines(), (unsigned int) 992) - 32;
and also for dom0less domUs in create_domUs().

However, it looks like Xen will not take care of the mininum value for 
libxl guests, the value from config->arch.nr_spis in guest config file 
will be directly passed to the domain_vgic_init() function from 
arch_domain_create().


I agree with you that we shouldn't just bump the number everytime when 
we have a new platform. Therefore, would it be a good idea to move the 
logic in this patch to arch_sanitise_domain_config()?


Xen domains are supposed to be platform agnostics and therefore the 
numbers of SPIs should not be based on the HW.


Furthermore, with your proposal we would end up to allocate data 
structure for N SPIs when a domain may never needs any SPIs (such as if 
passthrough is not in-use). This is more likely for domain created by 
the toolstack than from Xen directly.


Instead, we should introduce a new XL configuration to let the user 
decide the number of SPIs. I would suggest to name "nr_spis" to match 
the DT bindings.


Cheers,

--
Julien Grall

Re: [PATCH v2] x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake

On 07/05/2024 3:24 pm, Roger Pau Monné wrote:
> On Tue, May 07, 2024 at 02:45:40PM +0100, Andrew Cooper wrote:
>> Ever since Xen 4.14, there has been a latent bug with migration.
>>
>> While some toolstacks can level the features properly, they don't shink
>> feat.max_subleaf when all features have been dropped.  This is because
>> we *still* have not completed the toolstack side work for full CPU Policy
>> objects.
>>
>> As a consequence, even when properly feature levelled, VMs can't migrate
>> "backwards" across hardware which reduces feat.max_subleaf.  One such example
>> is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).
>>
>> Extend the max policies feat.max_subleaf to the hightest number Xen knows
>> about, but leave the default policies matching the host.  This will allow VMs
>> with a higher feat.max_subleaf than strictly necessary to migrate in.
>>
>> Eventually we'll manage to teach the toolstack how to avoid creating such VMs
>> in the first place, but there's still more work to do there.
>>
>> Signed-off-by: Andrew Cooper 
> Acked-by: Roger Pau Monné 

Thanks.

>
> Even if we have just found one glitch with PSFD and Ice Lake vs
> Cascade Lack, wouldn't it be safer to always extend the max policies
> max leafs and subleafs to match the known array sizes?

This is the final max leaf (containing feature information) to gain
custom handling, I think?

~Andrew

Re: [PATCH] tools/xl: Open xldevd.log with O_CLOEXEC

On Tue, May 07, 2024 at 01:32:00PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, May 07, 2024 at 12:08:06PM +0100, Andrew Cooper wrote:
> > `xl devd` has been observed leaking /var/log/xldevd.log into children.
> > 
> > Link: https://github.com/QubesOS/qubes-issues/issues/8292
> > Reported-by: Demi Marie Obenour 
> > Signed-off-by: Andrew Cooper 
> > ---
> > CC: Anthony PERARD 
> > CC: Juergen Gross 
> > CC: Demi Marie Obenour 
> > CC: Marek Marczykowski-Górecki 
> > 
> > Also entirely speculative based on the QubesOS ticket.
> > ---
> >  tools/xl/xl_utils.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
> > index 17489d182954..060186db3a59 100644
> > --- a/tools/xl/xl_utils.c
> > +++ b/tools/xl/xl_utils.c
> > @@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
> >  exit(-1);
> >  }
> >  
> > -CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 0644));
> > +CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
> > O_CLOEXEC, 0644));
> 
> This one might be not enough, as the FD gets dup2()-ed to stdout/stderr
> just outside of the context here, and then inherited by various hotplug
> script. Just adding O_CLOEXEC here means the hotplug scripts will run
> with stdout/stderr closed. The scripts shipped with Xen do redirect
> stderr to a log quite early, but a) it doesn't do it for stdout, and b)
> custom hotplug scripts are a valid use case.
> Without that, I see at least few potential issues:
> - some log messages may be lost (minor, but annoying)
> - something might simply fail on writing to a closed FD, breaking the
>   hotplug script
> - FD 1 will be used as first free FD for any open() or similar call - if
>   a tool later tries writing something to stdout, it will gets written
>   to that FD - worse of all three

Wait, the above is wrong, dup does not copy the O_CLOEXEC flag over to
the new FD. So, maybe your patch is correct after all.

> What should be the behavior of hotplug scripts logging? Should they
> always take care of their own logging? If so, the hotplug calling part
> should redirect stdout/stderr to /dev/null IMO. But if `xl` should
> provide some default logging for them (like, the xldevd.log here?), then
> the O_CLOEXEC should be set only after duplicating logfile over stdout/err.
> 
> >  free(fullname);
> >  assert(logfile >= 3);
> >  
> > 
> > base-commit: ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
> > prerequisite-patch-id: 212e50457e9b6bdfd06a97da545a5aa7155bb919
> 
> Which one is this? I don't see it in staging, nor in any of your
> branches on xenbits. Lore finds "tools/libxs: Open /dev/xen/xenbus fds
> as O_CLOEXEC" which I guess is correct, but I have no idea how it
> correlates it, as this hash doesn't appear anywhere in the message, nor
> its headers...
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH] tools/xl: Open xldevd.log with O_CLOEXEC

On 07/05/2024 3:23 pm, Marek Marczykowski-Górecki wrote:
> On Tue, May 07, 2024 at 03:15:48PM +0100, Andrew Cooper wrote:
>> On 07/05/2024 12:32 pm, Marek Marczykowski-Górecki wrote:
>>> On Tue, May 07, 2024 at 12:08:06PM +0100, Andrew Cooper wrote:
 `xl devd` has been observed leaking /var/log/xldevd.log into children.

 Link: https://github.com/QubesOS/qubes-issues/issues/8292
 Reported-by: Demi Marie Obenour 
 Signed-off-by: Andrew Cooper 
 ---
 CC: Anthony PERARD 
 CC: Juergen Gross 
 CC: Demi Marie Obenour 
 CC: Marek Marczykowski-Górecki 

 Also entirely speculative based on the QubesOS ticket.
 ---
  tools/xl/xl_utils.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
 index 17489d182954..060186db3a59 100644
 --- a/tools/xl/xl_utils.c
 +++ b/tools/xl/xl_utils.c
 @@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
  exit(-1);
  }
  
 -CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 
 0644));
 +CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
 O_CLOEXEC, 0644));
>>> This one might be not enough, as the FD gets dup2()-ed to stdout/stderr
>>> just outside of the context here, and then inherited by various hotplug
>>> script. Just adding O_CLOEXEC here means the hotplug scripts will run
>>> with stdout/stderr closed.
>> Lovely :(  Yes - this won't work.  I guess what we want instead is:
>>
>> diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
>> index 060186db3a59..a0ce7dd7fa21 100644
>> --- a/tools/xl/xl_utils.c
>> +++ b/tools/xl/xl_utils.c
>> @@ -282,6 +282,7 @@ int do_daemonize(const char *name, const char *pidfile)
>>  dup2(logfile, 2);
>>  
>>  close(nullfd);
>> +    close(logfile);
>>  
>>  CHK_SYSCALL(daemon(0, 1));
>>  
>> which at least means there's not a random extra fd attached to the logfile.
> But logfile is a global variable, and it looks to be used in dolog()...

Urgh, fine.  Lets go back to your suggestion of setting CLOEXEC after
dup()ing.

~Andrew

Re: [PATCH v2] x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake

2024-05-07 Thread Roger Pau Monné

On Tue, May 07, 2024 at 02:45:40PM +0100, Andrew Cooper wrote:
> Ever since Xen 4.14, there has been a latent bug with migration.
> 
> While some toolstacks can level the features properly, they don't shink
> feat.max_subleaf when all features have been dropped.  This is because
> we *still* have not completed the toolstack side work for full CPU Policy
> objects.
> 
> As a consequence, even when properly feature levelled, VMs can't migrate
> "backwards" across hardware which reduces feat.max_subleaf.  One such example
> is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).
> 
> Extend the max policies feat.max_subleaf to the hightest number Xen knows
> about, but leave the default policies matching the host.  This will allow VMs
> with a higher feat.max_subleaf than strictly necessary to migrate in.
> 
> Eventually we'll manage to teach the toolstack how to avoid creating such VMs
> in the first place, but there's still more work to do there.
> 
> Signed-off-by: Andrew Cooper 

Acked-by: Roger Pau Monné 

Even if we have just found one glitch with PSFD and Ice Lake vs
Cascade Lack, wouldn't it be safer to always extend the max policies
max leafs and subleafs to match the known array sizes?

Thanks, Roger.

Re: [PATCH] tools/xl: Open xldevd.log with O_CLOEXEC

On Tue, May 07, 2024 at 03:15:48PM +0100, Andrew Cooper wrote:
> On 07/05/2024 12:32 pm, Marek Marczykowski-Górecki wrote:
> > On Tue, May 07, 2024 at 12:08:06PM +0100, Andrew Cooper wrote:
> >> `xl devd` has been observed leaking /var/log/xldevd.log into children.
> >>
> >> Link: https://github.com/QubesOS/qubes-issues/issues/8292
> >> Reported-by: Demi Marie Obenour 
> >> Signed-off-by: Andrew Cooper 
> >> ---
> >> CC: Anthony PERARD 
> >> CC: Juergen Gross 
> >> CC: Demi Marie Obenour 
> >> CC: Marek Marczykowski-Górecki 
> >>
> >> Also entirely speculative based on the QubesOS ticket.
> >> ---
> >>  tools/xl/xl_utils.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
> >> index 17489d182954..060186db3a59 100644
> >> --- a/tools/xl/xl_utils.c
> >> +++ b/tools/xl/xl_utils.c
> >> @@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
> >>  exit(-1);
> >>  }
> >>  
> >> -CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 
> >> 0644));
> >> +CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
> >> O_CLOEXEC, 0644));
> > This one might be not enough, as the FD gets dup2()-ed to stdout/stderr
> > just outside of the context here, and then inherited by various hotplug
> > script. Just adding O_CLOEXEC here means the hotplug scripts will run
> > with stdout/stderr closed.
> 
> Lovely :(  Yes - this won't work.  I guess what we want instead is:
> 
> diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
> index 060186db3a59..a0ce7dd7fa21 100644
> --- a/tools/xl/xl_utils.c
> +++ b/tools/xl/xl_utils.c
> @@ -282,6 +282,7 @@ int do_daemonize(const char *name, const char *pidfile)
>  dup2(logfile, 2);
>  
>  close(nullfd);
> +    close(logfile);
>  
>  CHK_SYSCALL(daemon(0, 1));
>  
> which at least means there's not a random extra fd attached to the logfile.

But logfile is a global variable, and it looks to be used in dolog()...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 2/7] xen/arm: Wrap shared memory mapping code in one function

Hi Michal,


 
 int __init process_shm(struct domain *d, struct kernel_info *kinfo,
   const struct dt_device_node *node)
 {
 @@ -249,32 +290,10 @@ int __init process_shm(struct domain *d, struct 
 kernel_info *kinfo,
if ( dt_property_read_string(shm_node, "role", _str) == 0 )
owner_dom_io = false;
>>> Looking at owner_dom_io, why don't you move parsing role and setting 
>>> owner_dom_io accordingly to handle_shared_mem_bank()?
>> 
>> I think I wanted to keep all dt_* functions on the same level inside 
>> process_shm, otherwise yes, I could
>> pass down shm_node and do the reading of role_str in handle_shared_mem_bank, 
>> or I could derive
>> owner_dom_io from role_str being passed or not, something like:
>> 
>> role_str = NULL;
>> dt_property_read_string(shm_node, "role", _str)
>> 
>> [inside handle_shared_mem_bank]:
>> If ( role_str )
>>owner_dom_io = false;
>> 
>> And pass only role_str to handle_shared_mem_bank.
>> 
>> Is this comment to reduce the number of parameters passed? I guess it’s not 
>> for where we call
> In this series as well as the previous one you limit the number of arguments 
> passed to quite a few functions.
> So naturally I would expect it to be done here as well. owner_dom_io is used 
> only by handle_shared_mem_bank, so it makes more sense to move parsing to this
> function so that it is self-contained.

Ok I will, just to be on the same page here, you mean having 
dt_property_read_string inside handle_shared_mem_bank?
Or the above example would work for you as well? That one would have role_str 
passed instead of shm_node.

Cheers,
Luca

Re: [PATCH] tools/xl: Open xldevd.log with O_CLOEXEC

On 07/05/2024 12:32 pm, Marek Marczykowski-Górecki wrote:
> On Tue, May 07, 2024 at 12:08:06PM +0100, Andrew Cooper wrote:
>> `xl devd` has been observed leaking /var/log/xldevd.log into children.
>>
>> Link: https://github.com/QubesOS/qubes-issues/issues/8292
>> Reported-by: Demi Marie Obenour 
>> Signed-off-by: Andrew Cooper 
>> ---
>> CC: Anthony PERARD 
>> CC: Juergen Gross 
>> CC: Demi Marie Obenour 
>> CC: Marek Marczykowski-Górecki 
>>
>> Also entirely speculative based on the QubesOS ticket.
>> ---
>>  tools/xl/xl_utils.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
>> index 17489d182954..060186db3a59 100644
>> --- a/tools/xl/xl_utils.c
>> +++ b/tools/xl/xl_utils.c
>> @@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
>>  exit(-1);
>>  }
>>  
>> -CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 0644));
>> +CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
>> O_CLOEXEC, 0644));
> This one might be not enough, as the FD gets dup2()-ed to stdout/stderr
> just outside of the context here, and then inherited by various hotplug
> script. Just adding O_CLOEXEC here means the hotplug scripts will run
> with stdout/stderr closed.

Lovely :(  Yes - this won't work.  I guess what we want instead is:

diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
index 060186db3a59..a0ce7dd7fa21 100644
--- a/tools/xl/xl_utils.c
+++ b/tools/xl/xl_utils.c
@@ -282,6 +282,7 @@ int do_daemonize(const char *name, const char *pidfile)
 dup2(logfile, 2);
 
 close(nullfd);
+    close(logfile);
 
 CHK_SYSCALL(daemon(0, 1));
 
which at least means there's not a random extra fd attached to the logfile.
>>  free(fullname);
>>  assert(logfile >= 3);
>>  
>>
>> base-commit: ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
>> prerequisite-patch-id: 212e50457e9b6bdfd06a97da545a5aa7155bb919
> Which one is this? I don't see it in staging, nor in any of your
> branches on xenbits. Lore finds "tools/libxs: Open /dev/xen/xenbus fds
> as O_CLOEXEC" which I guess is correct, but I have no idea how it
> correlates it, as this hash doesn't appear anywhere in the message, nor
> its headers...

It's the libxs patch, but rebased over yesterday's push to staging. 
auto-base doesn't work quite so well in cases like this.

~Andrew

Re: [PATCH 1/7] xen/arm: Lookup bootinfo shm bank during the mapping

Hi Michal,

>> 
 
 @@ -440,6 +431,26 @@ int __init process_shm_node(const void *fdt, int 
 node, uint32_t address_cells,
device_tree_get_reg(, address_cells, address_cells, , 
 );
size = dt_next_cell(size_cells, );
 
 +if ( !IS_ALIGNED(paddr, PAGE_SIZE) )
 +{
 +printk("fdt: physical address 0x%"PRIpaddr" is not suitably 
 aligned.\n",
 +   paddr);
 +return -EINVAL;
 +}
 +
 +if ( !IS_ALIGNED(gaddr, PAGE_SIZE) )
 +{
 +printk("fdt: guest address 0x%"PRIpaddr" is not suitably 
 aligned.\n",
 +   gaddr);
 +return -EINVAL;
 +}
 +
 +if ( !IS_ALIGNED(size, PAGE_SIZE) )
>>> What sense does it make to check for size being aligned before checking for 
>>> size being 0? It would pass this check.
>> 
>> Yes, but in the end we are doing that to print a different error message, so 
>> it would pass
>> for 0 and it’s totally fine, but in the end it will fail afterwards. I don’t 
>> see functional disruptions
>> having this one before the other, what is the concern here?
> It does not cause the functional disruption. It is more about code 
> readability and writing cleaner code.
> It makes more sense to first check for size being 0 rather than whether it's 
> page aligned, since the latter can
> pass if former is true and thus not making much sense.

Ok then I will switch them and check it being different from 0 before the 
alignment check.

Cheers,
Luca

Re: [PATCH 2/7] xen/arm: Wrap shared memory mapping code in one function

2024-05-07 Thread Michal Orzel




On 07/05/2024 15:57, Luca Fancellu wrote:
> 
> 
> Hi Michal,
> 
>>>
>>> +static int __init handle_shared_mem_bank(struct domain *d, paddr_t gbase,
>>> + bool owner_dom_io,
>>> + const char *role_str,
>>> + const struct membank *shm_bank)
>>> +{
>>> +paddr_t pbase, psize;
>>> +int ret;
>>> +
>>> +BUG_ON(!shm_bank);
>> not needed
>>
>>> +
>>> +pbase = shm_bank->start;
>>> +psize = shm_bank->size;
>> please add empty line here
> 
> Will do
>>>
>>> int __init process_shm(struct domain *d, struct kernel_info *kinfo,
>>>const struct dt_device_node *node)
>>> {
>>> @@ -249,32 +290,10 @@ int __init process_shm(struct domain *d, struct 
>>> kernel_info *kinfo,
>>> if ( dt_property_read_string(shm_node, "role", _str) == 0 )
>>> owner_dom_io = false;
>> Looking at owner_dom_io, why don't you move parsing role and setting 
>> owner_dom_io accordingly to handle_shared_mem_bank()?
> 
> I think I wanted to keep all dt_* functions on the same level inside 
> process_shm, otherwise yes, I could
> pass down shm_node and do the reading of role_str in handle_shared_mem_bank, 
> or I could derive
> owner_dom_io from role_str being passed or not, something like:
> 
> role_str = NULL;
> dt_property_read_string(shm_node, "role", _str)
> 
> [inside handle_shared_mem_bank]:
> If ( role_str )
> owner_dom_io = false;
> 
> And pass only role_str to handle_shared_mem_bank.
> 
> Is this comment to reduce the number of parameters passed? I guess it’s not 
> for where we call
In this series as well as the previous one you limit the number of arguments 
passed to quite a few functions.
So naturally I would expect it to be done here as well. owner_dom_io is used 
only by handle_shared_mem_bank, so it makes more sense to move parsing to this
function so that it is self-contained.

~Michal

Re: [PATCH 1/7] xen/arm: Lookup bootinfo shm bank during the mapping

2024-05-07 Thread Michal Orzel




On 07/05/2024 15:44, Luca Fancellu wrote:
> 
> 
> Hi Michal,
> 
> Thanks for your review.
> 
>> On 6 May 2024, at 14:24, Michal Orzel  wrote:
>>
>> Hi Luca,
>>
>> On 23/04/2024 10:25, Luca Fancellu wrote:
>>>
>>>
>>> The current static shared memory code is using bootinfo banks when it
>>> needs to find the number of borrower, so every time assign_shared_memory
>> s/borrower/borrowers
> 
> Will fix
> 
>>
>>> is called, the bank is searched in the bootinfo.shmem structure.
>>>
>>> There is nothing wrong with it, however the bank can be used also to
>>> retrieve the start address and size and also to pass less argument to
>>> assign_shared_memory. When retrieving the information from the bootinfo
>>> bank, it's also possible to move the checks on alignment to
>>> process_shm_node in the early stages.
>> Is this change really required for what you want to achieve? At the moment 
>> the alignment checks
>> are done before first use, which requires these values to be aligned. FDT 
>> processing part does not need it.
> 
> That’s true, but it would separate better the parsing part, in the end what 
> is the point of failing later if, for example,
> some value are passed but not aligned?
> 
>>
>>>
>>> So create a new function find_shm() which takes a 'struct shared_meminfo'
>> Can we name it find_shm_bank() or find_shm_bank_by_id()?
>> I agree that it's better to use a unique ID rather than matching by 
>> address/size
> 
> Yes either names are good for me, I would use find_shm_bank_by_id
> 
>>
>>> structure and the shared memory ID, to look for a bank with a matching ID,
>>> take the physical host address and size from the bank, pass the bank to
>>> assign_shared_memory() removing the now unnecessary arguments and finally
>>> remove the acquire_nr_borrower_domain() function since now the information
>>> can be extracted from the passed bank.
>>> Move the "xen,shm-id" parsing early in process_shm to bail out quickly in
>>> case of errors (unlikely), as said above, move the checks on alignment
>>> to process_shm_node.
>>>
>>> Drawback of this change is that now the bootinfo are used also when the
>>> bank doesn't need to be allocated, however it will be convinient later
>>> to use it as an argument for assign_shared_memory when dealing with
>>> the use case where the Host physical address is not supplied by the user.
>>>
>>> Signed-off-by: Luca Fancellu 
>>> ---
>>> xen/arch/arm/static-shmem.c | 105 
>>> 1 file changed, 58 insertions(+), 47 deletions(-)
>>>
>>> diff --git a/xen/arch/arm/static-shmem.c b/xen/arch/arm/static-shmem.c
>>> index 09f474ec6050..f6cf74e58a83 100644
>>> --- a/xen/arch/arm/static-shmem.c
>>> +++ b/xen/arch/arm/static-shmem.c
>>> @@ -19,29 +19,24 @@ static void __init __maybe_unused build_assertions(void)
>>>  offsetof(struct shared_meminfo, bank)));
>>> }
>>>
>>> -static int __init acquire_nr_borrower_domain(struct domain *d,
>>> - paddr_t pbase, paddr_t psize,
>>> - unsigned long *nr_borrowers)
>>> +static const struct membank __init *find_shm(const struct membanks *shmem,
>>> + const char *shm_id)
>>> {
>>> -const struct membanks *shmem = bootinfo_get_shmem();
>>> unsigned int bank;
>>>
>>> -/* Iterate reserved memory to find requested shm bank. */
>>> +BUG_ON(!shmem || !shm_id);
>> Is it really necessary? For example, before calling find_shm(), strlen is 
>> used on shm_id
> 
> So, I guess I did that to have more robust code, in case someone changes the 
> code in the
> future and perhaps removes something we rely on. If you object to them I will 
> remove though,
> here and the other related points below.
> 
>>
>>> +
>>> for ( bank = 0 ; bank < shmem->nr_banks; bank++ )
>>> {
>>> -paddr_t bank_start = shmem->bank[bank].start;
>>> -paddr_t bank_size = shmem->bank[bank].size;
>>> -
>>> -if ( (pbase == bank_start) && (psize == bank_size) )
>>> +if ( strncmp(shm_id, shmem->bank[bank].shmem_extra->shm_id,
>>> + MAX_SHM_ID_LENGTH) == 0 )
>> Why not strcmp? AFAICS it's been validated many times already
>>
>>> break;
>>> }
>>>
>>> if ( bank == shmem->nr_banks )
>>> -return -ENOENT;
>>> -
>>> -*nr_borrowers = shmem->bank[bank].shmem_extra->nr_shm_borrowers;
>>> +return NULL;
>>>
>>> -return 0;
>>> +return >bank[bank];
>>> }
>>>
>>> /*
>>> @@ -103,14 +98,20 @@ static mfn_t __init acquire_shared_memory_bank(struct 
>>> domain *d,
>>> return smfn;
>>> }
>>>
>>> -static int __init assign_shared_memory(struct domain *d,
>>> -   paddr_t pbase, paddr_t psize,
>>> -   paddr_t gbase)
>>> +static int __init assign_shared_memory(struct domain *d, paddr_t gbase,
>>> +   const struct membank

Re: [PATCH 2/7] xen/arm: Wrap shared memory mapping code in one function

Hi Michal,

>> 
>> +static int __init handle_shared_mem_bank(struct domain *d, paddr_t gbase,
>> + bool owner_dom_io,
>> + const char *role_str,
>> + const struct membank *shm_bank)
>> +{
>> +paddr_t pbase, psize;
>> +int ret;
>> +
>> +BUG_ON(!shm_bank);
> not needed
> 
>> +
>> +pbase = shm_bank->start;
>> +psize = shm_bank->size;
> please add empty line here

Will do
>> 
>> int __init process_shm(struct domain *d, struct kernel_info *kinfo,
>>const struct dt_device_node *node)
>> {
>> @@ -249,32 +290,10 @@ int __init process_shm(struct domain *d, struct 
>> kernel_info *kinfo,
>> if ( dt_property_read_string(shm_node, "role", _str) == 0 )
>> owner_dom_io = false;
> Looking at owner_dom_io, why don't you move parsing role and setting 
> owner_dom_io accordingly to handle_shared_mem_bank()?

I think I wanted to keep all dt_* functions on the same level inside 
process_shm, otherwise yes, I could
pass down shm_node and do the reading of role_str in handle_shared_mem_bank, or 
I could derive
owner_dom_io from role_str being passed or not, something like:

role_str = NULL;
dt_property_read_string(shm_node, "role", _str)

[inside handle_shared_mem_bank]:
If ( role_str )
owner_dom_io = false;

And pass only role_str to handle_shared_mem_bank.

Is this comment to reduce the number of parameters passed? I guess it’s not for 
where we call
dt_property_read_string isn’t it?

Cheers,
Luca

Re: [PATCH net] xen-netfront: Add missing skb_mark_for_recycle

Hello,

Please could we request a CVE for "xen-netfront: Add missing
skb_mark_for_recycle" which is 037965402a010898d34f4e35327d22c0a95cd51f
in Linus' tree.

This is a kernel memory leak trigger-able from unprivileged userspace.

I can't see any evidence of this fix having been assigned a CVE thus far
on the linux-cve-annouce mailing list.

Thanks,

~Andrew

On 25/04/2024 4:13 pm, Greg KH wrote:
> On Thu, Apr 25, 2024 at 02:39:38PM +0100, George Dunlap wrote:
>> Greg,
>>
>> We're issuing an XSA for this; can you issue a CVE?
> To ask for a cve, please contact c...@kernel.org as per our
> documentation.  Please provide the git id of the commit you wish to have
> the cve assigned to.
>
> thanks,
>
> greg k-h

[PATCH v2] x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake

Ever since Xen 4.14, there has been a latent bug with migration.

While some toolstacks can level the features properly, they don't shink
feat.max_subleaf when all features have been dropped.  This is because
we *still* have not completed the toolstack side work for full CPU Policy
objects.

As a consequence, even when properly feature levelled, VMs can't migrate
"backwards" across hardware which reduces feat.max_subleaf.  One such example
is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).

Extend the max policies feat.max_subleaf to the hightest number Xen knows
about, but leave the default policies matching the host.  This will allow VMs
with a higher feat.max_subleaf than strictly necessary to migrate in.

Eventually we'll manage to teach the toolstack how to avoid creating such VMs
in the first place, but there's still more work to do there.

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Roger Pau Monné 

v2:
 * Adjust max policies rather than the host policy.
---
 xen/arch/x86/cpu-policy.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 4b6d96276399..f7e2910c01b5 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -590,6 +590,13 @@ static void __init calculate_pv_max_policy(void)
 unsigned int i;
 
 *p = host_cpu_policy;
+
+/*
+ * Some VMs may have a larger-than-necessary feat max_leaf.  Allow them to
+ * migrate in.
+ */
+p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
+
 x86_cpu_policy_to_featureset(p, fs);
 
 for ( i = 0; i < ARRAY_SIZE(fs); ++i )
@@ -630,6 +637,10 @@ static void __init calculate_pv_def_policy(void)
 unsigned int i;
 
 *p = pv_max_cpu_policy;
+
+/* Default to the same max_subleaf as the host. */
+p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
+
 x86_cpu_policy_to_featureset(p, fs);
 
 for ( i = 0; i < ARRAY_SIZE(fs); ++i )
@@ -666,6 +677,13 @@ static void __init calculate_hvm_max_policy(void)
 const uint32_t *mask;
 
 *p = host_cpu_policy;
+
+/*
+ * Some VMs may have a larger-than-necessary feat max_leaf.  Allow them to
+ * migrate in.
+ */
+p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
+
 x86_cpu_policy_to_featureset(p, fs);
 
 mask = hvm_hap_supported() ?
@@ -783,6 +801,10 @@ static void __init calculate_hvm_def_policy(void)
 const uint32_t *mask;
 
 *p = hvm_max_cpu_policy;
+
+/* Default to the same max_subleaf as the host. */
+p->feat.max_subleaf = host_cpu_policy.feat.max_subleaf;
+
 x86_cpu_policy_to_featureset(p, fs);
 
 mask = hvm_hap_supported() ?

base-commit: ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
-- 
2.30.2

Re: [PATCH 1/7] xen/arm: Lookup bootinfo shm bank during the mapping

Hi Michal,

Thanks for your review.

> On 6 May 2024, at 14:24, Michal Orzel  wrote:
> 
> Hi Luca,
> 
> On 23/04/2024 10:25, Luca Fancellu wrote:
>> 
>> 
>> The current static shared memory code is using bootinfo banks when it
>> needs to find the number of borrower, so every time assign_shared_memory
> s/borrower/borrowers

Will fix

> 
>> is called, the bank is searched in the bootinfo.shmem structure.
>> 
>> There is nothing wrong with it, however the bank can be used also to
>> retrieve the start address and size and also to pass less argument to
>> assign_shared_memory. When retrieving the information from the bootinfo
>> bank, it's also possible to move the checks on alignment to
>> process_shm_node in the early stages.
> Is this change really required for what you want to achieve? At the moment 
> the alignment checks
> are done before first use, which requires these values to be aligned. FDT 
> processing part does not need it.

That’s true, but it would separate better the parsing part, in the end what is 
the point of failing later if, for example,
some value are passed but not aligned? 

> 
>> 
>> So create a new function find_shm() which takes a 'struct shared_meminfo'
> Can we name it find_shm_bank() or find_shm_bank_by_id()?
> I agree that it's better to use a unique ID rather than matching by 
> address/size

Yes either names are good for me, I would use find_shm_bank_by_id

> 
>> structure and the shared memory ID, to look for a bank with a matching ID,
>> take the physical host address and size from the bank, pass the bank to
>> assign_shared_memory() removing the now unnecessary arguments and finally
>> remove the acquire_nr_borrower_domain() function since now the information
>> can be extracted from the passed bank.
>> Move the "xen,shm-id" parsing early in process_shm to bail out quickly in
>> case of errors (unlikely), as said above, move the checks on alignment
>> to process_shm_node.
>> 
>> Drawback of this change is that now the bootinfo are used also when the
>> bank doesn't need to be allocated, however it will be convinient later
>> to use it as an argument for assign_shared_memory when dealing with
>> the use case where the Host physical address is not supplied by the user.
>> 
>> Signed-off-by: Luca Fancellu 
>> ---
>> xen/arch/arm/static-shmem.c | 105 
>> 1 file changed, 58 insertions(+), 47 deletions(-)
>> 
>> diff --git a/xen/arch/arm/static-shmem.c b/xen/arch/arm/static-shmem.c
>> index 09f474ec6050..f6cf74e58a83 100644
>> --- a/xen/arch/arm/static-shmem.c
>> +++ b/xen/arch/arm/static-shmem.c
>> @@ -19,29 +19,24 @@ static void __init __maybe_unused build_assertions(void)
>>  offsetof(struct shared_meminfo, bank)));
>> }
>> 
>> -static int __init acquire_nr_borrower_domain(struct domain *d,
>> - paddr_t pbase, paddr_t psize,
>> - unsigned long *nr_borrowers)
>> +static const struct membank __init *find_shm(const struct membanks *shmem,
>> + const char *shm_id)
>> {
>> -const struct membanks *shmem = bootinfo_get_shmem();
>> unsigned int bank;
>> 
>> -/* Iterate reserved memory to find requested shm bank. */
>> +BUG_ON(!shmem || !shm_id);
> Is it really necessary? For example, before calling find_shm(), strlen is 
> used on shm_id

So, I guess I did that to have more robust code, in case someone changes the 
code in the
future and perhaps removes something we rely on. If you object to them I will 
remove though,
here and the other related points below.

> 
>> +
>> for ( bank = 0 ; bank < shmem->nr_banks; bank++ )
>> {
>> -paddr_t bank_start = shmem->bank[bank].start;
>> -paddr_t bank_size = shmem->bank[bank].size;
>> -
>> -if ( (pbase == bank_start) && (psize == bank_size) )
>> +if ( strncmp(shm_id, shmem->bank[bank].shmem_extra->shm_id,
>> + MAX_SHM_ID_LENGTH) == 0 )
> Why not strcmp? AFAICS it's been validated many times already
> 
>> break;
>> }
>> 
>> if ( bank == shmem->nr_banks )
>> -return -ENOENT;
>> -
>> -*nr_borrowers = shmem->bank[bank].shmem_extra->nr_shm_borrowers;
>> +return NULL;
>> 
>> -return 0;
>> +return >bank[bank];
>> }
>> 
>> /*
>> @@ -103,14 +98,20 @@ static mfn_t __init acquire_shared_memory_bank(struct 
>> domain *d,
>> return smfn;
>> }
>> 
>> -static int __init assign_shared_memory(struct domain *d,
>> -   paddr_t pbase, paddr_t psize,
>> -   paddr_t gbase)
>> +static int __init assign_shared_memory(struct domain *d, paddr_t gbase,
>> +   const struct membank *shm_bank)
>> {
>> mfn_t smfn;
>> int ret = 0;
>> unsigned long nr_pages, nr_borrowers, i;
>> struct page_info *page;
>> +paddr_t pbase, psize;
>> +
>> +

Re: [PATCH 3/7] xen/p2m: put reference for superpage

Hi Julien,

> On 7 May 2024, at 14:20, Julien Grall  wrote:
> 
> Hi Luca,
> 
> On 23/04/2024 09:25, Luca Fancellu wrote:
>> From: Penny Zheng 
>> We are doing foreign memory mapping for static shared memory, and
>> there is a great possibility that it could be super mapped.
> 
> Is this because we are mapping more than one page at the time? Can you point 
> me to the code?

So, to be honest this patch was originally in Penny’s serie, my knowledge of 
this side of the codebase
is very limited and so I pushed this one basically untouched.

From what I can see in the serie the mappings are made in 
handle_shared_mem_bank, and map_regions_p2mt
is called for one page at the time (allocated through the function 
allocate_domheap_memory (new function introduced in
the serie).

So is that the case that this patch is not needed?

> 
>> But today, p2m_put_l3_page could not handle superpages.
> 
> This was done on purpose. Xen is not preemptible and therefore we need to be 
> cautious how much work is done within the p2m code.
> 
> With the below proposal, for 1GB mapping, we may end up to call put_page() up 
> to 512 * 512 = 262144 times. put_page() can free memory. This could be a very 
> long operation.
> 
> Have you benchmark how long it would take?

I did not, since its purpose was unclear to me and was not commented in the 
last serie from Penny.

Cheers,
Luca

Re: [PATCH 3/7] xen/p2m: put reference for superpage


Hi Luca,

On 23/04/2024 09:25, Luca Fancellu wrote:

From: Penny Zheng 

We are doing foreign memory mapping for static shared memory, and
there is a great possibility that it could be super mapped.


Is this because we are mapping more than one page at the time? Can you 
point me to the code?



But today, p2m_put_l3_page could not handle superpages.


This was done on purpose. Xen is not preemptible and therefore we need 
to be cautious how much work is done within the p2m code.


With the below proposal, for 1GB mapping, we may end up to call 
put_page() up to 512 * 512 = 262144 times. put_page() can free memory. 
This could be a very long operation.


Have you benchmark how long it would take?

Cheers,

--
Julien Grall

Re: [PATCH v6 8/8] xen: allow up to 16383 cpus


Hi Jan,

On 06/05/2024 07:42, Jan Beulich wrote:

Of course those calculations depend on
what people choose as actual values, but giving an upper bound being a
power of 2 may at least serve as a hint to them.


This is rather a weak hint. If you want to encourage users to chose a 
power-of-2 value, then let's spell it out in the description.


Cheers,

--
Julien Grall

Re: [PATCH v6 8/8] xen: allow up to 16383 cpus

Hi Stefano,

On 03/05/2024 20:07, Stefano Stabellini wrote:

On Fri, 3 May 2024, Julien Grall wrote:

[...]

So are you saying that from Xen point of view, you are expecting no difference
between 256 and 512. And therefore you would be happy if to backport patches
if someone find differences (or even security issues) when using > 256 pCPUs?

It is difficult to be sure about anything that it is not regularly
tested. I am pretty sure someone in the community got Xen running on an
Ampere, so like you said 192 is a good number. However, that is not
regularly tested, so we don't have any regression checks in gitlab-ci or
OSSTest for it.

One approach would be to only support things regularly tested either by
OSSTest, Gitlab-ci, or also Xen community members. I am not sure what
would be the highest number with this way of thinking but likely no
more than 192, probably less. I don't know the CPU core count of the
biggest ARM machine in OSSTest.

This would be rochester* (Cavium Thunder-X). They have 96 pCPUs which,
IIRC, are split across two numa nodes.

Another approach is to support a "sensible" number: not something tested
but something we believe it should work. No regular testing. (In safety,
they only believe in things that are actually tested, so this would not
be OK. But this is security, not safety, just FYI.) With this approach,
we could round up the number to a limit we think it won't break. If 192
works, 256/512 should work? I don't know but couldn't think of something
that would break going from 192 to 256.

It depends what you mean by work/break. Strictly speaking, Xen should
run (i.e. not crash). However, it is unclear how well as if you increase
the number of physical CPUs, you will increase contention and may find
some bottleneck.

I haven't done any performance testing with that many CPUs and I haven't
seen any so far with Xen. But I have some areas of concerns.

* Xenstored: At least the C version is single-threaded. Technically the
limit here is not based on the number of pCPUs, but as you increase it,
you indirectly increase the number of domains that can run. I doubt it
will behave well if you have 4096 domains running (I am thinking about
the x86 limit...).

* Locking
* How Xen use the locks: I don't think we have many places where we
have global locks (one is the memory subsystem). If a lock is already
taken, the others will spin. It is unclear if we could high contending.
* How Xen implements the locks: At the moment, we are using LL/SC. My
take of XSA-295 is there is a lack of fairness with them. I am not sure
what would happen if they get contented (as we support more pCPUs). It
is also probably time to finally implement LSE atomics.

* TLB flush: The TLB flush are broadcasted. There are some suggestions
on the Linux ML [1] that they don't perform well on some processors. The
discussion seems to have gone nowhere in Linux. But I think it is
propably worth to take into account when we decide to update the limit
we (security) support.

It depends on how strict we want to be on testing requirements.
From above, I am rather worry about claiming that Xen can supports up
to 256 (and TBH even 192) without any proper testing. This could end up
to backfire as we may need to do (in a rush) and backport some rather
large work (unless we decide to remove support after the fact).

I think I would prefer if we have a low number until someone can do some
testing (including potentially malicious guest). If we want for a
power-of-two, I would go with 128 because this is closer to the HW we
have in testing. If in the future someone can show some data on other
platforms (e.g. Ampere), then we can up the limit.

> I am not
> sure what approach was taken by x86 so far.

It is unclear to me. I don't see how we can claim to support up to 4096
CPUs. But that's for the x86 folks to decide.

Cheers,

[1]
https://lore.kernel.org/linux-arm-kernel/20190617143255.10462-1-indou.ta...@jp.fujitsu.com/

--
Julien Grall

[PATCH v7 5/6] [DO NOT APPLY] switch to qemu fork

This makes tests to use patched QEMU, to actually test the new behavior.
---
 Config.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Config.mk b/Config.mk
index a962f095ca16..5e220a1284e4 100644
--- a/Config.mk
+++ b/Config.mk
@@ -220,8 +220,8 @@ endif
 OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git
 OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16
 
-QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git
-QEMU_UPSTREAM_REVISION ?= master
+QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu
+QEMU_UPSTREAM_REVISION ?= origin/msix
 
 MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
 MINIOS_UPSTREAM_REVISION ?= b6a5b4d72b88e5c4faed01f5a44505de022860fc
-- 
git-series 0.9.1

[PATCH v7 1/6] x86/msi: Extend per-domain/device warning mechanism

The arch_msix struct had a single "warned" field with a domid for which
warning was issued. Upcoming patch will need similar mechanism for few
more warnings, so change it to save a bit field of issued warnings.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
Changes in v6:
- add MSIX_CHECK_WARN macro (Jan)
- drop struct name from warned_kind union (Jan)

New in v5
---
 xen/arch/x86/include/asm/msi.h | 17 -
 xen/arch/x86/msi.c |  5 +
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index 997ccb87be0c..bcfdfd35345d 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -208,6 +208,15 @@ struct msg_address {
PCI_MSIX_ENTRY_SIZE + \
(~PCI_MSIX_BIRMASK & (PAGE_SIZE - 1)))
 
+#define MSIX_CHECK_WARN(msix, domid, which) ({ \
+if ( (msix)->warned_domid != (domid) ) \
+{ \
+(msix)->warned_domid = (domid); \
+(msix)->warned_kind.all = 0; \
+} \
+(msix)->warned_kind.which ? false : ((msix)->warned_kind.which = true); \
+})
+
 struct arch_msix {
 unsigned int nr_entries, used_entries;
 struct {
@@ -217,7 +226,13 @@ struct arch_msix {
 int table_idx[MAX_MSIX_TABLE_PAGES];
 spinlock_t table_lock;
 bool host_maskall, guest_maskall;
-domid_t warned;
+domid_t warned_domid;
+union {
+uint8_t all;
+struct {
+bool maskall   : 1;
+};
+} warned_kind;
 };
 
 void early_msi_init(void);
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index e721aaf5c001..42c793426da3 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -364,13 +364,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool 
host, bool guest)
 domid_t domid = pdev->domain->domain_id;
 
 maskall = true;
-if ( pdev->msix->warned != domid )
-{
-pdev->msix->warned = domid;
+if ( MSIX_CHECK_WARN(pdev->msix, domid, maskall) )
 printk(XENLOG_G_WARNING
"cannot mask IRQ %d: masking MSI-X on Dom%d's %pp\n",
desc->irq, domid, >sbdf);
-}
 }
 pdev->msix->host_maskall = maskall;
 if ( maskall || pdev->msix->guest_maskall )
-- 
git-series 0.9.1

[PATCH v7 3/6] automation: prevent QEMU access to /dev/mem in PCI passthrough tests

/dev/mem access doesn't work in dom0 in lockdown and in stubdomain.
Simulate this environment with removing /dev/mem device node. Full test
for lockdown and stubdomain will come later, when all requirements will
be in place.

Signed-off-by: Marek Marczykowski-Górecki 
Acked-by: Stefano Stabellini 
---
This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/scripts/qubes-x86-64.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index d81ed7b931cf..7eabc1bd6ad4 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -163,6 +163,8 @@ ifconfig eth0 up
 ifconfig xenbr0 up
 ifconfig xenbr0 192.168.0.1
 
+# ensure QEMU wont have access /dev/mem
+rm -f /dev/mem
 # get domU console content into test log
 tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) 
/\" &
 xl create /etc/xen/domU.cfg
-- 
git-series 0.9.1

[PATCH v7 6/6] [DO NOT APPLY] switch to alternative artifact repo

For testing, switch to my containers registry that includes containers
rebuilt with changes in this series.
---
 automation/gitlab-ci/build.yaml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index 49d6265ad5b4..0d7e311417d8 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -320,7 +320,7 @@ qemu-system-ppc64-8.1.0-ppc64-export:
 
 alpine-3.18-rootfs-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
   artifacts:
@@ -331,7 +331,7 @@ alpine-3.18-rootfs-export:
 
 kernel-6.1.19-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19
   script:
 - mkdir binaries && cp /bzImage binaries/bzImage
   artifacts:
-- 
git-series 0.9.1

[PATCH v7 4/6] automation: switch to a wifi card on ADL system

Switch to a wifi card that has registers on a MSI-X page. This tests the
"x86/hvm: Allow writes to registers on the same page as MSI-X table"
feature. Switch it only for HVM test, because MSI-X adjacent write is
not supported on PV.

This requires also including drivers and firmware in system for tests.
Remove firmware unrelated to the test, to not increase initrd size too
much (all firmware takes over 100MB compressed).
And finally adjusts test script to handle not only eth0 as a test device,
but also wlan0 and connect it to the wifi network.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Stefano Stabellini 
---
This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll
provide them in private.

This change requires rebuilding test containers.

This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/gitlab-ci/test.yaml  | 4 
 automation/scripts/qubes-x86-64.sh  | 7 +++
 automation/tests-artifacts/alpine/3.18.dockerfile   | 7 +++
 automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++
 4 files changed, 20 insertions(+)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index ad249fa0a5d9..6803cae116b5 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -193,6 +193,10 @@ adl-pci-pv-x86-64-gcc-debug:
 
 adl-pci-hvm-x86-64-gcc-debug:
   extends: .adl-x86-64
+  variables:
+PCIDEV: "00:14.3"
+WIFI_SSID: "$WIFI_HW2_SSID"
+WIFI_PSK: "$WIFI_HW2_PSK"
   script:
 - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE}
   needs:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 7eabc1bd6ad4..60498ef1e89a 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -94,6 +94,13 @@ on_reboot = "destroy"
 domU_check="
 set -x -e
 interface=eth0
+if [ -e /sys/class/net/wlan0 ]; then
+interface=wlan0
+set +x
+wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf
+set -x
+wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf
+fi
 ip link set \"\$interface\" up
 timeout 30s udhcpc -i \"\$interface\"
 pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ')
diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile 
b/automation/tests-artifacts/alpine/3.18.dockerfile
index 9cde6c9ad4da..c323e266c7da 100644
--- a/automation/tests-artifacts/alpine/3.18.dockerfile
+++ b/automation/tests-artifacts/alpine/3.18.dockerfile
@@ -34,6 +34,13 @@ RUN \
   apk add udev && \
   apk add pciutils && \
   apk add libelf && \
+  apk add wpa_supplicant && \
+  # Select firmware for hardware tests
+  apk add linux-firmware-other && \
+  mkdir /lib/firmware-preserve && \
+  mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \
+  rm -rf /lib/firmware && \
+  mv /lib/firmware-preserve /lib/firmware && \
   \
   # Xen
   cd / && \
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile 
b/automation/tests-artifacts/kernel/6.1.19.dockerfile
index 3a4096780d20..84ed5dff23ae 100644
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile
@@ -32,6 +32,8 @@ RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI
 make xen.config && \
 scripts/config --enable BRIDGE && \
 scripts/config --enable IGC && \
+scripts/config --enable IWLWIFI && \
+scripts/config --enable IWLMVM && \
 cp .config .config.orig && \
 cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
 make -j$(nproc) bzImage && \
-- 
git-series 0.9.1

[PATCH v7 0/6] MSI-X support with qemu in stubdomain, and other related changes

This series includes changes to make MSI-X working with Linux stubdomain and
especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for
QEMU to access /dev/mem, but also the Intel Wifi card violating spec by putting
some registers on the same page as the MSI-X table.
Besides the stubdomain case (of which I care more), this is also necessary for
PCI-passthrough to work with lockdown enabled in dom0 (when QEMU runs in dom0).

See individual patches for details.

This series include also tests for MSI-X using new approach (by preventing QEMU
access to /dev/mem). But for it to work, it needs QEMU change that
makes use of the changes introduced here. It can be seen at
https://github.com/marmarek/qemu/commits/msix

Here is the pipeline that used the QEMU fork above:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1280152273
(the build failures are due to issues with fetching or building newer QEMU
 discussed on Matrix)

v7:
 - "x86/msi: passthrough all MSI-X vector ctrl writes to device model" is 
already applied

Marek Marczykowski-Górecki (6):
  x86/msi: Extend per-domain/device warning mechanism
  x86/hvm: Allow access to registers on the same page as MSI-X table
  automation: prevent QEMU access to /dev/mem in PCI passthrough tests
  automation: switch to a wifi card on ADL system
  [DO NOT APPLY] switch to qemu fork
  [DO NOT APPLY] switch to alternative artifact repo

 Config.mk   |   4 +-
 automation/gitlab-ci/build.yaml |   4 +-
 automation/gitlab-ci/test.yaml  |   4 +-
 automation/scripts/qubes-x86-64.sh  |   9 +-
 automation/tests-artifacts/alpine/3.18.dockerfile   |   7 +-
 automation/tests-artifacts/kernel/6.1.19.dockerfile |   2 +-
 xen/arch/x86/hvm/vmsi.c | 205 -
 xen/arch/x86/include/asm/msi.h  |  22 +-
 xen/arch/x86/msi.c  |  47 ++-
 9 files changed, 285 insertions(+), 19 deletions(-)

base-commit: ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
-- 
git-series 0.9.1

[PATCH v7 2/6] x86/hvm: Allow access to registers on the same page as MSI-X table

Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
on the same page as MSI-X table. Device model (especially one in
stubdomain) cannot really handle those, as direct writes to that page is
refused (page is on the mmio_ro_ranges list). Instead, extend
msixtbl_mmio_ops to handle such accesses too.

Doing this, requires correlating read/write location with guest
MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
for PV would need to be done separately.

This will be also used to read Pending Bit Array, if it lives on the same
page, making QEMU not needing /dev/mem access at all (especially helpful
with lockdown enabled in dom0). If PBA lives on another page, QEMU will
map it to the guest directly.
If PBA lives on the same page, discard writes and log a message.
Technically, writes outside of PBA could be allowed, but at this moment
the precise location of PBA isn't saved, and also no known device abuses
the spec in this way (at least yet).

To access those registers, msixtbl_mmio_ops need the relevant page
mapped. MSI handling already has infrastructure for that, using fixmap,
so try to map first/last page of the MSI-X table (if necessary) and save
their fixmap indexes. Note that msix_get_fixmap() does reference
counting and reuses existing mapping, so just call it directly, even if
the page was mapped before. Also, it uses a specific range of fixmap
indexes which doesn't include 0, so use 0 as default ("not mapped")
value - which simplifies code a bit.

Based on assumption that all MSI-X page accesses are handled by Xen, do
not forward adjacent accesses to other hypothetical ioreq servers, even
if the access wasn't handled for some reason (failure to map pages etc).
Relevant places log a message about that already.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v7:
- simplify logic based on assumption that all access to MSI-X pages are
  handled by Xen (Roger)
- move calling adjacent_handle() into adjacent_{read,write}() (Roger)
- move range check into msixtbl_addr_to_desc() (Roger)
- fix off-by-one when initializing adj_access_idx[ADJ_IDX_LAST] (Roger)
- no longer distinguish between unhandled write due to PBA nearby and
  other reasons
- add missing break after ASSERT_UNREACHABLE (Jan)
Changes in v6:
- use MSIX_CHECK_WARN macro
- extend assert on fixmap_idx
- add break in default label, after ASSERT_UNREACHABLE(), and move
  setting default there
- style fixes
Changes in v5:
- style fixes
- include GCC version in the commit message
- warn only once (per domain, per device) about failed adjacent access
Changes in v4:
- drop same_page parameter of msixtbl_find_entry(), distinguish two
  cases in relevant callers
- rename adj_access_table_idx to adj_access_idx
- code style fixes
- drop alignment check in adjacent_{read,write}() - all callers already
  have it earlier
- delay mapping first/last MSI-X pages until preparing device for a
  passthrough
v3:
 - merge handling into msixtbl_mmio_ops
 - extend commit message
v2:
 - adjust commit message
 - pass struct domain to msixtbl_page_handler_get_hwaddr()
 - reduce local variables used only once
 - log a warning if write is forbidden if MSI-X and PBA lives on the same
   page
 - do not passthrough unaligned accesses
 - handle accesses both before and after MSI-X table
---
 xen/arch/x86/hvm/vmsi.c| 205 --
 xen/arch/x86/include/asm/msi.h |   5 +-
 xen/arch/x86/msi.c |  42 +++-
 3 files changed, 242 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 17983789..f7b7b4998b5e 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
 return d->arch.hvm.msixtbl_list.next;
 }
 
+/*
+ * Lookup an msixtbl_entry on the same page as given addr. It's up to the
+ * caller to check if address is strictly part of the table - if relevant.
+ */
 static struct msixtbl_entry *msixtbl_find_entry(
 struct vcpu *v, unsigned long addr)
 {
@@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
 struct domain *d = v->domain;
 
 list_for_each_entry( entry, >arch.hvm.msixtbl_list, list )
-if ( addr >= entry->gtable &&
- addr < entry->gtable + entry->table_len )
+if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
+ PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
 return entry;
 
 return NULL;
@@ -203,6 +207,10 @@ static struct msi_desc *msixtbl_addr_to_desc(
 if ( !entry || !entry->pdev )
 return NULL;
 
+if ( addr < entry->gtable ||
+ addr >= entry->gtable + entry->table_len )
+return NULL;
+
 nr_entry = (addr - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
 
 list_for_each_entry( desc, >pdev->msi_list, list )
@@ -213,6 +221,152 @@ static struct

Re: [PATCH 3/7] xen/p2m: put reference for superpage

2024-05-07 Thread Michal Orzel

Hi Luca,

On 23/04/2024 10:25, Luca Fancellu wrote:
> 
> 
> From: Penny Zheng 
> 
> We are doing foreign memory mapping for static shared memory, and
> there is a great possibility that it could be super mapped.
> But today, p2m_put_l3_page could not handle superpages.
> 
> This commits implements a new function p2m_put_superpage to handle superpages,
> specifically for helping put extra references for foreign superpages.
> 
> Signed-off-by: Penny Zheng 
> Signed-off-by: Luca Fancellu 
> ---
> v1:
>  - patch from 
> https://patchwork.kernel.org/project/xen-devel/patch/20231206090623.1932275-9-penny.zh...@arm.com/
> ---
>  xen/arch/arm/mmu/p2m.c | 58 +++---
>  1 file changed, 43 insertions(+), 15 deletions(-)
> 
> diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
> index 41fcca011cf4..479a80fbd4cf 100644
> --- a/xen/arch/arm/mmu/p2m.c
> +++ b/xen/arch/arm/mmu/p2m.c
> @@ -753,17 +753,9 @@ static int p2m_mem_access_radix_set(struct p2m_domain 
> *p2m, gfn_t gfn,
>  return rc;
>  }
> 
> -/*
> - * Put any references on the single 4K page referenced by pte.
> - * TODO: Handle superpages, for now we only take special references for leaf
> - * pages (specifically foreign ones, which can't be super mapped today).
> - */
> -static void p2m_put_l3_page(const lpae_t pte)
> +/* Put any references on the single 4K page referenced by mfn. */
> +static void p2m_put_l3_page(mfn_t mfn, unsigned type)
Shouldn't type be of p2m_type_t?

>  {
> -mfn_t mfn = lpae_get_mfn(pte);
> -
> -ASSERT(p2m_is_valid(pte));
> -
>  /*
>   * TODO: Handle other p2m types
>   *
> @@ -771,16 +763,53 @@ static void p2m_put_l3_page(const lpae_t pte)
>   * flush the TLBs if the page is reallocated before the end of
>   * this loop.
>   */
> -if ( p2m_is_foreign(pte.p2m.type) )
> +if ( p2m_is_foreign(type) )
>  {
>  ASSERT(mfn_valid(mfn));
>  put_page(mfn_to_page(mfn));
>  }
>  /* Detect the xenheap page and mark the stored GFN as invalid. */
> -else if ( p2m_is_ram(pte.p2m.type) && is_xen_heap_mfn(mfn) )
> +else if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
>  page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);
>  }
> 
> +/* Put any references on the superpage referenced by mfn. */
> +static void p2m_put_superpage(mfn_t mfn, unsigned int next_level, unsigned 
> type)
Shouldn't type be of p2m_type_t?

> +{
> +unsigned int i;
> +unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
> +
> +for ( i = 0; i < XEN_PT_LPAE_ENTRIES; i++ )
> +{
> +if ( next_level == 3 )
> +p2m_put_l3_page(mfn, type);
> +else
> +p2m_put_superpage(mfn, next_level + 1, type);
> +
> +mfn = mfn_add(mfn, 1 << level_order);
> +}
> +}
> +
> +/* Put any references on the page referenced by pte. */
> +static void p2m_put_page(const lpae_t pte, unsigned int level)
> +{
> +mfn_t mfn = lpae_get_mfn(pte);
> +
> +ASSERT(p2m_is_valid(pte));
> +
> +/*
> + * We are either having a first level 1G superpage or a
> + * second level 2M superpage.
> + */
> +if ( p2m_is_superpage(pte, level) )
> +return p2m_put_superpage(mfn, level + 1, pte.p2m.type);
> +else
No need for this else

> +{
> +ASSERT(level == 3);
> +return p2m_put_l3_page(mfn, pte.p2m.type);
> +}
> +}
> +
>  /* Free lpae sub-tree behind an entry */
>  static void p2m_free_entry(struct p2m_domain *p2m,
> lpae_t entry, unsigned int level)
> @@ -809,9 +838,8 @@ static void p2m_free_entry(struct p2m_domain *p2m,
>  #endif
> 
>  p2m->stats.mappings[level]--;
> -/* Nothing to do if the entry is a super-page. */
> -if ( level == 3 )
> -p2m_put_l3_page(entry);
> +p2m_put_page(entry, level);
> +
>  return;
>  }
> 
> --
> 2.34.1
> 

~Michal

Re: [PATCH] x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake

On 07/05/2024 12:50 pm, Roger Pau Monné wrote:
> On Tue, May 07, 2024 at 12:29:57PM +0100, Andrew Cooper wrote:
>> Ever since Xen 4.14, there has been a latent bug with migration.
>>
>> While some toolstacks can level the features properly, they don't shink
>> feat.max_subleaf when all features have been dropped.  This is because
>> we *still* have not completed the toolstack side work for full CPU Policy
>> objects.
>>
>> As a consequence, even when properly feature levelled, VMs can't migrate
>> "backwards" across hardware which reduces feat.max_subleaf.  One such example
>> is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).
>>
>> Extend the host policy's feat.max_subleaf to the hightest number Xen knows
>> about, similarly to how we extend extd.max_leaf for LFENCE_DISPATCH.  This
>> will allow VMs with a higher feat.max_subleaf than strictly necessary to
>> migrate in.
> Seeing what we do for max_extd_leaf, shouldn't we switch to doing what
> you propose for feat.max_subleaf to max_extd_leaf also?
>
> To allow migration between hosts that have 0x8021.eax and hosts
> that don't have such extended leaf.
>
> cpu_has_lfence_dispatch kind of does that, but if lfence cannot be
> made serializing then the max extended leaf is not expanded.  And we
> should also likely account for more feature leafs possibly appearing
> after 0x8021?

On second thoughts, this adjustment ought to be in the max policies only.

It's slightly different to LFENCE_DISPATCH, in that we don't actually
have any set bits in those leaves.

I'll do a different patch.

~Andrew

Re: [PATCH] x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake

2024-05-07 Thread Roger Pau Monné

On Tue, May 07, 2024 at 12:29:57PM +0100, Andrew Cooper wrote:
> Ever since Xen 4.14, there has been a latent bug with migration.
> 
> While some toolstacks can level the features properly, they don't shink
> feat.max_subleaf when all features have been dropped.  This is because
> we *still* have not completed the toolstack side work for full CPU Policy
> objects.
> 
> As a consequence, even when properly feature levelled, VMs can't migrate
> "backwards" across hardware which reduces feat.max_subleaf.  One such example
> is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).
> 
> Extend the host policy's feat.max_subleaf to the hightest number Xen knows
> about, similarly to how we extend extd.max_leaf for LFENCE_DISPATCH.  This
> will allow VMs with a higher feat.max_subleaf than strictly necessary to
> migrate in.

Seeing what we do for max_extd_leaf, shouldn't we switch to doing what
you propose for feat.max_subleaf to max_extd_leaf also?

To allow migration between hosts that have 0x8021.eax and hosts
that don't have such extended leaf.

cpu_has_lfence_dispatch kind of does that, but if lfence cannot be
made serializing then the max extended leaf is not expanded.  And we
should also likely account for more feature leafs possibly appearing
after 0x8021?

Thanks, Roger.

Re: [PATCH] tools/xl: Open xldevd.log with O_CLOEXEC

On Tue, May 07, 2024 at 12:08:06PM +0100, Andrew Cooper wrote:
> `xl devd` has been observed leaking /var/log/xldevd.log into children.
> 
> Link: https://github.com/QubesOS/qubes-issues/issues/8292
> Reported-by: Demi Marie Obenour 
> Signed-off-by: Andrew Cooper 
> ---
> CC: Anthony PERARD 
> CC: Juergen Gross 
> CC: Demi Marie Obenour 
> CC: Marek Marczykowski-Górecki 
> 
> Also entirely speculative based on the QubesOS ticket.
> ---
>  tools/xl/xl_utils.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
> index 17489d182954..060186db3a59 100644
> --- a/tools/xl/xl_utils.c
> +++ b/tools/xl/xl_utils.c
> @@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
>  exit(-1);
>  }
>  
> -CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 0644));
> +CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
> O_CLOEXEC, 0644));

This one might be not enough, as the FD gets dup2()-ed to stdout/stderr
just outside of the context here, and then inherited by various hotplug
script. Just adding O_CLOEXEC here means the hotplug scripts will run
with stdout/stderr closed. The scripts shipped with Xen do redirect
stderr to a log quite early, but a) it doesn't do it for stdout, and b)
custom hotplug scripts are a valid use case.
Without that, I see at least few potential issues:
- some log messages may be lost (minor, but annoying)
- something might simply fail on writing to a closed FD, breaking the
  hotplug script
- FD 1 will be used as first free FD for any open() or similar call - if
  a tool later tries writing something to stdout, it will gets written
  to that FD - worse of all three

What should be the behavior of hotplug scripts logging? Should they
always take care of their own logging? If so, the hotplug calling part
should redirect stdout/stderr to /dev/null IMO. But if `xl` should
provide some default logging for them (like, the xldevd.log here?), then
the O_CLOEXEC should be set only after duplicating logfile over stdout/err.

>  free(fullname);
>  assert(logfile >= 3);
>  
> 
> base-commit: ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
> prerequisite-patch-id: 212e50457e9b6bdfd06a97da545a5aa7155bb919

Which one is this? I don't see it in staging, nor in any of your
branches on xenbits. Lore finds "tools/libxs: Open /dev/xen/xenbus fds
as O_CLOEXEC" which I guess is correct, but I have no idea how it
correlates it, as this hash doesn't appear anywhere in the message, nor
its headers...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

[PATCH] x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake

Ever since Xen 4.14, there has been a latent bug with migration.

While some toolstacks can level the features properly, they don't shink
feat.max_subleaf when all features have been dropped.  This is because
we *still* have not completed the toolstack side work for full CPU Policy
objects.

As a consequence, even when properly feature levelled, VMs can't migrate
"backwards" across hardware which reduces feat.max_subleaf.  One such example
is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).

Extend the host policy's feat.max_subleaf to the hightest number Xen knows
about, similarly to how we extend extd.max_leaf for LFENCE_DISPATCH.  This
will allow VMs with a higher feat.max_subleaf than strictly necessary to
migrate in.

Eventually we'll manage to teach the toolstack how to avoid creating such VMs
in the first place, but there's still more work to do there.

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Roger Pau Monné 
---
 xen/arch/x86/cpu-policy.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 4b6d96276399..a216fc8b886f 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -373,8 +373,13 @@ static void __init calculate_host_policy(void)
 
 p->basic.max_leaf =
 min_t(uint32_t, p->basic.max_leaf,   ARRAY_SIZE(p->basic.raw) - 1);
-p->feat.max_subleaf =
-min_t(uint32_t, p->feat.max_subleaf, ARRAY_SIZE(p->feat.raw) - 1);
+
+/*
+ * p->feat is "just" featureset information.  We know about more than may
+ * be present in this hardware.  Also, VMs may have a higher max_subleaf
+ * than strictly necessary, and we can accept those too.
+ */
+p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
 
 max_extd_leaf = p->extd.max_leaf;
 

base-commit: ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
-- 
2.30.2

Re: [XEN PATCH v4 5/5] xen/arm: ffa: support notification

2024-05-07 Thread Jens Wiklander

Hi Julien,

On Fri, May 3, 2024 at 4:25 PM Julien Grall  wrote:
>
> Hi Jens,
>
> On 03/05/2024 14:54, Jens Wiklander wrote:
> >>> +static int ffa_setup_irq_callback(struct notifier_block *nfb,
> >>> +  unsigned long action, void *hcpu)
> >>> +{
> >>> +unsigned int cpu = (unsigned long)hcpu;
> >>> +struct notif_irq_info irq_info = { };
> >>> +
> >>> +switch ( action )
> >>> +{
> >>> +case CPU_ONLINE:
> >>
> >> Can't you execute the notifier in CPU_STARTING? This will be called on
> >> the CPU directly, so you should be able to use request_irq(...).
> >
> > I tried that first but it failed with the ASSERT_ALLOC_CONTEXT() in 
> > _xmalloc().
> >
> > I've also tested a three-step solution with CPU_UP_PREPARE,
> > CPU_STARTING, and CPU_UP_CANCELED.
> > My approach here is more direct, but it still suffers from a weakness
> > in error handling even if it seems quite unlikely to run out of heap
> > or for setup_irq() to fail at this stage.
>
> Ah I didn't notice that notify_cpu_starting() is called with IRQ
> disabled. I assumed they would be enabled.
>
> Then I would consider to do:
>
> diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
> index 6efed876782e..db322672e508 100644
> --- a/xen/arch/arm/smpboot.c
> +++ b/xen/arch/arm/smpboot.c
> @@ -389,6 +389,7 @@ void asmlinkage start_secondary(void)
>*/
>   init_maintenance_interrupt();
>   init_timer_interrupt();
> +init_tee_interrupt();
>
>   local_abort_enable();
>
> And plumb through the TEE subsystem.

I'll use that in the next version, it should remove a lot of complex code.

Thanks,
Jens

[PATCH] tools/xl: Open xldevd.log with O_CLOEXEC

`xl devd` has been observed leaking /var/log/xldevd.log into children.

Link: https://github.com/QubesOS/qubes-issues/issues/8292
Reported-by: Demi Marie Obenour 
Signed-off-by: Andrew Cooper 
---
CC: Anthony PERARD 
CC: Juergen Gross 
CC: Demi Marie Obenour 
CC: Marek Marczykowski-Górecki 

Also entirely speculative based on the QubesOS ticket.
---
 tools/xl/xl_utils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
index 17489d182954..060186db3a59 100644
--- a/tools/xl/xl_utils.c
+++ b/tools/xl/xl_utils.c
@@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
 exit(-1);
 }
 
-CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 0644));
+CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
O_CLOEXEC, 0644));
 free(fullname);
 assert(logfile >= 3);
 

base-commit: ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
prerequisite-patch-id: 212e50457e9b6bdfd06a97da545a5aa7155bb919
-- 
2.30.2

[libvirt test] 185934: tolerable all pass - PUSHED

flight 185934 libvirt real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185934/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185908
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-arm64-arm64-libvirt-qcow2 15 saverestore-support-checkfail never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail   never pass

version targeted for testing:
 libvirt  da95bcb6b2d9b04958e0f2603202801dd29debb8
baseline version:
 libvirt  3146305fd3a610573963fe4858cc12ec1c4cf5c7

Last test of basis   185908  2024-05-03 04:20:40 Z4 days
Testing same since   185934  2024-05-07 04:20:24 Z0 days1 attempts


People who touched revisions under test:
  Adam Julis 
  Ján Tomko 
  Kristina Hanicova 
  Michal Privoznik 
  Oleg Sviridov 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 build-armhf-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-arm64-pvopspass
 build-armhf-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm   pass
 test-amd64-amd64-libvirt-xsm pass
 test-arm64-arm64-libvirt-xsm pass
 test-amd64-amd64-libvirt pass
 test-arm64-arm64-libvirt pass
 test-armhf-armhf-libvirt pass
 test-amd64-amd64-libvirt-pairpass
 test-amd64-amd64-libvirt-qcow2   pass
 test-arm64-arm64-libvirt-qcow2   pass
 test-amd64-amd64-libvirt-raw pass
 test-arm64-arm64-libvirt-raw pass
 test-amd64-amd64-libvirt-vhd pass
 test-armhf-armhf-libvirt-vhd pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/libvirt.git
   3146305fd3..da95bcb6b2  da95bcb6b2d9b04958e0f2603202801dd29debb8 -> 
xen-tested-master

[linux-linus test] 185930: tolerable FAIL - PUSHED

flight 185930 linux-linus real [real]
flight 185938 linux-linus real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/185930/
http://logs.test-lab.xenproject.org/osstest/logs/185938/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-armhf-armhf-xl-raw   8 xen-bootfail pass in 185938-retest

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl-raw 14 migrate-support-check fail in 185938 never pass
 test-armhf-armhf-xl-raw 15 saverestore-support-check fail in 185938 never pass
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 185928
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 185928
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 185928
 test-armhf-armhf-xl-credit2   8 xen-boot fail  like 185928
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 185928
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 185928
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 185928
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-qcow214 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-qcow215 saverestore-support-checkfail   never pass

version targeted for testing:
 linuxdccb07f2914cdab2ac3a5b6c98406f765acab803
baseline version:
 linuxdd5a440a31fae6e459c0d627162825505361

Last test of basis   185928  2024-05-06 15:43:38 Z0 days
Testing same since   185930  2024-05-06 23:44:00 Z0 days1 attempts


People who touched revisions under test:
  Andy Shevchenko 
  Chengming Zhou 
  Dan Carpenter 
  David Rientjes 
  David Sterba 
  Geert Uytterhoeven 
  Josef Bacik 
  Linus Torvalds 
  Maxime Ripard 
  Nicolas Bouchinet 
  Qu Wenruo 
  Uwe Kleine-König 
  Vlastimil Babka 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64

Re: [PATCH v3 6/9] xen/arm64: bpi: Add missing code symbol annotations


Hi,

On 06/05/2024 13:54, Edgar E. Iglesias wrote:

On Sat, May 4, 2024 at 2:14 AM Stefano Stabellini
 wrote:


On Wed, 1 May 2024, Edgar E. Iglesias wrote:

From: "Edgar E. Iglesias" 

Use the generic xen/linkage.h macros to annotate code symbols
and add missing annotations.

Signed-off-by: Edgar E. Iglesias 
---
  xen/arch/arm/arm64/bpi.S | 20 
  1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/xen/arch/arm/arm64/bpi.S b/xen/arch/arm/arm64/bpi.S
index 4e63825220..b16e4d1e29 100644
--- a/xen/arch/arm/arm64/bpi.S
+++ b/xen/arch/arm/arm64/bpi.S
@@ -52,14 +52,15 @@
   * micro-architectures in a system.
   */
  .align   11
-ENTRY(__bp_harden_hyp_vecs_start)
+FUNC(__bp_harden_hyp_vecs_start)
  .rept 4
  vectors hyp_traps_vector
  .endr
-ENTRY(__bp_harden_hyp_vecs_end)
+GLOBAL(__bp_harden_hyp_vecs_end)
+END(__bp_harden_hyp_vecs_start)


Shouldn't GLOBAL be changed to FUNC as well?



I was a bit unsure but went for GLOBAL since the _end labels point to
addresses after and outside of the code sequence.
But I don't have a strong opinion and am happy to change them to FUNC
if you feel that's better.


I don't think it should be FUNC as this is not meant to be called 
directly. I am also under the impression, we were planning to get rid of 
GLOBAL() as well.


Furthermore, __bp_harden_hyp_vec_start is not a function per say. It is 
a pointer to the vector table.


From the brief look, the same remarks would apply to the rest of bpi.S. 
So I think we want to switch all the ENTRY() to LABEL().


Cheers,

--
Julien Grall

[ovmf test] 185937: all pass - PUSHED

flight 185937 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/185937/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf 987bea6525d70cd01649472c93d19f89d41d83a2
baseline version:
 ovmf 1c0d4ae2c0fd24164873947c2e262c499ecf13b5

Last test of basis   185935  2024-05-07 05:12:56 Z0 days
Testing same since   185937  2024-05-07 07:43:05 Z0 days1 attempts


People who touched revisions under test:
  Abdul Lateef Attar 
  Ray Ni 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   1c0d4ae2c0..987bea6525  987bea6525d70cd01649472c93d19f89d41d83a2 -> 
xen-tested-master

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-07 Thread Mickaël Salaün

On Mon, May 06, 2024 at 06:34:53PM GMT, Sean Christopherson wrote:
> On Mon, May 06, 2024, Mickaël Salaün wrote:
> > On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote:
> > > > ---
> > > > 
> > > > Changes since v1:
> > > > * New patch. Making user space aware of Heki properties was requested by
> > > >   Sean Christopherson.
> > > 
> > > No, I suggested having userspace _control_ the pinning[*], not merely be 
> > > notified
> > > of pinning.
> > > 
> > >  : IMO, manipulation of protections, both for memory (this patch) and CPU 
> > > state
> > >  : (control registers in the next patch) should come from userspace.  I 
> > > have no
> > >  : objection to KVM providing plumbing if necessary, but I think 
> > > userspace needs to
> > >  : to have full control over the actual state.
> > >  : 
> > >  : One of the things that caused Intel's control register pinning series 
> > > to stall
> > >  : out was how to handle edge cases like kexec() and reboot.  Deferring 
> > > to userspace
> > >  : means the kernel doesn't need to define policy, e.g. when to unprotect 
> > > memory,
> > >  : and avoids questions like "should userspace be able to overwrite 
> > > pinned control
> > >  : registers".
> > >  : 
> > >  : And like the confidential VM use case, keeping userspace in the loop 
> > > is a big
> > >  : beneifit, e.g. the guest can't circumvent protections by coercing 
> > > userspace into
> > >  : writing to protected memory.
> > > 
> > > I stand by that suggestion, because I don't see a sane way to handle 
> > > things like
> > > kexec() and reboot without having a _much_ more sophisticated policy than 
> > > would
> > > ever be acceptable in KVM.
> > > 
> > > I think that can be done without KVM having any awareness of CR pinning 
> > > whatsoever.
> > > E.g. userspace just needs to ability to intercept CR writes and inject 
> > > #GPs.  Off
> > > the cuff, I suspect the uAPI could look very similar to MSR filtering.  
> > > E.g. I bet
> > > userspace could enforce MSR pinning without any new KVM uAPI at all.
> > > 
> > > [*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com
> > 
> > OK, I had concern about the control not directly coming from the guest,
> > especially in the case of pKVM and confidential computing, but I get you
> 
> Hardware-based CoCo is completely out of scope, because KVM has zero 
> visibility
> into the guest (well, SNP technically allows trapping CR0/CR4, but KVM really
> shouldn't intercept CR0/CR4 for SNP guests).
> 
> And more importantly, _KVM_ doesn't define any policies for CoCo VMs.  KVM 
> might
> help enforce policies that are defined by hardware/firmware, but KVM doesn't
> define any of its own.
> 
> If pKVM on x86 comes along, then KVM will likely get in the business of 
> defining
> policy, but until that happens, KVM needs to stay firmly out of the picture.
> 
> > point.  It should indeed be quite similar to the MSR filtering on the
> > userspace side, except that we need another interface for the guest to
> > request such change (i.e. self-protection).
> > 
> > Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but
> > forward the request to userspace with a VM exit instead?  That would
> > also enable userspace to get the request and directly configure the CR
> > pinning with the same VM exit.
> 
> No?  Maybe?  I strongly suspect that full support will need a richer set of 
> APIs
> than a single hypercall.  E.g. to handle kexec(), suspend+resume, emulated 
> SMM,
> and so on and so forth.  And that's just for CR pinning.
> 
> And hypercalls are hampered by the fact that VMCALL/VMMCALL don't allow for
> delegation or restriction, i.e. there's no way for the guest to communicate to
> the hypervisor that a less privileged component is allowed to perform some 
> action,
> nor is there a way for the guest to say some chunk of CPL0 code *isn't* 
> allowed
> to request transition.  Delegation and restriction all has to be done 
> out-of-band.
> 
> It'd probably be more annoying to setup initially, but I think a synthetic 
> device
> with an MMIO-based interface would be more powerful and flexible in the long 
> run.
> Then userspace can evolve without needing to wait for KVM to catch up.
> 
> Actually, potential bad/crazy idea.  Why does the _host_ need to define 
> policy?
> Linux already knows what assets it wants to (un)protect and when.  What's 
> missing
> is a way for the guest kernel to effectively deprivilege and re-authenticate
> itself as needed.  We've been tossing around the idea of paired VMs+vCPUs to
> support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, 
> with a
> bit of pKVM mixed in?
> 
> Borrowing VTL terminology, where VTL0 is the least privileged, userspace 
> launches
> the VM at VTL0.  At some point, the guest triggers the deprivileging sequence 
> and
> userspace creates VTL1.  Userpace also provides a way for VTL0 restrict 
> access to
> its memory, e.g. to effectively make the page tables for

[PATCH v2 0/2] Some fixes for the existing dynamic dtbo code

During the review process for the v1 of the dynamic dtbo series, some
issues of the existing code were identified. Discussions of them can
be found in [1] (for the first patch) and [2] (for the second patch).

Since the main part of the remaining dynamic dtbo series requires more
rework, just send these fixes for now.

[1] 
https://lore.kernel.org/xen-devel/835099c8-6cf0-4f6d-899b-07388df89...@xen.org/
[2] 
https://lore.kernel.org/xen-devel/eaea1986-a27e-4d6c-932f-1d0a9918861f@perard/

Henry Wang (2):
  xen/common/dt-overlay: Fix missing lock when remove the device
  tools/xl: Correct the help information and exit code of the dt-overlay
command

 tools/xl/xl_vmcontrol.c | 6 +++---
 xen/common/dt-overlay.c | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

-- 
2.34.1

[PATCH v2 2/2] tools/xl: Correct the help information and exit code of the dt-overlay command

Fix the name mismatch in the xl dt-overlay command, the
command name should be "dt-overlay" instead of "dt_overlay".

Fix the exit code of the dt-overlay command, use EXIT_FAILURE
instead of ERROR_FAIL.

Fixes: 61765a07e3d8 ("tools/xl: Add new xl command overlay for device tree 
overlay support")
Suggested-by: Anthony PERARD 
Signed-off-by: Henry Wang 
---
v2:
- New patch
---
 tools/xl/xl_vmcontrol.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
index 98f6bd2e76..02575d5d36 100644
--- a/tools/xl/xl_vmcontrol.c
+++ b/tools/xl/xl_vmcontrol.c
@@ -1278,7 +1278,7 @@ int main_dt_overlay(int argc, char **argv)
 const int overlay_remove_op = 2;
 
 if (argc < 2) {
-help("dt_overlay");
+help("dt-overlay");
 return EXIT_FAILURE;
 }
 
@@ -1302,11 +1302,11 @@ int main_dt_overlay(int argc, char **argv)
 fprintf(stderr, "failed to read the overlay device tree file %s\n",
 overlay_config_file);
 free(overlay_dtb);
-return ERROR_FAIL;
+return EXIT_FAILURE;
 }
 } else {
 fprintf(stderr, "overlay dtbo file not provided\n");
-return ERROR_FAIL;
+return EXIT_FAILURE;
 }
 
 rc = libxl_dt_overlay(ctx, overlay_dtb, overlay_dtb_size, op);
-- 
2.34.1

[PATCH v2 1/2] xen/common/dt-overlay: Fix missing lock when remove the device

If CONFIG_DEBUG=y, below assertion will be triggered:
(XEN) Assertion 'rw_is_locked(_host_lock)' failed at 
drivers/passthrough/device_tree.c:146
(XEN) [ Xen-4.19-unstable  arm64  debug=y  Not tainted ]
(XEN) CPU:0
(XEN) PC: 0a257418 iommu_remove_dt_device+0x8c/0xd4
(XEN) LR: 0a2573a0
(XEN) SP: 8000fff7fb30
(XEN) CPSR:   0249 MODE:64-bit EL2h (Hypervisor, handler)
[...]

(XEN) Xen call trace:
(XEN)[<0a257418>] iommu_remove_dt_device+0x8c/0xd4 (PC)
(XEN)[<0a2573a0>] iommu_remove_dt_device+0x14/0xd4 (LR)
(XEN)[<0a20797c>] dt-overlay.c#remove_node_resources+0x8c/0x90
(XEN)[<0a207f14>] dt-overlay.c#remove_nodes+0x524/0x648
(XEN)[<0a208460>] dt_overlay_sysctl+0x428/0xc68
(XEN)[<0a2707f8>] arch_do_sysctl+0x1c/0x2c
(XEN)[<0a230b40>] do_sysctl+0x96c/0x9ec
(XEN)[<0a271e08>] traps.c#do_trap_hypercall+0x1e8/0x288
(XEN)[<0a273490>] do_trap_guest_sync+0x448/0x63c
(XEN)[<0a25c480>] entry.o#guest_sync_slowpath+0xa8/0xd8
(XEN)
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) Assertion 'rw_is_locked(_host_lock)' failed at 
drivers/passthrough/device_tree.c:146
(XEN) 

This is because iommu_remove_dt_device() is called without taking the
dt_host_lock. dt_host_lock is meant to ensure that the DT node will not
disappear behind back. So fix the issue by taking the lock as soon as getting
hold of overlay_node.

Fixes: 7e5c4a8b86f1 ("xen/arm: Implement device tree node removal 
functionalities")
Signed-off-by: Henry Wang 
---
v2:
- Take the lock as soon as getting hold of overlay_node.
v1.1:
- Move the unlock position before the check of rc.
---
 xen/common/dt-overlay.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/common/dt-overlay.c b/xen/common/dt-overlay.c
index 1b197381f6..25d15cbcb1 100644
--- a/xen/common/dt-overlay.c
+++ b/xen/common/dt-overlay.c
@@ -429,6 +429,8 @@ static int remove_nodes(const struct overlay_track *tracker)
 if ( overlay_node == NULL )
 return -EINVAL;
 
+write_lock(_host_lock);
+
 rc = remove_descendant_nodes_resources(overlay_node);
 if ( rc )
 return rc;
@@ -439,8 +441,6 @@ static int remove_nodes(const struct overlay_track *tracker)
 
 dt_dprintk("Removing node: %s\n", overlay_node->full_name);
 
-write_lock(_host_lock);
-
 rc = dt_overlay_remove_node(overlay_node);
 if ( rc )
 {
-- 
2.34.1

[ovmf test] 185935: all pass - PUSHED