Re: [PATCH] x86: amend 'n' debug-key output with SMI count

2024-01-24 Thread Jan Beulich
On 24.01.2024 17:24, Andrew Cooper wrote:
> On 24/01/2024 3:27 pm, Jan Beulich wrote:
>> ... if available only, of course.
>>
>> Signed-off-by: Jan Beulich 
>>
>> --- a/xen/arch/x86/cpu/common.c
>> +++ b/xen/arch/x86/cpu/common.c
>> @@ -406,9 +406,15 @@ void __init early_cpu_init(bool verbose)
>>  paddr_bits -= (ebx >> 6) & 0x3f;
>>  }
>>  
>> -if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)))
>> +if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON))) {
>> +uint64_t smi_count;
>> +
>>  park_offline_cpus = opt_mce;
>>  
>> +if (!verbose && !rdmsr_safe(MSR_SMI_COUNT, smi_count))
>> +setup_force_cpu_cap(X86_FEATURE_SMI_COUNT);
>> +}
>> +
> 
> I know you're re-using an existing condition, but I think it's more
> likely that it's Intel-only than common to VIA and Shanghai.

Then again when re-using the condition I questioned how likely it is
that people actually use Xen on CPUs of these two vendors, when the
respective code is only bit-rotting.

> Also, why is gated on verbose?
> 
> (I think I can see why this is rhetorical question, and I expect you can
> guess what the feedback will be.)

Hmm, no, I don't think I can guess that. The reason is simple: In
case the MSR doesn't exist, I'd like to avoid the respective (debug)
log message, emitted while recovering from the #GP, appearing twice.
(Which imo eliminates the only guess I might otherwise have: Don't
add complexity [the extra part of the condition] when it's not
needed.)

>> --- a/xen/arch/x86/nmi.c
>> +++ b/xen/arch/x86/nmi.c
>> @@ -589,9 +589,20 @@ static void cf_check do_nmi_stats(unsign
>>  unsigned int cpu;
>>  bool pend, mask;
>>  
>> -printk("CPU\tNMI\n");
>> +printk("CPU\tNMI%s\n", boot_cpu_has(X86_FEATURE_SMI_COUNT) ? "\tSMI" : 
>> "");
>>  for_each_online_cpu ( cpu )
>> -printk("%3u\t%3u\n", cpu, per_cpu(nmi_count, cpu));
>> +{
>> +printk("%3u\t%3u", cpu, per_cpu(nmi_count, cpu));
>> +if ( boot_cpu_has(X86_FEATURE_SMI_COUNT) )
>> +{
>> +unsigned int smi_count, dummy;
>> +
>> +rdmsr(MSR_SMI_COUNT, smi_count, dummy);
>> +printk("\t%3u\n", smi_count);
> 
> This reads MSR_SMI_COUNT repeatedly on the same CPU.
> 
> You'll need to IPI all CPUs to dump the count into a per-cpu variable.

Oh, how embarrassing.

Jan



Re: [PATCH] x86/ucode: Fix stability of the Raw CPU Policy rescan

2024-01-24 Thread Jan Beulich
On 24.01.2024 18:23, Andrew Cooper wrote:
> On 24/01/2024 3:37 pm, Jan Beulich wrote:
>> On 23.01.2024 21:59, Andrew Cooper wrote:
>>> Always run microcode_update_helper() on the BSP, so the the updated Raw CPU
>>> policy doesn't get non-BSP topology details included.
>> Wouldn't it be better (and consistent with ...
>>
>>> Have calculate_raw_cpu_policy() clear the instantanious XSTATE sizes.  The
>>> value XCR0 | MSR_XSS had when we scanned the policy isn't terribly 
>>> interesting
>>> to report.
>> ... this) to purge these details from the raw policy as well then?
> 
> I did spend some time considering this.
> 
> Rerunning on the same CPU is more resilient to new topology leaves, so
> we'd want to be doing that irrespective.

I'm afraid I don't understand this: If a ucode update surfaced new leaves,
they surely would appear on all CPUs? IOW my question still stands: Wouldn't
we better zap topology data from the raw policy (thus also not propagating
it into other policies)? At which point retrieval becomes independent of
what CPU it is run on (if there were any other CPU-specific pieces of data,
similar zapping should happen for them).

Surely using CPU0 here isn't much of a problem, as this is a pretty
infrequent event. But generally I'd like to avoid "preferring" CPU0 as much
as possible. Hence I'd prefer if even in cases like this one we could avoid
it.

> The XCR0/XSS state really is transient, and the useful information is
> everywhere else in leaf 0xd.

Sure, but this is still independent on what CPU the retrieval is run on.

Jan



Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Jan Beulich
On 24.01.2024 18:51, Roger Pau Monné wrote:
> On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
>> On 24.01.2024 10:24, Roger Pau Monné wrote:
>>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
 On 23.01.2024 16:07, Roger Pau Monné wrote:
> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, 
>>> int index, int *pirq_p,
>>>  {
>>>  int irq, pirq, ret;
>>>  
>>> +ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>
>> If either lock is sufficient to hold here, ...
>>
>>> --- a/xen/arch/x86/physdev.c
>>> +++ b/xen/arch/x86/physdev.c
>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int 
>>> *index, int *pirq_p,
>>>  
>>>  case MAP_PIRQ_TYPE_MSI:
>>>  case MAP_PIRQ_TYPE_MULTI_MSI:
>>> +pcidevs_lock();
>>>  ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>> +pcidevs_unlock();
>>>  break;
>>
>
> IIRC (Stewart can further comment) this is done holding the pcidevs
> lock to keep the path unmodified, as there's no need to hold the
> per-domain rwlock.

 Yet why would we prefer to acquire a global lock when a per-domain one
 suffices?
>>>
>>> I was hoping to introduce less changes, specially if they are not
>>> strictly required, as it's less risk.  I'm always quite worry of
>>> locking changes.
>>
>> In which case more description / code commenting is needed. The pattern
>> of the assertions looks dangerous.
> 
> Is such dangerousness perception because you fear some of the pcidevs
> lock usage might be there not just for preventing the pdev from going
> away, but also to guarantee exclusive access to certain state?

Indeed. In my view the main purpose of locks is to guard state. Their
use here to guard against devices here is imo rather an abuse; as
mentioned before this should instead be achieved e.g via refcounting.
And it's bad enough already that pcidevs_lock() alone has been abused
this way, without proper marking (leaving us to guess in many places).
It gets worse when a second lock can now also serve this same purpose.

>> Even if (as you say in a later reply)
>> this is only temporary, we all know how long "temporary" can be. It
>> might even be advisable to introduce a helper construct.
> 
> The aim here was to modify as little as possible, in order to avoid
> having to analyze all possible users of pcidevs lock, and thus not
> block the vPCI work on the probably lengthy and difficult analysis.
> 
> Not sure adding a construct makes is much better, as I didn't want to
> give the impression all checks for the pcidevs lock can merely be
> replaced by the new construct.

Of course such a construct could only be used in places where it can
be shown to be appropriate.

Jan



Re: [RFC KERNEL PATCH v4 3/3] PCI/sysfs: Add gsi sysfs for pci_dev

2024-01-24 Thread Chen, Jiqian
On 2024/1/24 00:02, Bjorn Helgaas wrote:
> On Tue, Jan 23, 2024 at 10:13:52AM +, Chen, Jiqian wrote:
>> On 2024/1/23 07:37, Bjorn Helgaas wrote:
>>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
 There is a need for some scenarios to use gsi sysfs.
 For example, when xen passthrough a device to dumU, it will
 use gsi to map pirq, but currently userspace can't get gsi
 number.
 So, add gsi sysfs for that and for other potential scenarios.
>> ...
> 
>>> I don't know enough about Xen to know why it needs the GSI in
>>> userspace.  Is this passthrough brand new functionality that can't be
>>> done today because we don't expose the GSI yet?
>>
>> In Xen architecture, there is a privileged domain named Dom0 that
>> has ACPI support and is responsible for detecting and controlling
>> the hardware, also it performs privileged operations such as the
>> creation of normal (unprivileged) domains DomUs. When we give to a
>> DomU direct access to a device, we need also to route the physical
>> interrupts to the DomU. In order to do so Xen needs to setup and map
>> the interrupts appropriately.
> 
> What kernel interfaces are used for this setup and mapping?
For passthrough devices, the setup and mapping of routing physical interrupts 
to DomU are done on Xen hypervisor side, hypervisor only need userspace to 
provide the GSI info, see Xen code: xc_physdev_map_pirq require GSI and then 
will call hypercall to pass GSI into hypervisor and then hypervisor will do the 
mapping and routing, kernel doesn't do the setup and mapping.
For devices on PVH Dom0, Dom0 setups interrupts for devices as the baremetal 
Linux kernel does, through using acpi_pci_irq_enable-> acpi_register_gsi-> 
__acpi_register_gsi->acpi_register_gsi_ioapic.

-- 
Best regards,
Jiqian Chen.


[xen-unstable test] 184449: tolerable FAIL - PUSHED

2024-01-24 Thread osstest service owner
flight 184449 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/184449/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 184434
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 184434
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 184434
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 184434
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 184434
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 184434
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 184434
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 184434
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 184434
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 184434
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 184434
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 184434
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  98ae35cab0e40e59963f9f58345bf378b9522d74
baseline version:
 xen  f67bddf3bccd99a5fee968c3b3f288db6a57d3be

Last test of basis   184434  2024-01-23 14:39:00 Z1 days
Testing same since   184449  2024-01-24 08:52:12 Z0 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Jan Beulich 
  Oleksii Kurochko 

jo

Re: [PATCH] consolidate do_bug_frame() / bug_fn_t

2024-01-24 Thread Andrew Cooper
On 24/01/2024 7:28 am, Jan Beulich wrote:
> On 24.01.2024 02:34, Stefano Stabellini wrote:
>> I managed to get back to read the mailing list and noticed this patch.
>>
>> Is it still relevant and needs to be reviewed?
>>
>> Are there any outstanding disagreements between maintainers on the
>> approach to take here?  Or should I just go ahead and review it?
> It is still relevant from my pov, and everything that may be controversial
> is said ...

BUGFRAME_* cannot legitimately modify the interrupted context.  Two are
fatal paths, and other two are side-effect-less as far as C can tell.

So the infrastructure ought to take a const pointer.

The reason why this pointer is non-const is to do with the interaction
of the serial and keyhandler infrastructures.  Because we're adjusting
that for other reasons, I was hoping it would subsequently be easy to
switch Xen to being properly const in this regard.

Turns out it is:

 
https://gitlab.com/xen-project/people/andyhhp/xen/-/commit/4f857075005da1d28632e4f9198c2e7d0f404b9a

with a couple of caveats.  (Only the buster-gcc-ibt run failed, so I've
got some cf_check-ing to adjust, but all the other builds worked fine).


To make the serial code compile, I ended up having to revert patch 2 of
the regs series, which I believe is safe to do following patch 3-5 which
un-plumb the regs pointer deeper in the call chain.  If this is turns
out to be true, then the patch ought to be added and reverted in the
same series so it isn't left hanging about after the fact.

The _$X_poll() functions are used in timer context, which means there's
an outer regs context already latched, and that's arguably a better
context to use anyway for 'd'.

This in turn allows us to remove a #UD from a fast(ish) path, and remove
some per-cpu or static variables which are just used for non-standard
parameter passing because run_in_exception_handler() doesn't let you
pass any.


This leaves the '%' debugger infrastructure.  Being a debugger, it's
making arbitrary changes anyway and I'd much rather cast away constness
for a debugger, than to keep everything else mutable when it oughtn't to be.

If absolutely nothing else, registration and handling '%' ought to be
from x86 code rather than common code, which would remove the
do_debugger_trap_fatal() layering violation.

But, the more I look into the gdbstub the more I'm convinced that it
doesn't work.  For example, this gem:

/* Resuming after we've stopped used to work, but more through luck than
any actual intention.  It doesn't at the moment. */

>From c/s b69f92f3012 in July 2004, and more specifically the commit
which added the gdbstub functionality to begin with.  I.e. it doesn't
appear to have ever supported more than "poke around in the crashed
state".  In the 2 decades that noone has fixed this, we've gained far
better technologies for doing this, such as running it in a VM.

I am going to submit some patches deleting gdbstub.  It clearly had not
much value to begin with, and is not definitely not worth the problems
it is creating in adjacent code these days.

~Andrew



Re: ACPI VFCT table too short on PVH dom0 (was: Re: E820 memory allocation issue on Threadripper platforms)

2024-01-24 Thread Patrick Plenefisch
On Wed, Jan 24, 2024 at 3:23 AM Roger Pau Monné 
wrote:

> > >
> > >>
> > >> > I had been working with ASRock support ever since I got my brand
> > >> > new motherboard because I couldn't see the BIOS/UEFI screens. I
> could
> > >> boot
> > >> > up, and once native linux took control amdgpu got the screens/gpu
> > >> working
> > >> > fine. I finally managed to be able to see the UEFI/BIOS setup
> screens by
> > >> > patching my VBIOS: I extracted the VBIOS image of a cheap R5 430
> OEM,
> > >> ran
> > >> > GOPupd to update the VBIOS UEFI component (GOP) from version 1.60 to
> > >> 1.67.
> > >> > That allowed the UEFI to actually initialize and use a screen.
> However,
> > >> I
> > >> > later realized that only 1 monitor was lighting up in the bios: my
> > >> monitor
> > >> > plugged into the Radeon RX 480 that was still on VBIOS GOP 1.60. It
> > >> appears
> > >> > the GOP was initializing the RX 480 too, despite not being flashed
> with
> > >> the
> > >> > latest itself. I am working on an email to asrock support about
> that.
> > >> Once
> > >> > I get into linux (native or PV), both monitors light up as expected.
> > >> Also,
> > >> > If I boot linux PVH from grub, they also work.
> > >>
> > >> OK, that's good, so that would be UEFI -> grub -> Xen -> Linux PVH?
> > >>
> > >
> > > Correct. Inserting grub into the chain "fixes" the acpi tables and
> things
> > > work correctly.
> > >
> >
> > Ok, I am not sure what I did the other day to get it to work, but I can't
> > replicate *any* PVH success today. One driver (radeon or amdgpu) always
> > complains the VFCT table is wrong, and leads to the symptoms previously
> > reported.
>
> Are you sure you are using Xen 4.18?  Some of the Xen logs you
> provided did use Xen 4.17, which doesn't have the VFCT fix.
>
> Can you please provide the `xl dmesg` for the non-working case when
> using grub2?
>

Yes, I expected it to fail on 4.17, but failing on 4.18 surprised me. I
tried this build with several kernels, with similar results.

Interestingly, with xen 4.18 + Debian 12 + Linux 6.1 (recompiled with the
working start address), I get the following graphics modes:
Grub, old low resolution with matching low resolution colXrow count
Xen font, xen boot, high resolution with low resolution colXrow count
Xen font, kernel early boot, high resolution with low resolution colXrow
count
Linux font, kernel later boot, high resolution with high resolution colxrow
count
At ~18s in the kernel logs, no graphics anymore

I have attached the log of serial output (xen+linux) after grub


>
> Thanks, Roger.
>
Xen 4.18.0
(XEN) Xen version 4.18.0 (patrick@) (gcc (Debian 10.2.1-6) 10.2.1 20210110) 
debug=n Wed Jan 17 18:31:53 EST 2024
(XEN) Latest ChangeSet: 
(XEN) build-id: e68e82847550e40a25eea66ea8a2e1d865385a1d
(XEN) Console output is synchronous.
(XEN) Bootloader: GRUB 2.06-13+deb12u1
(XEN) Command line: placeholder dom0=pvh dom0_mem=8G,max:8G loglvl=all 
e820-verbose=true console_to_ring sync_console guest_loglvl=all com1=115200,8n1 
console=com1,vga no-real-mode edd=off
(XEN) Xen image load base address: 0x9f60
(XEN) Video information:
(XEN)  VGA is graphics mode 2560x1440, 32 bpp
(XEN)  VBE/DDC methods: none; EDID transfer time: 0 seconds
(XEN) Disc information:
(XEN)  Found 0 MBR signatures
(XEN)  Found 6 EDD information structures
(XEN) CPU Vendor: AMD, Family 25 (0x19), Model 24 (0x18), Stepping 1 (raw 
00a10f81)
(XEN) Enabling Supervisor Shadow Stacks
(XEN) Initial EFI RAM map:
(XEN)  [, 0009] (usable)
(XEN)  [0010, 03ff] (usable)
(XEN)  [0400, 04045fff] (ACPI NVS)
(XEN)  [04046000, 09afefff] (usable)
(XEN)  [09aff000, 09ff] (reserved)
(XEN)  [0a00, 0aff] (usable)
(XEN)  [0b00, 0b020fff] (reserved)
(XEN)  [0b021000, a04e8fff] (usable)
(XEN)  [a04e9000, a64e8fff] (reserved)
(XEN)  [a64e9000, a747efff] (ACPI data)
(XEN)  [a747f000, a947efff] (ACPI NVS)
(XEN)  [a947f000, addfefff] (reserved)
(XEN)  [addff000, afffafff] (usable)
(XEN)  [afffb000, afff] (reserved)
(XEN)  [0001, 00403dbb] (usable)
(XEN)  [000a, 000f] (reserved)
(XEN)  [b000, bfff] (reserved)
(XEN)  [df20, df2f] (reserved)
(XEN)  [e000, efff] (reserved)
(XEN)  [fea0, feaf] (reserved)
(XEN)  [fec0, fec00fff] (reserved)
(XEN)  [fec1, fec10fff] (reserved)
(XEN)  [fec3, fec30fff] (reserved)
(XEN)  [fed0, fed00fff] (reserved)
(XEN)  [fed4, fed44fff] (reserved)
(XEN)  [fed8, fed8] (reserved)
(XEN)  [ff00, ] (reserved)
(XEN)  [00403dbc, 00403fff] (reser

Re: [PATCH 2/2] xen/dm: arm: Introduce inject_msi2 DM op

2024-01-24 Thread Stefano Stabellini
On Wed, 24 Jan 2024, Andrew Cooper wrote:
> On 24/01/2024 1:07 am, Stefano Stabellini wrote:
> >> I agree with Jan that we could reuse
> >> xendevicemodel_inject_msi by assuing that PCI BDF zero is invalid.
> 
> I've already explained why that will break future x86.  (See the other
> thread off this patch for specifics).
> 
> When we add vIOMMUs to guests, we *must* have correct source ids.  0 is
> ambiguous, so not safe to reuse.
> 
> 
> Allocating a new subop now is the only way to not shoot in the foot later.

OK, what you wrote makes sense to me.

Re: Xen 4.18rc/ARM64 on Raspberry Pi 4B: Traffic in DomU crashing Dom0 when using VLANs

2024-01-24 Thread Andrew Cooper
On 23/01/2024 8:29 am, Bertrand Marquis wrote:
>> Thanks for raising the visibility of this.  I'm not familiar with ARM,
>> but if I were investigating this I'd try to figure out what the
>> "unhandled" error messages are.  "gnttab_mark_dirty not implemented
>> yet" looks pretty sus too, and also sounds like it might be something
>> ARM-specific.
> I tried to explain those and they are not the reason of the problem.

The "gnttab_mark_dirty not implemented yet" printk() should be deleted.

It's /* TODO - logdirty support */ in a form that is actively unhelpful
to people using a debug hypervisor to debug other things.

~Andrew



Re: Xen 4.18rc/ARM64 on Raspberry Pi 4B: Traffic in DomU crashing Dom0 when using VLANs

2024-01-24 Thread Paul Leiber

Elliot, Bertrand, George: Thanks for your replies.

Am 23.01.2024 um 09:29 schrieb Bertrand Marquis:

Hi,

  will try to explain some of the messages here after but I am not sure of the 
reason
of the crash so I will just set some pointers...


On 22 Jan 2024, at 11:46, George Dunlap  wrote:

On Fri, Jan 19, 2024 at 8:32 PM Elliott Mitchell  wrote:


On Sun, Jan 14, 2024 at 10:54:24PM +0100, Paul Leiber wrote:


Am 22.10.2023 um 07:42 schrieb Paul Leiber:

Am 13.10.2023 um 20:56 schrieb Paul Leiber:

Hi Xen developers list,

TL;DR:
--

Causing certain web server traffic on a secondary VLAN on Raspberry Pi
under vanilla Debian/UEFI in combination with Xen leads to complete
system reboot (watchdog triggering for Dom0). Other strange things are
happening.

Description:
--

I recently set up Xen (self compiled, Version 4.18-rc) on a Raspberry
Pi 4B (on vanilla Debian Bookworm, UEFI boot mode). Until some time
ago, everything worked well with Dom0, one DomU and one bridge.

Then I wanted to actually make use of the virtualization and started
to set up a second Debian Bookworm DomU (using xen-create-image) for
monitoring my systems with zabbix (a webserver based system monitoring
solution). The bridge used for this setup was the device bridging the
hardware NIC. I installed zabbix, set it up, and everything went well,
I could access the web interface without any problem.

Then I set up VLANs (initally using VLAN numbers 1 and 2) to separate
network traffic between the DomUs. I made the existing device bridge
VLAN 1 (bridge 1) and created a secondary device for bridging VLAN 2
(bridge 2). Using only bridge 1 / VLAN 1 everything works well, I can
access the zabbix web interface without any noticeable issue. After
switching the zabbix DomU to VLAN 2 / bridge 2, everything seemingly
keeps on working well, I can ping different devices in my network from
the zabbix DomU and vice versa, I can ssh into the machine.

However, as soon as I remotely access the zabbix web interface, the
complete system (DomUs and Dom0) becomes unresponsive and reboots
after some time (usually seconds, sometimes 1-2 minutes). The reboot
is reliably reproducable.

I didn't see any error message in any log (zabbix, DomU syslog, Dom0
syslog) except for the following lines immediately before the system
reboots on the Xen serial console:

(XEN) Watchdog timer fired for domain 0
(XEN) Hardware Dom0 shutdown: watchdog rebooting machine

As soon as I change the bridge to bridge 1 (with or without VLAN
setup), the web interface is accessible again after booting the zabbix
DomU, no reboots.

So I assume that causing specific traffic on the virtual NIC when
using a VLAN setup with more than one VLAN under Xen makes the Dom0
system hard crash. Of course, there might be other causes that I'm not
aware of, but to me, this seems to be the most likely explanation
right now.

What I tried:
-

1. I changed the VLAN numbers. First to 101, 102, 103 etc. This was
when I noticed another strange thing: VLANs with numbers >99 simply
don't work on my Raspberry Pi under Debian, with or without Xen. VLAN
99 works, VLAN 100 (or everything else >99 that I tried) doesn't work.
If I choose a number >99, the VLAN is not configured, "ip a" doesn't
list it. Other Debian systems on x64 architecture don't show this
behavior, there, it was no problem to set up VLANs > 99. Therefore,
I've changed the VLANs to 10, 20, 30 etc., which worked. But it didn't
solve the initial problem of the crashing Dom0 and DomUs.

2. Different bridge options, without noticable effect:
bridge_stp off  # dont use STP (spanning tree proto)
bridge_waitport 0   # dont wait for port to be available
bridge_fd 0 # no forward delay

3. Removing IPv6: No noticable effect.

4. Network traffic analysis: Now, here it becomes _really_ strange. I
started tcpdumps on Dom0, and depending on on which interface/bridge
traffic was logged, the problem went away, meaning, the DomU was
running smoothly for hours, even when accessing the zabbix web
interface. Stopping the log makes the system crash (as above, after
seconds up to 1-2 minutes) reproducably if I access the zabbix web
interface.

Logging enabcm6e4ei0 (NIC): no crashes
Logging enabcm6e4ei0.10 (VLAN 10): instant crash
Logging enabcm6e4ei0.20 (VLAN 20): no crashes
Logging xenbr0 (on VLAN 10): instant crash
Logging xenbr1 (on VLAN 20): no crashes

I am clinging to the thought that there must be a rational explanation
for why logging the traffic on certain interfaces/bridges should avoid
the crash of the complete system, while logging other
interfaces/bridges doesn't. I myself can't think of one.

I checked the dumps of enabcm6e4ei0.10 and xenbr0 (where the system
crashes) with wireshark, nothing sticks out to me (but I am really no
expert in analyzing network traffic). Dumps can be provided.

5. Watchdog: I tried to dig deeper into the cause for the watchdog
triggering. However, I didn't find any useful documentation on the we

[PATCH v2 1/2] pmstat: Limit hypercalls under HWP

2024-01-24 Thread Jason Andryuk
When HWP is active, the cpufreq P-state information is not updated.  In
that case, return -EOPNOTSUPP instead of bogus, incomplete info.

Similarly, set_cpufreq_para() is not applicable when HWP is active.
Many of the options already checked the governor and were inaccessible,
but SCALING_MIN/MAX_FREQ was still accessible (though it would do
nothing).  Add an ealier HWP check to handle all cases.

Signed-off-by: Jason Andryuk 
---
v2:
Use -EOPNOTSUPP in both places

 xen/drivers/acpi/pmstat.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 85097d463c..86588ddd42 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -66,6 +66,8 @@ int do_get_pm_info(struct xen_sysctl_get_pmstat *op)
 return -ENODEV;
 if ( !cpufreq_driver.init )
 return -ENODEV;
+if ( hwp_active() )
+return -EOPNOTSUPP;
 if ( !pmpt || !(pmpt->perf.init & XEN_PX_INIT) )
 return -EINVAL;
 break;
@@ -329,6 +331,9 @@ static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
 if ( !policy || !policy->governor )
 return -EINVAL;
 
+if ( hwp_active() )
+return -EOPNOTSUPP;
+
 switch(op->u.set_para.ctrl_type)
 {
 case SCALING_MAX_FREQ:
-- 
2.43.0




[PATCH v2 2/2] xenpm: Print message for disabled commands

2024-01-24 Thread Jason Andryuk
xenpm get-cpufreq-states currently just prints no output when cpufreq is
disabled or HWP is running.  Have it print an appropriate message.  The
cpufreq disabled one mirros the cpuidle disabled one.

cpufreq disabled:
$ xenpm get-cpufreq-states
Either Xen cpufreq is disabled or no valid information is registered!

Under HWP:
$ xenpm get-cpufreq-states
P-State information not supported.  Try get-cpufreq-average or start.

Also allow xenpm to handle EOPNOTSUPP from the pmstat hypercalls.
EOPNOTSUPP is returned when HWP is active in some cases and allows the
differentiation from cpufreq being disabled.

Signed-off-by: Jason Andryuk 
---
v2:
New

 tools/misc/xenpm.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index d982482a3f..79c618590b 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -362,7 +362,15 @@ static int show_pxstat_by_cpuid(xc_interface *xc_handle, 
int cpuid)
 
 ret = get_pxstat_by_cpuid(xc_handle, cpuid, &pxstatinfo);
 if ( ret )
+{
+if ( ret == -ENODEV )
+fprintf(stderr,
+"Either Xen cpufreq is disabled or no valid information is 
registered!\n");
+   else if ( ret == -EOPNOTSUPP )
+fprintf(stderr,
+"P-State information not supported.  Try 
get-cpufreq-average or start.\n");
 return ret;
+}
 
 print_pxstat(cpuid, &pxstatinfo);
 
@@ -382,9 +390,11 @@ void pxstat_func(int argc, char *argv[])
 {
 /* show pxstates on all cpus */
 int i;
-for ( i = 0; i < max_cpu_nr; i++ )
-if ( show_pxstat_by_cpuid(xc_handle, i) == -ENODEV )
+for ( i = 0; i < max_cpu_nr; i++ ) {
+int ret = show_pxstat_by_cpuid(xc_handle, i);
+if ( ret == -ENODEV || ret == -EOPNOTSUPP )
 break;
+   }
 }
 else
 show_pxstat_by_cpuid(xc_handle, cpuid);
@@ -432,7 +442,7 @@ static uint64_t *sum, *sum_cx, *sum_px;
 
 static void signal_int_handler(int signo)
 {
-int i, j, k;
+int i, j, k, ret;
 struct timeval tv;
 int cx_cap = 0, px_cap = 0;
 xc_cputopo_t *cputopo = NULL;
@@ -473,7 +483,8 @@ static void signal_int_handler(int signo)
 }
 }
 
-if ( get_pxstat_by_cpuid(xc_handle, 0, NULL) != -ENODEV )
+ret = get_pxstat_by_cpuid(xc_handle, 0, NULL);
+if ( ret != -ENODEV && ret != -EOPNOTSUPP )
 {
 px_cap = 1;
 for ( i = 0; i < max_cpu_nr; i++ )
-- 
2.43.0




[PATCH v2 0/2] Improve xenpm output under HWP

2024-01-24 Thread Jason Andryuk
Have xen return -EOPNOTSUPP for more pmstat hypercalls instead of
returning misleading data under HWP.  Enhance xenpm to handle
-EOPNOTSUPP and provide more informative messages.

Jason Andryuk (2):
  pmstat: Limit hypercalls under HWP
  xenpm: Print message for disabled commands

 tools/misc/xenpm.c| 19 +++
 xen/drivers/acpi/pmstat.c |  5 +
 2 files changed, 20 insertions(+), 4 deletions(-)

-- 
2.43.0




Re: [PATCH 2/2] xen/dm: arm: Introduce inject_msi2 DM op

2024-01-24 Thread Andrew Cooper
On 24/01/2024 1:07 am, Stefano Stabellini wrote:
>> I agree with Jan that we could reuse
>> xendevicemodel_inject_msi by assuing that PCI BDF zero is invalid.

I've already explained why that will break future x86.  (See the other
thread off this patch for specifics).

When we add vIOMMUs to guests, we *must* have correct source ids.  0 is
ambiguous, so not safe to reuse.


Allocating a new subop now is the only way to not shoot in the foot later.

~Andrew



Re: [PATCH 2/2] xen/dm: arm: Introduce inject_msi2 DM op

2024-01-24 Thread Andrew Cooper
On 14/01/2024 10:01 am, Mykyta Poturai wrote:
> diff --git a/tools/include/xendevicemodel.h b/tools/include/xendevicemodel.h
> index 797e0c6b29..4833e55bce 100644
> --- a/tools/include/xendevicemodel.h
> +++ b/tools/include/xendevicemodel.h
> @@ -236,6 +236,20 @@ int xendevicemodel_inject_msi(
>  xendevicemodel_handle *dmod, domid_t domid, uint64_t msi_addr,
>  uint32_t msi_data);
>  
> +/**
> + * This function injects an MSI into a guest.
> + *
> + * @parm dmod a handle to an open devicemodel interface.
> + * @parm domid the domain id to be serviced
> + * @parm msi_addr the MSI address (0xfeex)

This 0xfeex is an x86-ism which I doubt is correct for ARM.  I'd
suggest dropping it from the comment.

> + * @parm source_id the PCI SBDF of the source device
> + * @parm msi_data the MSI data
> + * @return 0 on success, -1 on failure.
> +*/
> +int xendevicemodel_inject_msi2(
> +xendevicemodel_handle *dmod, domid_t domid, uint64_t msi_addr, uint32_t 
> source_id,
> +uint32_t msi_data, unsigned int source_id_valid);

The Source ID is always valid when making this call.  It is only within
the hypervisor itself that we may need to worry about the source ID
being invalid.

This means you don't have the flags field, and as a consequence, there's
no padding to worry about.

Also, the msi_ prefix to address and data are redundant.  Either drop
them, or put a prefix on source_id too, because we shouldn't be
inconsistent here.

> +
>  /**
>   * This function enables tracking of changes in the VRAM area.
>   *
> diff --git a/tools/libs/devicemodel/core.c b/tools/libs/devicemodel/core.c
> index 8e619eeb0a..17ad00c5d9 100644
> --- a/tools/libs/devicemodel/core.c
> +++ b/tools/libs/devicemodel/core.c
> @@ -448,6 +448,28 @@ int xendevicemodel_set_irq_level(
>  return xendevicemodel_op(dmod, domid, 1, &op, sizeof(op));
>  }
>  
> +int xendevicemodel_inject_msi2(
> +xendevicemodel_handle *dmod, domid_t domid, uint64_t msi_addr, uint32_t 
> source_id,
> +uint32_t msi_data, unsigned int source_id_valid)
> +{
> +struct xen_dm_op op;
> +struct xen_dm_op_inject_msi2 *data;
> +
> +memset(&op, 0, sizeof(op));
> +
> +op.op = XEN_DMOP_inject_msi2;
> +data = &op.u.inject_msi2;
> +
> +data->addr = msi_addr;
> +data->data = msi_data;
> +if ( source_id_valid ) {
> +data->source_id = source_id;
> +data->flags = XEN_DMOP_MSI_SOURCE_ID_VALID;
> +}
> +
> +return xendevicemodel_op(dmod, domid, 1, &op, sizeof(op));

{
    struct xen_dm_op op = {
        .op = XEN_DMOP_inject_msi2,
    .u.inject_msi2 = {
            .addr = msi_addr,
            .data = msi_data,
        .source_id = source_id,
        },
    };

    return xendevicemodel_op(dmod, domid, 1, &op, sizeof(op));
}


> diff --git a/xen/arch/x86/hvm/dm.c b/xen/arch/x86/hvm/dm.c
> index 462691f91d..a4a0e3dff9 100644
> --- a/xen/arch/x86/hvm/dm.c
> +++ b/xen/arch/x86/hvm/dm.c
> @@ -539,6 +540,18 @@ int dm_op(const struct dmop_args *op_args)
>  break;
>  }
>  
> +case XEN_DMOP_inject_msi2:
> +{
> +const struct xen_dm_op_inject_msi2 *data =
> +&op.u.inject_msi2;
> +
> +if ( !(data->flags & XEN_DMOP_MSI_SOURCE_ID_VALID) )
> +printk(XENLOG_WARNING "XEN_DMOP_inject_msi2: source_id is 
> ignored\n");
> +
> +rc = hvm_inject_msi(d, data->addr, data->data);

You need a prep patch adding a source id parameter into hvm_inject_msi().

The XEN_DMOP_inject_msi case can probably pass in 0 in the short term,
and it can probably be discarded internally.

As I said before, the source id doesn't matter until we start trying to
expose vIOMMUs to guests, at which point I suspect the easiest option
will simply to be to reject XEN_DMOP_inject_msi against a domain with a
vIOMMU and force Qemu/whatever in dom0 to use XEN_DMOP_inject_msi2.

~Andrew



Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Stewart Hildebrand
On 1/24/24 03:21, Roger Pau Monné wrote:
> On Wed, Jan 24, 2024 at 12:07:28AM -0500, Stewart Hildebrand wrote:
>> On 1/23/24 09:29, Jan Beulich wrote:
>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
 @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, 
 struct msi_info *msi,
  {
  struct msi_desc *old_desc;
  
 -ASSERT(pcidevs_locked());
 -
  if ( !pdev || !pdev->msix )
  return -ENODEV;
  
 +ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
 +
  if ( msi->entry_nr >= pdev->msix->nr_entries )
  return -EINVAL;
>>>
>>> Further looking at this - is dereferencing pdev actually safe without 
>>> holding
>>> the global lock?
> 
> It is safe because either the global pcidevs lock or the per-domain
> pci_lock will be held, which should prevent the device from being
> removed.
> 
>> Are you referring to the new placement of the ASSERT, which opens up the 
>> possibility that pdev could be dereferenced and the function return before 
>> the ASSERT? If that is what you mean, I see your point. The ASSERT was 
>> placed there simply because we wanted to check that pdev != NULL first. See 
>> prior discussion at [1]. Hmm.. How about splitting the pdev-checking 
>> condition? E.g.:
>>
>> if ( !pdev )
>> return -ENODEV;
>>
>> ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>
>> if ( !pdev->msix )
>> return -ENODEV;
> 
> I'm not specially worried about the position of the assert, those are
> just debug messages at the end.
> 
> One worry I have after further looking at the code, when called from
> ns16550_init_postirq(), does the device have pdev->domain set?
> 
> That case would satisfy the first condition of the assert, so won't
> attempt to dereference pdev->domain, but still would be good to ensure
> consistency here wrt the state of pdev->domain.

Indeed. How about this?

if ( !pdev )
return -ENODEV;

ASSERT(pcidevs_locked() ||
   (pdev->domain && rw_is_locked(&pdev->domain->pci_lock)));

if ( !pdev->msix )
return -ENODEV;

And similarly in __pci_enable_msi(), without the !pdev->msix check. And 
similarly in pci_enable_msi(), which then should also gain its own if ( !pdev ) 
return -ENODEV; check.



[libvirt test] 184445: tolerable all pass - PUSHED

2024-01-24 Thread osstest service owner
flight 184445 libvirt real [real]
http://logs.test-lab.xenproject.org/osstest/logs/184445/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 184431
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 184431
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 184431
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-arm64-arm64-libvirt-qcow2 15 saverestore-support-checkfail never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass

version targeted for testing:
 libvirt  91f9a9fb4fc0d34ed8d7a869de3d9f87687c3618
baseline version:
 libvirt  642af05e3e58b90cb9d15936678f16bd7e454d26

Last test of basis   184431  2024-01-23 04:20:35 Z1 days
Testing same since   184445  2024-01-24 04:20:28 Z0 days1 attempts


People who touched revisions under test:
  Michal Privoznik 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 build-armhf-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-arm64-pvopspass
 build-armhf-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm   pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsmpass
 test-amd64-amd64-libvirt-xsm pass
 test-arm64-arm64-libvirt-xsm pass
 test-amd64-i386-libvirt-xsm  pass
 test-amd64-amd64-libvirt pass
 test-arm64-arm64-libvirt pass
 test-armhf-armhf-libvirt pass
 test-amd64-i386-libvirt  pass
 test-amd64-amd64-libvirt-pairpass
 test-amd64-i386-libvirt-pair pass
 test-arm64-arm64-libvirt-qcow2   pass
 test-armhf-armhf-libvirt-qcow2   pass
 test-arm64-arm64-libvirt-raw pass
 test-armhf-armhf-libvirt-raw pass
 test-amd64-i386-libvirt-raw  pass
 test-amd64-amd64-libvirt-vhd pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is a

[ovmf test] 184453: all pass - PUSHED

2024-01-24 Thread osstest service owner
flight 184453 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/184453/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf ff52277e3796c50511e544219dde0b21edf5c53e
baseline version:
 ovmf 97e1ef87300cdf01f5b21cd4c5ee1d8df6ae1f39

Last test of basis   184448  2024-01-24 08:42:37 Z0 days
Testing same since   184453  2024-01-24 16:14:24 Z0 days1 attempts


People who touched revisions under test:
  Ming Tan 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   97e1ef8730..ff52277e37  ff52277e3796c50511e544219dde0b21edf5c53e -> 
xen-tested-master



Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Roger Pau Monné
On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
> On 24.01.2024 10:24, Roger Pau Monné wrote:
> > On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> >> On 23.01.2024 16:07, Roger Pau Monné wrote:
> >>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>  On 15.01.2024 20:43, Stewart Hildebrand wrote:
> > @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, 
> > int index, int *pirq_p,
> >  {
> >  int irq, pirq, ret;
> >  
> > +ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> 
>  If either lock is sufficient to hold here, ...
> 
> > --- a/xen/arch/x86/physdev.c
> > +++ b/xen/arch/x86/physdev.c
> > @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int 
> > *index, int *pirq_p,
> >  
> >  case MAP_PIRQ_TYPE_MSI:
> >  case MAP_PIRQ_TYPE_MULTI_MSI:
> > +pcidevs_lock();
> >  ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> > +pcidevs_unlock();
> >  break;
> 
> >>>
> >>> IIRC (Stewart can further comment) this is done holding the pcidevs
> >>> lock to keep the path unmodified, as there's no need to hold the
> >>> per-domain rwlock.
> >>
> >> Yet why would we prefer to acquire a global lock when a per-domain one
> >> suffices?
> > 
> > I was hoping to introduce less changes, specially if they are not
> > strictly required, as it's less risk.  I'm always quite worry of
> > locking changes.
> 
> In which case more description / code commenting is needed. The pattern
> of the assertions looks dangerous.

Is such dangerousness perception because you fear some of the pcidevs
lock usage might be there not just for preventing the pdev from going
away, but also to guarantee exclusive access to certain state?

> Even if (as you say in a later reply)
> this is only temporary, we all know how long "temporary" can be. It
> might even be advisable to introduce a helper construct.

The aim here was to modify as little as possible, in order to avoid
having to analyze all possible users of pcidevs lock, and thus not
block the vPCI work on the probably lengthy and difficult analysis.

Not sure adding a construct makes is much better, as I didn't want to
give the impression all checks for the pcidevs lock can merely be
replaced by the new construct.

Thanks, Roger.



[PATCH v5 2/3] x86/iommu: switch hwdom IOMMU to use a rangeset

2024-01-24 Thread Roger Pau Monne
The current loop that iterates from 0 to the maximum RAM address in order to
setup the IOMMU mappings is highly inefficient, and it will get worse as the
amount of RAM increases.  It's also not accounting for any reserved regions
past the last RAM address.

Instead of iterating over memory addresses, iterate over the memory map regions
and use a rangeset in order to keep track of which ranges need to be identity
mapped in the hardware domain physical address space.

On an AMD EPYC 7452 with 512GiB of RAM, the time to execute
arch_iommu_hwdom_init() in nanoseconds is:

x old
+ new
N   Min   MaxMedian   AvgStddev
x   5 2.2364154e+10  2.338244e+10 2.2474685e+10 2.2622409e+10 4.2949869e+08
+   5   1025012   1033036   1026188 1028276.2 3623.1194
Difference at 95.0% confidence
-2.26214e+10 +/- 4.42931e+08
-99.9955% +/- 9.05152e-05%
(Student's t, pooled s = 3.03701e+08)

Execution time of arch_iommu_hwdom_init() goes down from ~22s to ~0.001s.

Note there's a change for HVM domains (ie: PVH dom0) that get switched to
create the p2m mappings using map_mmio_regions() instead of
p2m_add_identity_entry(), so that ranges can be mapped with a single function
call if possible.  Note that the interface of map_mmio_regions() doesn't
allow creating read-only mappings, but so far there are no such mappings
created for PVH dom0 in arch_iommu_hwdom_init().

No change intended in the resulting mappings that a hardware domain gets.

Signed-off-by: Roger Pau Monné 
---
Changes since v4:
 - Add default case to handle ACPI and NVS regions (which are not mapped)
   unless in the low 4GB and when inclusive mode is set.
 - Add changelog entry.
 - Dropped Jans RB.

Changes since v3:
 - Remove unnecessary line wraps.

Changes since v2:
 - Simplify a bit the logic related to inclusive option, at the cost of making
   some no-op calls on some cases.

Changes since v1:
 - Split from bigger patch.
 - Remove unneeded default case.

x86/iommu: add CHANGELOG entry for hwdom setup time improvements

Signed-off-by: Roger Pau Monné 
---
 CHANGELOG.md|   2 +
 xen/drivers/passthrough/x86/iommu.c | 149 ++--
 2 files changed, 35 insertions(+), 116 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 723d06425431..3e8b996e4718 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,6 +9,8 @@ The format is based on [Keep a 
Changelog](https://keepachangelog.com/en/1.0.0/)
 ### Changed
  - Changed flexible array definitions in public I/O interface headers to not
use "1" as the number of array elements.
+ - On x86:
+   - Reduce IOMMU setup time for hardware domain.
 
 ### Added
  - On x86:
diff --git a/xen/drivers/passthrough/x86/iommu.c 
b/xen/drivers/passthrough/x86/iommu.c
index fc5215a9dc40..c90755ff58fa 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -300,76 +300,6 @@ void iommu_identity_map_teardown(struct domain *d)
 }
 }
 
-static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
- unsigned long pfn,
- unsigned long max_pfn)
-{
-mfn_t mfn = _mfn(pfn);
-unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
-
-/*
- * Set up 1:1 mapping for dom0. Default to include only conventional RAM
- * areas and let RMRRs include needed reserved regions. When set, the
- * inclusive mapping additionally maps in every pfn up to 4GB except those
- * that fall in unusable ranges for PV Dom0.
- */
-if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-return 0;
-
-switch ( type = page_get_ram_type(mfn) )
-{
-case RAM_TYPE_UNUSABLE:
-return 0;
-
-case RAM_TYPE_CONVENTIONAL:
-if ( iommu_hwdom_strict )
-return 0;
-break;
-
-default:
-if ( type & RAM_TYPE_RESERVED )
-{
-if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-perms = 0;
-}
-else if ( is_hvm_domain(d) )
-return 0;
-else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
-perms = 0;
-}
-
-/* Check that it doesn't overlap with the Interrupt Address Range. */
-if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-return 0;
-/* ... or the IO-APIC */
-if ( has_vioapic(d) )
-{
-for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
-if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-return 0;
-}
-else if ( is_pv_domain(d) )
-{
-/*
- * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
- * ones there (also for e.g. HPET in certain cases), so it should also
- * have such established for IOMMUs.
- */
-if ( iomem_access_permitted(d, pfn, pfn) &&
- rangeset_contains_s

[PATCH v5 0/3] x86/iommu: improve setup time of hwdom IOMMU

2024-01-24 Thread Roger Pau Monne
Hello,

The aim of the series is to reduce boot time setup of IOMMU page tables
for dom0.

This patches rework the hardware domain IOMMU setup to use a rangeset
instead of iterating over all addresses up to the max RAM page.  See
patch 2/3 for performance figures.

Jan: I've kept your RB in patch 1 with the off-by-one adjustments, and
dropped it for patch 2 with the addition of the default label in the
switch over the e820 ranges.  Such default label is required to continue
to deal equally with the  ACPI and NVS ranges.

Thanks, Roger.

Roger Pau Monne (3):
  x86/iommu: remove regions not to be mapped
  x86/iommu: switch hwdom IOMMU to use a rangeset
  x86/iommu: cleanup unused functions

 CHANGELOG.md|   2 +
 xen/arch/x86/hvm/io.c   |  15 ++-
 xen/arch/x86/include/asm/hvm/io.h   |   4 +-
 xen/arch/x86/include/asm/setup.h|   2 +-
 xen/arch/x86/setup.c|  81 ++--
 xen/arch/x86/tboot.c|   2 +-
 xen/drivers/passthrough/x86/iommu.c | 196 
 7 files changed, 146 insertions(+), 156 deletions(-)

-- 
2.43.0




[PATCH v5 3/3] x86/iommu: cleanup unused functions

2024-01-24 Thread Roger Pau Monne
Remove xen_in_range() and vpci_is_mmcfg_address() now that hey are unused.

Adjust comments to point to the new functions that replace the existing ones.

No functional change.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Jan Beulich 
---
Changes since v2:
 - Do remove vpci_is_mmcfg_address().
---
Can be squashed with the previous patch if desired, split as a separate patch
for clarity.
---
 xen/arch/x86/hvm/io.c |  5 ---
 xen/arch/x86/include/asm/hvm/io.h |  3 --
 xen/arch/x86/include/asm/setup.h  |  1 -
 xen/arch/x86/setup.c  | 53 ++-
 xen/arch/x86/tboot.c  |  2 +-
 5 files changed, 3 insertions(+), 61 deletions(-)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index a42854c52b65..06283b41c463 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -364,11 +364,6 @@ static const struct hvm_mmcfg *vpci_mmcfg_find(const 
struct domain *d,
 return NULL;
 }
 
-bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr)
-{
-return vpci_mmcfg_find(d, addr);
-}
-
 int __hwdom_init vpci_subtract_mmcfg(const struct domain *d, struct rangeset 
*r)
 {
 const struct hvm_mmcfg *mmcfg;
diff --git a/xen/arch/x86/include/asm/hvm/io.h 
b/xen/arch/x86/include/asm/hvm/io.h
index e1e5e6fe7491..24d1b6134f02 100644
--- a/xen/arch/x86/include/asm/hvm/io.h
+++ b/xen/arch/x86/include/asm/hvm/io.h
@@ -153,9 +153,6 @@ int register_vpci_mmcfg_handler(struct domain *d, paddr_t 
addr,
 /* Destroy tracked MMCFG areas. */
 void destroy_vpci_mmcfg(struct domain *d);
 
-/* Check if an address is between a MMCFG region for a domain. */
-bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr);
-
 /* Remove MMCFG regions from a given rangeset. */
 int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r);
 
diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index cd07d98101d8..1ced1299c77b 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -36,7 +36,6 @@ unsigned long initial_images_nrpages(nodeid_t node);
 void discard_initial_images(void);
 void *bootstrap_map(const module_t *mod);
 
-int xen_in_range(unsigned long mfn);
 int remove_xen_ranges(struct rangeset *r);
 
 extern uint8_t kbd_shift_flags;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index c9f65c3a70b8..8082aac303e0 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1343,7 +1343,7 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 relocated = true;
 
 /*
- * This needs to remain in sync with xen_in_range() and the
+ * This needs to remain in sync with remove_xen_ranges() and the
  * respective reserve_e820_ram() invocation below. No need to
  * query efi_boot_mem_unused() here, though.
  */
@@ -1495,7 +1495,7 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 if ( using_2M_mapping() )
 efi_boot_mem_unused(NULL, NULL);
 
-/* This needs to remain in sync with xen_in_range(). */
+/* This needs to remain in sync with remove_xen_ranges(). */
 if ( efi_boot_mem_unused(&eb_start, &eb_end) )
 {
 reserve_e820_ram(&boot_e820, __pa(_stext), __pa(eb_start));
@@ -2089,55 +2089,6 @@ void arch_get_xen_caps(xen_capabilities_info_t *info)
 }
 }
 
-int __hwdom_init xen_in_range(unsigned long mfn)
-{
-paddr_t start, end;
-int i;
-
-enum { region_s3, region_ro, region_rw, region_bss, nr_regions };
-static struct {
-paddr_t s, e;
-} xen_regions[nr_regions] __hwdom_initdata;
-
-/* initialize first time */
-if ( !xen_regions[0].s )
-{
-/* S3 resume code (and other real mode trampoline code) */
-xen_regions[region_s3].s = bootsym_phys(trampoline_start);
-xen_regions[region_s3].e = bootsym_phys(trampoline_end);
-
-/*
- * This needs to remain in sync with the uses of the same symbols in
- * - __start_xen() (above)
- * - is_xen_fixed_mfn()
- * - tboot_shutdown()
- */
-
-/* hypervisor .text + .rodata */
-xen_regions[region_ro].s = __pa(&_stext);
-xen_regions[region_ro].e = __pa(&__2M_rodata_end);
-/* hypervisor .data + .bss */
-xen_regions[region_rw].s = __pa(&__2M_rwdata_start);
-xen_regions[region_rw].e = __pa(&__2M_rwdata_end);
-if ( efi_boot_mem_unused(&start, &end) )
-{
-ASSERT(__pa(start) >= xen_regions[region_rw].s);
-ASSERT(__pa(end) <= xen_regions[region_rw].e);
-xen_regions[region_rw].e = __pa(start);
-xen_regions[region_bss].s = __pa(end);
-xen_regions[region_bss].e = __pa(&__2M_rwdata_end);
-}
-}
-
-start = (paddr_t)mfn << PAGE_SHIFT;
-end = start + PAGE_SIZE;
-for ( i = 0; i < nr_regions; i++ )
-if ( (start < xen_regions[i].e) && (end > xen_regions[i].s) )
-return 1;
-
-r

[PATCH v5 1/3] x86/iommu: remove regions not to be mapped

2024-01-24 Thread Roger Pau Monne
Introduce the code to remove regions not to be mapped from the rangeset
that will be used to setup the IOMMU page tables for the hardware domain.

This change also introduces two new functions: remove_xen_ranges() and
vpci_subtract_mmcfg() that copy the logic in xen_in_range() and
vpci_is_mmcfg_address() respectively and remove the ranges that would otherwise
be intercepted by the original functions.

Note that the rangeset is still not populated.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Jan Beulich 
---
Changes since v4:
 - Fix off-by-one when removing the Xen used ranges, as the rangesets are
   inclusive.

Changes since v3:
 - Remove unnecessary line wrapping.

Changes since v1:
 - Split from bigger patch.
---
 xen/arch/x86/hvm/io.c   | 16 
 xen/arch/x86/include/asm/hvm/io.h   |  3 ++
 xen/arch/x86/include/asm/setup.h|  1 +
 xen/arch/x86/setup.c| 48 +++
 xen/drivers/passthrough/x86/iommu.c | 61 +
 5 files changed, 129 insertions(+)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index d75af83ad01f..a42854c52b65 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -369,6 +369,22 @@ bool vpci_is_mmcfg_address(const struct domain *d, paddr_t 
addr)
 return vpci_mmcfg_find(d, addr);
 }
 
+int __hwdom_init vpci_subtract_mmcfg(const struct domain *d, struct rangeset 
*r)
+{
+const struct hvm_mmcfg *mmcfg;
+
+list_for_each_entry ( mmcfg, &d->arch.hvm.mmcfg_regions, next )
+{
+int rc = rangeset_remove_range(r, PFN_DOWN(mmcfg->addr),
+   PFN_DOWN(mmcfg->addr + mmcfg->size - 
1));
+
+if ( rc )
+return rc;
+}
+
+return 0;
+}
+
 static unsigned int vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
paddr_t addr, pci_sbdf_t *sbdf)
 {
diff --git a/xen/arch/x86/include/asm/hvm/io.h 
b/xen/arch/x86/include/asm/hvm/io.h
index a97731657801..e1e5e6fe7491 100644
--- a/xen/arch/x86/include/asm/hvm/io.h
+++ b/xen/arch/x86/include/asm/hvm/io.h
@@ -156,6 +156,9 @@ void destroy_vpci_mmcfg(struct domain *d);
 /* Check if an address is between a MMCFG region for a domain. */
 bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr);
 
+/* Remove MMCFG regions from a given rangeset. */
+int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index 9a460e4db8f4..cd07d98101d8 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -37,6 +37,7 @@ void discard_initial_images(void);
 void *bootstrap_map(const module_t *mod);
 
 int xen_in_range(unsigned long mfn);
+int remove_xen_ranges(struct rangeset *r);
 
 extern uint8_t kbd_shift_flags;
 
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 897b7e92082e..c9f65c3a70b8 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -2138,6 +2138,54 @@ int __hwdom_init xen_in_range(unsigned long mfn)
 return 0;
 }
 
+int __hwdom_init remove_xen_ranges(struct rangeset *r)
+{
+paddr_t start, end;
+int rc;
+
+/* S3 resume code (and other real mode trampoline code) */
+rc = rangeset_remove_range(r, PFN_DOWN(bootsym_phys(trampoline_start)),
+   PFN_DOWN(bootsym_phys(trampoline_end)) - 1);
+if ( rc )
+return rc;
+
+/*
+ * This needs to remain in sync with the uses of the same symbols in
+ * - __start_xen()
+ * - is_xen_fixed_mfn()
+ * - tboot_shutdown()
+ */
+/* hypervisor .text + .rodata */
+rc = rangeset_remove_range(r, PFN_DOWN(__pa(&_stext)),
+   PFN_DOWN(__pa(&__2M_rodata_end)) - 1);
+if ( rc )
+return rc;
+
+/* hypervisor .data + .bss */
+if ( efi_boot_mem_unused(&start, &end) )
+{
+ASSERT(__pa(start) >= __pa(&__2M_rwdata_start));
+rc = rangeset_remove_range(r, PFN_DOWN(__pa(&__2M_rwdata_start)),
+   PFN_DOWN(__pa(start)) - 1);
+if ( rc )
+return rc;
+ASSERT(__pa(end) <= __pa(&__2M_rwdata_end));
+rc = rangeset_remove_range(r, PFN_DOWN(__pa(end)),
+   PFN_DOWN(__pa(&__2M_rwdata_end)) - 1);
+if ( rc )
+return rc;
+}
+else
+{
+rc = rangeset_remove_range(r, PFN_DOWN(__pa(&__2M_rwdata_start)),
+   PFN_DOWN(__pa(&__2M_rwdata_end)) - 1);
+if ( rc )
+return rc;
+}
+
+return 0;
+}
+
 static int __hwdom_init cf_check io_bitmap_cb(
 unsigned long s, unsigned long e, void *ctx)
 {
diff --git a/xen/drivers/passthrough/x86/iommu.c 
b/xen/drivers/passthrough/x86/iommu.c
index 59b0c7e980ca..fc5215a9dc40 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -370,6 +370,14 @@

Re: [PATCH] x86/ucode: Fix stability of the Raw CPU Policy rescan

2024-01-24 Thread Andrew Cooper
On 24/01/2024 3:37 pm, Jan Beulich wrote:
> On 23.01.2024 21:59, Andrew Cooper wrote:
>> Always run microcode_update_helper() on the BSP, so the the updated Raw CPU
>> policy doesn't get non-BSP topology details included.
> Wouldn't it be better (and consistent with ...
>
>> Have calculate_raw_cpu_policy() clear the instantanious XSTATE sizes.  The
>> value XCR0 | MSR_XSS had when we scanned the policy isn't terribly 
>> interesting
>> to report.
> ... this) to purge these details from the raw policy as well then?

I did spend some time considering this.

Rerunning on the same CPU is more resilient to new topology leaves, so
we'd want to be doing that irrespective.

The XCR0/XSS state really is transient, and the useful information is
everywhere else in leaf 0xd.

>
>> When CPUID Masking is active, it affects CPUID instructions issued by Xen
>> too.  Transiently disable masking to get a clean scan.
>>
>> Fixes: 694d79ed5aac ("x86/ucode: Refresh raw CPU policy after microcode 
>> load")
>> Signed-off-by: Andrew Cooper 
> Irrespective of the question above, I'm also okay with the change as is:
> Reviewed-by: Jan Beulich 

Thanks.

~Andrew



Linux Xen PV CPA W^X violation false-positives

2024-01-24 Thread Jason Andryuk
Xen PV domains show CPA W^X violations like:

CPA detected W^X violation: 0064 -> 0067 range: 
0x8881 - 0x88810fff PFN 10
WARNING: CPU: 0 PID: 30 at arch/x86/mm/pat/set_memory.c:613 
__change_page_attr_set_clr+0x113a/0x11c0
Modules linked in: xt_physdev xt_MASQUERADE iptable_nat nf_nat nf_conntrack 
libcrc32c nf_defrag_ipv4 ip_tables x_tables xen_argo(O)
CPU: 0 PID: 30 Comm: kworker/0:2 Tainted: G   O   6.1.38 #1
Workqueue: events bpf_prog_free_deferred
RIP: e030:__change_page_attr_set_clr+0x113a/0x11c0
Code: 4c 89 f1 4c 89 e2 4c 89 d6 4c 89 8d 70 ff ff ff 4d 8d 86 ff 0f 00 00 48 
c7 c7 f0 3c da 81 c6 05 d0 0e 0e 01 01 e8 f6 71 00 00 <0f> 0b 4c 8b 8d 70 ff ff 
ff e9 2a fd ff ff 48 8b 85 60 ff ff ff 48
RSP: e02b:c9367c48 EFLAGS: 00010282
RAX:  RBX: 000ef064 RCX: 
RDX: 0003 RSI: f7ff RDI: 
RBP: c9367d48 R08:  R09: c9367aa0
R10: 0001 R11: 0001 R12: 0067
R13: 0001 R14: 8881 R15: c9367d60
FS:  () GS:88800b80() knlGS:
CS:  e030 DS:  ES:  CR0: 80050033
CR2: 7fdbaeda01c0 CR3: 04312000 CR4: 00050660
Call Trace:
 
 ? show_regs.cold+0x1a/0x1f
 ? __change_page_attr_set_clr+0x113a/0x11c0
 ? __warn+0x7b/0xc0
 ? __change_page_attr_set_clr+0x113a/0x11c0
 ? report_bug+0x111/0x1a0
 ? handle_bug+0x4d/0xa0
 ? exc_invalid_op+0x19/0x70
 ? asm_exc_invalid_op+0x1b/0x20
 ? __change_page_attr_set_clr+0x113a/0x11c0
 ? __change_page_attr_set_clr+0x113a/0x11c0
 ? debug_smp_processor_id+0x17/0x20
 ? ___cache_free+0x2e/0x1e0
 ? _raw_spin_unlock+0x1e/0x40
 ? __purge_vmap_area_lazy+0x2ea/0x6b0
 set_direct_map_default_noflush+0x7c/0xa0
 __vunmap+0x1ac/0x280
 __vfree+0x1d/0x60
 vfree+0x27/0x40
 __bpf_prog_free+0x44/0x50
 bpf_prog_free_deferred+0x104/0x120
 process_one_work+0x1ca/0x3d0
 ? process_one_work+0x3d0/0x3d0
 worker_thread+0x45/0x3c0
 ? process_one_work+0x3d0/0x3d0
 kthread+0xe2/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x1f/0x30
 
---[ end trace  ]---

Xen provides a set of page tables that the guest executes out of when it
starts.  The L1 entries are shared between level2_ident_pgt and
level2_kernel_pgt, and xen_setup_kernel_pagetable() sets the NX bit in
the level2_ident_pgt entries.  verify_rwx() only checks the l1 entry and
reports a false-positive violation.

Here is a dump of some kernel virtual addresses and the corresponding
L1 and L2 entries:
This is the start of the directmap (ident) and they have NX (bit 63) set
in the PMD.
ndvm-pv (1): [0.466778] va=8880 pte=00100027 level: 1
ndvm-pv (1): [0.466788] va=8880 pmd=8242c067 level: 2
Directmap for kernel text:
ndvm-pv (1): [0.466795] va=88800100 pte=00100165 level: 1
ndvm-pv (1): [0.466801] va=88800100 pmd=82434067 level: 2
ndvm-pv (1): [0.466807] va=88800101 pte=001001010065 level: 1
ndvm-pv (1): [0.466814] va=88800101 pmd=82434067 level: 2
The start of the kernel text highmap is unmapped:
ndvm-pv (1): [0.466820] va=8000 pte= level: 3
ndvm-pv (1): [0.466826] va=8000 pmd= level: 3
Kernel PMD for .text has NX bit clear
ndvm-pv (1): [0.466832] va=8100 pte=00100165 level: 1
ndvm-pv (1): [0.466838] va=8100 pmd=02434067 level: 2
Kernel PTE for rodata_end has NX bit set
ndvm-pv (1): [0.466846] va=81e62000 pte=801001e62025 level: 1
ndvm-pv (1): [0.466874] va=81e62000 pmd=0243b067 level: 2
Directmap of rodata_end
ndvm-pv (1): [0.466907] va=888001e62000 pte=801001e62025 level: 1
ndvm-pv (1): [0.466913] va=888001e62000 pmd=8243b067 level: 2
Directmap of a low RAM address
ndvm-pv (1): [0.466920] va=8881 pte=00110027 level: 1
ndvm-pv (1): [0.466926] va=8881 pmd=8242c067 level: 2
Directmap of another RAM address close to but below kernel text
ndvm-pv (1): [0.466932] va=88800096c000 pte=00100096c027 level: 1
ndvm-pv (1): [0.466938] va=88800096c000 pmd=82430067 level: 2

Here are some L2 entries showing the differing NX bits for l2_ident vs.
l2_kernel while they point at the same L1 addresses
ndvm-pv (1): [0.466944]  l2_ident[  0] pmd=8242c067
ndvm-pv (1): [0.466949]  l2_ident[  1] pmd=8242d067
ndvm-pv (1): [0.466955]  l2_ident[  8] pmd=82434067
ndvm-pv (1): [0.466959]  l2_ident[  9] pmd=82435067
ndvm-pv (1): [0.466964]  l2_ident[ 14] pmd=8243a067
ndvm-pv (1): [0.466969]  l2_ident[ 15] pmd=8243b067
ndvm-pv (1): [0.466974] l2_kernel[  8] pmd=02434067
ndvm-pv (1): [0.466979] l2_ker

Re: [PATCH] pmstat: Limit hypercalls under HWP

2024-01-24 Thread Jason Andryuk
On Tue, Jan 23, 2024 at 3:17 AM Jan Beulich  wrote:
>
> On 22.01.2024 20:09, Jason Andryuk wrote:
> > When HWP is active, the cpufreq P-state information is not updated.  In
> > that case, return -ENODEV instead of bogus, incomplete info.  The xenpm
> > command already supports skipping results when -ENODEV is returned, so
> > it is re-used when -EOPNOTSUPP might be more accurate.
> >
> > Similarly, set_cpufreq_para() is not applicable when HWP is active.
> > Many of the options already checked the governor and were inaccessible,
> > but SCALING_MIN/MAX_FREQ was still accessible (though it would do
> > nothing).  Add an ealier HWP check to handle all cases.
> >
> > Signed-off-by: Jason Andryuk 
> > ---
> > `xenpm get-cpufreq-states` now doesn't return any output.  It also exits
> > successfully since xenpm doesn't check the returns there.
>
> This isn't very nice. Is there nothing sensible that can be output
> instead in the HWP case? If not, I think it would help if
> inapplicability of the command would be indicated by at least one line
> of output. Or might it make sense to at least fall back to
> get-cpufreq-average in that case?

Sorry, I should have explained more, but, yes, not nice.  "No output
and exits with success" is how xenpm works if -ENODEV is received -
which I guess occurs when cpufreq is disabled (regardless of HWP).  I
found that surprising, but that behaviour is matched under HWP with
this patch.

Yes, `xenpm get-cpufreq-states` can be enhanced.  The re-use of ENODEV
was useful to make `xenpm get-cpufreq-average` output C-states but
skip P-states.  If EOPNOTSUPP is used, then that can differentiate
when HWP is used.  So `xenpm get-cpufreq-states` can print a message
when cpufreq is disabled, and a different one when P-States  are
unavailable.

Regards,
Jason



[PATCH] xen/events: close evtchn after mapping cleanup

2024-01-24 Thread Maximilian Heyne
shutdown_pirq and startup_pirq are not taking the
irq_mapping_update_lock because they can't due to lock inversion. Both
are called with the irq_desc->lock being taking. The lock order,
however, is first irq_mapping_update_lock and then irq_desc->lock.

This opens multiple races:
- shutdown_pirq can be interrupted by a function that allocates an event
  channel:

  CPU0CPU1
  shutdown_pirq {
xen_evtchn_close(e)
  __startup_pirq {
EVTCHNOP_bind_pirq
  -> returns just freed evtchn e
set_evtchn_to_irq(e, irq)
  }
xen_irq_info_cleanup() {
  set_evtchn_to_irq(e, -1)
}
  }

  Assume here event channel e refers here to the same event channel
  number.
  After this race the evtchn_to_irq mapping for e is invalid (-1).

- __startup_pirq races with __unbind_from_irq in a similar way. Because
  __startup_pirq doesn't take irq_mapping_update_lock it can grab the
  evtchn that __unbind_from_irq is currently freeing and cleaning up. In
  this case even though the event channel is allocated, its mapping can
  be unset in evtchn_to_irq.

The fix is to first cleanup the mappings and then close the event
channel. In this way, when an event channel gets allocated it's
potential previous evtchn_to_irq mappings are guaranteed to be unset already.
This is also the reverse order of the allocation where first the event
channel is allocated and then the mappings are setup.

On a 5.10 kernel prior to commit 3fcdaf3d7634 ("xen/events: modify internal
[un]bind interfaces"), we hit a BUG like the following during probing of NVMe
devices. The issue is that during nvme_setup_io_queues, pci_free_irq
is called for every device which results in a call to shutdown_pirq.
With many nvme devices it's therefore likely to hit this race during
boot because there will be multiple calls to shutdown_pirq and
startup_pirq are running potentially in parallel.

  [ cut here ]
  blkfront: xvda: barrier or flush: disabled; persistent grants: enabled; 
indirect descriptors: enabled; bounce buffer: enabled
  kernel BUG at drivers/xen/events/events_base.c:499!
  invalid opcode:  [#1] SMP PTI
  CPU: 44 PID: 375 Comm: kworker/u257:23 Not tainted 
5.10.201-191.748.amzn2.x86_64 #1
  Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
  Workqueue: nvme-reset-wq nvme_reset_work
  RIP: 0010:bind_evtchn_to_cpu+0xdf/0xf0
  Code: 5d 41 5e c3 cc cc cc cc 44 89 f7 e8 2b 55 ad ff 49 89 c5 48 85 c0 0f 84 
64 ff ff ff 4c 8b 68 30 41 83 fe ff 0f 85 60 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 
00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00
  RSP: :c9000d533b08 EFLAGS: 00010046
  RAX:  RBX:  RCX: 0006
  RDX: 0028 RSI:  RDI: 
  RBP: 888107419680 R08:  R09: 82d72b00
  R10:  R11:  R12: 01ed
  R13:  R14:  R15: 0002
  FS:  () GS:88bc8b50() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2:  CR3: 02610001 CR4: 001706e0
  DR0:  DR1:  DR2: 
  DR3:  DR6: fffe0ff0 DR7: 0400
  Call Trace:
   ? show_trace_log_lvl+0x1c1/0x2d9
   ? show_trace_log_lvl+0x1c1/0x2d9
   ? set_affinity_irq+0xdc/0x1c0
   ? __die_body.cold+0x8/0xd
   ? die+0x2b/0x50
   ? do_trap+0x90/0x110
   ? bind_evtchn_to_cpu+0xdf/0xf0
   ? do_error_trap+0x65/0x80
   ? bind_evtchn_to_cpu+0xdf/0xf0
   ? exc_invalid_op+0x4e/0x70
   ? bind_evtchn_to_cpu+0xdf/0xf0
   ? asm_exc_invalid_op+0x12/0x20
   ? bind_evtchn_to_cpu+0xdf/0xf0
   ? bind_evtchn_to_cpu+0xc5/0xf0
   set_affinity_irq+0xdc/0x1c0
   irq_do_set_affinity+0x1d7/0x1f0
   irq_setup_affinity+0xd6/0x1a0
   irq_startup+0x8a/0xf0
   __setup_irq+0x639/0x6d0
   ? nvme_suspend+0x150/0x150
   request_threaded_irq+0x10c/0x180
   ? nvme_suspend+0x150/0x150
   pci_request_irq+0xa8/0xf0
   ? __blk_mq_free_request+0x74/0xa0
   queue_request_irq+0x6f/0x80
   nvme_create_queue+0x1af/0x200
   nvme_create_io_queues+0xbd/0xf0
   nvme_setup_io_queues+0x246/0x320
   ? nvme_irq_check+0x30/0x30
   nvme_reset_work+0x1c8/0x400
   process_one_work+0x1b0/0x350
   worker_thread+0x49/0x310
   ? process_one_work+0x350/0x350
   kthread+0x11b/0x140
   ? __kthread_bind_mask+0x60/0x60
   ret_from_fork+0x22/0x30
  Modules linked in:
  ---[ end trace a11715de1eee1873 ]---

Fixes: d46a78b05c0e ("xen: implement pirq type event channels")
Cc: sta...@vger.kernel.org
Co-debugged-by: Andrew Panyakin 
Signed-off-by: Maximilian Heyne 
---
 drivers/xen/events/events_base.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index

Re: [PATCH] x86: amend 'n' debug-key output with SMI count

2024-01-24 Thread Andrew Cooper
On 24/01/2024 3:27 pm, Jan Beulich wrote:
> ... if available only, of course.
>
> Signed-off-by: Jan Beulich 
>
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -406,9 +406,15 @@ void __init early_cpu_init(bool verbose)
>   paddr_bits -= (ebx >> 6) & 0x3f;
>   }
>  
> - if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)))
> + if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON))) {
> + uint64_t smi_count;
> +
>   park_offline_cpus = opt_mce;
>  
> + if (!verbose && !rdmsr_safe(MSR_SMI_COUNT, smi_count))
> + setup_force_cpu_cap(X86_FEATURE_SMI_COUNT);
> + }
> +

I know you're re-using an existing condition, but I think it's more
likely that it's Intel-only than common to VIA and Shanghai.

Also, why is gated on verbose?

(I think I can see why this is rhetorical question, and I expect you can
guess what the feedback will be.)

>   initialize_cpu_data(0);
>  }
>  
> --- a/xen/arch/x86/include/asm/cpufeatures.h
> +++ b/xen/arch/x86/include/asm/cpufeatures.h
> @@ -24,7 +24,7 @@ XEN_CPUFEATURE(APERFMPERF,X86_SY
>  XEN_CPUFEATURE(MFENCE_RDTSC,  X86_SYNTH( 9)) /* MFENCE synchronizes 
> RDTSC */
>  XEN_CPUFEATURE(XEN_SMEP,  X86_SYNTH(10)) /* SMEP gets used by Xen 
> itself */
>  XEN_CPUFEATURE(XEN_SMAP,  X86_SYNTH(11)) /* SMAP gets used by Xen 
> itself */
> -/* Bit 12 unused. */
> +XEN_CPUFEATURE(SMI_COUNT, X86_SYNTH(12)) /* MSR_SMI_COUNT exists */
>  XEN_CPUFEATURE(IND_THUNK_LFENCE,  X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */
>  XEN_CPUFEATURE(IND_THUNK_JMP, X86_SYNTH(14)) /* Use IND_THUNK_JMP */
>  XEN_CPUFEATURE(SC_NO_BRANCH_HARDEN, X86_SYNTH(15)) /* (Disable) Conditional 
> branch hardening */
> --- a/xen/arch/x86/include/asm/msr-index.h
> +++ b/xen/arch/x86/include/asm/msr-index.h
> @@ -28,6 +28,8 @@
>  #define  TEST_CTRL_SPLITLOCK_DETECT (_AC(1, ULL) << 29)
>  #define  TEST_CTRL_SPLITLOCK_DISABLE(_AC(1, ULL) << 31)
>  
> +#define MSR_SMI_COUNT   0x0034
> +
>  #define MSR_INTEL_CORE_THREAD_COUNT 0x0035
>  #define  MSR_CTC_THREAD_MASK0x
>  #define  MSR_CTC_CORE_MASK  _AC(0x, U)
> --- a/xen/arch/x86/nmi.c
> +++ b/xen/arch/x86/nmi.c
> @@ -589,9 +589,20 @@ static void cf_check do_nmi_stats(unsign
>  unsigned int cpu;
>  bool pend, mask;
>  
> -printk("CPU\tNMI\n");
> +printk("CPU\tNMI%s\n", boot_cpu_has(X86_FEATURE_SMI_COUNT) ? "\tSMI" : 
> "");
>  for_each_online_cpu ( cpu )
> -printk("%3u\t%3u\n", cpu, per_cpu(nmi_count, cpu));
> +{
> +printk("%3u\t%3u", cpu, per_cpu(nmi_count, cpu));
> +if ( boot_cpu_has(X86_FEATURE_SMI_COUNT) )
> +{
> +unsigned int smi_count, dummy;
> +
> +rdmsr(MSR_SMI_COUNT, smi_count, dummy);
> +printk("\t%3u\n", smi_count);

This reads MSR_SMI_COUNT repeatedly on the same CPU.

You'll need to IPI all CPUs to dump the count into a per-cpu variable.

~Andrew



Re: [PATCH] x86/NMI: refine "watchdog stuck" log message

2024-01-24 Thread Andrew Cooper
On 24/01/2024 3:21 pm, Jan Beulich wrote:
> Observing
>
> "Testing NMI watchdog on all CPUs: 0 stuck"
>
> it felt like it's not quite right, but I still read it as "no CPU stuck;
> all good", when really the system suffered from what 6bdb965178bb
> ("x86/intel: ensure Global Performance Counter Control is setup
> correctly") works around. Convert this to
>
> "Testing NMI watchdog on all CPUs: {0} stuck"
>
> or, with multiple CPUs having an issue, e.g.
>
> "Testing NMI watchdog on all CPUs: {0,40} stuck"
>
> to make more obvious that a lone number is not a count of CPUs.
>
> Signed-off-by: Jan Beulich 

I'd forgotten it was still opencoded like this.

Acked-by: Andrew Cooper , but if you felt
turning this into using a scratch cpumask and %*pb then I think that
would be nicer still.

~Andrew



Re: [PATCH] x86/NMI: refine "watchdog stuck" log message

2024-01-24 Thread Roger Pau Monné
On Wed, Jan 24, 2024 at 04:21:24PM +0100, Jan Beulich wrote:
> Observing
> 
> "Testing NMI watchdog on all CPUs: 0 stuck"
> 
> it felt like it's not quite right, but I still read it as "no CPU stuck;
> all good", when really the system suffered from what 6bdb965178bb
> ("x86/intel: ensure Global Performance Counter Control is setup
> correctly") works around. Convert this to
> 
> "Testing NMI watchdog on all CPUs: {0} stuck"

To make this even more obvious, maybe it could be prefixed with "error":

"Testing NMI watchdog on all CPUs: error {0} stuck"

Hm, albeit I don't like it that much

> 
> or, with multiple CPUs having an issue, e.g.
> 
> "Testing NMI watchdog on all CPUs: {0,40} stuck"
> 
> to make more obvious that a lone number is not a count of CPUs.
> 
> Signed-off-by: Jan Beulich 

Reviewed-by: Roger Pau Monné 

> ---
> In principle "sep" could also fulfill the job of "ok"; it felt to me as
> if this may not be liked very much, though.

I think I prefer it the current way.

Thanks, Roger.



[PATCH] x86/entry: Avoid register spilling in cr4_pv32_restore()

2024-01-24 Thread Andrew Cooper
cr4_pv32_restore() needs two registers.  Right now, it spills %rdx and
clobbers %rax.

However, %rcx is free to use at all callsites.  Annotate CR4_PV32_RESTORE with
our usual clobber comments, and swap %rdx for %rcx in the non-fatal paths

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Roger Pau Monné 
CC: Wei Liu 

I suspect we can further improve this by using %r14 rather than
GET_CPUINFO_FIELD(), but I'll leave that to a future change.
---
 xen/arch/x86/x86_64/compat/entry.S | 17 +++--
 xen/arch/x86/x86_64/entry.S|  8 
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/x86_64/compat/entry.S 
b/xen/arch/x86/x86_64/compat/entry.S
index 49811a56e965..d4f0e4804090 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -23,7 +23,7 @@ FUNC(entry_int82)
 
 sti
 
-CR4_PV32_RESTORE
+CR4_PV32_RESTORE /* Clob: ac */
 
 GET_CURRENT(bx)
 
@@ -163,17 +163,15 @@ FUNC(compat_restore_all_guest)
 _ASM_PRE_EXTABLE(.Lft0, handle_exception)
 END(compat_restore_all_guest)
 
-/* This mustn't modify registers other than %rax. */
+/* Callers can cope with both %rax and %rcx being clobbered. */
 FUNC(cr4_pv32_restore)
-push  %rdx
-GET_CPUINFO_FIELD(cr4, dx)
-mov   (%rdx), %rax
+GET_CPUINFO_FIELD(cr4, cx)
+mov   (%rcx), %rax
 test  $XEN_CR4_PV32_BITS, %eax
 jnz   0f
 orcr4_pv32_mask(%rip), %rax
 mov   %rax, %cr4
-mov   %rax, (%rdx)
-pop   %rdx
+mov   %rax, (%rcx)
 ret
 0:
 #ifndef NDEBUG
@@ -191,7 +189,6 @@ FUNC(cr4_pv32_restore)
 BUG
 1:
 #endif
-pop   %rdx
 xor   %eax, %eax
 ret
 END(cr4_pv32_restore)
@@ -227,7 +224,7 @@ UNLIKELY_END(compat_syscall_gpf)
 END(compat_syscall)
 
 FUNC(compat_sysenter)
-CR4_PV32_RESTORE
+CR4_PV32_RESTORE /* Clob: ac */
 movq  VCPU_trap_ctxt(%rbx),%rcx
 cmpb  $X86_EXC_GP, UREGS_entry_vector(%rsp)
 movzwl VCPU_sysenter_sel(%rbx),%eax
@@ -242,7 +239,7 @@ FUNC(compat_sysenter)
 END(compat_sysenter)
 
 FUNC(compat_int80_direct_trap)
-CR4_PV32_RESTORE
+CR4_PV32_RESTORE /* Clob: ac */
 call  compat_create_bounce_frame
 jmp   compat_test_all_events
 END(compat_int80_direct_trap)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index c3f6b667a72a..6c53c0091168 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -309,7 +309,7 @@ FUNC(cstar_enter)
 .Lcstar_cr3_okay:
 sti
 
-CR4_PV32_RESTORE
+CR4_PV32_RESTORE /* Clob: ac */
 
 movq  STACK_CPUINFO_FIELD(current_vcpu)(%rbx), %rbx
 
@@ -712,7 +712,7 @@ FUNC(common_interrupt)
 cmovnz %r12d, %ebx
 .Lintr_cr3_okay:
 
-CR4_PV32_RESTORE
+CR4_PV32_RESTORE /* Clob: ac */
 movq %rsp,%rdi
 callq do_IRQ
 mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
@@ -762,7 +762,7 @@ handle_exception_saved:
 jz.Lcr4_pv32_done
 cmpb  $0,DOMAIN_is_32bit_pv(%rax)
 je.Lcr4_pv32_done
-call  cr4_pv32_restore
+call  cr4_pv32_restore /* Clob: ac */
 /*
  * An NMI or #MC may occur between clearing CR4.SMEP / CR4.SMAP in
  * compat_restore_all_guest and it actually returning to guest
@@ -1046,7 +1046,7 @@ FUNC(handle_ist_exception)
 .List_cr3_okay:
 
 #ifdef CONFIG_PV
-CR4_PV32_RESTORE
+CR4_PV32_RESTORE /* Clob: ac */
 testb $3,UREGS_cs(%rsp)
 jz1f
 /*

base-commit: 98ae35cab0e40e59963f9f58345bf378b9522d74
-- 
2.30.2




Re: [PATCH] x86/entry: replace two GET_CURRENT() uses

2024-01-24 Thread Andrew Cooper
On 24/01/2024 3:23 pm, Jan Beulich wrote:
> Now that we have %r14 set up using GET_STACK_END() in a number of
> places, in two places we can eliminate the redundancy of GET_CURRENT()
> also invoking that macro. In handle_ist_exception() actually go a step
> farther and avoid using %rbx altogether when retrieving the processor
> ID: Obtain the current vCPU pointer only in the PV32-specific code
> actually needing it.
>
> Signed-off-by: Jan Beulich 

Reviewed-by: Andrew Cooper 



Re: [PATCH v3 30/34] xen/riscv: add minimal stuff to processor.h to build full Xen

2024-01-24 Thread Jan Beulich
On 24.01.2024 16:33, Oleksii wrote:
> On Wed, 2024-01-24 at 12:27 +0100, Jan Beulich wrote:
>> On 24.01.2024 11:12, Oleksii wrote:
>>> On Wed, 2024-01-24 at 09:19 +0100, Jan Beulich wrote:
 On 23.01.2024 18:08, Oleksii wrote:
> On Tue, 2024-01-23 at 12:39 +0100, Jan Beulich wrote:
>> On 22.12.2023 16:13, Oleksii Kurochko wrote:
>>> @@ -53,6 +56,18 @@ struct cpu_user_regs
>>>  unsigned long pregs;
>>>  };
>>>  
>>> +/* TODO: need to implement */
>>> +#define cpu_to_core(cpu)   (0)
>>> +#define cpu_to_socket(cpu) (0)
>>> +
>>> +static inline void cpu_relax(void)
>>> +{
>>> +    /* Encoding of the pause instruction */
>>> +    __asm__ __volatile__ ( ".insn 0x10F" );
>>
>> binutils 2.40 knows "pause" - why use .insn then?
> I thought that for this instruction it is needed to have
> extension
> ZIHINTPAUSE ( according to Linux Kernel source code [1] ) and
> to
> cover
> older version.

 Well, of course you'll need to enable the extension then for gas.
 But
 as long as you use the insn unconditionally, that's all fine and
 natural. Another thing would be if you meant to also run on
 systems
 not supporting the extension: Then the above use of .insn would
 need
 to become conditional anyway.
>>> Then it makes sense to use "pause". 
>>> Let's assume that for now we are running only on systems which
>>> support
>>> the extension until we won't face compilation issue for some
>>> system.
>>
>> Gives me the impression that you still don't properly separate the
>> two
>> aspects: One is what systems Xen is to run on, and other is what's
>> needed to make Xen build properly. The first needs documenting (and
>> ideally at some point actually enforcing), while the latter may
>> require
>> e.g. compiler command line option adjustments.
> I understand that it will be required update "-march=..._zihintpause"
> and it should be a check that this extension is supported by a
> toolchain.
> 
> But I am not sure that I know how can I enforce that a system should
> have this extension, and considering Linux kernel implementation which
> uses always pause instruction, it looks like all available systems
> support this extension.

Which is why I said documenting will suffice, at least for now.

Jan



Re: [PATCH] x86/ucode: Fix stability of the Raw CPU Policy rescan

2024-01-24 Thread Jan Beulich
On 23.01.2024 21:59, Andrew Cooper wrote:
> Always run microcode_update_helper() on the BSP, so the the updated Raw CPU
> policy doesn't get non-BSP topology details included.

Wouldn't it be better (and consistent with ...

> Have calculate_raw_cpu_policy() clear the instantanious XSTATE sizes.  The
> value XCR0 | MSR_XSS had when we scanned the policy isn't terribly interesting
> to report.

... this) to purge these details from the raw policy as well then?

> When CPUID Masking is active, it affects CPUID instructions issued by Xen
> too.  Transiently disable masking to get a clean scan.
> 
> Fixes: 694d79ed5aac ("x86/ucode: Refresh raw CPU policy after microcode load")
> Signed-off-by: Andrew Cooper 

Irrespective of the question above, I'm also okay with the change as is:
Reviewed-by: Jan Beulich 

Jan



Re: [PATCH v3 30/34] xen/riscv: add minimal stuff to processor.h to build full Xen

2024-01-24 Thread Oleksii
On Wed, 2024-01-24 at 12:27 +0100, Jan Beulich wrote:
> On 24.01.2024 11:12, Oleksii wrote:
> > On Wed, 2024-01-24 at 09:19 +0100, Jan Beulich wrote:
> > > On 23.01.2024 18:08, Oleksii wrote:
> > > > On Tue, 2024-01-23 at 12:39 +0100, Jan Beulich wrote:
> > > > > On 22.12.2023 16:13, Oleksii Kurochko wrote:
> > > > > > @@ -53,6 +56,18 @@ struct cpu_user_regs
> > > > > >  unsigned long pregs;
> > > > > >  };
> > > > > >  
> > > > > > +/* TODO: need to implement */
> > > > > > +#define cpu_to_core(cpu)   (0)
> > > > > > +#define cpu_to_socket(cpu) (0)
> > > > > > +
> > > > > > +static inline void cpu_relax(void)
> > > > > > +{
> > > > > > +    /* Encoding of the pause instruction */
> > > > > > +    __asm__ __volatile__ ( ".insn 0x10F" );
> > > > > 
> > > > > binutils 2.40 knows "pause" - why use .insn then?
> > > > I thought that for this instruction it is needed to have
> > > > extension
> > > > ZIHINTPAUSE ( according to Linux Kernel source code [1] ) and
> > > > to
> > > > cover
> > > > older version.
> > > 
> > > Well, of course you'll need to enable the extension then for gas.
> > > But
> > > as long as you use the insn unconditionally, that's all fine and
> > > natural. Another thing would be if you meant to also run on
> > > systems
> > > not supporting the extension: Then the above use of .insn would
> > > need
> > > to become conditional anyway.
> > Then it makes sense to use "pause". 
> > Let's assume that for now we are running only on systems which
> > support
> > the extension until we won't face compilation issue for some
> > system.
> 
> Gives me the impression that you still don't properly separate the
> two
> aspects: One is what systems Xen is to run on, and other is what's
> needed to make Xen build properly. The first needs documenting (and
> ideally at some point actually enforcing), while the latter may
> require
> e.g. compiler command line option adjustments.
I understand that it will be required update "-march=..._zihintpause"
and it should be a check that this extension is supported by a
toolchain.

But I am not sure that I know how can I enforce that a system should
have this extension, and considering Linux kernel implementation which
uses always pause instruction, it looks like all available systems
support this extension.

But I agree what I wrote above isn't fully correct.

~ Oleksii



Re: [PATCH v3 16/34] xen/lib: introduce generic find next bit operations

2024-01-24 Thread Jan Beulich
On 24.01.2024 16:04, Oleksii wrote:
> On Wed, 2024-01-24 at 12:24 +0100, Jan Beulich wrote:
>> On 24.01.2024 10:34, Oleksii wrote:
>>> On Tue, 2024-01-23 at 14:37 +0100, Jan Beulich wrote:
 On 23.01.2024 13:34, Oleksii wrote:
> On Tue, 2024-01-23 at 12:14 +0100, Jan Beulich wrote:
>> On 22.12.2023 16:13, Oleksii Kurochko wrote:
>>> --- /dev/null
>>> +++ b/xen/lib/find-next-bit.c
>>> [...]
>>
>> I was going to ask that you convince git to actually present
>> a
>> proper
>> diff, to make visible what changes. But other than the
>> description
>> says
>> you don't really move the file, you copy it. Judging from
>> further
>> titles
>> there's also nowhere you'd make Arm actually use this now
>> generic
>> code.
> I wanted to do it separately, outside this patch series to
> simplify
> review and not have Arm specific changes in RISC-V patch
> series.

 Then do it the other way around: Make a separate _prereq_ change
 truly
 moving the file.
>>> So this one patch should be separated by 2? One which moves find-
>>> next-
>>> bit.c from Arm to xen/lib, and second where xen/lib/Makefile is
>>> updated.
>>
>> No, that would break the Arm build. I suggested breaking out this
>> patch from the series, and then doing what the description says:
>> Actually move the file. I don't think I suggested splitting the
>> patch. Even the breaking out of the series was only because you
>> said "I wanted to do it separately, outside this patch series".
> What I meant was that I would like to have a patch which introduces
> generic version of find-next-bit in the current patch series and
> provide a separate patch outside of the current patch series which
> switches Arm to use generic version.

I understand that this is what you meant. Yet I don't like the
duplication of code, even if it's only temporary. The more that iirc
git can show proper history for moved files, while there'll be a
disconnect when you first add a 2nd copy and later purge the original.
If you want this change separate from the series - fine. But then, as
said, please as a prereq patch.

Jan



[PATCH] x86: amend 'n' debug-key output with SMI count

2024-01-24 Thread Jan Beulich
... if available only, of course.

Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -406,9 +406,15 @@ void __init early_cpu_init(bool verbose)
paddr_bits -= (ebx >> 6) & 0x3f;
}
 
-   if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)))
+   if (!(c->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON))) {
+   uint64_t smi_count;
+
park_offline_cpus = opt_mce;
 
+   if (!verbose && !rdmsr_safe(MSR_SMI_COUNT, smi_count))
+   setup_force_cpu_cap(X86_FEATURE_SMI_COUNT);
+   }
+
initialize_cpu_data(0);
 }
 
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -24,7 +24,7 @@ XEN_CPUFEATURE(APERFMPERF,X86_SY
 XEN_CPUFEATURE(MFENCE_RDTSC,  X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC 
*/
 XEN_CPUFEATURE(XEN_SMEP,  X86_SYNTH(10)) /* SMEP gets used by Xen 
itself */
 XEN_CPUFEATURE(XEN_SMAP,  X86_SYNTH(11)) /* SMAP gets used by Xen 
itself */
-/* Bit 12 unused. */
+XEN_CPUFEATURE(SMI_COUNT, X86_SYNTH(12)) /* MSR_SMI_COUNT exists */
 XEN_CPUFEATURE(IND_THUNK_LFENCE,  X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */
 XEN_CPUFEATURE(IND_THUNK_JMP, X86_SYNTH(14)) /* Use IND_THUNK_JMP */
 XEN_CPUFEATURE(SC_NO_BRANCH_HARDEN, X86_SYNTH(15)) /* (Disable) Conditional 
branch hardening */
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -28,6 +28,8 @@
 #define  TEST_CTRL_SPLITLOCK_DETECT (_AC(1, ULL) << 29)
 #define  TEST_CTRL_SPLITLOCK_DISABLE(_AC(1, ULL) << 31)
 
+#define MSR_SMI_COUNT   0x0034
+
 #define MSR_INTEL_CORE_THREAD_COUNT 0x0035
 #define  MSR_CTC_THREAD_MASK0x
 #define  MSR_CTC_CORE_MASK  _AC(0x, U)
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -589,9 +589,20 @@ static void cf_check do_nmi_stats(unsign
 unsigned int cpu;
 bool pend, mask;
 
-printk("CPU\tNMI\n");
+printk("CPU\tNMI%s\n", boot_cpu_has(X86_FEATURE_SMI_COUNT) ? "\tSMI" : "");
 for_each_online_cpu ( cpu )
-printk("%3u\t%3u\n", cpu, per_cpu(nmi_count, cpu));
+{
+printk("%3u\t%3u", cpu, per_cpu(nmi_count, cpu));
+if ( boot_cpu_has(X86_FEATURE_SMI_COUNT) )
+{
+unsigned int smi_count, dummy;
+
+rdmsr(MSR_SMI_COUNT, smi_count, dummy);
+printk("\t%3u\n", smi_count);
+}
+else
+printk("\n");
+}
 
 if ( !hardware_domain || !(v = domain_vcpu(hardware_domain, 0)) )
 return;



[PATCH] x86/entry: replace two GET_CURRENT() uses

2024-01-24 Thread Jan Beulich
Now that we have %r14 set up using GET_STACK_END() in a number of
places, in two places we can eliminate the redundancy of GET_CURRENT()
also invoking that macro. In handle_ist_exception() actually go a step
farther and avoid using %rbx altogether when retrieving the processor
ID: Obtain the current vCPU pointer only in the PV32-specific code
actually needing it.

Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -118,6 +118,7 @@ void __dummy__(void)
 #endif
 
 OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
+OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
 OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
 OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
 OFFSET(CPUINFO_per_cpu_offset, struct cpu_info, per_cpu_offset);
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -749,7 +749,7 @@ FUNC(handle_exception, 0)
 .Lxcpt_cr3_okay:
 
 handle_exception_saved:
-GET_CURRENT(bx)
+mov   STACK_CPUINFO_FIELD(current_vcpu)(%r14), %rbx
 testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rsp)
 jzexception_with_ints_disabled
 
@@ -1128,9 +1128,8 @@ handle_ist_exception:
 #ifdef CONFIG_PV
 testb $3,UREGS_cs(%rsp)
 jzrestore_all_xen
-GET_CURRENT(bx)
 /* Send an IPI to ourselves to cover for the lack of event checking. */
-movl  VCPU_processor(%rbx),%eax
+mov   STACK_CPUINFO_FIELD(processor_id)(%r14), %eax
 shll  $IRQSTAT_shift,%eax
 leaq  irq_stat+IRQSTAT_softirq_pending(%rip),%rcx
 cmpl  $0,(%rcx,%rax,1)
@@ -1139,6 +1138,7 @@ handle_ist_exception:
 call  send_IPI_self
 1:
 #ifdef CONFIG_PV32
+mov   STACK_CPUINFO_FIELD(current_vcpu)(%r14), %rbx
 movq  VCPU_domain(%rbx),%rax
 cmpb  $0,DOMAIN_is_32bit_pv(%rax)
 jerestore_all_guest



[PATCH] x86/NMI: refine "watchdog stuck" log message

2024-01-24 Thread Jan Beulich
Observing

"Testing NMI watchdog on all CPUs: 0 stuck"

it felt like it's not quite right, but I still read it as "no CPU stuck;
all good", when really the system suffered from what 6bdb965178bb
("x86/intel: ensure Global Performance Counter Control is setup
correctly") works around. Convert this to

"Testing NMI watchdog on all CPUs: {0} stuck"

or, with multiple CPUs having an issue, e.g.

"Testing NMI watchdog on all CPUs: {0,40} stuck"

to make more obvious that a lone number is not a count of CPUs.

Signed-off-by: Jan Beulich 
---
In principle "sep" could also fulfill the job of "ok"; it felt to me as
if this may not be liked very much, though.

--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -167,13 +167,14 @@ static void __init cf_check wait_for_nmi
 void __init check_nmi_watchdog(void)
 {
 static unsigned int __initdata prev_nmi_count[NR_CPUS];
-int cpu;
+unsigned int cpu;
+char sep = '{';
 bool ok = true;
 
 if ( nmi_watchdog == NMI_NONE )
 return;
 
-printk("Testing NMI watchdog on all CPUs:");
+printk("Testing NMI watchdog on all CPUs: ");
 
 for_each_online_cpu ( cpu )
 prev_nmi_count[cpu] = per_cpu(nmi_count, cpu);
@@ -189,12 +190,13 @@ void __init check_nmi_watchdog(void)
 {
 if ( per_cpu(nmi_count, cpu) - prev_nmi_count[cpu] < 2 )
 {
-printk(" %d", cpu);
+printk("%c%u", sep, cpu);
+sep = ',';
 ok = false;
 }
 }
 
-printk(" %s\n", ok ? "ok" : "stuck");
+printk("%s\n", ok ? "ok" : "} stuck");
 
 /*
  * Now that we know it works we can reduce NMI frequency to



Re: Thoughts on current Xen EDAC/MCE situation

2024-01-24 Thread Elliott Mitchell
On Wed, Jan 24, 2024 at 08:23:15AM +0100, Jan Beulich wrote:
> On 23.01.2024 23:52, Elliott Mitchell wrote:
> > On Tue, Jan 23, 2024 at 11:44:03AM +0100, Jan Beulich wrote:
> >> On 22.01.2024 21:53, Elliott Mitchell wrote:
> >>
> >>> I find the present handling of MCE in Xen an odd choice.  Having Xen do
> >>> most of the handling of MCE events is a behavior matching a traditional
> >>> stand-alone hypervisor.  Yet Xen was originally pushing any task not
> >>> requiring hypervisor action onto Domain 0.
> >>
> >> Not exactly. Xen in particular deals with all of CPU and all of memory.
> >> Dom0 may be unaware of the full amount of CPUs in the system, nor the
> >> full memory map (without resorting to interfaces specifically making
> >> that information available, but not to be used for Dom0 kernel's own
> >> acting as a kernel).
> > 
> > Why would this be an issue?
> 
> Well, counter question: For all of ...
> 
> > I would expect the handling to be roughly:  NMI -> Xen; Xen schedules a
> > Dom0 vCPU which is eligible to run on the pCPU onto the pCPU; Dom0
> > examines registers/MSRs, Dom0 then issues a hypercall to Xen telling
> > Xen how to resolve the issue (no action, fix memory contents, kill page).
> > 
> > Ideally there would be an idle Dom0 vCPU, but interrupting a busy vCPU
> > would be viable.  It would even be reasonable to ignore affinity and
> > grab any Dom0 vCPU.
> > 
> > Dom0 has 2 purposes for the address.  First, to pass it back to Xen.
> > Second, to report it to a system administrator so they could restart the
> > system with that address marked as bad.  Dom0 wouldn't care whether the
> > address was directly accessible to it or not.
> > 
> > The proposed hypercall should report back what was effected by a UE
> > event.  A given site might have a policy that if $some_domain is hit by a
> > UE, everything is restarted.  Meanwhile Dom0 or Xen being the winner
> > could deserve urgent action.
> 
> ... this, did you first look at code and figure how what you suggest
> could be seamlessly integrated? Part of your suggestion (if I got it
> right) is, after all, to make maintenance on the Dom0 kernel side easy.
> I expect such adjustments being not overly intrusive would also be an
> acceptance criteria by the maintainers.

Maintenance on the Dom0 kernel isn't the issue.

One issue is for reporting of MCEs when running on Xen to be consistent
with MCE when not running on Xen.  Notably similar level of information
and ideally tools which assist with analyzing failures working when
running on Xen.

Another issue is to do a better job of keeping Xen up to date with MCE
handling as new hardware with new MCE implementations show up.

> Second - since you specifically talk about UE: The more code is involved
> in handling, the higher the chance of the #MC ending up fatal to the
> system.

Indeed.  Yet right now I'm more concerned over whether MCEs reporting is
happening at all.  There aren't very many messages at all.

> Third, as to Dom0's purposes of having the address: If all it is to use
> it for is to pass it back to Xen, paths in the respective drivers will
> necessarily be entirely different for the Xen vs the native cases.

I'm less than certain of the best place for Xen to intercept MCE events.
For UE memory events, the simplest approach on Linux might be to wrap the
memory_failure() function.  Yet for Linux/x86,
mce_register_decode_chain() also looks like a very good candidate.


-- 
(\___(\___(\__  --=> 8-) EHM <=--  __/)___/)___/)
 \BS (| ehem+sig...@m5p.com  PGP 87145445 |)   /
  \_CS\   |  _  -O #include  O-   _  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





Re: [PATCH v3 16/34] xen/lib: introduce generic find next bit operations

2024-01-24 Thread Oleksii
On Wed, 2024-01-24 at 12:24 +0100, Jan Beulich wrote:
> On 24.01.2024 10:34, Oleksii wrote:
> > On Tue, 2024-01-23 at 14:37 +0100, Jan Beulich wrote:
> > > On 23.01.2024 13:34, Oleksii wrote:
> > > > On Tue, 2024-01-23 at 12:14 +0100, Jan Beulich wrote:
> > > > > On 22.12.2023 16:13, Oleksii Kurochko wrote:
> > > > > > --- a/xen/common/Kconfig
> > > > > > +++ b/xen/common/Kconfig
> > > > > > @@ -47,6 +47,9 @@ config ARCH_MAP_DOMAIN_PAGE
> > > > > >  config GENERIC_BUG_FRAME
> > > > > >     bool
> > > > > >  
> > > > > > +config GENERIC_FIND_NEXT_BIT
> > > > > > +   bool
> > > > > 
> > > > > There's no need for this, as ...
> > > > > 
> > > > > > --- a/xen/lib/Makefile
> > > > > > +++ b/xen/lib/Makefile
> > > > > > @@ -3,6 +3,7 @@ obj-$(CONFIG_X86) += x86/
> > > > > >  lib-y += bsearch.o
> > > > > >  lib-y += ctors.o
> > > > > >  lib-y += ctype.o
> > > > > > +lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find-next-bit.o
> > > > > 
> > > > > ... you're moving this to lib/. Or have you encountered any
> > > > > issue
> > > > > with building this uniformly, and you forgot to mention this
> > > > > in
> > > > > the description?
> > > > I didn't check. My intention was to provide opportunity to
> > > > check if
> > > > an
> > > > architecture want to use generic version or not. Otherwise, I
> > > > expected
> > > > that we will have multiple definiotion of the funcion.
> > > > 
> > > > But considering that they are all defined under #ifdef...#endif
> > > > we
> > > > can
> > > > remove the declaration of the config GENERIC_FIND_NEXT_BIT.
> > > 
> > > What #ifdef / #endif would matter here? Whats in lib/ is intended
> > > to
> > > be
> > > generic anyway. And what is in the resulting lib.a won't be used
> > > by
> > > an
> > > arch if it has an arch-specific implementation. 
> > If what is implemented in lib.a won't be used by an arch if it has
> > an
> > arch-specific implementation then, for sure, I have to drop
> > CONFIG_GENERIC_FIND_NEXT_BIT.
> > But I am not really understand if lib.a is linked with Xen, then it
> > should be an issue then if some arch implement find-next-bit
> > function
> > we will have to multiple definitions ( one in lib.a and one arch
> > specific ). Probably, I have to look at how it is done.
> 
> You're aware how linking works? Objects are pulled out of archives
> only
> if there's no other definition for a to-be-resolved symbol provided
> by
> a particular object in the archive.
I wasn't aware about the case of the archive. Thanks for the
explanation.

> 
> > > Problems could arise if
> > > an arch had an inline function colliding with the out-of-line
> > > one.
> > > But
> > > that's about the old case where I could see a need to make the
> > > building
> > > of one of the objects conditional. And you'll note that withing
> > > this
> > > Makefile there are pretty few conditionals.
> > Could you please clarify What does it mean "out-of-line" ?
> 
> "not inline"
> 
> > > > > > --- /dev/null
> > > > > > +++ b/xen/lib/find-next-bit.c
> > > > > > [...]
> > > > > 
> > > > > I was going to ask that you convince git to actually present
> > > > > a
> > > > > proper
> > > > > diff, to make visible what changes. But other than the
> > > > > description
> > > > > says
> > > > > you don't really move the file, you copy it. Judging from
> > > > > further
> > > > > titles
> > > > > there's also nowhere you'd make Arm actually use this now
> > > > > generic
> > > > > code.
> > > > I wanted to do it separately, outside this patch series to
> > > > simplify
> > > > review and not have Arm specific changes in RISC-V patch
> > > > series.
> > > 
> > > Then do it the other way around: Make a separate _prereq_ change
> > > truly
> > > moving the file.
> > So this one patch should be separated by 2? One which moves find-
> > next-
> > bit.c from Arm to xen/lib, and second where xen/lib/Makefile is
> > updated.
> 
> No, that would break the Arm build. I suggested breaking out this
> patch from the series, and then doing what the description says:
> Actually move the file. I don't think I suggested splitting the
> patch. Even the breaking out of the series was only because you
> said "I wanted to do it separately, outside this patch series".
What I meant was that I would like to have a patch which introduces
generic version of find-next-bit in the current patch series and
provide a separate patch outside of the current patch series which
switches Arm to use generic version.

~ Oleksii



Re: [PATCH v3 15/34] xen/riscv: introduce atomic.h

2024-01-24 Thread Oleksii
On Wed, 2024-01-24 at 12:19 +0100, Jan Beulich wrote:
> On 24.01.2024 10:23, Oleksii wrote:
> > On Tue, 2024-01-23 at 14:30 +0100, Jan Beulich wrote:
> > > On 23.01.2024 13:24, Oleksii wrote:
> > > > On Tue, 2024-01-23 at 11:30 +0100, Jan Beulich wrote:
> > > > > On 23.01.2024 11:21, Oleksii wrote:
> > > > > > On Mon, 2024-01-22 at 17:56 +0100, Jan Beulich wrote:
> > > > > > > On 22.12.2023 16:12, Oleksii Kurochko wrote:
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/xen/arch/riscv/include/asm/fence.h
> > > > > > > > @@ -0,0 +1,13 @@
> > > > > > > > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > > > > > > > +#ifndef _ASM_RISCV_FENCE_H
> > > > > > > > +#define _ASM_RISCV_FENCE_H
> > > > > > > > +
> > > > > > > > +#ifdef CONFIG_SMP
> > > > > > > > +#define RISCV_ACQUIRE_BARRIER  "\tfence r ,
> > > > > > > > rw\n"
> > > > > > > > +#define RISCV_RELEASE_BARRIER  "\tfence rw, 
> > > > > > > > w\n"
> > > > > > > > +#else
> > > > > > > > +#define RISCV_ACQUIRE_BARRIER
> > > > > > > > +#define RISCV_RELEASE_BARRIER
> > > > > > > > +#endif
> > > > > > > 
> > > > > > > Do you really care about the !SMP case? On x86 at least
> > > > > > > we
> > > > > > > stopped
> > > > > > > special-
> > > > > > > casing that configuration many years ago (the few cases
> > > > > > > where
> > > > > > > for
> > > > > > > typically
> > > > > > > build reasons it matters, using CONFIG_NR_CPUS is
> > > > > > > sufficient). If
> > > > > > > you
> > > > > > > care
> > > > > > > about it, there needs to be somewhere you actually
> > > > > > > #define
> > > > > > > CONFIG_SMP.
> > > > > > Can't we use instead of CONFIG_SMP - CONFIG_NR_CPUS?
> > > > > 
> > > > > You can. Question is whether there's a point in doing so. Do
> > > > > you
> > > > > expect people to actually want to run Xen on single-CPU
> > > > > systems?
> > > > > They're generally not overly well suited for virtualization
> > > > > ...
> > > > Just to clarify.
> > > > 
> > > > Do you mean physically single based CPU?
> > > > Then I don't expect to run Xen on such systems and it is not
> > > > nesessary
> > > > to define *_BARRIER in this case. Should we have to add build
> > > > error
> > > > notification that we don't support single-CPU systems in this
> > > > header?
> > > > 
> > > > If you are speaking about we have ,let it be, 4 CPUs and only 1
> > > > CPU
> > > > is
> > > > currently supported by Xen then it still makes sense.
> > > 
> > > No, that's still not what I mean. The question is: Is it useful
> > > for
> > > you
> > > to _special case_ the NR_CPUS=1 case? Or is it instead simpler to
> > > handle
> > > NR_CPUS=1 the same as NR_CPUS>1 (accepting less than ideal
> > > performance,
> > > on the basis that in reality nobody's expected to use such in
> > > production
> > > anyway)?
> > NR_CPUS=1 sometimes is useful for debugging. At least, at the start
> > I
> > used that several times, but ITBO I don't remember when I used that
> > case after SMP support was added and context_switch() was fixed.
> 
> And "sometimes is useful for debugging" warrants introducing special
> cases? I've not suggested disallowing that configuration. I'm merely
> asking whether it isn't easier to have the barriers there at all
> times. Just like on x86 we now leave the LOCK prefixes in place at
> all times.
I misunderstood your initial suggestion. In this case we can always
have the barriers. I'll drop then #ifdef CONFIG_SMP.

Thanks for clarification.

~ Oleksii
> 
> > Probably, I misunderstand the real idea of NR_CPUS. Does NR_CPUS
> > represent a number of logical CPUs which can be different from
> > physical
> > amount of CPU?
> 
> No.
> 
> Jan




[PATCH AUTOSEL 5.10 6/7] xen/gntdev: Fix the abuse of underlying struct page in DMA-buf import

2024-01-24 Thread Sasha Levin
From: Oleksandr Tyshchenko 

[ Upstream commit 2d2db7d40254d5fb53b11ebd703cd1ed0c5de7a1 ]

DO NOT access the underlying struct page of an sg table exported
by DMA-buf in dmabuf_imp_to_refs(), this is not allowed.
Please see drivers/dma-buf/dma-buf.c:mangle_sg_table() for details.

Fortunately, here (for special Xen device) we can avoid using
pages and calculate gfns directly from dma addresses provided by
the sg table.

Suggested-by: Daniel Vetter 
Signed-off-by: Oleksandr Tyshchenko 
Acked-by: Christian König 
Reviewed-by: Stefano Stabellini 
Acked-by: Daniel Vetter 
Link: https://lore.kernel.org/r/20240107103426.2038075-1-olekst...@gmail.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/xen/gntdev-dmabuf.c | 50 ++---
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/xen/gntdev-dmabuf.c b/drivers/xen/gntdev-dmabuf.c
index 4c13cbc99896..398ea69c176c 100644
--- a/drivers/xen/gntdev-dmabuf.c
+++ b/drivers/xen/gntdev-dmabuf.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -56,7 +57,7 @@ struct gntdev_dmabuf {
 
/* Number of pages this buffer has. */
int nr_pages;
-   /* Pages of this buffer. */
+   /* Pages of this buffer (only for dma-buf export). */
struct page **pages;
 };
 
@@ -490,7 +491,7 @@ static int dmabuf_exp_from_refs(struct gntdev_priv *priv, 
int flags,
 /* DMA buffer import support. */
 
 static int
-dmabuf_imp_grant_foreign_access(struct page **pages, u32 *refs,
+dmabuf_imp_grant_foreign_access(unsigned long *gfns, u32 *refs,
int count, int domid)
 {
grant_ref_t priv_gref_head;
@@ -513,7 +514,7 @@ dmabuf_imp_grant_foreign_access(struct page **pages, u32 
*refs,
}
 
gnttab_grant_foreign_access_ref(cur_ref, domid,
-   xen_page_to_gfn(pages[i]), 0);
+   gfns[i], 0);
refs[i] = cur_ref;
}
 
@@ -535,7 +536,6 @@ static void dmabuf_imp_end_foreign_access(u32 *refs, int 
count)
 
 static void dmabuf_imp_free_storage(struct gntdev_dmabuf *gntdev_dmabuf)
 {
-   kfree(gntdev_dmabuf->pages);
kfree(gntdev_dmabuf->u.imp.refs);
kfree(gntdev_dmabuf);
 }
@@ -555,12 +555,6 @@ static struct gntdev_dmabuf *dmabuf_imp_alloc_storage(int 
count)
if (!gntdev_dmabuf->u.imp.refs)
goto fail;
 
-   gntdev_dmabuf->pages = kcalloc(count,
-  sizeof(gntdev_dmabuf->pages[0]),
-  GFP_KERNEL);
-   if (!gntdev_dmabuf->pages)
-   goto fail;
-
gntdev_dmabuf->nr_pages = count;
 
for (i = 0; i < count; i++)
@@ -582,7 +576,8 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, struct 
device *dev,
struct dma_buf *dma_buf;
struct dma_buf_attachment *attach;
struct sg_table *sgt;
-   struct sg_page_iter sg_iter;
+   struct sg_dma_page_iter sg_iter;
+   unsigned long *gfns;
int i;
 
dma_buf = dma_buf_get(fd);
@@ -630,26 +625,31 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, 
struct device *dev,
 
gntdev_dmabuf->u.imp.sgt = sgt;
 
-   /* Now convert sgt to array of pages and check for page validity. */
+   gfns = kcalloc(count, sizeof(*gfns), GFP_KERNEL);
+   if (!gfns) {
+   ret = ERR_PTR(-ENOMEM);
+   goto fail_unmap;
+   }
+
+   /*
+* Now convert sgt to array of gfns without accessing underlying pages.
+* It is not allowed to access the underlying struct page of an sg table
+* exported by DMA-buf, but since we deal with special Xen dma device 
here
+* (not a normal physical one) look at the dma addresses in the sg table
+* and then calculate gfns directly from them.
+*/
i = 0;
-   for_each_sgtable_page(sgt, &sg_iter, 0) {
-   struct page *page = sg_page_iter_page(&sg_iter);
-   /*
-* Check if page is valid: this can happen if we are given
-* a page from VRAM or other resources which are not backed
-* by a struct page.
-*/
-   if (!pfn_valid(page_to_pfn(page))) {
-   ret = ERR_PTR(-EINVAL);
-   goto fail_unmap;
-   }
+   for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
+   dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
+   unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, 
addr)));
 
-   gntdev_dmabuf->pages[i++] = page;
+   gfns[i++] = pfn_to_gfn(pfn);
}
 
-   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gntdev_dmabuf->pages,
+   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gfns,
  gntdev_dmab

[PATCH AUTOSEL 5.15 7/9] xen/gntdev: Fix the abuse of underlying struct page in DMA-buf import

2024-01-24 Thread Sasha Levin
From: Oleksandr Tyshchenko 

[ Upstream commit 2d2db7d40254d5fb53b11ebd703cd1ed0c5de7a1 ]

DO NOT access the underlying struct page of an sg table exported
by DMA-buf in dmabuf_imp_to_refs(), this is not allowed.
Please see drivers/dma-buf/dma-buf.c:mangle_sg_table() for details.

Fortunately, here (for special Xen device) we can avoid using
pages and calculate gfns directly from dma addresses provided by
the sg table.

Suggested-by: Daniel Vetter 
Signed-off-by: Oleksandr Tyshchenko 
Acked-by: Christian König 
Reviewed-by: Stefano Stabellini 
Acked-by: Daniel Vetter 
Link: https://lore.kernel.org/r/20240107103426.2038075-1-olekst...@gmail.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/xen/gntdev-dmabuf.c | 50 ++---
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/xen/gntdev-dmabuf.c b/drivers/xen/gntdev-dmabuf.c
index 4c13cbc99896..398ea69c176c 100644
--- a/drivers/xen/gntdev-dmabuf.c
+++ b/drivers/xen/gntdev-dmabuf.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -56,7 +57,7 @@ struct gntdev_dmabuf {
 
/* Number of pages this buffer has. */
int nr_pages;
-   /* Pages of this buffer. */
+   /* Pages of this buffer (only for dma-buf export). */
struct page **pages;
 };
 
@@ -490,7 +491,7 @@ static int dmabuf_exp_from_refs(struct gntdev_priv *priv, 
int flags,
 /* DMA buffer import support. */
 
 static int
-dmabuf_imp_grant_foreign_access(struct page **pages, u32 *refs,
+dmabuf_imp_grant_foreign_access(unsigned long *gfns, u32 *refs,
int count, int domid)
 {
grant_ref_t priv_gref_head;
@@ -513,7 +514,7 @@ dmabuf_imp_grant_foreign_access(struct page **pages, u32 
*refs,
}
 
gnttab_grant_foreign_access_ref(cur_ref, domid,
-   xen_page_to_gfn(pages[i]), 0);
+   gfns[i], 0);
refs[i] = cur_ref;
}
 
@@ -535,7 +536,6 @@ static void dmabuf_imp_end_foreign_access(u32 *refs, int 
count)
 
 static void dmabuf_imp_free_storage(struct gntdev_dmabuf *gntdev_dmabuf)
 {
-   kfree(gntdev_dmabuf->pages);
kfree(gntdev_dmabuf->u.imp.refs);
kfree(gntdev_dmabuf);
 }
@@ -555,12 +555,6 @@ static struct gntdev_dmabuf *dmabuf_imp_alloc_storage(int 
count)
if (!gntdev_dmabuf->u.imp.refs)
goto fail;
 
-   gntdev_dmabuf->pages = kcalloc(count,
-  sizeof(gntdev_dmabuf->pages[0]),
-  GFP_KERNEL);
-   if (!gntdev_dmabuf->pages)
-   goto fail;
-
gntdev_dmabuf->nr_pages = count;
 
for (i = 0; i < count; i++)
@@ -582,7 +576,8 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, struct 
device *dev,
struct dma_buf *dma_buf;
struct dma_buf_attachment *attach;
struct sg_table *sgt;
-   struct sg_page_iter sg_iter;
+   struct sg_dma_page_iter sg_iter;
+   unsigned long *gfns;
int i;
 
dma_buf = dma_buf_get(fd);
@@ -630,26 +625,31 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, 
struct device *dev,
 
gntdev_dmabuf->u.imp.sgt = sgt;
 
-   /* Now convert sgt to array of pages and check for page validity. */
+   gfns = kcalloc(count, sizeof(*gfns), GFP_KERNEL);
+   if (!gfns) {
+   ret = ERR_PTR(-ENOMEM);
+   goto fail_unmap;
+   }
+
+   /*
+* Now convert sgt to array of gfns without accessing underlying pages.
+* It is not allowed to access the underlying struct page of an sg table
+* exported by DMA-buf, but since we deal with special Xen dma device 
here
+* (not a normal physical one) look at the dma addresses in the sg table
+* and then calculate gfns directly from them.
+*/
i = 0;
-   for_each_sgtable_page(sgt, &sg_iter, 0) {
-   struct page *page = sg_page_iter_page(&sg_iter);
-   /*
-* Check if page is valid: this can happen if we are given
-* a page from VRAM or other resources which are not backed
-* by a struct page.
-*/
-   if (!pfn_valid(page_to_pfn(page))) {
-   ret = ERR_PTR(-EINVAL);
-   goto fail_unmap;
-   }
+   for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
+   dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
+   unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, 
addr)));
 
-   gntdev_dmabuf->pages[i++] = page;
+   gfns[i++] = pfn_to_gfn(pfn);
}
 
-   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gntdev_dmabuf->pages,
+   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gfns,
  gntdev_dmab

[PATCH AUTOSEL 6.1 7/9] xen/gntdev: Fix the abuse of underlying struct page in DMA-buf import

2024-01-24 Thread Sasha Levin
From: Oleksandr Tyshchenko 

[ Upstream commit 2d2db7d40254d5fb53b11ebd703cd1ed0c5de7a1 ]

DO NOT access the underlying struct page of an sg table exported
by DMA-buf in dmabuf_imp_to_refs(), this is not allowed.
Please see drivers/dma-buf/dma-buf.c:mangle_sg_table() for details.

Fortunately, here (for special Xen device) we can avoid using
pages and calculate gfns directly from dma addresses provided by
the sg table.

Suggested-by: Daniel Vetter 
Signed-off-by: Oleksandr Tyshchenko 
Acked-by: Christian König 
Reviewed-by: Stefano Stabellini 
Acked-by: Daniel Vetter 
Link: https://lore.kernel.org/r/20240107103426.2038075-1-olekst...@gmail.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/xen/gntdev-dmabuf.c | 50 ++---
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/xen/gntdev-dmabuf.c b/drivers/xen/gntdev-dmabuf.c
index 940e5e9e8a54..335451309566 100644
--- a/drivers/xen/gntdev-dmabuf.c
+++ b/drivers/xen/gntdev-dmabuf.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -50,7 +51,7 @@ struct gntdev_dmabuf {
 
/* Number of pages this buffer has. */
int nr_pages;
-   /* Pages of this buffer. */
+   /* Pages of this buffer (only for dma-buf export). */
struct page **pages;
 };
 
@@ -484,7 +485,7 @@ static int dmabuf_exp_from_refs(struct gntdev_priv *priv, 
int flags,
 /* DMA buffer import support. */
 
 static int
-dmabuf_imp_grant_foreign_access(struct page **pages, u32 *refs,
+dmabuf_imp_grant_foreign_access(unsigned long *gfns, u32 *refs,
int count, int domid)
 {
grant_ref_t priv_gref_head;
@@ -507,7 +508,7 @@ dmabuf_imp_grant_foreign_access(struct page **pages, u32 
*refs,
}
 
gnttab_grant_foreign_access_ref(cur_ref, domid,
-   xen_page_to_gfn(pages[i]), 0);
+   gfns[i], 0);
refs[i] = cur_ref;
}
 
@@ -529,7 +530,6 @@ static void dmabuf_imp_end_foreign_access(u32 *refs, int 
count)
 
 static void dmabuf_imp_free_storage(struct gntdev_dmabuf *gntdev_dmabuf)
 {
-   kfree(gntdev_dmabuf->pages);
kfree(gntdev_dmabuf->u.imp.refs);
kfree(gntdev_dmabuf);
 }
@@ -549,12 +549,6 @@ static struct gntdev_dmabuf *dmabuf_imp_alloc_storage(int 
count)
if (!gntdev_dmabuf->u.imp.refs)
goto fail;
 
-   gntdev_dmabuf->pages = kcalloc(count,
-  sizeof(gntdev_dmabuf->pages[0]),
-  GFP_KERNEL);
-   if (!gntdev_dmabuf->pages)
-   goto fail;
-
gntdev_dmabuf->nr_pages = count;
 
for (i = 0; i < count; i++)
@@ -576,7 +570,8 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, struct 
device *dev,
struct dma_buf *dma_buf;
struct dma_buf_attachment *attach;
struct sg_table *sgt;
-   struct sg_page_iter sg_iter;
+   struct sg_dma_page_iter sg_iter;
+   unsigned long *gfns;
int i;
 
dma_buf = dma_buf_get(fd);
@@ -624,26 +619,31 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, 
struct device *dev,
 
gntdev_dmabuf->u.imp.sgt = sgt;
 
-   /* Now convert sgt to array of pages and check for page validity. */
+   gfns = kcalloc(count, sizeof(*gfns), GFP_KERNEL);
+   if (!gfns) {
+   ret = ERR_PTR(-ENOMEM);
+   goto fail_unmap;
+   }
+
+   /*
+* Now convert sgt to array of gfns without accessing underlying pages.
+* It is not allowed to access the underlying struct page of an sg table
+* exported by DMA-buf, but since we deal with special Xen dma device 
here
+* (not a normal physical one) look at the dma addresses in the sg table
+* and then calculate gfns directly from them.
+*/
i = 0;
-   for_each_sgtable_page(sgt, &sg_iter, 0) {
-   struct page *page = sg_page_iter_page(&sg_iter);
-   /*
-* Check if page is valid: this can happen if we are given
-* a page from VRAM or other resources which are not backed
-* by a struct page.
-*/
-   if (!pfn_valid(page_to_pfn(page))) {
-   ret = ERR_PTR(-EINVAL);
-   goto fail_unmap;
-   }
+   for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
+   dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
+   unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, 
addr)));
 
-   gntdev_dmabuf->pages[i++] = page;
+   gfns[i++] = pfn_to_gfn(pfn);
}
 
-   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gntdev_dmabuf->pages,
+   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gfns,
  gntdev_dmab

[PATCH AUTOSEL 6.6 09/11] xen/gntdev: Fix the abuse of underlying struct page in DMA-buf import

2024-01-24 Thread Sasha Levin
From: Oleksandr Tyshchenko 

[ Upstream commit 2d2db7d40254d5fb53b11ebd703cd1ed0c5de7a1 ]

DO NOT access the underlying struct page of an sg table exported
by DMA-buf in dmabuf_imp_to_refs(), this is not allowed.
Please see drivers/dma-buf/dma-buf.c:mangle_sg_table() for details.

Fortunately, here (for special Xen device) we can avoid using
pages and calculate gfns directly from dma addresses provided by
the sg table.

Suggested-by: Daniel Vetter 
Signed-off-by: Oleksandr Tyshchenko 
Acked-by: Christian König 
Reviewed-by: Stefano Stabellini 
Acked-by: Daniel Vetter 
Link: https://lore.kernel.org/r/20240107103426.2038075-1-olekst...@gmail.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/xen/gntdev-dmabuf.c | 50 ++---
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/xen/gntdev-dmabuf.c b/drivers/xen/gntdev-dmabuf.c
index 4440e626b797..42adc2c1e06b 100644
--- a/drivers/xen/gntdev-dmabuf.c
+++ b/drivers/xen/gntdev-dmabuf.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -50,7 +51,7 @@ struct gntdev_dmabuf {
 
/* Number of pages this buffer has. */
int nr_pages;
-   /* Pages of this buffer. */
+   /* Pages of this buffer (only for dma-buf export). */
struct page **pages;
 };
 
@@ -484,7 +485,7 @@ static int dmabuf_exp_from_refs(struct gntdev_priv *priv, 
int flags,
 /* DMA buffer import support. */
 
 static int
-dmabuf_imp_grant_foreign_access(struct page **pages, u32 *refs,
+dmabuf_imp_grant_foreign_access(unsigned long *gfns, u32 *refs,
int count, int domid)
 {
grant_ref_t priv_gref_head;
@@ -507,7 +508,7 @@ dmabuf_imp_grant_foreign_access(struct page **pages, u32 
*refs,
}
 
gnttab_grant_foreign_access_ref(cur_ref, domid,
-   xen_page_to_gfn(pages[i]), 0);
+   gfns[i], 0);
refs[i] = cur_ref;
}
 
@@ -529,7 +530,6 @@ static void dmabuf_imp_end_foreign_access(u32 *refs, int 
count)
 
 static void dmabuf_imp_free_storage(struct gntdev_dmabuf *gntdev_dmabuf)
 {
-   kfree(gntdev_dmabuf->pages);
kfree(gntdev_dmabuf->u.imp.refs);
kfree(gntdev_dmabuf);
 }
@@ -549,12 +549,6 @@ static struct gntdev_dmabuf *dmabuf_imp_alloc_storage(int 
count)
if (!gntdev_dmabuf->u.imp.refs)
goto fail;
 
-   gntdev_dmabuf->pages = kcalloc(count,
-  sizeof(gntdev_dmabuf->pages[0]),
-  GFP_KERNEL);
-   if (!gntdev_dmabuf->pages)
-   goto fail;
-
gntdev_dmabuf->nr_pages = count;
 
for (i = 0; i < count; i++)
@@ -576,7 +570,8 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, struct 
device *dev,
struct dma_buf *dma_buf;
struct dma_buf_attachment *attach;
struct sg_table *sgt;
-   struct sg_page_iter sg_iter;
+   struct sg_dma_page_iter sg_iter;
+   unsigned long *gfns;
int i;
 
dma_buf = dma_buf_get(fd);
@@ -624,26 +619,31 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, 
struct device *dev,
 
gntdev_dmabuf->u.imp.sgt = sgt;
 
-   /* Now convert sgt to array of pages and check for page validity. */
+   gfns = kcalloc(count, sizeof(*gfns), GFP_KERNEL);
+   if (!gfns) {
+   ret = ERR_PTR(-ENOMEM);
+   goto fail_unmap;
+   }
+
+   /*
+* Now convert sgt to array of gfns without accessing underlying pages.
+* It is not allowed to access the underlying struct page of an sg table
+* exported by DMA-buf, but since we deal with special Xen dma device 
here
+* (not a normal physical one) look at the dma addresses in the sg table
+* and then calculate gfns directly from them.
+*/
i = 0;
-   for_each_sgtable_page(sgt, &sg_iter, 0) {
-   struct page *page = sg_page_iter_page(&sg_iter);
-   /*
-* Check if page is valid: this can happen if we are given
-* a page from VRAM or other resources which are not backed
-* by a struct page.
-*/
-   if (!pfn_valid(page_to_pfn(page))) {
-   ret = ERR_PTR(-EINVAL);
-   goto fail_unmap;
-   }
+   for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
+   dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
+   unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, 
addr)));
 
-   gntdev_dmabuf->pages[i++] = page;
+   gfns[i++] = pfn_to_gfn(pfn);
}
 
-   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gntdev_dmabuf->pages,
+   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gfns,
  gntdev_dmab

[PATCH AUTOSEL 6.7 11/13] xen/gntdev: Fix the abuse of underlying struct page in DMA-buf import

2024-01-24 Thread Sasha Levin
From: Oleksandr Tyshchenko 

[ Upstream commit 2d2db7d40254d5fb53b11ebd703cd1ed0c5de7a1 ]

DO NOT access the underlying struct page of an sg table exported
by DMA-buf in dmabuf_imp_to_refs(), this is not allowed.
Please see drivers/dma-buf/dma-buf.c:mangle_sg_table() for details.

Fortunately, here (for special Xen device) we can avoid using
pages and calculate gfns directly from dma addresses provided by
the sg table.

Suggested-by: Daniel Vetter 
Signed-off-by: Oleksandr Tyshchenko 
Acked-by: Christian König 
Reviewed-by: Stefano Stabellini 
Acked-by: Daniel Vetter 
Link: https://lore.kernel.org/r/20240107103426.2038075-1-olekst...@gmail.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/xen/gntdev-dmabuf.c | 50 ++---
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/xen/gntdev-dmabuf.c b/drivers/xen/gntdev-dmabuf.c
index 4440e626b797..42adc2c1e06b 100644
--- a/drivers/xen/gntdev-dmabuf.c
+++ b/drivers/xen/gntdev-dmabuf.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -50,7 +51,7 @@ struct gntdev_dmabuf {
 
/* Number of pages this buffer has. */
int nr_pages;
-   /* Pages of this buffer. */
+   /* Pages of this buffer (only for dma-buf export). */
struct page **pages;
 };
 
@@ -484,7 +485,7 @@ static int dmabuf_exp_from_refs(struct gntdev_priv *priv, 
int flags,
 /* DMA buffer import support. */
 
 static int
-dmabuf_imp_grant_foreign_access(struct page **pages, u32 *refs,
+dmabuf_imp_grant_foreign_access(unsigned long *gfns, u32 *refs,
int count, int domid)
 {
grant_ref_t priv_gref_head;
@@ -507,7 +508,7 @@ dmabuf_imp_grant_foreign_access(struct page **pages, u32 
*refs,
}
 
gnttab_grant_foreign_access_ref(cur_ref, domid,
-   xen_page_to_gfn(pages[i]), 0);
+   gfns[i], 0);
refs[i] = cur_ref;
}
 
@@ -529,7 +530,6 @@ static void dmabuf_imp_end_foreign_access(u32 *refs, int 
count)
 
 static void dmabuf_imp_free_storage(struct gntdev_dmabuf *gntdev_dmabuf)
 {
-   kfree(gntdev_dmabuf->pages);
kfree(gntdev_dmabuf->u.imp.refs);
kfree(gntdev_dmabuf);
 }
@@ -549,12 +549,6 @@ static struct gntdev_dmabuf *dmabuf_imp_alloc_storage(int 
count)
if (!gntdev_dmabuf->u.imp.refs)
goto fail;
 
-   gntdev_dmabuf->pages = kcalloc(count,
-  sizeof(gntdev_dmabuf->pages[0]),
-  GFP_KERNEL);
-   if (!gntdev_dmabuf->pages)
-   goto fail;
-
gntdev_dmabuf->nr_pages = count;
 
for (i = 0; i < count; i++)
@@ -576,7 +570,8 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, struct 
device *dev,
struct dma_buf *dma_buf;
struct dma_buf_attachment *attach;
struct sg_table *sgt;
-   struct sg_page_iter sg_iter;
+   struct sg_dma_page_iter sg_iter;
+   unsigned long *gfns;
int i;
 
dma_buf = dma_buf_get(fd);
@@ -624,26 +619,31 @@ dmabuf_imp_to_refs(struct gntdev_dmabuf_priv *priv, 
struct device *dev,
 
gntdev_dmabuf->u.imp.sgt = sgt;
 
-   /* Now convert sgt to array of pages and check for page validity. */
+   gfns = kcalloc(count, sizeof(*gfns), GFP_KERNEL);
+   if (!gfns) {
+   ret = ERR_PTR(-ENOMEM);
+   goto fail_unmap;
+   }
+
+   /*
+* Now convert sgt to array of gfns without accessing underlying pages.
+* It is not allowed to access the underlying struct page of an sg table
+* exported by DMA-buf, but since we deal with special Xen dma device 
here
+* (not a normal physical one) look at the dma addresses in the sg table
+* and then calculate gfns directly from them.
+*/
i = 0;
-   for_each_sgtable_page(sgt, &sg_iter, 0) {
-   struct page *page = sg_page_iter_page(&sg_iter);
-   /*
-* Check if page is valid: this can happen if we are given
-* a page from VRAM or other resources which are not backed
-* by a struct page.
-*/
-   if (!pfn_valid(page_to_pfn(page))) {
-   ret = ERR_PTR(-EINVAL);
-   goto fail_unmap;
-   }
+   for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
+   dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
+   unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, 
addr)));
 
-   gntdev_dmabuf->pages[i++] = page;
+   gfns[i++] = pfn_to_gfn(pfn);
}
 
-   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gntdev_dmabuf->pages,
+   ret = ERR_PTR(dmabuf_imp_grant_foreign_access(gfns,
  gntdev_dmab

Re: Community Process Group - Proposal

2024-01-24 Thread David Woodhouse
On Tue, 2024-01-23 at 08:34 +0100, Jan Beulich wrote:
> On 22.01.2024 23:47, Stefano Stabellini wrote:
> > On Mon, 22 Jan 2024, Jan Beulich wrote:
> > > What definitely needs clarifying is what "review" is: Are R-b tags
> > > counted, or is it the number of replies sent commenting on patches?
> > 
> > Yes, I think this needs to be clarified. I would say Reviewed-by tags.
> 
> Which may end up unfair. It's not uncommon for one person to do a lot
> of review on a patch, and for someone else to then ack the final
> version that goes in. In the end this is then no different from basing
> the decision on simple numbers, without regard to actual (potentially
> heavily differing) effort behind each individual instance.


Perhaps that isn't such a bad thing. It would mean that a hypothetical
reviewer who only ever nitpicks and is holding up progress of a patch
which others find acceptable does not get 'credit' for reviewing that
particular patch.


smime.p7s
Description: S/MIME cryptographic signature


Re: Community Process Group - Proposal

2024-01-24 Thread Kelly Choi
- AB

Hi Jan,
Please see my reply to the points below

Many thanks,
Kelly Choi

Community Manager
Xen Project


On Mon, Jan 22, 2024 at 10:32 AM Jan Beulich  wrote:

> On 17.01.2024 18:10, Kelly Choi wrote:
> > Hi everyone,
> >
> > I've spent a bit of time talking to various individuals and the advisory
> > board about setting up a new Community Process Group.
> >
> > A survey was recently conducted to identify how the community as a whole
> > feels about a certain situation. It was not intended to ban specific
> > wording or create a policy to do so, but more to give context that the
> > community has a wide range of ideas, and individuals may agree and
> disagree
> > a lot more frequently than we as individuals might think. It helps us
> > understand that as a community there are many situations where it is not
> > clear. As such, the results indicated a very even split among the
> > community, which indicates a larger problem as we may not always come to
> > agreement.
> >
> > There is obvious frustration with how certain matters are handled, as
> some
> > members may want the project to move faster, whereas others like to take
> a
> > cautious approach. Given we are an open source project, differences in
> > opinion are likely to happen and what we don’t want to do is cause
> further
> > frustration.
> >
> > *This is where I would like to propose the idea of a ‘Community Process
> > Group’.*
> >
> > A CPG can help as they will be able to advise members on similar issues
> > regarding community processes or appeals and decide on the best way
> > forward. To help with this process, I have consulted with various
> > individuals including some committers and conduct team members.
> >
> > *What is a CPG’s purpose?*
> > In the first instance, we would expect an informal vote resolves most
> > disagreements. However, there will be certain circumstances where the
> > outcome is not always agreed on.
> >
> > A CPG will be your second point of call, where you can escalate matters
> > quickly for a democratic solution.
>
> Between informal voting and this "second point of call", where does
> formal voting go?
>

Formal voting among committers will still exist in matters related to
technical decisions. The CPG handles disputes around processes and areas
where there are disagreements that can cause frustration around the
community, but don't necessarily break the code of conduct. For example,
naming conventions or how to move forward in cases of non-actionable
feedback.

>
> > Their purpose is to resolve process issues and informal vote appeals to
> > avoid matters going to a formal vote, but also act as a representative on
> > behalf of others in the community on future matters.
> >
> > For example:
> >
> >- Naming conventions
> >- Whether feedback requesting changes on a patch series is acceptable
> >- How to move forward in case of non-actionable feedback to a patch
> >series
> >- How to move forward when a contributor or reviewer has not been
> >responsive
> >- Policy questions not related to the code of conduct
> >
> > *What is their role and responsibility?*
> >
> > The CPG has the authority to propose a resolution to situations where
> there
> > are disagreements, that don’t involve large technical decisions. Their
> > decision proposed should be accepted as final since members will have
> > discussed the best steps and come to a consensus vote.
> >
> > The CPG does not aim to replace the committers' authority or the advisory
> > board but instead holds the authority to make decisions that are in the
> > best interest of the community in relation to processes. Committers still
> > hold the power should there be a formal escalation regarding technical
> > decisions, but this would be extremely rare. Advisory Board members hold
> > the final power regarding project and business-wide decisions.
>
> Nevertheless it doesn't become clear to me how adding yet another authority
> besides the committers will actually help.
>

We've seen cases where committers can disagree, and rather than calling a
formal vote each time among those that disagree, the wider community can
help provide clarity on the best steps forward.
The individuals in this CPG aim to help represent wider collective feedback
and prevent disagreements from coming to a standstill, which we have seen
in the past.

>
> > *How are members selected?*
> > The CPG will be composed of 5 randomly selected members in total.
> > An odd number has been purposely selected to avoid an impasse during
> > decisions.
> >
> > The criteria:
> > Individual members must be active contributors and are willing to help
> the
> > community succeed. As such they must be a part of the following groups:
> >
> >- Committers
> >- Active Maintainers: maintainers with >= 20 reviews in the last 2
> >releases
> >- Active Contributors: contributors with >= 10 commits in the last 2
> >releases
>
> I'm afraid I can't leave this uncommented,

Re: [PATCH] hw/i386/pc_piix: Make piix_intx_routing_notifier_xen() more device independent

2024-01-24 Thread Philippe Mathieu-Daudé

On 14/1/24 13:25, Michael S. Tsirkin wrote:

On Sun, Jan 14, 2024 at 12:21:59PM +, Bernhard Beschow wrote:



Am 7. Januar 2024 23:16:23 UTC schrieb Bernhard Beschow :

This is a follow-up on commit 89965db43cce "hw/isa/piix3: Avoid Xen-specific
variant of piix3_write_config()" which introduced
piix_intx_routing_notifier_xen(). This function is implemented in board code but
accesses the PCI configuration space of the PIIX ISA function to determine the
PCI interrupt routes. Avoid this by reusing pci_device_route_intx_to_irq() which
makes piix_intx_routing_notifier_xen() more device-agnostic.

One remaining improvement would be making piix_intx_routing_notifier_xen()
agnostic towards the number of PCI interrupt routes and move it to xen-hvm.
This might be useful for possible Q35 Xen efforts but remains a future exercise
for now.

Signed-off-by: Bernhard Beschow 


Hi Michael,

could you tag this, too? Or do we need another R-b?

Best regards,
Bernhard


tagged, too.


FYI merged as commit ebd92d6de3.



[linux-linus test] 184443: tolerable FAIL - PUSHED

2024-01-24 Thread osstest service owner
flight 184443 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/184443/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 184432
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 184432
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 184432
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 184432
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 184432
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 184432
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 184432
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 184432
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass

version targeted for testing:
 linux615d300648869c774bd1fe54b4627bb0c20faed4
baseline version:
 linux7ed2632ec7d72e926b9e8bcc9ad1bb0cd37274bf

Last test of basis   184432  2024-01-23 05:00:38 Z1 days
Testing same since   184443  2024-01-24 01:43:31 Z0 days1 attempts


People who touched revisions under test:
  Geert Uytterhoeven 
  Linus Torvalds 
  Petr Pavlu 
  Steven Rostedt (Google) 
  Tom Zanussi 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  

Re: [PATCH 1/2] arm: vgic: Add the ability to trigger MSIs from the Hypervisor

2024-01-24 Thread Julien Grall

Hi,

On 14/01/2024 10:01, Mykyta Poturai wrote:

Add the vgic_its_trigger_msi() function to the vgic interface. This
function allows to inject MSIs from the Hypervisor to the guest.
Which is useful for userspace PCI backend drivers.

Signed-off-by: Mykyta Poturai 
---
  xen/arch/arm/include/asm/vgic.h | 11 +++
  xen/arch/arm/vgic-v3-its.c  | 35 +
  2 files changed, 46 insertions(+)

diff --git a/xen/arch/arm/include/asm/vgic.h b/xen/arch/arm/include/asm/vgic.h
index 922779ce14..4695743848 100644
--- a/xen/arch/arm/include/asm/vgic.h
+++ b/xen/arch/arm/include/asm/vgic.h
@@ -317,6 +317,17 @@ extern bool vgic_migrate_irq(struct vcpu *old, struct vcpu 
*new, unsigned int ir
  extern void vgic_check_inflight_irqs_pending(struct domain *d, struct vcpu *v,
   unsigned int rank, uint32_t r);
  
+#ifdef CONFIG_HAS_ITS

+int vgic_its_trigger_msi(struct domain *d, paddr_t doorbell_address,
+u32 devid, u32 eventid);
+#else
+static inline int vgic_its_trigger_msi(struct domain *d, paddr_t 
doorbell_address,
+u32 devid, u32 eventid)
+{
+return -EOPNOTSUPP;
+}
+#endif /* CONFIG_HAS_ITS */
+
  #endif /* !CONFIG_NEW_VGIC */
  
  /*** Common VGIC functions used by Xen arch code /

diff --git a/xen/arch/arm/vgic-v3-its.c b/xen/arch/arm/vgic-v3-its.c
index 70b5aeb822..683a378f6e 100644
--- a/xen/arch/arm/vgic-v3-its.c
+++ b/xen/arch/arm/vgic-v3-its.c
@@ -1484,6 +1484,41 @@ static int vgic_v3_its_init_virtual(struct domain *d, 
paddr_t guest_addr,
  return 0;
  }
  
+int vgic_its_trigger_msi(struct domain *d, paddr_t doorbell_address,

+u32 devid, u32 eventid)
+{
+struct vcpu *vcpu;
+struct virt_its *pos, *temp;
+struct virt_its *its = NULL;
+uint32_t vlpi;
+bool ret;
+
+list_for_each_entry_safe( pos, temp, &d->arch.vgic.vits_list, vits_list )
+{
+if ( pos->doorbell_address == doorbell_address )
+{
+its = pos;
+break;
+}
+}
+
+if ( !its )
+return -EINVAL;
+
+spin_lock(&its->its_lock);
+ret = read_itte(its, devid, eventid, &vcpu, &vlpi);
+spin_unlock(&its->its_lock);
+if ( !ret )
+return -1;
+
+if ( vlpi == INVALID_LPI )
+return -1;


Reading the code, I think you want to use get_event_pending_irq(). This 
will return the associated pending_irq. Then you can...



+
+vgic_vcpu_inject_lpi(its->d, vlpi);


... open-code vgic_vcpu_inject_lpi(). This would avoid access the guest 
memory (done by read_itte) and reduce to just one lookup (today you are 
doing two: read_itte() and irq_to_pending()).


Cheers,

--
Julien Grall



Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Jan Beulich
On 24.01.2024 10:24, Roger Pau Monné wrote:
> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
>> On 23.01.2024 16:07, Roger Pau Monné wrote:
>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
 On 15.01.2024 20:43, Stewart Hildebrand wrote:
> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int 
> index, int *pirq_p,
>  {
>  int irq, pirq, ret;
>  
> +ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));

 If either lock is sufficient to hold here, ...

> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int 
> *index, int *pirq_p,
>  
>  case MAP_PIRQ_TYPE_MSI:
>  case MAP_PIRQ_TYPE_MULTI_MSI:
> +pcidevs_lock();
>  ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> +pcidevs_unlock();
>  break;

>>>
>>> IIRC (Stewart can further comment) this is done holding the pcidevs
>>> lock to keep the path unmodified, as there's no need to hold the
>>> per-domain rwlock.
>>
>> Yet why would we prefer to acquire a global lock when a per-domain one
>> suffices?
> 
> I was hoping to introduce less changes, specially if they are not
> strictly required, as it's less risk.  I'm always quite worry of
> locking changes.

In which case more description / code commenting is needed. The pattern
of the assertions looks dangerous. Even if (as you say in a later reply)
this is only temporary, we all know how long "temporary" can be. It
might even be advisable to introduce a helper construct. That would then
be where the respective code comment goes, clarifying that the _sole_
purpose is to guarantee a pdev to not go away; no further protection of
any state, no other critical region aspects (assuming of course all of
this is actually true for all of the affected use sites).

Jan



Re: [PATCH v3 30/34] xen/riscv: add minimal stuff to processor.h to build full Xen

2024-01-24 Thread Jan Beulich
On 24.01.2024 11:12, Oleksii wrote:
> On Wed, 2024-01-24 at 09:19 +0100, Jan Beulich wrote:
>> On 23.01.2024 18:08, Oleksii wrote:
>>> On Tue, 2024-01-23 at 12:39 +0100, Jan Beulich wrote:
 On 22.12.2023 16:13, Oleksii Kurochko wrote:
> @@ -53,6 +56,18 @@ struct cpu_user_regs
>  unsigned long pregs;
>  };
>  
> +/* TODO: need to implement */
> +#define cpu_to_core(cpu)   (0)
> +#define cpu_to_socket(cpu) (0)
> +
> +static inline void cpu_relax(void)
> +{
> +    /* Encoding of the pause instruction */
> +    __asm__ __volatile__ ( ".insn 0x10F" );

 binutils 2.40 knows "pause" - why use .insn then?
>>> I thought that for this instruction it is needed to have extension
>>> ZIHINTPAUSE ( according to Linux Kernel source code [1] ) and to
>>> cover
>>> older version.
>>
>> Well, of course you'll need to enable the extension then for gas. But
>> as long as you use the insn unconditionally, that's all fine and
>> natural. Another thing would be if you meant to also run on systems
>> not supporting the extension: Then the above use of .insn would need
>> to become conditional anyway.
> Then it makes sense to use "pause". 
> Let's assume that for now we are running only on systems which support
> the extension until we won't face compilation issue for some system.

Gives me the impression that you still don't properly separate the two
aspects: One is what systems Xen is to run on, and other is what's
needed to make Xen build properly. The first needs documenting (and
ideally at some point actually enforcing), while the latter may require
e.g. compiler command line option adjustments.

Jan



Re: [PATCH 0/2] Add support for MSI injection on Arm

2024-01-24 Thread Julien Grall

Hi,

On 14/01/2024 10:01, Mykyta Poturai wrote:

This series adds the base support for MSI injection on Arm. This is
needed to streamline virtio-pci interrupt triggering.

With this patches, MSIs can be triggered in guests by issuing the new
DM op, inject_msi2. This op is similar to inject_msi, but it allows
to specify the source id of the MSI.

We chose the approach of adding a new DM op instead of using the pad
field of inject_msi because we have no clear way of distinguishing
between set and unset pad fields. New implementations also adds flags
field to clearly specify if the SBDF is set.

Patches were tested on QEMU with QEMU virtio-pci backends, with
virtio-pci patches and patches for ITS support for DomUs applied.


I have seen some virtio-pci patches on the ML. But nothing about ITS 
support in the DomUs (at least in recent months).


It would feel odd to me to even consider merging this series before the 
ITS support for DomUs is merged. So what's the plan for it?


Cheers,

--
Julien Grall



Re: [PATCH v3 16/34] xen/lib: introduce generic find next bit operations

2024-01-24 Thread Jan Beulich
On 24.01.2024 10:34, Oleksii wrote:
> On Tue, 2024-01-23 at 14:37 +0100, Jan Beulich wrote:
>> On 23.01.2024 13:34, Oleksii wrote:
>>> On Tue, 2024-01-23 at 12:14 +0100, Jan Beulich wrote:
 On 22.12.2023 16:13, Oleksii Kurochko wrote:
> --- a/xen/common/Kconfig
> +++ b/xen/common/Kconfig
> @@ -47,6 +47,9 @@ config ARCH_MAP_DOMAIN_PAGE
>  config GENERIC_BUG_FRAME
>   bool
>  
> +config GENERIC_FIND_NEXT_BIT
> + bool

 There's no need for this, as ...

> --- a/xen/lib/Makefile
> +++ b/xen/lib/Makefile
> @@ -3,6 +3,7 @@ obj-$(CONFIG_X86) += x86/
>  lib-y += bsearch.o
>  lib-y += ctors.o
>  lib-y += ctype.o
> +lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find-next-bit.o

 ... you're moving this to lib/. Or have you encountered any issue
 with building this uniformly, and you forgot to mention this in
 the description?
>>> I didn't check. My intention was to provide opportunity to check if
>>> an
>>> architecture want to use generic version or not. Otherwise, I
>>> expected
>>> that we will have multiple definiotion of the funcion.
>>>
>>> But considering that they are all defined under #ifdef...#endif we
>>> can
>>> remove the declaration of the config GENERIC_FIND_NEXT_BIT.
>>
>> What #ifdef / #endif would matter here? Whats in lib/ is intended to
>> be
>> generic anyway. And what is in the resulting lib.a won't be used by
>> an
>> arch if it has an arch-specific implementation. 
> If what is implemented in lib.a won't be used by an arch if it has an
> arch-specific implementation then, for sure, I have to drop
> CONFIG_GENERIC_FIND_NEXT_BIT.
> But I am not really understand if lib.a is linked with Xen, then it
> should be an issue then if some arch implement find-next-bit function
> we will have to multiple definitions ( one in lib.a and one arch
> specific ). Probably, I have to look at how it is done.

You're aware how linking works? Objects are pulled out of archives only
if there's no other definition for a to-be-resolved symbol provided by
a particular object in the archive.

>> Problems could arise if
>> an arch had an inline function colliding with the out-of-line one.
>> But
>> that's about the old case where I could see a need to make the
>> building
>> of one of the objects conditional. And you'll note that withing this
>> Makefile there are pretty few conditionals.
> Could you please clarify What does it mean "out-of-line" ?

"not inline"

> --- /dev/null
> +++ b/xen/lib/find-next-bit.c
> [...]

 I was going to ask that you convince git to actually present a
 proper
 diff, to make visible what changes. But other than the
 description
 says
 you don't really move the file, you copy it. Judging from further
 titles
 there's also nowhere you'd make Arm actually use this now generic
 code.
>>> I wanted to do it separately, outside this patch series to simplify
>>> review and not have Arm specific changes in RISC-V patch series.
>>
>> Then do it the other way around: Make a separate _prereq_ change
>> truly
>> moving the file.
> So this one patch should be separated by 2? One which moves find-next-
> bit.c from Arm to xen/lib, and second where xen/lib/Makefile is
> updated.

No, that would break the Arm build. I suggested breaking out this
patch from the series, and then doing what the description says:
Actually move the file. I don't think I suggested splitting the
patch. Even the breaking out of the series was only because you
said "I wanted to do it separately, outside this patch series".

Jan



Re: [PATCH 1/2] arm: vgic: Add the ability to trigger MSIs from the Hypervisor

2024-01-24 Thread Julien Grall

Hi,

On 24/01/2024 01:17, Stefano Stabellini wrote:

On Sun, 14 Jan 2024, Mykyta Poturai wrote:

Add the vgic_its_trigger_msi() function to the vgic interface. This
function allows to inject MSIs from the Hypervisor to the guest.
Which is useful for userspace PCI backend drivers.

Signed-off-by: Mykyta Poturai 
---
  xen/arch/arm/include/asm/vgic.h | 11 +++
  xen/arch/arm/vgic-v3-its.c  | 35 +
  2 files changed, 46 insertions(+)

diff --git a/xen/arch/arm/include/asm/vgic.h b/xen/arch/arm/include/asm/vgic.h
index 922779ce14..4695743848 100644
--- a/xen/arch/arm/include/asm/vgic.h
+++ b/xen/arch/arm/include/asm/vgic.h
@@ -317,6 +317,17 @@ extern bool vgic_migrate_irq(struct vcpu *old, struct vcpu 
*new, unsigned int ir
  extern void vgic_check_inflight_irqs_pending(struct domain *d, struct vcpu *v,
   unsigned int rank, uint32_t r);
  
+#ifdef CONFIG_HAS_ITS

+int vgic_its_trigger_msi(struct domain *d, paddr_t doorbell_address,
+u32 devid, u32 eventid);
+#else
+static inline int vgic_its_trigger_msi(struct domain *d, paddr_t 
doorbell_address,
+u32 devid, u32 eventid)
+{
+return -EOPNOTSUPP;
+}
+#endif /* CONFIG_HAS_ITS */
+
  #endif /* !CONFIG_NEW_VGIC */
  
  /*** Common VGIC functions used by Xen arch code /

diff --git a/xen/arch/arm/vgic-v3-its.c b/xen/arch/arm/vgic-v3-its.c
index 70b5aeb822..683a378f6e 100644
--- a/xen/arch/arm/vgic-v3-its.c
+++ b/xen/arch/arm/vgic-v3-its.c
@@ -1484,6 +1484,41 @@ static int vgic_v3_its_init_virtual(struct domain *d, 
paddr_t guest_addr,
  return 0;
  }
  
+int vgic_its_trigger_msi(struct domain *d, paddr_t doorbell_address,

+u32 devid, u32 eventid)
+{
+struct vcpu *vcpu;
+struct virt_its *pos, *temp;
+struct virt_its *its = NULL;
+uint32_t vlpi;
+bool ret;
+
+list_for_each_entry_safe( pos, temp, &d->arch.vgic.vits_list, vits_list )
+{
+if ( pos->doorbell_address == doorbell_address )
+{
+its = pos;
+break;
+}
+}
+
+if ( !its )
+return -EINVAL;
+
+spin_lock(&its->its_lock);
+ret = read_itte(its, devid, eventid, &vcpu, &vlpi);
+spin_unlock(&its->its_lock);
+if ( !ret )
+return -1;
+
+if ( vlpi == INVALID_LPI )
+return -1;


We need a better error code, maybe EINVAL or ENOENT ?


EINVAL tends to be overloaded. I would prefer ENOENT.

Cheers,

--
Julien Grall



Re: [PATCH v3 15/34] xen/riscv: introduce atomic.h

2024-01-24 Thread Jan Beulich
On 24.01.2024 10:23, Oleksii wrote:
> On Tue, 2024-01-23 at 14:30 +0100, Jan Beulich wrote:
>> On 23.01.2024 13:24, Oleksii wrote:
>>> On Tue, 2024-01-23 at 11:30 +0100, Jan Beulich wrote:
 On 23.01.2024 11:21, Oleksii wrote:
> On Mon, 2024-01-22 at 17:56 +0100, Jan Beulich wrote:
>> On 22.12.2023 16:12, Oleksii Kurochko wrote:
>>> --- /dev/null
>>> +++ b/xen/arch/riscv/include/asm/fence.h
>>> @@ -0,0 +1,13 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>>> +#ifndef _ASM_RISCV_FENCE_H
>>> +#define _ASM_RISCV_FENCE_H
>>> +
>>> +#ifdef CONFIG_SMP
>>> +#define RISCV_ACQUIRE_BARRIER  "\tfence r , rw\n"
>>> +#define RISCV_RELEASE_BARRIER  "\tfence rw,  w\n"
>>> +#else
>>> +#define RISCV_ACQUIRE_BARRIER
>>> +#define RISCV_RELEASE_BARRIER
>>> +#endif
>>
>> Do you really care about the !SMP case? On x86 at least we
>> stopped
>> special-
>> casing that configuration many years ago (the few cases where
>> for
>> typically
>> build reasons it matters, using CONFIG_NR_CPUS is
>> sufficient). If
>> you
>> care
>> about it, there needs to be somewhere you actually #define
>> CONFIG_SMP.
> Can't we use instead of CONFIG_SMP - CONFIG_NR_CPUS?

 You can. Question is whether there's a point in doing so. Do you
 expect people to actually want to run Xen on single-CPU systems?
 They're generally not overly well suited for virtualization ...
>>> Just to clarify.
>>>
>>> Do you mean physically single based CPU?
>>> Then I don't expect to run Xen on such systems and it is not
>>> nesessary
>>> to define *_BARRIER in this case. Should we have to add build error
>>> notification that we don't support single-CPU systems in this
>>> header?
>>>
>>> If you are speaking about we have ,let it be, 4 CPUs and only 1 CPU
>>> is
>>> currently supported by Xen then it still makes sense.
>>
>> No, that's still not what I mean. The question is: Is it useful for
>> you
>> to _special case_ the NR_CPUS=1 case? Or is it instead simpler to
>> handle
>> NR_CPUS=1 the same as NR_CPUS>1 (accepting less than ideal
>> performance,
>> on the basis that in reality nobody's expected to use such in
>> production
>> anyway)?
> NR_CPUS=1 sometimes is useful for debugging. At least, at the start I
> used that several times, but ITBO I don't remember when I used that
> case after SMP support was added and context_switch() was fixed.

And "sometimes is useful for debugging" warrants introducing special
cases? I've not suggested disallowing that configuration. I'm merely
asking whether it isn't easier to have the barriers there at all
times. Just like on x86 we now leave the LOCK prefixes in place at
all times.

> Probably, I misunderstand the real idea of NR_CPUS. Does NR_CPUS
> represent a number of logical CPUs which can be different from physical
> amount of CPU?

No.

Jan



[ovmf test] 184448: all pass - PUSHED

2024-01-24 Thread osstest service owner
flight 184448 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/184448/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf 97e1ef87300cdf01f5b21cd4c5ee1d8df6ae1f39
baseline version:
 ovmf d24187a81f724fc2af4f739ad92a9b158c9254df

Last test of basis   184446  2024-01-24 06:41:04 Z0 days
Testing same since   184448  2024-01-24 08:42:37 Z0 days1 attempts


People who touched revisions under test:
  de...@edk2.groups.io 
  Jeff Brasen 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   d24187a81f..97e1ef8730  97e1ef87300cdf01f5b21cd4c5ee1d8df6ae1f39 -> 
xen-tested-master



Re: [BUG]i2c_hid_acpi broken with 4.17.2 on Framework Laptop 13 AMD

2024-01-24 Thread Mario Limonciello

On 1/22/2024 11:06, Sébastien Chaumat wrote:



Le mer. 17 janv. 2024 à 03:20, Mario Limonciello 
mailto:mario.limoncie...@amd.com>> a écrit :


On 1/16/2024 10:18, Jan Beulich wrote:
 > On 16.01.2024 16:52, Sébastien Chaumat wrote:
 >> Le mar. 2 janv. 2024 à 21:23, Sébastien Chaumat
mailto:euidz...@gmail.com>> a
 >> écrit :
 >>
 >>>
 >>>   output of gpioinfo
 
  kernel alone :
 
           line   5: unnamed         input active-low
consumer=interrupt
           line  84: unnamed         input active-low
consumer=interrupt
 
  xen:
 
           line   5: unnamed         input active-low
           line  84: unnamed         input active-low
 
  xen with skipping IRQ7 double init :
 
           line   5: unnamed         input active-low
consumer=interrupt
           line  84: unnamed         input active-low
 
 
  So definitely progressing.
 
 >>>
 >>> Checking /sys/kernel/irq/7
 >>>
 >>> kernel alone :
 >>>   actions: pinctrl_amd
 >>>   chip_name: IR-IO-APIC
 >>>   hwirq: 7
 >>>   name: fasteoi
 >>>   per_cpu_count: 0,0,0,0,0,20,0,0,0,0,0,0,0,0,0,0
 >>>   type: level
 >>>   wakeup: enabled
 >>>
 >>> xen skipping IRQ7 double init :
 >>>
 >>> actions: pinctrl_amd
 >>>   chip_name: xen-pirq
 >>>   hwirq:
 >>>   name: ioapic-level
 >>>   per_cpu_count: 0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
 >>>   type: edge
 >>>   wakeup: disabled
 >>>
 >>> So the skip of IRQ7 in pci_xen_initial_domain() sets the
correct handler
 >>>   (IIUC xen uses the ioapic-level and handles the eoi
separately), but not
 >>> the correct type (still edge).
 >>> I guess this may explains the results above.
 >>>
 >>>
 >>   Mario (in CC) patched the pinctrl_amd to flush pending
interrupt before
 >> starting the driver for the GPIO.
 >>
 >> This helped in  the sense of there's no more pending interrupt
on IRQ7
 >> (whatever the handler is, level or edge) but then the touchpad
is not
 >> detected by i2c-hid.
 >>
 >> Is there any work in progress related to the incorrect IRQ
configuration ?
 >
 > I'm not aware of any. As per my recollection it's still not entirely
 > clear where in the kernel things go astray. And to be honest I don't
 > feel comfortable trying to half-blindly address this, e.g. by trying
 > to circumvent / defer the early setting up of the low 16 IRQs.
 >
 > Jan

Shot in the dark - but could this be a problem where PCAT_COMPAT from
the MADT is being ignored causing PIC not to be setup properly in the
Xen case?

See https://lore.kernel.org/all/875y2u5s8g.ffs@tglx/
 for some context.

At least we know that no MADT override is found by xen for INT7 as no 
INT_SRC_OVR message is printed.


Do we expect one @Mario Limonciello   ?


No; the INT_SRV_OVR you'll see on Framework 13 AMD is on IRQ 2 and IRQ 9.




RE: [PATCH] xen: Drop out of coroutine context xen_invalidate_map_cache_entry

2024-01-24 Thread Peng Fan
> Subject: Re: [PATCH] xen: Drop out of coroutine context
> xen_invalidate_map_cache_entry
> 
> On Tue, 16 Jan 2024, Peng Fan (OSS) wrote:
> > From: Peng Fan 
> >
> > xen_invalidate_map_cache_entry is not expected to run in a coroutine.
> > Without this, there is crash:
> >
> > signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
> > threadid=) at pthread_kill.c:78
> > at /usr/src/debug/glibc/2.38+git-r0/sysdeps/posix/raise.c:26
> > fmt=0x9e1ca8a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
> > assertion=assertion@entry=0xe0d25740 "!qemu_in_coroutine()",
> > file=file@entry=0xe0d301a8 "../qemu-xen-dir-remote/block/graph-
> lock.c", line=line@entry=260,
> > function=function@entry=0xe0e522c0 <__PRETTY_FUNCTION__.3>
> "bdrv_graph_rdlock_main_loop") at assert.c:92
> > assertion=assertion@entry=0xe0d25740 "!qemu_in_coroutine()",
> > file=file@entry=0xe0d301a8 "../qemu-xen-dir-remote/block/graph-
> lock.c", line=line@entry=260,
> > function=function@entry=0xe0e522c0 <__PRETTY_FUNCTION__.3>
> "bdrv_graph_rdlock_main_loop") at assert.c:101
> > at ../qemu-xen-dir-remote/block/graph-lock.c:260
> > at /home/Freenix/work/sw-stash/xen/upstream/tools/qemu-xen-dir-
> remote/include/block/graph-lock.h:259
> > host=host@entry=0x742c8000, size=size@entry=2097152)
> > at ../qemu-xen-dir-remote/block/io.c:3362
> > host=0x742c8000, size=2097152)
> > at ../qemu-xen-dir-remote/block/block-backend.c:2859
> > host=, size=, max_size=)
> > at ../qemu-xen-dir-remote/block/block-ram-registrar.c:33
> > size=2097152, max_size=2097152)
> > at ../qemu-xen-dir-remote/hw/core/numa.c:883
> > buffer=buffer@entry=0x743c5000 "")
> > at ../qemu-xen-dir-remote/hw/xen/xen-mapcache.c:475
> > buffer=buffer@entry=0x743c5000 "")
> > at ../qemu-xen-dir-remote/hw/xen/xen-mapcache.c:487
> > as=as@entry=0xe1ca3ae8 ,
> buffer=0x743c5000,
> > len=, is_write=is_write@entry=true,
> > access_len=access_len@entry=32768)
> > at ../qemu-xen-dir-remote/system/physmem.c:3199
> > dir=DMA_DIRECTION_FROM_DEVICE, len=,
> > buffer=, as=0xe1ca3ae8 )
> > at /home/Freenix/work/sw-stash/xen/upstream/tools/qemu-xen-dir-
> remote/include/sysemu/dma.h:236
> > elem=elem@entry=0xf620aa30, len=len@entry=32769)
> > at ../qemu-xen-dir-remote/hw/virtio/virtio.c:758
> > elem=elem@entry=0xf620aa30, len=len@entry=32769,
> idx=idx@entry=0)
> > at ../qemu-xen-dir-remote/hw/virtio/virtio.c:919
> > elem=elem@entry=0xf620aa30, len=32769)
> > at ../qemu-xen-dir-remote/hw/virtio/virtio.c:994
> > req=req@entry=0xf620aa30, status=status@entry=0 '\000')
> > at ../qemu-xen-dir-remote/hw/block/virtio-blk.c:67
> > ret=0) at ../qemu-xen-dir-remote/hw/block/virtio-blk.c:136
> > at ../qemu-xen-dir-remote/block/block-backend.c:1559
> > --Type  for more, q to quit, c to continue without paging--
> > at ../qemu-xen-dir-remote/block/block-backend.c:1614
> > i1=) at ../qemu-xen-dir-remote/util/coroutine-
> ucontext.c:177
> > at ../sysdeps/unix/sysv/linux/aarch64/setcontext.S:123
> >
> > Signed-off-by: Peng Fan 
> 
> Hi Peng! Many thanks for the patch and for the investigation!
> 
> Only one minor question below
> 
> 
> > ---
> >  hw/xen/xen-mapcache.c | 31 +--
> >  1 file changed, 29 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/xen/xen-mapcache.c b/hw/xen/xen-mapcache.c index
> > f7d974677d..4e1bb665ee 100644
> > --- a/hw/xen/xen-mapcache.c
> > +++ b/hw/xen/xen-mapcache.c
> > @@ -481,11 +481,38 @@ static void
> xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
> >  g_free(entry);
> >  }
> >
> > -void xen_invalidate_map_cache_entry(uint8_t *buffer)
> > +typedef struct XenMapCacheData {
> > +Coroutine *co;
> > +uint8_t *buffer;
> > +int ret;
> 
> Do we need int ret? It doesn't look like we are using it.

Good catch, it is not needed, I will drop it in V2.

Thanks,
Peng.

> 
> 
> > +} XenMapCacheData;
> > +
> > +static void xen_invalidate_map_cache_entry_bh(void *opaque)
> >  {
> > +XenMapCacheData *data = opaque;
> > +
> >  mapcache_lock();
> > -xen_invalidate_map_cache_entry_unlocked(buffer);
> > +xen_invalidate_map_cache_entry_unlocked(data->buffer);
> >  mapcache_unlock();
> > +
> > +aio_co_wake(data->co);
> > +}
> > +
> > +void coroutine_mixed_fn xen_invalidate_map_cache_entry(uint8_t
> > +*buffer) {
> > +if (qemu_in_coroutine()) {
> > +XenMapCacheData data = {
> > +.co = qemu_coroutine_self(),
> > +.buffer = buffer,
> > +};
> > +aio_bh_schedule_oneshot(qemu_get_current_aio_context(),
> > +xen_invalidate_map_cache_entry_bh, &data);
> > +qemu_coroutine_yield();
> > +} else {
> > +mapcache_lock();
> > +xen_invalidate_map_cache_entry_unlocked(buffer);
> > + 

Re: [PATCH] x86/p2m-pt: fix off by one in entry check assert

2024-01-24 Thread George Dunlap
On Wed, Jan 24, 2024 at 8:45 AM Roger Pau Monne  wrote:
>
> The MMIO RO rangeset overlap check is bogus: the rangeset is inclusive so the
> passed end mfn should be the last mfn to be mapped (not last + 1).
>
> Fixes: 6fa1755644d0 ('amd/npt/shadow: replace assert that prevents creating 
> 2M/1G MMIO entries')
> Signed-off-by: Roger Pau Monné 

Reviewed-by: George Dunlap 



Re: [PATCH v3 30/34] xen/riscv: add minimal stuff to processor.h to build full Xen

2024-01-24 Thread Oleksii
On Wed, 2024-01-24 at 09:19 +0100, Jan Beulich wrote:
> On 23.01.2024 18:08, Oleksii wrote:
> > On Tue, 2024-01-23 at 12:39 +0100, Jan Beulich wrote:
> > > On 22.12.2023 16:13, Oleksii Kurochko wrote:
> > > > Signed-off-by: Oleksii Kurochko 
> > > > ---
> > > > Changes in V3:
> > > >  - Update the commit message
> > > 
> > > ??? (again)
> > The same as with previous. asm/processor.h was changed to
> > processor.h
> > 
> > > 
> > > > @@ -53,6 +56,18 @@ struct cpu_user_regs
> > > >  unsigned long pregs;
> > > >  };
> > > >  
> > > > +/* TODO: need to implement */
> > > > +#define cpu_to_core(cpu)   (0)
> > > > +#define cpu_to_socket(cpu) (0)
> > > > +
> > > > +static inline void cpu_relax(void)
> > > > +{
> > > > +    /* Encoding of the pause instruction */
> > > > +    __asm__ __volatile__ ( ".insn 0x10F" );
> > > 
> > > binutils 2.40 knows "pause" - why use .insn then?
> > I thought that for this instruction it is needed to have extension
> > ZIHINTPAUSE ( according to Linux Kernel source code [1] ) and to
> > cover
> > older version.
> 
> Well, of course you'll need to enable the extension then for gas. But
> as long as you use the insn unconditionally, that's all fine and
> natural. Another thing would be if you meant to also run on systems
> not supporting the extension: Then the above use of .insn would need
> to become conditional anyway.
Then it makes sense to use "pause". 
Let's assume that for now we are running only on systems which support
the extension until we won't face compilation issue for some system.

> 
> > > > +    barrier();
> > > 
> > > Why?
> > Just to be aligned with Linux kernel implemetation from where this
> > function was taken.
> 
> Hmm, looking more closely we have an (open-coded) barrier even on
> x86.
> So I suppose it's really wanted (to keep the compiler from moving
> memory accesses around this construct), but then you may want to
> consider using
> 
>     __asm__ __volatile__ ( "pause" ::: "memory" );
> 
> here. First and foremost because at least in the general case two
> separate asm()s aren't the same as one combined one (volatile ones
> are more restricted, but I'd always err on the safe side, even if
> just to avoid giving bad examples which later on may be taken as a
> basis for deriving other code).
It makes sense, I'll update inline assembler line code.

Thanks.

~ Oleksii




Re: [PATCH v3 29/34] xen/riscv: add minimal stuff to page.h to build full Xen

2024-01-24 Thread Oleksii
On Wed, 2024-01-24 at 09:09 +0100, Jan Beulich wrote:
> On 23.01.2024 17:54, Oleksii wrote:
> > On Tue, 2024-01-23 at 12:36 +0100, Jan Beulich wrote:
> > > On 22.12.2023 16:13, Oleksii Kurochko wrote:
> > > > Signed-off-by: Oleksii Kurochko 
> > > > Acked-by: Jan Beulich 
> > > > ---
> > > > Changes in V3:
> > > >  - update the commit message
> > > 
> > > Once again I find this puzzling, considering there's no commit
> > > message
> > > at all.
> > By the I meant that asm/page.h was changed to page.h
> 
> Oh. Can you say "title" or "subject" when you mean that, and "commit
> message" (or "description") only when you actually mean the
> description?
Sure, I'll stick to proposed terminology next time.
Thanks.

~ Oleksii



Re: [PATCH v3 27/34] xen/riscv: define an address of frame table

2024-01-24 Thread Oleksii
On Wed, 2024-01-24 at 09:07 +0100, Jan Beulich wrote:
> On 23.01.2024 17:50, Oleksii wrote:
> > On Tue, 2024-01-23 at 12:32 +0100, Jan Beulich wrote:
> > > On 22.12.2023 16:13, Oleksii Kurochko wrote:
> > > > @@ -22,25 +30,56 @@
> > > >   *
> > > >   * It means that:
> > > >   *   top VA bits are simply ignored for the purpose of
> > > > translating
> > > > to PA.
> > > > +#endif
> > > >   *
> > > > - *
> > > > ===
> > > > 
> > > > =
> > > > - *    Start addr    |   End addr    |  Size  | Slot  
> > > > > area description
> > > > - *
> > > > ===
> > > > 
> > > > =
> > > > - * C080 |   |1016 MB | L2 511
> > > > |
> > > > Unused
> > > > - * C060 |  C080 |  2 MB  | L2 511
> > > > |
> > > > Fixmap
> > > > - * C020 |  C060 |  4 MB  | L2 511
> > > > |
> > > > FDT
> > > > - * C000 |  C020 |  2 MB  | L2 511
> > > > |
> > > > Xen
> > > > - * ...  |  1 GB  | L2 510
> > > > |
> > > > Unused
> > > > - * 0032 |  007F8000 | 309 GB | L2 200-509
> > > > |
> > > > Direct map
> > > > - * ...  |  1 GB  | L2 199
> > > > |
> > > > Unused
> > > > - * 0031 |  0031C000 |  3 GB  | L2 196-198
> > > > |
> > > > Frametable
> > > > - * ...  |  1 GB  | L2 195
> > > > |
> > > > Unused
> > > > - * 00308000 |  0030C000 |  1 GB  | L2 194
> > > > |
> > > > VMAP
> > > > - * ...  | 194 GB | L2 0 - 193
> > > > |
> > > > Unused
> > > > - *
> > > > ===
> > > > 
> > > > =
> > > > + *   SATP_MODE_SV32   | SATP_MODE_SV39   |
> > > > SATP_MODE_SV48   |
> > > > SATP_MODE_SV57
> > > > + * 
> > > > ==|==|==|==
> > > > 
> > > > ===
> > > > + * BA0 | FFE0 | C000 |
> > > > FF80 |
> > > > 
> > > > + * BA1 | 1900 | 0032 |
> > > > 6400 |
> > > > 00C8
> > > > + * BA2 | 1880 | 0031 |
> > > > 6200 |
> > > > 00C4
> > > > + * BA3 | 1840 | 00308000 |
> > > > 6100 |
> > > > 00C2
> > > >   *
> > > > -#endif
> > > > + *
> > > > ===
> > > > 
> > > > 
> > > > + * Start addr |   End addr  |  Size  | Root PT
> > > > slot |
> > > > Area description
> > > > + *
> > > > ===
> > > > 
> > > > 
> > > > + * BA0 + 0x80 |     |1016 MB |
> > > > 511  |
> > > > Unused
> > > > + * BA0 + 0x40 |  BA0 + 0x80 |  2 MB  |
> > > > 511  |
> > > > Fixmap
> > > > + * BA0 + 0x20 |  BA0 + 0x40 |  4 MB  |
> > > > 511  |
> > > > FDT
> > > > + * BA0    |  BA0 + 0x20 |  2 MB  |
> > > > 511  |
> > > > Xen
> > > > + * ...  |  1 GB  |
> > > > 510  |
> > > > Unused
> > > > + * BA1 + 0x00 |  BA1 + 0x4D8000 | 309 GB |   200-
> > > > 509    |
> > > > Direct map
> > > 
> > > This definitely can't be right for SV32. Others may be
> > > problematic,
> > > too, like ...
> > > 
> > > > + * ...  |  1 GB  |
> > > > 199  |
> > > > Unused
> > > > + * BA2 + 0x00 |  BA2 + 0xC000   |  3 GB  |   196-
> > > > 198    |
> > > > Frametable
> > > 
> > > ... this one. Otoh I'd expect both to potentially be much larger
> > > in
> > > SV48 and SV57 modes.
> > Regarding Sv32, it looks to me the only BA0 and End addr at the
> > first
> > line isn't correct as address size is 32.
> > 
> > Regarding other modes, yes, it should be changed Size column. Also,
> > the
> > size of frame table should be recalculated.
> > 
> > Do we really need size column?
> > 
> > Wouldn't it be enough only have PT slot number?
> 
> Perhaps.
> 
> > Would it be better to have separate table for each mode?
> 
> Don't know.
Then I'll play around different ways to display memory layout.

> 
> > > > +#define VPN_BITS    (9)
> > > 
> > > This need to move ...
> > > 
> > > > +#define HYP_PT_ROOT_LEVEL (CONFIG_PAGING_LEVELS - 1)
> > > > +
> > > > +#ifdef CONFIG_RISCV_64
> > > 
> > > ... here, I think, for not being applicable to SV32?
> > You are right, it is not applicable for Sv32. In case of Sv32, it
> > should be 10.
> > But I am not sure that it is correct only to move this definition
> > as
> > RISCV-64 can also use Sv32. So it looks like VPN_BITS should be
> > "#ifdef
> > RV_STAGE1_MODE == Sv32".
> 
> Can it? The spec talk

Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Roger Pau Monné
On Wed, Jan 24, 2024 at 09:56:42AM +0100, Jan Beulich wrote:
> On 23.01.2024 16:23, Roger Pau Monné wrote:
> > On Tue, Jan 23, 2024 at 03:26:26PM +0100, Jan Beulich wrote:
> >> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>> --- a/xen/arch/x86/hvm/vmsi.c
> >>> +++ b/xen/arch/x86/hvm/vmsi.c
> >>> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq 
> >>> *pirq, uint64_t gtable)
> >>>  struct msixtbl_entry *entry, *new_entry;
> >>>  int r = -EINVAL;
> >>>  
> >>> -ASSERT(pcidevs_locked());
> >>> +ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>  ASSERT(rw_is_write_locked(&d->event_lock));
> >>>  
> >>>  if ( !msixtbl_initialised(d) )
> >>> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct 
> >>> pirq *pirq)
> >>>  struct pci_dev *pdev;
> >>>  struct msixtbl_entry *entry;
> >>>  
> >>> -ASSERT(pcidevs_locked());
> >>> +ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>  ASSERT(rw_is_write_locked(&d->event_lock));
> >>
> >> I was hoping to just ack this patch, but the two changes above look
> >> questionable to me: How can it be that holding _either_ lock is okay?
> >> It's not obvious in this context that consumers have to hold both
> >> locks now. In fact consumers looks to be the callers of
> >> msixtbl_find_entry(), yet the list is RCU-protected. Whereas races
> >> against themselves or against one another are avoided by holding
> >> d->event_lock.
> > 
> > The reason for the change here is that msixtbl_pt_{un,}register() gets
> > called by pt_irq_{create,destroy}_bind(), which is in turn called by
> > vPCI code (pcidevs_locked()) that has been switched to not take the
> > pcidevs lock anymore, and hence the ASSERT would trigger.
> 
> I understand this is the motivation for the change, but that doesn't
> (alone) render the construct above sensible / correct.

But we agreed that for the purpose of the device not going anyway,
either the pcidevs or the per-domain pci_lock should be held, both are
valid for the purpose, and hence functions have adjusted to take that
into account.

So your concern is about the pcidevs lock being used here not just for
preventing the device from being removed, but also for protecting MSI
related state?

> >> My only guess then for the original need of holding pcidevs_lock is
> >> the use of msi_desc->dev, with the desire for the device to not go
> >> away. Yet the description doesn't talk about interactions of the per-
> >> domain PCI lock with that one at all; it all circles around the
> >> domain'd vPCI lock.
> > 
> > I do agree that it looks like the original intention of holding
> > pcidevs_lock is to prevent msi_desc->dev from being removed - yet I'm
> > not sure it's possible for the device to go away while the domain
> > event_lock is hold, as device removal would need to take that same
> > lock in order to destroy the irq_desc.
> 
> Yes, that matches an observation of mine as well. If we can simplify
> (rather then complicate) locking, I'd prefer if we did. May need to
> be a separate (prereq) patch, though.

Hm, yes, that might be an option, and doing a pre-patch that removes
the need to have pcidevs locked here would avoid the what seem
controversial changes.

> >> Feels like I'm missing something that's obvious to everyone else.
> >> Or maybe this part of the patch is actually unrelated, and should be
> >> split off (with its own [proper] justification)? Or wouldn't it then
> >> be better to also change the other paths leading here to acquire the
> >> per-domain PCI lock?
> > 
> > Other paths in vPCI vpci_msi_update(), vpci_msi_arch_update(),
> > vpci_msi_arch_enable()... are switched in this patch to use the
> > per-domain pci_lock instead of pcidevs lock.
> 
> Hence my question: Can't we consolidate to all paths only using the
> per-domain lock? That would make these odd-looking assertions become
> normal-looking again.

Hm, I think that's more work than originally planned, as the initial
plan was to use both locks during an interim period in order to avoid
doing a full swept change to switch to the per-domain one.

Regards, Roger.



Re: [PATCH v3 16/34] xen/lib: introduce generic find next bit operations

2024-01-24 Thread Oleksii
On Tue, 2024-01-23 at 14:37 +0100, Jan Beulich wrote:
> On 23.01.2024 13:34, Oleksii wrote:
> > On Tue, 2024-01-23 at 12:14 +0100, Jan Beulich wrote:
> > > On 22.12.2023 16:13, Oleksii Kurochko wrote:
> > > > --- a/xen/common/Kconfig
> > > > +++ b/xen/common/Kconfig
> > > > @@ -47,6 +47,9 @@ config ARCH_MAP_DOMAIN_PAGE
> > > >  config GENERIC_BUG_FRAME
> > > >     bool
> > > >  
> > > > +config GENERIC_FIND_NEXT_BIT
> > > > +   bool
> > > 
> > > There's no need for this, as ...
> > > 
> > > > --- a/xen/lib/Makefile
> > > > +++ b/xen/lib/Makefile
> > > > @@ -3,6 +3,7 @@ obj-$(CONFIG_X86) += x86/
> > > >  lib-y += bsearch.o
> > > >  lib-y += ctors.o
> > > >  lib-y += ctype.o
> > > > +lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find-next-bit.o
> > > 
> > > ... you're moving this to lib/. Or have you encountered any issue
> > > with building this uniformly, and you forgot to mention this in
> > > the description?
> > I didn't check. My intention was to provide opportunity to check if
> > an
> > architecture want to use generic version or not. Otherwise, I
> > expected
> > that we will have multiple definiotion of the funcion.
> > 
> > But considering that they are all defined under #ifdef...#endif we
> > can
> > remove the declaration of the config GENERIC_FIND_NEXT_BIT.
> 
> What #ifdef / #endif would matter here? Whats in lib/ is intended to
> be
> generic anyway. And what is in the resulting lib.a won't be used by
> an
> arch if it has an arch-specific implementation. 
If what is implemented in lib.a won't be used by an arch if it has an
arch-specific implementation then, for sure, I have to drop
CONFIG_GENERIC_FIND_NEXT_BIT.
But I am not really understand if lib.a is linked with Xen, then it
should be an issue then if some arch implement find-next-bit function
we will have to multiple definitions ( one in lib.a and one arch
specific ). Probably, I have to look at how it is done.

> Problems could arise if
> an arch had an inline function colliding with the out-of-line one.
> But
> that's about the old case where I could see a need to make the
> building
> of one of the objects conditional. And you'll note that withing this
> Makefile there are pretty few conditionals.
Could you please clarify What does it mean "out-of-line" ?
> 
> > > > --- /dev/null
> > > > +++ b/xen/lib/find-next-bit.c
> > > > [...]
> > > 
> > > I was going to ask that you convince git to actually present a
> > > proper
> > > diff, to make visible what changes. But other than the
> > > description
> > > says
> > > you don't really move the file, you copy it. Judging from further
> > > titles
> > > there's also nowhere you'd make Arm actually use this now generic
> > > code.
> > I wanted to do it separately, outside this patch series to simplify
> > review and not have Arm specific changes in RISC-V patch series.
> 
> Then do it the other way around: Make a separate _prereq_ change
> truly
> moving the file.
So this one patch should be separated by 2? One which moves find-next-
bit.c from Arm to xen/lib, and second where xen/lib/Makefile is
updated.

> 
> > Regarding a proper diff, you would like me to make git shows that
> > it
> > was copy from Arm and it is not newly created file. Am I understand
> > you
> > correctly?
> 
> Not quite, I think. Git has move detection (and we've seen that in
> action in other patches of yours). So when truly moving a file, what
> (if anything) is changed is easily visible.
I think I am still a little bit confused. But I think the answer on my
question above can clarify that.

~ Oleksii



Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Roger Pau Monné
On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> On 23.01.2024 16:07, Roger Pau Monné wrote:
> > On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> >> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int 
> >>> index, int *pirq_p,
> >>>  {
> >>>  int irq, pirq, ret;
> >>>  
> >>> +ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>
> >> If either lock is sufficient to hold here, ...
> >>
> >>> --- a/xen/arch/x86/physdev.c
> >>> +++ b/xen/arch/x86/physdev.c
> >>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int 
> >>> *index, int *pirq_p,
> >>>  
> >>>  case MAP_PIRQ_TYPE_MSI:
> >>>  case MAP_PIRQ_TYPE_MULTI_MSI:
> >>> +pcidevs_lock();
> >>>  ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> >>> +pcidevs_unlock();
> >>>  break;
> >>
> > 
> > IIRC (Stewart can further comment) this is done holding the pcidevs
> > lock to keep the path unmodified, as there's no need to hold the
> > per-domain rwlock.
> 
> Yet why would we prefer to acquire a global lock when a per-domain one
> suffices?

I was hoping to introduce less changes, specially if they are not
strictly required, as it's less risk.  I'm always quite worry of
locking changes.

Thanks, Roger.



Re: [PATCH v3 15/34] xen/riscv: introduce atomic.h

2024-01-24 Thread Oleksii
On Tue, 2024-01-23 at 14:30 +0100, Jan Beulich wrote:
> On 23.01.2024 13:24, Oleksii wrote:
> > On Tue, 2024-01-23 at 11:30 +0100, Jan Beulich wrote:
> > > On 23.01.2024 11:21, Oleksii wrote:
> > > > On Mon, 2024-01-22 at 17:56 +0100, Jan Beulich wrote:
> > > > > On 22.12.2023 16:12, Oleksii Kurochko wrote:
> > > > > > --- /dev/null
> > > > > > +++ b/xen/arch/riscv/include/asm/fence.h
> > > > > > @@ -0,0 +1,13 @@
> > > > > > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > > > > > +#ifndef _ASM_RISCV_FENCE_H
> > > > > > +#define _ASM_RISCV_FENCE_H
> > > > > > +
> > > > > > +#ifdef CONFIG_SMP
> > > > > > +#define RISCV_ACQUIRE_BARRIER  "\tfence r , rw\n"
> > > > > > +#define RISCV_RELEASE_BARRIER  "\tfence rw,  w\n"
> > > > > > +#else
> > > > > > +#define RISCV_ACQUIRE_BARRIER
> > > > > > +#define RISCV_RELEASE_BARRIER
> > > > > > +#endif
> > > > > 
> > > > > Do you really care about the !SMP case? On x86 at least we
> > > > > stopped
> > > > > special-
> > > > > casing that configuration many years ago (the few cases where
> > > > > for
> > > > > typically
> > > > > build reasons it matters, using CONFIG_NR_CPUS is
> > > > > sufficient). If
> > > > > you
> > > > > care
> > > > > about it, there needs to be somewhere you actually #define
> > > > > CONFIG_SMP.
> > > > Can't we use instead of CONFIG_SMP - CONFIG_NR_CPUS?
> > > 
> > > You can. Question is whether there's a point in doing so. Do you
> > > expect people to actually want to run Xen on single-CPU systems?
> > > They're generally not overly well suited for virtualization ...
> > Just to clarify.
> > 
> > Do you mean physically single based CPU?
> > Then I don't expect to run Xen on such systems and it is not
> > nesessary
> > to define *_BARRIER in this case. Should we have to add build error
> > notification that we don't support single-CPU systems in this
> > header?
> > 
> > If you are speaking about we have ,let it be, 4 CPUs and only 1 CPU
> > is
> > currently supported by Xen then it still makes sense.
> 
> No, that's still not what I mean. The question is: Is it useful for
> you
> to _special case_ the NR_CPUS=1 case? Or is it instead simpler to
> handle
> NR_CPUS=1 the same as NR_CPUS>1 (accepting less than ideal
> performance,
> on the basis that in reality nobody's expected to use such in
> production
> anyway)?
NR_CPUS=1 sometimes is useful for debugging. At least, at the start I
used that several times, but ITBO I don't remember when I used that
case after SMP support was added and context_switch() was fixed.

Probably, I misunderstand the real idea of NR_CPUS. Does NR_CPUS
represent a number of logical CPUs which can be different from physical
amount of CPU?
If yes, then what I wrote above it was about physical CPU and then in
context of logical CPUs I don't need a special case when NR_CPUS=1.

~ Oleksii




Re: [PATCH 2/3] x86/entry: Make #PF/NMI/INT0x82 more amenable to livepatching

2024-01-24 Thread Roger Pau Monné
On Tue, Jan 23, 2024 at 02:43:15PM +0100, Jan Beulich wrote:
> On 23.01.2024 14:37, Roger Pau Monné wrote:
> > On Tue, Jan 23, 2024 at 10:22:10AM +0100, Jan Beulich wrote:
> >> On 22.01.2024 19:17, Andrew Cooper wrote:
> >>> It is bad form to have inter-function fallthrough.  It only functions 
> >>> right
> >>> now because alignment padding bytes are NOPs.
> >>
> >> But that's a requirement anyway in executable sections.
> > 
> > Really?  I was under the impression we wanted to replace the padding
> > nops with rets maybe, or even poison the padding with int3 or ud2.
> 
> Well, that would be a decision of ours. Which then imo can't be described as
> "only functions right now because ..." The assembler can't[1] use other than
> NOPs by default, as it can't know whether fall-through is intended.

So it's not a strict requirement of ELF that padding is done using
nops, it's just the default decision of the assembler because it
doesn't know better.

Thanks, Roger.



Re: [PATCH 3/3] x86/entry: Make intra-funciton symbols properly local

2024-01-24 Thread Roger Pau Monné
On Mon, Jan 22, 2024 at 06:17:14PM +, Andrew Cooper wrote:
> Each of these symbols are local to their main function.  By not having them
> globally visible, livepatch's binary diffing logic can reason about the
> functions properly.
> 
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Wei Liu 
> CC: Konrad Rzeszutek Wilk 
> CC: Ross Lagerwall 
> 
> sysenter_eflags_saved() is an open question.  This does need to be globally
> visible, and I think any livepatch modifying sysenter_entry() will need a
> manual adjustment to do_debug() to update the reference there.

Hm, yes, this stuff is indeed problematic.  There's also the usage of
sysenter_entry in do_debug(), which will also need adjusting to
account for the position of the payload text replacement, as there's
no explicitly guarantee the payload will be loaded at an address
greater than the current sysenter_entry position.

> ---
>  xen/arch/x86/x86_64/compat/entry.S | 20 ++--
>  xen/arch/x86/x86_64/entry.S| 22 +++---
>  2 files changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/xen/arch/x86/x86_64/compat/entry.S 
> b/xen/arch/x86/x86_64/compat/entry.S
> index 4fbd89cea1a9..1e9e0455eaf3 100644
> --- a/xen/arch/x86/x86_64/compat/entry.S
> +++ b/xen/arch/x86/x86_64/compat/entry.S
> @@ -41,7 +41,7 @@ FUNC(compat_test_all_events)
>  shll  $IRQSTAT_shift,%eax
>  leaq  irq_stat+IRQSTAT_softirq_pending(%rip),%rcx
>  cmpl  $0,(%rcx,%rax,1)
> -jne   compat_process_softirqs
> +jne   .L_compat_process_softirqs
>  
>  /* Inject exception if pending. */
>  lea   VCPU_trap_bounce(%rbx), %rdx
> @@ -49,11 +49,11 @@ FUNC(compat_test_all_events)
>  jnz   .Lcompat_process_trapbounce
>  
>  cmpb  $0, VCPU_mce_pending(%rbx)
> -jne   compat_process_mce
> +jne   .L_compat_process_mce
>  .Lcompat_test_guest_nmi:
>  cmpb  $0, VCPU_nmi_pending(%rbx)
> -jne   compat_process_nmi
> -compat_test_guest_events:
> +jne   .L_compat_process_nmi
> +.L_compat_test_guest_events:
>  movq  VCPU_vcpu_info(%rbx),%rax
>  movzwl COMPAT_VCPUINFO_upcall_pending(%rax),%eax
>  decl  %eax
> @@ -71,7 +71,7 @@ compat_test_guest_events:
>  jmp   compat_test_all_events
>  
>  /* %rbx: struct vcpu */
> -LABEL_LOCAL(compat_process_softirqs)
> +LABEL_LOCAL(.L_compat_process_softirqs)
>  sti
>  call  do_softirq
>  jmp   compat_test_all_events
> @@ -84,7 +84,7 @@ LABEL_LOCAL(.Lcompat_process_trapbounce)
>  jmp   compat_test_all_events
>  
>  /* %rbx: struct vcpu */
> -LABEL_LOCAL(compat_process_mce)
> +LABEL_LOCAL(.L_compat_process_mce)
>  testb $1 << VCPU_TRAP_MCE,VCPU_async_exception_mask(%rbx)
>  jnz   .Lcompat_test_guest_nmi
>  sti
> @@ -96,12 +96,12 @@ LABEL_LOCAL(compat_process_mce)
>  movb %dl,VCPU_mce_old_mask(%rbx)# iret hypercall
>  orl  $1 << VCPU_TRAP_MCE,%edx
>  movb %dl,VCPU_async_exception_mask(%rbx)
> -jmp   compat_process_trap
> +jmp   .L_compat_process_trap
>  
>  /* %rbx: struct vcpu */
> -LABEL_LOCAL(compat_process_nmi)
> +LABEL_LOCAL(.L_compat_process_nmi)
>  testb $1 << VCPU_TRAP_NMI,VCPU_async_exception_mask(%rbx)
> -jnz   compat_test_guest_events
> +jnz   .L_compat_test_guest_events
>  sti
>  movb  $0, VCPU_nmi_pending(%rbx)
>  call  set_guest_nmi_trapbounce
> @@ -112,7 +112,7 @@ LABEL_LOCAL(compat_process_nmi)
>  orl  $1 << VCPU_TRAP_NMI,%edx
>  movb %dl,VCPU_async_exception_mask(%rbx)
>  /* FALLTHROUGH */
> -compat_process_trap:
> +.L_compat_process_trap:
>  leaq  VCPU_trap_bounce(%rbx),%rdx
>  call  compat_create_bounce_frame
>  jmp   compat_test_all_events
> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
> index fc64ef1fd460..130462ba0e1a 100644
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -188,7 +188,7 @@ FUNC_LOCAL(restore_all_guest)
>  
>  RESTORE_ALL
>  testw $TRAP_syscall,4(%rsp)
> -jziret_exit_to_guest
> +jz.L_iret_exit_to_guest
>  
>  movq  24(%rsp),%r11   # RFLAGS
>  andq  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), %r11
> @@ -220,7 +220,7 @@ FUNC_LOCAL(restore_all_guest)
>  LABEL_LOCAL(.Lrestore_rcx_iret_exit_to_guest)
>  movq  8(%rsp), %rcx   # RIP
>  /* No special register assumptions. */
> -iret_exit_to_guest:
> +.L_iret_exit_to_guest:
>  andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), 24(%rsp)
>  orl   $X86_EFLAGS_IF,24(%rsp)
>  addq  $8,%rsp
> @@ -432,10 +432,10 @@ UNLIKELY_END(msi_check)
>  cmove %rdi, %rdx
>  
>  test  %rdx, %rdx
> -jzint80_slow_path
> +jz.L_int80_slow_path
>  #else
>  test  %rdi, %rdi
> -jzint80_slow_path
> +j

Re: [PATCH v3 13/34] xen/riscv: introduce cmpxchg.h

2024-01-24 Thread Oleksii
On Tue, 2024-01-23 at 14:27 +0100, Jan Beulich wrote:
> On 23.01.2024 13:18, Oleksii wrote:
> > On Tue, 2024-01-23 at 11:28 +0100, Jan Beulich wrote:
> > > On 23.01.2024 11:15, Oleksii wrote:
> > > > On Mon, 2024-01-22 at 17:27 +0100, Jan Beulich wrote:
> > > > > On 22.12.2023 16:12, Oleksii Kurochko wrote:
> > > > > > +static inline unsigned long __xchg(volatile void *ptr,
> > > > > > unsigned
> > > > > > long x, int size)
> > > > > > +{
> > > > > > +    switch (size) {
> > > > > > +    case 1:
> > > > > > +    return __cmpxchg_case_1(ptr, (uint32_t)-1, x);
> > > > > > +    case 2:
> > > > > > +    return __cmpxchg_case_2(ptr, (uint32_t)-1, x);
> > > > > 
> > > > > How are these going to work? You'll compare against ~0, and
> > > > > if
> > > > > the
> > > > > value
> > > > > in memory isn't ~0, memory won't be updated; you will only
> > > > > (correctly)
> > > > > return the value found in memory.
> > > > > 
> > > > > Or wait - looking at __cmpxchg_case_{1,2}() far further down,
> > > > > you
> > > > > ignore
> > > > > "old" there. Which apparently means they'll work for the use
> > > > > here,
> > > > > but
> > > > > not for the use in __cmpxchg().
> > > > Yes, the trick is that old is ignored and is read in
> > > > __emulate_cmpxchg_case1_2() before __cmpxchg_case_4 is called:
> > > >     do
> > > > {  
> > > >     read_val =
> > > > read_func(aligned_ptr);    
> > > >     swapped_new = read_val &
> > > > ~mask;   
> > > >     swapped_new |=
> > > > masked_new;    
> > > >     ret = cmpxchg_func(aligned_ptr, read_val,
> > > > swapped_new);   
> > > >     } while ( ret != read_val
> > > > );  
> > > > read_val it is 'old'.
> > > > 
> > > > But now I am not 100% sure that it is correct for __cmpxchg...
> > > 
> > > It just can't be correct - you can't ignore "old" there. I think
> > > you
> > > want simple cmpxchg primitives, which xchg then uses in a loop
> > > (while
> > > cmpxchg uses them plainly).
> > But xchg doesn't require 'old' value, so it should be ignored in
> > some
> > way by cmpxchg.
> 
> Well, no. If you have only cmpxchg, I think your only choice is - as
> said - to read the old value and then loop over cmpxchg until that
> succeeds. Not really different from other operations which need
> emulating using cmpxchg.
Then it looks like the main error in __emulate_cmpxchg_case1_2 is that
I read the value each time, so read_val = read_func(aligned_ptr); 
should be before the do {...} while(). Also, it would be better to
rename it to old_val or just old.

> 
> > > > > > +static always_inline unsigned short
> > > > > > __cmpxchg_case_2(volatile
> > > > > > uint32_t *ptr,
> > > > > > +
> > > > > > uint32_t
> > > > > > old,
> > > > > > +
> > > > > > uint32_t
> > > > > > new)
> > > > > > +{
> > > > > > +    (void) old;
> > > > > > +
> > > > > > +    if (((unsigned long)ptr & 3) == 3)
> > > > > > +    {
> > > > > > +#ifdef CONFIG_64BIT
> > > > > > +    return __emulate_cmpxchg_case1_2((uint64_t *)ptr,
> > > > > > new,
> > > > > > + readq,
> > > > > > __cmpxchg_case_8,
> > > > > > 0xU);
> > > > > 
> > > > > What if ((unsigned long)ptr & 7) == 7 (which is a sub-case of
> > > > > what
> > > > > the
> > > > > if() above checks for? Isn't it more reasonable to require
> > > > > aligned
> > > > > 16-bit quantities here? Or if mis-aligned addresses are okay,
> > > > > you
> > > > > could
> > > > > as well emulate using __cmpxchg_case_4().
> > > > Yes, it will be more reasonable. I'll use IS_ALIGNED instead.
> > > 
> > > Not sure I get your use of "instead" here correctly. There's more
> > > to change here than just the if() condition.
> > I meant something like:
> > 
> > if ( IS_ALIGNED(ptr, 16) )
> >     __emulate_cmpxchg_case1_2(...);
> > else
> >     assert_failed("ptr isn't aligned\n");
> 
> Except that you'd better not use assert_failed() directly anywhere,
> and the above is easier as
> 
>     ASSERT(IS_ALIGNED(ptr, 16));
>     __emulate_cmpxchg_case1_2(...);
> 
> anyway (leaving aside that I guess you mean 2, not 16).
Yeah, it should be 2. Thanks.

~ Oleksii


Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Jan Beulich
On 23.01.2024 16:23, Roger Pau Monné wrote:
> On Tue, Jan 23, 2024 at 03:26:26PM +0100, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> --- a/xen/arch/x86/hvm/vmsi.c
>>> +++ b/xen/arch/x86/hvm/vmsi.c
>>> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq 
>>> *pirq, uint64_t gtable)
>>>  struct msixtbl_entry *entry, *new_entry;
>>>  int r = -EINVAL;
>>>  
>>> -ASSERT(pcidevs_locked());
>>> +ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>  ASSERT(rw_is_write_locked(&d->event_lock));
>>>  
>>>  if ( !msixtbl_initialised(d) )
>>> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct 
>>> pirq *pirq)
>>>  struct pci_dev *pdev;
>>>  struct msixtbl_entry *entry;
>>>  
>>> -ASSERT(pcidevs_locked());
>>> +ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>  ASSERT(rw_is_write_locked(&d->event_lock));
>>
>> I was hoping to just ack this patch, but the two changes above look
>> questionable to me: How can it be that holding _either_ lock is okay?
>> It's not obvious in this context that consumers have to hold both
>> locks now. In fact consumers looks to be the callers of
>> msixtbl_find_entry(), yet the list is RCU-protected. Whereas races
>> against themselves or against one another are avoided by holding
>> d->event_lock.
> 
> The reason for the change here is that msixtbl_pt_{un,}register() gets
> called by pt_irq_{create,destroy}_bind(), which is in turn called by
> vPCI code (pcidevs_locked()) that has been switched to not take the
> pcidevs lock anymore, and hence the ASSERT would trigger.

I understand this is the motivation for the change, but that doesn't
(alone) render the construct above sensible / correct.

>> My only guess then for the original need of holding pcidevs_lock is
>> the use of msi_desc->dev, with the desire for the device to not go
>> away. Yet the description doesn't talk about interactions of the per-
>> domain PCI lock with that one at all; it all circles around the
>> domain'd vPCI lock.
> 
> I do agree that it looks like the original intention of holding
> pcidevs_lock is to prevent msi_desc->dev from being removed - yet I'm
> not sure it's possible for the device to go away while the domain
> event_lock is hold, as device removal would need to take that same
> lock in order to destroy the irq_desc.

Yes, that matches an observation of mine as well. If we can simplify
(rather then complicate) locking, I'd prefer if we did. May need to
be a separate (prereq) patch, though.

>> Feels like I'm missing something that's obvious to everyone else.
>> Or maybe this part of the patch is actually unrelated, and should be
>> split off (with its own [proper] justification)? Or wouldn't it then
>> be better to also change the other paths leading here to acquire the
>> per-domain PCI lock?
> 
> Other paths in vPCI vpci_msi_update(), vpci_msi_arch_update(),
> vpci_msi_arch_enable()... are switched in this patch to use the
> per-domain pci_lock instead of pcidevs lock.

Hence my question: Can't we consolidate to all paths only using the
per-domain lock? That would make these odd-looking assertions become
normal-looking again.

Jan



Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Jan Beulich
On 24.01.2024 06:07, Stewart Hildebrand wrote:
> On 1/23/24 09:29, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, 
>>> struct msi_info *msi,
>>>  {
>>>  struct msi_desc *old_desc;
>>>  
>>> -ASSERT(pcidevs_locked());
>>> -
>>>  if ( !pdev || !pdev->msix )
>>>  return -ENODEV;
>>>  
>>> +ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>> +
>>>  if ( msi->entry_nr >= pdev->msix->nr_entries )
>>>  return -EINVAL;
>>
>> Further looking at this - is dereferencing pdev actually safe without holding
>> the global lock?
> 
> Are you referring to the new placement of the ASSERT, which opens up the 
> possibility that pdev could be dereferenced and the function return before 
> the ASSERT? If that is what you mean, I see your point. The ASSERT was placed 
> there simply because we wanted to check that pdev != NULL first. See prior 
> discussion at [1]. Hmm.. How about splitting the pdev-checking condition? 
> E.g.:
> 
> if ( !pdev )
> return -ENODEV;
> 
> ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> 
> if ( !pdev->msix )
> return -ENODEV;

Yes, this is the particular arrangement I would have expected (at least
partly based on the guessing of the purpose of holding those locks).

Jan

> [1] 
> https://lore.kernel.org/xen-devel/85a52f8d-d6db-4478-92b1-2b6305769...@amd.com/




[xen-unstable test] 184434: tolerable FAIL - PUSHED

2024-01-24 Thread osstest service owner
flight 184434 xen-unstable real [real]
flight 184447 xen-unstable real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/184434/
http://logs.test-lab.xenproject.org/osstest/logs/184447/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass 
in 184447-retest
 test-amd64-amd64-libvirt-vhd 19 guest-start/debian.repeat fail pass in 
184447-retest

Regressions which are regarded as allowable (not blocking):
 test-amd64-amd64-xl-rtds 20 guest-localmigrate/x10   fail REGR. vs. 184429

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl-vhd  13 guest-start  fail  like 184426
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 184429
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 184429
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 184429
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 184429
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 184429
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 184429
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 184429
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 184429
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 184429
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 184429
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 184429
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 184429
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  f67bddf3bccd99

Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Jan Beulich
On 23.01.2024 16:07, Roger Pau Monné wrote:
> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int 
>>> index, int *pirq_p,
>>>  {
>>>  int irq, pirq, ret;
>>>  
>>> +ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>
>> If either lock is sufficient to hold here, ...
>>
>>> --- a/xen/arch/x86/physdev.c
>>> +++ b/xen/arch/x86/physdev.c
>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int 
>>> *index, int *pirq_p,
>>>  
>>>  case MAP_PIRQ_TYPE_MSI:
>>>  case MAP_PIRQ_TYPE_MULTI_MSI:
>>> +pcidevs_lock();
>>>  ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>> +pcidevs_unlock();
>>>  break;
>>
> 
> IIRC (Stewart can further comment) this is done holding the pcidevs
> lock to keep the path unmodified, as there's no need to hold the
> per-domain rwlock.

Yet why would we prefer to acquire a global lock when a per-domain one
suffices?

Jan



[PATCH] x86/p2m-pt: fix off by one in entry check assert

2024-01-24 Thread Roger Pau Monne
The MMIO RO rangeset overlap check is bogus: the rangeset is inclusive so the
passed end mfn should be the last mfn to be mapped (not last + 1).

Fixes: 6fa1755644d0 ('amd/npt/shadow: replace assert that prevents creating 
2M/1G MMIO entries')
Signed-off-by: Roger Pau Monné 
---
 xen/arch/x86/mm/p2m-pt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index 640a11f5647f..348130d0dd3b 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -552,7 +552,7 @@ static void check_entry(mfn_t mfn, p2m_type_t new, 
p2m_type_t old,
 if ( new == p2m_mmio_direct )
 ASSERT(!mfn_eq(mfn, INVALID_MFN) &&
!rangeset_overlaps_range(mmio_ro_ranges, mfn_x(mfn),
-mfn_x(mfn) + (1UL << order)));
+mfn_x(mfn) + (1UL << order) - 1));
 else if ( p2m_allows_invalid_mfn(new) || new == p2m_invalid ||
   new == p2m_mmio_dm )
 ASSERT(mfn_valid(mfn) || mfn_eq(mfn, INVALID_MFN));
-- 
2.43.0




Re: [XEN PATCH 1/3] xen: introduce static_assert_unreachable()

2024-01-24 Thread Jan Beulich
On 24.01.2024 09:20, Federico Serafini wrote:
> On 22/01/24 15:02, Jan Beulich wrote:
>> On 22.01.2024 14:48, Federico Serafini wrote:
>>> --- a/xen/include/xen/compiler.h
>>> +++ b/xen/include/xen/compiler.h
>>> @@ -64,6 +64,14 @@
>>>   # define fallthroughdo {} while (0)  /* fallthrough */
>>>   #endif
>>>   
>>> +/*
>>> + * Add the following macro to check that a program point is considered
>>> + * unreachable by the static analysis performed by the compiler,
>>> + * even at optimization level -O0.
>>> + */
>>> +#define static_assert_unreachable() \
>>> +asm(".error \"unreachable program point reached\"");
>>
>> Did you check the diagnostic that results when this check actually
>> triggers? I expect it will be not really obvious from the message
>> you introduce where the issue actually is. I expect we will want
>> to use some of __FILE__ / __LINE__ / __FUNCTION__ to actually
>> supply such context.
> 
> The assembler error comes with file and line information, for example:
> 
> ./arch/x86/include/asm/uaccess.h: Assembler messages:
> ./arch/x86/include/asm/uaccess.h:377: Error: unreachable program point 
> reached
> 
> At line 377 there is an use of get_unsafe_size() where I passed a wrong
> size as argument. Is that clear enough?

Hmm, yes, looks like it might be sufficient. Mentioning __FUNCTION__ may
still add value, though. But I see now that __FILE__ / __LINE__ are
already covered for.

> What do you think about modifying the message as follows:
> "unreachability static assert failed."

I'm okay-ish with the original text, and I like it slightly better than
this new suggestion. If we want "static assert" in the output, then maybe
"static assertion failed: unreachable".

>> Also: Stray semicolon and (nit) missing blanks.
> 
> It is not clear to me where are the missing blanks.

Just like for other keywords:

asm ( ".error \"unreachable program point reached\"" )

Jan



[ovmf test] 184446: all pass - PUSHED

2024-01-24 Thread osstest service owner
flight 184446 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/184446/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf d24187a81f724fc2af4f739ad92a9b158c9254df
baseline version:
 ovmf 1063665fa5466ece0814a3e764ee3382656956a1

Last test of basis   18  2024-01-24 01:58:35 Z0 days
Testing same since   184446  2024-01-24 06:41:04 Z0 days1 attempts


People who touched revisions under test:
  Jeff Brasen 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   1063665fa5..d24187a81f  d24187a81f724fc2af4f739ad92a9b158c9254df -> 
xen-tested-master



Re: ACPI VFCT table too short on PVH dom0 (was: Re: E820 memory allocation issue on Threadripper platforms)

2024-01-24 Thread Roger Pau Monné
On Tue, Jan 23, 2024 at 08:27:18PM -0500, Patrick Plenefisch wrote:
> On Sat, Jan 20, 2024 at 8:33 PM Patrick Plenefisch 
> wrote:
> 
> >
> >
> > On Fri, Jan 19, 2024 at 6:06 AM Roger Pau Monné 
> > wrote:
> >
> >> On Fri, Jan 19, 2024 at 02:44:35AM -0500, Patrick Plenefisch wrote:
> >> > On Thu, Jan 18, 2024 at 7:41 AM Roger Pau Monné 
> >> > wrote:
> >> >
> >> > >
> >> > > From that environment (PVH dom0) can you see if you can dump the
> >> > > contents of the VFCT table?  I don't have a system with that table, so
> >> > > not sure if this will work (because iasl is unlikely to know how to
> >> > > decode it):
> >> > >
> >> > > # acpidump -n VFCT -o table.dump
> >> > > # acpixtract -a table.dump
> >> > > # iasl -d vfct.dat
> >> > > # cat vfct.dsl
> >> > >
> >> > > Would be good if you can compare the output from what you get on a PV
> >> > > dom0 or when running native Linux.
> >> > >
> >> > > I'm also adding some AMD folks, as IIRC they did some fixes to Linux
> >> > > in order to get some AMD graphics cards running on a PVH dom0, maybe
> >> > > they can provide some additional input.
> >> > >
> >> > >
> >> > Well, this is pretty weird. I'll go into more details because it may be
> >> > relevant.
> >>
> >> Wow, you have certainly gone out of the beaten path here.
> >>
> >
> > Yeah, I wasn't expecting this many issues for being an early adopter of
> > Threaderipper 7000!
> >
> >
> >>
> >> > I had been working with ASRock support ever since I got my brand
> >> > new motherboard because I couldn't see the BIOS/UEFI screens. I could
> >> boot
> >> > up, and once native linux took control amdgpu got the screens/gpu
> >> working
> >> > fine. I finally managed to be able to see the UEFI/BIOS setup screens by
> >> > patching my VBIOS: I extracted the VBIOS image of a cheap R5 430 OEM,
> >> ran
> >> > GOPupd to update the VBIOS UEFI component (GOP) from version 1.60 to
> >> 1.67.
> >> > That allowed the UEFI to actually initialize and use a screen. However,
> >> I
> >> > later realized that only 1 monitor was lighting up in the bios: my
> >> monitor
> >> > plugged into the Radeon RX 480 that was still on VBIOS GOP 1.60. It
> >> appears
> >> > the GOP was initializing the RX 480 too, despite not being flashed with
> >> the
> >> > latest itself. I am working on an email to asrock support about that.
> >> Once
> >> > I get into linux (native or PV), both monitors light up as expected.
> >> Also,
> >> > If I boot linux PVH from grub, they also work.
> >>
> >> OK, that's good, so that would be UEFI -> grub -> Xen -> Linux PVH?
> >>
> >
> > Correct. Inserting grub into the chain "fixes" the acpi tables and things
> > work correctly.
> >
> 
> Ok, I am not sure what I did the other day to get it to work, but I can't
> replicate *any* PVH success today. One driver (radeon or amdgpu) always
> complains the VFCT table is wrong, and leads to the symptoms previously
> reported.

Are you sure you are using Xen 4.18?  Some of the Xen logs you
provided did use Xen 4.17, which doesn't have the VFCT fix.

Can you please provide the `xl dmesg` for the non-working case when
using grub2?

Thanks, Roger.



Re: [PATCH v3 31/34] xen/riscv: add minimal stuff to mm.h to build full Xen

2024-01-24 Thread Jan Beulich
On 23.01.2024 18:27, Oleksii wrote:
> On Tue, 2024-01-23 at 14:03 +0100, Jan Beulich wrote:
>> On 22.12.2023 16:13, Oleksii Kurochko wrote:
>>> +#define _PGC_extra    PG_shift(10)
>>> +#define PGC_extra PG_mask(1, 10)
>>> +
>>> +#define is_xen_heap_page(page) ((page)->count_info & PGC_xen_heap)
>>> +#define is_xen_heap_mfn(mfn) \
>>> +    (mfn_valid(mfn) && is_xen_heap_page(mfn_to_page(mfn)))
>>> +
>>> +#define is_xen_fixed_mfn(mfn)   \
>>> +    ((mfn_to_maddr(mfn) >= virt_to_maddr(&_start)) &&   \
>>> + (mfn_to_maddr(mfn) <= virt_to_maddr((vaddr_t)_end - 1)))
>>
>> Why does _start need prefixing wuth & and _end prefixing with a cast?
>> First and foremost both want to be consistent. And then preferably
>> with as little extra clutter as possible.
> This is how it was defined in Arm. I think it both can be casted.
> I'll update that.

Judging from your present use of virt_to_maddr(&_start), I'd assume
you're fine without casts. And when casts aren't needed, they're
better avoided.

Jan



Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure

2024-01-24 Thread Roger Pau Monné
On Wed, Jan 24, 2024 at 12:07:28AM -0500, Stewart Hildebrand wrote:
> On 1/23/24 09:29, Jan Beulich wrote:
> > On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, 
> >> struct msi_info *msi,
> >>  {
> >>  struct msi_desc *old_desc;
> >>  
> >> -ASSERT(pcidevs_locked());
> >> -
> >>  if ( !pdev || !pdev->msix )
> >>  return -ENODEV;
> >>  
> >> +ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> >> +
> >>  if ( msi->entry_nr >= pdev->msix->nr_entries )
> >>  return -EINVAL;
> > 
> > Further looking at this - is dereferencing pdev actually safe without 
> > holding
> > the global lock?

It is safe because either the global pcidevs lock or the per-domain
pci_lock will be held, which should prevent the device from being
removed.

> Are you referring to the new placement of the ASSERT, which opens up the 
> possibility that pdev could be dereferenced and the function return before 
> the ASSERT? If that is what you mean, I see your point. The ASSERT was placed 
> there simply because we wanted to check that pdev != NULL first. See prior 
> discussion at [1]. Hmm.. How about splitting the pdev-checking condition? 
> E.g.:
> 
> if ( !pdev )
> return -ENODEV;
> 
> ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> 
> if ( !pdev->msix )
> return -ENODEV;

I'm not specially worried about the position of the assert, those are
just debug messages at the end.

One worry I have after further looking at the code, when called from
ns16550_init_postirq(), does the device have pdev->domain set?

That case would satisfy the first condition of the assert, so won't
attempt to dereference pdev->domain, but still would be good to ensure
consistency here wrt the state of pdev->domain.

Regards, Roger.



Re: [XEN PATCH 1/3] xen: introduce static_assert_unreachable()

2024-01-24 Thread Federico Serafini

On 22/01/24 15:02, Jan Beulich wrote:

On 22.01.2024 14:48, Federico Serafini wrote:

Introduce macro static_asser_unreachable() to check that a program
point is considered unreachable by the static analysis performed by the
compiler, even at optimization level -O0.


Is it really intended to limit use of this macro to cases where even
at -O0 the compiler would eliminate respective code? Note that right
now even debug builds are done with some optimization, and some of
the DCE we're relying depends on that (iirc).


I'll remove this restriction.





--- a/xen/include/xen/compiler.h
+++ b/xen/include/xen/compiler.h
@@ -64,6 +64,14 @@
  # define fallthroughdo {} while (0)  /* fallthrough */
  #endif
  
+/*

+ * Add the following macro to check that a program point is considered
+ * unreachable by the static analysis performed by the compiler,
+ * even at optimization level -O0.
+ */
+#define static_assert_unreachable() \
+asm(".error \"unreachable program point reached\"");


Did you check the diagnostic that results when this check actually
triggers? I expect it will be not really obvious from the message
you introduce where the issue actually is. I expect we will want
to use some of __FILE__ / __LINE__ / __FUNCTION__ to actually
supply such context.


The assembler error comes with file and line information, for example:

./arch/x86/include/asm/uaccess.h: Assembler messages:
./arch/x86/include/asm/uaccess.h:377: Error: unreachable program point 
reached


At line 377 there is an use of get_unsafe_size() where I passed a wrong
size as argument. Is that clear enough?

What do you think about modifying the message as follows:
"unreachability static assert failed."




Also: Stray semicolon and (nit) missing blanks.


It is not clear to me where are the missing blanks.




Finally I wonder about case: We have ASSERT_UNREACHABLE() and it
may be indicated to use all uppercase her as well.


Ok.


--
Federico Serafini, M.Sc.

Software Engineer, BUGSENG (http://bugseng.com)



Re: [PATCH v3 30/34] xen/riscv: add minimal stuff to processor.h to build full Xen

2024-01-24 Thread Jan Beulich
On 23.01.2024 18:08, Oleksii wrote:
> On Tue, 2024-01-23 at 12:39 +0100, Jan Beulich wrote:
>> On 22.12.2023 16:13, Oleksii Kurochko wrote:
>>> Signed-off-by: Oleksii Kurochko 
>>> ---
>>> Changes in V3:
>>>  - Update the commit message
>>
>> ??? (again)
> The same as with previous. asm/processor.h was changed to processor.h
> 
>>
>>> @@ -53,6 +56,18 @@ struct cpu_user_regs
>>>  unsigned long pregs;
>>>  };
>>>  
>>> +/* TODO: need to implement */
>>> +#define cpu_to_core(cpu)   (0)
>>> +#define cpu_to_socket(cpu) (0)
>>> +
>>> +static inline void cpu_relax(void)
>>> +{
>>> +    /* Encoding of the pause instruction */
>>> +    __asm__ __volatile__ ( ".insn 0x10F" );
>>
>> binutils 2.40 knows "pause" - why use .insn then?
> I thought that for this instruction it is needed to have extension
> ZIHINTPAUSE ( according to Linux Kernel source code [1] ) and to cover
> older version.

Well, of course you'll need to enable the extension then for gas. But
as long as you use the insn unconditionally, that's all fine and
natural. Another thing would be if you meant to also run on systems
not supporting the extension: Then the above use of .insn would need
to become conditional anyway.

>>> +    barrier();
>>
>> Why?
> Just to be aligned with Linux kernel implemetation from where this
> function was taken.

Hmm, looking more closely we have an (open-coded) barrier even on x86.
So I suppose it's really wanted (to keep the compiler from moving
memory accesses around this construct), but then you may want to
consider using

__asm__ __volatile__ ( "pause" ::: "memory" );

here. First and foremost because at least in the general case two
separate asm()s aren't the same as one combined one (volatile ones
are more restricted, but I'd always err on the safe side, even if
just to avoid giving bad examples which later on may be taken as a
basis for deriving other code).

Jan



Re: [PATCH v3 29/34] xen/riscv: add minimal stuff to page.h to build full Xen

2024-01-24 Thread Jan Beulich
On 23.01.2024 17:54, Oleksii wrote:
> On Tue, 2024-01-23 at 12:36 +0100, Jan Beulich wrote:
>> On 22.12.2023 16:13, Oleksii Kurochko wrote:
>>> Signed-off-by: Oleksii Kurochko 
>>> Acked-by: Jan Beulich 
>>> ---
>>> Changes in V3:
>>>  - update the commit message
>>
>> Once again I find this puzzling, considering there's no commit
>> message
>> at all.
> By the I meant that asm/page.h was changed to page.h

Oh. Can you say "title" or "subject" when you mean that, and "commit
message" (or "description") only when you actually mean the description?

Jan



Re: [PATCH v3 27/34] xen/riscv: define an address of frame table

2024-01-24 Thread Jan Beulich
On 23.01.2024 17:50, Oleksii wrote:
> On Tue, 2024-01-23 at 12:32 +0100, Jan Beulich wrote:
>> On 22.12.2023 16:13, Oleksii Kurochko wrote:
>>> @@ -22,25 +30,56 @@
>>>   *
>>>   * It means that:
>>>   *   top VA bits are simply ignored for the purpose of translating
>>> to PA.
>>> +#endif
>>>   *
>>> - *
>>> ===
>>> =
>>> - *    Start addr    |   End addr    |  Size  | Slot  
>>> |area description
>>> - *
>>> ===
>>> =
>>> - * C080 |   |1016 MB | L2 511 |
>>> Unused
>>> - * C060 |  C080 |  2 MB  | L2 511 |
>>> Fixmap
>>> - * C020 |  C060 |  4 MB  | L2 511 |
>>> FDT
>>> - * C000 |  C020 |  2 MB  | L2 511 |
>>> Xen
>>> - * ...  |  1 GB  | L2 510 |
>>> Unused
>>> - * 0032 |  007F8000 | 309 GB | L2 200-509 |
>>> Direct map
>>> - * ...  |  1 GB  | L2 199 |
>>> Unused
>>> - * 0031 |  0031C000 |  3 GB  | L2 196-198 |
>>> Frametable
>>> - * ...  |  1 GB  | L2 195 |
>>> Unused
>>> - * 00308000 |  0030C000 |  1 GB  | L2 194 |
>>> VMAP
>>> - * ...  | 194 GB | L2 0 - 193 |
>>> Unused
>>> - *
>>> ===
>>> =
>>> + *   SATP_MODE_SV32   | SATP_MODE_SV39   | SATP_MODE_SV48   |
>>> SATP_MODE_SV57
>>> + * 
>>> ==|==|==|==
>>> ===
>>> + * BA0 | FFE0 | C000 | FF80 |
>>> 
>>> + * BA1 | 1900 | 0032 | 6400 |
>>> 00C8
>>> + * BA2 | 1880 | 0031 | 6200 |
>>> 00C4
>>> + * BA3 | 1840 | 00308000 | 6100 |
>>> 00C2
>>>   *
>>> -#endif
>>> + *
>>> ===
>>> 
>>> + * Start addr |   End addr  |  Size  | Root PT slot |
>>> Area description
>>> + *
>>> ===
>>> 
>>> + * BA0 + 0x80 |     |1016 MB | 511  |
>>> Unused
>>> + * BA0 + 0x40 |  BA0 + 0x80 |  2 MB  | 511  |
>>> Fixmap
>>> + * BA0 + 0x20 |  BA0 + 0x40 |  4 MB  | 511  |
>>> FDT
>>> + * BA0    |  BA0 + 0x20 |  2 MB  | 511  |
>>> Xen
>>> + * ...  |  1 GB  | 510  |
>>> Unused
>>> + * BA1 + 0x00 |  BA1 + 0x4D8000 | 309 GB |   200-509    |
>>> Direct map
>>
>> This definitely can't be right for SV32. Others may be problematic,
>> too, like ...
>>
>>> + * ...  |  1 GB  | 199  |
>>> Unused
>>> + * BA2 + 0x00 |  BA2 + 0xC000   |  3 GB  |   196-198    |
>>> Frametable
>>
>> ... this one. Otoh I'd expect both to potentially be much larger in
>> SV48 and SV57 modes.
> Regarding Sv32, it looks to me the only BA0 and End addr at the first
> line isn't correct as address size is 32.
> 
> Regarding other modes, yes, it should be changed Size column. Also, the
> size of frame table should be recalculated.
> 
> Do we really need size column?
> 
> Wouldn't it be enough only have PT slot number?

Perhaps.

> Would it be better to have separate table for each mode?

Don't know.

>>> +#define VPN_BITS    (9)
>>
>> This need to move ...
>>
>>> +#define HYP_PT_ROOT_LEVEL (CONFIG_PAGING_LEVELS - 1)
>>> +
>>> +#ifdef CONFIG_RISCV_64
>>
>> ... here, I think, for not being applicable to SV32?
> You are right, it is not applicable for Sv32. In case of Sv32, it
> should be 10.
> But I am not sure that it is correct only to move this definition as
> RISCV-64 can also use Sv32. So it looks like VPN_BITS should be "#ifdef
> RV_STAGE1_MODE == Sv32".

Can it? The spec talks of SXLEN=32 implying SV32, while SXLEN=64 permits
SV39, SV48, and SV57. No mention of SV32 there.

>>> +#define SLOTN_ENTRY_BITS    (HYP_PT_ROOT_LEVEL * VPN_BITS +
>>> PAGE_SHIFT)
>>> +#define SLOTN(slot) (_AT(vaddr_t, slot) <<
>>> SLOTN_ENTRY_BITS)
>>> +#define SLOTN_ENTRY_SIZE    SLOTN(1)
>>
>> Do you have any example of how/where this going to be used?
> Yes, it will be used to define DIRECTMAP_SIZE:
> #define DIRECTMAP_SIZE  (SLOTN_ENTRY_SIZE * (509-200))

How about

#define DIRECTMAP_SIZE  (SLOTN(509) - SLOTN(200))

instead?

>>> +#define XEN_VIRT_START 0xC000 /* (_AC(-1, UL) + 1 -
>>> GB(1)) */
>>
>> Won't /* -GB(1) */ do, thus allowing the line to also be padded such
>> that
>> it matches neighboring ones in layout?
> Could you please clarify what do you mean by padded h