from:"Jan Beulich"

Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway

2024-05-31 Thread Jan Beulich

On 31.05.2024 09:31, Roger Pau Monné wrote:
> On Fri, May 31, 2024 at 09:02:20AM +0200, Jan Beulich wrote:
>> On 29.05.2024 18:14, Roger Pau Monné wrote:
>>> On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
>>>> On 29.05.2024 17:03, Roger Pau Monné wrote:
>>>>> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
>>>>>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>>>>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does 
>>>>>>> so from
>>>>>>> a cpu_hotplug_{begin,done}() region the function will still return 
>>>>>>> success,
>>>>>>> because a CPU taking the rwlock in read mode after having taken it in 
>>>>>>> write
>>>>>>> mode is allowed.  Such behavior however defeats the purpose of 
>>>>>>> get_cpu_maps(),
>>>>>>> as it should always return false when called with a CPU hot{,un}plug 
>>>>>>> operation
>>>>>>> is in progress.
>>>>>>
>>>>>> I'm not sure I can agree with this. The CPU doing said operation ought 
>>>>>> to be
>>>>>> aware of what it is itself doing. And all other CPUs will get back false 
>>>>>> from
>>>>>> get_cpu_maps().
>>>>>
>>>>> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
>>>>> the interrupts that might be handled while that operation is in
>>>>> progress, see below for a concrete example.
>>>>>
>>>>>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
>>>>>>> as it could decide to use the shorthand even when a CPU operation is in
>>>>>>> progress.
>>>>>>
>>>>>> It's also not becoming clear what's wrong there: As long as a CPU isn't
>>>>>> offline enough to not be in cpu_online_map anymore, it may well need to 
>>>>>> still
>>>>>> be the target of IPIs, and targeting it with a shorthand then is still 
>>>>>> fine.
>>>>>
>>>>> The issue is in the online path: there's a window where the CPU is
>>>>> online (and the lapic active), but cpu_online_map hasn't been updated
>>>>> yet.  A specific example would be time_calibration() being executed on
>>>>> the CPU that is running cpu_up().  That could result in a shorthand
>>>>> IPI being used, but the mask in r.cpu_calibration_map not containing
>>>>> the CPU that's being brought up online because it's not yet added to
>>>>> cpu_online_map.  Then the number of CPUs actually running
>>>>> time_calibration_rendezvous_fn won't match the weight of the cpumask
>>>>> in r.cpu_calibration_map.
>>>>
>>>> I see, but maybe only partly. Prior to the CPU having its bit set in
>>>> cpu_online_map, can it really take interrupts already? Shouldn't it be
>>>> running with IRQs off until later, thus preventing it from making it
>>>> into the rendezvous function in the first place? But yes, I can see
>>>> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
>>>> might cause problems, too.
>>>
>>> The interrupt will get set in IRR and handled when interrupts are
>>> enabled.
>>>
>>>>
>>>> Plus, with how the rendezvous function is invoked (via
>>>> on_selected_cpus() with the mask copied from cpu_online_map), the
>>>> first check in smp_call_function_interrupt() ought to prevent the
>>>> function from being called on the CPU being onlined. A problem would
>>>> arise though if the IPI arrived later and call_data was already
>>>> (partly or fully) overwritten with the next request.
>>>
>>> Yeah, there's a small window where the fields in call_data are out of
>>> sync.
>>>
>>>>>> In any event this would again affect only the CPU leading the CPU 
>>>>>> operation,
>>>>>> which should clearly know at which point(s) it is okay to send IPIs. Are 
>>>>>> we
>>>>>> actually sending any IPIs from within CPU-online or CPU-offline paths?
>>>>>
>>>>> Yes, I've seen the time rendezvous happening while in the middle of a
>>>>> hotplug operation, and the CPU coordinating the rendezvous being the
>>>>> one doing th

Re: [XEN PATCH v2 13/15] x86/ioreq: guard VIO_realmode_completion with CONFIG_VMX

2024-05-31 Thread Jan Beulich

On 31.05.2024 10:05, Sergiy Kibrik wrote:
> 16.05.24 15:11, Jan Beulich:
>> On 15.05.2024 11:24, Sergiy Kibrik wrote:
>>> --- a/xen/arch/x86/hvm/emulate.c
>>> +++ b/xen/arch/x86/hvm/emulate.c
>>> @@ -2667,7 +2667,9 @@ static int _hvm_emulate_one(struct hvm_emulate_ctxt 
>>> *hvmemul_ctxt,
>>>   break;
>>>   
>>>   case VIO_mmio_completion:
>>> +#ifdef CONFIG_VMX
>>>   case VIO_realmode_completion:
>>> +#endif
>>>   BUILD_BUG_ON(sizeof(hvio->mmio_insn) < 
>>> sizeof(hvmemul_ctxt->insn_buf));
>>>   hvio->mmio_insn_bytes = hvmemul_ctxt->insn_buf_bytes;
>>>   memcpy(hvio->mmio_insn, hvmemul_ctxt->insn_buf, 
>>> hvio->mmio_insn_bytes);
>>
>> This change doesn't buy us anything, does it?
> 
> why not? Code won't compile w/o it.
> Or do you mean hiding the whole VIO_realmode_completion enum under 
> CONFIG_VMX wasn't really useful?

That's what I meant, by implication. To me it's extra #ifdef-ary without
real gain.

Jan

Re: [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works

2024-05-31 Thread Jan Beulich

On 29.05.2024 17:28, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 03:57:19PM +0200, Jan Beulich wrote:
>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/include/asm/irq.h
>>> +++ b/xen/arch/x86/include/asm/irq.h
>>> @@ -28,6 +28,32 @@ typedef struct {
>>>  
>>>  struct irq_desc;
>>>  
>>> +/*
>>> + * Xen logic for moving interrupts around CPUs allows manipulating 
>>> interrupts
>>> + * that target remote CPUs.  The logic to move an interrupt from CPU(s) is 
>>> as
>>> + * follows:
>>> + *
>>> + * 1. cpu_mask and vector is copied to old_cpu_mask and old_vector.
>>> + * 2. New cpu_mask and vector are set, vector is setup at the new 
>>> destination.
>>> + * 3. move_in_progress is set.
>>> + * 4. Interrupt source is updated to target new CPU and vector.
>>> + * 5. Interrupts arriving at old_cpu_mask are processed normally.
>>> + * 6. When an interrupt is delivered at the new destination (cpu_mask) as 
>>> part
>>> + *of acking the interrupt move_in_progress is cleared and 
>>> move_cleanup_count
>>
>> Nit: A comma after "interrupt" may help reading.
>>
>>> + *is set to the weight of online CPUs in old_cpu_mask.
>>> + *IRQ_MOVE_CLEANUP_VECTOR is sent to all CPUs in old_cpu_mask.
>>
>> These last two steps aren't precise enough, compared to what the code does.
>> old_cpu_mask is first reduced to online CPUs therein. If the result is non-
>> empty, what you describe is done. If, however, the result is empty, the
>> vector is released right away (this code may be there just in case, but I
>> think it shouldn't be omitted here).
> 
> I've left that out because I got the impression it made the text more
> complex to follow (with the extra branch) for no real benefit, but I'm
> happy to attempt to add it.

Why "no real benefit"? Isn't the goal to accurately describe what code does
(in various places)? If the result isn't an accurate description in one
specific regard, how reliable would the rest be from a reader's perspective?

>>> + * 7. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean the
>>> + *vector entry and decrease the count in move_cleanup_count.  The CPU 
>>> that
>>> + *sets move_cleanup_count to 0 releases the vector.
>>> + *
>>> + * Note that when interrupt movement (either move_in_progress or
>>> + * move_cleanup_count set) is in progress it's not possible to move the
>>> + * interrupt to yet a different CPU.
>>> + *
>>> + * By keeping the vector in the old CPU(s) configured until the interrupt 
>>> is
>>> + * acked on the new destination Xen allows draining any pending interrupts 
>>> at
>>> + * the old destinations.
>>> + */
>>>  struct arch_irq_desc {
>>>  s16 vector;  /* vector itself is only 8 bits, */
>>>  s16 old_vector;  /* but we use -1 for unassigned  */
>>
>> I take it that it is not a goal to (also) describe under what conditions
>> an IRQ move may actually be initiated (IRQ_MOVE_PENDING)? I ask not the
>> least because the 2nd from last paragraph lightly touches that area.
> 
> Right, I was mostly focused on moves (forcefully) initiated from
> fixup_irqs(), which is different from the opportunistic affinity
> changes signaled by IRQ_MOVE_PENDING.
> 
> Not sure whether I want to mention this ahead of the list in a
> paragraph, or just add it as a step.  Do you have any preference?

I think ahead might be better. But I also won't insist on it being added.
Just if you don't, perhaps mention in the description that leaving that
out is intentional.

Jan

Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway

2024-05-31 Thread Jan Beulich

On 29.05.2024 18:14, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
>> On 29.05.2024 17:03, Roger Pau Monné wrote:
>>> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
>>>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does 
>>>>> so from
>>>>> a cpu_hotplug_{begin,done}() region the function will still return 
>>>>> success,
>>>>> because a CPU taking the rwlock in read mode after having taken it in 
>>>>> write
>>>>> mode is allowed.  Such behavior however defeats the purpose of 
>>>>> get_cpu_maps(),
>>>>> as it should always return false when called with a CPU hot{,un}plug 
>>>>> operation
>>>>> is in progress.
>>>>
>>>> I'm not sure I can agree with this. The CPU doing said operation ought to 
>>>> be
>>>> aware of what it is itself doing. And all other CPUs will get back false 
>>>> from
>>>> get_cpu_maps().
>>>
>>> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
>>> the interrupts that might be handled while that operation is in
>>> progress, see below for a concrete example.
>>>
>>>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
>>>>> as it could decide to use the shorthand even when a CPU operation is in
>>>>> progress.
>>>>
>>>> It's also not becoming clear what's wrong there: As long as a CPU isn't
>>>> offline enough to not be in cpu_online_map anymore, it may well need to 
>>>> still
>>>> be the target of IPIs, and targeting it with a shorthand then is still 
>>>> fine.
>>>
>>> The issue is in the online path: there's a window where the CPU is
>>> online (and the lapic active), but cpu_online_map hasn't been updated
>>> yet.  A specific example would be time_calibration() being executed on
>>> the CPU that is running cpu_up().  That could result in a shorthand
>>> IPI being used, but the mask in r.cpu_calibration_map not containing
>>> the CPU that's being brought up online because it's not yet added to
>>> cpu_online_map.  Then the number of CPUs actually running
>>> time_calibration_rendezvous_fn won't match the weight of the cpumask
>>> in r.cpu_calibration_map.
>>
>> I see, but maybe only partly. Prior to the CPU having its bit set in
>> cpu_online_map, can it really take interrupts already? Shouldn't it be
>> running with IRQs off until later, thus preventing it from making it
>> into the rendezvous function in the first place? But yes, I can see
>> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
>> might cause problems, too.
> 
> The interrupt will get set in IRR and handled when interrupts are
> enabled.
> 
>>
>> Plus, with how the rendezvous function is invoked (via
>> on_selected_cpus() with the mask copied from cpu_online_map), the
>> first check in smp_call_function_interrupt() ought to prevent the
>> function from being called on the CPU being onlined. A problem would
>> arise though if the IPI arrived later and call_data was already
>> (partly or fully) overwritten with the next request.
> 
> Yeah, there's a small window where the fields in call_data are out of
> sync.
> 
>>>> In any event this would again affect only the CPU leading the CPU 
>>>> operation,
>>>> which should clearly know at which point(s) it is okay to send IPIs. Are we
>>>> actually sending any IPIs from within CPU-online or CPU-offline paths?
>>>
>>> Yes, I've seen the time rendezvous happening while in the middle of a
>>> hotplug operation, and the CPU coordinating the rendezvous being the
>>> one doing the CPU hotplug operation, so get_cpu_maps() returning true.
>>
>> Right, yet together with ...
>>
>>>> Together with the earlier paragraph the critical window would be between 
>>>> the
>>>> CPU being taken off of cpu_online_map and the CPU actually going "dead" 
>>>> (i.e.
>>>> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
>>>> then the question would be what bad, if any, would happen to that CPU if an
>>>> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
>>>> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
>>>&g

Re: [PATCH 1/2] arch/irq: Make irq_ack_none() mandatory

2024-05-31 Thread Jan Beulich

On 30.05.2024 20:40, Andrew Cooper wrote:
> Any non-stub implementation of these is going to have to do something here.

For whatever definition of "something", seeing ...

> --- a/xen/arch/arm/irq.c
> +++ b/xen/arch/arm/irq.c
> @@ -31,12 +31,12 @@ struct irq_guest
>  unsigned int virq;
>  };
>  
> -static void ack_none(struct irq_desc *irq)
> +void irq_ack_none(struct irq_desc *irq)
>  {
>  printk("unexpected IRQ trap at irq %02x\n", irq->irq);
>  }

... this, which - perhaps apart from the word "trap" - is entirely Arm-
independent, and could hence quite well live in a common code fallback
implementation. Nevertheless with patch 2 clearly being an improvement,
both patches:
Reviewed-by: Jan Beulich 

Jan

Re: [PATCH v12 5/8] xen/riscv: add minimal stuff to mm.h to build full Xen

2024-05-31 Thread Jan Beulich

On 30.05.2024 19:23, Andrew Cooper wrote:
> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
>> Signed-off-by: Oleksii Kurochko 
>> Acked-by: Jan Beulich 
> 
> This patch looks like it can go in independently?  Or does it depend on
> having bitops.h working in practice?

The latter, iirc. In fact I had already tried at least twice to pull this ahead.

Jan

Re: [PATCH v12 8/8] xen/README: add compiler and binutils versions for RISC-V64

2024-05-31 Thread Jan Beulich

On 30.05.2024 21:52, Andrew Cooper wrote:
> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
>> diff --git a/README b/README
>> index c8a108449e..30da5ff9c0 100644
>> --- a/README
>> +++ b/README
>> @@ -48,6 +48,10 @@ provided by your OS distributor:
>>- For ARM 64-bit:
>>  - GCC 5.1 or later
>>  - GNU Binutils 2.24 or later
>> +  - For RISC-V 64-bit:
>> +- GCC 12.2 or later
>> +- GNU Binutils 2.39 or later
> 
> I would like to petition for GCC 10 and Binutils 2.35.
> 
> These are the versions provided as cross-compilers by Debian, and
> therefore are the versions I would prefer to do smoke testing with.

See why I asked to amend the specified versions by a softening sentence that
you (only now) said you dislike? The "this is what we use in CI" makes it a
very random choice, entirely unrelated to the compiler's abilities.

Jan

Re: [RFC XEN PATCH v8 5/5] domctl: Add XEN_DOMCTL_gsi_permission to grant gsi

2024-05-30 Thread Jan Beulich

On 30.05.2024 13:19, Chen, Jiqian wrote:
> On 2024/5/29 20:22, Jan Beulich wrote:
>> On 29.05.2024 13:13, Chen, Jiqian wrote:
>>> On 2024/5/29 15:10, Jan Beulich wrote:
>>>> On 29.05.2024 08:56, Chen, Jiqian wrote:
>>>>> On 2024/5/29 14:31, Jan Beulich wrote:
>>>>>> On 29.05.2024 04:41, Chen, Jiqian wrote:
>>>>>>> But I found in function init_irq_data:
>>>>>>> for ( irq = 0; irq < nr_irqs_gsi; irq++ )
>>>>>>> {
>>>>>>> int rc;
>>>>>>>
>>>>>>> desc = irq_to_desc(irq);
>>>>>>> desc->irq = irq;
>>>>>>>
>>>>>>> rc = init_one_irq_desc(desc);
>>>>>>> if ( rc )
>>>>>>> return rc;
>>>>>>> }
>>>>>>> Does it mean that when irq < nr_irqs_gsi, the gsi and irq is a 1:1 
>>>>>>> mapping?
>>>>>>
>>>>>> No, as explained before. I also don't see how you would derive that from 
>>>>>> the code above.
>>>>> Because here set desc->irq = irq, and it seems there is no other place to 
>>>>> change this desc->irq, so, gsi 1 is considered to irq 1.
>>>>
>>>> What are you taking this from? The loop bound isn't nr_gsis, and the 
>>>> iteration
>>>> variable isn't in GSI space either; it's in IRQ numbering space. In this 
>>>> loop
>>>> we're merely leveraging that every GSI has a corresponding IRQ;
>>>> there are no assumptions made about the mapping between the two. Afaics at 
>>>> least.
>>>>
>>>>>> "nr_irqs_gsi" describes what its name says: The number of
>>>>>> IRQs mapping to a (_some_) GSI. That's to tell them from the non-GSI 
>>>>>> (i.e.
>>>>>> mainly MSI) ones. There's no implication whatsoever on the IRQ <-> GSI
>>>>>> mapping.
>>>>>>
>>>>>>> What's more, when using PHYSDEVOP_setup_gsi, it calls mp_register_gsi,
>>>>>>> and in mp_register_gsi, it uses " desc = irq_to_desc(gsi); " to get 
>>>>>>> irq_desc directly.
>>>>>>
>>>>>> Which may be wrong, while that wrong-ness may not have hit anyone in
>>>>>> practice (for reasons that would need working out).
>>>>>>
>>>>>>> Combining above, can we consider "gsi == irq" when irq < nr_irqs_gsi ?
>>>>>>
>>>>>> Again - no.
>>>>> Since you are certain that they are not equal, could you tell me where 
>>>>> show they are not equal or where build their mappings,
>>>>> so that I can know how to do a conversion gsi from irq.
>>>>
>>>> I did point you at the ACPI Interrupt Source Override structure before.
>>>> We're parsing those in acpi_parse_int_src_ovr(), to give you a place to
>>>> start going from.
>>> Oh! I think I know.
>>> If I want to transform gsi to irq, I need to do below:
>>> int irq, entry, ioapic, pin;
>>>
>>> ioapic = mp_find_ioapic(gsi);
>>> pin = gsi - mp_ioapic_routing[ioapic].gsi_base;
>>> entry = find_irq_entry(ioapic, pin, mp_INT);
>>> irq = pin_2_irq(entry, ioapic, pin);
>>>
>>> Am I right?
>>
>> This looks plausible, yes.
> I dump all mpc_config_intsrc of array mp_irqs, it shows:
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 0 dstapic 
> 33 dstirq 2
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 15 srcbus 0 srcbusirq 9 dstapic 
> 33 dstirq 9
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 1 dstapic 
> 33 dstirq 1
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 3 dstapic 
> 33 dstirq 3
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 4 dstapic 
> 33 dstirq 4
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 5 dstapic 
> 33 dstirq 5
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 6 dstapic 
> 33 dstirq 6
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 7 dstapic 
> 33 dstirq 7
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 8 dstapic 
> 33 dstirq 8
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 10 dstapic 
> 33 dstirq 10
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 11 dstapic 
> 33 dstirq 11
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 12 dstapic 
> 33 dstirq 12
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 13 dstapic 
> 33 dstirq 13
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 14 dstapic 
> 33 dstirq 14
> (XEN) find_irq_entry type 3 irqtype 0 irqflag 0 srcbus 0 srcbusirq 15 dstapic 
> 33 dstirq 15
> 
> It seems only Legacy irq and gsi[0:15] has a mapping in mp_irqs.
> Other gsi can be considered 1:1 mapping with irq? Or are there other places 
> reflect the mapping between irq and gsi?

It may be uncommon to have overrides for higher GSIs, but I don't think ACPI
disallows that.

Jan

Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway

2024-05-29 Thread Jan Beulich

On 29.05.2024 17:03, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so 
>>> from
>>> a cpu_hotplug_{begin,done}() region the function will still return success,
>>> because a CPU taking the rwlock in read mode after having taken it in write
>>> mode is allowed.  Such behavior however defeats the purpose of 
>>> get_cpu_maps(),
>>> as it should always return false when called with a CPU hot{,un}plug 
>>> operation
>>> is in progress.
>>
>> I'm not sure I can agree with this. The CPU doing said operation ought to be
>> aware of what it is itself doing. And all other CPUs will get back false from
>> get_cpu_maps().
> 
> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
> the interrupts that might be handled while that operation is in
> progress, see below for a concrete example.
> 
>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
>>> as it could decide to use the shorthand even when a CPU operation is in
>>> progress.
>>
>> It's also not becoming clear what's wrong there: As long as a CPU isn't
>> offline enough to not be in cpu_online_map anymore, it may well need to still
>> be the target of IPIs, and targeting it with a shorthand then is still fine.
> 
> The issue is in the online path: there's a window where the CPU is
> online (and the lapic active), but cpu_online_map hasn't been updated
> yet.  A specific example would be time_calibration() being executed on
> the CPU that is running cpu_up().  That could result in a shorthand
> IPI being used, but the mask in r.cpu_calibration_map not containing
> the CPU that's being brought up online because it's not yet added to
> cpu_online_map.  Then the number of CPUs actually running
> time_calibration_rendezvous_fn won't match the weight of the cpumask
> in r.cpu_calibration_map.

I see, but maybe only partly. Prior to the CPU having its bit set in
cpu_online_map, can it really take interrupts already? Shouldn't it be
running with IRQs off until later, thus preventing it from making it
into the rendezvous function in the first place? But yes, I can see
how the IRQ (IPI) then being delivered later (once IRQs are enabled)
might cause problems, too.

Plus, with how the rendezvous function is invoked (via
on_selected_cpus() with the mask copied from cpu_online_map), the
first check in smp_call_function_interrupt() ought to prevent the
function from being called on the CPU being onlined. A problem would
arise though if the IPI arrived later and call_data was already
(partly or fully) overwritten with the next request.

>> In any event this would again affect only the CPU leading the CPU operation,
>> which should clearly know at which point(s) it is okay to send IPIs. Are we
>> actually sending any IPIs from within CPU-online or CPU-offline paths?
> 
> Yes, I've seen the time rendezvous happening while in the middle of a
> hotplug operation, and the CPU coordinating the rendezvous being the
> one doing the CPU hotplug operation, so get_cpu_maps() returning true.

Right, yet together with ...

>> Together with the earlier paragraph the critical window would be between the
>> CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
>> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
>> then the question would be what bad, if any, would happen to that CPU if an
>> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
>> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
>>
>>> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
>>> already hold in write mode by the current CPU, as read_trylock() would
>>> otherwise return true.
>>>
>>> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already 
>>> locked in write mode')
>>
>> I'm puzzled by this as well: Prior to that and the change referenced by its
>> Fixes: tag, recursive spin locks were used. For the purposes here that's the
>> same as permitting read locking even when the write lock is already held by
>> the local CPU.
> 
> I see, so the Fixes should be:
> 
> x86/smp: use APIC ALLBUT destination shorthand when possible
> 
> Instead, which is the commit that started using get_cpu_maps() in
> send_IPI_mask().

... this I then wonder whether it's really only the condition in
send_IPI_mask() which needs further amending, rather than fiddling with
get_cpu_maps().

Jan

Re: [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run()

2024-05-29 Thread Jan Beulich

On 29.05.2024 17:20, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 03:04:13PM +0200, Jan Beulich wrote:
>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>> The current callers of stop_machine_run() outside of init code already have 
>>> the
>>> CPU maps locked, and hence there's no reason for stop_machine_run() to 
>>> attempt
>>> to lock again.
>>
>> While purely from a description perspective this is okay, ...
>>
>>> --- a/xen/common/stop_machine.c
>>> +++ b/xen/common/stop_machine.c
>>> @@ -82,9 +82,15 @@ int stop_machine_run(int (*fn)(void *data), void *data, 
>>> unsigned int cpu)
>>>  BUG_ON(!local_irq_is_enabled());
>>>  BUG_ON(!is_idle_vcpu(current));
>>>  
>>> -/* cpu_online_map must not change. */
>>> -if ( !get_cpu_maps() )
>>> +/*
>>> + * cpu_online_map must not change.  The only two callers of
>>> + * stop_machine_run() outside of init code already have the CPU map 
>>> locked.
>>> + */
>>
>> ... the "two" here is not unlikely to quickly go stale; who knows what PPC
>> and RISC-V will have as their code becomes more complete?
>>
>> I'm also unconvinced that requiring ...
>>
>>> +if ( system_state >= SYS_STATE_active && !cpu_map_locked() )
>>
>> ... this for all future (post-init) uses of stop_machine_run() is a good
>> idea. It is quite a bit more natural, to me at least, for the function to
>> effect this itself, as is the case prior to your change.
> 
> This is mostly a pre-req for the next change that switches
> get_cpu_maps() to return false if the current CPU is holding the CPU
> maps lock in write mode.
> 
> IF we don't want to go this route we need a way to signal
> send_IPI_mask() when a CPU hot{,un}plug operation is taking place,
> because get_cpu_maps() enough is not suitable.
> 
> Overall I don't like the corner case where get_cpu_maps() returns true
> if a CPU hot{,un}plug operation is taking place in the current CPU
> context.  The guarantee of get_cpu_maps() is that no CPU hot{,un}plug
> operations can be in progress if it returns true.

I'm not convinced of looking at it this way. To me the guarantee is
merely that no CPU operation is taking place _elsewhere_. As indicated,
imo the local CPU should be well aware of what context it's actually in,
and hence what is (or is not) appropriate to do at a particular point in
time.

I guess what I'm missing is an example of a concrete code path where
things presently go wrong.

Jan

Re: [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count

2024-05-29 Thread Jan Beulich

On 29.05.2024 17:15, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 02:40:51PM +0200, Jan Beulich wrote:
>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>> When adjusting move_cleanup_count to account for CPUs that are offline also
>>> adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
>>> those again creating and create an imbalance in move_cleanup_count.
>>
>> I'm in trouble with "creating"; I can't seem to be able to guess what you may
>> have meant.
> 
> Oh, sorry, that's a typo.
> 
> I was meaning to point out that not removing the already subtracted
> CPUs from the mask can lead to further calls to fixup_irqs()
> subtracting them again and move_cleanup_count possibly underflowing.
> 
> Would you prefer to write it as:
> 
> "... could subtract those again and possibly underflow move_cleanup_count."

Fine with me. Looks like simply deleting "creating" and keeping the rest
as it was would be okay too? Whatever you prefer in the end.

>>> --- a/xen/arch/x86/irq.c
>>> +++ b/xen/arch/x86/irq.c
>>> @@ -2572,6 +2572,14 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>>>  desc->arch.move_cleanup_count -= cpumask_weight(affinity);
>>>  if ( !desc->arch.move_cleanup_count )
>>>  release_old_vec(desc);
>>> +else
>>> +/*
>>> + * Adjust old_cpu_mask to account for the offline CPUs,
>>> + * otherwise further calls to fixup_irqs() could subtract 
>>> those
>>> + * again and possibly underflow the counter.
>>> + */
>>> +cpumask_and(desc->arch.old_cpu_mask, 
>>> desc->arch.old_cpu_mask,
>>> +_online_map);
>>>  }
>>
>> While functionality-wise okay, imo it would be slightly better to use
>> "affinity" here as well, so that even without looking at context beyond
>> what's shown here there is a direct connection to the cpumask_weight()
>> call. I.e.
>>
>> cpumask_andnot(desc->arch.old_cpu_mask, 
>> desc->arch.old_cpu_mask,
>>affinity);
>>
>> Thoughts?
> 
> It was more straightforward for me to reason that removing the offline
> CPUs is OK, but I can see that you might prefer to use 'affinity',
> because that's the weight that's subtracted from move_cleanup_count.
> Using either should lead to the same result if my understanding is
> correct.

That was the conclusion I came to, or else I wouldn't have made the
suggestion. Unless you have a strong preference for the as-is form, I'd
indeed prefer the suggested alternative.

Jan

Re: [PATCH v11 2/9] xen: introduce generic non-atomic test_*bit()

2024-05-29 Thread Jan Beulich

On 29.05.2024 16:58, Oleksii K. wrote:
> static always_inline bool test_bit(int nr, const volatile void *addr)On
> Wed, 2024-05-29 at 12:06 +0200, Jan Beulich wrote:
>> On 29.05.2024 11:59, Julien Grall wrote:
>>> I didn't realise this was an existing comment. I think the
>>> suggestion is 
>>> a little bit odd because you could use the atomic version of the
>>> helper.
>>>
>>> Looking at Linux, the second sentence was dropped. But not the
>>> first 
>>> one. I would suggest to do the same. IOW keep:
>>>
>>> "
>>> If two examples of this operation race, one can appear to succeed
>>> but 
>>> actually fail.
>>> "
>>
>> As indicated, I'm okay with that being retained, but only in a form
>> that
>> actually makes sense. I've explained before (to Oleksii) what I
>> consider
>> wrong in this way of wording things.
> 
> Would it be suitable to rephrase it in the following way:
>  * This operation is non-atomic and can be reordered.
>- * If two examples of this operation race, one can appear to
>succeed
>- * but actually fail.  You must protect multiple accesses with a
>lock.
>+ * If two instances of this operation race, one may succeed in
>updating
>+ * the bit in memory, but actually fail. It should be protected
>from
>+ * potentially racy behavior.
>  */
> static always_inline bool
> __test_and_set_bit(int nr, volatile void *addr)

You've lost the "appear to" ahead of "succeed". Yet even with the adjustment
I still don't see what the "appear to succeed" really is supposed to mean
here. The issue (imo) isn't with success or failure, but with the write of
one CPU potentially being immediately overwritten by another CPU, not
observing the bit change that the first CPU did. IOW both will succeed (or
appear to succeed), but one update may end up being lost. Maybe "..., both
may update memory with their view of the new value, not taking into account
the update the respectively other one did"? Or "..., both will appear to
succeed, but one of the updates may be lost"?

Jan

Re: [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works

2024-05-29 Thread Jan Beulich

On 29.05.2024 11:01, Roger Pau Monne wrote:
> --- a/xen/arch/x86/include/asm/irq.h
> +++ b/xen/arch/x86/include/asm/irq.h
> @@ -28,6 +28,32 @@ typedef struct {
>  
>  struct irq_desc;
>  
> +/*
> + * Xen logic for moving interrupts around CPUs allows manipulating interrupts
> + * that target remote CPUs.  The logic to move an interrupt from CPU(s) is as
> + * follows:
> + *
> + * 1. cpu_mask and vector is copied to old_cpu_mask and old_vector.
> + * 2. New cpu_mask and vector are set, vector is setup at the new 
> destination.
> + * 3. move_in_progress is set.
> + * 4. Interrupt source is updated to target new CPU and vector.
> + * 5. Interrupts arriving at old_cpu_mask are processed normally.
> + * 6. When an interrupt is delivered at the new destination (cpu_mask) as 
> part
> + *of acking the interrupt move_in_progress is cleared and 
> move_cleanup_count

Nit: A comma after "interrupt" may help reading.

> + *is set to the weight of online CPUs in old_cpu_mask.
> + *IRQ_MOVE_CLEANUP_VECTOR is sent to all CPUs in old_cpu_mask.

These last two steps aren't precise enough, compared to what the code does.
old_cpu_mask is first reduced to online CPUs therein. If the result is non-
empty, what you describe is done. If, however, the result is empty, the
vector is released right away (this code may be there just in case, but I
think it shouldn't be omitted here).

> + * 7. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean the
> + *vector entry and decrease the count in move_cleanup_count.  The CPU 
> that
> + *sets move_cleanup_count to 0 releases the vector.
> + *
> + * Note that when interrupt movement (either move_in_progress or
> + * move_cleanup_count set) is in progress it's not possible to move the
> + * interrupt to yet a different CPU.
> + *
> + * By keeping the vector in the old CPU(s) configured until the interrupt is
> + * acked on the new destination Xen allows draining any pending interrupts at
> + * the old destinations.
> + */
>  struct arch_irq_desc {
>  s16 vector;  /* vector itself is only 8 bits, */
>  s16 old_vector;  /* but we use -1 for unassigned  */

I take it that it is not a goal to (also) describe under what conditions
an IRQ move may actually be initiated (IRQ_MOVE_PENDING)? I ask not the
least because the 2nd from last paragraph lightly touches that area.

Jan

Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway

2024-05-29 Thread Jan Beulich

On 29.05.2024 11:01, Roger Pau Monne wrote:
> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so 
> from
> a cpu_hotplug_{begin,done}() region the function will still return success,
> because a CPU taking the rwlock in read mode after having taken it in write
> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
> as it should always return false when called with a CPU hot{,un}plug operation
> is in progress.

I'm not sure I can agree with this. The CPU doing said operation ought to be
aware of what it is itself doing. And all other CPUs will get back false from
get_cpu_maps().

>  Otherwise the logic in send_IPI_mask() for example is wrong,
> as it could decide to use the shorthand even when a CPU operation is in
> progress.

It's also not becoming clear what's wrong there: As long as a CPU isn't
offline enough to not be in cpu_online_map anymore, it may well need to still
be the target of IPIs, and targeting it with a shorthand then is still fine.

In any event this would again affect only the CPU leading the CPU operation,
which should clearly know at which point(s) it is okay to send IPIs. Are we
actually sending any IPIs from within CPU-online or CPU-offline paths?
Together with the earlier paragraph the critical window would be between the
CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
then the question would be what bad, if any, would happen to that CPU if an
IPI was still targeted at it by way of using the shorthand. I'm pretty sure
it runs with IRQs off at that time, so no ordinary IRQ could be delivered.

> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
> already hold in write mode by the current CPU, as read_trylock() would
> otherwise return true.
> 
> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already 
> locked in write mode')

I'm puzzled by this as well: Prior to that and the change referenced by its
Fixes: tag, recursive spin locks were used. For the purposes here that's the
same as permitting read locking even when the write lock is already held by
the local CPU.

Jan

Re: [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run()

2024-05-29 Thread Jan Beulich

On 29.05.2024 11:01, Roger Pau Monne wrote:
> The current callers of stop_machine_run() outside of init code already have 
> the
> CPU maps locked, and hence there's no reason for stop_machine_run() to attempt
> to lock again.

While purely from a description perspective this is okay, ...

> --- a/xen/common/stop_machine.c
> +++ b/xen/common/stop_machine.c
> @@ -82,9 +82,15 @@ int stop_machine_run(int (*fn)(void *data), void *data, 
> unsigned int cpu)
>  BUG_ON(!local_irq_is_enabled());
>  BUG_ON(!is_idle_vcpu(current));
>  
> -/* cpu_online_map must not change. */
> -if ( !get_cpu_maps() )
> +/*
> + * cpu_online_map must not change.  The only two callers of
> + * stop_machine_run() outside of init code already have the CPU map 
> locked.
> + */

... the "two" here is not unlikely to quickly go stale; who knows what PPC
and RISC-V will have as their code becomes more complete?

I'm also unconvinced that requiring ...

> +if ( system_state >= SYS_STATE_active && !cpu_map_locked() )

... this for all future (post-init) uses of stop_machine_run() is a good
idea. It is quite a bit more natural, to me at least, for the function to
effect this itself, as is the case prior to your change.

In fact I'm puzzled by Arm's (init-time) use: They use it twice, for errata
workaround and feature enabling. They do that after bringing up secondary
CPUs, but still from start_xen(). How's that going to work for CPUs brought
offline and then back online later on? __cpu_up() isn't __init, so there's
no obvious sign that soft-{off,on}lining isn't possible on Arm. Cc-ing Arm
maintainers.

Jan

> +{
> +ASSERT_UNREACHABLE();
>  return -EBUSY;
> +}
>  
>  nr_cpus = num_online_cpus();
>  if ( cpu_online(this) )
> @@ -92,10 +98,7 @@ int stop_machine_run(int (*fn)(void *data), void *data, 
> unsigned int cpu)
>  
>  /* Must not spin here as the holder will expect us to be descheduled. */
>  if ( !spin_trylock(_lock) )
> -{
> -put_cpu_maps();
>  return -EBUSY;
> -}
>  
>  stopmachine_data.fn = fn;
>  stopmachine_data.fn_data = data;
> @@ -136,8 +139,6 @@ int stop_machine_run(int (*fn)(void *data), void *data, 
> unsigned int cpu)
>  
>  spin_unlock(_lock);
>  
> -put_cpu_maps();
> -
>  return ret;
>  }
>  
> diff --git a/xen/include/xen/cpu.h b/xen/include/xen/cpu.h
> index e1d4eb59675c..d8c8264c58b0 100644
> --- a/xen/include/xen/cpu.h
> +++ b/xen/include/xen/cpu.h
> @@ -13,6 +13,8 @@ void put_cpu_maps(void);
>  void cpu_hotplug_begin(void);
>  void cpu_hotplug_done(void);
>  
> +bool cpu_map_locked(void);
> +
>  /* Receive notification of CPU hotplug events. */
>  void register_cpu_notifier(struct notifier_block *nb);
>

Re: [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count

2024-05-29 Thread Jan Beulich

On 29.05.2024 11:01, Roger Pau Monne wrote:
> When adjusting move_cleanup_count to account for CPUs that are offline also
> adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
> those again creating and create an imbalance in move_cleanup_count.

I'm in trouble with "creating"; I can't seem to be able to guess what you may
have meant.

> Fixes: 472e0b74c5c4 ('x86/IRQ: deal with move cleanup count state in 
> fixup_irqs()')
> Signed-off-by: Roger Pau Monné 

With the above clarified (adjustment can be done while committing)
Reviewed-by: Jan Beulich 

> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2572,6 +2572,14 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>  desc->arch.move_cleanup_count -= cpumask_weight(affinity);
>  if ( !desc->arch.move_cleanup_count )
>  release_old_vec(desc);
> +else
> +/*
> + * Adjust old_cpu_mask to account for the offline CPUs,
> + * otherwise further calls to fixup_irqs() could subtract 
> those
> + * again and possibly underflow the counter.
> + */
> +cpumask_and(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
> +_online_map);
>  }

While functionality-wise okay, imo it would be slightly better to use
"affinity" here as well, so that even without looking at context beyond
what's shown here there is a direct connection to the cpumask_weight()
call. I.e.

cpumask_andnot(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
   affinity);

Thoughts?

Jan

Re: [RFC XEN PATCH v8 5/5] domctl: Add XEN_DOMCTL_gsi_permission to grant gsi

2024-05-29 Thread Jan Beulich

On 29.05.2024 13:13, Chen, Jiqian wrote:
> On 2024/5/29 15:10, Jan Beulich wrote:
>> On 29.05.2024 08:56, Chen, Jiqian wrote:
>>> On 2024/5/29 14:31, Jan Beulich wrote:
>>>> On 29.05.2024 04:41, Chen, Jiqian wrote:
>>>>> But I found in function init_irq_data:
>>>>> for ( irq = 0; irq < nr_irqs_gsi; irq++ )
>>>>> {
>>>>> int rc;
>>>>>
>>>>> desc = irq_to_desc(irq);
>>>>> desc->irq = irq;
>>>>>
>>>>> rc = init_one_irq_desc(desc);
>>>>> if ( rc )
>>>>> return rc;
>>>>> }
>>>>> Does it mean that when irq < nr_irqs_gsi, the gsi and irq is a 1:1 
>>>>> mapping?
>>>>
>>>> No, as explained before. I also don't see how you would derive that from 
>>>> the code above.
>>> Because here set desc->irq = irq, and it seems there is no other place to 
>>> change this desc->irq, so, gsi 1 is considered to irq 1.
>>
>> What are you taking this from? The loop bound isn't nr_gsis, and the 
>> iteration
>> variable isn't in GSI space either; it's in IRQ numbering space. In this loop
>> we're merely leveraging that every GSI has a corresponding IRQ;
>> there are no assumptions made about the mapping between the two. Afaics at 
>> least.
>>
>>>> "nr_irqs_gsi" describes what its name says: The number of
>>>> IRQs mapping to a (_some_) GSI. That's to tell them from the non-GSI (i.e.
>>>> mainly MSI) ones. There's no implication whatsoever on the IRQ <-> GSI
>>>> mapping.
>>>>
>>>>> What's more, when using PHYSDEVOP_setup_gsi, it calls mp_register_gsi,
>>>>> and in mp_register_gsi, it uses " desc = irq_to_desc(gsi); " to get 
>>>>> irq_desc directly.
>>>>
>>>> Which may be wrong, while that wrong-ness may not have hit anyone in
>>>> practice (for reasons that would need working out).
>>>>
>>>>> Combining above, can we consider "gsi == irq" when irq < nr_irqs_gsi ?
>>>>
>>>> Again - no.
>>> Since you are certain that they are not equal, could you tell me where show 
>>> they are not equal or where build their mappings,
>>> so that I can know how to do a conversion gsi from irq.
>>
>> I did point you at the ACPI Interrupt Source Override structure before.
>> We're parsing those in acpi_parse_int_src_ovr(), to give you a place to
>> start going from.
> Oh! I think I know.
> If I want to transform gsi to irq, I need to do below:
>   int irq, entry, ioapic, pin;
> 
>   ioapic = mp_find_ioapic(gsi);
>   pin = gsi - mp_ioapic_routing[ioapic].gsi_base;
>   entry = find_irq_entry(ioapic, pin, mp_INT);
>   irq = pin_2_irq(entry, ioapic, pin);
> 
> Am I right?

This looks plausible, yes.

Jan

Re: [XEN PATCH v2 12/15] x86/vmx: guard access to cpu_has_vmx_* in common code

2024-05-29 Thread Jan Beulich

On 29.05.2024 12:58, Sergiy Kibrik wrote:
> 16.05.24 10:32, Jan Beulich:
>> On 16.05.2024 02:50, Stefano Stabellini wrote:
>>> On Wed, 15 May 2024, Sergiy Kibrik wrote:
>>>> There're several places in common code, outside of arch/x86/hvm/vmx,
>>>> where cpu_has_vmx_* get accessed without checking if VMX present first.
>>>> We may want to guard these macros, as they read global variables defined
>>>> inside vmx-specific files -- so VMX can be made optional later on.
>>>>
>>>> Signed-off-by: Sergiy Kibrik 
>>>> CC: Andrew Cooper 
>>>> CC: Jan Beulich 
>>>> ---
>>>> Here I've tried a different approach from prev.patches [1,2] -- instead of
>>>> modifying whole set of cpu_has_{svm/vmx}_* macros, we can:
>>>>   1) do not touch SVM part at all, because just as Andrew pointed out 
>>>> they're
>>>> used inside arch/x86/hvm/svm only.
>>>>   2) track several places in common code where cpu_has_vmx_* features are
>>>> checked out and guard them using cpu_has_vmx condition
>>>>   3) two of cpu_has_vmx_* macros being used in common code are checked in 
>>>> a bit
>>>> more tricky way, so instead of making complex conditionals even more 
>>>> complicated,
>>>> we can instead integrate cpu_has_vmx condition inside these two macros.
>>>>
>>>> This patch aims to replace [1,2] from v1 series by doing steps above.
>>>>
> [..]
>>>
>>> I am missing some of the previous discussions but why can't we just fix
>>> all of the cpu_has_vmx_* #defines in vmcs.h to also check for
>>> cpu_has_vmx?
>>>
>>> That seems easier and simpler than to add add-hoc checks at the invocations?
>>
>> I'd like to take the question on step further: Following 0b5f149338e3
>> ("x86/HVM: hide SVM/VMX when their enabling is prohibited by firmware"),
>> is this change needed at all? IOW is there a path left where cpu_has_vmx
>> may be false, by any cpu_has_vmx_* may still yield true?
>>
> 
> This change is about exec control variables (vmx_secondary_exec_control, 
> vmx_pin_based_exec_control etc) not been built, because they're in vmx 
> code, but accessed in common code. The description is probably unclear 
> about that.
> Also build issues related to VMX can be solved differently, without 
> touching cpu_has_vmx_* macros and related logic at all.
> I can move exec control variables from vmcs.c to common hvm.c, this 
> would be simpler change and directly related to problem that I'm having.

That would be moving them one layer too high. Proper disentangling then
will need to wait until that data is actually part of the (host) CPU
policy. For the time being your change may thus be acceptable, assuming
that we won't be very quick in doing said move.

Jan

Re: [PATCH] Partial revert of "x86/MCE: optional build of AMD/Intel MCE code"

2024-05-29 Thread Jan Beulich

On 29.05.2024 12:37, Andrew Cooper wrote:
> {cmci,lmce}_support are written during S3 resume, so cannot live in
> __ro_after_init.  Move them back to being __read_mostly, as they were
> originally.
> 
> Link: https://gitlab.com/xen-project/xen/-/jobs/6966698361
> Fixes: 19b6e9f9149f ("x86/MCE: optional build of AMD/Intel MCE code")
> Signed-off-by: Andrew Cooper 

Acked-by: Jan Beulich

[PATCH] x86/MCE: don't treat S3 resume the same as boot

2024-05-29 Thread Jan Beulich

Checks for running on CPU0 are bogus; they identify S3 resume the same
as being the BSP while booting. Replace with the same check we use
elsewhere to properly limit things to just the BSP.

Link: https://gitlab.com/xen-project/xen/-/jobs/6966698361
Fixes: 19b6e9f9149f ("x86/MCE: optional build of AMD/Intel MCE code")
Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -767,7 +767,7 @@ static void intel_init_mca(struct cpuinf
 lmce = intel_enable_lmce();
 
 #define CAP(enabled, name) ((enabled) ? ", " name : "")
-if ( smp_processor_id() == 0 )
+if ( c == _cpu_data )
 {
 dprintk(XENLOG_INFO,
 "MCA Capability: firstbank %d, extended MCE MSR %d%s%s%s%s\n",

Re: [PATCH v11 2/9] xen: introduce generic non-atomic test_*bit()

2024-05-29 Thread Jan Beulich

On 29.05.2024 11:59, Julien Grall wrote:
> Hi,
> 
> On 29/05/2024 09:36, Jan Beulich wrote:
>> On 29.05.2024 09:50, Oleksii K. wrote:
>>> On Tue, 2024-05-28 at 09:53 +0100, Julien Grall wrote:
>>>>>>> +/**
>>>>>>> + * generic_test_bit - Determine whether a bit is set
>>>>>>> + * @nr: bit number to test
>>>>>>> + * @addr: Address to start counting from
>>>>>>> + *
>>>>>>> + * This operation is non-atomic and can be reordered.
>>>>>>> + * If two examples of this operation race, one can appear to
>>>>>>> succeed
>>>>>>> + * but actually fail.  You must protect multiple accesses with
>>>>>>> a
>>>>>>> lock.
>>>>>>> + */
>>>>>>
>>>>>> You got carried away updating comments - there's no raciness for
>>>>>> simple test_bit(). As is also expressed by its name not having
>>>>>> those
>>>>>> double underscores that the others have.
>>>>> Then it is true for every function in this header. Based on the
>>>>> naming
>>>>> the conclusion can be done if it is atomic/npn-atomic and can/can't
>>>>> be
>>>>> reordered.
>>>>
>>>> So let me start with that my only request is to keep the existing
>>>> comments as you move it. It looks like there were none of test_bit()
>>>> before.
>>> Just to clarify that I understand correctly.
>>>
>>> Do we need any comment above functions generic_*()? Based on that they
>>> are implemented in generic way they will be always "non-atomic and can
>>> be reordered.".
>>
>> I indicated before that I think reproducing the same comments __test_and_*
>> already have also for generic_* isn't overly useful. If someone insisted
>> on them being there as well, I could live with that, though.
> 
> Would you be ok if the comment is only on top of the __test_and_* 
> version? (So no comments on top of the generic_*)

That's my preferred variant, actually. The alternative I would also be
okay-ish with is to have the comments also ahead of generic_*.

>>> Do you find the following comment useful?
>>>
>>> " * If two examples of this operation race, one can appear to succeed
>>>   * but actually fail.  You must protect multiple accesses with a lock."
>>>
>>> It seems to me that it can dropped as basically "non-atomic and can be
>>> reordered." means that.
>>
>> I agree, or else - as indicated before - the wording would need to further
>> change. Yet iirc you've added that in response to a comment from Julien,
>> so you'll primarily want his input as to the presence of something along
>> these lines.
> 
> I didn't realise this was an existing comment. I think the suggestion is 
> a little bit odd because you could use the atomic version of the helper.
> 
> Looking at Linux, the second sentence was dropped. But not the first 
> one. I would suggest to do the same. IOW keep:
> 
> "
> If two examples of this operation race, one can appear to succeed but 
> actually fail.
> "

As indicated, I'm okay with that being retained, but only in a form that
actually makes sense. I've explained before (to Oleksii) what I consider
wrong in this way of wording things.

Jan

Re: [PATCH v11 2/9] xen: introduce generic non-atomic test_*bit()

2024-05-29 Thread Jan Beulich

On 29.05.2024 09:50, Oleksii K. wrote:
> On Tue, 2024-05-28 at 09:53 +0100, Julien Grall wrote:
> +/**
> + * generic_test_bit - Determine whether a bit is set
> + * @nr: bit number to test
> + * @addr: Address to start counting from
> + *
> + * This operation is non-atomic and can be reordered.
> + * If two examples of this operation race, one can appear to
> succeed
> + * but actually fail.  You must protect multiple accesses with
> a
> lock.
> + */

 You got carried away updating comments - there's no raciness for
 simple test_bit(). As is also expressed by its name not having
 those
 double underscores that the others have.
>>> Then it is true for every function in this header. Based on the
>>> naming
>>> the conclusion can be done if it is atomic/npn-atomic and can/can't
>>> be
>>> reordered. 
>>
>> So let me start with that my only request is to keep the existing 
>> comments as you move it. It looks like there were none of test_bit()
>> before.
> Just to clarify that I understand correctly.
> 
> Do we need any comment above functions generic_*()? Based on that they
> are implemented in generic way they will be always "non-atomic and can
> be reordered.".

I indicated before that I think reproducing the same comments __test_and_*
already have also for generic_* isn't overly useful. If someone insisted
on them being there as well, I could live with that, though.

> Do you find the following comment useful?
> 
> " * If two examples of this operation race, one can appear to succeed
>  * but actually fail.  You must protect multiple accesses with a lock."
> 
> It seems to me that it can dropped as basically "non-atomic and can be
> reordered." means that.

I agree, or else - as indicated before - the wording would need to further
change. Yet iirc you've added that in response to a comment from Julien,
so you'll primarily want his input as to the presence of something along
these lines.

Jan

Re: Updated Xen 4.19 schedule

2024-05-29 Thread Jan Beulich

On 29.05.2024 10:03, Oleksii K. wrote:
> Hello everyone,
> 
> I would like to announce that I have decided to update the Xen 4.19
> schedule due to the extended feature freeze period and the upcoming Xen
> Summit next week.
> 
> I propose the following updates:
>Code Freeze: from May 31 to June 7

This is almost fully the week of the summit. With ...

>Hard Code Freeze: from June 21 to June 28

... this, did you perhaps mean May 31 to June 20 there?

>Final commits: from July 5 to July 12

Somewhat similarly, what's between June 28 and July 5?

Jan

>Release: July 17
> The release date is shifted, but it still remains in July, which seems
> acceptable to me.
> 
> One more thing:
> No release ack is needed before Rc1. Please commit bug fixes at will.
> 
> Have a nice day.
> 
> Best regards,
> Oleksii

Re: [PATCH for-4.19 v3 0/3] xen/x86: support foreign mappings for HVM/PVH

2024-05-29 Thread Jan Beulich

On 17.05.2024 15:33, Roger Pau Monne wrote:
> Hello,
> 
> The following series attempts to solve a shortcoming of HVM/PVH guests
> with the lack of support for foreign mappings.  Such lack of support
> prevents using PVH based guests as stubdomains for example.
> 
> Add support in a way similar to how it's done on Arm, by iterating over
> the p2m based on the maximum gfn.
> 
> Patch 2 is not strictly needed.  Moving the enablement of altp2m from an
> HVM param to a create domctl flag avoids any possible race with the HVM
> param changing after it's been evaluated.  Note the param can only be
> set by the control domain, and libxl currently sets it at domain
> create.  Also altp2m enablement is different from activation, as
> activation does happen during runtime of the domain.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (3):
>   xen/x86: account number of foreign mappings in the p2m
>   xen/x86: enable altp2m at create domain domctl
>   xen/x86: remove foreign mappings from the p2m on teardown

Here, too, I'd like to ask whether to keep this as a candidate for 4.19, or
whether to postpone. Afaict what's still missing are Arm and tool chain acks
on patch 2.

Jan

>  CHANGELOG.md|  1 +
>  tools/libs/light/libxl_create.c | 23 +-
>  tools/libs/light/libxl_x86.c| 26 +--
>  tools/ocaml/libs/xc/xenctrl_stubs.c |  2 +-
>  xen/arch/arm/domain.c   |  6 +++
>  xen/arch/x86/domain.c   | 28 
>  xen/arch/x86/hvm/hvm.c  | 23 +-
>  xen/arch/x86/include/asm/p2m.h  | 32 +-
>  xen/arch/x86/mm/p2m-basic.c | 18 
>  xen/arch/x86/mm/p2m.c   | 68 +++--
>  xen/include/public/domctl.h | 20 -
>  xen/include/public/hvm/params.h |  9 +---
>  12 files changed, 215 insertions(+), 41 deletions(-)
>

Re: [PATCH v2 for-4.19 0.5/13] xen: Introduce CONFIG_SELF_TESTS

2024-05-29 Thread Jan Beulich

On 28.05.2024 16:22, Andrew Cooper wrote:
> ... and move x86's stub_selftest() under this new option.
> 
> There is value in having these tests included in release builds too.
> 
> It will shortly be used to gate the bitops unit tests on all architectures.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 

> I've gone with SELF_TESTS rather than BOOT_TESTS, because already in bitops
> we've got compile time tests (which aren't strictly boot time), and the
> livepatching testing wants to be included here and is definitely not boot
> time.

I second this consideration, fwiw.

Jan

Re: [RFC XEN PATCH v8 5/5] domctl: Add XEN_DOMCTL_gsi_permission to grant gsi

2024-05-29 Thread Jan Beulich

On 29.05.2024 08:56, Chen, Jiqian wrote:
> On 2024/5/29 14:31, Jan Beulich wrote:
>> On 29.05.2024 04:41, Chen, Jiqian wrote:
>>> Hi,
>>> On 2024/5/17 19:50, Jan Beulich wrote:
>>>> On 17.05.2024 13:14, Chen, Jiqian wrote:
>>>>> On 2024/5/17 18:51, Jan Beulich wrote:
>>>>>> On 17.05.2024 12:45, Chen, Jiqian wrote:
>>>>>>> On 2024/5/16 22:01, Jan Beulich wrote:
>>>>>>>> On 16.05.2024 11:52, Jiqian Chen wrote:
>>>>>>>>> +if ( gsi >= nr_irqs_gsi )
>>>>>>>>> +{
>>>>>>>>> +ret = -EINVAL;
>>>>>>>>> +break;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +if ( !irq_access_permitted(current->domain, gsi) ||
>>>>>>>>
>>>>>>>> I.e. assuming IRQ == GSI? Is that a valid assumption when any number of
>>>>>>>> source overrides may be surfaced by ACPI?
>>>>>>> All irqs smaller than nr_irqs_gsi are gsi, aren't they?
>>>>>>
>>>>>> They are, but there's not necessarily a 1:1 mapping.
>>>>> Oh, so do I need to add a new gsi_caps to store granted gsi?
>>>>
>>>> Probably not. You ought to be able to translate between GSI and IRQ,
>>>> and then continue to record in / check against IRQ permissions.
>>> But I found in function init_irq_data:
>>> for ( irq = 0; irq < nr_irqs_gsi; irq++ )
>>> {
>>> int rc;
>>>
>>> desc = irq_to_desc(irq);
>>> desc->irq = irq;
>>>
>>> rc = init_one_irq_desc(desc);
>>> if ( rc )
>>> return rc;
>>> }
>>> Does it mean that when irq < nr_irqs_gsi, the gsi and irq is a 1:1 mapping?
>>
>> No, as explained before. I also don't see how you would derive that from the 
>> code above.
> Because here set desc->irq = irq, and it seems there is no other place to 
> change this desc->irq, so, gsi 1 is considered to irq 1.

What are you taking this from? The loop bound isn't nr_gsis, and the iteration
variable isn't in GSI space either; it's in IRQ numbering space. In this loop
we're merely leveraging that every GSI has a corresponding IRQ; there are no
assumptions made about the mapping between the two. Afaics at least.

>> "nr_irqs_gsi" describes what its name says: The number of
>> IRQs mapping to a (_some_) GSI. That's to tell them from the non-GSI (i.e.
>> mainly MSI) ones. There's no implication whatsoever on the IRQ <-> GSI
>> mapping.
>>
>>> What's more, when using PHYSDEVOP_setup_gsi, it calls mp_register_gsi,
>>> and in mp_register_gsi, it uses " desc = irq_to_desc(gsi); " to get 
>>> irq_desc directly.
>>
>> Which may be wrong, while that wrong-ness may not have hit anyone in
>> practice (for reasons that would need working out).
>>
>>> Combining above, can we consider "gsi == irq" when irq < nr_irqs_gsi ?
>>
>> Again - no.
> Since you are certain that they are not equal, could you tell me where show 
> they are not equal or where build their mappings,
> so that I can know how to do a conversion gsi from irq.

I did point you at the ACPI Interrupt Source Override structure before.
We're parsing those in acpi_parse_int_src_ovr(), to give you a place to
start going from.

Jan

Re: [RFC XEN PATCH v8 5/5] domctl: Add XEN_DOMCTL_gsi_permission to grant gsi

2024-05-29 Thread Jan Beulich

On 29.05.2024 04:41, Chen, Jiqian wrote:
> Hi,
> On 2024/5/17 19:50, Jan Beulich wrote:
>> On 17.05.2024 13:14, Chen, Jiqian wrote:
>>> On 2024/5/17 18:51, Jan Beulich wrote:
>>>> On 17.05.2024 12:45, Chen, Jiqian wrote:
>>>>> On 2024/5/16 22:01, Jan Beulich wrote:
>>>>>> On 16.05.2024 11:52, Jiqian Chen wrote:
>>>>>>> +if ( gsi >= nr_irqs_gsi )
>>>>>>> +{
>>>>>>> +ret = -EINVAL;
>>>>>>> +break;
>>>>>>> +}
>>>>>>> +
>>>>>>> +if ( !irq_access_permitted(current->domain, gsi) ||
>>>>>>
>>>>>> I.e. assuming IRQ == GSI? Is that a valid assumption when any number of
>>>>>> source overrides may be surfaced by ACPI?
>>>>> All irqs smaller than nr_irqs_gsi are gsi, aren't they?
>>>>
>>>> They are, but there's not necessarily a 1:1 mapping.
>>> Oh, so do I need to add a new gsi_caps to store granted gsi?
>>
>> Probably not. You ought to be able to translate between GSI and IRQ,
>> and then continue to record in / check against IRQ permissions.
> But I found in function init_irq_data:
> for ( irq = 0; irq < nr_irqs_gsi; irq++ )
> {
> int rc;
> 
> desc = irq_to_desc(irq);
> desc->irq = irq;
> 
> rc = init_one_irq_desc(desc);
> if ( rc )
> return rc;
> }
> Does it mean that when irq < nr_irqs_gsi, the gsi and irq is a 1:1 mapping?

No, as explained before. I also don't see how you would derive that from
the code above. "nr_irqs_gsi" describes what its name says: The number of
IRQs mapping to a (_some_) GSI. That's to tell them from the non-GSI (i.e.
mainly MSI) ones. There's no implication whatsoever on the IRQ <-> GSI
mapping.

> What's more, when using PHYSDEVOP_setup_gsi, it calls mp_register_gsi,
> and in mp_register_gsi, it uses " desc = irq_to_desc(gsi); " to get irq_desc 
> directly.

Which may be wrong, while that wrong-ness may not have hit anyone in
practice (for reasons that would need working out).

> Combining above, can we consider "gsi == irq" when irq < nr_irqs_gsi ?

Again - no.

Jan

Re: [PATCH] x86/svm: Rework VMCB_ACCESSORS() to use a plain type name

2024-05-28 Thread Jan Beulich

On 28.05.2024 17:32, Andrew Cooper wrote:
> This avoids having a function call in a typeof() expression.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 

Acked-by: Jan Beulich

Re: [PATCH v2 07/13] x86/bitops: Improve arch_ffs() in the general case

2024-05-28 Thread Jan Beulich

On 28.05.2024 14:30, Andrew Cooper wrote:
> On 27/05/2024 2:37 pm, Jan Beulich wrote:
>> On 27.05.2024 15:27, Jan Beulich wrote:
>>> On 24.05.2024 22:03, Andrew Cooper wrote:
>>>> --- a/xen/arch/x86/include/asm/bitops.h
>>>> +++ b/xen/arch/x86/include/asm/bitops.h
>>>> @@ -432,12 +432,28 @@ static inline int ffsl(unsigned long x)
>>>>  
>>>>  static always_inline unsigned int arch_ffs(unsigned int x)
>>>>  {
>>>> -int r;
>>>> +unsigned int r;
>>>> +
>>>> +if ( __builtin_constant_p(x > 0) && x > 0 )
>>>> +{
>>>> +/* Safe, when the compiler knows that x is nonzero. */
>>>> +asm ( "bsf %[val], %[res]"
>>>> +  : [res] "=r" (r)
>>>> +  : [val] "rm" (x) );
>>>> +}
>>> In patch 11 relevant things are all in a single patch, making it easier
>>> to spot that this is dead code: The sole caller already has a
>>> __builtin_constant_p(), hence I don't see how the one here could ever
>>> return true. With that the respective part of the description is then
>>> questionable, too, I'm afraid: Where did you observe any actual effect
>>> from this? Or if you did - what am I missing?
>> Hmm, thinking about it: I suppose that's why you have
>> __builtin_constant_p(x > 0), not __builtin_constant_p(x). I have to admit
>> I'm (positively) surprised that the former may return true when the latter
>> doesn't.
> 
> So was I, but this recommendation came straight from the GCC mailing
> list.  And it really does work, even back in obsolete versions of GCC.
> 
> __builtin_constant_p() operates on an expression not a value, and is
> documented as such.

Of course.

>>  Nevertheless I'm inclined to think this deserves a brief comment.
> 
> There is a comment, and it's even visible in the snippet.

The comment is about the asm(); it is neither placed to clearly relate
to __builtin_constant_p(), nor is it saying anything about this specific
property of it. You said you were equally surprised; don't you think
that when both of us are surprised, a specific (even if brief) comment
is warranted?

>> As an aside, to better match the comment inside the if()'s body, how about
>>
>> if ( __builtin_constant_p(!!x) && x )
>>
>> ? That also may make a little more clear that this isn't just a style
>> choice, but actually needed for the intended purpose.
> 
> I am not changing the logic.
> 
> Apart from anything else, your suggestion is trivially buggy.  I care
> about whether the RHS collapses to a constant, and the only way of doing
> that correctly is asking the compiler about the *exact* expression. 
> Asking about some other expression which you hope - but do not know -
> that the compiler will treat equivalently is bogus.  It would be
> strictly better to only take the else clause, than to have both halves
> emitted.
> 
> This is the form I've tested extensively.  It's also the clearest form
> IMO.  You can experiment with alternative forms when we're not staring
> down code freeze of 4.19.

"Clearest form" is almost always a matter of taste. To me, comparing
unsigned values with > or < against 0 is generally at least suspicious.
Using != is typically better (again: imo), and simply omitting the != 0
then is shorter with no difference in effect. Except in peculiar cases
like this one, where indeed it took me some time to figure why the
comparison operator may not be omitted.

All that said: I'm not going to insist on any change; the R-b previously
offered still stands. I would highly appreciate though if the (further)
comment asked for could be added.

What I definitely dislike here is you - not for the first time - turning
down remarks because a change of yours is late. This feels even more so
bad when considering that I'm typically replying to your patches with
pretty little turnaround. Whereas various of mine, pending in part for
years, do not seem to deserve any review comments at all. Unlike before,
where it was "only" improvements or feature additions, meanwhile even
bug fixes are left sit like that. If I may be blunt: This may not work
this way for much longer. At some point I will need to artificially
delay reviews, making them dependent on my own work also being allowed
to make progress. I question though whether that would be in everyone's
interest.

Jan

Re: [PATCH v11 2/9] xen: introduce generic non-atomic test_*bit()

2024-05-28 Thread Jan Beulich

On 28.05.2024 10:37, Oleksii K. wrote:
> On Tue, 2024-05-28 at 08:20 +0200, Jan Beulich wrote:
>> On 24.05.2024 13:08, Oleksii Kurochko wrote:
>>> +/**
>>> + * generic_test_bit - Determine whether a bit is set
>>> + * @nr: bit number to test
>>> + * @addr: Address to start counting from
>>> + *
>>> + * This operation is non-atomic and can be reordered.
>>> + * If two examples of this operation race, one can appear to
>>> succeed
>>> + * but actually fail.  You must protect multiple accesses with a
>>> lock.
>>> + */
>>
>> You got carried away updating comments - there's no raciness for
>> simple test_bit(). As is also expressed by its name not having those
>> double underscores that the others have.
> Then it is true for every function in this header. Based on the naming
> the conclusion can be done if it is atomic/npn-atomic and can/can't be
> reordered. So the comments aren't seem very useful.

I disagree - test_bit() is different, in not being a read-modify-write
operation.

Jan

Re: [XEN PATCH 4/4] x86/traps: address violation of MISRA C Rule 8.4

2024-05-28 Thread Jan Beulich

On 27.05.2024 16:53, Nicola Vetrini wrote:
> Rule 8.4 states: "A compatible declaration shall be visible when
> an object or function with external linkage is defined".
> 
> The function do_general_protection is either used is asm code
> or only within this unit, so there is no risk of this getting
> out of sync with its definition, but the function must remain
> extern.
> 
> Therefore, this function is deviated using a comment-based deviation.
> No functional change.
> 
> Signed-off-by: Nicola Vetrini 

Acked-by: Jan Beulich 

Albeit here, too, I'm not entirely happy with the wording, but I'll let
it go as is. While "there is no risk of this getting out of sync with
its definition" is true for the C side, there's always the risk of
assembly going out of sync with C, simply because the two cannot be
properly connected by any means.

Jan

Re: [XEN PATCH 3/4] x86: address violations of MISRA C Rule 8.4

2024-05-28 Thread Jan Beulich

On 27.05.2024 16:53, Nicola Vetrini wrote:
> Rule 8.4 states: "A compatible declaration shall be visible when
> an object or function with external linkage is defined."
> 
> These variables are only referenced from asm modules, so they
> need to be extern and there is negligible risk of them being
> used improperly without noticing.

"asm modules" isn't quite correct, as there's one use from inline
assembly. I have to admit I have no good wording suggestion other than
explicitly covering both: "asm modules or inline assembly". Yet that
then is ambiguous, as a use in inline assembly may also mean that
symbol is actually visible to the compiler by being mentioned as on of
the operands. Better ideas?

> As a result, they can be exempted using a comment-based deviation.
> No functional change.
> 
> Signed-off-by: Nicola Vetrini 

With suitably adjusted wording:
Acked-by: Jan Beulich 

Jan

Re: [PATCH v11 2/9] xen: introduce generic non-atomic test_*bit()

2024-05-28 Thread Jan Beulich

On 24.05.2024 13:08, Oleksii Kurochko wrote:
> The following generic functions were introduced:
> * test_bit
> * generic__test_and_set_bit
> * generic__test_and_clear_bit
> * generic__test_and_change_bit
> 
> These functions and macros can be useful for architectures
> that don't have corresponding arch-specific instructions.
> 
> Also, the patch introduces the following generics which are
> used by the functions mentioned above:
> * BITOP_BITS_PER_WORD
> * BITOP_MASK
> * BITOP_WORD
> * BITOP_TYPE
> 
> BITS_PER_BYTE was moved to xen/bitops.h as BITS_PER_BYTE is the same
> across architectures.
> 
> The following approach was chosen for generic*() and arch*() bit
> operation functions:
> If the bit operation function that is going to be generic starts
> with the prefix "__", then the corresponding generic/arch function
> will also contain the "__" prefix. For example:
>  * test_bit() will be defined using arch_test_bit() and
>generic_test_bit().
>  * __test_and_set_bit() will be defined using
>arch__test_and_set_bit() and generic__test_and_set_bit().
> 
> Signed-off-by: Oleksii Kurochko 
> ---
>  Reviewed-by: Jan Beulich jbeul...@suse.com? Jan gave his R-by for the 
> previous
>  version of the patch, but some changes were done, so I wasn't sure if I could
>  use the R-by in this patch version.

That really depends on the nature of the changes. Seeing the pretty
long list below, I think it was appropriate to drop the R-b.

> ---
> Changes in V11:
>  - fix identation in generic_test_bit() function.
>  - move definition of BITS_PER_BYTE from /bitops.h to xen/bitops.h

I realize you were asked to do this, but I'm not overly happy with it.
BITS_PER_BYTE is far more similar to BITS_PER_LONG than to
BITOP_BITS_PER_WORD. Plus in any event that change is entirely unrelated
here.

>  - drop the changes in arm64/livepatch.c.
>  - update the the comments on top of functions: generic__test_and_set_bit(),
>generic__test_and_clear_bit(),  generic__test_and_change_bit(),
>generic_test_bit().
>  - Update footer after Signed-off section.
>  - Rebase the patch on top of staging branch, so it can be merged when 
> necessary
>approves will be given.
>  - Update the commit message.
> ---
>  xen/arch/arm/include/asm/bitops.h |  69 ---
>  xen/arch/ppc/include/asm/bitops.h |  54 -
>  xen/arch/ppc/include/asm/page.h   |   2 +-
>  xen/arch/ppc/mm-radix.c   |   2 +-
>  xen/arch/x86/include/asm/bitops.h |  31 ++---
>  xen/include/xen/bitops.h  | 185 ++
>  6 files changed, 196 insertions(+), 147 deletions(-)
> 
> --- a/xen/include/xen/bitops.h
> +++ b/xen/include/xen/bitops.h
> @@ -103,8 +103,193 @@ static inline int generic_flsl(unsigned long x)
>   * Include this here because some architectures need generic_ffs/fls in
>   * scope
>   */
> +
> +#define BITOP_BITS_PER_WORD 32
> +typedef uint32_t bitop_uint_t;
> +
> +#define BITOP_MASK(nr)  ((bitop_uint_t)1 << ((nr) % BITOP_BITS_PER_WORD))
> +
> +#define BITOP_WORD(nr)  ((nr) / BITOP_BITS_PER_WORD)
> +
> +#define BITS_PER_BYTE 8

This, if to be moved at all, very clearly doesn't belong here. As much
as it's unrelated to this change, it's also unrelated to bitops as a
whole.

> +extern void __bitop_bad_size(void);
> +
> +#define bitop_bad_size(addr) (sizeof(*(addr)) < sizeof(bitop_uint_t))
> +
> +/* - Please tidy above here - */
> +
>  #include 
>  
> +/**
> + * generic__test_and_set_bit - Set a bit and return its old value
> + * @nr: Bit to set
> + * @addr: Address to count from
> + *
> + * This operation is non-atomic and can be reordered.
> + * If two examples of this operation race, one can appear to succeed

Why "examples"? Do you mean "instances"? It's further unclear what
"succeed" and "fail" here mean. The function after all has two
purposes: Checking and setting the specified bit. Therefore I think
you mean "succeed in updating the bit in memory", yet then it's
still unclear what the "appear" here means: The caller has no
indication of success/failure. Therefore I think "appear to" also
wants dropping.

> + * but actually fail.  You must protect multiple accesses with a lock.

That's not "must". An often better alternative is to use the atomic
forms instead. "Multiple" is imprecise, too: "Potentially racy" or
some such would come closer.

Also did Julien(?) really ask that these comments be reproduced on
both the functions supposed to be used throughout the codebase _and_
the generic_*() ones (being merely internal helpers/fallbacks)?

> +/**
> + * generic_test_

Re: Code freeze date for Xen 4.19 is 31.05.2024

2024-05-27 Thread Jan Beulich

On 27.05.2024 17:52, Oleksii K. wrote:
> On Mon, 2024-05-27 at 17:12 +0200, Jan Beulich wrote:
>> On 27.05.2024 15:58, Oleksii K. wrote:
>>> I would like to remind you that the code freeze date for Xen 4.19
>>> is
>>> May 31, 2024.
>>
>> I may be confused: With feature freeze having been last Friday aiui,
>> isn't
>> this a little too short a period? I was actually expecting a longer-
>> than-
>> normal one to account for the Xen Summit week.
> It makes sense to move the last day to June 14. Consequently, we would
> need to adjust the schedule as follows:
> 
> Hard Code Freeze: from June 21 to June 28
> Final commits: from July 5 to July 12
> Release: July 17
> 
> And this schedule also looks good to me.
> 
> However, another option is to combine the Code Freeze and Hard Code
> Freeze periods since both phases are about accepting only bug fixes,
> differing only in the severity of the fixes.

I'm okay with whichever way you want it. All I'm seeking is clarity.

Jan

Re: [XEN PATCH v4 0/3] x86: make Intel/AMD vPMU & MCE support configurable

2024-05-27 Thread Jan Beulich

Oleksii,

On 22.05.2024 10:37, Sergiy Kibrik wrote:
> Three remaining patches to separate support of Intel & AMD CPUs in Xen build.
> Most of related patches from previous series had already been merged.
> Specific changes since v3 are provided per-patch.
> 
> v3 series here:
> https://lore.kernel.org/xen-devel/cover.1715673586.git.sergiy_kib...@epam.com/
> 
>   -Sergiy
> 
> Sergiy Kibrik (3):
>   x86/intel: move vmce_has_lmce() routine to header
>   x86/MCE: add default switch case in init_nonfatal_mce_checker()
>   x86/MCE: optional build of AMD/Intel MCE code

As I'm apparently confused as to the state 4.19 is in, may I please ask
whether this series is still okay to go in, or whether it should be
postponed until after branching.

Thanks, Jan

Re: [XEN PATCH v4 2/3] x86/MCE: add default switch case in init_nonfatal_mce_checker()

2024-05-27 Thread Jan Beulich

On 22.05.2024 10:42, Sergiy Kibrik wrote:
> The default switch case block is wanted here, to handle situation
> e.g. of unexpected c->x86_vendor value -- then no mcheck init is done, but
> misleading message still gets logged anyway.
> 
> Signed-off-by: Sergiy Kibrik 

Acked-by: Jan Beulich

Re: [XEN PATCH v4 1/3] x86/intel: move vmce_has_lmce() routine to header

2024-05-27 Thread Jan Beulich

On 22.05.2024 10:40, Sergiy Kibrik wrote:
> --- a/xen/arch/x86/msr.c
> +++ b/xen/arch/x86/msr.c
> @@ -24,6 +24,8 @@
>  
>  #include 
>  
> +#include "cpu/mcheck/mce.h"

Considering that I asked about this before, I'm a little irritated
that this is is entirely unadorned. Such an unusual #include wants
explaining some, you'll find similar comments elsewhere:

#include "cpu/mcheck/mce.h" /* for vmce_has_lmce() */

With that, which could also be adjusted while committing,
Acked-by: Jan Beulich 

Jan

Re: [XEN PATCH 2/4] automation/eclair_analysis: avoid an ECLAIR warning about escaping

2024-05-27 Thread Jan Beulich

On 27.05.2024 16:53, Nicola Vetrini wrote:
> The parentheses in this regular expression should be doubly
> escaped because they are undergo escaping twice.

Do you maybe mean "undergo expansion twice"?

Jan

> Signed-off-by: Nicola Vetrini 
> ---
>  automation/eclair_analysis/ECLAIR/deviations.ecl | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/automation/eclair_analysis/ECLAIR/deviations.ecl 
> b/automation/eclair_analysis/ECLAIR/deviations.ecl
> index b9b377c56b25..cf62a874d928 100644
> --- a/automation/eclair_analysis/ECLAIR/deviations.ecl
> +++ b/automation/eclair_analysis/ECLAIR/deviations.ecl
> @@ -405,8 +405,8 @@ explicit comment indicating the fallthrough intention is 
> present."
>  #
>  
>  -doc_begin="printf()-like functions are allowed to use the variadic features 
> provided by stdarg.h."
> --config=MC3R1.R17.1,reports+={deliberate,"any_area(^.*va_list.*$&(ancestor_or_self(^.*printk\(.*\)$)))"}
> --config=MC3R1.R17.1,reports+={deliberate,"any_area(^.*va_list.*$&(ancestor_or_self(^.*printf\(.*\)$)))"}
> +-config=MC3R1.R17.1,reports+={deliberate,"any_area(^.*va_list.*$&(ancestor_or_self(^.*printk\\(.*\\)$)))"}
> +-config=MC3R1.R17.1,reports+={deliberate,"any_area(^.*va_list.*$&(ancestor_or_self(^.*printf\\(.*\\)$)))"}
>  
> -config=MC3R1.R17.1,reports+={deliberate,"any_area(^.*va_list.*$&(ancestor_or_self(name(panic)&(function"}
>  
> -config=MC3R1.R17.1,reports+={deliberate,"any_area(^.*va_list.*$&(ancestor_or_self(name(elf_call_log_callback)&(function"}
>  
> -config=MC3R1.R17.1,reports+={deliberate,"any_area(^.*va_list.*$&(ancestor_or_self(name(vprintk_common)&(function"}

Re: [XEN PATCH 1/4] docs/misra: exclude gdbsx from MISRA compliance

2024-05-27 Thread Jan Beulich

On 27.05.2024 16:53, Nicola Vetrini wrote:
> These files are used when debugging Xen, and are not meant to comply
> with MISRA rules at the moment.
> 
> No functional change.
> 
> Signed-off-by: Nicola Vetrini 

Acked-by: Jan Beulich

Re: Code freeze date for Xen 4.19 is 31.05.2024

2024-05-27 Thread Jan Beulich

On 27.05.2024 15:58, Oleksii K. wrote:
> I would like to remind you that the code freeze date for Xen 4.19 is
> May 31, 2024.

I may be confused: With feature freeze having been last Friday aiui, isn't
this a little too short a period? I was actually expecting a longer-than-
normal one to account for the Xen Summit week.

Jan

> I'm okay with bug fixes being committed without my release ack (just CC
> me), except in cases where a one of maintainers gives a strong NACK.
> 
> Have a nice week!
> 
> Best regards,
> Oleksii

Re: [PATCH v2 13/13] xen/bitops: Rearrange the top of xen/bitops.h

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> The #include  can move to the top of the file now now that
> generic_f?s() have been untangled.
> 
> Signed-off-by: Andrew Cooper 

Acked-by: Jan Beulich

Re: [PATCH v2 12/13] xen/bitops: Clean up ffs64()/fls64() definitions

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> Implement ffs64() and fls64() as plain static inlines, dropping the ifdefary
> and intermediate generic_f?s64() forms.
> 
> Add tests for all interesting bit positions at 32bit boundaries.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 
with two remarks:

> --- a/xen/common/bitops.c
> +++ b/xen/common/bitops.c
> @@ -24,6 +24,22 @@ static void __init test_ffs(void)
>  CHECK(ffsl, 1UL << 32, 33);
>  CHECK(ffsl, 1UL << 63, 64);
>  #endif
> +
> +/*
> + * unsigned int ffs64(uint64_t)
> + *
> + * 32-bit builds of Xen have to split this into two adjacent operations,
> + * so test all interesting bit positions across the divide.
> + */
> +CHECK(ffs64, 0, 0);
> +CHECK(ffs64, 1, 1);
> +CHECK(ffs64, 3, 1);
> +CHECK(ffs64, 7, 1);
> +CHECK(ffs64, 6, 2);
> +
> +CHECK(ffs64, 0x80008000ULL, 32);
> +CHECK(ffs64, 0x8001ULL, 33);
> +CHECK(ffs64, 0x8000ULL, 64);

With the intermediate blank line, the respective part of the comment doesn't
look to be related to these 3 lines. Could I talk you into moving that part
down?

> --- a/xen/include/xen/bitops.h
> +++ b/xen/include/xen/bitops.h
> @@ -60,6 +60,14 @@ static always_inline __pure unsigned int ffsl(unsigned 
> long x)
>  #endif
>  }
>  
> +static always_inline __pure unsigned int ffs64(uint64_t x)
> +{
> +if ( BITS_PER_LONG == 64 )

In principle >= 64 would be okay here, and hence I'd prefer if we used that
less strict form. Yet I'm not going to insist.

Jan

Re: [PATCH v2 11/13] xen/bitops: Implement fls()/flsl() in common logic

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> From: Oleksii Kurochko 
> 
> This is most easily done together because of how arm32 is currently
> structured, but it does just mirror the existing ffs()/ffsl() work.
> 
> Introduce compile and boot time testing.
> 
> Signed-off-by: Oleksii Kurochko 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 
with small adjustments possibly to be done on the earlier similar patches
also done here.

Jan

Re: [PATCH v2 07/13] x86/bitops: Improve arch_ffs() in the general case

2024-05-27 Thread Jan Beulich

On 27.05.2024 15:27, Jan Beulich wrote:
> On 24.05.2024 22:03, Andrew Cooper wrote:
>> --- a/xen/arch/x86/include/asm/bitops.h
>> +++ b/xen/arch/x86/include/asm/bitops.h
>> @@ -432,12 +432,28 @@ static inline int ffsl(unsigned long x)
>>  
>>  static always_inline unsigned int arch_ffs(unsigned int x)
>>  {
>> -int r;
>> +unsigned int r;
>> +
>> +if ( __builtin_constant_p(x > 0) && x > 0 )
>> +{
>> +/* Safe, when the compiler knows that x is nonzero. */
>> +asm ( "bsf %[val], %[res]"
>> +  : [res] "=r" (r)
>> +  : [val] "rm" (x) );
>> +}
> 
> In patch 11 relevant things are all in a single patch, making it easier
> to spot that this is dead code: The sole caller already has a
> __builtin_constant_p(), hence I don't see how the one here could ever
> return true. With that the respective part of the description is then
> questionable, too, I'm afraid: Where did you observe any actual effect
> from this? Or if you did - what am I missing?

Hmm, thinking about it: I suppose that's why you have
__builtin_constant_p(x > 0), not __builtin_constant_p(x). I have to admit
I'm (positively) surprised that the former may return true when the latter
doesn't. Nevertheless I'm inclined to think this deserves a brief comment.

As an aside, to better match the comment inside the if()'s body, how about

if ( __builtin_constant_p(!!x) && x )

? That also may make a little more clear that this isn't just a style
choice, but actually needed for the intended purpose.

Jan

Re: [PATCH v2 07/13] x86/bitops: Improve arch_ffs() in the general case

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> --- a/xen/arch/x86/include/asm/bitops.h
> +++ b/xen/arch/x86/include/asm/bitops.h
> @@ -432,12 +432,28 @@ static inline int ffsl(unsigned long x)
>  
>  static always_inline unsigned int arch_ffs(unsigned int x)
>  {
> -int r;
> +unsigned int r;
> +
> +if ( __builtin_constant_p(x > 0) && x > 0 )
> +{
> +/* Safe, when the compiler knows that x is nonzero. */
> +asm ( "bsf %[val], %[res]"
> +  : [res] "=r" (r)
> +  : [val] "rm" (x) );
> +}

In patch 11 relevant things are all in a single patch, making it easier
to spot that this is dead code: The sole caller already has a
__builtin_constant_p(), hence I don't see how the one here could ever
return true. With that the respective part of the description is then
questionable, too, I'm afraid: Where did you observe any actual effect
from this? Or if you did - what am I missing?

Jan

Re: [PATCH v2 10/13] xen/bitops: Delete find_first_set_bit()

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> No more users.
> 
> Signed-off-by: Andrew Cooper 

Acked-by: Jan Beulich

Re: [PATCH v2 09/13] xen/bitops: Replace find_first_set_bit() with ffsl() - 1

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> --- a/xen/arch/x86/hvm/hpet.c
> +++ b/xen/arch/x86/hvm/hpet.c
> @@ -335,7 +335,7 @@ static void timer_sanitize_int_route(HPETState *h, 
> unsigned int tn)
>   * enabled pick the first irq.
>   */
>  timer_config(h, tn) |=
> -MASK_INSR(find_first_set_bit(timer_int_route_cap(h, tn)),
> +MASK_INSR(ffsl(timer_int_route_cap(h, tn)) - 1,
>HPET_TN_ROUTE);
>  }

This can be just ffs().

> @@ -409,7 +409,7 @@ static int cf_check hpet_write(
>  {
>  bool active;
>  
> -i = find_first_set_bit(new_val);
> +i = ffsl(new_val) - 1;
>  if ( i >= HPET_TIMER_NUM )
>  break;

This in principle can be, too, but would require a little further care.

> @@ -535,14 +535,14 @@ static int cf_check hpet_write(
>  /* stop/start timers whos state was changed by this write. */
>  while (stop_timers)
>  {
> -i = find_first_set_bit(stop_timers);
> +i = ffsl(stop_timers) - 1;
>  __clear_bit(i, _timers);
>  hpet_stop_timer(h, i, guest_time);
>  }
>  
>  while (start_timers)
>  {
> -i = find_first_set_bit(start_timers);
> +i = ffsl(start_timers) - 1;
>  __clear_bit(i, _timers);
>  hpet_set_timer(h, i, guest_time);
>  }

Same here; in fact {start,stop}_timers are needlessly unsigned long in
the first place.

> --- a/xen/arch/x86/include/asm/pt-contig-markers.h
> +++ b/xen/arch/x86/include/asm/pt-contig-markers.h
> @@ -60,7 +60,7 @@ static bool pt_update_contig_markers(uint64_t *pt, unsigned 
> int idx,
>  /* Step 1: Reduce markers in lower numbered entries. */
>  while ( i )
>  {
> -b = find_first_set_bit(i);
> +b = ffsl(i) - 1;
>  i &= ~(1U << b);

Considering i's type and the immediately following expression, this again
can easily be just ffs().

> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -137,7 +137,7 @@ static void set_iommu_ptes_present(unsigned long pt_mfn,
>  ASSERT(!pde->u);
>  
>  if ( pde > table )
> -ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> +    ASSERT(pde->ign0 == ffsl(pde - table) - 1);

pde pointing into the page starting at table, this can be ffs(), too.

Preferably with at least the easy adjustments done:
Reviewed-by: Jan Beulich 

Jan

Re: [PATCH v2 08/13] xen/bitops: Implement ffsl() in common logic

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> Just like ffs() in the previous changes.  Express the upper bound of the
> testing in terms of BITS_PER_LONG as it varies between architectures.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 

> @@ -458,6 +441,24 @@ static always_inline unsigned int arch_ffs(unsigned int 
> x)
>  }
>  #define arch_ffs arch_ffs
>  
> +static always_inline unsigned int arch_ffsl(unsigned long x)
> +{
> +unsigned int r;
> +
> +/* See arch_ffs() for safety discussions. */
> +if ( __builtin_constant_p(x > 0) && x > 0 )

See remark on arch_ffs() for possible slight reduction of code.

Jan

Re: [PATCH v2 07/13] x86/bitops: Improve arch_ffs() in the general case

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> The asm in arch_ffs() is safe but inefficient.
> 
> CMOV would be an improvement over a conditional branch, but for 64bit CPUs
> both Intel and AMD have provided enough details about the behaviour for a zero
> input.  It is safe to pre-load the destination register with -1 and drop the
> conditional logic.
> 
> However, it is common to find ffs() in a context where the optimiser knows
> that x in nonzero even if it the value isn't known precisely, and in that case
> it's safe to drop the preload of -1 too.
> 
> There are only a handful of uses of ffs() in the x86 build, and all of them
> improve as a result of this:
> 
>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-31 (-31)
>   Function old new   delta
>   mask_write   114 107  -7
>   xmem_pool_alloc 10631039 -24
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 
with one suggestion:

> --- a/xen/arch/x86/include/asm/bitops.h
> +++ b/xen/arch/x86/include/asm/bitops.h
> @@ -432,12 +432,28 @@ static inline int ffsl(unsigned long x)
>  
>  static always_inline unsigned int arch_ffs(unsigned int x)
>  {
> -int r;
> +unsigned int r;
> +
> +if ( __builtin_constant_p(x > 0) && x > 0 )

__builtin_constant_p(x) surely will do. In fact even the other "> 0" could
in principle be left out here.

Jan

Re: [PATCH v2 06/13] xen/bitops: Implement ffs() in common logic

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> Perform constant-folding unconditionally, rather than having it implemented
> inconsistency between architectures.
> 
> Confirm the expected behaviour with compile time and boot time tests.
> 
> For non-constant inputs, use arch_ffs() if provided but fall back to
> generic_ffsl() if not.  In particular, RISC-V doesn't have a builtin that
> works in all configurations.
> 
> For x86, rename ffs() to arch_ffs() and adjust the prototype.
> 
> For PPC, __builtin_ctz() is 1/3 of the size of size of the transform to
> generic_fls().  Drop the definition entirely.  ARM too benefits in the general
> case by using __builtin_ctz(), but less dramatically because it using
> optimised asm().
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich

Re: [XEN PATCH v2 07/15] x86: guard cpu_has_{svm/vmx} macros with CONFIG_{SVM/VMX}

2024-05-27 Thread Jan Beulich

On 27.05.2024 12:27, Sergiy Kibrik wrote:
> 23.05.24 17:50, Jan Beulich:
>> On 23.05.2024 15:07, Sergiy Kibrik wrote:
>>> 16.05.24 14:12, Jan Beulich:
>>>> On 15.05.2024 11:12, Sergiy Kibrik wrote:
>>>>> --- a/xen/arch/x86/include/asm/cpufeature.h
>>>>> +++ b/xen/arch/x86/include/asm/cpufeature.h
>>>>> @@ -81,7 +81,8 @@ static inline bool boot_cpu_has(unsigned int feat)
>>>>>#define cpu_has_sse3boot_cpu_has(X86_FEATURE_SSE3)
>>>>>#define cpu_has_pclmulqdq   boot_cpu_has(X86_FEATURE_PCLMULQDQ)
>>>>>#define cpu_has_monitor boot_cpu_has(X86_FEATURE_MONITOR)
>>>>> -#define cpu_has_vmx boot_cpu_has(X86_FEATURE_VMX)
>>>>> +#define cpu_has_vmx ( IS_ENABLED(CONFIG_VMX) && \
>>>>> +  boot_cpu_has(X86_FEATURE_VMX))
>>>>>#define cpu_has_eistboot_cpu_has(X86_FEATURE_EIST)
>>>>>#define cpu_has_ssse3   boot_cpu_has(X86_FEATURE_SSSE3)
>>>>>#define cpu_has_fma boot_cpu_has(X86_FEATURE_FMA)
>>>>> @@ -109,7 +110,8 @@ static inline bool boot_cpu_has(unsigned int feat)
>>>>>
>>>>>/* CPUID level 0x8001.ecx */
>>>>>#define cpu_has_cmp_legacy  boot_cpu_has(X86_FEATURE_CMP_LEGACY)
>>>>> -#define cpu_has_svm boot_cpu_has(X86_FEATURE_SVM)
>>>>> +#define cpu_has_svm ( IS_ENABLED(CONFIG_SVM) && \
>>>>> +  boot_cpu_has(X86_FEATURE_SVM))
>>>>>#define cpu_has_sse4a   boot_cpu_has(X86_FEATURE_SSE4A)
>>>>>#define cpu_has_xop boot_cpu_has(X86_FEATURE_XOP)
>>>>>#define cpu_has_skinit  boot_cpu_has(X86_FEATURE_SKINIT)
>>>>
>>>> Hmm, leaving aside the style issue (stray blanks after opening parentheses,
>>>> and as a result one-off indentation on the wrapped lines) I'm not really
>>>> certain we can do this. The description goes into detail why we would want
>>>> this, but it doesn't cover at all why it is safe for all present (and
>>>> ideally also future) uses. I wouldn't be surprised if we had VMX/SVM checks
>>>> just to derive further knowledge from that, without them being directly
>>>> related to the use of VMX/SVM. Take a look at calculate_hvm_max_policy(),
>>>> for example. While it looks to be okay there, it may give you an idea of
>>>> what I mean.
>>>>
>>>> Things might become better separated if instead for such checks we used
>>>> host and raw CPU policies instead of cpuinfo_x86.x86_capability[]. But
>>>> that's still pretty far out, I'm afraid.
>>>
>>> I've followed a suggestion you made for patch in previous series:
>>>
> [..]
>>
>> See the "If not, ..." that I had put there. Doing the change just 
>> mechanically
>> isn't enough, you also need to make clear (in the description) that you
>> verified it's safe to have this way.
>>
>>> yet if this approach can potentially be unsafe (I'm not completely sure
>>> it's safe), should we instead fallback to the way it was done in v1
>>> series? I.e. guard calls to vmx/svm-specific calls where needed, like in
>>> these 3 patches:
>>>
> [..]
>>
>> I don't like this sprinkling around of IS_ENABLED() very much. Maybe we want
>> to have two new helpers (say using_svm() and using_vmx()), to be used in 
>> place
>> of most but possibly not all cpu_has_{svm,vmx}? Doing such a transformation
>> would then kind of implicitly answer the safety question above, as at every
>> use site you'd need to judge whether the replacement is correct. If it's
>> correct everywhere, the construct(s) as proposed in this version could then 
>> be
>> considered to be used in this very shape (instead of introducing the two new
>> helpers). But of course the transition could also be done gradually then,
>> touching only those uses that previously you touched in 1), 2), and 3).
>>
> 
> now I might be seeing your concerns, if I understood correctly, 
> situation is the following.
> 
>   As an example of cpu_has_vmx macro, it can be used to prove either of 
> following two statements: 1) VMX features can be used or 2) CPU provides 
> VMX features.
> Currently they're the same for Xen, yet after this patch series they're 
> not, as the situation possible when non-vmx build would be able to get 
> executed o

Re: [PATCH v2 05/13] xen/bitops: Implement generic_f?sl() in lib/

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> generic_f?s() being static inline is the cause of lots of the complexity
> between the common and arch-specific bitops.h
> 
> They appear to be static inline for constant-folding reasons (ARM uses them
> for this), but there are better ways to achieve the same effect.
> 
> It is presumptuous that an unrolled binary search is the right algorithm to
> use on all microarchitectures.  Indeed, it's not for the eventual users, but
> that can be addressed at a later point.
> 
> It is also nonsense to implement the int form as the base primitive and
> construct the long form from 2x int in 64-bit builds, when it's just one extra
> step to operate at the native register width.
> 
> Therefore, implement generic_f?sl() in lib/.  They're not actually needed in
> x86/ARM/PPC by the end of the cleanup (i.e. the functions will be dropped by
> the linker), and they're only expected be needed by RISC-V on hardware which
> lacks the Zbb extension.
> 
> Implement generic_fls() in terms of generic_flsl() for now, but this will be
> cleaned up in due course.
> 
> Provide basic runtime testing using __constructor inside the lib/ file.  This
> is important, as it means testing runs if and only if generic_f?sl() are used
> elsewhere in Xen.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 
with a suggestion and a question.

> I suspect we want to swap CONFIG_DEBUG for CONFIG_BOOT_UNIT_TESTS in due
> course.  These ought to be able to be used in a release build too.

+1

> --- /dev/null
> +++ b/xen/lib/generic-ffsl.c
> @@ -0,0 +1,65 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include 
> +#include 
> +#include 
> +
> +unsigned int generic_ffsl(unsigned long x)
> +{
> +unsigned int r = 1;
> +
> +if ( !x )
> +return 0;
> +
> +#if BITS_PER_LONG > 32

To be future-proof, perhaps ahead of this

#if BITS_PER_LONG > 64
# error "..."
#endif

or a functionally similar BUILD_BUG_ON()?

> --- /dev/null
> +++ b/xen/lib/generic-flsl.c
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include 
> +#include 
> +#include 
> +
> +/* Mask of type UL with the upper x bits set. */
> +#define UPPER_MASK(x) (~0UL << (BITS_PER_LONG - (x)))
> +
> +unsigned int generic_flsl(unsigned long x)
> +{
> +unsigned int r = BITS_PER_LONG;
> +
> +if ( !x )
> +return 0;
> +
> +#if BITS_PER_LONG > 32
> +if ( !(x & UPPER_MASK(32)) )
> +{
> +x <<= 32;
> +r -= 32;
> +}
> +#endif
> +if ( !(x & UPPER_MASK(16)) )
> +{
> +x <<= 16;
> +r -= 16;
> +}
> +if ( !(x & UPPER_MASK(8)) )
> +{
> +x <<= 8;
> +r -= 8;
> +}
> +if ( !(x & UPPER_MASK(4)) )
> +{
> +x <<= 4;
> +r -= 4;
> +}
> +if ( !(x & UPPER_MASK(2)) )
> +{
> +x <<= 2;
> +r -= 2;
> +}
> +if ( !(x & UPPER_MASK(1)) )
> +{
> +x <<= 1;
> +r -= 1;
> +}
> +
> +return r;
> +}

While, as you say, the expectation is for this code to not commonly come
into actual use, I still find the algorithm a little inefficient in terms
of the constants used, specifically considering how they would need
instantiating in resulting assembly. It may be that Arm's fancy constant-
move insns can actually efficiently synthesize them, but I think on most
other architectures it would be more efficient (and presumably no less
efficient on Arm) to shift the "remaining" value right, thus allowing for
successively smaller (and hence easier to instantiate) constants to be
used.

Jan

Re: [PATCH v2 02/13] xen/bitops: Cleanup ahead of rearrangements

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
>  * Rename __attribute_pure__ to just __pure before it gains users.
>  * Introduce __constructor which is going to be used in lib/, and is
>unconditionally cf_check.
>  * Identify the areas of xen/bitops.h which are a mess.
>  * Introduce xen/boot-check.h as helpers for compile and boot time testing.
>This provides a statement of the ABI, and a confirmation that arch-specific
>implementations behave as expected.
> 
> Sadly Clang 7 and older isn't happy with the compile time checks.  Skip them,
> and just rely on the runtime checks.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 

Further remarks, though:

> ---
>  xen/include/xen/bitops.h | 13 ++--
>  xen/include/xen/boot-check.h | 60 
>  xen/include/xen/compiler.h   |  3 +-
>  3 files changed, 72 insertions(+), 4 deletions(-)
>  create mode 100644 xen/include/xen/boot-check.h

The bulk of the changes isn't about bitops; it's just that you're intending
to first use it for testing there. The subject prefix therefore is somewhat
misleading.

> --- /dev/null
> +++ b/xen/include/xen/boot-check.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +/*
> + * Helpers for boot-time checks of basic logic, including confirming that
> + * examples which should be calculated by the compiler are.
> + */
> +#ifndef XEN_BOOT_CHECK_H
> +#define XEN_BOOT_CHECK_H
> +
> +#include 
> +
> +/* Hide a value from the optimiser. */
> +#define HIDE(x) \
> +({ typeof(x) _x = (x); asm volatile ( "" : "+r" (_x) ); _x; })

In principle this is a macro that could be of use elsewhere. That's also
reflected in its entirely generic name. It therefore feels mis-placed in
this header. Otoh though the use of "+r" is more restricting than truly
necessary: While I'm not sure if "+g" would work, i.e. if that wouldn't
cause issues with literals, pretty surely "+rm" ought to work, removing
the strict requirement for the compiler to put a certain value in a
register.

Assuming you may have reservations against "+g" / "+rm" (and hence the
construct wants keeping here), maybe rename to e.g. BOOT_CHECK_HIDE()?
Alternatively, if generalized, moving to xen/macros.h would seem
appropriate to me.

Finally, plainly as a remark with no request for any change (but
possibly a minor argument against moving to xen/macros.h), this construct
won't, afaict, work if x is of array(-of-const) type. A more specialized
variant may need introducing, should any such use ever appear.

Jan

Re: [PATCH v2 04/13] xen/page_alloc: Coerce min(flsl(), foo) expressions to being unsigned

2024-05-27 Thread Jan Beulich

On 24.05.2024 22:03, Andrew Cooper wrote:
> This is in order to maintain bisectability through the subsequent changes,
> where flsl() changes sign-ness non-atomically by architecture.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich

Re: [PATCH for-4.19? v4 4/6] x86: Make the maximum number of altp2m views configurable

2024-05-27 Thread Jan Beulich

On 27.05.2024 01:55, Petr Beneš wrote:
> On Tue, May 21, 2024 at 12:59 PM Jan Beulich  wrote:
>>
>> The compared entities don't really fit together. I think we want a new
>> MAX_NR_ALTP2M, which - for the time being - could simply be

Note that you've stripped too much context - "the compared entities" is
left without any meaning here, yet that's relevant to my earlier reply.

>> #define MAX_NR_ALTP2M MAX_EPTP
>>
>> in the header. That would then be a suitable replacement for the
>> min(ARRAY_SIZE(d->arch.altp2m_p2m), MAX_EPTP) that you're adjusting
>> elsewhere. Which however raises the question whether in EPT-specific
>> code the min() wouldn't better survive, as min(d->nr_altp2m, MAX_EPTP).
>>
> 
> As you mentioned in a previous email, I've removed all the min(...,
> MAX_EPTP) invocations from the code, since nr_altp2m is validated to
> be no greater than that value. The only remaining places where this
> value occurs are:
> 
> - In my newly introduced condition in arch_sanitise_domain_config:
> 
> if ( config->nr_altp2m > MAX_EPTP )
> {
> dprintk(XENLOG_INFO, "nr_altp2m must be <= %lu\n", MAX_NR_ALTP2M);
> return -EINVAL;
> }

This is suspicious: You compare against one value but log another. This
isn't EPT-specific, so shouldn't use MAX_EPTP.

> - In hap_enable():
> 
> for ( i = 0; i < MAX_EPTP; i++ )
> {
> d->arch.altp2m_eptp[i] = mfn_x(INVALID_MFN);
> d->arch.altp2m_visible_eptp[i] = mfn_x(INVALID_MFN);
> }
> 
> Note that altp2m_eptp/altp2m_visible_eptp is never accessed beyond
> nr_altp2m. From what you're saying, it sounds to me like I should only
> replace the first mentioned occurrence with MAX_NR_ALTP2M. Correct me
> if I'm wrong.

Yes. I suspect though that there may be further places that want adjusting.

>>> @@ -403,12 +403,12 @@ long p2m_set_mem_access_multi(struct domain *d,
>>>  /* altp2m view 0 is treated as the hostp2m */
>>>  if ( altp2m_idx )
>>>  {
>>> -if ( altp2m_idx >= min(ARRAY_SIZE(d->arch.altp2m_p2m), MAX_EPTP) ||
>>> - d->arch.altp2m_eptp[array_index_nospec(altp2m_idx, MAX_EPTP)] 
>>> ==
>>> - mfn_x(INVALID_MFN) )
>>> +if ( altp2m_idx >= d->nr_altp2m ||
>>> + d->arch.altp2m_eptp[array_index_nospec(altp2m_idx, 
>>> d->nr_altp2m)]
>>> + == mfn_x(INVALID_MFN) )
>>
>> Please don't break previously correct style: Binary operators (here: == )
>> belong onto the end of the earlier line. That'll render the line too long
>> again, but you want to deal with that e.g. thus:
>>
>>  d->arch.altp2m_eptp[array_index_nospec(altp2m_idx,
>> d->nr_altp2m)] ==
>>  mfn_x(INVALID_MFN) )
>>
> 
> Roger suggested introducing the altp2m_get_p2m() function, which I
> like. I think introducing altp2m_get_eptp/visible_eptp and
> altp2m_set_eptp/visible_eptp would also elegantly solve the issue of
> overly long lines. My question is: if I go this route, should I
> strictly replace with these functions only accesses that use
> array_index_nospec()? Or should I replace all array accesses? For
> example:
> 
> for ( i = 0; i < d->nr_altp2m; i++ )
> {
> struct p2m_domain *p2m;
> 
> if ( d->arch.altp2m_eptp[i] == mfn_x(INVALID_MFN) )
> continue;
> 
> p2m = d->arch.altp2m_p2m[i];
> 
> p2m_lock(p2m);
> p2m->ept.ad = value;
> p2m_unlock(p2m);
> }
> 
> ... should I be consistent and also replace these accesses with
> altp2m_get_eptp/altp2m_get_p2m (which will internally use
> array_index_nospec), or should I leave them as they are?

Perhaps leave them as they are, unless you can technically justify the
adjustment.

Jan

Re: [PATCH v10 02/14] xen: introduce generic non-atomic test_*bit()

2024-05-24 Thread Jan Beulich

On 24.05.2024 09:25, Oleksii K. wrote:
> On Fri, 2024-05-24 at 08:48 +0200, Jan Beulich wrote:
>> On 23.05.2024 18:40, Oleksii K. wrote:
>>> On Thu, 2024-05-23 at 15:33 +0100, Julien Grall wrote:
>>>> On 23/05/2024 15:11, Oleksii K. wrote:
>>>>> On Thu, 2024-05-23 at 14:00 +0100, Julien Grall wrote:
>>>>>> On 17/05/2024 14:54, Oleksii Kurochko wrote:
>>>>>>> diff --git a/xen/arch/arm/arm64/livepatch.c
>>>>>>> b/xen/arch/arm/arm64/livepatch.c
>>>>>>> index df2cebedde..4bc8ed9be5 100644
>>>>>>> --- a/xen/arch/arm/arm64/livepatch.c
>>>>>>> +++ b/xen/arch/arm/arm64/livepatch.c
>>>>>>> @@ -10,7 +10,6 @@
>>>>>>>    #include 
>>>>>>>    #include 
>>>>>>>    
>>>>>>> -#include 
>>>>>>
>>>>>> It is a bit unclear how this change is related to the patch.
>>>>>> Can
>>>>>> you
>>>>>> explain in the commit message?
>>>>> Probably it doesn't need anymore. I will double check and if
>>>>> this
>>>>> change is not needed, I will just drop it in the next patch
>>>>> version.
>>>>>
>>>>>>
>>>>>>>    #include 
>>>>>>>    #include 
>>>>>>>    #include 
>>>>>>> diff --git a/xen/arch/arm/include/asm/bitops.h
>>>>>>> b/xen/arch/arm/include/asm/bitops.h
>>>>>>> index 5104334e48..8e16335e76 100644
>>>>>>> --- a/xen/arch/arm/include/asm/bitops.h
>>>>>>> +++ b/xen/arch/arm/include/asm/bitops.h
>>>>>>> @@ -22,9 +22,6 @@
>>>>>>>    #define __set_bit(n,p)    set_bit(n,p)
>>>>>>>    #define __clear_bit(n,p)  clear_bit(n,p)
>>>>>>>    
>>>>>>> -#define BITOP_BITS_PER_WORD 32
>>>>>>> -#define BITOP_MASK(nr)  (1UL << ((nr) %
>>>>>>> BITOP_BITS_PER_WORD))
>>>>>>> -#define BITOP_WORD(nr)  ((nr) /
>>>>>>> BITOP_BITS_PER_WORD)
>>>>>>>    #define BITS_PER_BYTE   8
>>>>>>
>>>>>> OOI, any reason BITS_PER_BYTE has not been moved as well? I
>>>>>> don't
>>>>>> expect
>>>>>> the value to change across arch.
>>>>> I can move it to generic one header too in the next patch
>>>>> version.
>>>>>
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> diff --git a/xen/include/xen/bitops.h
>>>>>>> b/xen/include/xen/bitops.h
>>>>>>> index f14ad0d33a..6eeeff0117 100644
>>>>>>> --- a/xen/include/xen/bitops.h
>>>>>>> +++ b/xen/include/xen/bitops.h
>>>>>>> @@ -65,10 +65,141 @@ static inline int
>>>>>>> generic_flsl(unsigned
>>>>>>> long
>>>>>>> x)
>>>>>>>     * scope
>>>>>>>     */
>>>>>>>    
>>>>>>> +#define BITOP_BITS_PER_WORD 32
>>>>>>> +typedef uint32_t bitop_uint_t;
>>>>>>> +
>>>>>>> +#define BITOP_MASK(nr)  ((bitop_uint_t)1 << ((nr) %
>>>>>>> BITOP_BITS_PER_WORD))
>>>>>>> +
>>>>>>> +#define BITOP_WORD(nr)  ((nr) / BITOP_BITS_PER_WORD)
>>>>>>> +
>>>>>>> +extern void __bitop_bad_size(void);
>>>>>>> +
>>>>>>> +#define bitop_bad_size(addr) (sizeof(*(addr)) <
>>>>>>> sizeof(bitop_uint_t))
>>>>>>> +
>>>>>>>    /* - Please tidy above here 
>>>>>>> 
>>>>>>> -
>>>>>>>  */
>>>>>>>    
>>>>>>>    #include 
>>>>>>>    
>>>>>>> +/**
>>>>>>> + * generic__test_and_set_bit - Set a bit and return its
>>>>>>> old
>>>>>>> value
>>>>>>> + * @nr: Bit to set
>>>>>>> + * @addr: Address to count from
>>>>>>> + *
>>>>>>> + * This operation is non-atomic and can be reorde

Re: [XEN PATCH v2 07/15] x86: guard cpu_has_{svm/vmx} macros with CONFIG_{SVM/VMX}

2024-05-24 Thread Jan Beulich

On 24.05.2024 01:36, Stefano Stabellini wrote:
> On Thu, 23 May 2024, Jan Beulich wrote:
>> On 23.05.2024 15:07, Sergiy Kibrik wrote:
>>> 16.05.24 14:12, Jan Beulich:
>>>> On 15.05.2024 11:12, Sergiy Kibrik wrote:
>>>>> --- a/xen/arch/x86/include/asm/cpufeature.h
>>>>> +++ b/xen/arch/x86/include/asm/cpufeature.h
>>>>> @@ -81,7 +81,8 @@ static inline bool boot_cpu_has(unsigned int feat)
>>>>>   #define cpu_has_sse3boot_cpu_has(X86_FEATURE_SSE3)
>>>>>   #define cpu_has_pclmulqdq   boot_cpu_has(X86_FEATURE_PCLMULQDQ)
>>>>>   #define cpu_has_monitor boot_cpu_has(X86_FEATURE_MONITOR)
>>>>> -#define cpu_has_vmx boot_cpu_has(X86_FEATURE_VMX)
>>>>> +#define cpu_has_vmx ( IS_ENABLED(CONFIG_VMX) && \
>>>>> +  boot_cpu_has(X86_FEATURE_VMX))
>>>>>   #define cpu_has_eistboot_cpu_has(X86_FEATURE_EIST)
>>>>>   #define cpu_has_ssse3   boot_cpu_has(X86_FEATURE_SSSE3)
>>>>>   #define cpu_has_fma boot_cpu_has(X86_FEATURE_FMA)
>>>>> @@ -109,7 +110,8 @@ static inline bool boot_cpu_has(unsigned int feat)
>>>>>   
>>>>>   /* CPUID level 0x8001.ecx */
>>>>>   #define cpu_has_cmp_legacy  boot_cpu_has(X86_FEATURE_CMP_LEGACY)
>>>>> -#define cpu_has_svm boot_cpu_has(X86_FEATURE_SVM)
>>>>> +#define cpu_has_svm ( IS_ENABLED(CONFIG_SVM) && \
>>>>> +  boot_cpu_has(X86_FEATURE_SVM))
>>>>>   #define cpu_has_sse4a   boot_cpu_has(X86_FEATURE_SSE4A)
>>>>>   #define cpu_has_xop boot_cpu_has(X86_FEATURE_XOP)
>>>>>   #define cpu_has_skinit  boot_cpu_has(X86_FEATURE_SKINIT)
>>>>
>>>> Hmm, leaving aside the style issue (stray blanks after opening parentheses,
>>>> and as a result one-off indentation on the wrapped lines) I'm not really
>>>> certain we can do this. The description goes into detail why we would want
>>>> this, but it doesn't cover at all why it is safe for all present (and
>>>> ideally also future) uses. I wouldn't be surprised if we had VMX/SVM checks
>>>> just to derive further knowledge from that, without them being directly
>>>> related to the use of VMX/SVM. Take a look at calculate_hvm_max_policy(),
>>>> for example. While it looks to be okay there, it may give you an idea of
>>>> what I mean.
>>>>
>>>> Things might become better separated if instead for such checks we used
>>>> host and raw CPU policies instead of cpuinfo_x86.x86_capability[]. But
>>>> that's still pretty far out, I'm afraid.
>>>
>>> I've followed a suggestion you made for patch in previous series:
>>>
>>> https://lore.kernel.org/xen-devel/8fbd604e-5e5d-410c-880f-2ad257bbe...@suse.com/
>>
>> See the "If not, ..." that I had put there. Doing the change just 
>> mechanically
>> isn't enough, you also need to make clear (in the description) that you
>> verified it's safe to have this way.
> 
> What does it mean to "verified it's safe to have this way"? "Safe" in
> what way?

"Safe" as in "not breaking existing logic", anywhere.

Jan

Re: [PATCH v10 02/14] xen: introduce generic non-atomic test_*bit()

2024-05-24 Thread Jan Beulich

On 23.05.2024 18:40, Oleksii K. wrote:
> On Thu, 2024-05-23 at 15:33 +0100, Julien Grall wrote:
>> On 23/05/2024 15:11, Oleksii K. wrote:
>>> On Thu, 2024-05-23 at 14:00 +0100, Julien Grall wrote:
 On 17/05/2024 14:54, Oleksii Kurochko wrote:
> diff --git a/xen/arch/arm/arm64/livepatch.c
> b/xen/arch/arm/arm64/livepatch.c
> index df2cebedde..4bc8ed9be5 100644
> --- a/xen/arch/arm/arm64/livepatch.c
> +++ b/xen/arch/arm/arm64/livepatch.c
> @@ -10,7 +10,6 @@
>    #include 
>    #include 
>    
> -#include 

 It is a bit unclear how this change is related to the patch. Can
 you
 explain in the commit message?
>>> Probably it doesn't need anymore. I will double check and if this
>>> change is not needed, I will just drop it in the next patch
>>> version.
>>>

>    #include 
>    #include 
>    #include 
> diff --git a/xen/arch/arm/include/asm/bitops.h
> b/xen/arch/arm/include/asm/bitops.h
> index 5104334e48..8e16335e76 100644
> --- a/xen/arch/arm/include/asm/bitops.h
> +++ b/xen/arch/arm/include/asm/bitops.h
> @@ -22,9 +22,6 @@
>    #define __set_bit(n,p)    set_bit(n,p)
>    #define __clear_bit(n,p)  clear_bit(n,p)
>    
> -#define BITOP_BITS_PER_WORD 32
> -#define BITOP_MASK(nr)  (1UL << ((nr) %
> BITOP_BITS_PER_WORD))
> -#define BITOP_WORD(nr)  ((nr) / BITOP_BITS_PER_WORD)
>    #define BITS_PER_BYTE   8

 OOI, any reason BITS_PER_BYTE has not been moved as well? I don't
 expect
 the value to change across arch.
>>> I can move it to generic one header too in the next patch version.
>>>

 [...]

> diff --git a/xen/include/xen/bitops.h
> b/xen/include/xen/bitops.h
> index f14ad0d33a..6eeeff0117 100644
> --- a/xen/include/xen/bitops.h
> +++ b/xen/include/xen/bitops.h
> @@ -65,10 +65,141 @@ static inline int generic_flsl(unsigned
> long
> x)
>     * scope
>     */
>    
> +#define BITOP_BITS_PER_WORD 32
> +typedef uint32_t bitop_uint_t;
> +
> +#define BITOP_MASK(nr)  ((bitop_uint_t)1 << ((nr) %
> BITOP_BITS_PER_WORD))
> +
> +#define BITOP_WORD(nr)  ((nr) / BITOP_BITS_PER_WORD)
> +
> +extern void __bitop_bad_size(void);
> +
> +#define bitop_bad_size(addr) (sizeof(*(addr)) <
> sizeof(bitop_uint_t))
> +
>    /* - Please tidy above here 
> -
>  */
>    
>    #include 
>    
> +/**
> + * generic__test_and_set_bit - Set a bit and return its old
> value
> + * @nr: Bit to set
> + * @addr: Address to count from
> + *
> + * This operation is non-atomic and can be reordered.
> + * If two examples of this operation race, one can appear to
> succeed
> + * but actually fail.  You must protect multiple accesses with
> a
> lock.
> + */

 Sorry for only mentioning this on v10. I think this comment
 should be
 duplicated (or moved to) on top of test_bit() because this is
 what
 everyone will use. This will avoid the developper to follow the
 function
 calls and only notice the x86 version which says "This function
 is
 atomic and may not be reordered." and would be wrong for all the
 other arch.
>>> It makes sense to add this comment on top of test_bit(), but I am
>>> curious if it is needed to mention that for x86 arch_test_bit() "is
>>> atomic and may not be reordered":
>>
>> I would say no because any developper modifying common code can't 
>> relying it.
>>
>>>
>>>   * This operation is non-atomic and can be reordered. ( Exception:
>>> for
>>> * x86 arch_test_bit() is atomic and may not be reordered )
>>>   * If two examples of this operation race, one can appear to
>>> succeed
>>>   * but actually fail.  You must protect multiple accesses with a
>>> lock.
>>>   */
>>>

> +static always_inline bool
> +generic__test_and_set_bit(int nr, volatile void *addr)
> +{
> +    bitop_uint_t mask = BITOP_MASK(nr);
> +    volatile bitop_uint_t *p = (volatile bitop_uint_t *)addr +
> BITOP_WORD(nr);
> +    bitop_uint_t old = *p;
> +
> +    *p = old | mask;
> +    return (old & mask);
> +}
> +
> +/**
> + * generic__test_and_clear_bit - Clear a bit and return its
> old
> value
> + * @nr: Bit to clear
> + * @addr: Address to count from
> + *
> + * This operation is non-atomic and can be reordered.
> + * If two examples of this operation race, one can appear to
> succeed
> + * but actually fail.  You must protect multiple accesses with
> a
> lock.
> + */

 Same applies here and ...

> +static always_inline bool
> +generic__test_and_clear_bit(int nr, volatile void *addr)
> +{
> +    bitop_uint_t mask = BITOP_MASK(nr);
> +    volatile bitop_uint_t *p =

Re: [PATCH 6/7] x86/cpuid: Fix handling of XSAVE dynamic leaves

2024-05-23 Thread Jan Beulich

On 23.05.2024 13:16, Andrew Cooper wrote:
> First, if XSAVE is available in hardware but not visible to the guest, the
> dynamic leaves shouldn't be filled in.
> 
> Second, the comment concerning XSS state is wrong.  VT-x doesn't manage
> host/guest state automatically, but there is provision for "host only" bits to
> be set, so the implications are still accurate.
> 
> Introduce xstate_compressed_size() to mirror the uncompressed one.  Cross
> check it at boot.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 
Irrespective ...

> v3:
>  * Adjust commit message about !XSAVE guests
>  * Rebase over boot time cross check
>  * Use raw policy

... it should probably have occurred to me earlier on to ask: Why raw policy?
Isn't the host one the more appropriate one to use for any kind of internal
decisions?

Jan

Re: [PATCH 4/7] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()

2024-05-23 Thread Jan Beulich

On 23.05.2024 13:16, Andrew Cooper wrote:
> @@ -611,6 +587,40 @@ static bool valid_xcr0(uint64_t xcr0)
>  return true;
>  }
>  
> +unsigned int xstate_uncompressed_size(uint64_t xcr0)
> +{
> +unsigned int size = XSTATE_AREA_MIN_SIZE, i;
> +
> +ASSERT((xcr0 & ~X86_XCR0_STATES) == 0);

I'm puzzled by the combination of this assertion and ...

> +if ( xcr0 == xfeature_mask )
> +return xsave_cntxt_size;

... this conditional return. Yes, right now we don't support/use any XSS
components, but without any comment the assertion looks overly restrictive
to me.

> @@ -818,14 +834,14 @@ void xstate_init(struct cpuinfo_x86 *c)
>   * xsave_cntxt_size is the max size required by enabled features.
>   * We know FP/SSE and YMM about eax, and nothing about edx at 
> present.
>   */
> -xsave_cntxt_size = hw_uncompressed_size(feature_mask);
> +xsave_cntxt_size = cpuid_count_ebx(0xd, 0);
>  printk("xstate: size: %#x and states: %#"PRIx64"\n",
> xsave_cntxt_size, xfeature_mask);
>  }
>  else
>  {
>  BUG_ON(xfeature_mask != feature_mask);
> -BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
> +BUG_ON(xsave_cntxt_size != cpuid_count_ebx(0xd, 0));
>  }

Hmm, this may make re-basing of said earlier patch touching this code yet
more interesting. Or maybe it actually simplifies things, will need to see
... The overall comment remains though: Patches pending for so long should
really take priority over creating yet more new ones. But what do I do - I
can't enforce this, unless I was now going to block your work the same way.
Which I don't mean to do.

Jan

Re: [PATCH 7/7] x86/defns: Clean up X86_{XCR0,XSS}_* constants

2024-05-23 Thread Jan Beulich

On 23.05.2024 13:16, Andrew Cooper wrote:
> With the exception of one case in read_bndcfgu() which can use ilog2(),
> the *_POS defines are unused.
> 
> X86_XCR0_X87 is the name used by both the SDM and APM, rather than
> X86_XCR0_FP.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 

Acked-by: Jan Beulich

Re: [PATCH 3/7] x86/boot: Collect the Raw CPU Policy earlier on boot

2024-05-23 Thread Jan Beulich

On 23.05.2024 13:16, Andrew Cooper wrote:
> This is a tangle, but it's a small step in the right direction.
> 
> xstate_init() is shortly going to want data from the Raw policy.
> calculate_raw_cpu_policy() is sufficiently separate from the other policies to
> be safe to do.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 

Would you mind taking a look at
https://lists.xen.org/archives/html/xen-devel/2021-04/msg01335.html
to make clear (to me at least) in how far we can perhaps find common grounds
on what wants doing when? (Of course the local version I have has been
constantly re-based, so some of the function names would have changed from
what's visible there.)

> --- a/xen/arch/x86/cpu-policy.c
> +++ b/xen/arch/x86/cpu-policy.c
> @@ -845,7 +845,6 @@ static void __init calculate_hvm_def_policy(void)
>  
>  void __init init_guest_cpu_policies(void)
>  {
> -calculate_raw_cpu_policy();
>  calculate_host_policy();
>  
>  if ( IS_ENABLED(CONFIG_PV) )
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1888,7 +1888,9 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> long mbi_p)
>  
>  tsx_init(); /* Needs microcode.  May change HLE/RTM feature bits. */
>  
> -identify_cpu(_cpu_data);
> +calculate_raw_cpu_policy(); /* Needs microcode.  No other dependenices. 
> */
> +
> +identify_cpu(_cpu_data); /* Needs microcode and raw policy. */

You don't introduce any dependency on raw policy here, and there cannot possibly
have been such a dependency before (unless there was a bug somewhere). Therefore
I consider this latter comment misleading at this point.

Jan

Re: [PATCH 2/7] x86/xstate: Cross-check dynamic XSTATE sizes at boot

2024-05-23 Thread Jan Beulich

On 23.05.2024 13:16, Andrew Cooper wrote:
> Right now, xstate_ctxt_size() performs a cross-check of size with CPUID in for
> every call.  This is expensive, being used for domain create/migrate, as well
> as to service certain guest CPUID instructions.
> 
> Instead, arrange to check the sizes once at boot.  See the code comments for
> details.  Right now, it just checks hardware against the algorithm
> expectations.  Later patches will add further cross-checking.
> 
> Introduce the missing X86_XCR0_* and X86_XSS_* constants, and a couple of
> missing CPUID bits.  This is to maximise coverage in the sanity check, even if
> we don't expect to use/virtualise some of these features any time soon.  Leave
> HDC and HWP alone for now.  We don't have CPUID bits from them stored nicely.

Since you say "the missing", ...

> --- a/xen/arch/x86/include/asm/x86-defns.h
> +++ b/xen/arch/x86/include/asm/x86-defns.h
> @@ -77,7 +77,7 @@
>  #define X86_CR4_PKS0x0100 /* Protection Key Supervisor */
>  
>  /*
> - * XSTATE component flags in XCR0
> + * XSTATE component flags in XCR0 | MSR_XSS
>   */
>  #define X86_XCR0_FP_POS   0
>  #define X86_XCR0_FP   (1ULL << X86_XCR0_FP_POS)
> @@ -95,11 +95,34 @@
>  #define X86_XCR0_ZMM  (1ULL << X86_XCR0_ZMM_POS)
>  #define X86_XCR0_HI_ZMM_POS   7
>  #define X86_XCR0_HI_ZMM   (1ULL << X86_XCR0_HI_ZMM_POS)
> +#define X86_XSS_PROC_TRACE(_AC(1, ULL) <<  8)
>  #define X86_XCR0_PKRU_POS 9
>  #define X86_XCR0_PKRU (1ULL << X86_XCR0_PKRU_POS)
> +#define X86_XSS_PASID (_AC(1, ULL) << 10)
> +#define X86_XSS_CET_U (_AC(1, ULL) << 11)
> +#define X86_XSS_CET_S (_AC(1, ULL) << 12)
> +#define X86_XSS_HDC   (_AC(1, ULL) << 13)
> +#define X86_XSS_UINTR (_AC(1, ULL) << 14)
> +#define X86_XSS_LBR   (_AC(1, ULL) << 15)
> +#define X86_XSS_HWP   (_AC(1, ULL) << 16)
> +#define X86_XCR0_TILE_CFG (_AC(1, ULL) << 17)
> +#define X86_XCR0_TILE_DATA(_AC(1, ULL) << 18)

... I'm wondering if you deliberately left out APX (bit 19).

Since you're re-doing some of what I have long had in patches already,
I'd also like to ask whether the last underscores each in the two AMX
names really are useful in your opinion. While rebasing isn't going
to be difficult either way, it would be yet simpler with
X86_XCR0_TILECFG and X86_XCR0_TILEDATA, as I've had it in my patches
for over 3 years.

> --- a/xen/arch/x86/xstate.c
> +++ b/xen/arch/x86/xstate.c
> @@ -604,9 +604,156 @@ static bool valid_xcr0(uint64_t xcr0)
>  if ( !(xcr0 & X86_XCR0_BNDREGS) != !(xcr0 & X86_XCR0_BNDCSR) )
>  return false;
>  
> +/* TILE_CFG and TILE_DATA must be the same. */
> +if ( !(xcr0 & X86_XCR0_TILE_CFG) != !(xcr0 & X86_XCR0_TILE_DATA) )
> +return false;
> +
>  return true;
>  }
>  
> +struct xcheck_state {
> +uint64_t states;
> +uint32_t uncomp_size;
> +uint32_t comp_size;
> +};
> +
> +static void __init check_new_xstate(struct xcheck_state *s, uint64_t new)
> +{
> +uint32_t hw_size;
> +
> +BUILD_BUG_ON(X86_XCR0_STATES & X86_XSS_STATES);
> +
> +BUG_ON(s->states & new); /* States only increase. */
> +BUG_ON(!valid_xcr0(s->states | new)); /* Xen thinks it's a good value. */
> +BUG_ON(new & ~(X86_XCR0_STATES | X86_XSS_STATES)); /* Known state. */
> +BUG_ON((new & X86_XCR0_STATES) &&
> +   (new & X86_XSS_STATES)); /* User or supervisor, not both. */
> +
> +s->states |= new;
> +if ( new & X86_XCR0_STATES )
> +{
> +if ( !set_xcr0(s->states & X86_XCR0_STATES) )
> +BUG();
> +}
> +else
> +set_msr_xss(s->states & X86_XSS_STATES);
> +
> +/*
> + * Check the uncompressed size.  Some XSTATEs are out-of-order and fill 
> in
> + * prior holes in the state area, so we check that the size doesn't
> + * decrease.
> + */
> +hw_size = cpuid_count_ebx(0xd, 0);
> +
> +if ( hw_size < s->uncomp_size )
> +panic("XSTATE 0x%016"PRIx64", new bits {%63pbl}, uncompressed hw 
> size %#x < prev size %#x\n",
> +  s->states, , hw_size, s->uncomp_size);
> +
> +s->uncomp_size = hw_size;
> +
> +/*
> + * Check the compressed size, if available.  All components strictly
> + * appear in index order.  In principle there are no holes, but some
> + * components have their base address 64-byte aligned for efficiency
> + * reasons (e.g. AMX-TILE) and there are other components small enough to
> + * fit in the gap (e.g. PKRU) without increasing the overall length.
> + */
> +hw_size = cpuid_count_ebx(0xd, 1);
> +
> +if ( cpu_has_xsavec )
> +{
> +if ( hw_size < s->comp_size )
> +panic("XSTATE 0x%016"PRIx64", new bits {%63pbl}, compressed hw 
> size %#x < prev size %#x\n",
> +  s->states, , hw_size, s->comp_size);
> +
> +s->comp_size = hw_size;
> +}
> +else

Re: [PATCH 1/7] x86/xstate: Fix initialisation of XSS cache

2024-05-23 Thread Jan Beulich

On 23.05.2024 13:16, Andrew Cooper wrote:
> The clobbering of this_cpu(xcr0) and this_cpu(xss) to architecturally invalid
> values is to force the subsequent set_xcr0() and set_msr_xss() to reload the
> hardware register.
> 
> While XCR0 is reloaded in xstate_init(), MSR_XSS isn't.  This causes
> get_msr_xss() to return the invalid value, and logic of the form:
> 
>   old = get_msr_xss();
>   set_msr_xss(new);
>   ...
>   set_msr_xss(old);
> 
> to try and restore the architecturally invalid value.
> 
> The architecturally invalid value must be purged from the cache, meaning the
> hardware register must be written at least once.  This in turn highlights that
> the invalid value must only be used in the case that the hardware register is
> available.
> 
> Fixes: f7f4a523927f ("x86/xstate: reset cached register values on resume")
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 

However, I view it as pretty unfair that now I will need to re-base
https://lists.xen.org/archives/html/xen-devel/2021-04/msg01336.html
over ...

> --- a/xen/arch/x86/xstate.c
> +++ b/xen/arch/x86/xstate.c
> @@ -641,13 +641,6 @@ void xstate_init(struct cpuinfo_x86 *c)
>  return;
>  }
>  
> -/*
> - * Zap the cached values to make set_xcr0() and set_msr_xss() really
> - * write it.
> - */
> -this_cpu(xcr0) = 0;
> -this_cpu(xss) = ~0;
> -
>  cpuid_count(XSTATE_CPUID, 0, , , , );
>  feature_mask = (((u64)edx << 32) | eax) & XCNTXT_MASK;
>  BUG_ON(!valid_xcr0(feature_mask));
> @@ -657,8 +650,19 @@ void xstate_init(struct cpuinfo_x86 *c)
>   * Set CR4_OSXSAVE and run "cpuid" to get xsave_cntxt_size.
>   */
>  set_in_cr4(X86_CR4_OSXSAVE);
> +
> +/*
> + * Zap the cached values to make set_xcr0() and set_msr_xss() really 
> write
> + * the hardware register.
> + */
> +this_cpu(xcr0) = 0;
>  if ( !set_xcr0(feature_mask) )
>  BUG();
> +if ( cpu_has_xsaves )
> +{
> +this_cpu(xss) = ~0;
> +set_msr_xss(0);
> +}

... this change, kind of breaking again your nice arrangement. Seeing for how
long that change has been pending, it _really_ should have gone in ahead of
this one, with you then sorting how you'd like things to be arranged in the
combined result, rather than me re-posting and then either again not getting
any feedback for years, or you disliking what I've done. Oh well ...

Jan

Re: [xen-4.17-testing test] 186087: regressions - FAIL

2024-05-23 Thread Jan Beulich

On 23.05.2024 16:40, osstest service owner wrote:
> flight 186087 xen-4.17-testing real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/186087/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  build-amd64   6 xen-buildfail REGR. vs. 
> 185864
>  build-amd64-xsm   6 xen-buildfail REGR. vs. 
> 185864
>  build-i386-xsm6 xen-buildfail REGR. vs. 
> 185864
>  build-i3866 xen-buildfail REGR. vs. 
> 185864
>  build-amd64-prev  6 xen-buildfail REGR. vs. 
> 185864
>  build-i386-prev   6 xen-buildfail REGR. vs. 
> 185864

These look to be recurring, yet at the same time these look to be infrastructure
issues. This not happening for the first time I'm not sure we can simply wait
and hope for the problem to clear itself.

Jan

Re: [XEN PATCH v2 07/15] x86: guard cpu_has_{svm/vmx} macros with CONFIG_{SVM/VMX}

2024-05-23 Thread Jan Beulich

On 23.05.2024 15:07, Sergiy Kibrik wrote:
> 16.05.24 14:12, Jan Beulich:
>> On 15.05.2024 11:12, Sergiy Kibrik wrote:
>>> --- a/xen/arch/x86/include/asm/cpufeature.h
>>> +++ b/xen/arch/x86/include/asm/cpufeature.h
>>> @@ -81,7 +81,8 @@ static inline bool boot_cpu_has(unsigned int feat)
>>>   #define cpu_has_sse3boot_cpu_has(X86_FEATURE_SSE3)
>>>   #define cpu_has_pclmulqdq   boot_cpu_has(X86_FEATURE_PCLMULQDQ)
>>>   #define cpu_has_monitor boot_cpu_has(X86_FEATURE_MONITOR)
>>> -#define cpu_has_vmx boot_cpu_has(X86_FEATURE_VMX)
>>> +#define cpu_has_vmx ( IS_ENABLED(CONFIG_VMX) && \
>>> +  boot_cpu_has(X86_FEATURE_VMX))
>>>   #define cpu_has_eistboot_cpu_has(X86_FEATURE_EIST)
>>>   #define cpu_has_ssse3   boot_cpu_has(X86_FEATURE_SSSE3)
>>>   #define cpu_has_fma boot_cpu_has(X86_FEATURE_FMA)
>>> @@ -109,7 +110,8 @@ static inline bool boot_cpu_has(unsigned int feat)
>>>   
>>>   /* CPUID level 0x8001.ecx */
>>>   #define cpu_has_cmp_legacy  boot_cpu_has(X86_FEATURE_CMP_LEGACY)
>>> -#define cpu_has_svm boot_cpu_has(X86_FEATURE_SVM)
>>> +#define cpu_has_svm ( IS_ENABLED(CONFIG_SVM) && \
>>> +  boot_cpu_has(X86_FEATURE_SVM))
>>>   #define cpu_has_sse4a   boot_cpu_has(X86_FEATURE_SSE4A)
>>>   #define cpu_has_xop boot_cpu_has(X86_FEATURE_XOP)
>>>   #define cpu_has_skinit  boot_cpu_has(X86_FEATURE_SKINIT)
>>
>> Hmm, leaving aside the style issue (stray blanks after opening parentheses,
>> and as a result one-off indentation on the wrapped lines) I'm not really
>> certain we can do this. The description goes into detail why we would want
>> this, but it doesn't cover at all why it is safe for all present (and
>> ideally also future) uses. I wouldn't be surprised if we had VMX/SVM checks
>> just to derive further knowledge from that, without them being directly
>> related to the use of VMX/SVM. Take a look at calculate_hvm_max_policy(),
>> for example. While it looks to be okay there, it may give you an idea of
>> what I mean.
>>
>> Things might become better separated if instead for such checks we used
>> host and raw CPU policies instead of cpuinfo_x86.x86_capability[]. But
>> that's still pretty far out, I'm afraid.
> 
> I've followed a suggestion you made for patch in previous series:
> 
> https://lore.kernel.org/xen-devel/8fbd604e-5e5d-410c-880f-2ad257bbe...@suse.com/

See the "If not, ..." that I had put there. Doing the change just mechanically
isn't enough, you also need to make clear (in the description) that you
verified it's safe to have this way.

> yet if this approach can potentially be unsafe (I'm not completely sure 
> it's safe), should we instead fallback to the way it was done in v1 
> series? I.e. guard calls to vmx/svm-specific calls where needed, like in 
> these 3 patches:
> 
> 1) 
> https://lore.kernel.org/xen-devel/20240416063328.3469386-1-sergiy_kib...@epam.com/
> 
> 2) 
> https://lore.kernel.org/xen-devel/20240416063740.3469592-1-sergiy_kib...@epam.com/
> 
> 3) 
> https://lore.kernel.org/xen-devel/20240416063947.3469718-1-sergiy_kib...@epam.com/

I don't like this sprinkling around of IS_ENABLED() very much. Maybe we want
to have two new helpers (say using_svm() and using_vmx()), to be used in place
of most but possibly not all cpu_has_{svm,vmx}? Doing such a transformation
would then kind of implicitly answer the safety question above, as at every
use site you'd need to judge whether the replacement is correct. If it's
correct everywhere, the construct(s) as proposed in this version could then be
considered to be used in this very shape (instead of introducing the two new
helpers). But of course the transition could also be done gradually then,
touching only those uses that previously you touched in 1), 2), and 3).

Jan

Re: [XEN PATCH v2 06/15] x86/p2m: guard altp2m code with CONFIG_ALTP2M option

2024-05-23 Thread Jan Beulich

On 23.05.2024 12:44, Sergiy Kibrik wrote:
> 16.05.24 14:01, Jan Beulich:
>> On 15.05.2024 11:10, Sergiy Kibrik wrote:
>>> --- a/xen/arch/x86/include/asm/hvm/hvm.h
>>> +++ b/xen/arch/x86/include/asm/hvm/hvm.h
>>> @@ -670,7 +670,7 @@ static inline bool hvm_hap_supported(void)
>>>   /* returns true if hardware supports alternate p2m's */
>>>   static inline bool hvm_altp2m_supported(void)
>>>   {
>>> -return hvm_funcs.caps.altp2m;
>>> +return IS_ENABLED(CONFIG_ALTP2M) && hvm_funcs.caps.altp2m;
>>
>> Which in turn raises the question whether the altp2m struct field shouldn't
>> become conditional upon CONFIG_ALTP2M too (or rather: instead, as the change
>> here then would need to be done differently). Yet maybe that would entail
>> further changes elsewhere, so may well better be left for later.
> 
>   but hvm_funcs.caps.altp2m is only a capability bit -- is it worth to 
> become conditional?

Well, the comment was more based on the overall principle than the actual
space savings that might result. Plus as said - likely that would not work
anyway without further changes elsewhere. So perhaps okay to leave as you
have it.

>>> --- a/xen/arch/x86/mm/Makefile
>>> +++ b/xen/arch/x86/mm/Makefile
>>> @@ -1,7 +1,7 @@
>>>   obj-y += shadow/
>>>   obj-$(CONFIG_HVM) += hap/
>>>   
>>> -obj-$(CONFIG_HVM) += altp2m.o
>>> +obj-$(CONFIG_ALTP2M) += altp2m.o
>>
>> This change I think wants to move to patch 5.
>>
> 
> If this moves to patch 5 then HVM=y && ALTP2M=n configuration 
> combination will break the build in between patch 5 and 6, so I've 
> decided to put it together with fixes of these build failures in patch 6.

Hmm, yes, I think I see what you mean.

> Maybe I can merge patch 5 & 6 together then ?

Perhaps more consistent that way, yes.

Jan

Re: [PATCH v2 2/8] xen/x86: Simplify header dependencies in x86/hvm

2024-05-23 Thread Jan Beulich

On 23.05.2024 16:37, Roger Pau Monné wrote:
> On Wed, May 08, 2024 at 01:39:21PM +0100, Alejandro Vallejo wrote:
>> --- a/xen/arch/x86/include/asm/hvm/hvm.h
>> +++ b/xen/arch/x86/include/asm/hvm/hvm.h
>> @@ -798,6 +798,12 @@ static inline void hvm_update_vlapic_mode(struct vcpu 
>> *v)
>>  alternative_vcall(hvm_funcs.update_vlapic_mode, v);
>>  }
>>  
>> +static inline void hvm_vlapic_sync_pir_to_irr(struct vcpu *v)
>> +{
>> +if ( hvm_funcs.sync_pir_to_irr )
>> +alternative_vcall(hvm_funcs.sync_pir_to_irr, v);
> 
> Nit: for consistency the wrappers are usually named hvm_,
> so in this case it would be hvm_sync_pir_to_irr(), or the hvm_funcs
> field should be renamed to vlapic_sync_pir_to_irr.

Funny you should mention that: See my earlier comment as well as what
was committed.

Jan

Re: [xen-unstable-smoke test] 186107: regressions - FAIL

2024-05-23 Thread Jan Beulich

On 23.05.2024 15:45, osstest service owner wrote:
> flight 186107 xen-unstable-smoke real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/186107/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  build-armhf   6 xen-buildfail REGR. vs. 
> 186064

Found ninja-1.11.1 at /usr/bin/ninja

ERROR: Clock skew detected. File /usr/bin/bash has a time stamp 
1682259478.4465s in the future.

A full log can be found at 
/home/osstest/build.186107.build-armhf/xen/tools/qemu-xen-build/meson-logs/meson-log.txt

ERROR: meson setup failed

make: Entering directory 
'/home/osstest/build.186107.build-armhf/xen/tools/qemu-xen-build'
config-host.mak is out-of-date, running configure
  GIT ui/keycodemapdb meson tests/fp/berkeley-testfloat-3 
tests/fp/berkeley-softfloat-3 dtc
bash: line 4: ./config.status: No such file or directory
make: *** No rule to make target 'config-host.mak', needed by 
'Makefile.prereqs'.  Stop.
make: *** Waiting for unfinished jobs
make: Leaving directory 
'/home/osstest/build.186107.build-armhf/xen/tools/qemu-xen-build'
make[2]: *** [Makefile:212: subdir-all-qemu-xen-dir] Error 2
make[2]: Leaving directory '/home/osstest/build.186107.build-armhf/xen/tools'
make[1]: *** 
[/home/osstest/build.186107.build-armhf/xen/tools/../tools/Rules.mk:199: 
subdirs-all] Error 2
make[1]: Leaving directory '/home/osstest/build.186107.build-armhf/xen/tools'
make: *** [Makefile:63: build-tools] Error 2

Suggest to me that there's some issue with the build host.

Jan

Re: [PATCH v4 0/2] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-05-23 Thread Jan Beulich

On 23.05.2024 16:22, Marek Marczykowski-Górecki wrote:
> On Wed, May 22, 2024 at 05:39:02PM +0200, Marek Marczykowski-Górecki wrote:
>> On older systems, XHCI xcap had a layout that no other (interesting) 
>> registers
>> were placed on the same page as the debug capability, so Linux was fine with
>> making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
>> needs to write to some other registers on the same page too.
>>
>> Add a generic API for making just parts of an MMIO page R/O and use it to fix
>> USB3 console with share=yes or share=hwdom options. More details in commit
>> messages.
>>
>> Marek Marczykowski-Górecki (2):
>>   x86/mm: add API for marking only part of a MMIO page read only
>>   drivers/char: Use sub-page ro API to make just xhci dbc cap RO
> 
> Does any other x86 maintainer feel comfortable ack-ing this series? Jan
> already reviewed 2/2 here (but not 1/2 in this version),

Which, btw, isn't to mean I'm not going to look at it. But 2/2 was the
lower hanging fruit ...

Jan

> but also said
> he is not comfortable with letting this in without a second maintainer
> approval: 
> https://lore.kernel.org/xen-devel/7655e401-b927-4250-ae63-05361a5ee...@suse.com/

Re: [PATCH v4 2/2] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-05-23 Thread Jan Beulich

On 22.05.2024 17:39, Marek Marczykowski-Górecki wrote:
> Not the whole page, which may contain other registers too. The XHCI
> specification describes DbC as designed to be controlled by a different
> driver, but does not mandate placing registers on a separate page. In fact
> on Tiger Lake and newer (at least), this page do contain other registers
> that Linux tries to use. And with share=yes, a domU would use them too.
> Without this patch, PV dom0 would fail to initialize the controller,
> while HVM would be killed on EPT violation.
> 
> With `share=yes`, this patch gives domU more access to the emulator
> (although a HVM with any emulated device already has plenty of it). This
> configuration is already documented as unsafe with untrusted guests and
> not security supported.
> 
> Signed-off-by: Marek Marczykowski-Górecki 

Reviewed-by: Jan Beulich

Re: [PATCH] docs/misra: rules for mass adoption

2024-05-23 Thread Jan Beulich

On 23.05.2024 03:26, Stefano Stabellini wrote:
> @@ -725,12 +787,25 @@ maintainers if you want to suggest a change.
>   - The Standard Library function system of  shall not be used
>   -
>  
> +   * - `Rule 22.1 
> `_
> + - Required
> + - All resources obtained dynamically by means of Standard Library
> +   functions shall be explicitly released
> + -
> + - Xen doesn't provide, use, or link against a Standard Library 
> [#xen-stdlib]_

The empty sub-bullet-point looks stray here.

> @@ -748,6 +823,31 @@ maintainers if you want to suggest a change.
> stream has been closed
>   -
>  
> +   * - `Rule 22.7 
> `_
> + - Required
> + - The macro EOF shall only be compared with the unmodified return
> +   value from any Standard Library function capable of returning EOF
> + - Xen doesn't provide, use, or link against a Standard Library 
> [#xen-stdlib]_

Shouldn't this remark also be replicated ...

> +   * - `Rule 22.8 
> `_
> + - Required
> + - The value of errno shall be set to zero prior to a call to an
> +   errno-setting-function
> + -
> +
> +   * - `Rule 22.9 
> `_
> + - Required
> + - The value of errno shall be tested against zero after calling an
> +   errno-setting-function
> + -
> +
> +   * - `Rule 22.10 
> `_
> + - Required
> + - The value of errno shall only be tested when the last function to
> +   be called was an errno-setting-function
> + -

... for all three of these, seeing that errno is something a (standard) library
would provide? Or alternatively should remarks here say that we simply have no
errno?

Jan

Re: [PATCH for-4.19 v3 2/3] xen: enable altp2m at create domain domctl

2024-05-23 Thread Jan Beulich

On 22.05.2024 18:21, Roger Pau Monné wrote:
> On Wed, May 22, 2024 at 03:34:29PM +0200, Jan Beulich wrote:
>> On 22.05.2024 15:16, Roger Pau Monné wrote:
>>> On Tue, May 21, 2024 at 12:30:32PM +0200, Jan Beulich wrote:
>>>> On 17.05.2024 15:33, Roger Pau Monne wrote:
>>>>> Enabling it using an HVM param is fragile, and complicates the logic when
>>>>> deciding whether options that interact with altp2m can also be enabled.
>>>>>
>>>>> Leave the HVM param value for consumption by the guest, but prevent it 
>>>>> from
>>>>> being set.  Enabling is now done using and additional altp2m specific 
>>>>> field in
>>>>> xen_domctl_createdomain.
>>>>>
>>>>> Note that albeit only currently implemented in x86, altp2m could be 
>>>>> implemented
>>>>> in other architectures, hence why the field is added to 
>>>>> xen_domctl_createdomain
>>>>> instead of xen_arch_domainconfig.
>>>>>
>>>>> Signed-off-by: Roger Pau Monné 
>>>>
>>>> Reviewed-by: Jan Beulich  # hypervisor
>>>> albeit with one question:
>>>>
>>>>> --- a/xen/arch/x86/domain.c
>>>>> +++ b/xen/arch/x86/domain.c
>>>>> @@ -637,6 +637,8 @@ int arch_sanitise_domain_config(struct 
>>>>> xen_domctl_createdomain *config)
>>>>>  bool hap = config->flags & XEN_DOMCTL_CDF_hap;
>>>>>  bool nested_virt = config->flags & XEN_DOMCTL_CDF_nested_virt;
>>>>>  unsigned int max_vcpus;
>>>>> +unsigned int altp2m_mode = MASK_EXTR(config->altp2m_opts,
>>>>> + XEN_DOMCTL_ALTP2M_mode_mask);
>>>>>  
>>>>>  if ( hvm ? !hvm_enabled : !IS_ENABLED(CONFIG_PV) )
>>>>>  {
>>>>> @@ -715,6 +717,26 @@ int arch_sanitise_domain_config(struct 
>>>>> xen_domctl_createdomain *config)
>>>>>  return -EINVAL;
>>>>>  }
>>>>>  
>>>>> +if ( config->altp2m_opts & ~XEN_DOMCTL_ALTP2M_mode_mask )
>>>>> +{
>>>>> +dprintk(XENLOG_INFO, "Invalid altp2m options selected: %#x\n",
>>>>> +config->flags);
>>>>> +return -EINVAL;
>>>>> +}
>>>>> +
>>>>> +if ( altp2m_mode && nested_virt )
>>>>> +{
>>>>> +dprintk(XENLOG_INFO,
>>>>> +"Nested virt and altp2m are not supported together\n");
>>>>> +return -EINVAL;
>>>>> +}
>>>>> +
>>>>> +if ( altp2m_mode && !hap )
>>>>> +{
>>>>> +dprintk(XENLOG_INFO, "altp2m is only supported with HAP\n");
>>>>> +return -EINVAL;
>>>>> +}
>>>>
>>>> Should this last one perhaps be further extended to permit altp2m with EPT
>>>> only?
>>>
>>> Hm, yes, that would be more accurate as:
>>>
>>> if ( altp2m_mode && (!hap || !hvm_altp2m_supported()) )
>>
>> Wouldn't
>>
>>if ( altp2m_mode && !hvm_altp2m_supported() )
>>
>> suffice? hvm_funcs.caps.altp2m is not supposed to be set when no HAP,
>> as long as HAP continues to be a pre-condition?
> 
> No, `hap` here signals whether the domain is using HAP, and we need to
> take this int account, otherwise we would allow enabling altp2m for
> domains using shadow.

Oh, right. But then the original for is good enough HAP-wise, as a request
to use HAP when HAP isn't available is deal with elsewhere. The
!hvm_altp2m_supported() is still wanted imo (for there potentially being
other restrictions), but then in a separate check, not resulting in a HAP-
specific log message. I'll commit the patch in its original form, and that
further addition can then be an incremental change.

Jan

Re: [PATCH v3 2/2] x86: detect PIT aliasing on ports other than 0x4[0-3]

2024-05-22 Thread Jan Beulich

On 22.05.2024 15:57, Jason Andryuk wrote:
> On 2024-05-22 08:59, Jan Beulich wrote:
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -427,6 +427,74 @@ static struct platform_timesource __init
>>   .resume = resume_pit,
>>   };
>>   
>> +unsigned int __initdata pit_alias_mask;
>> +
>> +static void __init probe_pit_alias(void)
>> +{
>> +unsigned int mask = 0x1c;
>> +uint8_t val = 0;
>> +
>> +if ( !opt_probe_port_aliases )
>> +return;
>> +
>> +/*
>> + * Use channel 2 in mode 0 for probing.  In this mode even a non-initial
>> + * count is loaded independent of counting being / becoming enabled.  
>> Thus
>> + * we have a 16-bit value fully under our control, to write and then 
>> check
>> + * whether we can also read it back unaltered.
>> + */
>> +
>> +/* Turn off speaker output and disable channel 2 counting. */
>> +outb(inb(0x61) & 0x0c, 0x61);
>> +
>> +outb(PIT_LTCH_CH(2) | PIT_RW_LSB_MSB | PIT_MODE_EOC | PIT_BINARY,
>> + PIT_MODE);
>> +
>> +do {
>> +uint8_t val2;
>> +unsigned int offs;
>> +
>> +outb(val, PIT_CH2);
>> +outb(val ^ 0xff, PIT_CH2);
>> +
>> +/* Wait for the Null Count bit to clear. */
>> +do {
>> +/* Latch status. */
>> +outb(PIT_RDB | PIT_RDB_NO_COUNT | PIT_RDB_CH2, PIT_MODE);
>> +
>> +/* Try to make sure we're actually having a PIT here. */
>> +val2 = inb(PIT_CH2);
>> +if ( (val2 & ~(PIT_STATUS_OUT_PIN | PIT_STATUS_NULL_COUNT)) !=
>> + (PIT_RW_LSB_MSB | PIT_MODE_EOC | PIT_BINARY) )
>> +return;
>> +} while ( val2 & (1 << 6) );
> 
> You can use PIT_STATUS_NULL_COUNT here.

Indeed, and I meant to but then forgot. Thanks for noticing.

> With that:
> Reviewed-by: Jason Andryuk 

Thanks.

Jan

Re: [PATCH v2 4/4] x86/shadow: Don't leave trace record field uninitialized

2024-05-22 Thread Jan Beulich

On 22.05.2024 15:17, Andrew Cooper wrote:
> From: Jan Beulich 
> 
> The emulation_count field is set only conditionally right now. Convert
> all field setting to an initializer, thus guaranteeing that field to be
> set to 0 (default initialized) when GUEST_PAGING_LEVELS != 3.
> 
> Rework trace_shadow_emulate() to be consistent with the other trace helpers.
> 
> Coverity-ID: 1598430
> Fixes: 9a86ac1aa3d2 ("xentrace 5/7: Additional tracing for the shadow code")
> Signed-off-by: Jan Beulich 
> Acked-by: Roger Pau Monné 

Your additional changes look pretty much independent of what my original
patch did. I don't mind the folding though, yet I think you need to add
your own S-o-b as well. Then in turn
Acked-by: Jan Beulich 

Jan

Re: [PATCH v2 3/4] x86/shadow: Rework trace_shadow_emulate_other() as sh_trace_gfn_va()

2024-05-22 Thread Jan Beulich

On 22.05.2024 15:17, Andrew Cooper wrote:
> sh_trace_gfn_va() is very similar to sh_trace_gl1e_va(), and a rather shorter
> name than trace_shadow_emulate_other().  Like sh_trace_gl1e_va(), there is no
> need to pack the trace record.

Provided record size can freely change (here for the 3-level case) without
breaking consumers, i.e. similar to patch 2.

> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -2010,29 +2010,30 @@ static void sh_trace_gl1e_va(uint32_t event, 
> guest_l1e_t gl1e, guest_va_t va)
>  }
>  }
>  
> -static inline void trace_shadow_emulate_other(u32 event,
> - guest_va_t va,
> - gfn_t gfn)
> +/* Shadow trace event with a gfn, linear address and flags. */
> +static void sh_trace_gfn_va(uint32_t event, gfn_t gfn, guest_va_t va)
>  {
>  if ( tb_init_done )
>  {
> -struct __packed {
> -/* for PAE, guest_l1e may be 64 while guest_va may be 32;
> -   so put it first for alignment sake. */
> +struct {
> +/*
> + * For GUEST_PAGING_LEVELS=3 (PAE paging), gfn is 64 while
> + * guest_va is 32.  Put it first to avoid padding.
> + */
>  #if GUEST_PAGING_LEVELS == 2
> -u32 gfn;
> +uint32_t gfn;
>  #else
> -u64 gfn;
> +uint64_t gfn;
>  #endif
>  guest_va_t va;
> -} d;
> -
> -event |= ((GUEST_PAGING_LEVELS-2)<<8);
> -
> -d.gfn=gfn_x(gfn);
> -d.va = va;
> +uint32_t flags;
> +} d = {
> +.gfn = gfn_x(gfn),
> +.va = va,
> +.flags = this_cpu(trace_shadow_path_flags),
> +};

There's again no function call involved here, so having tb_init_done checked
only in (inlined) sh_trace() ought to again be enough?

Jan

Re: [PATCH v2 1/4] x86/shadow: Rework trace_shadow_gen() into sh_trace_va()

2024-05-22 Thread Jan Beulich

On 22.05.2024 15:47, Andrew Cooper wrote:
> On 22/05/2024 2:40 pm, Jan Beulich wrote:
>> On 22.05.2024 15:17, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/mm/shadow/multi.c
>>> +++ b/xen/arch/x86/mm/shadow/multi.c
>>> @@ -1974,13 +1974,17 @@ typedef u32 guest_va_t;
>>>  typedef u32 guest_pa_t;
>>>  #endif
>>>  
>>> -static inline void trace_shadow_gen(u32 event, guest_va_t va)
>>> +/* Shadow trace event with GUEST_PAGING_LEVELS folded into the event 
>>> field. */
>>> +static void sh_trace(uint32_t event, unsigned int extra, const void 
>>> *extra_data)
>>> +{
>>> +trace(event | ((GUEST_PAGING_LEVELS - 2) << 8), extra, extra_data);
>>> +}
>>> +
>>> +/* Shadow trace event with the guest's linear address. */
>>> +static void sh_trace_va(uint32_t event, guest_va_t va)
>>>  {
>>>  if ( tb_init_done )
>>> -{
>>> -event |= (GUEST_PAGING_LEVELS-2)<<8;
>>> -    trace(event, sizeof(va), );
>>> -}
>>> +sh_trace(event, sizeof(va), );
>>>  }
>> If any tb_init_done check, then perhaps rather in sh_trace()? With that
>> (and provided you agree)
>> Reviewed-by: Jan Beulich 
> 
> Sadly not.  That leads to double reads of tb_init_done when tracing is
> compiled in.

Not here, but I can see how that could happen in principle, when ...

> When GCC can't fully inline the structure initialisation, it can't prove
> that a function call modified tb_init_done.  This is why I arranged all
> the trace cleanup in this way.

... inlining indeed doesn't happen. Patch 2 fits the one here in this regard
(no function calls); I have yet to look at patch 3, though.

But anyway, the present placement, while likely a little redundant, is not
the end of the world, so my R-b holds either way.

Jan

Re: [PATCH v2 2/4] x86/shadow: Introduce sh_trace_gl1e_va()

2024-05-22 Thread Jan Beulich

On 22.05.2024 15:17, Andrew Cooper wrote:
> trace_shadow_fixup() and trace_not_shadow_fault() both write out identical
> trace records.  Reimplement them in terms of a common sh_trace_gl1e_va().
> 
> There's no need to pack the trace record, even in the case of PAE paging.

Isn't this altering the generated trace record for the 4-level case, in
size changing from 20 to 24 bytes?

> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -1987,51 +1987,26 @@ static void sh_trace_va(uint32_t event, guest_va_t va)
>  sh_trace(event, sizeof(va), );
>  }
>  
> -static inline void trace_shadow_fixup(guest_l1e_t gl1e,
> -  guest_va_t va)
> +/* Shadow trace event with a gl1e, linear address and flags. */
> +static void sh_trace_gl1e_va(uint32_t event, guest_l1e_t gl1e, guest_va_t va)
>  {
>  if ( tb_init_done )
>  {
> -struct __packed {
> -/* for PAE, guest_l1e may be 64 while guest_va may be 32;
> -   so put it first for alignment sake. */
> -guest_l1e_t gl1e;
> -guest_va_t va;
> -u32 flags;
> -} d;
> -u32 event;
> -
> -event = TRC_SHADOW_FIXUP | ((GUEST_PAGING_LEVELS-2)<<8);
> -
> -d.gl1e = gl1e;
> -d.va = va;
> -d.flags = this_cpu(trace_shadow_path_flags);
> -
> -trace(event, sizeof(d), );
> -}
> -}
> -
> -static inline void trace_not_shadow_fault(guest_l1e_t gl1e,
> -  guest_va_t va)
> -{
> -if ( tb_init_done )
> -{
> -struct __packed {
> -/* for PAE, guest_l1e may be 64 while guest_va may be 32;
> -   so put it first for alignment sake. */
> +struct {
> +/*
> + * For GUEST_PAGING_LEVELS=3 (PAE paging), guest_l1e is 64 while
> + * guest_va is 32.  Put it first to avoid padding.
> + */
>  guest_l1e_t gl1e;
>  guest_va_t va;
> -u32 flags;
> -} d;
> -u32 event;
> -
> -event = TRC_SHADOW_NOT_SHADOW | ((GUEST_PAGING_LEVELS-2)<<8);
> -
> -d.gl1e = gl1e;
> -d.va = va;
> -d.flags = this_cpu(trace_shadow_path_flags);
> -
> -trace(event, sizeof(d), );
> +uint32_t flags;
> +} d = {
> +.gl1e = gl1e,
> +.va = va,
> +.flags = this_cpu(trace_shadow_path_flags),
> +};
> +
> +sh_trace(event, sizeof(d), );
>  }
>  }

Unlike in patch 1, it's less clear here whether leaving the tb_init_done
check is actually better to keep where it is. In principle the compiler
should be able to re-arrange code enough to make it identical no matter
which way it's written, at which point it might again be more desirable
to have the check solely in sh_trace().

Jan

Re: [PATCH v2 1/4] x86/shadow: Rework trace_shadow_gen() into sh_trace_va()

2024-05-22 Thread Jan Beulich

On 22.05.2024 15:17, Andrew Cooper wrote:
> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -1974,13 +1974,17 @@ typedef u32 guest_va_t;
>  typedef u32 guest_pa_t;
>  #endif
>  
> -static inline void trace_shadow_gen(u32 event, guest_va_t va)
> +/* Shadow trace event with GUEST_PAGING_LEVELS folded into the event field. 
> */
> +static void sh_trace(uint32_t event, unsigned int extra, const void 
> *extra_data)
> +{
> +trace(event | ((GUEST_PAGING_LEVELS - 2) << 8), extra, extra_data);
> +}
> +
> +/* Shadow trace event with the guest's linear address. */
> +static void sh_trace_va(uint32_t event, guest_va_t va)
>  {
>  if ( tb_init_done )
> -{
> -event |= (GUEST_PAGING_LEVELS-2)<<8;
> -trace(event, sizeof(va), );
> -}
> +sh_trace(event, sizeof(va), );
>  }

If any tb_init_done check, then perhaps rather in sh_trace()? With that
(and provided you agree)
Reviewed-by: Jan Beulich 

Jan

Re: [PATCH for-4.19 v3 2/3] xen: enable altp2m at create domain domctl

2024-05-22 Thread Jan Beulich

On 22.05.2024 15:16, Roger Pau Monné wrote:
> On Tue, May 21, 2024 at 12:30:32PM +0200, Jan Beulich wrote:
>> On 17.05.2024 15:33, Roger Pau Monne wrote:
>>> Enabling it using an HVM param is fragile, and complicates the logic when
>>> deciding whether options that interact with altp2m can also be enabled.
>>>
>>> Leave the HVM param value for consumption by the guest, but prevent it from
>>> being set.  Enabling is now done using and additional altp2m specific field 
>>> in
>>> xen_domctl_createdomain.
>>>
>>> Note that albeit only currently implemented in x86, altp2m could be 
>>> implemented
>>> in other architectures, hence why the field is added to 
>>> xen_domctl_createdomain
>>> instead of xen_arch_domainconfig.
>>>
>>> Signed-off-by: Roger Pau Monné 
>>
>> Reviewed-by: Jan Beulich  # hypervisor
>> albeit with one question:
>>
>>> --- a/xen/arch/x86/domain.c
>>> +++ b/xen/arch/x86/domain.c
>>> @@ -637,6 +637,8 @@ int arch_sanitise_domain_config(struct 
>>> xen_domctl_createdomain *config)
>>>  bool hap = config->flags & XEN_DOMCTL_CDF_hap;
>>>  bool nested_virt = config->flags & XEN_DOMCTL_CDF_nested_virt;
>>>  unsigned int max_vcpus;
>>> +unsigned int altp2m_mode = MASK_EXTR(config->altp2m_opts,
>>> + XEN_DOMCTL_ALTP2M_mode_mask);
>>>  
>>>  if ( hvm ? !hvm_enabled : !IS_ENABLED(CONFIG_PV) )
>>>  {
>>> @@ -715,6 +717,26 @@ int arch_sanitise_domain_config(struct 
>>> xen_domctl_createdomain *config)
>>>  return -EINVAL;
>>>  }
>>>  
>>> +if ( config->altp2m_opts & ~XEN_DOMCTL_ALTP2M_mode_mask )
>>> +{
>>> +dprintk(XENLOG_INFO, "Invalid altp2m options selected: %#x\n",
>>> +config->flags);
>>> +return -EINVAL;
>>> +}
>>> +
>>> +if ( altp2m_mode && nested_virt )
>>> +{
>>> +dprintk(XENLOG_INFO,
>>> +"Nested virt and altp2m are not supported together\n");
>>> +return -EINVAL;
>>> +}
>>> +
>>> +if ( altp2m_mode && !hap )
>>> +{
>>> +dprintk(XENLOG_INFO, "altp2m is only supported with HAP\n");
>>> +return -EINVAL;
>>> +}
>>
>> Should this last one perhaps be further extended to permit altp2m with EPT
>> only?
> 
> Hm, yes, that would be more accurate as:
> 
> if ( altp2m_mode && (!hap || !hvm_altp2m_supported()) )

Wouldn't

   if ( altp2m_mode && !hvm_altp2m_supported() )

suffice? hvm_funcs.caps.altp2m is not supposed to be set when no HAP,
as long as HAP continues to be a pre-condition?

> Would you be fine adjusting at commit, or would you prefer me to send
> an updated version?

I'd be happy to fold in whatever we settle on.

Jan

Re: [PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-22 Thread Jan Beulich

On 22.05.2024 15:22, Marek Marczykowski-Górecki wrote:
> On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
>> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
>>> +static void subpage_mmio_write_emulate(
>>> +mfn_t mfn,
>>> +unsigned int offset,
>>> +const void *data,
>>> +unsigned int len)
>>> +{
>>> +struct subpage_ro_range *entry;
>>> +void __iomem *addr;
>>
>> Wouldn't this better be pointer-to-volatile, with ...
> 
> Shouldn't then most other uses of __iomem in the code base be this way
> too? I see volatile only in few places...

Quite likely, yet being consistent at least in new code is going to be
at least desirable.

Jan

[PATCH v3 2/2] x86: detect PIT aliasing on ports other than 0x4[0-3]

2024-05-22 Thread Jan Beulich

... in order to also deny Dom0 access through the alias ports (commonly
observed on Intel chipsets). Without this it is only giving the
impression of denying access to PIT. Unlike for CMOS/RTC, do detection
pretty early, to avoid disturbing normal operation later on (even if
typically we won't use much of the PIT).

Like for CMOS/RTC a fundamental assumption of the probing is that reads
from the probed alias port won't have side effects (beyond such that PIT
reads have anyway) in case it does not alias the PIT's.

As to the port 0x61 accesses: Unlike other accesses we do, this masks
off the top four bits (in addition to the bottom two ones), following
Intel chipset documentation saying that these (read-only) bits should
only be written with zero.

Signed-off-by: Jan Beulich 
---
If Xen was running on top of another instance of itself (in HVM mode,
not PVH, i.e. not as a shim), prior to 14f42af3f52d ('x86/vPIT: account
for "counter stopped" time') I'm afraid our vPIT logic would not have
allowed the "Try to further make sure ..." check to pass in the Xen
running on top: We don't respect the gate bit being clear when handling
counter reads. (There are more unhandled [and unmentioned as being so]
aspects of PIT behavior though, yet it's unclear in how far addressing
at least some of them would be useful.)
---
v3: Use PIT_* in dom0_setup_permissions(). Use #define-s introduced by
new earlier patch.
v2: Use new command line option. Re-base over changes to earlier
patches. Use ISOLATE_LSB().

--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct memsize {
 long nr_pages;
@@ -495,7 +496,11 @@ int __init dom0_setup_permissions(struct
 rc |= ioports_deny_access(d, 0x4D0, 0x4D1);
 
 /* Interval Timer (PIT). */
-rc |= ioports_deny_access(d, 0x40, 0x43);
+for ( offs = 0, i = ISOLATE_LSB(pit_alias_mask) ?: 4;
+  offs <= pit_alias_mask; offs += i )
+if ( !(offs & ~pit_alias_mask) )
+rc |= ioports_deny_access(d, PIT_CH0 + offs, PIT_MODE + offs);
+
 /* PIT Channel 2 / PC Speaker Control. */
 rc |= ioports_deny_access(d, 0x61, 0x61);
 
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -49,6 +49,7 @@ extern unsigned long highmem_start;
 #endif
 
 extern unsigned int i8259A_alias_mask;
+extern unsigned int pit_alias_mask;
 
 extern int8_t opt_smt;
 extern int8_t opt_probe_port_aliases;
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -427,6 +427,74 @@ static struct platform_timesource __init
 .resume = resume_pit,
 };
 
+unsigned int __initdata pit_alias_mask;
+
+static void __init probe_pit_alias(void)
+{
+unsigned int mask = 0x1c;
+uint8_t val = 0;
+
+if ( !opt_probe_port_aliases )
+return;
+
+/*
+ * Use channel 2 in mode 0 for probing.  In this mode even a non-initial
+ * count is loaded independent of counting being / becoming enabled.  Thus
+ * we have a 16-bit value fully under our control, to write and then check
+ * whether we can also read it back unaltered.
+ */
+
+/* Turn off speaker output and disable channel 2 counting. */
+outb(inb(0x61) & 0x0c, 0x61);
+
+outb(PIT_LTCH_CH(2) | PIT_RW_LSB_MSB | PIT_MODE_EOC | PIT_BINARY,
+ PIT_MODE);
+
+do {
+uint8_t val2;
+unsigned int offs;
+
+outb(val, PIT_CH2);
+outb(val ^ 0xff, PIT_CH2);
+
+/* Wait for the Null Count bit to clear. */
+do {
+/* Latch status. */
+outb(PIT_RDB | PIT_RDB_NO_COUNT | PIT_RDB_CH2, PIT_MODE);
+
+/* Try to make sure we're actually having a PIT here. */
+val2 = inb(PIT_CH2);
+if ( (val2 & ~(PIT_STATUS_OUT_PIN | PIT_STATUS_NULL_COUNT)) !=
+ (PIT_RW_LSB_MSB | PIT_MODE_EOC | PIT_BINARY) )
+return;
+} while ( val2 & (1 << 6) );
+
+/*
+ * Try to further make sure we're actually having a PIT here.
+ *
+ * NB: Deliberately |, not ||, as we always want both reads.
+ */
+val2 = inb(PIT_CH2);
+if ( (val2 ^ val) | (inb(PIT_CH2) ^ val ^ 0xff) )
+return;
+
+for ( offs = ISOLATE_LSB(mask); offs <= mask; offs <<= 1 )
+{
+if ( !(mask & offs) )
+continue;
+val2 = inb(PIT_CH2 + offs);
+if ( (val2 ^ val) | (inb(PIT_CH2 + offs) ^ val ^ 0xff) )
+mask &= ~offs;
+}
+} while ( mask && (val += 0x0b) );  /* Arbitrary uneven number. */
+
+if ( mask )
+{
+dprintk(XENLOG_INFO, "PIT aliasing mask: %02x\n", mask);
+pit_alias_mask = mask;
+}
+}
+
 /
  * PLATFORM TIMER 2: HIGH PRECISION EVENT TIMER (HPET)
  */
@@ -2416,6 +2484,8 @@ void __init early

[PATCH v3 1/2] x86/PIT: supply and use #define-s

2024-05-22 Thread Jan Beulich

Help reading of code programming the PIT by introducing constants for
control word, read back and latch commands, as well as status.

Requested-by: Jason Andryuk 
Signed-off-by: Jan Beulich 
---
v3: New.

--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -983,7 +983,7 @@ static unsigned int __init get_8254_time
 
 /*spin_lock_irqsave(_lock, flags);*/
 
-outb_p(0x00, PIT_MODE);
+outb_p(PIT_LTCH_CH(0), PIT_MODE);
 count = inb_p(PIT_CH0);
 count |= inb_p(PIT_CH0) << 8;
 
--- a/xen/arch/x86/include/asm/time.h
+++ b/xen/arch/x86/include/asm/time.h
@@ -58,4 +58,38 @@ struct time_scale;
 void set_time_scale(struct time_scale *ts, u64 ticks_per_sec);
 u64 scale_delta(u64 delta, const struct time_scale *scale);
 
+/* Programmable Interval Timer (8254) */
+
+/* Timer Control Word */
+#define PIT_TCW_CH(n) ((n) << 6)
+/* Lower bits also Timer Status. */
+#define PIT_RW_MSB(1 << 5)
+#define PIT_RW_LSB(1 << 4)
+#define PIT_RW_LSB_MSB(PIT_RW_LSB | PIT_RW_MSB)
+#define PIT_MODE_EOC  (0 << 1)
+#define PIT_MODE_ONESHOT  (1 << 1)
+#define PIT_MODE_RATE_GEN (2 << 1)
+#define PIT_MODE_SQUARE_WAVE  (3 << 1)
+#define PIT_MODE_SW_STROBE(4 << 1)
+#define PIT_MODE_HW_STROBE(5 << 1)
+#define PIT_BINARY(0 << 0)
+#define PIT_BCD   (1 << 0)
+
+/* Read Back Command */
+#define PIT_RDB   PIT_TCW_CH(3)
+#define PIT_RDB_NO_COUNT  (1 << 5)
+#define PIT_RDB_NO_STATUS (1 << 4)
+#define PIT_RDB_CH2   (1 << 3)
+#define PIT_RDB_CH1   (1 << 2)
+#define PIT_RDB_CH0   (1 << 1)
+#define PIT_RDB_RSVD  (1 << 0)
+
+/* Counter Latch Command */
+#define PIT_LTCH_CH(n)PIT_TCW_CH(n)
+
+/* Timer Status */
+#define PIT_STATUS_OUT_PIN(1 << 7)
+#define PIT_STATUS_NULL_COUNT (1 << 6)
+/* Lower bits match Timer Control Word. */
+
 #endif /* __X86_TIME_H__ */
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -222,7 +222,7 @@ static void cf_check timer_interrupt(int
 
 spin_lock_irq(_lock);
 
-outb(0x80, PIT_MODE);
+outb(PIT_LTCH_CH(2), PIT_MODE);
 count  = inb(PIT_CH2);
 count |= inb(PIT_CH2) << 8;
 
@@ -245,7 +245,8 @@ static void preinit_pit(void)
 {
 /* Set PIT channel 0 to HZ Hz. */
 #define LATCH (((CLOCK_TICK_RATE)+(HZ/2))/HZ)
-outb_p(0x34, PIT_MODE);/* binary, mode 2, LSB/MSB, ch 0 */
+outb_p(PIT_TCW_CH(0) | PIT_RW_LSB_MSB | PIT_MODE_RATE_GEN | PIT_BINARY,
+   PIT_MODE);
 outb_p(LATCH & 0xff, PIT_CH0); /* LSB */
 outb(LATCH >> 8, PIT_CH0); /* MSB */
 #undef LATCH
@@ -356,7 +357,7 @@ static u64 cf_check read_pit_count(void)
 
 spin_lock_irqsave(_lock, flags);
 
-outb(0x80, PIT_MODE);
+outb(PIT_LTCH_CH(2), PIT_MODE);
 count16  = inb(PIT_CH2);
 count16 |= inb(PIT_CH2) << 8;
 
@@ -383,7 +384,8 @@ static s64 __init cf_check init_pit(stru
  */
 #define CALIBRATE_LATCH CALIBRATE_VALUE(CLOCK_TICK_RATE)
 BUILD_BUG_ON(CALIBRATE_LATCH >> 16);
-outb(0xb0, PIT_MODE);  /* binary, mode 0, LSB/MSB, Ch 2 */
+outb(PIT_TCW_CH(2) | PIT_RW_LSB_MSB | PIT_MODE_EOC | PIT_BINARY,
+ PIT_MODE);
 outb(CALIBRATE_LATCH & 0xff, PIT_CH2); /* LSB of count */
 outb(CALIBRATE_LATCH >> 8, PIT_CH2);   /* MSB of count */
 #undef CALIBRATE_LATCH
@@ -408,7 +410,8 @@ static s64 __init cf_check init_pit(stru
 static void cf_check resume_pit(struct platform_timesource *pts)
 {
 /* Set CTC channel 2 to mode 0 again; initial value does not matter. */
-outb(0xb0, PIT_MODE); /* binary, mode 0, LSB/MSB, Ch 2 */
+outb(PIT_TCW_CH(2) | PIT_RW_LSB_MSB | PIT_MODE_EOC | PIT_BINARY,
+ PIT_MODE);
 outb(0, PIT_CH2); /* LSB of count */
 outb(0, PIT_CH2); /* MSB of count */
 }
@@ -2456,7 +2459,8 @@ static int _disable_pit_irq(bool init)
 }
 
 /* Disable PIT CH0 timer interrupt. */
-outb_p(0x30, PIT_MODE);
+outb_p(PIT_TCW_CH(0) | PIT_RW_LSB_MSB | PIT_MODE_EOC | PIT_BINARY,
+   PIT_MODE);
 outb_p(0, PIT_CH0);
 outb_p(0, PIT_CH0);
 
@@ -2562,17 +2566,18 @@ int hwdom_pit_access(struct ioreq *ioreq
 case PIT_MODE:
 if ( ioreq->dir == IOREQ_READ )
 return 0; /* urk! */
-switch ( ioreq->data & 0xc0 )
+switch ( ioreq->data & PIT_TCW_CH(3) )
 {
-case 0xc0: /* Read Back */
-if ( ioreq->data & 0x08 )/* Select Channel 2? */
-outb(ioreq->data & 0xf8, PIT_MODE);
-if ( !(ioreq->data & 0x06) ) /* Select Channel 0/1? */
+case PIT_RDB: /* Read Back */
+if ( ioreq->data & PIT_RDB_CH2 )
+outb(ioreq->data & ~(PIT_RDB_CH1 | PIT_RDB_CH0 | PIT_RDB_RSVD),
+ PIT_MODE);
+if ( !

[PATCH v3 0/2] x86: detect PIT aliasing on ports other than 0x4[0-3]

2024-05-22 Thread Jan Beulich

1: PIT: supply and use #define-s
2: detect PIT aliasing on ports other than 0x4[0-3]

No functional change from v2, just the introduction of the new prereq
patch to help overall readability.

Jan

Re: [PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-22 Thread Jan Beulich

On 22.05.2024 12:36, Marek Marczykowski-Górecki wrote:
> On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
>> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -2009,6 +2009,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned 
>>> long gla,
>>>  goto out_put_gfn;
>>>  }
>>>  
>>> +if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present 
>>> &&
>>> + subpage_mmio_write_accept(mfn, gla) &&
>>
>> Afaics subpage_mmio_write_accept() is unreachable then when CONFIG_HVM=n?
> 
> Right, the PV path hits mmio_ro_emulated_write() without my changes
> already.
> Do you suggest to make subpage_mmio_write_accept() under #ifdef
> CONFIG_HVM?

That's not just me, but also Misra.

>>> + (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
>>> +{
>>> +rc = 1;
>>> +goto out_put_gfn;
>>> +}
>>
>> Overall this new if() is pretty similar to the immediate preceding one.
>> So similar that I wonder whether the two shouldn't be folded. 
> 
> I can do that if you prefer.
> 
>> In fact
>> it looks as if the new one is needed only for the case where you'd pass
>> through (to a DomU) a device partially used by Xen. That could certainly
>> do with mentioning explicitly.
> 
> Well, the change in mmio_ro_emulated_write() is relevant to both dom0
> and domU. It simply wasn't reachable (in this case) for HVM domU before
> (but was for PV already).

The remark was about the code here only. Of course that other change you
talk about is needed for both, and I wasn't meaning to suggest Dom0 had
worked (in this regard) prior to your change.

>>> +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
>>> +{
>>> +void __iomem *mapped_page;
>>> +
>>> +if ( entry->mapped )
>>> +return entry->mapped;
>>> +
>>> +mapped_page = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
>>> +
>>> +spin_lock(_ro_lock);
>>> +/* Re-check under the lock */
>>> +if ( entry->mapped )
>>> +{
>>> +spin_unlock(_ro_lock);
>>> +iounmap(mapped_page);
>>
>> The only unmap is on an error path here and on another error path elsewhere.
>> IOW it looks as if devices with such marked pages are meant to never be hot
>> unplugged. I can see that being intentional for the XHCI console, but imo
>> such a restriction also needs prominently calling out in a comment next to
>> e.g. the function declaration.
> 
> The v1 included subpage_mmio_ro_remove() function (which would need to
> be used in case of hot-unplug of such device, if desirable), but since
> this series doesn't introduce any use of it (as you say, it isn't
> desirable for XHCI console specifically), you asked me to remove it...
> 
> Should I add an explicit comment about the limitation, instead of having
> it implicit by not having subpage_mmio_ro_remove() there?

That's what I was asking for in my earlier comment, yes.

>>> --- a/xen/arch/x86/pv/ro-page-fault.c
>>> +++ b/xen/arch/x86/pv/ro-page-fault.c
>>> @@ -330,6 +330,7 @@ static int mmio_ro_do_page_fault(struct 
>>> x86_emulate_ctxt *ctxt,
>>>  return X86EMUL_UNHANDLEABLE;
>>>  }
>>>  
>>> +mmio_ro_ctxt.mfn = mfn;
>>>  ctxt->data = _ro_ctxt;
>>>  if ( pci_ro_mmcfg_decode(mfn_x(mfn), _ro_ctxt.seg, 
>>> _ro_ctxt.bdf) )
>>>  return x86_emulate(ctxt, _intercept_ops);
>>
>> Wouldn't you better set .mfn only on the "else" path, just out of context?
>> Suggesting that the new field in the struct could actually overlay the
>> (seg,bdf) tuple (being of relevance only to MMCFG intercept handling).
>> This would be more for documentation purposes than to actually save space.
>> (If so, perhaps the "else" itself would also better be dropped while making
>> the adjustment.)
> 
> I can do that if you prefer. But personally, I find such such use of an
> union risky (without some means for a compiler to actually enforce their
> proper use) - while for correct code it may save some space, it makes
> the impact of a type confusion bug potentially worse - now that the
> unexpected value would be potentially attacker controlled.
> For a documentation purpose I can simply add a comment.

Well, I'm not going to insist on using a union. But I am pretty firm on
expecting the setting of .mfn to move down. Not using a union will then
mean static analysis tools may point out that .mfn is left uninitialized
for the above visible 1st invocation of x86_emulate().

Jan

[PATCH] x86/shadow: don't leave trace record field uninitialized

2024-05-22 Thread Jan Beulich

The emulation_count field is set only conditionally right now. Convert
all field setting to an initializer, thus guaranteeing that field to be
set to 0 (default initialized) when GUEST_PAGING_LEVELS != 3.

While there also drop the "event" local variable, thus eliminating an
instance of the being phased out u32 type.

Coverity ID: 1598430
Fixes: 9a86ac1aa3d2 ("xentrace 5/7: Additional tracing for the shadow code")
Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2093,20 +2093,18 @@ static inline void trace_shadow_emulate(
 guest_l1e_t gl1e, write_val;
 guest_va_t va;
 uint32_t flags:29, emulation_count:3;
-} d;
-u32 event;
-
-event = TRC_SHADOW_EMULATE | ((GUEST_PAGING_LEVELS-2)<<8);
-
-d.gl1e = gl1e;
-d.write_val.l1 = this_cpu(trace_emulate_write_val);
-d.va = va;
+} d = {
+.gl1e = gl1e,
+.write_val.l1 = this_cpu(trace_emulate_write_val),
+.va = va,
 #if GUEST_PAGING_LEVELS == 3
-d.emulation_count = this_cpu(trace_extra_emulation_count);
+.emulation_count = this_cpu(trace_extra_emulation_count),
 #endif
-d.flags = this_cpu(trace_shadow_path_flags);
+.flags = this_cpu(trace_shadow_path_flags),
+};
 
-trace(event, sizeof(d), );
+trace(TRC_SHADOW_EMULATE | ((GUEST_PAGING_LEVELS - 2) << 8),
+  sizeof(d), );
 }
 }
 #endif /* CONFIG_HVM */

Re: New Defects reported by Coverity Scan for XenProject

2024-05-22 Thread Jan Beulich

On 22.05.2024 11:56, scan-ad...@coverity.com wrote:
> ** CID 1598431:  Memory - corruptions  (OVERRUN)
> 
> 
> 
> *** CID 1598431:  Memory - corruptions  (OVERRUN)
> /xen/common/trace.c: 798 in trace()
> 792 }
> 793 
> 794 if ( rec_size > bytes_to_wrap )
> 795 insert_wrap_record(buf, rec_size);
> 796 
> 797 /* Write the original record */
 CID 1598431:  Memory - corruptions  (OVERRUN)
 Overrunning callee's array of size 28 by passing argument "extra" 
 (which evaluates to 31) in call to "__insert_record".
> 798 __insert_record(buf, event, extra, cycles, rec_size, extra_data);
> 799 
> 800 unlock:
> 801 spin_unlock_irqrestore(_cpu(t_lock), flags);
> 802 
> 803 /* Notify trace buffer consumer that we've crossed the high water 
> mark. */

How does the tool conclude "extra" evaluating to 31, when at the top of
the function it is clearly checked to be less than 28?

> ** CID 1598430:  Uninitialized variables  (UNINIT)
> 
> 
> 
> *** CID 1598430:  Uninitialized variables  (UNINIT)
> /xen/arch/x86/mm/shadow/multi.c: 2109 in trace_shadow_emulate()
> 2103 d.va = va;
> 2104 #if GUEST_PAGING_LEVELS == 3
> 2105 d.emulation_count = this_cpu(trace_extra_emulation_count);
> 2106 #endif
> 2107 d.flags = this_cpu(trace_shadow_path_flags);
> 2108 
 CID 1598430:  Uninitialized variables  (UNINIT)
 Using uninitialized value "d". Field "d.emulation_count" is 
 uninitialized when calling "trace".
> 2109 trace(event, sizeof(d), );
> 2110 }
> 2111 }
> 2112 #endif /* CONFIG_HVM */
> 2113 
> 2114 
> /**/

This, otoh, looks to be a valid (but long-standing) issue, which I'll make
a patch for.

Jan

Re: [PATCH v15 2/5] vpci/header: emulate PCI_COMMAND register for guests

2024-05-22 Thread Jan Beulich

On 22.05.2024 11:28, Roger Pau Monné wrote:
> On Fri, May 17, 2024 at 01:06:12PM -0400, Stewart Hildebrand wrote:
>> @@ -754,9 +774,23 @@ static int cf_check init_header(struct pci_dev *pdev)
>>  return -EOPNOTSUPP;
>>  }
>>  
>> -/* Setup a handler for the command register. */
>> -rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, 
>> PCI_COMMAND,
>> -   2, header);
>> +/*
>> + * Setup a handler for the command register.
>> + *
>> + * TODO: If support for emulated bits is added, re-visit how to handle
>> + * PCI_COMMAND_PARITY, PCI_COMMAND_SERR, and PCI_COMMAND_FAST_BACK.
>> + */
>> +rc = vpci_add_register_mask(pdev->vpci,
>> +is_hwdom ? vpci_hw_read16 : guest_cmd_read,
>> +cmd_write, PCI_COMMAND, 2, header, 0, 0,
>> +PCI_COMMAND_RSVDP_MASK |
>> +(is_hwdom ? 0
>> +  : PCI_COMMAND_IO |
>> +PCI_COMMAND_PARITY |
>> +PCI_COMMAND_WAIT |
>> +PCI_COMMAND_SERR |
>> +PCI_COMMAND_FAST_BACK),
> 
> We want to allow full access to the hw domain and only apply the
> PCI_COMMAND_RSVDP_MASK when !is_hwdom in order to keep the current
> behavior for dom0.
> 
> I don't think it makes a difference in practice, but we are very lax
> in explicitly not applying any of such restrictions to dom0.
> 
> With that fixed:
> 
> Reviewed-by: Roger Pau Monné 

Makes sense to me, so please feel free to retain my R-b with that adjustment.

Jan

Re: [PATCH v2 2/8] xen/x86: Simplify header dependencies in x86/hvm

2024-05-22 Thread Jan Beulich

On 08.05.2024 14:39, Alejandro Vallejo wrote:
> Otherwise it's not possible to call functions described in hvm/vlapic.h from 
> the
> inline functions of hvm/hvm.h.
> 
> This is because a static inline in vlapic.h depends on hvm.h, and pulls it
> transitively through vpt.h. The ultimate cause is having hvm.h included in any
> of the "v*.h" headers, so break the cycle moving the guilty inline into hvm.h.
> 
> No functional change.
> 
> Signed-off-by: Alejandro Vallejo 

In principle:
Reviewed-by: Jan Beulich 
But see below for one possible adjustment.

> ---
> v2:
>   * New patch. Prereq to moving vlapic_cpu_policy_changed() onto hvm.h

That hook invocation living outside of hvm/hvm.h was an outlier anyway,
so even without the planned further work this is probably a good move.

> --- a/xen/arch/x86/include/asm/hvm/hvm.h
> +++ b/xen/arch/x86/include/asm/hvm/hvm.h
> @@ -798,6 +798,12 @@ static inline void hvm_update_vlapic_mode(struct vcpu *v)
>  alternative_vcall(hvm_funcs.update_vlapic_mode, v);
>  }
>  
> +static inline void hvm_vlapic_sync_pir_to_irr(struct vcpu *v)
> +{
> +if ( hvm_funcs.sync_pir_to_irr )
> +alternative_vcall(hvm_funcs.sync_pir_to_irr, v);
> +}

The hook doesn't have "vlapic" in its name. Therefore instead or prepending
hvm_ to the original name or the wrapper, how about replacing the vlapic_
that was there. That would then also fit better with the naming scheme used
for other hooks and their wrappers. Happy to adjust while committing, so
long as you don't disagree.

Jan

Re: [PATCH v10 03/14] xen/bitops: implement fls{l}() in common logic

2024-05-22 Thread Jan Beulich

On 22.05.2024 09:37, Oleksii K. wrote:
> On Tue, 2024-05-21 at 13:18 +0200, Jan Beulich wrote:
>> On 17.05.2024 15:54, Oleksii Kurochko wrote:
>>> To avoid the compilation error below, it is needed to update to
>>> places
>>> in common/page_alloc.c where flsl() is used as now flsl() returns
>>> unsigned int:
>>>
>>> ./include/xen/kernel.h:18:21: error: comparison of distinct pointer
>>> types lacks a cast [-Werror]
>>>    18 | (void) (&_x == &_y);    \
>>>   | ^~
>>>     common/page_alloc.c:1843:34: note: in expansion of macro 'min'
>>>  1843 | unsigned int inc_order = min(MAX_ORDER, flsl(e
>>> - s) - 1);
>>>
>>> generic_fls{l} was used instead of __builtin_clz{l}(x) as if x is
>>> 0,
>>> the result in undefined.
>>>
>>> The prototype of the per-architecture fls{l}() functions was
>>> changed to
>>> return 'unsigned int' to align with the generic implementation of
>>> these
>>> functions and avoid introducing signed/unsigned mismatches.
>>>
>>> Signed-off-by: Oleksii Kurochko 
>>> ---
>>>  The patch is almost independent from Andrew's patch series
>>>  (
>>> https://lore.kernel.org/xen-devel/20240313172716.2325427-1-andrew.coop...@citrix.com/T/#t
>>> )
>>>  except test_fls() function which IMO can be merged as a separate
>>> patch after Andrew's patch
>>>  will be fully ready.
>>
>> If there wasn't this dependency (I don't think it's "almost
>> independent"),
>> I'd be offering R-b with again one nit below.
> 
> Aren't all changes, except those in xen/common/bitops.c, independent? I
> could move these changes in xen/common/bitops.c to a separate commit. I
> think it is safe to commit them ( an introduction of common logic for
> fls{l}() and tests ) separately since the CI tests have passed.

Technically they might be, but contextually there are further conflicts.
Just try "patch --dry-run" on top of a plain staging tree. You really
need to settle, perhaps consulting Andrew, whether you want to go on top
of his change, or ahead of it. I'm not willing to approve a patch that's
presented one way but then is (kind of) expected to go in the other way.

Jan

Re: [PATCH v3 0/2] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-05-22 Thread Jan Beulich

On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> On older systems, XHCI xcap had a layout that no other (interesting) registers
> were placed on the same page as the debug capability, so Linux was fine with
> making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
> needs to write to some other registers on the same page too.
> 
> Add a generic API for making just parts of an MMIO page R/O and use it to fix
> USB3 console with share=yes or share=hwdom options. More details in commit
> messages.
> 
> Technically it may still qualify for 4.19, since v1 was sent well before
> last posting date. But I realize it's quite late and it isn't top
> priority series, so if it won't hit 4.19, it's okay with me too.
> 
> Marek Marczykowski-Górecki (2):
>   x86/mm: add API for marking only part of a MMIO page read only
>   drivers/char: Use sub-page ro API to make just xhci dbc cap RO
> 
>  xen/arch/x86/hvm/emulate.c  |   2 +-
>  xen/arch/x86/hvm/hvm.c  |   8 +-
>  xen/arch/x86/include/asm/mm.h   |  18 ++-
>  xen/arch/x86/mm.c   | 268 +-
>  xen/arch/x86/pv/ro-page-fault.c |   1 +-
>  xen/drivers/char/xhci-dbc.c |  27 +--
>  6 files changed, 309 insertions(+), 15 deletions(-)

Just to mention it here again, with v2 having been quite some time ago:
Like for that other work of yours I'm not really convinced the complexity
is worth the gain. Ultimately this may once again mean that I'll demand
a 2nd maintainer's ack, once technically I may be okay to offer R-b.

Jan

Re: [PATCH v3 2/2] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-05-22 Thread Jan Beulich

On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> --- a/xen/drivers/char/xhci-dbc.c
> +++ b/xen/drivers/char/xhci-dbc.c
> @@ -1216,20 +1216,19 @@ static void __init cf_check 
> dbc_uart_init_postirq(struct serial_port *port)
>  break;
>  }
>  #ifdef CONFIG_X86
> -/*
> - * This marks the whole page as R/O, which may include other registers
> - * unrelated to DbC. Xen needs only DbC area protected, but it seems
> - * Linux's XHCI driver (as of 5.18) works without writting to the whole
> - * page, so keep it simple.
> - */
> -if ( rangeset_add_range(mmio_ro_ranges,
> -PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
> - uart->dbc.xhc_dbc_offset),
> -PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
> -   uart->dbc.xhc_dbc_offset +
> -sizeof(*uart->dbc.dbc_reg)) - 1) )
> -printk(XENLOG_INFO
> -   "Error while adding MMIO range of device to 
> mmio_ro_ranges\n");
> +if ( subpage_mmio_ro_add(
> + (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
> +  uart->dbc.xhc_dbc_offset,
> + sizeof(*uart->dbc.dbc_reg)) )
> +{
> +printk(XENLOG_WARNING
> +   "Error while marking MMIO range of XHCI console as R/O, "
> +   "making the whole device R/O (share=no)\n");

Since you mention "share=no" here, wouldn't you then better also update the
respective struct field, even if (right now) there may be nothing subsequently
using that? Except that dbc_ensure_running() actually is looking at it, and
that's not an __init function.

> +if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
> +printk(XENLOG_WARNING
> +   "Failed to mark read-only %pp used for XHCI console\n",
> +   >dbc.sbdf);
> +}
>  #endif
>  }

It's been a long time since v2 and the description doesn't say anything in
this regard: Is there a reason not to retain the rangeset addition alongside
the pci_ro_device() on the fallback path?

Jan

Re: [PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-22 Thread Jan Beulich

On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2009,6 +2009,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned 
> long gla,
>  goto out_put_gfn;
>  }
>  
> +if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present &&
> + subpage_mmio_write_accept(mfn, gla) &&

Afaics subpage_mmio_write_accept() is unreachable then when CONFIG_HVM=n?

> + (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
> +{
> +rc = 1;
> +goto out_put_gfn;
> +}

Overall this new if() is pretty similar to the immediate preceding one.
So similar that I wonder whether the two shouldn't be folded. In fact
it looks as if the new one is needed only for the case where you'd pass
through (to a DomU) a device partially used by Xen. That could certainly
do with mentioning explicitly.

> +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)

Considering what the function does and what it returns, perhaps better
s/get/map/? The "get_page" part of the name generally has a different
meaning in Xen's memory management.

> +{
> +void __iomem *mapped_page;
> +
> +if ( entry->mapped )
> +return entry->mapped;
> +
> +mapped_page = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
> +
> +spin_lock(_ro_lock);
> +/* Re-check under the lock */
> +if ( entry->mapped )
> +{
> +spin_unlock(_ro_lock);
> +iounmap(mapped_page);

The only unmap is on an error path here and on another error path elsewhere.
IOW it looks as if devices with such marked pages are meant to never be hot
unplugged. I can see that being intentional for the XHCI console, but imo
such a restriction also needs prominently calling out in a comment next to
e.g. the function declaration.

> +return entry->mapped;
> +}
> +
> +entry->mapped = mapped_page;
> +spin_unlock(_ro_lock);
> +return entry->mapped;
> +}
> +
> +static void subpage_mmio_write_emulate(
> +mfn_t mfn,
> +unsigned int offset,
> +const void *data,
> +unsigned int len)
> +{
> +struct subpage_ro_range *entry;
> +void __iomem *addr;

Wouldn't this better be pointer-to-volatile, with ...

> +list_for_each_entry(entry, _ro_ranges, list)
> +{
> +if ( mfn_eq(entry->mfn, mfn) )
> +{
> +if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )
> +{
> + write_ignored:
> +gprintk(XENLOG_WARNING,
> +"ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len 
> %u\n",
> +mfn_x(mfn), offset, len);
> +return;
> +}
> +
> +addr = subpage_mmio_get_page(entry);
> +if ( !addr )
> +{
> +gprintk(XENLOG_ERR,
> +"Failed to map page for MMIO write at 
> 0x%"PRI_mfn"%03x\n",
> +mfn_x(mfn), offset);
> +return;
> +}
> +
> +switch ( len )
> +{
> +case 1:
> +writeb(*(const uint8_t*)data, addr);
> +break;
> +case 2:
> +writew(*(const uint16_t*)data, addr);
> +break;
> +case 4:
> +writel(*(const uint32_t*)data, addr);
> +break;
> +case 8:
> +writeq(*(const uint64_t*)data, addr);
> +break;

... this being how it's written? (If so, volatile suitably carried through to
other places as well.)

> +default:
> +/* mmio_ro_emulated_write() already validated the size */
> +ASSERT_UNREACHABLE();
> +goto write_ignored;
> +}
> +return;
> +}
> +}
> +/* Do not print message for pages without any writable parts. */
> +}
> +
> +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla)
> +{
> +unsigned int offset = PAGE_OFFSET(gla);
> +const struct subpage_ro_range *entry;
> +
> +list_for_each_entry_rcu(entry, _ro_ranges, list)

Considering the other remark about respective devices impossible to go
away, is the RCU form here really needed? Its use gives the (false)
impression of entry removal being possible.

> +if ( mfn_eq(entry->mfn, mfn) &&
> + !test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) )

Btw, "qwords" in the field name is kind of odd when SUBPAGE_MMIO_RO_ALIGN
in principle suggests that changing granularity ought to be possible by
simply adjusting that #define. Maybe "->ro_elems"?

> --- a/xen/arch/x86/pv/ro-page-fault.c
> +++ b/xen/arch/x86/pv/ro-page-fault.c
> @@ -330,6 +330,7 @@ static int mmio_ro_do_page_fault(struct x86_emulate_ctxt 
> *ctxt,
>  return X86EMUL_UNHANDLEABLE;
>  }
>  
> +mmio_ro_ctxt.mfn = mfn;
>  ctxt->data = _ro_ctxt;
>  if (

Re: [PATCH 2/3] xen/x86: Drop useless non-Kconfig CONFIG_* variables

2024-05-22 Thread Jan Beulich

On 21.05.2024 19:15, Andrew Cooper wrote:
> These are all either completely unused, or do nothing useful.
> 
> Signed-off-by: Andrew Cooper 

Not an objection, i.e. you're fine to commit as is with Stefano's R-b, yet
still a question:

> @@ -30,11 +29,8 @@
>  /* Intel P4 currently has largest cache line (L2 line size is 128 bytes). */
>  #define CONFIG_X86_L1_CACHE_SHIFT 7
>  
> -#define CONFIG_ACPI_SRAT 1
>  #define CONFIG_ACPI_CSTATE 1
>  
> -#define CONFIG_WATCHDOG 1

I wonder if this wouldn't make sense to become a proper Kconfig setting,
thus ...

> --- a/xen/include/xen/watchdog.h
> +++ b/xen/include/xen/watchdog.h
> @@ -9,8 +9,6 @@
>  
>  #include 
>  
> -#ifdef CONFIG_WATCHDOG
> -
>  /* Try to set up a watchdog. */
>  int watchdog_setup(void);
>  
> @@ -23,13 +21,4 @@ void watchdog_disable(void);
>  /* Is the watchdog currently enabled. */
>  bool watchdog_enabled(void);
>  
> -#else
> -
> -#define watchdog_setup() ((void)0)
> -#define watchdog_enable() ((void)0)
> -#define watchdog_disable() ((void)0)
> -#define watchdog_enabled() ((void)0)
> -
> -#endif

... assigning purpose to these stubs.

Jan

Re: [PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-21 Thread Jan Beulich

On 21.05.2024 17:33, Marek Marczykowski-Górecki wrote:
> On Tue, May 21, 2024 at 05:16:58PM +0200, Jan Beulich wrote:
>> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
>>> --- a/xen/arch/x86/include/asm/mm.h
>>> +++ b/xen/arch/x86/include/asm/mm.h
>>> @@ -522,9 +522,27 @@ extern struct rangeset *mmio_ro_ranges;
>>>  void memguard_guard_stack(void *p);
>>>  void memguard_unguard_stack(void *p);
>>>  
>>> +/*
>>> + * Add more precise r/o marking for a MMIO page. Range specified here
>>> + * will still be R/O, but the rest of the page (not marked as R/O via 
>>> another
>>> + * call) will have writes passed through.
>>> + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
>>> + *
>>> + * This API cannot be used for overlapping ranges, nor for pages already 
>>> added
>>> + * to mmio_ro_ranges separately.
>>> + *
>>> + * Return values:
>>> + *  - negative: error
>>> + *  - 0: success
>>> + */
>>> +#define SUBPAGE_MMIO_RO_ALIGN 8
>>
>> This isn't just alignment, but also (and perhaps more importantly) 
>> granularity.
>> I think the name wants to express this.
> 
> SUBPAGE_MMIO_RO_GRANULARITY? Sounds a bit long...

..._GRAN? ..._CHUNK? ..._UNIT? (Perhaps also want to have MMIO_RO_ first.)

>>> +static int __init subpage_mmio_ro_add_page(
>>> +mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
>>> +{
>>> +struct subpage_ro_range *entry = NULL, *iter;
>>> +unsigned int i;
>>> +
>>> +list_for_each_entry(iter, _ro_ranges, list)
>>> +{
>>> +if ( mfn_eq(iter->mfn, mfn) )
>>> +{
>>> +entry = iter;
>>> +break;
>>> +}
>>> +}
>>> +if ( !entry )
>>> +{
>>> +/* iter == NULL marks it was a newly allocated entry */
>>> +iter = NULL;
>>> +entry = xzalloc(struct subpage_ro_range);
>>> +if ( !entry )
>>> +return -ENOMEM;
>>> +entry->mfn = mfn;
>>> +}
>>> +
>>> +for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
>>> +{
>>> +int oldbit = __test_and_set_bit(i / SUBPAGE_MMIO_RO_ALIGN,
>>> +entry->ro_qwords);
>>
>> Why int, not bool?
> 
> Because __test_and_set_bit returns int. But I can change to bool if you
> prefer.

test_bit() and the test-and set of functions should eventually all change
to be returning bool. One less use site to then also take care of.

Jan

Re: [PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-21 Thread Jan Beulich

On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -522,9 +522,27 @@ extern struct rangeset *mmio_ro_ranges;
>  void memguard_guard_stack(void *p);
>  void memguard_unguard_stack(void *p);
>  
> +/*
> + * Add more precise r/o marking for a MMIO page. Range specified here
> + * will still be R/O, but the rest of the page (not marked as R/O via another
> + * call) will have writes passed through.
> + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
> + *
> + * This API cannot be used for overlapping ranges, nor for pages already 
> added
> + * to mmio_ro_ranges separately.
> + *
> + * Return values:
> + *  - negative: error
> + *  - 0: success
> + */
> +#define SUBPAGE_MMIO_RO_ALIGN 8

This isn't just alignment, but also (and perhaps more importantly) granularity.
I think the name wants to express this.

> @@ -4910,6 +4921,260 @@ long arch_memory_op(unsigned long cmd, 
> XEN_GUEST_HANDLE_PARAM(void) arg)
>  return rc;
>  }
>  
> +/*
> + * Mark part of the page as R/O.
> + * Returns:
> + * - 0 on success - first range in the page
> + * - 1 on success - subsequent range in the page
> + * - <0 on error
> + *
> + * This needs subpage_ro_lock already taken */

Nit: Comment style (full stop and */ on its own line).

> +static int __init subpage_mmio_ro_add_page(
> +mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
> +{
> +struct subpage_ro_range *entry = NULL, *iter;
> +unsigned int i;
> +
> +list_for_each_entry(iter, _ro_ranges, list)
> +{
> +if ( mfn_eq(iter->mfn, mfn) )
> +{
> +entry = iter;
> +break;
> +}
> +}
> +if ( !entry )
> +{
> +/* iter == NULL marks it was a newly allocated entry */
> +iter = NULL;
> +entry = xzalloc(struct subpage_ro_range);
> +if ( !entry )
> +return -ENOMEM;
> +entry->mfn = mfn;
> +}
> +
> +for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> +{
> +int oldbit = __test_and_set_bit(i / SUBPAGE_MMIO_RO_ALIGN,
> +entry->ro_qwords);

Why int, not bool?

> +ASSERT(!oldbit);
> +}
> +
> +if ( !iter )
> +list_add(>list, _ro_ranges);
> +
> +return iter ? 1 : 0;
> +}
> +
> +/* This needs subpage_ro_lock already taken */
> +static void __init subpage_mmio_ro_remove_page(
> +mfn_t mfn,
> +int offset_s,
> +int offset_e)

Can either of these be negative? The more that ...

> +{
> +struct subpage_ro_range *entry = NULL, *iter;
> +unsigned int i;

... this is used ...

> +list_for_each_entry(iter, _ro_ranges, list)
> +{
> +if ( mfn_eq(iter->mfn, mfn) )
> +{
> +entry = iter;
> +break;
> +}
> +}
> +if ( !entry )
> +return;
> +
> +for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )

... with both of them?

> +__clear_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
> +
> +if ( !bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN) )
> +return;
> +
> +list_del(>list);
> +if ( entry->mapped )
> +iounmap(entry->mapped);
> +xfree(entry);
> +}
> +
> +int __init subpage_mmio_ro_add(
> +paddr_t start,
> +size_t size)
> +{
> +mfn_t mfn_start = maddr_to_mfn(start);
> +paddr_t end = start + size - 1;
> +mfn_t mfn_end = maddr_to_mfn(end);
> +unsigned int offset_end = 0;
> +int rc;
> +bool subpage_start, subpage_end;
> +
> +ASSERT(IS_ALIGNED(start, SUBPAGE_MMIO_RO_ALIGN));
> +ASSERT(IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN));
> +if ( !IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN) )
> +size = ROUNDUP(size, SUBPAGE_MMIO_RO_ALIGN);
> +
> +if ( !size )
> +return 0;
> +
> +if ( mfn_eq(mfn_start, mfn_end) )
> +{
> +/* Both starting and ending parts handled at once */
> +subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != PAGE_SIZE 
> - 1;
> +subpage_end = false;
> +}
> +else
> +{
> +subpage_start = PAGE_OFFSET(start);
> +subpage_end = PAGE_OFFSET(end) != PAGE_SIZE - 1;
> +}
> +
> +spin_lock(_ro_lock);
> +
> +if ( subpage_start )
> +{
> +offset_end = mfn_eq(mfn_start, mfn_end) ?
> + PAGE_OFFSET(end) :
> + (PAGE_SIZE - 1);
> +rc = subpage_mmio_ro_add_page(mfn_start,
> +  PAGE_OFFSET(start),
> +  offset_end);
> +if ( rc < 0 )
> +goto err_unlock;
> +/* Check if not marking R/W part of a page intended to be fully R/O 
> */
> +ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
> +  mfn_x(mfn_start)));
> +}
> +
> +if ( subpage_end )
> +{
> +rc =

Re: [XEN PATCH 4/4] x86_64/cpu_idle: address violations of MISRA C Rule 20.7

2024-05-21 Thread Jan Beulich

On 16.05.2024 01:19, Stefano Stabellini wrote:
> On Wed, 15 May 2024, Nicola Vetrini wrote:
>> MISRA C Rule 20.7 states: "Expressions resulting from the expansion
>> of macro parameters shall be enclosed in parentheses". Therefore, some
>> macro definitions should gain additional parentheses to ensure that all
>> current and future users will be safe with respect to expansions that
>> can possibly alter the semantics of the passed-in macro parameter.
>>
>> No functional change.
>>
>> Signed-off-by: Nicola Vetrini 
> 
> Reviewed-by: Stefano Stabellini 

Acked-by: Jan Beulich

Re: [XEN PATCH 3/4] x86_64/uaccess: address violations of MISRA C Rule 20.7

2024-05-21 Thread Jan Beulich

On 16.05.2024 01:19, Stefano Stabellini wrote:
> On Wed, 15 May 2024, Nicola Vetrini wrote:
>> MISRA C Rule 20.7 states: "Expressions resulting from the expansion
>> of macro parameters shall be enclosed in parentheses". Therefore, some
>> macro definitions should gain additional parentheses to ensure that all
>> current and future users will be safe with respect to expansions that
>> can possibly alter the semantics of the passed-in macro parameter.
>>
>> xlat_malloc_init is touched for consistency, despite the construct
>> being already deviated.
>>
>> No functional change.
>>
>> Signed-off-by: Nicola Vetrini 
> 
> Reviewed-by: Stefano Stabellini 

Acked-by: Jan Beulich

Re: [XEN PATCH 2/4] x86/hvm: address violations of MISRA C Rule 20.7

2024-05-21 Thread Jan Beulich

On 16.05.2024 01:18, Stefano Stabellini wrote:
> On Wed, 15 May 2024, Nicola Vetrini wrote:
>> MISRA C Rule 20.7 states: "Expressions resulting from the expansion
>> of macro parameters shall be enclosed in parentheses". Therefore, some
>> macro definitions should gain additional parentheses to ensure that all
>> current and future users will be safe with respect to expansions that
>> can possibly alter the semantics of the passed-in macro parameter.
>>
>> No functional change.
>>
>> Signed-off-by: Nicola Vetrini 
> 
> Reviewed-by: Stefano Stabellini 
> 
> 
>> ---
>>  xen/arch/x86/hvm/mtrr.c | 2 +-
>>  xen/arch/x86/hvm/rtc.c  | 2 +-
>>  xen/arch/x86/include/asm/hvm/save.h | 2 +-
>>  3 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c
>> index 32f74c1db03b..1079851f70ed 100644
>> --- a/xen/arch/x86/hvm/mtrr.c
>> +++ b/xen/arch/x86/hvm/mtrr.c
>> @@ -16,7 +16,7 @@
>>  #include 
>>  
>>  /* Get page attribute fields (PAn) from PAT MSR. */
>> -#define pat_cr_2_paf(pat_cr,n)  uint64_t)pat_cr) >> ((n)<<3)) & 0xff)
>> +#define pat_cr_2_paf(pat_cr, n)  uint64_t)(pat_cr)) >> ((n) << 3)) & 
>> 0xff)
> 
> just a cosmetic change

Not really, as Nicola already pointed out. However, there's an excess pair of
parentheses which better would have been re-arranged rather than adding yet
another pair:

#define pat_cr_2_paf(pat_cr, n)  (((uint64_t)(pat_cr) >> ((n) << 3)) & 0xff)

Preferably with that (which I'll likely take the liberty of adjusting while
committing)
Acked-by: Jan Beulich 

Jan

Re: [XEN PATCH 1/4] x86/vpmu: address violations of MISRA C Rule 20.7

2024-05-21 Thread Jan Beulich

On 16.05.2024 01:17, Stefano Stabellini wrote:
> On Wed, 15 May 2024, Nicola Vetrini wrote:
>> MISRA C Rule 20.7 states: "Expressions resulting from the expansion
>> of macro parameters shall be enclosed in parentheses". Therefore, some
>> macro definitions should gain additional parentheses to ensure that all
>> current and future users will be safe with respect to expansions that
>> can possibly alter the semantics of the passed-in macro parameter.
>>
>> No functional change.
>>
>> Signed-off-by: Nicola Vetrini 
> 
> Reviewed-by: Stefano Stabellini 

Acked-by: Jan Beulich

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 24401 matches

Mail list logo