Re: [PATCH] x86: Use printk_once() instead of opencoding it

2023-05-10 Thread Jan Beulich
On 10.05.2023 21:33, Andrew Cooper wrote:
> Technically our helper post-dates all of these examples, but it's good cleanup
> nevertheless.  None of these examples should be using fully locked
> test_and_set_bool() in the first place.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 





Re: [PATCH 5/6] x86/cpu-policy: Disentangle X86_NR_FEAT and FEATURESET_NR_ENTRIES

2023-05-10 Thread Jan Beulich
On 10.05.2023 22:13, Andrew Cooper wrote:
> On 09/05/2023 3:24 pm, Jan Beulich wrote:
>> On 09.05.2023 16:03, Andrew Cooper wrote:
>>> On 08/05/2023 8:45 am, Jan Beulich wrote:
 On 04.05.2023 21:39, Andrew Cooper wrote:
> When adding new words to a featureset, there is a reasonable amount of
> boilerplate and it is preforable to split the addition into multiple 
> patches.
>
> GCC 12 spotted a real (transient) error which occurs when splitting 
> additions
> like this.  Right now, FEATURESET_NR_ENTRIES is dynamically generated 
> from the
> highest numeric XEN_CPUFEATURE() value, and can be less than what the
> FEATURESET_* constants suggest the length of a featureset bitmap ought to 
> be.
>
> This causes the policy <-> featureset converters to genuinely access
> out-of-bounds on the featureset array.
>
> Rework X86_NR_FEAT to be related to FEATURESET_* alone, allowing it
> specifically to grow larger than FEATURESET_NR_ENTRIES.
>
> Reported-by: Jan Beulich 
> Signed-off-by: Andrew Cooper 
 While, like you, I could live with the previous patch even if I don't
 particularly like it, I'm not convinced of the route you take here.
>>> It's the route you tentatively agreed to in
>>> https://lore.kernel.org/xen-devel/a282c338-98ab-6c3f-314b-267a5a82b...@suse.com/
>> Right. Yet I deliberately said "may be the best" there, as something
>> better might turn up. And getting the two numbers to always agree, as
>> suggested, might end up being better.
> 
> Then don't write "yes" if what you actually mean is "I'd prefer a
> different option if possible", which is a "no".
> 
> I cannot read your mind, and we both know I do not have time to waste on
> this task.
> 
> And now I have to go and spend yet more time doing it differently.

I'm sorry for that. Yet please also allow for me to re-consider an earlier
voiced view, once I see things in more detail.

Jan



Re: [PATCH 1/6] x86/cpu-policy: Drop build time cross-checks of featureset sizes

2023-05-10 Thread Jan Beulich
On 10.05.2023 17:06, Andrew Cooper wrote:
> On 09/05/2023 5:15 pm, Jan Beulich wrote:
>> On 09.05.2023 17:59, Andrew Cooper wrote:
>>> On 09/05/2023 3:28 pm, Jan Beulich wrote:
 On 09.05.2023 15:04, Andrew Cooper wrote:
> On 08/05/2023 7:47 am, Jan Beulich wrote:
>> On 04.05.2023 21:39, Andrew Cooper wrote:
>>> These BUILD_BUG_ON()s exist to cover the curious absence of a 
>>> diagnostic for
>>> code which looks like:
>>>
>>>   uint32_t foo[1] = { 1, 2, 3 };
>>>
>>> However, GCC 12 at least does now warn for this:
>>>
>>>   foo.c:1:24: error: excess elements in array initializer [-Werror]
>>> 884 | uint32_t foo[1] = { 1, 2, 3 };
>>> |^
>>>   foo.c:1:24: note: (near initialization for 'foo')
>> I'm pretty sure all gcc versions we support diagnose such cases. In turn
>> the arrays in question don't have explicit dimensions at their
>> definition sites, and hence they derive their dimensions from their
>> initializers. So the build-time-checks are about the arrays in fact
>> obtaining the right dimensions, i.e. the initializers being suitable.
>>
>> With the core part of the reasoning not being applicable, I'm afraid I
>> can't even say "okay with an adjusted description".
> Now I'm extra confused.
>
> I put those BUILD_BUG_ON()'s in because I was not getting a diagnostic
> when I was expecting one, and there was a bug in the original featureset
> work caused by this going wrong.
>
> But godbolt seems to agree that even GCC 4.1 notices.
>
> Maybe it was some other error (C file not seeing the header properly?)
> which disappeared across the upstream review?
 Or maybe, by mistake, too few initializer fields? But what exactly it
 was probably doesn't matter. If this patch is to stay (see below), some
 different description will be needed anyway (or the change be folded
 into the one actually invalidating those BUILD_BUG_ON()s).

> Either way, these aren't appropriate, and need deleting before patch 5,
> because the check is no longer valid when a featureset can be longer
> than the autogen length.
 Well, they need deleting if we stick to the approach chosen there right
 now. If we switched to my proposed alternative, they better would stay.
>>> Given that all versions of GCC do warn, I don't see any justification
>>> for them to stay.
>> All versions warn when the variable declarations / definitions have a
>> dimension specified, and then there are excess initializers. Yet none
>> of the five affected arrays have a dimension specified in their
>> definitions.
>>
>> Even if dimensions were added, we'd then have only covered half of
>> what the BUILD_BUG_ON()s cover right now: There could then be fewer
>> than intended initializer fields, and things may still be screwed. I
>> think it was for this very reason why BUILD_BUG_ON() was chosen.
> 
> ???
> 
> The dimensions already exist, as proved by the fact GCC can spot the
> violation.

Where? Quoting cpu-policy.c:

const uint32_t known_features[] = INIT_KNOWN_FEATURES;

static const uint32_t __initconst pv_max_featuremask[] = INIT_PV_MAX_FEATURES;
static const uint32_t hvm_shadow_max_featuremask[] = 
INIT_HVM_SHADOW_MAX_FEATURES;
static const uint32_t __initconst hvm_hap_max_featuremask[] =
INIT_HVM_HAP_MAX_FEATURES;
static const uint32_t __initconst pv_def_featuremask[] = INIT_PV_DEF_FEATURES;
static const uint32_t __initconst hvm_shadow_def_featuremask[] =
INIT_HVM_SHADOW_DEF_FEATURES;
static const uint32_t __initconst hvm_hap_def_featuremask[] =
INIT_HVM_HAP_DEF_FEATURES;
static const uint32_t deep_features[] = INIT_DEEP_FEATURES;

I notice that known_features[], as an exception, has its dimension declared
in cpuid.h.

> On the other hand, zero extending a featureset is explicitly how they're
> supposed to work.  How do you think Xapi has coped with migration
> compatibility over the years, not to mention the microcode changes which
> lengthen a featureset?
> 
> So no, there was never any problem with constructs of the form uint32_t
> foo[10] = { 1, } in the first place.
> 
> The BUILD_BUG_ON()s therefore serve no purpose at all.

As per above the very minimum would be to accompany their dropping with
adding of explicitly specified dimensions for all the static arrays. I'm
not entirely certain about the other side (the zero-extension), but I'd
likely end up simply trusting you on that.

Jan



Re: [PATCH] iommu/vtd: fix address translation for superpages

2023-05-10 Thread Jan Beulich
On 10.05.2023 17:19, Roger Pau Monné wrote:
> On Wed, May 10, 2023 at 03:30:21PM +0200, Jan Beulich wrote:
>> On 10.05.2023 12:22, Roger Pau Monné wrote:
>>> On Wed, May 10, 2023 at 12:00:51PM +0200, Jan Beulich wrote:
 On 10.05.2023 10:27, Roger Pau Monné wrote:
> On Tue, May 09, 2023 at 06:06:45PM +0200, Jan Beulich wrote:
>> On 09.05.2023 12:41, Roger Pau Monne wrote:
>>> When translating an address that falls inside of a superpage in the
>>> IOMMU page tables the fetching of the PTE physical address field
>>> wasn't using dma_pte_addr(), which caused the returned data to be
>>> corrupt as it would contain bits not related to the address field.
>>
>> I'm afraid I don't understand:
>>
>>> --- a/xen/drivers/passthrough/vtd/iommu.c
>>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>>> @@ -359,16 +359,18 @@ static uint64_t addr_to_dma_page_maddr(struct 
>>> domain *domain, daddr_t addr,
>>>  
>>>  if ( !alloc )
>>>  {
>>> -pte_maddr = 0;
>>>  if ( !dma_pte_present(*pte) )
>>> +{
>>> +pte_maddr = 0;
>>>  break;
>>> +}
>>>  
>>>  /*
>>>   * When the leaf entry was requested, pass back the 
>>> full PTE,
>>>   * with the address adjusted to account for the 
>>> residual of
>>>   * the walk.
>>>   */
>>> -pte_maddr = pte->val +
>>> +pte_maddr +=
>>>  (addr & ((1UL << level_to_offset_bits(level)) - 1) 
>>> &
>>>   PAGE_MASK);
>>
>> With this change you're now violating what the comment says (plus what
>> the comment ahead of the function says). And it says what it says for
>> a reason - see intel_iommu_lookup_page(), which I think your change is
>> breaking.
>
> Hm, but the code in intel_iommu_lookup_page() is now wrong as it takes
> the bits in DMA_PTE_CONTIG_MASK as part of the physical address when
> doing the conversion to mfn?  maddr_to_mfn() doesn't perform a any
> masking to remove the bits above PADDR_BITS.

 Oh, right. But that's a missing dma_pte_addr() in intel_iommu_lookup_page()
 then. (It would likely be better anyway to switch "uint64_t val" to
 "struct dma_pte pte" there, to make more visible that it's a PTE we're
 dealing with.) I indeed overlooked this aspect when doing the earlier
 change.
>>>
>>> I guess I'm still confused, as the other return value for target == 0
>>> (when the address is not part of a superpage) does return
>>> dma_pte_addr(pte).  I think that needs further fixing then.
>>
>> Hmm, indeed. But I think it's worse than this: addr_to_dma_page_maddr()
>> also does one too many iterations in that case. All "normal" callers
>> supply a positive "target". We need to terminate the walk at level 1
>> also when target == 0.
> 
> Don't we do that already due to the following check:
> 
> if ( --level == target )
> break;
> 
> Which prevents mapping the PTE address as a page table directory?

I don't think this is enough - this code, afaict right now, is only
sufficient when target >= 1.

Jan



Re: [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters

2023-05-10 Thread Jan Beulich
On 10.05.2023 20:11, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 6:43 AM Jan Beulich  wrote:
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> --- a/tools/misc/xenpm.c
>>> +++ b/tools/misc/xenpm.c
>>> @@ -708,6 +708,44 @@ void start_gather_func(int argc, char *argv[])
>>>  pause();
>>>  }
>>>
>>> +static void calculate_hwp_activity_window(const xc_hwp_para_t *hwp,
>>> +  unsigned int *activity_window,
>>> +  const char **units)
>>> +{
>>> +unsigned int mantissa = hwp->activity_window & 
>>> HWP_ACT_WINDOW_MANTISSA_MASK;
>>> +unsigned int exponent =
>>> +(hwp->activity_window >> HWP_ACT_WINDOW_EXPONENT_SHIFT) &
>>> +HWP_ACT_WINDOW_EXPONENT_MASK;
>>
>> I wish we had MASK_EXTR() in common-macros.h. While really a comment on
>> patch 7 - HWP_ACT_WINDOW_EXPONENT_SHIFT is redundant information and
>> should imo be omitted from the public interface, in favor of just a
>> (suitably shifted) mask value. Also note how those constants all lack
>> proper XEN_ prefixes.
> 
> I'll add a patch adding MASK_EXTR() & MASK_INSR() to common-macros.h
> and use those - is there any reason not to do that?

I don't think there is, but I'm also not a maintainer of that code.

>>> +unsigned int multiplier = 1;
>>> +unsigned int i;
>>> +
>>> +if ( hwp->activity_window == 0 )
>>> +{
>>> +*units = "hardware selected";
>>> +*activity_window = 0;
>>> +
>>> +return;
>>> +}
>>
>> While in line with documentation, any mantissa of 0 results in a 0us
>> window, which I assume would then also mean "hardware selected".
> 
> I hadn't considered that.  The hardware seems to allow you to write a
> 0 mantissa, non-0 exponent.  From the SDM, it's unclear what that
> would mean.  The code as written would display "0 us", "0 ms", or "0
> s" - not "0 hardware selected".  Do you want more explicity printing
> for those cases?  I think it's fine to have a distinction between the
> output.  "0 hardware selected" is the known valid value that is
> working as expected.  The other ones being something different seems
> good to me since we don't really know what they mean.

Keeping things - apart from perhaps adding a respective comment - is
okay, as long as we don't know any better.

Jan



Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace

2023-05-10 Thread Jan Beulich
On 10.05.2023 19:49, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 6:26 AM Jan Beulich  wrote:
>>
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
>>> hardware rather closely.
>>>
>>> We need the features bitmask to indicated fields supported by the actual
>>> hardware.
>>>
>>> The use of uint8_t parameters matches the hardware size.  uint32_t
>>> entries grows the sysctl_t past the build assertion in setup.c.  The
>>> uint8_t ranges are supported across multiple generations, so hopefully
>>> they won't change.
>>
>> Still it feels a little odd for values to be this narrow. Aiui the
>> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
>> used by HWP. So you could widen the union in struct
>> xen_get_cpufreq_para (in a binary but not necessarily source compatible
>> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
>> placed scaling_cur_freq could be included as well ...
> 
> The values are narrow, but they match the hardware.  It works for HWP,
> so there is no need to change at this time AFAICT.
> 
> Do you want me to make this change?

Well, much depends on what these 8-bit values actually express (I did
raise this question in one of the replies to your patches, as I wasn't
able to find anything in the SDM). That'll then hopefully allow to
make some educated prediction on on how likely it is that a future
variant of hwp would want to widen them. (Was it energy_perf that went
from 4 to 8 bits at some point, which you even comment upon in the
public header?)

> On Mon, May 8, 2023 at 6:47 AM Jan Beulich  wrote:
>> On 08.05.2023 12:25, Jan Beulich wrote:
>>> On 01.05.2023 21:30, Jason Andryuk wrote:
 Extend xen_get_cpufreq_para to return hwp parameters.  These match the
 hardware rather closely.

 We need the features bitmask to indicated fields supported by the actual
 hardware.

 The use of uint8_t parameters matches the hardware size.  uint32_t
 entries grows the sysctl_t past the build assertion in setup.c.  The
 uint8_t ranges are supported across multiple generations, so hopefully
 they won't change.
>>>
>>> Still it feels a little odd for values to be this narrow. Aiui the
>>> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
>>> used by HWP. So you could widen the union in struct
>>> xen_get_cpufreq_para (in a binary but not necessarily source compatible
>>> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
>>> placed scaling_cur_freq could be included as well ...
>>
>> Having seen patch 9 now as well, I wonder whether here (or in a separate
>> patch) you don't want to limit providing inapplicable data (for example
>> not filling *scaling_available_governors would even avoid an allocation,
>> thus removing a possible reason for failure), while there (or again in a
>> separate patch) you'd also limit what the tool reports (inapplicable
>> output causes confusion / questions at best).
> 
> The xenpm output only shows relevant information:
> 
> # xenpm get-cpufreq-para 11
> cpu id   : 11
> affected_cpus: 11
> cpuinfo frequency: base [160] max [490]
> scaling_driver   : hwp-cpufreq
> scaling_avail_gov: hwp
> current_governor : hwp
> hwp variables:
>   hardware limits: lowest [1] most_efficient [11]
>  : guaranteed [11] highest [49]
>   configured limits  : min [1] max [255] energy_perf [128]
>  : activity_window [0 hardware selected]
>  : desired [0 hw autonomous]
> turbo mode   : enabled
> 
> The scaling_*_freq values, policy->{min,max,cur} are filled in with
> base, max and 0 in hwp_get_cpu_speeds(), so it's not totally invalid
> values being returned.  The governor registration restricting to only
> internal governors when HWP is active means it only has the single
> governor.  I think it's okay as-is, but let me know if you want
> something changed.

Well, the main connection here was to the possible overloading of
space in the sysctl struct, by widening what the union covers. That
can of course only be done for fields which don't convey useful data.
If we go without the further overloading, I guess we can for now leave
the tool as you have it, and deal with possible tidying later on.

Jan



Re: [XEN PATCH 2/2] x86/Dom0: Use streaming decompression for ZSTD compressed kernels

2023-05-10 Thread Jan Beulich
On 10.05.2023 19:30, Rafaël Kooi wrote:
> On 10/05/2023 11:48, Jan Beulich wrote:
>> On 10.05.2023 10:51, Rafaël Kooi wrote:
>>> On 10/05/2023 10:03, Jan Beulich wrote:> On 10.05.2023 02:18, Rafaël Kooi 
>>> wrote:
> --- a/xen/common/decompress.c
> +++ b/xen/common/decompress.c
> @@ -3,11 +3,26 @@
>#include 
>#include 
>
> +typedef struct _ZSTD_state
> +{
> +void *write_buf;
> +unsigned int write_pos;
> +} ZSTD_state;
> +
>static void __init cf_check error(const char *msg)
>{
>printk("%s\n", msg);
>}
>
> +static int __init cf_check ZSTD_flush(void *buf, unsigned int pos,
> +  void *userptr)
> +{
> +ZSTD_state *state = (ZSTD_state*)userptr;
> +memcpy(state->write_buf + state->write_pos, buf, pos);
> +state->write_pos += pos;
> +return pos;
> +}

 This doesn't really belong here, but will (I expect) go away anyway once
 you drop the earlier patch.

>>>
>>> The ZSTD_flush will have to stay, as that is how the decompressor will
>>> start streaming decompression. The difference will be that the book
>>> keeping will be "global" (to the translation unit).
>>
>> But this bookkeeping should be entirely in zstd code (i.e. presumably
>> unzstd.c).
>>
> 
> The implementation of the decompression functions seem to indicate
> otherwise. Referring to unzstd.c:`unzstd`, the function will take the
> streaming decompression path if either `fill` or `flush` have been
> supplied. I cross checked with unlzma.c and unxz.c, and that seems to
> have similar behavior in regards to flushing the output data. The
> `flush` function is passed a buffer to a chunk of decompressed data with
> `pos` being the size of the chunk. For the sake of consistency I don't
> think it's a good idea to deviate from this behavior in just unzstd.c.

Well, so far we don't use any flush functions, do we? The question
could therefore also be put differently: In how far is the flush
function you introduce zstd-specific? If it isn't be other than the
fact that it is only passed to unzstd(), perhaps it shouldn't be
named as if zstd-specific?

>if ( len >= 4 && !memcmp(inbuf, "\x28\xb5\x2f\xfd", 4) )
> - return unzstd(inbuf, len, NULL, NULL, outbuf, NULL, error);
> +{
> +// NOTE (Rafaël): On Arch Linux the kernel is compressed in a way
> +// that requires streaming ZSTD decompression. Otherwise 
> decompression
> +// will fail when using a unified EFI binary. Somehow 
> decompression
> +// works when not using a unified EFI binary, I suspect this is 
> the
> +// kernel self decompressing. Or there is a code path that I am 
> not
> +// aware of that takes care of the use case properly.

 Along the lines of what I've said for the description, this wants to avoid
 terms like "somehow" if at all possible.
>>>
>>> I've used the term "somehow" because I don't know why decompression
>>> works when Xen loads the kernel from the EFI file system. I assume the
>>> kernel still gets unpacked by Xen, right? Or does the kernel unpack
>>> itself?
>>
>> The handling of Dom0 kernel decompression ought to be entirely independent
>> of EFI vs legacy. Unless I'm wrong with that (mis-remembering), you
>> mentioning EFI is potentially misleading. And yes, at least on x86 the
>> kernel is decompressed by Xen (by peeking into the supplied bzImage). The
>> difference between a plain bzImage and a "unified EFI binary" is what you
>> will want to outline in the description (and at least mention in the
>> comment). What I'm wondering is whether there simply is an issue with size
>> determination when the kernel is taken from the .kernel section.
>>
> 
> Assuming you are talking about size determination of the compressed
> bzImage, I notice a discrepancy in the size of the ZSTD stream and the
> reported size by the vmlinuz-* header upon further investigation. When
> using the streaming decompression code I made it output how many bytes
> it reads from the extracted-but-still-compressed bzImage. The code
> reads 12,327,560 bytes, but the size of the compressed bzImage in the
> header is 12,327,564 bytes. In xen/arch/x86/bzImage.c `decompress` is
> called with `orig_image_len`, when the function `output_length`
> calculates the end address and removes 4 bytes from that address. If I
> remove the last 4 bytes from the compressed bzImage then `unzstd` will
> unpack it with `unzstd bzImage.zst -o bzImage`, otherwise it will
> complain with `zstd: /*stdin*\: unknown header`. With this new
> information I think the correct solution is to try calling `decompress`
> a second time with with `orig_image_len - 4` if it fails.

That's not very likely to be an appropriate solution. If the sizes
diverge by 4, that difference needs explaining. Once understood, it'll
hopefu

[ovmf test] 180611: all pass - PUSHED

2023-05-10 Thread osstest service owner
flight 180611 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/180611/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf f47415e0315e2b0cf7bf1d00634f46deee802227
baseline version:
 ovmf 6fb2760dc8939b16a906b8e6bb224764907168f3

Last test of basis   180604  2023-05-10 15:12:03 Z0 days
Testing same since   180611  2023-05-11 00:42:25 Z0 days1 attempts


People who touched revisions under test:
  Rebecca Cran 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   6fb2760dc8..f47415e031  f47415e0315e2b0cf7bf1d00634f46deee802227 -> 
xen-tested-master



Re: [PATCH v5 05/14] tools/xenstore: use accounting buffering for node accounting

2023-05-10 Thread Juergen Gross

On 10.05.23 23:31, Julien Grall wrote:

Hi,

On 10/05/2023 13:54, Juergen Gross wrote:

On 09.05.23 20:46, Julien Grall wrote:

Hi Juergen,

On 08/05/2023 12:47, Juergen Gross wrote:

Add the node accounting to the accounting information buffering in
order to avoid having to undo it in case of failure.

This requires to call domain_nbentry_dec() before any changes to the
data base, as it can return an error now.

Signed-off-by: Juergen Gross 
---
V5:
- add error handling after domain_nbentry_dec() calls (Julien Grall)
---
  tools/xenstore/xenstored_core.c   | 29 +++--
  tools/xenstore/xenstored_domain.h |  4 ++--
  2 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 8392bdec9b..22da434e2a 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1454,7 +1454,6 @@ static void destroy_node_rm(struct connection *conn, 
struct node *node)

  static int destroy_node(struct connection *conn, struct node *node)
  {
  destroy_node_rm(conn, node);
-    domain_nbentry_dec(conn, get_node_owner(node));
  /*
   * It is not possible to easily revert the changes in a transaction.
@@ -1645,6 +1644,9 @@ static int delnode_sub(const void *ctx, struct 
connection *conn,

  if (ret > 0)
  return WALK_TREE_SUCCESS_STOP;
+    if (domain_nbentry_dec(conn, get_node_owner(node)))
+    return WALK_TREE_ERROR_STOP;


I think there is a potential issue with the buffering here. In case of 
failure, the node could have been removed, but the quota would not be 
properly accounted.


You mean the case where another node has been deleted and due to accounting
buffering the corrected accounting data wouldn't be committed?

This is no problem, as an error returned by delnode_sub() will result in
corrupt() being called, which in turn will correct the accounting data.


To me corrupt() is a big hammer and it feels wrong to call it when I think we 
have easier/faster way to deal with the issue. Could we instead call 
acc_commit() before returning?


You are aware that this is a very problematic situation we are in?

We couldn't allocate a small amount of memory (around 64 bytes)! Xenstored
will probably die within milliseconds. Using the big hammer in such a
situation is fine IMO. It will maybe result in solving the problem by
freeing of memory (quite unlikely, though), but it won't leave xenstored
in a worse state than with your suggestion.

And calling acc_commit() here wouldn't really help, as accounting data
couldn't be recorded, so there are missing updates anyway due to the failed
call of domain_nbentry_dec().

Also, I think a comment would be warrant to explain why we are returning 
WALK_TREE_ERROR_STOP here when...



+
  /* In case of error stop the walk. */
  if (!ret && do_tdb_delete(conn, &key, &node->acc))
  return WALK_TREE_SUCCESS_STOP;


... this is not the case when do_tdb_delete() fails for some reasons.


The main idea was that the remove is working from the leafs towards the root.
In case one entry can't be removed, we should just stop.

OTOH returning WALK_TREE_ERROR_STOP might be cleaner, as this would make sure
that accounting data is repaired afterwards. Even if do_tdb_delete() can't
really fail in normal configurations, as the only failure reasons are:

- the node isn't found (quite unlikely, as we just read it before entering
   delnode_sub()), or
- writing the updated data base failed (in normal configurations it is in
   already allocated memory, so no way to fail that)

I think I'll switch to return WALK_TREE_ERROR_STOP here.


See above for a different proposal.


Without deleting the node in the data base this would be another accounting
data inconsistency, so calling corrupt() is the correct cleanup measure.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


[linux-linus test] 180606: regressions - FAIL

2023-05-10 Thread osstest service owner
flight 180606 linux-linus real [real]
flight 180612 linux-linus real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/180606/
http://logs.test-lab.xenproject.org/osstest/logs/180612/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-armhf-armhf-xl-credit1   8 xen-boot fail REGR. vs. 180278

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-vhd 21 guest-start/debian.repeat fail pass in 180612-retest

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt  8 xen-boot fail  like 180278
 test-armhf-armhf-libvirt-raw  8 xen-boot fail  like 180278
 test-armhf-armhf-xl-rtds  8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 180278
 test-armhf-armhf-xl   8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 180278
 test-armhf-armhf-examine  8 reboot   fail  like 180278
 test-armhf-armhf-xl-multivcpu  8 xen-boot fail like 180278
 test-armhf-armhf-libvirt-qcow2  8 xen-bootfail like 180278
 test-armhf-armhf-xl-credit2   8 xen-boot fail  like 180278
 test-armhf-armhf-xl-vhd   8 xen-boot fail  like 180278
 test-armhf-armhf-xl-arndale   8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted for testing:
 linuxad2fd53a7870a395b8564697bef6c329d017c6c9
baseline version:
 linux6c538e1adbfc696ac4747fb10d63e704344f763d

Last test of basis   180278  2023-04-16 19:41:46 Z   24 days
Failing since180281  2023-04-17 06:24:36 Z   23 days   44 attempts
Testing same since   180606  2023-05-10 16:41:43 Z0 days1 attempts


2365 people touched revisions under test,
not listing them all

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 build-armhf-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-arm64-pvopspass
 build-armhf-pvopspass
 build-i386-pvops 

[PATCH net-next v2 1/2] net: introduce and use skb_frag_fill_page_desc()

2023-05-10 Thread Yunsheng Lin
Most users use __skb_frag_set_page()/skb_frag_off_set()/
skb_frag_size_set() to fill the page desc for a skb frag.

Introduce skb_frag_fill_page_desc() to do that.

net/bpf/test_run.c does not call skb_frag_off_set() to
set the offset, "copy_from_user(page_address(page), ...)"
and 'shinfo' being part of the 'data' kzalloced in
bpf_test_init() suggest that it is assuming offset to be
initialized as zero, so call skb_frag_fill_page_desc()
with offset being zero for this case.

Also, skb_frag_set_page() is not used anymore, so remove
it.

Signed-off-by: Yunsheng Lin 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Simon Horman 
---
 .../net/ethernet/aquantia/atlantic/aq_ring.c  |  6 ++--
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  5 ++-
 drivers/net/ethernet/chelsio/cxgb3/sge.c  |  5 ++-
 drivers/net/ethernet/emulex/benet/be_main.c   | 32 ++-
 drivers/net/ethernet/freescale/enetc/enetc.c  |  5 ++-
 .../net/ethernet/fungible/funeth/funeth_rx.c  |  5 ++-
 drivers/net/ethernet/marvell/mvneta.c |  5 ++-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  4 +--
 drivers/net/ethernet/sun/cassini.c|  8 ++---
 drivers/net/virtio_net.c  |  4 +--
 drivers/net/vmxnet3/vmxnet3_drv.c |  4 +--
 drivers/net/xen-netback/netback.c |  4 +--
 include/linux/skbuff.h| 27 ++--
 net/bpf/test_run.c|  3 +-
 net/core/gro.c|  4 +--
 net/core/pktgen.c | 13 +---
 net/core/skbuff.c |  7 ++--
 net/tls/tls_device.c  | 10 +++---
 net/xfrm/xfrm_ipcomp.c|  5 +--
 19 files changed, 64 insertions(+), 92 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 7f933175cbda..4de22eed099a 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -532,10 +532,10 @@ static bool aq_add_rx_fragment(struct device *dev,
  buff_->rxdata.pg_off,
  buff_->len,
  DMA_FROM_DEVICE);
-   skb_frag_off_set(frag, buff_->rxdata.pg_off);
-   skb_frag_size_set(frag, buff_->len);
sinfo->xdp_frags_size += buff_->len;
-   __skb_frag_set_page(frag, buff_->rxdata.page);
+   skb_frag_fill_page_desc(frag, buff_->rxdata.page,
+   buff_->rxdata.pg_off,
+   buff_->len);
 
buff_->is_cleaned = 1;
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index dcd9367f05af..efaff5018af8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1085,9 +1085,8 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
 
cons_rx_buf = &rxr->rx_agg_ring[cons];
-   skb_frag_off_set(frag, cons_rx_buf->offset);
-   skb_frag_size_set(frag, frag_len);
-   __skb_frag_set_page(frag, cons_rx_buf->page);
+   skb_frag_fill_page_desc(frag, cons_rx_buf->page,
+   cons_rx_buf->offset, frag_len);
shinfo->nr_frags = i + 1;
__clear_bit(cons, rxr->rx_agg_bmap);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/sge.c 
b/drivers/net/ethernet/chelsio/cxgb3/sge.c
index efa7f401529e..2e9a74fe0970 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
@@ -2184,9 +2184,8 @@ static void lro_add_page(struct adapter *adap, struct 
sge_qset *qs,
len -= offset;
 
rx_frag += nr_frags;
-   __skb_frag_set_page(rx_frag, sd->pg_chunk.page);
-   skb_frag_off_set(rx_frag, sd->pg_chunk.offset + offset);
-   skb_frag_size_set(rx_frag, len);
+   skb_frag_fill_page_desc(rx_frag, sd->pg_chunk.page,
+   sd->pg_chunk.offset + offset, len);
 
skb->len += len;
skb->data_len += len;
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 7e408bcc88de..3164ed205cf7 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -2343,11 +2343,10 @@ static void skb_fill_rx_data(struct be_rx_obj *rxo, 
struct sk_buff *skb,
hdr_len = ETH_HLEN;
memcpy(skb->data, start, hdr_len);
skb_shinfo(skb)->nr_frags = 1;
-   skb_frag_set_page(skb, 0, page_info->page);
-   skb_frag_off_set(&skb_shinfo(skb)->frags[0],
-page_info->page_offset + hdr_len

Re: DROPME [XEN PATCH 0/2] Use streaming decompression for ZSTD kernels

2023-05-10 Thread Rafaël Kooi

On 10/05/2023 02:18, Rafaël Kooi wrote:> I've attempted to get Xen to boot Arch 
Linux as a unified EFI binary.

Using https://xenbits.xen.org/docs/unstable/misc/efi.html as my source
of information, I've been able to build a unified binary. When trying to
boot the kernel Xen complains that the stream is corrupt
("ZSTD-compressed data is corrupt"). I've been able to reproduce the
issue locally in user-mode, and confirmed that the issue is also present
in the latest ZSTD version.

Using streaming decompression the kernel gets unpacked properly and the
output is the same as if doing `cat kernel.zst | unzstd > bzImage`.

A problem I ran into was that adding book keeping to decompress.c would
result in either a .data section being added or a .bss.* section. The
linker would complain about this. And since I am not familiar with this
code, and why it is this way, I opted to add a user-pointer to the
internal decompression API.

Rafaël Kooi (2):
   xen/decompress: Add a user pointer for book keeping in the callbacks
   x86/Dom0: Use streaming decompression for ZSTD compressed kernels

  xen/common/bunzip2.c | 23 --
  xen/common/decompress.c  | 37 ++--
  xen/common/unlz4.c   | 15 ---
  xen/common/unlzma.c  | 30 +
  xen/common/unlzo.c   | 13 +++--
  xen/common/unxz.c| 11 ++-
  xen/common/unzstd.c  | 13 +++--
  xen/include/xen/decompress.h | 10 +++---
  8 files changed, 97 insertions(+), 55 deletions(-)

--
2.40.0



This patch can be dropped in its entirety. The issue that this code
fixes is a symptom of another issue entirely. What ended up being the
issue was that the SSD of my laptop is dead, and that I messed up the
alignment of the sections inserted into xen.efi.

I plan to add some sanity checks to the EFI boot loader code to inform
the user if the sections are misaligned. That's a much more user friendly
error than whatever the decompressor will report.

To Jan, sorry for wasting your time, and thanks again for being patient
with me.

Rafaël



[qemu-mainline test] 180600: tolerable FAIL - PUSHED

2023-05-10 Thread osstest service owner
flight 180600 qemu-mainline real [real]
flight 180608 qemu-mainline real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/180600/
http://logs.test-lab.xenproject.org/osstest/logs/180608/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-i386-pair11 xen-install/dst_host fail pass in 180608-retest

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 180586
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 180586
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 180586
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 180586
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 180586
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 180586
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 180586
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 180586
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass

version targeted for testing:
 qemuu577e648bdb524d1984659baf1bd6165de2edae83
baseline version:
 qemuu271477b59e723250f17a7e20f139262057921b6a

Last test of basis   180586  2023-05-09 06:38:31 Z1 days
Testing same since   180600  2023-05-10 05:40:33 Z0 days1 attempts


People who touched revisions under test:
  Babu Moger 
  Kim Phillips 
  Michael Roth 
  Michael S. Tsirkin 
  Paolo Bonzini 
  Richard Henderson 
  Santosh Shukla 
  Thomas H

[xen-unstable-smoke test] 180607: tolerable all pass - PUSHED

2023-05-10 Thread osstest service owner
flight 180607 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/180607/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  31c65549746179e16cf3f82b694b4b1e0b7545ca
baseline version:
 xen  ed6b7c0266e512c1207c07911da14e684f47b909

Last test of basis   180592  2023-05-09 21:03:16 Z1 days
Testing same since   180607  2023-05-10 19:03:28 Z0 days1 attempts


People who touched revisions under test:
  Alejandro Vallejo 
  Andrew Cooper 
  Anthony PERARD 
  Christian Lindig 
  Roger Pau Monné 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   ed6b7c0266..31c6554974  31c65549746179e16cf3f82b694b4b1e0b7545ca -> smoke



Re: [PATCH v5 05/14] tools/xenstore: use accounting buffering for node accounting

2023-05-10 Thread Julien Grall

Hi,

On 10/05/2023 13:54, Juergen Gross wrote:

On 09.05.23 20:46, Julien Grall wrote:

Hi Juergen,

On 08/05/2023 12:47, Juergen Gross wrote:

Add the node accounting to the accounting information buffering in
order to avoid having to undo it in case of failure.

This requires to call domain_nbentry_dec() before any changes to the
data base, as it can return an error now.

Signed-off-by: Juergen Gross 
---
V5:
- add error handling after domain_nbentry_dec() calls (Julien Grall)
---
  tools/xenstore/xenstored_core.c   | 29 +++--
  tools/xenstore/xenstored_domain.h |  4 ++--
  2 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/tools/xenstore/xenstored_core.c 
b/tools/xenstore/xenstored_core.c

index 8392bdec9b..22da434e2a 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1454,7 +1454,6 @@ static void destroy_node_rm(struct connection 
*conn, struct node *node)

  static int destroy_node(struct connection *conn, struct node *node)
  {
  destroy_node_rm(conn, node);
-    domain_nbentry_dec(conn, get_node_owner(node));
  /*
   * It is not possible to easily revert the changes in a 
transaction.
@@ -1645,6 +1644,9 @@ static int delnode_sub(const void *ctx, struct 
connection *conn,

  if (ret > 0)
  return WALK_TREE_SUCCESS_STOP;
+    if (domain_nbentry_dec(conn, get_node_owner(node)))
+    return WALK_TREE_ERROR_STOP;


I think there is a potential issue with the buffering here. In case of 
failure, the node could have been removed, but the quota would not be 
properly accounted.


You mean the case where another node has been deleted and due to accounting
buffering the corrected accounting data wouldn't be committed?

This is no problem, as an error returned by delnode_sub() will result in
corrupt() being called, which in turn will correct the accounting data.


To me corrupt() is a big hammer and it feels wrong to call it when I 
think we have easier/faster way to deal with the issue. Could we instead 
call acc_commit() before returning?




Also, I think a comment would be warrant to explain why we are 
returning WALK_TREE_ERROR_STOP here when...



+
  /* In case of error stop the walk. */
  if (!ret && do_tdb_delete(conn, &key, &node->acc))
  return WALK_TREE_SUCCESS_STOP;


... this is not the case when do_tdb_delete() fails for some reasons.


The main idea was that the remove is working from the leafs towards the 
root.

In case one entry can't be removed, we should just stop.

OTOH returning WALK_TREE_ERROR_STOP might be cleaner, as this would make 
sure

that accounting data is repaired afterwards. Even if do_tdb_delete() can't
really fail in normal configurations, as the only failure reasons are:

- the node isn't found (quite unlikely, as we just read it before entering
   delnode_sub()), or
- writing the updated data base failed (in normal configurations it is in
   already allocated memory, so no way to fail that)

I think I'll switch to return WALK_TREE_ERROR_STOP here.


See above for a different proposal.

Cheers,

--
Julien Grall



Re: [RFC PATCH] xen/arm: arm32: Enable smpboot on Arm32 based systems

2023-05-10 Thread Julien Grall

Hi Ayan,

On 04/05/2023 09:57, Ayan Kumar Halder wrote:


On 03/05/2023 18:43, Julien Grall wrote:

Hi Ayan,

Hi Julien,


On 03/05/2023 17:49, Ayan Kumar Halder wrote:


On 03/05/2023 08:40, Julien Grall wrote:

Hi,

Hi Julien,


Title: Did you mean "Enable spin table"?

Yes, that would be more concrete.


On 02/05/2023 11:58, Ayan Kumar Halder wrote:
On some of the Arm32 based systems (eg Cortex-R52), smpboot is 
supported.


Same here.

Yes


In these systems PSCI may not always be supported. In case of 
Cortex-R52, there
is no EL3 or secure mode. Thus, PSCI is not supported as it 
requires EL3.


Thus, we use 'spin-table' mechanism to boot the secondary cpus. The 
primary
cpu provides the startup address of the secondary cores. This 
address is

provided using the 'cpu-release-addr' property.

To support smpboot, we have copied the code from 
xen/arch/arm/arm64/smpboot.c


I would rather prefer if we don't duplicate the code but instead 
move the logic in common code.

Ack



with the following changes :-

1. 'enable-method' is an optional property. Refer to the comment in
https://www.kernel.org/doc/Documentation/devicetree/bindings/arm/cpus.yaml
"  # On ARM 32-bit systems this property is optional"


Looking at this list, "spin-table" doesn't seem to be supported
for 32-bit systems. 


However, looking at 
https://developer.arm.com/documentation/den0013/d/Multi-core-processors/Booting-SMP-systems/SMP-boot-in-Linux , it seems "spin-table" is a valid boot mechanism for Armv7 cpus.


I am not able to find the associated code in Linux 32-bit. Do you have 
any pointer?


Unfortunately, no.

I see that in linux, "spin-table" support is added in 
arch/arm64/kernel/smp_spin_table.c. So, there seems to be no Arm32 
support for this.






Can you point me to the discussion/patch where this would be added?


Actually, this is the first discussion I am having with regards to 
adding a "spin-table" support on Arm32.


I was asking for the discussion on the Device-Tree/Linux ML or code.
I don't really want to do a "spin-table" support if this is not even 
supported in Linux.


I see your point. But that brings me to my next question, how do I parse 
cpu node specific properties for Arm32 cpus.


I was probably not very clear in my previous message. What I don't want 
is for Xen to use unofficial bindings.


IOW, if the existing binding doesn't allow the spin-table on arm32 (IMHO 
it is not clear) and it makes sense to use them, then we should first 
consider to send a patch against the documentation and merge it before 
Xen can use the properties.




In 
https://www.kernel.org/doc/Documentation/devicetree/bindings/arm/cpus.yaml , I see some of the properties valid for Arm32 cpus.


For example:-

secondary-boot-reg
rockchip,pmu

Also, it says "additionalProperties: true" which means I can add 
platform specific properties under the cpu node.


For clarification, are you saying the bindings below is not yet 
official? If so, then this should be first discussed with the 
Device-Tree folks.


Please correct me if 
mistaken.


My cpu nodes will look like this :-

     cpu@1 {
     device_type = "cpu";
     compatible = "arm,armv8";
     reg = <0x00 0x01>;
     enable-method = "spin-table";


The enable-method "spin-table" is generic and expect property like 
cpu-release-addr. Yet...


     amd-cpu-release-addr = <0xEB58C010>; /* might also use 
"secondary-boot-reg" */

     amd-cpu-reset-addr = <0xEB58C000>;
     amd-cpu-reset-delay = <0xF0>;
     amd-cpu-re


... these are all AMD properties. What are the reasons to not use the 
generic "spin-table" bindings?


If you can't use it, then I think you should define a new type of 
enable-method.



     phandle = <0x03>;
     };

     cpu@2 {
     device_type = "cpu";
     compatible = "arm,armv8";
     reg = <0x00 0x02>;
     enable-method = "spin-table";
     amd-cpu-release-addr = <0xEB59C010>; /* might also use 
"secondary-boot-reg" */

     amd-cpu-reset-addr = <0xEB59C000>;
     amd-cpu-reset-delay = <0xF0>;
     amd-cpu-re
     phandle = <0x03>;
     };

If the reasoning makes sense, then does the following proposed change 
looks sane ?


I would first like to understand a bit more about how the bindings were 
created and whether this was discussed with the Device-Tree community.


Cheers,

--
Julien Grall



Re: [PATCH 5/6] x86/cpu-policy: Disentangle X86_NR_FEAT and FEATURESET_NR_ENTRIES

2023-05-10 Thread Andrew Cooper
On 09/05/2023 3:24 pm, Jan Beulich wrote:
> On 09.05.2023 16:03, Andrew Cooper wrote:
>> On 08/05/2023 8:45 am, Jan Beulich wrote:
>>> On 04.05.2023 21:39, Andrew Cooper wrote:
 When adding new words to a featureset, there is a reasonable amount of
 boilerplate and it is preforable to split the addition into multiple 
 patches.

 GCC 12 spotted a real (transient) error which occurs when splitting 
 additions
 like this.  Right now, FEATURESET_NR_ENTRIES is dynamically generated from 
 the
 highest numeric XEN_CPUFEATURE() value, and can be less than what the
 FEATURESET_* constants suggest the length of a featureset bitmap ought to 
 be.

 This causes the policy <-> featureset converters to genuinely access
 out-of-bounds on the featureset array.

 Rework X86_NR_FEAT to be related to FEATURESET_* alone, allowing it
 specifically to grow larger than FEATURESET_NR_ENTRIES.

 Reported-by: Jan Beulich 
 Signed-off-by: Andrew Cooper 
>>> While, like you, I could live with the previous patch even if I don't
>>> particularly like it, I'm not convinced of the route you take here.
>> It's the route you tentatively agreed to in
>> https://lore.kernel.org/xen-devel/a282c338-98ab-6c3f-314b-267a5a82b...@suse.com/
> Right. Yet I deliberately said "may be the best" there, as something
> better might turn up. And getting the two numbers to always agree, as
> suggested, might end up being better.

Then don't write "yes" if what you actually mean is "I'd prefer a
different option if possible", which is a "no".

I cannot read your mind, and we both know I do not have time to waste on
this task.

And now I have to go and spend yet more time doing it differently.

~Andrew



[PATCH] x86: Use printk_once() instead of opencoding it

2023-05-10 Thread Andrew Cooper
Technically our helper post-dates all of these examples, but it's good cleanup
nevertheless.  None of these examples should be using fully locked
test_and_set_bool() in the first place.

No functional change.

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Roger Pau Monné 
CC: Wei Liu 
CC: Jun Nakajima 
CC: Kevin Tian 
---
 xen/arch/x86/cpu/amd.c | 22 --
 xen/arch/x86/hvm/vmx/vmx.c | 18 ++
 xen/arch/x86/srat.c|  8 ++--
 xen/arch/x86/time.c| 19 ---
 4 files changed, 20 insertions(+), 47 deletions(-)

diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
index caafe4474021..630adead2fc1 100644
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -1061,25 +1061,19 @@ static void cf_check init_amd(struct cpuinfo_x86 *c)
 
rdmsrl(MSR_AMD64_LS_CFG, value);
if (!(value & (1 << 15))) {
-   static bool_t warned;
-
-   if (c == &boot_cpu_data || opt_cpu_info ||
-   !test_and_set_bool(warned))
-   printk(KERN_WARNING
-  "CPU%u: Applying workaround for erratum 
793\n",
-  smp_processor_id());
+   if (c == &boot_cpu_data || opt_cpu_info)
+   printk_once(XENLOG_WARNING
+   "CPU%u: Applying workaround for 
erratum 793\n",
+   smp_processor_id());
wrmsrl(MSR_AMD64_LS_CFG, value | (1 << 15));
}
} else if (c->x86 == 0x12) {
rdmsrl(MSR_AMD64_DE_CFG, value);
if (!(value & (1U << 31))) {
-   static bool warned;
-
-   if (c == &boot_cpu_data || opt_cpu_info ||
-   !test_and_set_bool(warned))
-   printk(KERN_WARNING
-  "CPU%u: Applying workaround for erratum 
665\n",
-  smp_processor_id());
+   if (c == &boot_cpu_data || opt_cpu_info)
+   printk_once(XENLOG_WARNING
+   "CPU%u: Applying workaround for 
erratum 665\n",
+   smp_processor_id());
wrmsrl(MSR_AMD64_DE_CFG, value | (1U << 31));
}
}
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 096c69251d58..0f392fc0d4fe 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1183,16 +1183,11 @@ static void cf_check vmx_get_segment_register(
  */
 if ( unlikely(!vmx_vmcs_try_enter(v)) )
 {
-static bool_t warned;
+printk_once(XENLOG_WARNING "Segment register inaccessible for %pv\n"
+"(If you see this outside of debugging activity,"
+" please report to xen-devel@lists.xenproject.org)\n",
+v);
 
-if ( !warned )
-{
-warned = 1;
-printk(XENLOG_WARNING "Segment register inaccessible for %pv\n"
-   "(If you see this outside of debugging activity,"
-   " please report to xen-devel@lists.xenproject.org)\n",
-   v);
-}
 memset(reg, 0, sizeof(*reg));
 return;
 }
@@ -2301,10 +2296,9 @@ static bool cf_check vmx_test_pir(const struct vcpu *v, 
uint8_t vec)
 static void cf_check vmx_handle_eoi(uint8_t vector, int isr)
 {
 uint8_t old_svi = set_svi(isr);
-static bool warned;
 
-if ( vector != old_svi && !test_and_set_bool(warned) )
-printk(XENLOG_WARNING "EOI for %02x but SVI=%02x\n", vector, old_svi);
+if ( vector != old_svi )
+printk_once(XENLOG_WARNING "EOI for %02x but SVI=%02x\n", vector, 
old_svi);
 }
 
 static void cf_check vmx_enable_msr_interception(struct domain *d, uint32_t 
msr)
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 56749ddca526..3f70338e6e23 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -55,7 +55,6 @@ nodeid_t setup_node(unsigned pxm)
 {
nodeid_t node;
unsigned idx;
-   static bool warned;
static unsigned nodes_found;
 
BUILD_BUG_ON(MAX_NUMNODES >= NUMA_NO_NODE);
@@ -75,11 +74,8 @@ nodeid_t setup_node(unsigned pxm)
if (pxm2node[idx].node == NUMA_NO_NODE)
goto finish;
 
-   if (!warned) {
-   printk(KERN_WARNING "SRAT: Too many proximity domains (%#x)\n",
-  pxm);
-   warned = true;
-   }
+   printk_once(XENLOG_WARNING "SRAT: Too many proximity domains (%#x)\n",
+   pxm);
 
return NUMA_NO_NODE;
 
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index bc75e1ae7d42..f5e30d4e0236 100644
---

Re: [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters

2023-05-10 Thread Jason Andryuk
On Mon, May 8, 2023 at 6:43 AM Jan Beulich  wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > Print HWP-specific parameters.  Some are always present, but others
> > depend on hardware support.
> >
> > Signed-off-by: Jason Andryuk 
> > ---
> > v2:
> > Style fixes
> > Declare i outside loop
> > Replace repearted hardware/configured limits with spaces
> > Fixup for hw_ removal
> > Use XEN_HWP_GOVERNOR
> > Use HWP_ACT_WINDOW_EXPONENT_*
> > Remove energy_perf hw autonomous - 0 doesn't mean autonomous
> > ---
> >  tools/misc/xenpm.c | 65 ++
> >  1 file changed, 65 insertions(+)
> >
> > diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
> > index ce8d7644d0..b2defde0d4 100644
> > --- a/tools/misc/xenpm.c
> > +++ b/tools/misc/xenpm.c
> > @@ -708,6 +708,44 @@ void start_gather_func(int argc, char *argv[])
> >  pause();
> >  }
> >
> > +static void calculate_hwp_activity_window(const xc_hwp_para_t *hwp,
> > +  unsigned int *activity_window,
> > +  const char **units)
>
> The function's return value would be nice to use for one of the two
> values that are being returned.

Ok, I'll return activity_window.

> > +{
> > +unsigned int mantissa = hwp->activity_window & 
> > HWP_ACT_WINDOW_MANTISSA_MASK;
> > +unsigned int exponent =
> > +(hwp->activity_window >> HWP_ACT_WINDOW_EXPONENT_SHIFT) &
> > +HWP_ACT_WINDOW_EXPONENT_MASK;
>
> I wish we had MASK_EXTR() in common-macros.h. While really a comment on
> patch 7 - HWP_ACT_WINDOW_EXPONENT_SHIFT is redundant information and
> should imo be omitted from the public interface, in favor of just a
> (suitably shifted) mask value. Also note how those constants all lack
> proper XEN_ prefixes.

I'll add a patch adding MASK_EXTR() & MASK_INSR() to common-macros.h
and use those - is there any reason not to do that?

I'll also add XEN_ prefixes.

> > +unsigned int multiplier = 1;
> > +unsigned int i;
> > +
> > +if ( hwp->activity_window == 0 )
> > +{
> > +*units = "hardware selected";
> > +*activity_window = 0;
> > +
> > +return;
> > +}
>
> While in line with documentation, any mantissa of 0 results in a 0us
> window, which I assume would then also mean "hardware selected".

I hadn't considered that.  The hardware seems to allow you to write a
0 mantissa, non-0 exponent.  From the SDM, it's unclear what that
would mean.  The code as written would display "0 us", "0 ms", or "0
s" - not "0 hardware selected".  Do you want more explicity printing
for those cases?  I think it's fine to have a distinction between the
output.  "0 hardware selected" is the known valid value that is
working as expected.  The other ones being something different seems
good to me since we don't really know what they mean.

> > @@ -773,6 +811,33 @@ static void print_cpufreq_para(int cpuid, struct 
> > xc_get_cpufreq_para *p_cpufreq)
> > p_cpufreq->scaling_cur_freq);
> >  }
> >
> > +if ( strcmp(p_cpufreq->scaling_governor, XEN_HWP_GOVERNOR) == 0 )
> > +{
> > +const xc_hwp_para_t *hwp = &p_cpufreq->u.hwp_para;
> > +
> > +printf("hwp variables:\n");
> > +printf("  hardware limits: lowest [%u] most_efficient [%u]\n",
>
> Here and ...
>
> > +   hwp->lowest, hwp->most_efficient);
> > +printf(" : guaranteed [%u] highest [%u]\n",
> > +   hwp->guaranteed, hwp->highest);
> > +printf("  configured limits  : min [%u] max [%u] energy_perf 
> > [%u]\n",
>
> ... here I wonder what use the underscores are in produced output. I'd
> use blanks. If you really want a separator there, then please use
> dashes.

I'll use blanks.

Regards,
Jason



Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace

2023-05-10 Thread Jason Andryuk
On Mon, May 8, 2023 at 6:26 AM Jan Beulich  wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > Extend xen_get_cpufreq_para to return hwp parameters.  These match the
> > hardware rather closely.
> >
> > We need the features bitmask to indicated fields supported by the actual
> > hardware.
> >
> > The use of uint8_t parameters matches the hardware size.  uint32_t
> > entries grows the sysctl_t past the build assertion in setup.c.  The
> > uint8_t ranges are supported across multiple generations, so hopefully
> > they won't change.
>
> Still it feels a little odd for values to be this narrow. Aiui the
> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
> used by HWP. So you could widen the union in struct
> xen_get_cpufreq_para (in a binary but not necessarily source compatible
> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
> placed scaling_cur_freq could be included as well ...

The values are narrow, but they match the hardware.  It works for HWP,
so there is no need to change at this time AFAICT.

Do you want me to make this change?

> > --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> > @@ -506,6 +506,31 @@ static const struct cpufreq_driver __initconstrel 
> > hwp_cpufreq_driver =
> >  .update = hwp_cpufreq_update,
> >  };
> >
> > +int get_hwp_para(const struct cpufreq_policy *policy,
>
> While I don't really mind a policy being passed into here, ...
>
> > + struct xen_hwp_para *hwp_para)
> > +{
> > +unsigned int cpu = policy->cpu;
>
> ... this is its only use afaics, and hence the caller could as well pass
> in just a CPU number?

Sounds good.

> > --- a/xen/include/public/sysctl.h
> > +++ b/xen/include/public/sysctl.h
> > @@ -292,6 +292,31 @@ struct xen_ondemand {
> >  uint32_t up_threshold;
> >  };
> >
> > +struct xen_hwp_para {
> > +/*
> > + * bits 6:0   - 7bit mantissa
> > + * bits 9:7   - 3bit base-10 exponent
> > + * btis 15:10 - Unused - must be 0
> > + */
> > +#define HWP_ACT_WINDOW_MANTISSA_MASK  0x7f
> > +#define HWP_ACT_WINDOW_EXPONENT_MASK  0x7
> > +#define HWP_ACT_WINDOW_EXPONENT_SHIFT 7
> > +uint16_t activity_window;
> > +/* energy_perf range 0-255 if 1. Otherwise 0-15 */
> > +#define XEN_SYSCTL_HWP_FEAT_ENERGY_PERF (1 << 0)
> > +/* activity_window supported if 1 */
> > +#define XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  (1 << 1)
> > +uint8_t features; /* bit flags for features */
> > +uint8_t lowest;
> > +uint8_t most_efficient;
> > +uint8_t guaranteed;
> > +uint8_t highest;
> > +uint8_t minimum;
> > +uint8_t maximum;
> > +uint8_t desired;
> > +uint8_t energy_perf;
>
> These fields could do with some more commentary. To be honest I had
> trouble figuring (from the SDM) what exact meaning specific numeric
> values have. Readers of this header should at the very least be told
> where they can turn to in order to understand what these fields
> communicate. (FTAOD this could be section names, but please not
> section numbers. The latter are fine to use in a discussion, but
> they're changing too frequently to make them useful in code
> comments.)

Sounds good.  I'll add some description.

> > +};
>
> Also, if you decide to stick to uint8_t, then the trailing padding
> field (another uint8_t) wants making explicit. I'm on the edge
> whether to ask to also check the field: Right here the struct is
> "get only", and peeking ahead you look to be introducing a separate
> sub-op for "set". Perhaps if you added /* OUT */ at the top of the
> new struct? (But if you don't check the field for being zero, then
> you'll want to set it to zero for forward compatibility.)

Thanks for catching.  I'll add the padding field and zero it.

On Mon, May 8, 2023 at 6:47 AM Jan Beulich  wrote:
>
> On 08.05.2023 12:25, Jan Beulich wrote:
> > On 01.05.2023 21:30, Jason Andryuk wrote:
> >> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
> >> hardware rather closely.
> >>
> >> We need the features bitmask to indicated fields supported by the actual
> >> hardware.
> >>
> >> The use of uint8_t parameters matches the hardware size.  uint32_t
> >> entries grows the sysctl_t past the build assertion in setup.c.  The
> >> uint8_t ranges are supported across multiple generations, so hopefully
> >> they won't change.
> >
> > Still it feels a little odd for values to be this narrow. Aiui the
> > scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
> > used by HWP. So you could widen the union in struct
> > xen_get_cpufreq_para (in a binary but not necessarily source compatible
> > manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
> > placed scaling_cur_freq could be included as well ...
>
> Having seen patch 9 now as well, I wonder whether here (or in a separate
> patch) you don't want to limit providing inapplicable data (for example
> not filling *scaling_available_governors would even avoid an all

Re: [XEN PATCH 2/2] x86/Dom0: Use streaming decompression for ZSTD compressed kernels

2023-05-10 Thread Rafaël Kooi

On 10/05/2023 11:48, Jan Beulich wrote:> First of all - please don't drop Cc-s 
when replying. I'm restoring

xen-devel@ here at least.



Apologies, I didn't notice that replying dropped the Cc-s. Should I send
the emails again with the proper Cc-s?


On 10.05.2023 10:51, Rafaël Kooi wrote:

On 10/05/2023 10:03, Jan Beulich wrote:> On 10.05.2023 02:18, Rafaël Kooi wrote:

On Arch Linux kernel decompression will fail when Xen has been unified
with the kernel and initramfs as a single binary. This change works for
both streaming and non-streaming ZSTD content.


This could do with better explaining what "unified" means, and how
streaming decompression actually makes a difference.



I don't mind explaining it further, but with the efi documentation for
it existing on xenbits, should I just refer to that?


You may of course refer to existing documentation. Iirc that doesn't
cover any compression aspects, though.



Right, I'll think about what ends up being the clearest explanation.


--- a/xen/common/decompress.c
+++ b/xen/common/decompress.c
@@ -3,11 +3,26 @@
   #include 
   #include 
   
+typedef struct _ZSTD_state

+{
+void *write_buf;
+unsigned int write_pos;
+} ZSTD_state;
+
   static void __init cf_check error(const char *msg)
   {
   printk("%s\n", msg);
   }
   
+static int __init cf_check ZSTD_flush(void *buf, unsigned int pos,

+  void *userptr)
+{
+ZSTD_state *state = (ZSTD_state*)userptr;
+memcpy(state->write_buf + state->write_pos, buf, pos);
+state->write_pos += pos;
+return pos;
+}


This doesn't really belong here, but will (I expect) go away anyway once
you drop the earlier patch.



The ZSTD_flush will have to stay, as that is how the decompressor will
start streaming decompression. The difference will be that the book
keeping will be "global" (to the translation unit).


But this bookkeeping should be entirely in zstd code (i.e. presumably
unzstd.c).



The implementation of the decompression functions seem to indicate
otherwise. Referring to unzstd.c:`unzstd`, the function will take the
streaming decompression path if either `fill` or `flush` have been
supplied. I cross checked with unlzma.c and unxz.c, and that seems to
have similar behavior in regards to flushing the output data. The
`flush` function is passed a buffer to a chunk of decompressed data with
`pos` being the size of the chunk. For the sake of consistency I don't
think it's a good idea to deviate from this behavior in just unzstd.c.

I could wrap the decompression code in another file and function, but
in my opinion it should stay here and be renamed to something generic
like `stream_flush` or `chunk_flush`.


   if ( len >= 4 && !memcmp(inbuf, "\x28\xb5\x2f\xfd", 4) )
-   return unzstd(inbuf, len, NULL, NULL, outbuf, NULL, error);
+{
+// NOTE (Rafaël): On Arch Linux the kernel is compressed in a way
+// that requires streaming ZSTD decompression. Otherwise decompression
+// will fail when using a unified EFI binary. Somehow decompression
+// works when not using a unified EFI binary, I suspect this is the
+// kernel self decompressing. Or there is a code path that I am not
+// aware of that takes care of the use case properly.


Along the lines of what I've said for the description, this wants to avoid
terms like "somehow" if at all possible.


I've used the term "somehow" because I don't know why decompression
works when Xen loads the kernel from the EFI file system. I assume the
kernel still gets unpacked by Xen, right? Or does the kernel unpack
itself?


The handling of Dom0 kernel decompression ought to be entirely independent
of EFI vs legacy. Unless I'm wrong with that (mis-remembering), you
mentioning EFI is potentially misleading. And yes, at least on x86 the
kernel is decompressed by Xen (by peeking into the supplied bzImage). The
difference between a plain bzImage and a "unified EFI binary" is what you
will want to outline in the description (and at least mention in the
comment). What I'm wondering is whether there simply is an issue with size
determination when the kernel is taken from the .kernel section.



Assuming you are talking about size determination of the compressed
bzImage, I notice a discrepancy in the size of the ZSTD stream and the
reported size by the vmlinuz-* header upon further investigation. When
using the streaming decompression code I made it output how many bytes
it reads from the extracted-but-still-compressed bzImage. The code
reads 12,327,560 bytes, but the size of the compressed bzImage in the
header is 12,327,564 bytes. In xen/arch/x86/bzImage.c `decompress` is
called with `orig_image_len`, when the function `output_length`
calculates the end address and removes 4 bytes from that address. If I
remove the last 4 bytes from the compressed bzImage then `unzstd` will
unpack it with `unzstd bzImage.zst -o bzImage`, otherwise it will
complain with `zstd: /*stdin*\: un

[libvirt test] 180599: tolerable all pass - PUSHED

2023-05-10 Thread osstest service owner
flight 180599 libvirt real [real]
http://logs.test-lab.xenproject.org/osstest/logs/180599/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 180552
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 180552
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 180552
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-arm64-arm64-libvirt-qcow2 15 saverestore-support-checkfail never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass

version targeted for testing:
 libvirt  3d6bc5c61101aadd6fca5d558a44a1cba8120178
baseline version:
 libvirt  9b8bb536ff999fa61e41869bd98a026b8e23378f

Last test of basis   180552  2023-05-06 04:18:49 Z4 days
Testing same since   180599  2023-05-10 04:18:55 Z0 days1 attempts


People who touched revisions under test:
  Erik Skultety 
  Michal Privoznik 
  Peter Krempa 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 build-armhf-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-arm64-pvopspass
 build-armhf-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm   pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsmpass
 test-amd64-amd64-libvirt-xsm pass
 test-arm64-arm64-libvirt-xsm pass
 test-amd64-i386-libvirt-xsm  pass
 test-amd64-amd64-libvirt pass
 test-arm64-arm64-libvirt pass
 test-armhf-armhf-libvirt pass
 test-amd64-i386-libvirt  pass
 test-amd64-amd64-libvirt-pairpass
 test-amd64-i386-libvirt-pair pass
 test-arm64-arm64-libvirt-qcow2   pass
 test-armhf-armhf-libvirt-qcow2   pass
 test-arm64-arm64-libvirt-raw pass
 test-armhf-armhf-libvirt-raw pass
 test-amd64-i386-libvirt-raw  pass
 test-amd64-amd64-libvirt-vhd pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports

[ovmf test] 180604: all pass - PUSHED

2023-05-10 Thread osstest service owner
flight 180604 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/180604/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf 6fb2760dc8939b16a906b8e6bb224764907168f3
baseline version:
 ovmf e6447d2a08f5ca585816d093e79a01dad3781f98

Last test of basis   180603  2023-05-10 13:12:05 Z0 days
Testing same since   180604  2023-05-10 15:12:03 Z0 days1 attempts


People who touched revisions under test:
  Ard Biesheuvel 
  Gerd Hoffmann 
  Jiewen Yao 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   e6447d2a08..6fb2760dc8  6fb2760dc8939b16a906b8e6bb224764907168f3 -> 
xen-tested-master



Re: [PATCH v2] tools: convert bitfields to unsigned type

2023-05-10 Thread Anthony PERARD
On Tue, May 09, 2023 at 09:10:04AM +0200, Juergen Gross wrote:
> On 08.05.23 18:46, Olaf Hering wrote:
> > clang complains about the signed type:
> > 
> > implicit truncation from 'int' to a one-bit wide bit-field changes value 
> > from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
> > 
> > The potential ABI change in libxenvchan is covered by the Xen version based 
> > SONAME.
> > 
> > Signed-off-by: Olaf Hering 
> 
> Reviewed-by: Juergen Gross 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD



[linux-linus test] 180598: regressions - FAIL

2023-05-10 Thread osstest service owner
flight 180598 linux-linus real [real]
flight 180605 linux-linus real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/180598/
http://logs.test-lab.xenproject.org/osstest/logs/180605/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-armhf-armhf-xl-credit1   8 xen-boot fail REGR. vs. 180278

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-libvirt-qcow2 19 guest-start/debian.repeat fail pass in 
180605-retest

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt  8 xen-boot fail  like 180278
 test-armhf-armhf-libvirt-raw  8 xen-boot fail  like 180278
 test-armhf-armhf-xl-rtds  8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 180278
 test-armhf-armhf-xl   8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 180278
 test-armhf-armhf-examine  8 reboot   fail  like 180278
 test-armhf-armhf-xl-multivcpu  8 xen-boot fail like 180278
 test-armhf-armhf-libvirt-qcow2  8 xen-bootfail like 180278
 test-armhf-armhf-xl-credit2   8 xen-boot fail  like 180278
 test-armhf-armhf-xl-vhd   8 xen-boot fail  like 180278
 test-armhf-armhf-xl-arndale   8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted for testing:
 linux16a8829130ca22666ac6236178a6233208d425c3
baseline version:
 linux6c538e1adbfc696ac4747fb10d63e704344f763d

Last test of basis   180278  2023-04-16 19:41:46 Z   23 days
Failing since180281  2023-04-17 06:24:36 Z   23 days   43 attempts
Testing same since   180598  2023-05-10 04:12:30 Z0 days1 attempts


2361 people touched revisions under test,
not listing them all

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 build-armhf-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-arm64-pvopspass
 build-armhf-pvopspass
 build-i386-pv

Re: [PATCH v6 00/16] x86/mtrr: fix handling with PAT but without MTRR

2023-05-10 Thread Juergen Gross

On 10.05.23 15:30, Borislav Petkov wrote:

On Wed, May 10, 2023 at 01:36:41AM +0200, Borislav Petkov wrote:

More staring at this tomorrow, on a clear head.


Yeah, I'm going to leave it as is. Tried doing a union with bitfields
but doesn't get any prettier.

Next crapola:

The Intel box says now:

[8.138683] sgx: EPC section 0x8020-0x85ff
[8.204838] pmd_set_huge: Cannot satisfy [mem 0x8020-0x8040] with a 
huge-page mapping due to MTRR override, uniform: 0

(I've extended the debug output).

and that happens because

[8.174229] mtrr_type_lookup: mtrr_state_set: 1
[8.178909] mtrr_type_lookup: start: 0x8020, cache_map[3].start: 
0x8880

that's

 if (start < cache_map[i].start) {

in mtrr_type_lookup(). I fail to see how that check would work for the
range 0x8020-0x8040 and the MTRR map is:

[0.000587] MTRR map: 4 entries (3 fixed + 1 variable; max 23), built from 
10 variable MTRRs
[0.000588]   0: -0009 write-back
[0.000589]   1: 000a-000b uncachable
[0.000590]   2: 000c-000f write-protect
[0.000591]   3: 8880- uncachable

so the UC range comes after this one we request.

[8.186372] mtrr_type_lookup: type: 0x6, cache_map[3].type: 0x0

now the next type merging happens and the 3rd region's type is UC, ofc.

[8.192433] type_merge: type: 0x6, new_type: 0x0, effective_type: 0x0, clear 
uniform

we clear uniform and we fail:

[8.200331] mtrr_type_lookup: ret, uniform: 0

So this map lookup thing is wrong in this case.



Urgh, yes, there is something missing:

diff --git a/arch/x86/kernel/cpu/mtrr/generic.c 
b/arch/x86/kernel/cpu/mtrr/generic.c
index 031f7ea8e72b..9544e7d13bb3 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -521,8 +521,12 @@ u8 mtrr_type_lookup(u64 start, u64 end, u8 *uniform)
for (i = 0; i < cache_map_n && start < end; i++) {
if (start >= cache_map[i].end)
continue;
-   if (start < cache_map[i].start)
+   if (start < cache_map[i].start) {
type = type_merge(type, mtrr_state.def_type, uniform);
+   start = cache_map[i].start;
+   if (end <= start)
+   break;
+   }
type = type_merge(type, cache_map[i].type, uniform);

start = cache_map[i].end;


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v5 12/14] tools/xenstore: use generic accounting for remaining quotas

2023-05-10 Thread Juergen Gross

On 09.05.23 21:21, Julien Grall wrote:

Hi Juergen,

On 08/05/2023 12:47, Juergen Gross wrote:

The maxrequests, node size, number of node permissions, and path length
quota are a little bit special, as they are either active in
transactions only (maxrequests), or they are just per item instead of
count values. Nevertheless being able to know the maximum number of
those quota related values per domain would be beneficial, so add them
to the generic accounting.

The per domain value will never show current numbers other than zero,
but the maximum number seen can be gathered the same way as the number
of nodes during a transaction.

To be able to use the const qualifier for a new function switch
domain_is_unprivileged() to take a const pointer, too.

For printing the quota/max values, adapt the print format string to
the longest quota name (now 17 characters long).

Signed-off-by: Juergen Gross 
---
V5:
- add comment (Julien Grall)
- add missing quota printing (Julien Grall)
---
  tools/xenstore/xenstored_core.c    | 15 +
  tools/xenstore/xenstored_core.h    |  2 +-
  tools/xenstore/xenstored_domain.c  | 45 +-
  tools/xenstore/xenstored_domain.h  |  6 
  tools/xenstore/xenstored_transaction.c |  4 +--
  tools/xenstore/xenstored_watch.c   |  2 +-
  6 files changed, 55 insertions(+), 19 deletions(-)

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index c98d30561f..fce73b883e 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -799,8 +799,9 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, 
struct node *node,

  + node->perms.num * sizeof(node->perms.p[0])
  + node->datalen + node->childlen;
-    if (!no_quota_check && domain_is_unprivileged(conn) &&
-    data.dsize >= quota_max_entry_size) {
+    /* Call domain_max_chk() in any case in order to record max values. */
+    if (domain_max_chk(conn, ACC_NODESZ, data.dsize, quota_max_entry_size)
+    && !no_quota_check) {
  errno = ENOSPC;
  return errno;
  }
@@ -1170,7 +1171,7 @@ static bool valid_chars(const char *node)
 "0123456789-/_@") == strlen(node));
  }
-bool is_valid_nodename(const char *node)
+bool is_valid_nodename(const struct connection *conn, const char *node)
  {
  int local_off = 0;
  unsigned int domid;
@@ -1190,7 +1191,8 @@ bool is_valid_nodename(const char *node)
  if (sscanf(node, "/local/domain/%5u/%n", &domid, &local_off) != 1)
  local_off = 0;
-    if (strlen(node) > local_off + quota_max_path_len)
+    if (domain_max_chk(conn, ACC_PATHLEN, strlen(node) - local_off,
+   quota_max_path_len))
  return false;
  return valid_chars(node);
@@ -1252,7 +1254,7 @@ static struct node *get_node_canonicalized(struct 
connection *conn,

  *canonical_name = canonicalize(conn, ctx, name);
  if (!*canonical_name)
  return NULL;
-    if (!is_valid_nodename(*canonical_name)) {
+    if (!is_valid_nodename(conn, *canonical_name)) {
  errno = EINVAL;
  return NULL;
  }
@@ -1778,8 +1780,7 @@ static int do_set_perms(const void *ctx, struct 
connection *conn,

  return EINVAL;
  perms.num--;
-    if (domain_is_unprivileged(conn) &&
-    perms.num > quota_nb_perms_per_node)
+    if (domain_max_chk(conn, ACC_NPERM, perms.num, quota_nb_perms_per_node))
  return ENOSPC;
  permstr = in->buffer + strlen(in->buffer) + 1;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 3564d85d7d..9339820156 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -258,7 +258,7 @@ void check_store(void);
  void corrupt(struct connection *conn, const char *fmt, ...);
  /* Is this a valid node name? */
-bool is_valid_nodename(const char *node);
+bool is_valid_nodename(const struct connection *conn, const char *node);
  /* Get name of parent node. */
  char *get_parent(const void *ctx, const char *node);
diff --git a/tools/xenstore/xenstored_domain.c 
b/tools/xenstore/xenstored_domain.c

index 6f3a27765a..b3a719569e 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -431,7 +431,7 @@ int domain_get_quota(const void *ctx, struct connection 
*conn,

  return ENOMEM;
  #define ent(t, e) \
-    resp = talloc_asprintf_append(resp, "%-16s: %8u (max: %8u\n", #t, \
+    resp = talloc_asprintf_append(resp, "%-17s: %8u (max: %8u\n", #t, \
    d->acc[e].val, d->acc[e].max); \
  if (!resp) return ENOMEM
@@ -440,6 +440,10 @@ int domain_get_quota(const void *ctx, struct connection 
*conn,

  ent(transactions, ACC_TRANS);
  ent(outstanding, ACC_OUTST);
  ent(memory, ACC_MEM);
+    ent(transaction-nodes, ACC_TRANSNODES);
+    ent(node-permissions, ACC_NPERM);
+    ent(path-length, ACC_PATHLEN);
+    ent(node-size, ACC_NODESZ);
  #undef ent
@@ -457,7 +461,7 @@ int do

Re: [PATCH v4 2/3] tools: Use new xc function for some xc_domain_getinfo() calls

2023-05-10 Thread Anthony PERARD
On Tue, May 09, 2023 at 05:07:11PM +0100, Alejandro Vallejo wrote:
> Move calls that require a information about a single precisely identified
> domain to the new xc_domain_getinfo_single().
> 
> Signed-off-by: Alejandro Vallejo 
> Reviewed-by: Andrew Cooper 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD



Re: [PATCH] iommu/vtd: fix address translation for superpages

2023-05-10 Thread Roger Pau Monné
On Wed, May 10, 2023 at 03:30:21PM +0200, Jan Beulich wrote:
> On 10.05.2023 12:22, Roger Pau Monné wrote:
> > On Wed, May 10, 2023 at 12:00:51PM +0200, Jan Beulich wrote:
> >> On 10.05.2023 10:27, Roger Pau Monné wrote:
> >>> On Tue, May 09, 2023 at 06:06:45PM +0200, Jan Beulich wrote:
>  On 09.05.2023 12:41, Roger Pau Monne wrote:
> > When translating an address that falls inside of a superpage in the
> > IOMMU page tables the fetching of the PTE physical address field
> > wasn't using dma_pte_addr(), which caused the returned data to be
> > corrupt as it would contain bits not related to the address field.
> 
>  I'm afraid I don't understand:
> 
> > --- a/xen/drivers/passthrough/vtd/iommu.c
> > +++ b/xen/drivers/passthrough/vtd/iommu.c
> > @@ -359,16 +359,18 @@ static uint64_t addr_to_dma_page_maddr(struct 
> > domain *domain, daddr_t addr,
> >  
> >  if ( !alloc )
> >  {
> > -pte_maddr = 0;
> >  if ( !dma_pte_present(*pte) )
> > +{
> > +pte_maddr = 0;
> >  break;
> > +}
> >  
> >  /*
> >   * When the leaf entry was requested, pass back the 
> > full PTE,
> >   * with the address adjusted to account for the 
> > residual of
> >   * the walk.
> >   */
> > -pte_maddr = pte->val +
> > +pte_maddr +=
> >  (addr & ((1UL << level_to_offset_bits(level)) - 1) 
> > &
> >   PAGE_MASK);
> 
>  With this change you're now violating what the comment says (plus what
>  the comment ahead of the function says). And it says what it says for
>  a reason - see intel_iommu_lookup_page(), which I think your change is
>  breaking.
> >>>
> >>> Hm, but the code in intel_iommu_lookup_page() is now wrong as it takes
> >>> the bits in DMA_PTE_CONTIG_MASK as part of the physical address when
> >>> doing the conversion to mfn?  maddr_to_mfn() doesn't perform a any
> >>> masking to remove the bits above PADDR_BITS.
> >>
> >> Oh, right. But that's a missing dma_pte_addr() in intel_iommu_lookup_page()
> >> then. (It would likely be better anyway to switch "uint64_t val" to
> >> "struct dma_pte pte" there, to make more visible that it's a PTE we're
> >> dealing with.) I indeed overlooked this aspect when doing the earlier
> >> change.
> > 
> > I guess I'm still confused, as the other return value for target == 0
> > (when the address is not part of a superpage) does return
> > dma_pte_addr(pte).  I think that needs further fixing then.
> 
> Hmm, indeed. But I think it's worse than this: addr_to_dma_page_maddr()
> also does one too many iterations in that case. All "normal" callers
> supply a positive "target". We need to terminate the walk at level 1
> also when target == 0.

Don't we do that already due to the following check:

if ( --level == target )
break;

Which prevents mapping the PTE address as a page table directory?

Thanks, Roger.



Re: [PATCH v4 1/3] tools: Modify single-domid callers of xc_domain_getinfolist()

2023-05-10 Thread Anthony PERARD
On Tue, May 09, 2023 at 05:07:10PM +0100, Alejandro Vallejo wrote:
> xc_domain_getinfolist() internally relies on a sysctl that performs
> a linear search for the domids. Many callers of xc_domain_getinfolist()
> who require information about a precise domid are much better off calling
> xc_domain_getinfo_single() instead, that will use the getdomaininfo domctl
> instead and ensure the returned domid matches the requested one. The domtctl
> will find the domid faster too, because that uses hashed lists.
> 
> Signed-off-by: Alejandro Vallejo 
> Reviewed-by: Andrew Cooper 
> Acked-by: Christian Lindig 

Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD



Re: [PATCH 1/6] x86/cpu-policy: Drop build time cross-checks of featureset sizes

2023-05-10 Thread Andrew Cooper
On 09/05/2023 5:15 pm, Jan Beulich wrote:
> On 09.05.2023 17:59, Andrew Cooper wrote:
>> On 09/05/2023 3:28 pm, Jan Beulich wrote:
>>> On 09.05.2023 15:04, Andrew Cooper wrote:
 On 08/05/2023 7:47 am, Jan Beulich wrote:
> On 04.05.2023 21:39, Andrew Cooper wrote:
>> These BUILD_BUG_ON()s exist to cover the curious absence of a diagnostic 
>> for
>> code which looks like:
>>
>>   uint32_t foo[1] = { 1, 2, 3 };
>>
>> However, GCC 12 at least does now warn for this:
>>
>>   foo.c:1:24: error: excess elements in array initializer [-Werror]
>> 884 | uint32_t foo[1] = { 1, 2, 3 };
>> |^
>>   foo.c:1:24: note: (near initialization for 'foo')
> I'm pretty sure all gcc versions we support diagnose such cases. In turn
> the arrays in question don't have explicit dimensions at their
> definition sites, and hence they derive their dimensions from their
> initializers. So the build-time-checks are about the arrays in fact
> obtaining the right dimensions, i.e. the initializers being suitable.
>
> With the core part of the reasoning not being applicable, I'm afraid I
> can't even say "okay with an adjusted description".
 Now I'm extra confused.

 I put those BUILD_BUG_ON()'s in because I was not getting a diagnostic
 when I was expecting one, and there was a bug in the original featureset
 work caused by this going wrong.

 But godbolt seems to agree that even GCC 4.1 notices.

 Maybe it was some other error (C file not seeing the header properly?)
 which disappeared across the upstream review?
>>> Or maybe, by mistake, too few initializer fields? But what exactly it
>>> was probably doesn't matter. If this patch is to stay (see below), some
>>> different description will be needed anyway (or the change be folded
>>> into the one actually invalidating those BUILD_BUG_ON()s).
>>>
 Either way, these aren't appropriate, and need deleting before patch 5,
 because the check is no longer valid when a featureset can be longer
 than the autogen length.
>>> Well, they need deleting if we stick to the approach chosen there right
>>> now. If we switched to my proposed alternative, they better would stay.
>> Given that all versions of GCC do warn, I don't see any justification
>> for them to stay.
> All versions warn when the variable declarations / definitions have a
> dimension specified, and then there are excess initializers. Yet none
> of the five affected arrays have a dimension specified in their
> definitions.
>
> Even if dimensions were added, we'd then have only covered half of
> what the BUILD_BUG_ON()s cover right now: There could then be fewer
> than intended initializer fields, and things may still be screwed. I
> think it was for this very reason why BUILD_BUG_ON() was chosen.

???

The dimensions already exist, as proved by the fact GCC can spot the
violation.

On the other hand, zero extending a featureset is explicitly how they're
supposed to work.  How do you think Xapi has coped with migration
compatibility over the years, not to mention the microcode changes which
lengthen a featureset?

So no, there was never any problem with constructs of the form uint32_t
foo[10] = { 1, } in the first place.

The BUILD_BUG_ON()s therefore serve no purpose at all.

~Andrew



[ovmf test] 180603: all pass - PUSHED

2023-05-10 Thread osstest service owner
flight 180603 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/180603/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf e6447d2a08f5ca585816d093e79a01dad3781f98
baseline version:
 ovmf 373a95532a785108f625877d993451d4a6d36713

Last test of basis   180601  2023-05-10 10:12:15 Z0 days
Testing same since   180603  2023-05-10 13:12:05 Z0 days1 attempts


People who touched revisions under test:
  Rebecca Cran 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   373a95532a..e6447d2a08  e6447d2a08f5ca585816d093e79a01dad3781f98 -> 
xen-tested-master



Re: [PATCH v2] Fix install.sh for systemd

2023-05-10 Thread Jason Andryuk
On Mon, May 8, 2023 at 1:14 PM Olaf Hering  wrote:
>
> On a fedora system, if you run `sudo sh install.sh` you break your
> system.  The installation clobbers /var/run, a symlink to /run.  A
> subsequent boot fails when /var/run and /run are different since
> accesses through /var/run can't find items that now only exist in /run
> and vice-versa.
>
> Skip populating /var/run/xen during make install.
> The directory is already created by some scripts. Adjust all remaining
> scripts to create XEN_RUN_DIR at runtime.
>
> XEN_RUN_STORED is covered by XEN_RUN_DIR because xenstored is usually
> started afterwards.
>
> Reported-by: Jason Andryuk 
> Signed-off-by: Olaf Hering 

Tested-by: Jason Andryuk 

I tested with Fedora/systemd.

Thanks, Olaf.

Regards,
Jason



[PATCH 3/3] libxl: create ~/control/feature-balloon

2023-05-10 Thread Yann Dirson
Like other feature controls it has to be created by the toolstack before
the guest can advertise the feature.

Reported-by: zithro 
Signed-off-by: Yann Dirson 
---
 tools/libs/light/libxl_create.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c
index ec8eab02c2..85eccc90cd 100644
--- a/tools/libs/light/libxl_create.c
+++ b/tools/libs/light/libxl_create.c
@@ -863,6 +863,9 @@ retry_transaction:
 libxl__xs_mknod(gc, t,
 GCSPRINTF("%s/control/sysrq", dom_path),
 rwperm, ARRAY_SIZE(rwperm));
+libxl__xs_mknod(gc, t,
+GCSPRINTF("%s/control/feature-balloon", dom_path),
+rwperm, ARRAY_SIZE(rwperm));
 
 libxl__xs_mknod(gc, t,
 GCSPRINTF("%s/data", dom_path),
-- 
2.30.2



Yann Dirson | Vates Platform Developer

XCP-ng & Xen Orchestra - Vates solutions
w: vates.fr | xcp-ng.org | xen-orchestra.com



[PATCH 2/3] docs: document ~/control/feature-balloon

2023-05-10 Thread Yann Dirson
This flag has been in use by XAPI for a long time (was present at creation
of the github xen-api/squeezed repo in 2013), and could be used by other
toolstacks.

Signed-off-by: Yann Dirson 
---
 docs/misc/xenstore-paths.pandoc | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/docs/misc/xenstore-paths.pandoc b/docs/misc/xenstore-paths.pandoc
index a604f6b1c6..6c4e2c3da2 100644
--- a/docs/misc/xenstore-paths.pandoc
+++ b/docs/misc/xenstore-paths.pandoc
@@ -509,6 +509,12 @@ This may be initialized to "" by the toolstack and may 
then be set
 to 0 or 1 by a guest to indicate whether it is capable of responding
 to a mode value written to ~/control/laptop-slate-mode.
 
+ ~/control/feature-balloon
+
+This may be initialized to "" by the toolstack and may then be set to
+0 or 1 by a guest to indicate whether it is capable of memory
+ballooning, and responds to values written to ~/memory/target.
+
 ### Domain Controlled Paths
 
  ~/data/* [w]
-- 
2.30.2



Yann Dirson | Vates Platform Developer

XCP-ng & Xen Orchestra - Vates solutions
w: vates.fr | xcp-ng.org | xen-orchestra.com



[PATCH 0/3] officializing xenstore control/feature-balloon entry

2023-05-10 Thread Yann Dirson
The main topic of this patch series is the ~/control/feature-balloon
entry used by XAPI, prompted by the report of xe-guest-utilities on
FreeBSD not being able to report the feature when using just libxl on
the host.

First patch is a bit off-topic, but included here because it fixes the
text from which this feature description was adapted from.

Yann Dirson (3):
  docs: fix complex-and-wrong xenstore-path wording
  docs: document ~/control/feature-balloon
  libxl: create ~/control/feature-balloon

 docs/misc/xenstore-paths.pandoc | 16 ++--
 tools/libs/light/libxl_create.c |  3 +++
 2 files changed, 13 insertions(+), 6 deletions(-)

-- 
2.30.2



Yann Dirson | Vates Platform Developer

XCP-ng & Xen Orchestra - Vates solutions
w: vates.fr | xcp-ng.org | xen-orchestra.com



[PATCH 1/3] docs: fix complex-and-wrong xenstore-path wording

2023-05-10 Thread Yann Dirson
"0 or 1 ... to indicate whether it is capable or incapable, respectively"
is luckily just swapped words.  Making this shorter will
make the reading easier.

Signed-off-by: Yann Dirson 
---
 docs/misc/xenstore-paths.pandoc | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/docs/misc/xenstore-paths.pandoc b/docs/misc/xenstore-paths.pandoc
index f07ef90f63..a604f6b1c6 100644
--- a/docs/misc/xenstore-paths.pandoc
+++ b/docs/misc/xenstore-paths.pandoc
@@ -454,9 +454,8 @@ The precise protocol is not yet documented.
  ~/control/feature-suspend = (""|"0"|"1") [w]
 
 These may be initialized to "" by the toolstack and may then be set
-to 0 or 1 by a guest to indicate whether it is capable or incapable,
-respectively, of responding to the corresponding command when written
-to ~/control/shutdown.
+to 0 or 1 by a guest to indicate whether it is capable of responding
+to the corresponding command when written to ~/control/shutdown.
 A toolstack may then sample the feature- value at the point of issuing
 a PV control command and respond accordingly:
 
@@ -507,9 +506,8 @@ string back to the control node.
  ~/control/feature-laptop-slate-mode = (""|"0"|"1") [w]
 
 This may be initialized to "" by the toolstack and may then be set
-to 0 or 1 by a guest to indicate whether it is capable or incapable,
-respectively, of responding to a mode value written to
-~/control/laptop-slate-mode.
+to 0 or 1 by a guest to indicate whether it is capable of responding
+to a mode value written to ~/control/laptop-slate-mode.
 
 ### Domain Controlled Paths
 
-- 
2.30.2



Yann Dirson | Vates Platform Developer

XCP-ng & Xen Orchestra - Vates solutions
w: vates.fr | xcp-ng.org | xen-orchestra.com



Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-10 Thread Jan Beulich
On 10.05.2023 15:54, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 2:33 AM Jan Beulich  wrote:
>> On 05.05.2023 17:35, Jason Andryuk wrote:
>>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich  wrote:
>>> The other issue is that if you select "hwp" as the governor, but HWP
>>> hardware support is not available, then hwp_available() needs to reset
>>> the governor back to the default.  This feels like a layering
>>> violation.
>>
>> Layering violation - yes. But why would the governor need resetting in
>> this case? If HWP was asked for but isn't available, I don't think any
>> other cpufreq handling (and hence governor) should be put in place.
>> And turning off cpufreq altogether (if necessary in the first place)
>> wouldn't, to me, feel as much like a layering violation.
> 
> My goal was for Xen to use HWP if available and fallback to the acpi
> cpufreq driver if not.  That to me seems more user-friendly than
> disabling cpufreq.
> 
> if ( hwp_available() )
> ret = hwp_register_driver();
> else
> ret = cpufreq_register_driver(&acpi_cpufreq_driver);

That's fine as a (future) default, but for now using hwp requires a
command line option, and if that option says "hwp" then it ought to
be hwp imo.

> If we are setting cpufreq_opt_governor to enter hwp_available(), but
> then HWP isn't available, it seems to me that we need to reset
> cpufreq_opt_governor when exiting hwp_available() false.

This may be necessary in the future, but shouldn't be necessary right
now. It's not entirely clear to me how that future is going to look
like, command line option wise.

Jan



Re: [PATCH v3 06/14 RESEND] xen/x86: Tweak PDC bits when using HWP

2023-05-10 Thread Jason Andryuk
On Mon, May 8, 2023 at 5:53 AM Jan Beulich  wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> > @@ -13,6 +13,8 @@
> >  #include 
> >  #include 
> >
> > +static bool hwp_in_use;
>
> __ro_after_init again, please.

Of course.  (I'd already made the change locally after the earlier ones.)

> > --- a/xen/include/acpi/pdc_intel.h
> > +++ b/xen/include/acpi/pdc_intel.h
> > @@ -17,6 +17,7 @@
> >  #define ACPI_PDC_C_C1_FFH(0x0100)
> >  #define ACPI_PDC_C_C2C3_FFH  (0x0200)
> >  #define ACPI_PDC_SMP_P_HWCOORD   (0x0800)
> > +#define ACPI_PDC_CPPC_NTV_INT(0x1000)
>
> I can probably live with NTV (albeit I'd prefer NATIVE), but INT is too
> ambiguous for my taste: Can at least that become INTR, please?

Sounds good.  I'm switching to ACPI_PDC_CPPC_NATIVE_INTR.

> With at least the minimal adjustments
> Reviewed-by: Jan Beulich 

Thank you.

Regards,
Jason



Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-10 Thread Jason Andryuk
On Mon, May 8, 2023 at 2:33 AM Jan Beulich  wrote:
>
> On 05.05.2023 17:35, Jason Andryuk wrote:
> > On Fri, May 5, 2023 at 3:01 AM Jan Beulich  wrote:
> >>
> >> On 04.05.2023 18:56, Jason Andryuk wrote:
> >>> On Thu, May 4, 2023 at 9:11 AM Jan Beulich  wrote:
>  On 01.05.2023 21:30, Jason Andryuk wrote:
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -499,7 +499,7 @@ If set, force use of the performance counters for 
> > oprofile, rather than detectin
> >  available support.
> >
> >  ### cpufreq
> > -> `= none | {{  | xen } 
> > [:[powersave|performance|ondemand|userspace][,][,[][,[verbose}
> >  | dom0-kernel`
> > +> `= none | {{  | xen } 
> > [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]}
> >  | dom0-kernel`
> 
>  Considering you use a special internal governor, the 4 governor 
>  alternatives are
>  meaningless for hwp. Hence at the command line level recognizing "hwp" 
>  as if it
>  was another governor name would seem better to me. This would then also 
>  get rid
>  of one of the two special "no-" prefix parsing cases (which I'm not 
>  overly
>  happy about).
> 
>  Even if not done that way I'm puzzled by the way you spell out the 
>  interaction
>  of "hwp" and "hdc": As you say in the description, "hdc" is meaningful 
>  only when
>  "hwp" was specified, so even if not merged with the governors group 
>  "hwp" should
>  come first, and "hdc" ought to be rejected if "hwp" wasn't first 
>  specified. (The
>  way you've spelled it out it actually looks to be kind of the other way 
>  around.)
> >>>
> >>> I placed them in alphabetical order, but, yes, it doesn't make sense.
> >>>
>  Strictly speaking "maxfreq" and "minfreq" also should be objected to 
>  when "hwp"
>  was specified.
> 
>  Overall I'm getting the impression that beyond your "verbose" related 
>  adjustment
>  more is needed, if you're meaning to get things closer to how we parse 
>  the
>  option (splitting across multiple lines to help see what I mean):
> 
>  `= none
>   | {{  | xen } [:{powersave|performance|ondemand|userspace}
>    
>  [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}]
>    [,verbose]]}
>   | dom0-kernel`
> 
>  (We're still parsing in a more relaxed way, e.g. minfreq may come ahead 
>  of
>  maxfreq, but better be more tight in the doc than too relaxed.)
> 
>  Furthermore while max/min freq don't apply directly, there are still two 
>  MSRs
>  controlling bounds at the package and logical processor levels.
> >>>
> >>> Well, we only program the logical processor level MSRs because we
> >>> don't have a good idea of the packages to know when we can skip
> >>> writing an MSR.
> >>>
> >>> How about this:
> >>> `= none
> >>>  | {{  | xen } {
> >>> [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]]
> >>> | [:hwp[,hdc]] }
> >>>   [,verbose]]}
> >>>  | dom0-kernel`
> >>
> >> Looks right, yes.
> >
> > There is a wrinkle to using the hwp governor.  The hwp governor was
> > named "hwp-internal", so it needs to be renamed to "hwp" for use with
> > command line parsing.  That means the checking for "-internal" needs
> > to change to just "hwp" which removes the generality of the original
> > implementation.
>
> I'm afraid I don't see why this would strictly be necessary or a
> consequence.

Maybe I took your comment too far when you mentioned using hwp as a
fake governor.  I used the actual HWP struct cpufreq_governor to hook
into cpufreq_cmdline_parse().  cpufreq_cmdline_parse() uses the that
name for comparison.  But the naming stops being an issue if struct
cpufreq_governor gains a bool .internal flag.  That flag also makes
the registration checks clearer.

> > The other issue is that if you select "hwp" as the governor, but HWP
> > hardware support is not available, then hwp_available() needs to reset
> > the governor back to the default.  This feels like a layering
> > violation.
>
> Layering violation - yes. But why would the governor need resetting in
> this case? If HWP was asked for but isn't available, I don't think any
> other cpufreq handling (and hence governor) should be put in place.
> And turning off cpufreq altogether (if necessary in the first place)
> wouldn't, to me, feel as much like a layering violation.

My goal was for Xen to use HWP if available and fallback to the acpi
cpufreq driver if not.  That to me seems more user-friendly than
disabling cpufreq.

if ( hwp_available() )
ret = hwp_register_driver();
else
ret = cpufreq_register_driver(&acpi_cpufreq_driver);

If we are setting cpufreq_opt_governor to enter hwp_available(), but
then HWP is

Re: [PATCH v6 00/16] x86/mtrr: fix handling with PAT but without MTRR

2023-05-10 Thread Borislav Petkov
On Wed, May 10, 2023 at 03:30:24PM +0200, Borislav Petkov wrote:
> So this map lookup thing is wrong in this case.

Btw, current tree is:

https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=rc1-mtrr

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v6 00/16] x86/mtrr: fix handling with PAT but without MTRR

2023-05-10 Thread Borislav Petkov
On Wed, May 10, 2023 at 01:36:41AM +0200, Borislav Petkov wrote:
> More staring at this tomorrow, on a clear head.

Yeah, I'm going to leave it as is. Tried doing a union with bitfields
but doesn't get any prettier.

Next crapola:

The Intel box says now:

[8.138683] sgx: EPC section 0x8020-0x85ff
[8.204838] pmd_set_huge: Cannot satisfy [mem 0x8020-0x8040] with a 
huge-page mapping due to MTRR override, uniform: 0

(I've extended the debug output).

and that happens because

[8.174229] mtrr_type_lookup: mtrr_state_set: 1
[8.178909] mtrr_type_lookup: start: 0x8020, cache_map[3].start: 
0x8880

that's

 if (start < cache_map[i].start) {

in mtrr_type_lookup(). I fail to see how that check would work for the
range 0x8020-0x8040 and the MTRR map is:

[0.000587] MTRR map: 4 entries (3 fixed + 1 variable; max 23), built from 
10 variable MTRRs
[0.000588]   0: -0009 write-back
[0.000589]   1: 000a-000b uncachable
[0.000590]   2: 000c-000f write-protect
[0.000591]   3: 8880- uncachable

so the UC range comes after this one we request.

[8.186372] mtrr_type_lookup: type: 0x6, cache_map[3].type: 0x0

now the next type merging happens and the 3rd region's type is UC, ofc.

[8.192433] type_merge: type: 0x6, new_type: 0x0, effective_type: 0x0, clear 
uniform

we clear uniform and we fail:

[8.200331] mtrr_type_lookup: ret, uniform: 0

So this map lookup thing is wrong in this case.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH] iommu/vtd: fix address translation for superpages

2023-05-10 Thread Jan Beulich
On 10.05.2023 12:22, Roger Pau Monné wrote:
> On Wed, May 10, 2023 at 12:00:51PM +0200, Jan Beulich wrote:
>> On 10.05.2023 10:27, Roger Pau Monné wrote:
>>> On Tue, May 09, 2023 at 06:06:45PM +0200, Jan Beulich wrote:
 On 09.05.2023 12:41, Roger Pau Monne wrote:
> When translating an address that falls inside of a superpage in the
> IOMMU page tables the fetching of the PTE physical address field
> wasn't using dma_pte_addr(), which caused the returned data to be
> corrupt as it would contain bits not related to the address field.

 I'm afraid I don't understand:

> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -359,16 +359,18 @@ static uint64_t addr_to_dma_page_maddr(struct 
> domain *domain, daddr_t addr,
>  
>  if ( !alloc )
>  {
> -pte_maddr = 0;
>  if ( !dma_pte_present(*pte) )
> +{
> +pte_maddr = 0;
>  break;
> +}
>  
>  /*
>   * When the leaf entry was requested, pass back the full 
> PTE,
>   * with the address adjusted to account for the residual 
> of
>   * the walk.
>   */
> -pte_maddr = pte->val +
> +pte_maddr +=
>  (addr & ((1UL << level_to_offset_bits(level)) - 1) &
>   PAGE_MASK);

 With this change you're now violating what the comment says (plus what
 the comment ahead of the function says). And it says what it says for
 a reason - see intel_iommu_lookup_page(), which I think your change is
 breaking.
>>>
>>> Hm, but the code in intel_iommu_lookup_page() is now wrong as it takes
>>> the bits in DMA_PTE_CONTIG_MASK as part of the physical address when
>>> doing the conversion to mfn?  maddr_to_mfn() doesn't perform a any
>>> masking to remove the bits above PADDR_BITS.
>>
>> Oh, right. But that's a missing dma_pte_addr() in intel_iommu_lookup_page()
>> then. (It would likely be better anyway to switch "uint64_t val" to
>> "struct dma_pte pte" there, to make more visible that it's a PTE we're
>> dealing with.) I indeed overlooked this aspect when doing the earlier
>> change.
> 
> I guess I'm still confused, as the other return value for target == 0
> (when the address is not part of a superpage) does return
> dma_pte_addr(pte).  I think that needs further fixing then.

Hmm, indeed. But I think it's worse than this: addr_to_dma_page_maddr()
also does one too many iterations in that case. All "normal" callers
supply a positive "target". We need to terminate the walk at level 1
also when target == 0.

Jan



Re: [PATCH 3/3] x86: Use CpuidUserDis if an AMD HVM guest toggles CPUID faulting

2023-05-10 Thread Jan Beulich
On 10.05.2023 12:52, Alejandro Vallejo wrote:
> On Wed, May 10, 2023 at 10:15:31AM +0200, Jan Beulich wrote:
>> On 09.05.2023 12:05, Andrew Cooper wrote:
>>> On 08/05/2023 2:18 pm, Jan Beulich wrote:
 On 05.05.2023 19:57, Alejandro Vallejo wrote:
> This is in order to aid guests of AMD hardware that we have exposed
> CPUID faulting to. If they try to modify the Intel MSR that enables
> the feature, trigger levelling so AMD's version of it (CpuidUserDis)
> is used instead.
>
> Signed-off-by: Alejandro Vallejo 
> ---
>  xen/arch/x86/msr.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
 Don't you also need to update cpu-policy.c:calculate_host_policy()
 for the guest to actually know it can use the functionality? Which
 in turn would appear to require some form of adjustment to
 lib/x86/policy.c:x86_cpu_policies_are_compatible().
>>>
>>> I asked Alejandro to do it like this.
>>>
>>> Advertising this to guests requires plumbing another MSR into the
>>> infrastructure which isn't quite set up properly let, and is in flux
>>> from my work.
>>
>> Maybe there was some misunderstanding here, as I realize only now. I
>> wasn't asking to expose the AMD feature, but instead I was after
>>
>> /* 0x00ce  MSR_INTEL_PLATFORM_INFO */
>> /* probe_cpuid_faulting() sanity checks presence of 
>> MISC_FEATURES_ENABLES */
>> p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
>>
>> wanting to be extended by "|| boot_cpu_has(X86_FEATURE_CPUID_USER_DIS)".
>> That, afaict, has no connection to plumbing yet another MSR.
> 
> Let me backtrack a little. There's 2 different (but related) matters under
> discussion:
> 
>  1. Trapping PV guest attempts to run the cpuid instruction
>  2. Providing a virtualized CPUID faulting interface so a guest kernel
> can intercept the CPUID instructions of code running under it.
> 
> This series was meant to provide fix (1) on AMD hardware. I did go a bit
> beyond in v1, trying to provide support for a unified faulting mechanism
> on AMD, but providing a virtualized CPUID faulting interface needs to be
> done a bit more carefully than that. Doing it partially now just adds tech
> debt to be paid when CpuidUserDis is exposed to the domains.
> 
> Changing the policy to expose the Intel interface when AMD is the backend
> falls on (2), which is probably better off done separately in a unified way.
> 
> v2 removes the changes to x86/msr.c so only (1) is addressed.
> 
> Does this make sense?

Certainly. Nevertheless I would like to understand what Andrew meant,
even if only for staying better in sync with all the work he has been
doing (and is still planning to do) in the area of CPU policies and
leveling.

Jan



Re: [PATCH v5 05/14] tools/xenstore: use accounting buffering for node accounting

2023-05-10 Thread Juergen Gross

On 09.05.23 20:46, Julien Grall wrote:

Hi Juergen,

On 08/05/2023 12:47, Juergen Gross wrote:

Add the node accounting to the accounting information buffering in
order to avoid having to undo it in case of failure.

This requires to call domain_nbentry_dec() before any changes to the
data base, as it can return an error now.

Signed-off-by: Juergen Gross 
---
V5:
- add error handling after domain_nbentry_dec() calls (Julien Grall)
---
  tools/xenstore/xenstored_core.c   | 29 +++--
  tools/xenstore/xenstored_domain.h |  4 ++--
  2 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 8392bdec9b..22da434e2a 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1454,7 +1454,6 @@ static void destroy_node_rm(struct connection *conn, 
struct node *node)

  static int destroy_node(struct connection *conn, struct node *node)
  {
  destroy_node_rm(conn, node);
-    domain_nbentry_dec(conn, get_node_owner(node));
  /*
   * It is not possible to easily revert the changes in a transaction.
@@ -1645,6 +1644,9 @@ static int delnode_sub(const void *ctx, struct 
connection *conn,

  if (ret > 0)
  return WALK_TREE_SUCCESS_STOP;
+    if (domain_nbentry_dec(conn, get_node_owner(node)))
+    return WALK_TREE_ERROR_STOP;


I think there is a potential issue with the buffering here. In case of failure, 
the node could have been removed, but the quota would not be properly accounted.


You mean the case where another node has been deleted and due to accounting
buffering the corrected accounting data wouldn't be committed?

This is no problem, as an error returned by delnode_sub() will result in
corrupt() being called, which in turn will correct the accounting data.

Also, I think a comment would be warrant to explain why we are returning 
WALK_TREE_ERROR_STOP here when...



+
  /* In case of error stop the walk. */
  if (!ret && do_tdb_delete(conn, &key, &node->acc))
  return WALK_TREE_SUCCESS_STOP;


... this is not the case when do_tdb_delete() fails for some reasons.


The main idea was that the remove is working from the leafs towards the root.
In case one entry can't be removed, we should just stop.

OTOH returning WALK_TREE_ERROR_STOP might be cleaner, as this would make sure
that accounting data is repaired afterwards. Even if do_tdb_delete() can't
really fail in normal configurations, as the only failure reasons are:

- the node isn't found (quite unlikely, as we just read it before entering
  delnode_sub()), or
- writing the updated data base failed (in normal configurations it is in
  already allocated memory, so no way to fail that)

I think I'll switch to return WALK_TREE_ERROR_STOP here.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


[ovmf test] 180601: all pass - PUSHED

2023-05-10 Thread osstest service owner
flight 180601 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/180601/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf 373a95532a785108f625877d993451d4a6d36713
baseline version:
 ovmf 9165a7e95ec6263c04c8babfdbe8bee133959300

Last test of basis   180597  2023-05-10 03:12:17 Z0 days
Testing same since   180601  2023-05-10 10:12:15 Z0 days1 attempts


People who touched revisions under test:
  Michael D Kinney 
  Rebecca Cran 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   9165a7e95e..373a95532a  373a95532a785108f625877d993451d4a6d36713 -> 
xen-tested-master



[xen-unstable test] 180596: tolerable FAIL - PUSHED

2023-05-10 Thread osstest service owner
flight 180596 xen-unstable real [real]
flight 180602 xen-unstable real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/180596/
http://logs.test-lab.xenproject.org/osstest/logs/180602/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-i386-qemut-rhel6hvm-amd  7 xen-install   fail pass in 180602-retest

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 180589
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 180589
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 180589
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 180589
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 180589
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 180589
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 180589
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 180589
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 180589
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 180589
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 180589
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 180589
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass

version targeted for testing:
 xen  ed6b7c0266e512c1207c07911da14e684f47b909
baseline version:
 xen  8b1ac353b4db7c5bb2f82cb6afee9cc641e756a4

Last test of basis   180589  2023-05-09 13:39:59 Z0

Re: [PATCH 0/3] Add CpuidUserDis support

2023-05-10 Thread Alejandro Vallejo
On Mon, May 08, 2023 at 11:06:31AM +0200, Jan Beulich wrote:
> On 05.05.2023 19:57, Alejandro Vallejo wrote:
> > Nowadays AMD supports trapping the CPUID instruction from ring3 to ring0,
> 
> Since it's relevant for PV32: Their doc talks about CPL > 0, i.e. not just
> ring 3. Therefore I wonder whether ...

Very true. It's CPL>0, not just ring3. Noted and changed on v2.

> 
> > (CpuidUserDis)
> 
> ... we shouldn't deviate from the PM and avoid the misleading use of "user"
> in our internal naming.
> 
> Jan
> 

IMO it's going to be confusing enough as is. We'll eventually have
virtualized versions of both Intel and AMD that may or may not be
cross-hooked with each other (e.g: virtualized Intel interface on
an AMD host) due to backward compatibility. That means we'll probably
want:

 1. A name for the Intel mechanism, currently "CPUID faulting"
 2. A name for the AMD mechanism, currently "CpuidUserDis"
 3. A generic name for the cpuid-can-be-trapped behaviour, currently
overloaded with the Intel name (but could do with a Xen-specific one).
It doesn't matter a lot now, but it will once the AMD interface is
virtualized.

Sure, we could give it an alternative name on AMD, but we still need yet
another one to disambiguate the trapping behaviour from the specific
mechanism that does it.

Using the AMD manual name does have the upside that it's easier to check
the manual without having to remember the AMD-specific feature that
corresponds to a Xen-specific name. That said, if you have a good suggestion
I'm happy to change it. So long as in the end result is (1), (2) and (3)
have non-ambiguous names.

Alejandro



Re: [PATCH 3/3] x86: Use CpuidUserDis if an AMD HVM guest toggles CPUID faulting

2023-05-10 Thread Alejandro Vallejo
On Wed, May 10, 2023 at 10:15:31AM +0200, Jan Beulich wrote:
> On 09.05.2023 12:05, Andrew Cooper wrote:
> > On 08/05/2023 2:18 pm, Jan Beulich wrote:
> >> On 05.05.2023 19:57, Alejandro Vallejo wrote:
> >>> This is in order to aid guests of AMD hardware that we have exposed
> >>> CPUID faulting to. If they try to modify the Intel MSR that enables
> >>> the feature, trigger levelling so AMD's version of it (CpuidUserDis)
> >>> is used instead.
> >>>
> >>> Signed-off-by: Alejandro Vallejo 
> >>> ---
> >>>  xen/arch/x86/msr.c | 9 -
> >>>  1 file changed, 8 insertions(+), 1 deletion(-)
> >> Don't you also need to update cpu-policy.c:calculate_host_policy()
> >> for the guest to actually know it can use the functionality? Which
> >> in turn would appear to require some form of adjustment to
> >> lib/x86/policy.c:x86_cpu_policies_are_compatible().
> > 
> > I asked Alejandro to do it like this.
> > 
> > Advertising this to guests requires plumbing another MSR into the
> > infrastructure which isn't quite set up properly let, and is in flux
> > from my work.
> 
> Maybe there was some misunderstanding here, as I realize only now. I
> wasn't asking to expose the AMD feature, but instead I was after
> 
> /* 0x00ce  MSR_INTEL_PLATFORM_INFO */
> /* probe_cpuid_faulting() sanity checks presence of MISC_FEATURES_ENABLES 
> */
> p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
> 
> wanting to be extended by "|| boot_cpu_has(X86_FEATURE_CPUID_USER_DIS)".
> That, afaict, has no connection to plumbing yet another MSR.
> 
> Jan

Let me backtrack a little. There's 2 different (but related) matters under
discussion:

 1. Trapping PV guest attempts to run the cpuid instruction
 2. Providing a virtualized CPUID faulting interface so a guest kernel
can intercept the CPUID instructions of code running under it.

This series was meant to provide fix (1) on AMD hardware. I did go a bit
beyond in v1, trying to provide support for a unified faulting mechanism
on AMD, but providing a virtualized CPUID faulting interface needs to be
done a bit more carefully than that. Doing it partially now just adds tech
debt to be paid when CpuidUserDis is exposed to the domains.

Changing the policy to expose the Intel interface when AMD is the backend
falls on (2), which is probably better off done separately in a unified way.

v2 removes the changes to x86/msr.c so only (1) is addressed.

Does this make sense?

Alejandro



Re: [PATCH] iommu/vtd: fix address translation for superpages

2023-05-10 Thread Roger Pau Monné
On Wed, May 10, 2023 at 12:00:51PM +0200, Jan Beulich wrote:
> On 10.05.2023 10:27, Roger Pau Monné wrote:
> > On Tue, May 09, 2023 at 06:06:45PM +0200, Jan Beulich wrote:
> >> On 09.05.2023 12:41, Roger Pau Monne wrote:
> >>> When translating an address that falls inside of a superpage in the
> >>> IOMMU page tables the fetching of the PTE physical address field
> >>> wasn't using dma_pte_addr(), which caused the returned data to be
> >>> corrupt as it would contain bits not related to the address field.
> >>
> >> I'm afraid I don't understand:
> >>
> >>> --- a/xen/drivers/passthrough/vtd/iommu.c
> >>> +++ b/xen/drivers/passthrough/vtd/iommu.c
> >>> @@ -359,16 +359,18 @@ static uint64_t addr_to_dma_page_maddr(struct 
> >>> domain *domain, daddr_t addr,
> >>>  
> >>>  if ( !alloc )
> >>>  {
> >>> -pte_maddr = 0;
> >>>  if ( !dma_pte_present(*pte) )
> >>> +{
> >>> +pte_maddr = 0;
> >>>  break;
> >>> +}
> >>>  
> >>>  /*
> >>>   * When the leaf entry was requested, pass back the full 
> >>> PTE,
> >>>   * with the address adjusted to account for the residual 
> >>> of
> >>>   * the walk.
> >>>   */
> >>> -pte_maddr = pte->val +
> >>> +pte_maddr +=
> >>>  (addr & ((1UL << level_to_offset_bits(level)) - 1) &
> >>>   PAGE_MASK);
> >>
> >> With this change you're now violating what the comment says (plus what
> >> the comment ahead of the function says). And it says what it says for
> >> a reason - see intel_iommu_lookup_page(), which I think your change is
> >> breaking.
> > 
> > Hm, but the code in intel_iommu_lookup_page() is now wrong as it takes
> > the bits in DMA_PTE_CONTIG_MASK as part of the physical address when
> > doing the conversion to mfn?  maddr_to_mfn() doesn't perform a any
> > masking to remove the bits above PADDR_BITS.
> 
> Oh, right. But that's a missing dma_pte_addr() in intel_iommu_lookup_page()
> then. (It would likely be better anyway to switch "uint64_t val" to
> "struct dma_pte pte" there, to make more visible that it's a PTE we're
> dealing with.) I indeed overlooked this aspect when doing the earlier
> change.

I guess I'm still confused, as the other return value for target == 0
(when the address is not part of a superpage) does return
dma_pte_addr(pte).  I think that needs further fixing then.

Thanks, Roger.



Re: [XEN][PATCH v6 16/19] xen/arm: Implement device tree node addition functionalities

2023-05-10 Thread Michal Orzel



On 03/05/2023 01:36, Vikram Garhwal wrote:
> Update sysctl XEN_SYSCTL_dt_overlay to enable support for dtbo nodes addition
> using device tree overlay.
> 
> xl dt-overlay add file.dtbo:
> Each time overlay nodes are added using .dtbo, a new fdt(memcpy of
> device_tree_flattened) is created and updated with overlay nodes. This
> updated fdt is further unflattened to a dt_host_new. Next, it checks if 
> any
> of the overlay nodes already exists in the dt_host. If overlay nodes 
> doesn't
> exist then find the overlay nodes in dt_host_new, find the overlay node's
> parent in dt_host and add the nodes as child under their parent in the
> dt_host. The node is attached as the last node under target parent.
> 
> Finally, add IRQs, add device to IOMMUs, set permissions and map MMIO for 
> the
> overlay node.
> 
> When a node is added using overlay, a new entry is allocated in the
> overlay_track to keep the track of memory allocation due to addition of 
> overlay
> node. This is helpful for freeing the memory allocated when a device tree node
> is removed.
> 
> The main purpose of this to address first part of dynamic programming i.e.
> making xen aware of new device tree node which means updating the dt_host with
> overlay node information. Here we are adding/removing node from dt_host, and
> checking/setting IOMMU and IRQ permission but never mapping them to any 
> domain.
> Right now, mapping/Un-mapping will happen only when a new domU is
> created/destroyed using "xl create".
> 
> Signed-off-by: Vikram Garhwal 
> ---
>  xen/common/dt-overlay.c | 510 
>  1 file changed, 510 insertions(+)
> 
> diff --git a/xen/common/dt-overlay.c b/xen/common/dt-overlay.c
> index b89cceab84..09ea46111b 100644
> --- a/xen/common/dt-overlay.c
> +++ b/xen/common/dt-overlay.c
> @@ -33,6 +33,25 @@ static struct dt_device_node *
>  return child_node;
>  }
>  
> +/*
> + * Returns next node to the input node. If node has children then return
> + * last descendant's next node.
> +*/
> +static struct dt_device_node *
> +dt_find_next_node(struct dt_device_node *dt, const struct dt_device_node 
> *node)
> +{
> +struct dt_device_node *np;
> +
> +dt_for_each_device_node(dt, np)
> +if ( np == node )
> +break;
> +
> +if ( np->child )
> +np = find_last_descendants_node(np);
> +
> +return np->allnext;
> +}
> +
>  static int dt_overlay_remove_node(struct dt_device_node *device_node)
>  {
>  struct dt_device_node *np;
> @@ -106,6 +125,76 @@ static int dt_overlay_remove_node(struct dt_device_node 
> *device_node)
>  return 0;
>  }
>  
> +static int dt_overlay_add_node(struct dt_device_node *device_node,
> +   const char *parent_node_path)
> +{
> +struct dt_device_node *parent_node;
> +struct dt_device_node *next_node;
> +
> +parent_node = dt_find_node_by_path(parent_node_path);
> +
> +if ( parent_node == NULL )
> +{
> +dt_dprintk("Parent node %s not found. Overlay node will not be 
> added\n",
> +   parent_node_path);
> +return -EINVAL;
> +}
> +
> +/* If parent has no child. */
> +if ( parent_node->child == NULL )
> +{
> +next_node = parent_node->allnext;
> +device_node->parent = parent_node;
> +parent_node->allnext = device_node;
> +parent_node->child = device_node;
> +}
> +else
> +{
> +struct dt_device_node *np;
> +/* If parent has at least one child node.
incorrect comment style, should be:
/*
 *
 */

> + * Iterate to the last child node of parent.
> + */
> +for ( np = parent_node->child; np->sibling != NULL; np = np->sibling 
> );
> +
> +/* Iterate over all child nodes of np node. */
> +if ( np->child )
> +{
> +struct dt_device_node *np_last_descendant;
> +
> +np_last_descendant = find_last_descendants_node(np);
> +
> +next_node = np_last_descendant->allnext;
> +np_last_descendant->allnext = device_node;
> +}
> +else
> +{
> +next_node = np->allnext;
> +np->allnext = device_node;
> +}
> +
> +device_node->parent = parent_node;
> +np->sibling = device_node;
> +np->sibling->sibling = NULL;
> +}
> +
> +/* Iterate over all child nodes of device_node to add children too. */
> +if ( device_node->child )
> +{
> +struct dt_device_node *device_node_last_descendant;
> +
> +device_node_last_descendant = 
> find_last_descendants_node(device_node);
empty line

> +/* Plug next_node at the end of last children of device_node. */
> +device_node_last_descendant->allnext = next_node;
> +}
> +else
> +{
> +/* Now plug next_node at the end of device_node. */
> +device_node->allnext = next_node;
> +}
> +
> +return 0;
> +}
> +
> 

Re: [PATCH] x86/iommu: fix wrong iterator type in arch_iommu_hwdom_init()

2023-05-10 Thread Jan Beulich
On 10.05.2023 12:03, Jan Beulich wrote:
> On 10.05.2023 10:32, Roger Pau Monné wrote:
>> On Tue, May 09, 2023 at 06:25:05PM +0200, Jan Beulich wrote:
>>> On 09.05.2023 13:03, Roger Pau Monne wrote:
 The 'i' iterator index stores a pdx, not a pfn, and hence the initial
 assignation of start (which stores a pfn) needs a conversion from pfn
 to pdx.
>>>
>>> Strictly speaking: Yes. But pdx compression skips the bottom MAX_ORDER
>>> bits, so ...
>>
>> Oh, that wasn't obvious to me.
>>
 --- a/xen/drivers/passthrough/x86/iommu.c
 +++ b/xen/drivers/passthrough/x86/iommu.c
 @@ -406,7 +406,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain 
 *d)
   */
  start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>>
>>> ... with this, ...
>>>
 -for ( i = start, count = 0; i < top; )
 +for ( i = pfn_to_pdx(start), count = 0; i < top; )
>>>
>>> ... this is an expensive identity transformation. Could I talk you into
>>> adding
>>>
>>> ASSERT(start == pfn_to_pdx(start));
>>>
>>> instead (or the corresponding BUG_ON() if you'd prefer that, albeit then
>>> the expensive identity transformation will still be there even in release
>>> builds; not that it matters all that much right here, but still)?
>>
>> So far the value of start is not influenced by hardware, so having an
>> assert should be fine.
>>
>> Given that the assignation is just done once at the start of the loop
>> I don't see it being that relevant to the performance of this piece of
>> code TBH, the more that we do a pdx_to_pfn() for each loop
>> iteration, so my preference would be to use the proposed change.
> 
> Well, okay, but then please with the description also "softened" a
> little (it isn't really "needs", but e.g. "better would undergo"),

And in the title then perhaps s/fix wrong/adjust/.

Jan

> alongside ...
> 
>>> In any event, with no real bug fixed (unless I'm overlooking something),
>>> I would suggest to drop the Fixes: tag.
>>
>> Right, I could drop that.
> 
> ... this.
> 
> Jan
> 




Re: [PATCH] x86/iommu: fix wrong iterator type in arch_iommu_hwdom_init()

2023-05-10 Thread Jan Beulich
On 10.05.2023 10:32, Roger Pau Monné wrote:
> On Tue, May 09, 2023 at 06:25:05PM +0200, Jan Beulich wrote:
>> On 09.05.2023 13:03, Roger Pau Monne wrote:
>>> The 'i' iterator index stores a pdx, not a pfn, and hence the initial
>>> assignation of start (which stores a pfn) needs a conversion from pfn
>>> to pdx.
>>
>> Strictly speaking: Yes. But pdx compression skips the bottom MAX_ORDER
>> bits, so ...
> 
> Oh, that wasn't obvious to me.
> 
>>> --- a/xen/drivers/passthrough/x86/iommu.c
>>> +++ b/xen/drivers/passthrough/x86/iommu.c
>>> @@ -406,7 +406,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain 
>>> *d)
>>>   */
>>>  start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>
>> ... with this, ...
>>
>>> -for ( i = start, count = 0; i < top; )
>>> +for ( i = pfn_to_pdx(start), count = 0; i < top; )
>>
>> ... this is an expensive identity transformation. Could I talk you into
>> adding
>>
>> ASSERT(start == pfn_to_pdx(start));
>>
>> instead (or the corresponding BUG_ON() if you'd prefer that, albeit then
>> the expensive identity transformation will still be there even in release
>> builds; not that it matters all that much right here, but still)?
> 
> So far the value of start is not influenced by hardware, so having an
> assert should be fine.
> 
> Given that the assignation is just done once at the start of the loop
> I don't see it being that relevant to the performance of this piece of
> code TBH, the more that we do a pdx_to_pfn() for each loop
> iteration, so my preference would be to use the proposed change.

Well, okay, but then please with the description also "softened" a
little (it isn't really "needs", but e.g. "better would undergo"),
alongside ...

>> In any event, with no real bug fixed (unless I'm overlooking something),
>> I would suggest to drop the Fixes: tag.
> 
> Right, I could drop that.

... this.

Jan



Re: [PATCH] iommu/vtd: fix address translation for superpages

2023-05-10 Thread Jan Beulich
On 10.05.2023 10:27, Roger Pau Monné wrote:
> On Tue, May 09, 2023 at 06:06:45PM +0200, Jan Beulich wrote:
>> On 09.05.2023 12:41, Roger Pau Monne wrote:
>>> When translating an address that falls inside of a superpage in the
>>> IOMMU page tables the fetching of the PTE physical address field
>>> wasn't using dma_pte_addr(), which caused the returned data to be
>>> corrupt as it would contain bits not related to the address field.
>>
>> I'm afraid I don't understand:
>>
>>> --- a/xen/drivers/passthrough/vtd/iommu.c
>>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>>> @@ -359,16 +359,18 @@ static uint64_t addr_to_dma_page_maddr(struct domain 
>>> *domain, daddr_t addr,
>>>  
>>>  if ( !alloc )
>>>  {
>>> -pte_maddr = 0;
>>>  if ( !dma_pte_present(*pte) )
>>> +{
>>> +pte_maddr = 0;
>>>  break;
>>> +}
>>>  
>>>  /*
>>>   * When the leaf entry was requested, pass back the full 
>>> PTE,
>>>   * with the address adjusted to account for the residual of
>>>   * the walk.
>>>   */
>>> -pte_maddr = pte->val +
>>> +pte_maddr +=
>>>  (addr & ((1UL << level_to_offset_bits(level)) - 1) &
>>>   PAGE_MASK);
>>
>> With this change you're now violating what the comment says (plus what
>> the comment ahead of the function says). And it says what it says for
>> a reason - see intel_iommu_lookup_page(), which I think your change is
>> breaking.
> 
> Hm, but the code in intel_iommu_lookup_page() is now wrong as it takes
> the bits in DMA_PTE_CONTIG_MASK as part of the physical address when
> doing the conversion to mfn?  maddr_to_mfn() doesn't perform a any
> masking to remove the bits above PADDR_BITS.

Oh, right. But that's a missing dma_pte_addr() in intel_iommu_lookup_page()
then. (It would likely be better anyway to switch "uint64_t val" to
"struct dma_pte pte" there, to make more visible that it's a PTE we're
dealing with.) I indeed overlooked this aspect when doing the earlier
change.

Jan



Re: [XEN PATCH 2/2] x86/Dom0: Use streaming decompression for ZSTD compressed kernels

2023-05-10 Thread Jan Beulich
First of all - please don't drop Cc-s when replying. I'm restoring
xen-devel@ here at least.

On 10.05.2023 10:51, Rafaël Kooi wrote:
> On 10/05/2023 10:03, Jan Beulich wrote:> On 10.05.2023 02:18, Rafaël Kooi 
> wrote:
>>> On Arch Linux kernel decompression will fail when Xen has been unified
>>> with the kernel and initramfs as a single binary. This change works for
>>> both streaming and non-streaming ZSTD content.
>>
>> This could do with better explaining what "unified" means, and how
>> streaming decompression actually makes a difference.
>>
> 
> I don't mind explaining it further, but with the efi documentation for
> it existing on xenbits, should I just refer to that?

You may of course refer to existing documentation. Iirc that doesn't
cover any compression aspects, though.

>>> --- a/xen/common/decompress.c
>>> +++ b/xen/common/decompress.c
>>> @@ -3,11 +3,26 @@
>>>   #include 
>>>   #include 
>>>   
>>> +typedef struct _ZSTD_state
>>> +{
>>> +void *write_buf;
>>> +unsigned int write_pos;
>>> +} ZSTD_state;
>>> +
>>>   static void __init cf_check error(const char *msg)
>>>   {
>>>   printk("%s\n", msg);
>>>   }
>>>   
>>> +static int __init cf_check ZSTD_flush(void *buf, unsigned int pos,
>>> +  void *userptr)
>>> +{
>>> +ZSTD_state *state = (ZSTD_state*)userptr;
>>> +memcpy(state->write_buf + state->write_pos, buf, pos);
>>> +state->write_pos += pos;
>>> +return pos;
>>> +}
>>
>> This doesn't really belong here, but will (I expect) go away anyway once
>> you drop the earlier patch.
>>
> 
> The ZSTD_flush will have to stay, as that is how the decompressor will
> start streaming decompression. The difference will be that the book
> keeping will be "global" (to the translation unit).

But this bookkeeping should be entirely in zstd code (i.e. presumably
unzstd.c).

>>>   if ( len >= 4 && !memcmp(inbuf, "\x28\xb5\x2f\xfd", 4) )
>>> -   return unzstd(inbuf, len, NULL, NULL, outbuf, NULL, error);
>>> +{
>>> +// NOTE (Rafaël): On Arch Linux the kernel is compressed in a way
>>> +// that requires streaming ZSTD decompression. Otherwise 
>>> decompression
>>> +// will fail when using a unified EFI binary. Somehow decompression
>>> +// works when not using a unified EFI binary, I suspect this is the
>>> +// kernel self decompressing. Or there is a code path that I am not
>>> +// aware of that takes care of the use case properly.
>>
>> Along the lines of what I've said for the description, this wants to avoid
>> terms like "somehow" if at all possible.
> 
> I've used the term "somehow" because I don't know why decompression
> works when Xen loads the kernel from the EFI file system. I assume the
> kernel still gets unpacked by Xen, right? Or does the kernel unpack
> itself?

The handling of Dom0 kernel decompression ought to be entirely independent
of EFI vs legacy. Unless I'm wrong with that (mis-remembering), you
mentioning EFI is potentially misleading. And yes, at least on x86 the
kernel is decompressed by Xen (by peeking into the supplied bzImage). The
difference between a plain bzImage and a "unified EFI binary" is what you
will want to outline in the description (and at least mention in the
comment). What I'm wondering is whether there simply is an issue with size
determination when the kernel is taken from the .kernel section.

> When I present the v2 of this patch, do I add you as a reviewer? Or will
> that be done by the merger?

I'm afraid I don't understand the question. You will continue to Cc
respective maintainers, which will include me. In case you refer to a
Reviewed-by: tag - you can only add such tags once they were offered to
you by the respective person. For this specific one it doesn't mean "an
earlier version of this was looked at by " but "this is deemed
okay by ".

Jan



Re: [patch v3 08/36] x86/smpboot: Split up native_cpu_up() into separate phases and document them

2023-05-10 Thread Peter Zijlstra
On Tue, May 09, 2023 at 10:11:05PM +0200, Thomas Gleixner wrote:
> On Tue, May 09 2023 at 12:04, Peter Zijlstra wrote:
> > On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:
> > Not to the detriment of this patch, but this barrier() and it's comment
> > seem weird vs smp_callin(). That function ends with an atomic bitop (it
> > has to, at the very least it must not be weaker than store-release) but
> > also has an explicit wmb() to order setup vs CPU_STARTING.
> >
> > (arguably that should be a full fence *AND* get a comment)
> 
> TBH: I'm grasping for something 'arguable': What's the point of this
> wmb() or even a mb()?
> 
> Most of the [w]mb()'s in smpboot.c except those in mwait_play_dead()
> have a very distinct voodoo programming smell.

Oh fully agreed, esp. without a comment these things are hugely suspect.
I could not immediately see purpose either.

My arguably argument was about IF it was needed at all, then it would
make more sense to me to also constrain loads. But I'd be more than
happy to see the whole thing go. But perhaps not in this series?



Re: [PATCH] x86/iommu: fix wrong iterator type in arch_iommu_hwdom_init()

2023-05-10 Thread Roger Pau Monné
On Tue, May 09, 2023 at 06:25:05PM +0200, Jan Beulich wrote:
> On 09.05.2023 13:03, Roger Pau Monne wrote:
> > The 'i' iterator index stores a pdx, not a pfn, and hence the initial
> > assignation of start (which stores a pfn) needs a conversion from pfn
> > to pdx.
> 
> Strictly speaking: Yes. But pdx compression skips the bottom MAX_ORDER
> bits, so ...

Oh, that wasn't obvious to me.

> > --- a/xen/drivers/passthrough/x86/iommu.c
> > +++ b/xen/drivers/passthrough/x86/iommu.c
> > @@ -406,7 +406,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain 
> > *d)
> >   */
> >  start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> 
> ... with this, ...
> 
> > -for ( i = start, count = 0; i < top; )
> > +for ( i = pfn_to_pdx(start), count = 0; i < top; )
> 
> ... this is an expensive identity transformation. Could I talk you into
> adding
> 
> ASSERT(start == pfn_to_pdx(start));
> 
> instead (or the corresponding BUG_ON() if you'd prefer that, albeit then
> the expensive identity transformation will still be there even in release
> builds; not that it matters all that much right here, but still)?

So far the value of start is not influenced by hardware, so having an
assert should be fine.

Given that the assignation is just done once at the start of the loop
I don't see it being that relevant to the performance of this piece of
code TBH, the more that we do a pdx_to_pfn() for each loop
iteration, so my preference would be to use the proposed change.

> In any event, with no real bug fixed (unless I'm overlooking something),
> I would suggest to drop the Fixes: tag.

Right, I could drop that.

Thanks, Roger.



Re: [PATCH] iommu/vtd: fix address translation for superpages

2023-05-10 Thread Roger Pau Monné
On Tue, May 09, 2023 at 06:06:45PM +0200, Jan Beulich wrote:
> On 09.05.2023 12:41, Roger Pau Monne wrote:
> > When translating an address that falls inside of a superpage in the
> > IOMMU page tables the fetching of the PTE physical address field
> > wasn't using dma_pte_addr(), which caused the returned data to be
> > corrupt as it would contain bits not related to the address field.
> 
> I'm afraid I don't understand:
> 
> > --- a/xen/drivers/passthrough/vtd/iommu.c
> > +++ b/xen/drivers/passthrough/vtd/iommu.c
> > @@ -359,16 +359,18 @@ static uint64_t addr_to_dma_page_maddr(struct domain 
> > *domain, daddr_t addr,
> >  
> >  if ( !alloc )
> >  {
> > -pte_maddr = 0;
> >  if ( !dma_pte_present(*pte) )
> > +{
> > +pte_maddr = 0;
> >  break;
> > +}
> >  
> >  /*
> >   * When the leaf entry was requested, pass back the full 
> > PTE,
> >   * with the address adjusted to account for the residual of
> >   * the walk.
> >   */
> > -pte_maddr = pte->val +
> > +pte_maddr +=
> >  (addr & ((1UL << level_to_offset_bits(level)) - 1) &
> >   PAGE_MASK);
> 
> With this change you're now violating what the comment says (plus what
> the comment ahead of the function says). And it says what it says for
> a reason - see intel_iommu_lookup_page(), which I think your change is
> breaking.

Hm, but the code in intel_iommu_lookup_page() is now wrong as it takes
the bits in DMA_PTE_CONTIG_MASK as part of the physical address when
doing the conversion to mfn?  maddr_to_mfn() doesn't perform a any
masking to remove the bits above PADDR_BITS.

Thanks, Roger.



Re: [PATCH 3/3] x86: Use CpuidUserDis if an AMD HVM guest toggles CPUID faulting

2023-05-10 Thread Jan Beulich
On 09.05.2023 12:05, Andrew Cooper wrote:
> On 08/05/2023 2:18 pm, Jan Beulich wrote:
>> On 05.05.2023 19:57, Alejandro Vallejo wrote:
>>> This is in order to aid guests of AMD hardware that we have exposed
>>> CPUID faulting to. If they try to modify the Intel MSR that enables
>>> the feature, trigger levelling so AMD's version of it (CpuidUserDis)
>>> is used instead.
>>>
>>> Signed-off-by: Alejandro Vallejo 
>>> ---
>>>  xen/arch/x86/msr.c | 9 -
>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>> Don't you also need to update cpu-policy.c:calculate_host_policy()
>> for the guest to actually know it can use the functionality? Which
>> in turn would appear to require some form of adjustment to
>> lib/x86/policy.c:x86_cpu_policies_are_compatible().
> 
> I asked Alejandro to do it like this.
> 
> Advertising this to guests requires plumbing another MSR into the
> infrastructure which isn't quite set up properly let, and is in flux
> from my work.

Maybe there was some misunderstanding here, as I realize only now. I
wasn't asking to expose the AMD feature, but instead I was after

/* 0x00ce  MSR_INTEL_PLATFORM_INFO */
/* probe_cpuid_faulting() sanity checks presence of MISC_FEATURES_ENABLES */
p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;

wanting to be extended by "|| boot_cpu_has(X86_FEATURE_CPUID_USER_DIS)".
That, afaict, has no connection to plumbing yet another MSR.

Jan



Re: [XEN PATCH 2/2] x86/Dom0: Use streaming decompression for ZSTD compressed kernels

2023-05-10 Thread Jan Beulich
On 10.05.2023 02:18, Rafaël Kooi wrote:
> On Arch Linux kernel decompression will fail when Xen has been unified
> with the kernel and initramfs as a single binary. This change works for
> both streaming and non-streaming ZSTD content.

This could do with better explaining what "unified" means, and how
streaming decompression actually makes a difference.

> --- a/xen/common/decompress.c
> +++ b/xen/common/decompress.c
> @@ -3,11 +3,26 @@
>  #include 
>  #include 
>  
> +typedef struct _ZSTD_state
> +{
> +void *write_buf;
> +unsigned int write_pos;
> +} ZSTD_state;
> +
>  static void __init cf_check error(const char *msg)
>  {
>  printk("%s\n", msg);
>  }
>  
> +static int __init cf_check ZSTD_flush(void *buf, unsigned int pos,
> +  void *userptr)
> +{
> +ZSTD_state *state = (ZSTD_state*)userptr;
> +memcpy(state->write_buf + state->write_pos, buf, pos);
> +state->write_pos += pos;
> +return pos;
> +}

This doesn't really belong here, but will (I expect) go away anyway once
you drop the earlier patch.

> @@ -17,22 +32,32 @@ int __init decompress(void *inbuf, unsigned int len, void 
> *outbuf)
>  #endif
>  
>  if ( len >= 3 && !memcmp(inbuf, "\x42\x5a\x68", 3) )
> -return bunzip2(inbuf, len, NULL, NULL, outbuf, NULL, error);
> +return bunzip2(inbuf, len, NULL, NULL, outbuf, NULL, error, NULL);
>  
>  if ( len >= 6 && !memcmp(inbuf, "\3757zXZ", 6) )
> -return unxz(inbuf, len, NULL, NULL, outbuf, NULL, error);
> +return unxz(inbuf, len, NULL, NULL, outbuf, NULL, error, NULL);
>  
>  if ( len >= 2 && !memcmp(inbuf, "\135\000", 2) )
> -return unlzma(inbuf, len, NULL, NULL, outbuf, NULL, error);
> +return unlzma(inbuf, len, NULL, NULL, outbuf, NULL, error, NULL);
>  
>  if ( len >= 5 && !memcmp(inbuf, "\x89LZO", 5) )
> -return unlzo(inbuf, len, NULL, NULL, outbuf, NULL, error);
> +return unlzo(inbuf, len, NULL, NULL, outbuf, NULL, error, NULL);
>  
>  if ( len >= 2 && !memcmp(inbuf, "\x02\x21", 2) )
> - return unlz4(inbuf, len, NULL, NULL, outbuf, NULL, error);
> +return unlz4(inbuf, len, NULL, NULL, outbuf, NULL, error, NULL);

This also looks wrong here - if the earlier patch was to be kept, I expect
all these adjustments would have to move there. Otherwise build would be
broken when just the 1st patch is in place.

>  if ( len >= 4 && !memcmp(inbuf, "\x28\xb5\x2f\xfd", 4) )
> - return unzstd(inbuf, len, NULL, NULL, outbuf, NULL, error);
> +{
> +// NOTE (Rafaël): On Arch Linux the kernel is compressed in a way
> +// that requires streaming ZSTD decompression. Otherwise 
> decompression
> +// will fail when using a unified EFI binary. Somehow decompression
> +// works when not using a unified EFI binary, I suspect this is the
> +// kernel self decompressing. Or there is a code path that I am not
> +// aware of that takes care of the use case properly.

Along the lines of what I've said for the description, this wants to avoid
terms like "somehow" if at all possible.

We also don't normally put our names in such comments.

Finally please see ./CODING_STYLE for how we expect comments to be
formatted.

Jan



Re: [XEN PATCH 0/2] Use streaming decompression for ZSTD kernels

2023-05-10 Thread Jan Beulich
On 10.05.2023 02:18, Rafaël Kooi wrote:
> A problem I ran into was that adding book keeping to decompress.c would
> result in either a .data section being added or a .bss.* section. The
> linker would complain about this. And since I am not familiar with this
> code, and why it is this way, I opted to add a user-pointer to the
> internal decompression API.

At least gunzip uses global variables to track state as well. Without you
being explicit, I guess you mean "Error: size of : is ",
which is emitted by our build system, not the linker? That's easy to overcome
without touching all decompressors - use __initdata. See common/gunzip.c.

Jan



Re: [PATCH v2] piix: fix regression during unplug in Xen HVM domUs

2023-05-10 Thread Olaf Hering
Wed, 10 May 2023 00:58:27 +0200 Olaf Hering :

> In my debugging (with v8.0.0) it turned out the three pci_set_word
> causes the domU to hang. In fact, it is just the last one:
> 
>pci_set_byte(pci_conf + 0x20, 0x01);  /* BMIBA: 20-23h */
> 
> It changes the value from 0xc121 to 0x1.

If I disable just "pci_set_word(pci_conf + PCI_COMMAND, 0x);" it works as 
well.
It changes the value from 0x5 to 0.

In general I feel it is wrong to fiddle with PCI from the host side.
This is most likely not the intention of the Xen unplug protocol.
I'm sure the guest does not expect such changes under the hood.
It happens to work by luck with pvops kernels because their PCI discovery
is done after the unplug.

So, what do we do here to get this off the table?


Olaf


pgpXHFrKQIvDy.pgp
Description: Digitale Signatur von OpenPGP