On Wed, Nov 24, 2021 at 09:11:52PM +0000, Andrew Cooper wrote:
> OSSTest has identified a 3rd regression caused by this change. Migration
> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
> fails with:
>
> xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr
> ffffffff (22 = Invalid argument): Internal error
> xc: error: Restore failed (22 = Invalid argument): Internal error
>
> which is a safety check to prevent resuming the guest when the CPUID data has
> been truncated. The problem is caused by shrinking of the max policies, which
> is an ABI that needs handling compatibly between different versions of Xen.
>
> Furthermore, shrinking of the default policies also breaks things in some
> cases, because certain cpuid= settings in a VM config file which used to have
> an effect will now be silently discarded.
I don't think they will be silently discarded. xc_cpuid_xend_policy
will attempt to get the CPUID leaf from the current/host/default
policies and fail because the leaf couldn't be found.
There's an issue with callers of xc_cpuid_apply_policy that pass a
featureset (which is not used by the upstream tools) as in that case
the featureset is translated to a CPUID policy without checking that
the set features are correctly exposed regarding the maximum leaves
set in CPUID, and thus features could be silently dropped, but as said
that's not used by the `cpuid=` config file option.
This possibly needs fixing anyway after the release, in order to
assert that the bits set in the featureset are correctly exposed given
the max leaves reported in the policy.
I think the above paragraph should be rewored as:
"Furthermore, shrinking of the default policies also breaks things in
some cases, because certain cpuid= settings in a VM config file which
used to work will now be refused. Also external toolstacks that
attempt to set the CPUID policy from a featureset might now see some
filled leaves not reachable due to the shrinking done to the default
domain policy before applying the featureset."
> This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
> partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
> one new case where cpuid= settings might not apply correctly) and restores the
> same behaviour as Xen 4.15.
>
> Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to
> actual leaf contents")
> Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max
> leaves")
> Signed-off-by: Andrew Cooper <[email protected]>
Acked-by: Roger Pau Monné <[email protected]>
Likely with the paragraph adjusted if agreed.
> ---
> CC: Jan Beulich <[email protected]>
> CC: Roger Pau Monné <[email protected]>
> CC: Wei Liu <[email protected]>
> CC: Ian Jackson <[email protected]>
>
> Passing CI runs:
> https://gitlab.com/xen-project/people/andyhhp/xen/-/pipelines/415822110
> https://cirrus-ci.com/build/4915643318272000
>
> This supercedes
> https://lore.kernel.org/xen-devel/[email protected]/
> which is a step in the right direction (it fixes the OSSTest breakage), but
> incomplete (doesn't fix the potential cpuid= breakage).
>
> For 4.16. This is the 3rd breakage caused by 540d911c2813, and there is no
> reasonable workaround. Obviously, this revert isn't completely without risk,
> but going wholesale back to the 4.15 behaviour is by far the safest option at
> this point.
I agree. Also the shrinking done by 540d911c2813 is not fixing any
issue we know about, so it's safer to just keep the previous behavior
of likely reporting empty leaves rather than risk more regressions
caused by the change.
Thanks, Roger.