On 25.11.2021 11:15, Roger Pau Monné wrote:
> On Thu, Nov 25, 2021 at 10:52:12AM +0100, Jan Beulich wrote:
>> On 25.11.2021 10:24, Roger Pau Monné wrote:
>>> On Thu, Nov 25, 2021 at 09:57:31AM +0100, Jan Beulich wrote:
>>>> On 24.11.2021 22:11, Andrew Cooper wrote:
>>>>> OSSTest has identified a 3rd regression caused by this change.  Migration
>>>>> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 
>>>>> 4133)
>>>>> fails with:
>>>>>
>>>>>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, 
>>>>> msr ffffffff (22 = Invalid argument): Internal error
>>>>>   xc: error: Restore failed (22 = Invalid argument): Internal error
>>>>>
>>>>> which is a safety check to prevent resuming the guest when the CPUID data 
>>>>> has
>>>>> been truncated.  The problem is caused by shrinking of the max policies, 
>>>>> which
>>>>> is an ABI that needs handling compatibly between different versions of 
>>>>> Xen.
>>>>
>>>> Would you mind pointing me to the flight which has hit this problem? I
>>>> don't think I've seen any respective failure. I also don't think I've
>>>> seen any respective discussion on irc, so I really wonder where all
>>>> this is coming from all of the sudden. It's not like the change in
>>>> question would have gone in just yesterday.
>>>
>>> It's from a pair of newish boxes that Credativ and Ian where
>>> attempting to commission yesterday. Since the boxes are not yet in
>>> production Ian wasn't sure if the issue could be on the testing or
>>> hardware side, so emailed the details in private for us to provide
>>> some feedback on the issue. The error is at:
>>>
>>> http://logs.test-lab.xenproject.org/osstest/logs/166296/test-amd64-amd64-migrupgrade/info.html
>>
>> I see. Quite lucky timing then.
> 
> Indeed, it was pure luck that we got this just yesterday.
> 
>>
>>>>> Furthermore, shrinking of the default policies also breaks things in some
>>>>> cases, because certain cpuid= settings in a VM config file which used to 
>>>>> have
>>>>> an effect will now be silently discarded.
>>>>
>>>> I'm afraid I don't see what you're talking about here. Could you give
>>>> an example? Is this about features the flags of which live in the
>>>> higher leaves, which would have got stripped from the default policies?
>>>> Which would mean the stripping really should happen on the max policies
>>>> only (where it may not have much of an effect).
>>>
>>> I think there are two separate issues, which I tried to clarify in my
>>> reply to this same patch.
>>>
>>> Options set using cpuid= with xl could now be rejected when in
>>> previous versions they were accepted just fine. That's because the
>>> shrinking to the policies can now cause find_leaf calls in
>>> xc_cpuid_xend_policy to fail, and thus the whole operation would
>>> fail.
>>
>> Okay, this could be addressed by merely dropping the calls from
>> calculate_{pv,hvm}_def_policy(). Thinking about it, I can surely
>> agree they shouldn't have been put there in the first place.
>> Which would be quite the opposite of your initial proposal, where
>> you did drop them from calculate_{pv,hvm}_max_policy(). A guest
>> migrating in with a larger max leaf value should merely have that
>> max leaf value retained, but that ought to be possible without
>> dropping the shrinking from calculate_{pv,hvm}_max_policy().
> 
> I won't argue it's not possible to do that without dropping the shrink
> from calculate_{pv,hvm}_max_policy(), but given the point we are on
> the release we should consider the safest option, and IMO that's the
> revert of the shrinking from there in order to restore the previous
> behavior and have working migrations from 4.15 -> 4.16.
> 
> We can discuss other likely better approaches to solve this issue
> after the release.

Sure; as said earlier on I merely would like to understand things
sufficiently well before giving my ack.

Jan

>> Even
>> leaving aside migration, I guess an explicit request for a large
>> max leaf value should be honored; those possibly many trailing
>> leaves then would simply all be blank.
>>
>>> There's another likely error that only affects callers of
>>> xc_cpuid_apply_policy that pass a featureset (so not the upstream
>>> toolstack), where some leaves of the featureset could now be ignored
>>> by the guest if the max leaves value doesn't cover them anymore. Note
>>> this was already an issue even before 540d911c2813, as applying the
>>> featureset doesn't check that the set feature leaves are below the max
>>> leaf.
>>
>> If this was an issue before the commit to be reverted, I take it
>> the revert isn't going to help it?
> 
> I think the commit makes it more likely to hit the above scenario by
> shrinking max leaves.
> 
>> In which case this information
>> is interesting, but not applicable as justification for the
>> revert?
> 
> As said above, while the commit at hand is not introducing the issue
> with the featuresets, it makes it more likely by shrinking the max
> leaves, and IMO it's a regression from behavior in 4.15.
> 
> Ie: options set on the featureset on 4.15 would be exposed, while the
> same options could be hidden in 4.16 because of the shrinking to the
> default domain policies, if the user happens to set an option that's
> on an empty trailing featureset with a tail of zeroed leaves.
> 
> Thanks, Roger.
> 


Reply via email to