Re: Linux 5.13+ as Xen dom0 crashes on Ryzen CPU (ucode loading related?)

2021-09-14 Thread Marek Marczykowski-Górecki
On Tue, Sep 14, 2021 at 10:39:10AM +0200, Jan Beulich wrote:
> On 14.09.2021 09:14, Juergen Gross wrote:
> > On 13.09.21 14:50, Marek Marczykowski-Górecki wrote:
> >> Hi,
> >>
> >> Since 5.13, the Xen (PV) dom0 crashes on boot, before even printing the
> >> kernel version.
> >> Test environment:
> >>   - Xen 4.14.2
> >>   - AMD Ryzen 5 4500U (reported also on AMD Ryzen 7 4750U)
> >>   - Linux 5.13.13, confirmed also on 5.14
> >>
> >> The crash happens only if the initramfs has earlycpio with microcode.
> >> I don't have a serial console, but I've got a photo with crash message
> >> (from Xen, Linux doesn't managed to print anything):
> >> https://user-images.githubusercontent.com/726704/133084966-5038f37e-001b-4688-9f90-83d09be3dc2d.jpg
> >>
> >> Transcription of some of it:
> >>
> >>  mapping kernel into physical memory
> >>  about to get started
> >>  (XEN) Pagetable walk from 82810888:
> >>  (XEN)  L4[0x1ff] = 000332815067 2815
> >>  (XEN)  L3[0x1fe] = 000332816067 2816
> >>  (XEN)  L2[0x014] = 000334018067 4018
> >>  (XEN)  L1[0x010] = 000332810067 2810
> >>  (XEN) domain_crash_sync called from entry.S: fault at 
> >> 82d04033e790 x86_64/entry.S#domain_crash_page_fault
> >>  (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> >>  (XEN) [ Xen-4.14.2  x86_64  debug=n  Not tainted ]
> >>  (XEN) CPU:0
> >>  (XEN) RIP:e033:[<>]
> > 
> > The domain's run state seems to be completely clobbered.
> > 
> > Did you try to boot the kernel with "earlyprintk=xen" to get some idea
> > how far it progressed?
> 
> I guess without my "xen/x86: allow "earlyprintk=xen" to work for PV Dom0"
> "earlyprintk=xen" would need to be accompanied by "console=xenboot". I
> have not tried whether this helps, this is merely a guess from having
> looked at the code relatively recently.

This boot was with "earlyprintk=xen" already, but I didn't know
about "console=xenboot".
Anyway, it seems it isn't relevant anymore.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: Linux 5.13+ as Xen dom0 crashes on Ryzen CPU (ucode loading related?)

2021-09-14 Thread Juergen Gross

On 14.09.21 10:33, Mike Rapoport wrote:

On Tue, Sep 14, 2021 at 09:14:38AM +0200, Juergen Gross wrote:

On 13.09.21 14:50, Marek Marczykowski-Górecki wrote:

Hi,

Since 5.13, the Xen (PV) dom0 crashes on boot, before even printing the
kernel version.
Test environment:
   - Xen 4.14.2
   - AMD Ryzen 5 4500U (reported also on AMD Ryzen 7 4750U)
   - Linux 5.13.13, confirmed also on 5.14

The crash happens only if the initramfs has earlycpio with microcode.
I don't have a serial console, but I've got a photo with crash message
(from Xen, Linux doesn't managed to print anything):
https://user-images.githubusercontent.com/726704/133084966-5038f37e-001b-4688-9f90-83d09be3dc2d.jpg

Transcription of some of it:

  mapping kernel into physical memory
  about to get started
  (XEN) Pagetable walk from 82810888:
  (XEN)  L4[0x1ff] = 000332815067 2815
  (XEN)  L3[0x1fe] = 000332816067 2816
  (XEN)  L2[0x014] = 000334018067 4018
  (XEN)  L1[0x010] = 000332810067 2810
  (XEN) domain_crash_sync called from entry.S: fault at 82d04033e790 
x86_64/entry.S#domain_crash_page_fault
  (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
  (XEN) [ Xen-4.14.2  x86_64  debug=n  Not tainted ]
  (XEN) CPU:0
  (XEN) RIP:e033:[<>]


The domain's run state seems to be completely clobbered.

Did you try to boot the kernel with "earlyprintk=xen" to get some idea
how far it progressed?

I could imagine that doing the early reservations after the call of
e820__memory_setup() is problematic, as Xen PV guests have a hook in
this function performing some rather extended actions.


Right, among them it may relocate initrd:

https://elixir.bootlin.com/linux/latest/source/arch/x86/xen/setup.c#L872
  
and this may cause the reported crash.



I'm not sure the call of early_reserve_memory() can be moved just before
the e820__memory_setup() call. If this is possibel it should be done
IMO, if not then the reservations which have been at the start of
setup_arch() might need to go there again.


early_reserve_memory() can be moved to the beginning of setup_arch().


IMO this should be the preferred fix. I will write a patch to do that.


Anther possibility is to move initrd relocation out of xen_setup_memory()
and maybe even integrate it somehow in reserve_initrd().


This would be rather complicated as xen_setup_memory() is changing the
memory map from one large memory chunk to match the memory map of the
host in case the system is running as dom0 (the need to do so has
historical reasons, changing that is no option). The initrd needs to be
moved in case it is using memory which is conflicting with the new
layout.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: Linux 5.13+ as Xen dom0 crashes on Ryzen CPU (ucode loading related?)

2021-09-14 Thread Jan Beulich
On 14.09.2021 09:14, Juergen Gross wrote:
> On 13.09.21 14:50, Marek Marczykowski-Górecki wrote:
>> Hi,
>>
>> Since 5.13, the Xen (PV) dom0 crashes on boot, before even printing the
>> kernel version.
>> Test environment:
>>   - Xen 4.14.2
>>   - AMD Ryzen 5 4500U (reported also on AMD Ryzen 7 4750U)
>>   - Linux 5.13.13, confirmed also on 5.14
>>
>> The crash happens only if the initramfs has earlycpio with microcode.
>> I don't have a serial console, but I've got a photo with crash message
>> (from Xen, Linux doesn't managed to print anything):
>> https://user-images.githubusercontent.com/726704/133084966-5038f37e-001b-4688-9f90-83d09be3dc2d.jpg
>>
>> Transcription of some of it:
>>
>>  mapping kernel into physical memory
>>  about to get started
>>  (XEN) Pagetable walk from 82810888:
>>  (XEN)  L4[0x1ff] = 000332815067 2815
>>  (XEN)  L3[0x1fe] = 000332816067 2816
>>  (XEN)  L2[0x014] = 000334018067 4018
>>  (XEN)  L1[0x010] = 000332810067 2810
>>  (XEN) domain_crash_sync called from entry.S: fault at 82d04033e790 
>> x86_64/entry.S#domain_crash_page_fault
>>  (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
>>  (XEN) [ Xen-4.14.2  x86_64  debug=n  Not tainted ]
>>  (XEN) CPU:0
>>  (XEN) RIP:e033:[<>]
> 
> The domain's run state seems to be completely clobbered.
> 
> Did you try to boot the kernel with "earlyprintk=xen" to get some idea
> how far it progressed?

I guess without my "xen/x86: allow "earlyprintk=xen" to work for PV Dom0"
"earlyprintk=xen" would need to be accompanied by "console=xenboot". I
have not tried whether this helps, this is merely a guess from having
looked at the code relatively recently.

Jan




Re: Linux 5.13+ as Xen dom0 crashes on Ryzen CPU (ucode loading related?)

2021-09-14 Thread Mike Rapoport
On Tue, Sep 14, 2021 at 09:14:38AM +0200, Juergen Gross wrote:
> On 13.09.21 14:50, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > Since 5.13, the Xen (PV) dom0 crashes on boot, before even printing the
> > kernel version.
> > Test environment:
> >   - Xen 4.14.2
> >   - AMD Ryzen 5 4500U (reported also on AMD Ryzen 7 4750U)
> >   - Linux 5.13.13, confirmed also on 5.14
> > 
> > The crash happens only if the initramfs has earlycpio with microcode.
> > I don't have a serial console, but I've got a photo with crash message
> > (from Xen, Linux doesn't managed to print anything):
> > https://user-images.githubusercontent.com/726704/133084966-5038f37e-001b-4688-9f90-83d09be3dc2d.jpg
> > 
> > Transcription of some of it:
> > 
> >  mapping kernel into physical memory
> >  about to get started
> >  (XEN) Pagetable walk from 82810888:
> >  (XEN)  L4[0x1ff] = 000332815067 2815
> >  (XEN)  L3[0x1fe] = 000332816067 2816
> >  (XEN)  L2[0x014] = 000334018067 4018
> >  (XEN)  L1[0x010] = 000332810067 2810
> >  (XEN) domain_crash_sync called from entry.S: fault at 82d04033e790 
> > x86_64/entry.S#domain_crash_page_fault
> >  (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> >  (XEN) [ Xen-4.14.2  x86_64  debug=n  Not tainted ]
> >  (XEN) CPU:0
> >  (XEN) RIP:e033:[<>]
> 
> The domain's run state seems to be completely clobbered.
> 
> Did you try to boot the kernel with "earlyprintk=xen" to get some idea
> how far it progressed?
> 
> I could imagine that doing the early reservations after the call of
> e820__memory_setup() is problematic, as Xen PV guests have a hook in
> this function performing some rather extended actions.

Right, among them it may relocate initrd:

https://elixir.bootlin.com/linux/latest/source/arch/x86/xen/setup.c#L872
 
and this may cause the reported crash.

> I'm not sure the call of early_reserve_memory() can be moved just before
> the e820__memory_setup() call. If this is possibel it should be done
> IMO, if not then the reservations which have been at the start of
> setup_arch() might need to go there again.

early_reserve_memory() can be moved to the beginning of setup_arch().

Anther possibility is to move initrd relocation out of xen_setup_memory()
and maybe even integrate it somehow in reserve_initrd().

-- 
Sincerely yours,
Mike.



Re: Linux 5.13+ as Xen dom0 crashes on Ryzen CPU (ucode loading related?)

2021-09-14 Thread Juergen Gross

On 13.09.21 14:50, Marek Marczykowski-Górecki wrote:

Hi,

Since 5.13, the Xen (PV) dom0 crashes on boot, before even printing the
kernel version.
Test environment:
  - Xen 4.14.2
  - AMD Ryzen 5 4500U (reported also on AMD Ryzen 7 4750U)
  - Linux 5.13.13, confirmed also on 5.14

The crash happens only if the initramfs has earlycpio with microcode.
I don't have a serial console, but I've got a photo with crash message
(from Xen, Linux doesn't managed to print anything):
https://user-images.githubusercontent.com/726704/133084966-5038f37e-001b-4688-9f90-83d09be3dc2d.jpg

Transcription of some of it:

 mapping kernel into physical memory
 about to get started
 (XEN) Pagetable walk from 82810888:
 (XEN)  L4[0x1ff] = 000332815067 2815
 (XEN)  L3[0x1fe] = 000332816067 2816
 (XEN)  L2[0x014] = 000334018067 4018
 (XEN)  L1[0x010] = 000332810067 2810
 (XEN) domain_crash_sync called from entry.S: fault at 82d04033e790 
x86_64/entry.S#domain_crash_page_fault
 (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
 (XEN) [ Xen-4.14.2  x86_64  debug=n  Not tainted ]
 (XEN) CPU:0
 (XEN) RIP:e033:[<>]


The domain's run state seems to be completely clobbered.

Did you try to boot the kernel with "earlyprintk=xen" to get some idea
how far it progressed?

I could imagine that doing the early reservations after the call of
e820__memory_setup() is problematic, as Xen PV guests have a hook in
this function performing some rather extended actions.

I'm not sure the call of early_reserve_memory() can be moved just before
the e820__memory_setup() call. If this is possibel it should be done
IMO, if not then the reservations which have been at the start of
setup_arch() might need to go there again.


Juergen



I've bisected it down to the commit a799c2bd29d19c565f37fa038b31a0a1d44d0e4d

 x86/setup: Consolidate early memory reservations

 The early reservations of memory areas used by the firmware, bootloader,
 kernel text and data are spread over setup_arch(). Moreover, some of them
 happen *after* memblock allocations, e.g trim_platform_memory_ranges() and
 trim_low_memory_range() are called after reserve_real_mode() that allocates
 memory.

 There was no corruption of these memory regions because memblock always
 allocates memory either from the end of memory (in top-down mode) or above
 the kernel image (in bottom-up mode). However, the bottom up mode is going
 to be updated to span the entire memory [1] to avoid limitations caused by
 KASLR.

 Consolidate early memory reservations in a dedicated function to improve
 robustness against future changes. Having the early reservations in one
 place also makes it clearer what memory must be reserved before memblock
 allocations are allowed.

 Signed-off-by: Mike Rapoport 
 Signed-off-by: Borislav Petkov 
 Reviewed-by: Baoquan He 
 Acked-by: Borislav Petkov 
 Acked-by: David Hildenbrand 
 Link: [1] https://lore.kernel.org/lkml/20201217201214.3414100-2-g...@fb.com
 Link: https://lkml.kernel.org/r/20210302100406.22059-2-r...@kernel.org

Since this seems to affect Xen boot only, I'm copying xen-devel too.

Any ideas?





OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: Linux 5.13+ as Xen dom0 crashes on Ryzen CPU (ucode loading related?)

2021-09-13 Thread Mike Rapoport
Hi Marek,

On Mon, Sep 13, 2021 at 02:50:00PM +0200, Marek Marczykowski-Górecki wrote:
> Hi,
> 
> Since 5.13, the Xen (PV) dom0 crashes on boot, before even printing the
> kernel version.
> Test environment:
>  - Xen 4.14.2
>  - AMD Ryzen 5 4500U (reported also on AMD Ryzen 7 4750U)
>  - Linux 5.13.13, confirmed also on 5.14
> 
> The crash happens only if the initramfs has earlycpio with microcode.

Does the crash happen if you boot the same kernel and initrd directly
without Xen?

> I don't have a serial console, but I've got a photo with crash message
> (from Xen, Linux doesn't managed to print anything):
> https://user-images.githubusercontent.com/726704/133084966-5038f37e-001b-4688-9f90-83d09be3dc2d.jpg
> 
> Transcription of some of it:
> 
> mapping kernel into physical memory
> about to get started
> (XEN) Pagetable walk from 82810888:
> (XEN)  L4[0x1ff] = 000332815067 2815
> (XEN)  L3[0x1fe] = 000332816067 2816
> (XEN)  L2[0x014] = 000334018067 4018
> (XEN)  L1[0x010] = 000332810067 2810
> (XEN) domain_crash_sync called from entry.S: fault at 82d04033e790 
> x86_64/entry.S#domain_crash_page_fault
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) [ Xen-4.14.2  x86_64  debug=n  Not tainted ]
> (XEN) CPU:0
> (XEN) RIP:e033:[<>]

Is it possible to get the actual RIP of the instruction that faulted? 
Feeding that to scripts/faddr2line would be just lovely.
 
> I've bisected it down to the commit a799c2bd29d19c565f37fa038b31a0a1d44d0e4d
> 
> x86/setup: Consolidate early memory reservations
> 
> Since this seems to affect Xen boot only, I'm copying xen-devel too.
> 
> Any ideas?

The only thing I can suggest for now is to move the reservations from
early_reserve_memory() back to where they were before this commit one by
one to see which move caused the crash.

-- 
Sincerely yours,
Mike.