Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-12 Thread Thomas Gleixner
On Fri, 12 Jan 2018, Greg Kroah-Hartman wrote:
> On Fri, Jan 12, 2018 at 12:03:10AM +0100, Thomas Gleixner wrote:
> > So the transition to long mode for secondaries uses the trampoline pgd for
> > long mode transition and then jumping to secondary_startup_64 where CR3 is
> > set to the real kernel page tables.
> 
> Ok, so the summary is that this patch is only needed for the 4.4 and 4.9
> kernels, and _NOT_ for Linus's tree and 4.14, right?

Correct.


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-12 Thread Thomas Gleixner
On Fri, 12 Jan 2018, Greg Kroah-Hartman wrote:
> On Fri, Jan 12, 2018 at 12:03:10AM +0100, Thomas Gleixner wrote:
> > So the transition to long mode for secondaries uses the trampoline pgd for
> > long mode transition and then jumping to secondary_startup_64 where CR3 is
> > set to the real kernel page tables.
> 
> Ok, so the summary is that this patch is only needed for the 4.4 and 4.9
> kernels, and _NOT_ for Linus's tree and 4.14, right?

Correct.


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Greg Kroah-Hartman
On Fri, Jan 12, 2018 at 12:03:10AM +0100, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> > On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> > > On Thu, 11 Jan 2018, Linus Torvalds wrote:
> > > 
> > > > On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > > > wrote:
> > > > >
> > > > > 67a9108ed431 ("x86/efi: Build our own page table structures")
> > > > >
> > > > > got rid of EFI depending on real_mode_header->trampoline_pgd
> > > > 
> > > > So I think it only got rid of by default - the codepath is still
> > > > there, the allocation is still there, it's just that it's not actually
> > > > used unless somebody does that "efi=old_mmap" thing.
> > > 
> > > Yes, the trampoline_pgd is still around, but I can't figure out how it
> > > would be used after boot. Confused, digging more.
> > 
> > So coming back to the same commit. From the changelog:
> > 
> > This is caused by mapping EFI regions with RWX permissions.
> > There isn't much we can do to restrict the permissions for these
> > regions due to the way the firmware toolchains mix code and
> > data, but we can at least isolate these mappings so that they do
> > not appear in the regular kernel page tables.
> > 
> > In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> > mapping") we started using 'trampoline_pgd' to map the EFI
> > regions because there was an existing identity mapping there
> > which we use during the SetVirtualAddressMap() call and for
> > broken firmware that accesses those addresses.
> > 
> > So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
> > efi_pgd, which we made use the proper size.
> > 
> > trampoline_pgd is since then only used to get into long mode in
> > realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> > 
> > The runtime services stuff does not use it in kernel versions >= 4.6
> 
> But there is one very well hidden user for it after boot:
> 
> It's used for booting secondary CPUs from real mode
> 
> So the transition to long mode for secondaries uses the trampoline pgd for
> long mode transition and then jumping to secondary_startup_64 where CR3 is
> set to the real kernel page tables.

Ok, so the summary is that this patch is only needed for the 4.4 and 4.9
kernels, and _NOT_ for Linus's tree and 4.14, right?

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Greg Kroah-Hartman
On Fri, Jan 12, 2018 at 12:03:10AM +0100, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> > On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> > > On Thu, 11 Jan 2018, Linus Torvalds wrote:
> > > 
> > > > On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > > > wrote:
> > > > >
> > > > > 67a9108ed431 ("x86/efi: Build our own page table structures")
> > > > >
> > > > > got rid of EFI depending on real_mode_header->trampoline_pgd
> > > > 
> > > > So I think it only got rid of by default - the codepath is still
> > > > there, the allocation is still there, it's just that it's not actually
> > > > used unless somebody does that "efi=old_mmap" thing.
> > > 
> > > Yes, the trampoline_pgd is still around, but I can't figure out how it
> > > would be used after boot. Confused, digging more.
> > 
> > So coming back to the same commit. From the changelog:
> > 
> > This is caused by mapping EFI regions with RWX permissions.
> > There isn't much we can do to restrict the permissions for these
> > regions due to the way the firmware toolchains mix code and
> > data, but we can at least isolate these mappings so that they do
> > not appear in the regular kernel page tables.
> > 
> > In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> > mapping") we started using 'trampoline_pgd' to map the EFI
> > regions because there was an existing identity mapping there
> > which we use during the SetVirtualAddressMap() call and for
> > broken firmware that accesses those addresses.
> > 
> > So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
> > efi_pgd, which we made use the proper size.
> > 
> > trampoline_pgd is since then only used to get into long mode in
> > realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> > 
> > The runtime services stuff does not use it in kernel versions >= 4.6
> 
> But there is one very well hidden user for it after boot:
> 
> It's used for booting secondary CPUs from real mode
> 
> So the transition to long mode for secondaries uses the trampoline pgd for
> long mode transition and then jumping to secondary_startup_64 where CR3 is
> set to the real kernel page tables.

Ok, so the summary is that this patch is only needed for the 4.4 and 4.9
kernels, and _NOT_ for Linus's tree and 4.14, right?

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Guenter Roeck
On Thu, Jan 11, 2018 at 11:47:23PM +0100, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Steven Sistare wrote:
> > On 1/11/2018 5:30 PM, Thomas Gleixner wrote:
> > > On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> > >> On Thu, 11 Jan 2018, Linus Torvalds wrote:
> > >>
> > >>> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > >>> wrote:
> > 
> >  67a9108ed431 ("x86/efi: Build our own page table structures")
> > 
> >  got rid of EFI depending on real_mode_header->trampoline_pgd
> > >>>
> > >>> So I think it only got rid of by default - the codepath is still
> > >>> there, the allocation is still there, it's just that it's not actually
> > >>> used unless somebody does that "efi=old_mmap" thing.
> > >>
> > >> Yes, the trampoline_pgd is still around, but I can't figure out how it
> > >> would be used after boot. Confused, digging more.
> > > 
> > > So coming back to the same commit. From the changelog:
> > > 
> > > This is caused by mapping EFI regions with RWX permissions.
> > > There isn't much we can do to restrict the permissions for these
> > > regions due to the way the firmware toolchains mix code and
> > > data, but we can at least isolate these mappings so that they do
> > > not appear in the regular kernel page tables.
> > > 
> > > In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> > > mapping") we started using 'trampoline_pgd' to map the EFI
> > > regions because there was an existing identity mapping there
> > > which we use during the SetVirtualAddressMap() call and for
> > > broken firmware that accesses those addresses.
> > > 
> > > So this very commit gets rid of the (ab)use of trampoline_pgd and 
> > > allocates
> > > efi_pgd, which we made use the proper size.
> > > 
> > > trampoline_pgd is since then only used to get into long mode in
> > > realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> > > 
> > > The runtime services stuff does not use it in kernel versions >= 4.6
> > > 
> > > Thanks,
> > > 
> > >   tglx
> > 
> > Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are 
> > independent of it.  When EFI_OLD_MMAP is enabled, the efi pgd is not 
> > used, and the bug will not bite.
> 
> We have a fix queued in tip/x86/pti which addresses a missing NX clear, but
> that's a different story.
> 
Since you are talking about NX, I see this in last night's -next:

kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at fe007000
IP: 0xfe006e9d
PGD ffd6067 P4D ffd6067 PUD ffd5067 PMD ff73067 PTE 8fc09063
Oops: 0011 [#1] PREEMPT SMP PTI
Modules linked in:
CPU: 0 PID: 1 Comm: init Tainted: GW
4.15.0-rc7-next-20180111-yocto-standard #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:0xfe006e9d
RSP: 0018:aee28000ffd0 EFLAGS: 0006
RAX: 000c RBX: 00400040 RCX: 7f2c4186ad6a
RDX:  RSI:  RDI: b6a0
RBP: 0008 R08: 037f R09: 0064
R10: 078bfbfd R11: 0246 R12: 7f2c41856a60
R13:  R14: 00402368 R15: 1000
FS:  () GS:95fecfc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: fe007000 CR3: 0d88a000 CR4: 003406f0
Call Trace:
Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <90> 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 90 90 90 
RIP: 0xfe006e9d RSP: aee28000ffd0
CR2: fe007000
---[ end trace a82b8742114c1785 ]---

Is this the issue you are talking about, or is the fix triggering 
the crash ?

Guenter


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Guenter Roeck
On Thu, Jan 11, 2018 at 11:47:23PM +0100, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Steven Sistare wrote:
> > On 1/11/2018 5:30 PM, Thomas Gleixner wrote:
> > > On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> > >> On Thu, 11 Jan 2018, Linus Torvalds wrote:
> > >>
> > >>> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > >>> wrote:
> > 
> >  67a9108ed431 ("x86/efi: Build our own page table structures")
> > 
> >  got rid of EFI depending on real_mode_header->trampoline_pgd
> > >>>
> > >>> So I think it only got rid of by default - the codepath is still
> > >>> there, the allocation is still there, it's just that it's not actually
> > >>> used unless somebody does that "efi=old_mmap" thing.
> > >>
> > >> Yes, the trampoline_pgd is still around, but I can't figure out how it
> > >> would be used after boot. Confused, digging more.
> > > 
> > > So coming back to the same commit. From the changelog:
> > > 
> > > This is caused by mapping EFI regions with RWX permissions.
> > > There isn't much we can do to restrict the permissions for these
> > > regions due to the way the firmware toolchains mix code and
> > > data, but we can at least isolate these mappings so that they do
> > > not appear in the regular kernel page tables.
> > > 
> > > In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> > > mapping") we started using 'trampoline_pgd' to map the EFI
> > > regions because there was an existing identity mapping there
> > > which we use during the SetVirtualAddressMap() call and for
> > > broken firmware that accesses those addresses.
> > > 
> > > So this very commit gets rid of the (ab)use of trampoline_pgd and 
> > > allocates
> > > efi_pgd, which we made use the proper size.
> > > 
> > > trampoline_pgd is since then only used to get into long mode in
> > > realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> > > 
> > > The runtime services stuff does not use it in kernel versions >= 4.6
> > > 
> > > Thanks,
> > > 
> > >   tglx
> > 
> > Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are 
> > independent of it.  When EFI_OLD_MMAP is enabled, the efi pgd is not 
> > used, and the bug will not bite.
> 
> We have a fix queued in tip/x86/pti which addresses a missing NX clear, but
> that's a different story.
> 
Since you are talking about NX, I see this in last night's -next:

kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at fe007000
IP: 0xfe006e9d
PGD ffd6067 P4D ffd6067 PUD ffd5067 PMD ff73067 PTE 8fc09063
Oops: 0011 [#1] PREEMPT SMP PTI
Modules linked in:
CPU: 0 PID: 1 Comm: init Tainted: GW
4.15.0-rc7-next-20180111-yocto-standard #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:0xfe006e9d
RSP: 0018:aee28000ffd0 EFLAGS: 0006
RAX: 000c RBX: 00400040 RCX: 7f2c4186ad6a
RDX:  RSI:  RDI: b6a0
RBP: 0008 R08: 037f R09: 0064
R10: 078bfbfd R11: 0246 R12: 7f2c41856a60
R13:  R14: 00402368 R15: 1000
FS:  () GS:95fecfc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: fe007000 CR3: 0d88a000 CR4: 003406f0
Call Trace:
Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <90> 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 90 90 90 
RIP: 0xfe006e9d RSP: aee28000ffd0
CR2: fe007000
---[ end trace a82b8742114c1785 ]---

Is this the issue you are talking about, or is the fix triggering 
the crash ?

Guenter


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> > On Thu, 11 Jan 2018, Linus Torvalds wrote:
> > 
> > > On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > > wrote:
> > > >
> > > > 67a9108ed431 ("x86/efi: Build our own page table structures")
> > > >
> > > > got rid of EFI depending on real_mode_header->trampoline_pgd
> > > 
> > > So I think it only got rid of by default - the codepath is still
> > > there, the allocation is still there, it's just that it's not actually
> > > used unless somebody does that "efi=old_mmap" thing.
> > 
> > Yes, the trampoline_pgd is still around, but I can't figure out how it
> > would be used after boot. Confused, digging more.
> 
> So coming back to the same commit. From the changelog:
> 
> This is caused by mapping EFI regions with RWX permissions.
> There isn't much we can do to restrict the permissions for these
> regions due to the way the firmware toolchains mix code and
> data, but we can at least isolate these mappings so that they do
> not appear in the regular kernel page tables.
> 
> In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> mapping") we started using 'trampoline_pgd' to map the EFI
> regions because there was an existing identity mapping there
> which we use during the SetVirtualAddressMap() call and for
> broken firmware that accesses those addresses.
> 
> So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
> efi_pgd, which we made use the proper size.
> 
> trampoline_pgd is since then only used to get into long mode in
> realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> 
> The runtime services stuff does not use it in kernel versions >= 4.6

But there is one very well hidden user for it after boot:

It's used for booting secondary CPUs from real mode

So the transition to long mode for secondaries uses the trampoline pgd for
long mode transition and then jumping to secondary_startup_64 where CR3 is
set to the real kernel page tables.

Thanks,

tglx





Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> > On Thu, 11 Jan 2018, Linus Torvalds wrote:
> > 
> > > On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > > wrote:
> > > >
> > > > 67a9108ed431 ("x86/efi: Build our own page table structures")
> > > >
> > > > got rid of EFI depending on real_mode_header->trampoline_pgd
> > > 
> > > So I think it only got rid of by default - the codepath is still
> > > there, the allocation is still there, it's just that it's not actually
> > > used unless somebody does that "efi=old_mmap" thing.
> > 
> > Yes, the trampoline_pgd is still around, but I can't figure out how it
> > would be used after boot. Confused, digging more.
> 
> So coming back to the same commit. From the changelog:
> 
> This is caused by mapping EFI regions with RWX permissions.
> There isn't much we can do to restrict the permissions for these
> regions due to the way the firmware toolchains mix code and
> data, but we can at least isolate these mappings so that they do
> not appear in the regular kernel page tables.
> 
> In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> mapping") we started using 'trampoline_pgd' to map the EFI
> regions because there was an existing identity mapping there
> which we use during the SetVirtualAddressMap() call and for
> broken firmware that accesses those addresses.
> 
> So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
> efi_pgd, which we made use the proper size.
> 
> trampoline_pgd is since then only used to get into long mode in
> realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> 
> The runtime services stuff does not use it in kernel versions >= 4.6

But there is one very well hidden user for it after boot:

It's used for booting secondary CPUs from real mode

So the transition to long mode for secondaries uses the trampoline pgd for
long mode transition and then jumping to secondary_startup_64 where CR3 is
set to the real kernel page tables.

Thanks,

tglx





Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Linus Torvalds
On Thu, Jan 11, 2018 at 2:42 PM, Steven Sistare
 wrote:
>
> Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are
> independent of it.  When EFI_OLD_MMAP is enabled, the efi pgd is not
> used, and the bug will not bite.

Ok, good. Thanks for checking.

   Linus


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Linus Torvalds
On Thu, Jan 11, 2018 at 2:42 PM, Steven Sistare
 wrote:
>
> Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are
> independent of it.  When EFI_OLD_MMAP is enabled, the efi pgd is not
> used, and the bug will not bite.

Ok, good. Thanks for checking.

   Linus


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Steven Sistare wrote:
> On 1/11/2018 5:30 PM, Thomas Gleixner wrote:
> > On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> >> On Thu, 11 Jan 2018, Linus Torvalds wrote:
> >>
> >>> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> >>> wrote:
> 
>  67a9108ed431 ("x86/efi: Build our own page table structures")
> 
>  got rid of EFI depending on real_mode_header->trampoline_pgd
> >>>
> >>> So I think it only got rid of by default - the codepath is still
> >>> there, the allocation is still there, it's just that it's not actually
> >>> used unless somebody does that "efi=old_mmap" thing.
> >>
> >> Yes, the trampoline_pgd is still around, but I can't figure out how it
> >> would be used after boot. Confused, digging more.
> > 
> > So coming back to the same commit. From the changelog:
> > 
> > This is caused by mapping EFI regions with RWX permissions.
> > There isn't much we can do to restrict the permissions for these
> > regions due to the way the firmware toolchains mix code and
> > data, but we can at least isolate these mappings so that they do
> > not appear in the regular kernel page tables.
> > 
> > In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> > mapping") we started using 'trampoline_pgd' to map the EFI
> > regions because there was an existing identity mapping there
> > which we use during the SetVirtualAddressMap() call and for
> > broken firmware that accesses those addresses.
> > 
> > So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
> > efi_pgd, which we made use the proper size.
> > 
> > trampoline_pgd is since then only used to get into long mode in
> > realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> > 
> > The runtime services stuff does not use it in kernel versions >= 4.6
> > 
> > Thanks,
> > 
> > tglx
> 
> Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are 
> independent of it.  When EFI_OLD_MMAP is enabled, the efi pgd is not 
> used, and the bug will not bite.

We have a fix queued in tip/x86/pti which addresses a missing NX clear, but
that's a different story.

Thanks,

tglx


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Steven Sistare wrote:
> On 1/11/2018 5:30 PM, Thomas Gleixner wrote:
> > On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> >> On Thu, 11 Jan 2018, Linus Torvalds wrote:
> >>
> >>> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> >>> wrote:
> 
>  67a9108ed431 ("x86/efi: Build our own page table structures")
> 
>  got rid of EFI depending on real_mode_header->trampoline_pgd
> >>>
> >>> So I think it only got rid of by default - the codepath is still
> >>> there, the allocation is still there, it's just that it's not actually
> >>> used unless somebody does that "efi=old_mmap" thing.
> >>
> >> Yes, the trampoline_pgd is still around, but I can't figure out how it
> >> would be used after boot. Confused, digging more.
> > 
> > So coming back to the same commit. From the changelog:
> > 
> > This is caused by mapping EFI regions with RWX permissions.
> > There isn't much we can do to restrict the permissions for these
> > regions due to the way the firmware toolchains mix code and
> > data, but we can at least isolate these mappings so that they do
> > not appear in the regular kernel page tables.
> > 
> > In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> > mapping") we started using 'trampoline_pgd' to map the EFI
> > regions because there was an existing identity mapping there
> > which we use during the SetVirtualAddressMap() call and for
> > broken firmware that accesses those addresses.
> > 
> > So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
> > efi_pgd, which we made use the proper size.
> > 
> > trampoline_pgd is since then only used to get into long mode in
> > realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> > 
> > The runtime services stuff does not use it in kernel versions >= 4.6
> > 
> > Thanks,
> > 
> > tglx
> 
> Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are 
> independent of it.  When EFI_OLD_MMAP is enabled, the efi pgd is not 
> used, and the bug will not bite.

We have a fix queued in tip/x86/pti which addresses a missing NX clear, but
that's a different story.

Thanks,

tglx


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Steven Sistare
On 1/11/2018 5:30 PM, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Thomas Gleixner wrote:
>> On Thu, 11 Jan 2018, Linus Torvalds wrote:
>>
>>> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
>>> wrote:

 67a9108ed431 ("x86/efi: Build our own page table structures")

 got rid of EFI depending on real_mode_header->trampoline_pgd
>>>
>>> So I think it only got rid of by default - the codepath is still
>>> there, the allocation is still there, it's just that it's not actually
>>> used unless somebody does that "efi=old_mmap" thing.
>>
>> Yes, the trampoline_pgd is still around, but I can't figure out how it
>> would be used after boot. Confused, digging more.
> 
> So coming back to the same commit. From the changelog:
> 
> This is caused by mapping EFI regions with RWX permissions.
> There isn't much we can do to restrict the permissions for these
> regions due to the way the firmware toolchains mix code and
> data, but we can at least isolate these mappings so that they do
> not appear in the regular kernel page tables.
> 
> In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> mapping") we started using 'trampoline_pgd' to map the EFI
> regions because there was an existing identity mapping there
> which we use during the SetVirtualAddressMap() call and for
> broken firmware that accesses those addresses.
> 
> So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
> efi_pgd, which we made use the proper size.
> 
> trampoline_pgd is since then only used to get into long mode in
> realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> 
> The runtime services stuff does not use it in kernel versions >= 4.6
> 
> Thanks,
> 
>   tglx

Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are 
independent of it.  When EFI_OLD_MMAP is enabled, the efi pgd is not 
used, and the bug will not bite.

- Steve


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Steven Sistare
On 1/11/2018 5:30 PM, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Thomas Gleixner wrote:
>> On Thu, 11 Jan 2018, Linus Torvalds wrote:
>>
>>> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
>>> wrote:

 67a9108ed431 ("x86/efi: Build our own page table structures")

 got rid of EFI depending on real_mode_header->trampoline_pgd
>>>
>>> So I think it only got rid of by default - the codepath is still
>>> there, the allocation is still there, it's just that it's not actually
>>> used unless somebody does that "efi=old_mmap" thing.
>>
>> Yes, the trampoline_pgd is still around, but I can't figure out how it
>> would be used after boot. Confused, digging more.
> 
> So coming back to the same commit. From the changelog:
> 
> This is caused by mapping EFI regions with RWX permissions.
> There isn't much we can do to restrict the permissions for these
> regions due to the way the firmware toolchains mix code and
> data, but we can at least isolate these mappings so that they do
> not appear in the regular kernel page tables.
> 
> In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
> mapping") we started using 'trampoline_pgd' to map the EFI
> regions because there was an existing identity mapping there
> which we use during the SetVirtualAddressMap() call and for
> broken firmware that accesses those addresses.
> 
> So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
> efi_pgd, which we made use the proper size.
> 
> trampoline_pgd is since then only used to get into long mode in
> realmode/rm/trampoline_64.S and for reboot in machine_real_restart().
> 
> The runtime services stuff does not use it in kernel versions >= 4.6
> 
> Thanks,
> 
>   tglx

Yes, and addressing Linus' concern about EFI_OLD_MEMMAP, those paths are 
independent of it.  When EFI_OLD_MMAP is enabled, the efi pgd is not 
used, and the bug will not bite.

- Steve


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Linus Torvalds wrote:
> 
> > On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > wrote:
> > >
> > > 67a9108ed431 ("x86/efi: Build our own page table structures")
> > >
> > > got rid of EFI depending on real_mode_header->trampoline_pgd
> > 
> > So I think it only got rid of by default - the codepath is still
> > there, the allocation is still there, it's just that it's not actually
> > used unless somebody does that "efi=old_mmap" thing.
> 
> Yes, the trampoline_pgd is still around, but I can't figure out how it
> would be used after boot. Confused, digging more.

So coming back to the same commit. From the changelog:

This is caused by mapping EFI regions with RWX permissions.
There isn't much we can do to restrict the permissions for these
regions due to the way the firmware toolchains mix code and
data, but we can at least isolate these mappings so that they do
not appear in the regular kernel page tables.

In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
mapping") we started using 'trampoline_pgd' to map the EFI
regions because there was an existing identity mapping there
which we use during the SetVirtualAddressMap() call and for
broken firmware that accesses those addresses.

So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
efi_pgd, which we made use the proper size.

trampoline_pgd is since then only used to get into long mode in
realmode/rm/trampoline_64.S and for reboot in machine_real_restart().

The runtime services stuff does not use it in kernel versions >= 4.6

Thanks,

tglx





Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Thomas Gleixner wrote:
> On Thu, 11 Jan 2018, Linus Torvalds wrote:
> 
> > On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > wrote:
> > >
> > > 67a9108ed431 ("x86/efi: Build our own page table structures")
> > >
> > > got rid of EFI depending on real_mode_header->trampoline_pgd
> > 
> > So I think it only got rid of by default - the codepath is still
> > there, the allocation is still there, it's just that it's not actually
> > used unless somebody does that "efi=old_mmap" thing.
> 
> Yes, the trampoline_pgd is still around, but I can't figure out how it
> would be used after boot. Confused, digging more.

So coming back to the same commit. From the changelog:

This is caused by mapping EFI regions with RWX permissions.
There isn't much we can do to restrict the permissions for these
regions due to the way the firmware toolchains mix code and
data, but we can at least isolate these mappings so that they do
not appear in the regular kernel page tables.

In commit d2f7cbe7b26a ("x86/efi: Runtime services virtual
mapping") we started using 'trampoline_pgd' to map the EFI
regions because there was an existing identity mapping there
which we use during the SetVirtualAddressMap() call and for
broken firmware that accesses those addresses.

So this very commit gets rid of the (ab)use of trampoline_pgd and allocates
efi_pgd, which we made use the proper size.

trampoline_pgd is since then only used to get into long mode in
realmode/rm/trampoline_64.S and for reboot in machine_real_restart().

The runtime services stuff does not use it in kernel versions >= 4.6

Thanks,

tglx





Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Steven Sistare wrote:
> On 1/11/2018 3:46 PM, Linus Torvalds wrote:
> > On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > wrote:
> >>
> >> 67a9108ed431 ("x86/efi: Build our own page table structures")
> >>
> >> got rid of EFI depending on real_mode_header->trampoline_pgd
> > 
> > So I think it only got rid of by default - the codepath is still
> > there, the allocation is still there, it's just that it's not actually
> > used unless somebody does that "efi=old_mmap" thing.
> > 
> > Looking around, there's at least one quirk for the SGI UV1 system that
> > enables EFI_OLD_MMAP automatically. There might be others that I
> > missed, but I think that's it.
> > 
> > So it *can* trigger without "efi=old_mmap", but not on any normal machines.
> > 
> > And as Pavel points out, even when the bug is active, it's pretty hard
> > to actually trigger.
> > 
> > But yeah, there may be other EFI patches that I didn't notice that
> > changed things in other ways too.
> > 
> >Linus
> 
> The bug is not present in the latest upstream kernel because the efi_pgd is
> correctly aligned:
> 
>   arch/x86/platform/efi/efi_64.c
> int __init efi_alloc_page_tables(void)
>   efi_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);

Yes, I came exactly to the same conclusion, but I didn't want to call Linus
a moron before I triple checked that trampoline_pgd is still there, but
only every used to get out of the realmode swamp at bpot.

Thanks,

tglx


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Steven Sistare wrote:
> On 1/11/2018 3:46 PM, Linus Torvalds wrote:
> > On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  
> > wrote:
> >>
> >> 67a9108ed431 ("x86/efi: Build our own page table structures")
> >>
> >> got rid of EFI depending on real_mode_header->trampoline_pgd
> > 
> > So I think it only got rid of by default - the codepath is still
> > there, the allocation is still there, it's just that it's not actually
> > used unless somebody does that "efi=old_mmap" thing.
> > 
> > Looking around, there's at least one quirk for the SGI UV1 system that
> > enables EFI_OLD_MMAP automatically. There might be others that I
> > missed, but I think that's it.
> > 
> > So it *can* trigger without "efi=old_mmap", but not on any normal machines.
> > 
> > And as Pavel points out, even when the bug is active, it's pretty hard
> > to actually trigger.
> > 
> > But yeah, there may be other EFI patches that I didn't notice that
> > changed things in other ways too.
> > 
> >Linus
> 
> The bug is not present in the latest upstream kernel because the efi_pgd is
> correctly aligned:
> 
>   arch/x86/platform/efi/efi_64.c
> int __init efi_alloc_page_tables(void)
>   efi_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);

Yes, I came exactly to the same conclusion, but I didn't want to call Linus
a moron before I triple checked that trampoline_pgd is still there, but
only every used to get out of the realmode swamp at bpot.

Thanks,

tglx


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Steven Sistare
On 1/11/2018 3:46 PM, Linus Torvalds wrote:
> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  wrote:
>>
>> 67a9108ed431 ("x86/efi: Build our own page table structures")
>>
>> got rid of EFI depending on real_mode_header->trampoline_pgd
> 
> So I think it only got rid of by default - the codepath is still
> there, the allocation is still there, it's just that it's not actually
> used unless somebody does that "efi=old_mmap" thing.
> 
> Looking around, there's at least one quirk for the SGI UV1 system that
> enables EFI_OLD_MMAP automatically. There might be others that I
> missed, but I think that's it.
> 
> So it *can* trigger without "efi=old_mmap", but not on any normal machines.
> 
> And as Pavel points out, even when the bug is active, it's pretty hard
> to actually trigger.
> 
> But yeah, there may be other EFI patches that I didn't notice that
> changed things in other ways too.
> 
>Linus

The bug is not present in the latest upstream kernel because the efi_pgd is
correctly aligned:

  arch/x86/platform/efi/efi_64.c
int __init efi_alloc_page_tables(void)
  efi_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);

  arch/x86/include/asm/pgalloc.h
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif

Pavel's patch fixes kernels prior to
  67a9108ed431 ("x86/efi: Build our own page table structures")

where the efi pgd allocation looks like:

  arch/x86/realmode/init.c
void __init reserve_real_mode(void)
   mem = memblock_find_in_range(0, 1<<20, size, PAGE_SIZE);
   base = __va(mem);
   real_mode_header = (struct real_mode_header *) base;

  void __init setup_real_mode(void)
trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);

Kernel versions between 67a9108ed431 and the latest also have the bug and
need a similar fix:

  arch/x86/platform/efi/efi_64.c

int __init efi_alloc_page_tables(void)
  efi_pgd = (pgd_t *)__get_free_page(gfp_mask);

int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned 
num_pages)
  pgd = efi_pgd;
  efi_scratch.efi_pgt = (pgd_t *)__pa(efi_pgd);

All of the code paths above are taken when *not* EFI_OLD_MMAP.

- Steve


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Steven Sistare
On 1/11/2018 3:46 PM, Linus Torvalds wrote:
> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  wrote:
>>
>> 67a9108ed431 ("x86/efi: Build our own page table structures")
>>
>> got rid of EFI depending on real_mode_header->trampoline_pgd
> 
> So I think it only got rid of by default - the codepath is still
> there, the allocation is still there, it's just that it's not actually
> used unless somebody does that "efi=old_mmap" thing.
> 
> Looking around, there's at least one quirk for the SGI UV1 system that
> enables EFI_OLD_MMAP automatically. There might be others that I
> missed, but I think that's it.
> 
> So it *can* trigger without "efi=old_mmap", but not on any normal machines.
> 
> And as Pavel points out, even when the bug is active, it's pretty hard
> to actually trigger.
> 
> But yeah, there may be other EFI patches that I didn't notice that
> changed things in other ways too.
> 
>Linus

The bug is not present in the latest upstream kernel because the efi_pgd is
correctly aligned:

  arch/x86/platform/efi/efi_64.c
int __init efi_alloc_page_tables(void)
  efi_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);

  arch/x86/include/asm/pgalloc.h
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif

Pavel's patch fixes kernels prior to
  67a9108ed431 ("x86/efi: Build our own page table structures")

where the efi pgd allocation looks like:

  arch/x86/realmode/init.c
void __init reserve_real_mode(void)
   mem = memblock_find_in_range(0, 1<<20, size, PAGE_SIZE);
   base = __va(mem);
   real_mode_header = (struct real_mode_header *) base;

  void __init setup_real_mode(void)
trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);

Kernel versions between 67a9108ed431 and the latest also have the bug and
need a similar fix:

  arch/x86/platform/efi/efi_64.c

int __init efi_alloc_page_tables(void)
  efi_pgd = (pgd_t *)__get_free_page(gfp_mask);

int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned 
num_pages)
  pgd = efi_pgd;
  efi_scratch.efi_pgt = (pgd_t *)__pa(efi_pgd);

All of the code paths above are taken when *not* EFI_OLD_MMAP.

- Steve


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Linus Torvalds wrote:

> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  wrote:
> >
> > 67a9108ed431 ("x86/efi: Build our own page table structures")
> >
> > got rid of EFI depending on real_mode_header->trampoline_pgd
> 
> So I think it only got rid of by default - the codepath is still
> there, the allocation is still there, it's just that it's not actually
> used unless somebody does that "efi=old_mmap" thing.

Yes, the trampoline_pgd is still around, but I can't figure out how it
would be used after boot. Confused, digging more.

Thanks,

tglx



Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Linus Torvalds wrote:

> On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  wrote:
> >
> > 67a9108ed431 ("x86/efi: Build our own page table structures")
> >
> > got rid of EFI depending on real_mode_header->trampoline_pgd
> 
> So I think it only got rid of by default - the codepath is still
> there, the allocation is still there, it's just that it's not actually
> used unless somebody does that "efi=old_mmap" thing.

Yes, the trampoline_pgd is still around, but I can't figure out how it
would be used after boot. Confused, digging more.

Thanks,

tglx



Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Linus Torvalds
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  wrote:
>
> 67a9108ed431 ("x86/efi: Build our own page table structures")
>
> got rid of EFI depending on real_mode_header->trampoline_pgd

So I think it only got rid of by default - the codepath is still
there, the allocation is still there, it's just that it's not actually
used unless somebody does that "efi=old_mmap" thing.

Looking around, there's at least one quirk for the SGI UV1 system that
enables EFI_OLD_MMAP automatically. There might be others that I
missed, but I think that's it.

So it *can* trigger without "efi=old_mmap", but not on any normal machines.

And as Pavel points out, even when the bug is active, it's pretty hard
to actually trigger.

But yeah, there may be other EFI patches that I didn't notice that
changed things in other ways too.

   Linus


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Linus Torvalds
On Thu, Jan 11, 2018 at 12:37 PM, Thomas Gleixner  wrote:
>
> 67a9108ed431 ("x86/efi: Build our own page table structures")
>
> got rid of EFI depending on real_mode_header->trampoline_pgd

So I think it only got rid of by default - the codepath is still
there, the allocation is still there, it's just that it's not actually
used unless somebody does that "efi=old_mmap" thing.

Looking around, there's at least one quirk for the SGI UV1 system that
enables EFI_OLD_MMAP automatically. There might be others that I
missed, but I think that's it.

So it *can* trigger without "efi=old_mmap", but not on any normal machines.

And as Pavel points out, even when the bug is active, it's pretty hard
to actually trigger.

But yeah, there may be other EFI patches that I didn't notice that
changed things in other ways too.

   Linus


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Linus Torvalds wrote:

> [ Patch to make sure the EFI trampoline_pgd is properly aligned and
> has the double pgd that KPTI requires ]
> 
> On Thu, Jan 11, 2018 at 10:40 AM, Pavel Tatashin
>  wrote:
> > If it is better to resubmit this patch via git send-email, please let me 
> > know.
> 
> It would be better, because that way the patch can be more easily
> quoted and discussed.
> 
> That said, I do not see why this isn't an issue upstream too.
> 
> As far as I can tell, it's not just 4.4.110. Our current entry code
> does that ADJUST_KERNEL_CR3 dance too, which clears the
> PTI_SWITCH_MASK bit from cr3.
> 
> And that realmode trampoline pgd seems all to be just aligned to PAGE_SIZE.

Right, but see below.

> Now, in the modern world, we generate new page tables for EFI, but we
> still have that EFI_OLD_MEMMAP code that disables that. And afaik,
> EFI_OLD_MEMMAP has the exact same problem that your patch fixes in 4.4
> (where it's always on).
> 
> So I think this patch should go into the development kernel too.
> 
> Or maybe it already is, and I just haven't gotten it yet.

It's not. There is an efi oldmap fix pending, but that's a different story.

> Or - even more likely - I'm missing something entirely, and even
> EFI_OLD_MEMMAP solved this some other way upstream.

67a9108ed431 ("x86/efi: Build our own page table structures")

got rid of EFI depending on real_mode_header->trampoline_pgd

So I don't see how upstream needs the fix as the trampoline_pgd seems only
to be used when coming out of the boot loader.

Adding Matt. He stepped back from EFI, but he might still know.

> Adding Thomas Gleixner explicitly to the participants so that he can
> tell me I'm a moron and point me to the right thing.

Your wish is my command, but I need to stare some more before doing so.

Thanks,

tglx


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Thomas Gleixner
On Thu, 11 Jan 2018, Linus Torvalds wrote:

> [ Patch to make sure the EFI trampoline_pgd is properly aligned and
> has the double pgd that KPTI requires ]
> 
> On Thu, Jan 11, 2018 at 10:40 AM, Pavel Tatashin
>  wrote:
> > If it is better to resubmit this patch via git send-email, please let me 
> > know.
> 
> It would be better, because that way the patch can be more easily
> quoted and discussed.
> 
> That said, I do not see why this isn't an issue upstream too.
> 
> As far as I can tell, it's not just 4.4.110. Our current entry code
> does that ADJUST_KERNEL_CR3 dance too, which clears the
> PTI_SWITCH_MASK bit from cr3.
> 
> And that realmode trampoline pgd seems all to be just aligned to PAGE_SIZE.

Right, but see below.

> Now, in the modern world, we generate new page tables for EFI, but we
> still have that EFI_OLD_MEMMAP code that disables that. And afaik,
> EFI_OLD_MEMMAP has the exact same problem that your patch fixes in 4.4
> (where it's always on).
> 
> So I think this patch should go into the development kernel too.
> 
> Or maybe it already is, and I just haven't gotten it yet.

It's not. There is an efi oldmap fix pending, but that's a different story.

> Or - even more likely - I'm missing something entirely, and even
> EFI_OLD_MEMMAP solved this some other way upstream.

67a9108ed431 ("x86/efi: Build our own page table structures")

got rid of EFI depending on real_mode_header->trampoline_pgd

So I don't see how upstream needs the fix as the trampoline_pgd seems only
to be used when coming out of the boot loader.

Adding Matt. He stepped back from EFI, but he might still know.

> Adding Thomas Gleixner explicitly to the participants so that he can
> tell me I'm a moron and point me to the right thing.

Your wish is my command, but I need to stare some more before doing so.

Thanks,

tglx


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Pavel Tatashin



On 01/11/2018 03:10 PM, Greg Kroah-Hartman wrote:

On Thu, Jan 11, 2018 at 01:36:50PM -0500, Pavel Tatashin wrote:

I have root caused the memory corruption panics/hangs that I've been
experiencing during boot with the latest 4.4.110 kernel. The problem
as was suspected by Andy Lutomirski is with interaction between PTI
and EFI. It may affect any system that has EFI bios.  I have not
verified if it can affect any other kernel beside 4.4.110

Attached is the fix for this issue with explanations that Steve
Sistare and I developed.


Nice, but why does this not show up in 4.9 and 4.14 and Linus's tree as
well on this hardware?  Nor on the SLES12 SP3 kernel?

What is different there that 4.4 requires?  That worries me more than
your fix (which looks good to me, fwiw.)


Hi Greg,

I have not studied other versions of kernels, efi was changed 
substantially since 4.4. But, even on 4.4.110 there are several things 
have to happen for this bug to show-up:


1. During boot memmblock must allocate address that is not 2PAGE_SIZE 
aligned.

2. nmi must arrive exactly when EFI replaced page table.

While I was debugging this problem, I tried to enable, kasan, vm_debug, 
add more printfs etc, but every little change would cause this problem 
to disappear, or appear less frequently.


Thank you,
Pavel


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Pavel Tatashin



On 01/11/2018 03:10 PM, Greg Kroah-Hartman wrote:

On Thu, Jan 11, 2018 at 01:36:50PM -0500, Pavel Tatashin wrote:

I have root caused the memory corruption panics/hangs that I've been
experiencing during boot with the latest 4.4.110 kernel. The problem
as was suspected by Andy Lutomirski is with interaction between PTI
and EFI. It may affect any system that has EFI bios.  I have not
verified if it can affect any other kernel beside 4.4.110

Attached is the fix for this issue with explanations that Steve
Sistare and I developed.


Nice, but why does this not show up in 4.9 and 4.14 and Linus's tree as
well on this hardware?  Nor on the SLES12 SP3 kernel?

What is different there that 4.4 requires?  That worries me more than
your fix (which looks good to me, fwiw.)


Hi Greg,

I have not studied other versions of kernels, efi was changed 
substantially since 4.4. But, even on 4.4.110 there are several things 
have to happen for this bug to show-up:


1. During boot memmblock must allocate address that is not 2PAGE_SIZE 
aligned.

2. nmi must arrive exactly when EFI replaced page table.

While I was debugging this problem, I tried to enable, kasan, vm_debug, 
add more printfs etc, but every little change would cause this problem 
to disappear, or appear less frequently.


Thank you,
Pavel


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Linus Torvalds
On Thu, Jan 11, 2018 at 12:10 PM, Greg Kroah-Hartman
 wrote:
>
> Nice, but why does this not show up in 4.9 and 4.14 and Linus's tree as
> well on this hardware?  Nor on the SLES12 SP3 kernel?
>
> What is different there that 4.4 requires?  That worries me more than
> your fix (which looks good to me, fwiw.)

I really think it's simply that since v4.6, we've had commit
67a9108ed431 ("x86/efi: Build our own page table structures"), so no
normal EFI use actually uses the old legacy mapping unless you passed
in "efi=old_map" on the kernel command line.

So the bug is there in all versions, it's just that it's normally only
noticeable in 4.4.

But I might be missing some other difference, so take that with a pinch of salt.

 Linus


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Linus Torvalds
On Thu, Jan 11, 2018 at 12:10 PM, Greg Kroah-Hartman
 wrote:
>
> Nice, but why does this not show up in 4.9 and 4.14 and Linus's tree as
> well on this hardware?  Nor on the SLES12 SP3 kernel?
>
> What is different there that 4.4 requires?  That worries me more than
> your fix (which looks good to me, fwiw.)

I really think it's simply that since v4.6, we've had commit
67a9108ed431 ("x86/efi: Build our own page table structures"), so no
normal EFI use actually uses the old legacy mapping unless you passed
in "efi=old_map" on the kernel command line.

So the bug is there in all versions, it's just that it's normally only
noticeable in 4.4.

But I might be missing some other difference, so take that with a pinch of salt.

 Linus


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Greg Kroah-Hartman
On Thu, Jan 11, 2018 at 01:36:50PM -0500, Pavel Tatashin wrote:
> I have root caused the memory corruption panics/hangs that I've been
> experiencing during boot with the latest 4.4.110 kernel. The problem
> as was suspected by Andy Lutomirski is with interaction between PTI
> and EFI. It may affect any system that has EFI bios.  I have not
> verified if it can affect any other kernel beside 4.4.110
> 
> Attached is the fix for this issue with explanations that Steve
> Sistare and I developed.

Nice, but why does this not show up in 4.9 and 4.14 and Linus's tree as
well on this hardware?  Nor on the SLES12 SP3 kernel?

What is different there that 4.4 requires?  That worries me more than
your fix (which looks good to me, fwiw.)

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Greg Kroah-Hartman
On Thu, Jan 11, 2018 at 01:36:50PM -0500, Pavel Tatashin wrote:
> I have root caused the memory corruption panics/hangs that I've been
> experiencing during boot with the latest 4.4.110 kernel. The problem
> as was suspected by Andy Lutomirski is with interaction between PTI
> and EFI. It may affect any system that has EFI bios.  I have not
> verified if it can affect any other kernel beside 4.4.110
> 
> Attached is the fix for this issue with explanations that Steve
> Sistare and I developed.

Nice, but why does this not show up in 4.9 and 4.14 and Linus's tree as
well on this hardware?  Nor on the SLES12 SP3 kernel?

What is different there that 4.4 requires?  That worries me more than
your fix (which looks good to me, fwiw.)

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Linus Torvalds
[ Patch to make sure the EFI trampoline_pgd is properly aligned and
has the double pgd that KPTI requires ]

On Thu, Jan 11, 2018 at 10:40 AM, Pavel Tatashin
 wrote:
> If it is better to resubmit this patch via git send-email, please let me know.

It would be better, because that way the patch can be more easily
quoted and discussed.

That said, I do not see why this isn't an issue upstream too.

As far as I can tell, it's not just 4.4.110. Our current entry code
does that ADJUST_KERNEL_CR3 dance too, which clears the
PTI_SWITCH_MASK bit from cr3.

And that realmode trampoline pgd seems all to be just aligned to PAGE_SIZE.

Now, in the modern world, we generate new page tables for EFI, but we
still have that EFI_OLD_MEMMAP code that disables that. And afaik,
EFI_OLD_MEMMAP has the exact same problem that your patch fixes in 4.4
(where it's always on).

So I think this patch should go into the development kernel too.

Or maybe it already is, and I just haven't gotten it yet.

Or - even more likely - I'm missing something entirely, and even
EFI_OLD_MEMMAP solved this some other way upstream.

Adding Thomas Gleixner explicitly to the participants so that he can
tell me I'm a moron and point me to the right thing.

   Linus


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Linus Torvalds
[ Patch to make sure the EFI trampoline_pgd is properly aligned and
has the double pgd that KPTI requires ]

On Thu, Jan 11, 2018 at 10:40 AM, Pavel Tatashin
 wrote:
> If it is better to resubmit this patch via git send-email, please let me know.

It would be better, because that way the patch can be more easily
quoted and discussed.

That said, I do not see why this isn't an issue upstream too.

As far as I can tell, it's not just 4.4.110. Our current entry code
does that ADJUST_KERNEL_CR3 dance too, which clears the
PTI_SWITCH_MASK bit from cr3.

And that realmode trampoline pgd seems all to be just aligned to PAGE_SIZE.

Now, in the modern world, we generate new page tables for EFI, but we
still have that EFI_OLD_MEMMAP code that disables that. And afaik,
EFI_OLD_MEMMAP has the exact same problem that your patch fixes in 4.4
(where it's always on).

So I think this patch should go into the development kernel too.

Or maybe it already is, and I just haven't gotten it yet.

Or - even more likely - I'm missing something entirely, and even
EFI_OLD_MEMMAP solved this some other way upstream.

Adding Thomas Gleixner explicitly to the participants so that he can
tell me I'm a moron and point me to the right thing.

   Linus


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Pavel Tatashin
If it is better to resubmit this patch via git send-email, please let me know.

Thank you,
Pavel

On Thu, Jan 11, 2018 at 1:36 PM, Pavel Tatashin
 wrote:
> I have root caused the memory corruption panics/hangs that I've been
> experiencing during boot with the latest 4.4.110 kernel. The problem
> as was suspected by Andy Lutomirski is with interaction between PTI
> and EFI. It may affect any system that has EFI bios.  I have not
> verified if it can affect any other kernel beside 4.4.110
>
> Attached is the fix for this issue with explanations that Steve
> Sistare and I developed.


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Pavel Tatashin
If it is better to resubmit this patch via git send-email, please let me know.

Thank you,
Pavel

On Thu, Jan 11, 2018 at 1:36 PM, Pavel Tatashin
 wrote:
> I have root caused the memory corruption panics/hangs that I've been
> experiencing during boot with the latest 4.4.110 kernel. The problem
> as was suspected by Andy Lutomirski is with interaction between PTI
> and EFI. It may affect any system that has EFI bios.  I have not
> verified if it can affect any other kernel beside 4.4.110
>
> Attached is the fix for this issue with explanations that Steve
> Sistare and I developed.


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Pavel Tatashin
I have root caused the memory corruption panics/hangs that I've been
experiencing during boot with the latest 4.4.110 kernel. The problem
as was suspected by Andy Lutomirski is with interaction between PTI
and EFI. It may affect any system that has EFI bios.  I have not
verified if it can affect any other kernel beside 4.4.110

Attached is the fix for this issue with explanations that Steve
Sistare and I developed.
From 1189f3568a90ddd40e1418b9687def5d89153ee3 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin 
Date: Thu, 11 Jan 2018 06:50:25 -0800
Subject: [PATCH] x86/pti/efi: broken conversion from efi to kernel page table

In entry_64.S we have code like this:

/* Unconditionally use kernel CR3 for do_nmi() */
/* %rax is saved above, so OK to clobber here */
ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER
/* If PCID enabled, NOFLUSH now and NOFLUSH on return */
ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID
pushq   %rax
/* mask off "user" bit of pgd address and 12 PCID bits: */
andq$(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
movq%rax, %cr3
2:

/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
calldo_nmi

With this instruction:
andq$(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax

We unconditionally switch from whatever our CR3 was to kernel page table.
But, in arch/x86/platform/efi/efi_64.c We temporarily set a different page
table, that does not have the kernel page table with 0x1000 offset from it.

Look in efi_thunk() and efi_thunk_set_virtual_address_map().

So, while CR3 points to the other page table, we get an NMI interrupt,
and clear 0x1000 from CR3, resulting in a bogus CR3 if the 0x1000 bit was
set.

The efi page table comes from realmode/rm/trampoline_64.S:

arch/x86/realmode/rm/trampoline_64.S

141 .bss
142 .balign PAGE_SIZE
143 GLOBAL(trampoline_pgd) .space PAGE_SIZE

Notice: alignment is PAGE_SIZE, so after applying KAISER_SHADOW_PGD_OFFSET
which equal to PAGE_SIZE, we can get a different page table.

But, even if we fix alignment, here the trampoline binary is later copied
into dynamically allocated memory in reserve_real_mode(), so we need to
fix that place as well.

Fixes: 8a43ddfb93a0 ("KAISER: Kernel Address Isolation")

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
---
 arch/x86/include/asm/kaiser.h| 8 
 arch/x86/realmode/init.c | 4 +++-
 arch/x86/realmode/rm/trampoline_64.S | 3 ++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
index 802bbbdfe143..e087bd7a8d29 100644
--- a/arch/x86/include/asm/kaiser.h
+++ b/arch/x86/include/asm/kaiser.h
@@ -19,6 +19,12 @@
 
 #define KAISER_SHADOW_PGD_OFFSET 0x1000
 
+/*
+ *  A page table address must have this alignment to stay the same when
+ *  KAISER_SHADOW_PGD_OFFSET mask is applied
+ */
+#define KAISER_KERNEL_PGD_ALIGNMENT (KAISER_SHADOW_PGD_OFFSET << 1)
+
 #ifdef __ASSEMBLY__
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 
@@ -71,6 +77,8 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax
 
 #else /* CONFIG_PAGE_TABLE_ISOLATION */
 
+#define KAISER_KERNEL_PGD_ALIGNMENT PAGE_SIZE
+
 .macro SWITCH_KERNEL_CR3
 .endm
 .macro SWITCH_USER_CR3
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 0b7a63d98440..cfecb7d6c6a8 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -1,5 +1,6 @@
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -15,7 +16,8 @@ void __init reserve_real_mode(void)
 	size_t size = PAGE_ALIGN(real_mode_blob_end - real_mode_blob);
 
 	/* Has to be under 1M so we can execute real-mode AP code. */
-	mem = memblock_find_in_range(0, 1<<20, size, PAGE_SIZE);
+	mem = memblock_find_in_range(0, 1 << 20, size,
+ KAISER_KERNEL_PGD_ALIGNMENT);
 	if (!mem)
 		panic("Cannot allocate trampoline\n");
 
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index dac7b20d2f9d..781cca63f795 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "realmode.h"
 
 	.text
@@ -139,7 +140,7 @@ tr_gdt:
 tr_gdt_end:
 
 	.bss
-	.balign	PAGE_SIZE
+	.balign	KAISER_KERNEL_PGD_ALIGNMENT
 GLOBAL(trampoline_pgd)		.space	PAGE_SIZE
 
 	.balign	8
-- 
1.8.3.1



Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-11 Thread Pavel Tatashin
I have root caused the memory corruption panics/hangs that I've been
experiencing during boot with the latest 4.4.110 kernel. The problem
as was suspected by Andy Lutomirski is with interaction between PTI
and EFI. It may affect any system that has EFI bios.  I have not
verified if it can affect any other kernel beside 4.4.110

Attached is the fix for this issue with explanations that Steve
Sistare and I developed.
From 1189f3568a90ddd40e1418b9687def5d89153ee3 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin 
Date: Thu, 11 Jan 2018 06:50:25 -0800
Subject: [PATCH] x86/pti/efi: broken conversion from efi to kernel page table

In entry_64.S we have code like this:

/* Unconditionally use kernel CR3 for do_nmi() */
/* %rax is saved above, so OK to clobber here */
ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER
/* If PCID enabled, NOFLUSH now and NOFLUSH on return */
ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID
pushq   %rax
/* mask off "user" bit of pgd address and 12 PCID bits: */
andq$(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
movq%rax, %cr3
2:

/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
calldo_nmi

With this instruction:
andq$(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax

We unconditionally switch from whatever our CR3 was to kernel page table.
But, in arch/x86/platform/efi/efi_64.c We temporarily set a different page
table, that does not have the kernel page table with 0x1000 offset from it.

Look in efi_thunk() and efi_thunk_set_virtual_address_map().

So, while CR3 points to the other page table, we get an NMI interrupt,
and clear 0x1000 from CR3, resulting in a bogus CR3 if the 0x1000 bit was
set.

The efi page table comes from realmode/rm/trampoline_64.S:

arch/x86/realmode/rm/trampoline_64.S

141 .bss
142 .balign PAGE_SIZE
143 GLOBAL(trampoline_pgd) .space PAGE_SIZE

Notice: alignment is PAGE_SIZE, so after applying KAISER_SHADOW_PGD_OFFSET
which equal to PAGE_SIZE, we can get a different page table.

But, even if we fix alignment, here the trampoline binary is later copied
into dynamically allocated memory in reserve_real_mode(), so we need to
fix that place as well.

Fixes: 8a43ddfb93a0 ("KAISER: Kernel Address Isolation")

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
---
 arch/x86/include/asm/kaiser.h| 8 
 arch/x86/realmode/init.c | 4 +++-
 arch/x86/realmode/rm/trampoline_64.S | 3 ++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
index 802bbbdfe143..e087bd7a8d29 100644
--- a/arch/x86/include/asm/kaiser.h
+++ b/arch/x86/include/asm/kaiser.h
@@ -19,6 +19,12 @@
 
 #define KAISER_SHADOW_PGD_OFFSET 0x1000
 
+/*
+ *  A page table address must have this alignment to stay the same when
+ *  KAISER_SHADOW_PGD_OFFSET mask is applied
+ */
+#define KAISER_KERNEL_PGD_ALIGNMENT (KAISER_SHADOW_PGD_OFFSET << 1)
+
 #ifdef __ASSEMBLY__
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 
@@ -71,6 +77,8 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax
 
 #else /* CONFIG_PAGE_TABLE_ISOLATION */
 
+#define KAISER_KERNEL_PGD_ALIGNMENT PAGE_SIZE
+
 .macro SWITCH_KERNEL_CR3
 .endm
 .macro SWITCH_USER_CR3
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 0b7a63d98440..cfecb7d6c6a8 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -1,5 +1,6 @@
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -15,7 +16,8 @@ void __init reserve_real_mode(void)
 	size_t size = PAGE_ALIGN(real_mode_blob_end - real_mode_blob);
 
 	/* Has to be under 1M so we can execute real-mode AP code. */
-	mem = memblock_find_in_range(0, 1<<20, size, PAGE_SIZE);
+	mem = memblock_find_in_range(0, 1 << 20, size,
+ KAISER_KERNEL_PGD_ALIGNMENT);
 	if (!mem)
 		panic("Cannot allocate trampoline\n");
 
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index dac7b20d2f9d..781cca63f795 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "realmode.h"
 
 	.text
@@ -139,7 +140,7 @@ tr_gdt:
 tr_gdt_end:
 
 	.bss
-	.balign	PAGE_SIZE
+	.balign	KAISER_KERNEL_PGD_ALIGNMENT
 GLOBAL(trampoline_pgd)		.space	PAGE_SIZE
 
 	.balign	8
-- 
1.8.3.1



Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-10 Thread Serge E. Hallyn
Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> On Tue, Jan 09, 2018 at 01:49:48PM -0600, Serge E. Hallyn wrote:
> > Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> > > On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
> > > > On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman
> > > >  wrote:
> > > > > This is the start of the stable review cycle for the 4.4.110 release.
> > > > > There are 37 patches in this series, all will be posted as a response
> > > > > to this one.  If anyone has any issues with these being applied, 
> > > > > please
> > > > > let me know.
> > > > >
> > > > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > > > Anything received after that time might be too late.
> > > > >
> > > > > The whole patch series can be found in one patch at:
> > > > > 
> > > > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > > > > or in the git tree and branch at:
> > > > >   
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> > > > >  linux-4.4.y
> > > > > and the diffstat can be found below.
> > > > >
> > > > 
> > > > This patchset merges correctly with Gentoo patches and GCC version 6.4.0
> > > > The kernel boot up correctly.
> > > > Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
> > > 
> > > Great, but Gentoo really should be moving to 4.9 and 4.14 here, I
> > > hope no one running Gentoo is relying on 4.4 :)
> > 
> > Wait what?
> > 
> > According to https://www.kernel.org/category/releases.html
> > 4.4 should be the best bet for longest support, right?  Does
> > that page need to be updated?  If 4.4 is not going to be
> > supported, is there anything else with a possible 5-6 years
> > of support?
> 
> 4.4 is going to be supported, yes, but really, for a desktop/server
> system, why would you ever want to stick with it for anything longer
> than a year?  No new hardware support is added, and no new features that
> you would want are in there.
> 
> The LTS kernels are for the crazy embedded people that don't change
> their hardware systems, and have the insane huge number of out-of-tree
> patches.  No one else should be using those kernels, they should always
> be using newer ones, as there are always more issues fixed in newer
> kernels than older ones.
> 
> So again, I hope no one running Gentoo, which is a rolling, constantly
> updated distro, is using the old and crusty 4.4 kernel release.  To do
> so is to defeat the purpose of relying on Gentoo in the first place...

Ah, I see, yeah that makes sense :)

thanks,
-serge


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-10 Thread Serge E. Hallyn
Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> On Tue, Jan 09, 2018 at 01:49:48PM -0600, Serge E. Hallyn wrote:
> > Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> > > On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
> > > > On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman
> > > >  wrote:
> > > > > This is the start of the stable review cycle for the 4.4.110 release.
> > > > > There are 37 patches in this series, all will be posted as a response
> > > > > to this one.  If anyone has any issues with these being applied, 
> > > > > please
> > > > > let me know.
> > > > >
> > > > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > > > Anything received after that time might be too late.
> > > > >
> > > > > The whole patch series can be found in one patch at:
> > > > > 
> > > > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > > > > or in the git tree and branch at:
> > > > >   
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> > > > >  linux-4.4.y
> > > > > and the diffstat can be found below.
> > > > >
> > > > 
> > > > This patchset merges correctly with Gentoo patches and GCC version 6.4.0
> > > > The kernel boot up correctly.
> > > > Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
> > > 
> > > Great, but Gentoo really should be moving to 4.9 and 4.14 here, I
> > > hope no one running Gentoo is relying on 4.4 :)
> > 
> > Wait what?
> > 
> > According to https://www.kernel.org/category/releases.html
> > 4.4 should be the best bet for longest support, right?  Does
> > that page need to be updated?  If 4.4 is not going to be
> > supported, is there anything else with a possible 5-6 years
> > of support?
> 
> 4.4 is going to be supported, yes, but really, for a desktop/server
> system, why would you ever want to stick with it for anything longer
> than a year?  No new hardware support is added, and no new features that
> you would want are in there.
> 
> The LTS kernels are for the crazy embedded people that don't change
> their hardware systems, and have the insane huge number of out-of-tree
> patches.  No one else should be using those kernels, they should always
> be using newer ones, as there are always more issues fixed in newer
> kernels than older ones.
> 
> So again, I hope no one running Gentoo, which is a rolling, constantly
> updated distro, is using the old and crusty 4.4 kernel release.  To do
> so is to defeat the purpose of relying on Gentoo in the first place...

Ah, I see, yeah that makes sense :)

thanks,
-serge


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-10 Thread Greg Kroah-Hartman
On Tue, Jan 09, 2018 at 01:49:48PM -0600, Serge E. Hallyn wrote:
> Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> > On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
> > > On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman
> > >  wrote:
> > > > This is the start of the stable review cycle for the 4.4.110 release.
> > > > There are 37 patches in this series, all will be posted as a response
> > > > to this one.  If anyone has any issues with these being applied, please
> > > > let me know.
> > > >
> > > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > > Anything received after that time might be too late.
> > > >
> > > > The whole patch series can be found in one patch at:
> > > > 
> > > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > > > or in the git tree and branch at:
> > > >   
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> > > >  linux-4.4.y
> > > > and the diffstat can be found below.
> > > >
> > > 
> > > This patchset merges correctly with Gentoo patches and GCC version 6.4.0
> > > The kernel boot up correctly.
> > > Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
> > 
> > Great, but Gentoo really should be moving to 4.9 and 4.14 here, I
> > hope no one running Gentoo is relying on 4.4 :)
> 
> Wait what?
> 
> According to https://www.kernel.org/category/releases.html
> 4.4 should be the best bet for longest support, right?  Does
> that page need to be updated?  If 4.4 is not going to be
> supported, is there anything else with a possible 5-6 years
> of support?

4.4 is going to be supported, yes, but really, for a desktop/server
system, why would you ever want to stick with it for anything longer
than a year?  No new hardware support is added, and no new features that
you would want are in there.

The LTS kernels are for the crazy embedded people that don't change
their hardware systems, and have the insane huge number of out-of-tree
patches.  No one else should be using those kernels, they should always
be using newer ones, as there are always more issues fixed in newer
kernels than older ones.

So again, I hope no one running Gentoo, which is a rolling, constantly
updated distro, is using the old and crusty 4.4 kernel release.  To do
so is to defeat the purpose of relying on Gentoo in the first place...

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-10 Thread Greg Kroah-Hartman
On Tue, Jan 09, 2018 at 01:49:48PM -0600, Serge E. Hallyn wrote:
> Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> > On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
> > > On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman
> > >  wrote:
> > > > This is the start of the stable review cycle for the 4.4.110 release.
> > > > There are 37 patches in this series, all will be posted as a response
> > > > to this one.  If anyone has any issues with these being applied, please
> > > > let me know.
> > > >
> > > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > > Anything received after that time might be too late.
> > > >
> > > > The whole patch series can be found in one patch at:
> > > > 
> > > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > > > or in the git tree and branch at:
> > > >   
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> > > >  linux-4.4.y
> > > > and the diffstat can be found below.
> > > >
> > > 
> > > This patchset merges correctly with Gentoo patches and GCC version 6.4.0
> > > The kernel boot up correctly.
> > > Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
> > 
> > Great, but Gentoo really should be moving to 4.9 and 4.14 here, I
> > hope no one running Gentoo is relying on 4.4 :)
> 
> Wait what?
> 
> According to https://www.kernel.org/category/releases.html
> 4.4 should be the best bet for longest support, right?  Does
> that page need to be updated?  If 4.4 is not going to be
> supported, is there anything else with a possible 5-6 years
> of support?

4.4 is going to be supported, yes, but really, for a desktop/server
system, why would you ever want to stick with it for anything longer
than a year?  No new hardware support is added, and no new features that
you would want are in there.

The LTS kernels are for the crazy embedded people that don't change
their hardware systems, and have the insane huge number of out-of-tree
patches.  No one else should be using those kernels, they should always
be using newer ones, as there are always more issues fixed in newer
kernels than older ones.

So again, I hope no one running Gentoo, which is a rolling, constantly
updated distro, is using the old and crusty 4.4 kernel release.  To do
so is to defeat the purpose of relying on Gentoo in the first place...

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-09 Thread Serge E. Hallyn
Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
> > On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman
> >  wrote:
> > > This is the start of the stable review cycle for the 4.4.110 release.
> > > There are 37 patches in this series, all will be posted as a response
> > > to this one.  If anyone has any issues with these being applied, please
> > > let me know.
> > >
> > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > Anything received after that time might be too late.
> > >
> > > The whole patch series can be found in one patch at:
> > > 
> > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > > or in the git tree and branch at:
> > >   
> > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > > linux-4.4.y
> > > and the diffstat can be found below.
> > >
> > 
> > This patchset merges correctly with Gentoo patches and GCC version 6.4.0
> > The kernel boot up correctly.
> > Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
> 
> Great, but Gentoo really should be moving to 4.9 and 4.14 here, I
> hope no one running Gentoo is relying on 4.4 :)

Wait what?

According to https://www.kernel.org/category/releases.html
4.4 should be the best bet for longest support, right?  Does
that page need to be updated?  If 4.4 is not going to be
supported, is there anything else with a possible 5-6 years
of support?


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-09 Thread Serge E. Hallyn
Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
> > On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman
> >  wrote:
> > > This is the start of the stable review cycle for the 4.4.110 release.
> > > There are 37 patches in this series, all will be posted as a response
> > > to this one.  If anyone has any issues with these being applied, please
> > > let me know.
> > >
> > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > Anything received after that time might be too late.
> > >
> > > The whole patch series can be found in one patch at:
> > > 
> > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > > or in the git tree and branch at:
> > >   
> > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > > linux-4.4.y
> > > and the diffstat can be found below.
> > >
> > 
> > This patchset merges correctly with Gentoo patches and GCC version 6.4.0
> > The kernel boot up correctly.
> > Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44
> 
> Great, but Gentoo really should be moving to 4.9 and 4.14 here, I
> hope no one running Gentoo is relying on 4.4 :)

Wait what?

According to https://www.kernel.org/category/releases.html
4.4 should be the best bet for longest support, right?  Does
that page need to be updated?  If 4.4 is not going to be
supported, is there anything else with a possible 5-6 years
of support?


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-08 Thread Pavel Tatashin
Here is one more:

[6.284763] EFI Variables Facility v0.08 2004-May-17
[6.555990] [ cut here ]
[6.561145] kernel BUG at
/scratch/ptatashi/linux-stable/mm/slub.c:3627!
[6.568625] invalid opcode:  [#1] SMP
[6.573219] Modules linked in:
[6.576639] CPU: 1 PID: 364 Comm: kworker/1:1 Not tainted
4.4.110_pt_stable #3
[6.584692] Hardware name: Oracle Corporation ORACLE SERVER
X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
[6.595766] Workqueue: events clocksource_watchdog_work
[6.601611] task: 881fecd82b00 ti: 881fecda4000 task.ti:
881fecda4000
[6.609963] RIP: 0010:[]  []
kfree+0x14a/0x150
[6.618419] RSP: :881fecda7d40  EFLAGS: 00010246
[6.624348] RAX: 8106c280 RBX: 883ff114bfc0 RCX:
ffd8
[6.632314] RDX: 77ff8000 RSI: 0246 RDI:
883ff114bfc0
[6.640280] RBP: 881fecda7d58 R08:  R09:
881fff917300
[6.648244] R10:  R11: ea00ffc452c0 R12:
883fec2f4080
[6.656208] R13: 810a5bee R14:  R15:

[6.664175] FS:  () GS:881fff84()
knlGS:
[6.673208] CS:  0010 DS:  ES:  CR0: 80050033
[6.679623] CR2:  CR3: 01aa2000 CR4:
00360670
[6.687587] DR0:  DR1:  DR2:

[6.695553] DR3:  DR6: fffe0ff0 DR7:
0400
[6.703516] Stack:
[6.705759]  883ff114bfc0 883fec2f4080 819a26e8
881fecda7e00
[6.714061]  810a5bee 881f0020 881fecda7e10
881fecda7da8
[6.722363]   881f 881fecda7d90
881fecda7d90
[6.730666] Call Trace:
[6.733400]  []
kthread_create_on_node+0x14e/0x1a0
[6.740495]  []
clocksource_watchdog_work+0x25/0x40
[6.747679]  [] process_one_work+0x14f/0x400
[6.754181]  [] worker_thread+0x114/0x480
[6.760402]  [] ? rescuer_thread+0x310/0x310
[6.766913]  [] kthread+0xe5/0x100
[6.772456]  [] ? kthread_park+0x60/0x60
[6.778580]  [] ret_from_fork+0x3f/0x70
[6.784608]  [] ? kthread_park+0x60/0x60
[6.790721] Code: 8b 03 31 f6 f6 c4 40 74 04 41 8b 73 6c 4c 89 df
e8 1c a8 fa ff e9 73 ff ff ff 4c 8d 58 ff e9 20 ff ff ff 49 8b 43 20
a8 01 75 d4 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56
41 55
[6.812429] RIP  [] kfree+0x14a/0x150
[6.818273]  RSP 
[6.822177] ---[ end trace 4ce44d21c6d68eed ]---

On Mon, Jan 8, 2018 at 3:38 PM, Pavel Tatashin
 wrote:
> Hi Greg,
>
>
>
> On Mon, Jan 8, 2018 at 2:46 AM, Greg Kroah-Hartman
>  wrote:
>> On Sun, Jan 07, 2018 at 10:06:59AM -0500, Pavel Tatashin wrote:
>>> Hi Greg,
>>>
>>> I reverted suse12 back to:
>>> 13dae54cb229d078635f159dd8afe16ae683980b
>>> x86/kaiser: Move feature detection up (bsc#1068032).
>>>
>>> And, still do not see the problem. So, whatever fixes the issue comes
>>> before kaiser.
>>
>> Ok, thanks for the hint.
>>
>> As I can't duplicate this here at all, any specifics as to what
>> hardware/procesor type this is?
>>
>
> BIOS:
> Version 2.17.1249. Copyright (C) 2016 American Megatrends, Inc.
> BIOS Date: 08/30/2016 10:35:36 Ver: 38050100
>
> ca-ostest442:linux-stable$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):40
> On-line CPU(s) list:   0-39
> Thread(s) per core:2
> Core(s) per socket:10
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Stepping:  1
> CPU MHz:   1738.601
> BogoMIPS:  4396.18
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  25600K
> NUMA node0 CPU(s): 0-9,20-29
> NUMA node1 CPU(s): 10-19,30-39
>
> Note, if I boot with nr_cpus=1, hang never happens, with nr_cpus=4
> happens but seldomly, and with all 40 CPUs happens on almost every
> reboot.
>
> As Hugh Dickins suggested, I am going to show panic outputs, as I get
> them. Here is one more panic (note output is not complete because
> machine reboots):
>
> [6.276456] EFI Variables Facility v0.08 2004-May-17
> [6.384665] BUG: unable to handle kernel paging request at
> 901fff5a6000
> [6.392461] IP: [] vmalloc_fault+0x1f8/0x340
> [6.398987] PGD 0
> [6.401242] Oops:  [#1] SMP
> [6.404866] Modules linked in:
> [6.408287] CPU: 10 PID: 0 Comm: swapper/10 Not tainted
> 4.4.110_pt_stable #2
> [6.416156] Hardware name: Oracle Corporation ORACLE SERVER
> X6-2/ASM,MOTHERBOARD,1U, BIOS 3
> 8050100 08/30/2016
> [6.427226] task: 883ff1e28000 ti: 

Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-08 Thread Pavel Tatashin
Here is one more:

[6.284763] EFI Variables Facility v0.08 2004-May-17
[6.555990] [ cut here ]
[6.561145] kernel BUG at
/scratch/ptatashi/linux-stable/mm/slub.c:3627!
[6.568625] invalid opcode:  [#1] SMP
[6.573219] Modules linked in:
[6.576639] CPU: 1 PID: 364 Comm: kworker/1:1 Not tainted
4.4.110_pt_stable #3
[6.584692] Hardware name: Oracle Corporation ORACLE SERVER
X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
[6.595766] Workqueue: events clocksource_watchdog_work
[6.601611] task: 881fecd82b00 ti: 881fecda4000 task.ti:
881fecda4000
[6.609963] RIP: 0010:[]  []
kfree+0x14a/0x150
[6.618419] RSP: :881fecda7d40  EFLAGS: 00010246
[6.624348] RAX: 8106c280 RBX: 883ff114bfc0 RCX:
ffd8
[6.632314] RDX: 77ff8000 RSI: 0246 RDI:
883ff114bfc0
[6.640280] RBP: 881fecda7d58 R08:  R09:
881fff917300
[6.648244] R10:  R11: ea00ffc452c0 R12:
883fec2f4080
[6.656208] R13: 810a5bee R14:  R15:

[6.664175] FS:  () GS:881fff84()
knlGS:
[6.673208] CS:  0010 DS:  ES:  CR0: 80050033
[6.679623] CR2:  CR3: 01aa2000 CR4:
00360670
[6.687587] DR0:  DR1:  DR2:

[6.695553] DR3:  DR6: fffe0ff0 DR7:
0400
[6.703516] Stack:
[6.705759]  883ff114bfc0 883fec2f4080 819a26e8
881fecda7e00
[6.714061]  810a5bee 881f0020 881fecda7e10
881fecda7da8
[6.722363]   881f 881fecda7d90
881fecda7d90
[6.730666] Call Trace:
[6.733400]  []
kthread_create_on_node+0x14e/0x1a0
[6.740495]  []
clocksource_watchdog_work+0x25/0x40
[6.747679]  [] process_one_work+0x14f/0x400
[6.754181]  [] worker_thread+0x114/0x480
[6.760402]  [] ? rescuer_thread+0x310/0x310
[6.766913]  [] kthread+0xe5/0x100
[6.772456]  [] ? kthread_park+0x60/0x60
[6.778580]  [] ret_from_fork+0x3f/0x70
[6.784608]  [] ? kthread_park+0x60/0x60
[6.790721] Code: 8b 03 31 f6 f6 c4 40 74 04 41 8b 73 6c 4c 89 df
e8 1c a8 fa ff e9 73 ff ff ff 4c 8d 58 ff e9 20 ff ff ff 49 8b 43 20
a8 01 75 d4 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56
41 55
[6.812429] RIP  [] kfree+0x14a/0x150
[6.818273]  RSP 
[6.822177] ---[ end trace 4ce44d21c6d68eed ]---

On Mon, Jan 8, 2018 at 3:38 PM, Pavel Tatashin
 wrote:
> Hi Greg,
>
>
>
> On Mon, Jan 8, 2018 at 2:46 AM, Greg Kroah-Hartman
>  wrote:
>> On Sun, Jan 07, 2018 at 10:06:59AM -0500, Pavel Tatashin wrote:
>>> Hi Greg,
>>>
>>> I reverted suse12 back to:
>>> 13dae54cb229d078635f159dd8afe16ae683980b
>>> x86/kaiser: Move feature detection up (bsc#1068032).
>>>
>>> And, still do not see the problem. So, whatever fixes the issue comes
>>> before kaiser.
>>
>> Ok, thanks for the hint.
>>
>> As I can't duplicate this here at all, any specifics as to what
>> hardware/procesor type this is?
>>
>
> BIOS:
> Version 2.17.1249. Copyright (C) 2016 American Megatrends, Inc.
> BIOS Date: 08/30/2016 10:35:36 Ver: 38050100
>
> ca-ostest442:linux-stable$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):40
> On-line CPU(s) list:   0-39
> Thread(s) per core:2
> Core(s) per socket:10
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Stepping:  1
> CPU MHz:   1738.601
> BogoMIPS:  4396.18
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  25600K
> NUMA node0 CPU(s): 0-9,20-29
> NUMA node1 CPU(s): 10-19,30-39
>
> Note, if I boot with nr_cpus=1, hang never happens, with nr_cpus=4
> happens but seldomly, and with all 40 CPUs happens on almost every
> reboot.
>
> As Hugh Dickins suggested, I am going to show panic outputs, as I get
> them. Here is one more panic (note output is not complete because
> machine reboots):
>
> [6.276456] EFI Variables Facility v0.08 2004-May-17
> [6.384665] BUG: unable to handle kernel paging request at
> 901fff5a6000
> [6.392461] IP: [] vmalloc_fault+0x1f8/0x340
> [6.398987] PGD 0
> [6.401242] Oops:  [#1] SMP
> [6.404866] Modules linked in:
> [6.408287] CPU: 10 PID: 0 Comm: swapper/10 Not tainted
> 4.4.110_pt_stable #2
> [6.416156] Hardware name: Oracle Corporation ORACLE SERVER
> X6-2/ASM,MOTHERBOARD,1U, BIOS 3
> 8050100 08/30/2016
> [6.427226] task: 883ff1e28000 ti: 883ff1e24000 task.ti:
> 883ff1e24000
> [6.435580] 

Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-08 Thread Pavel Tatashin
Hi Greg,



On Mon, Jan 8, 2018 at 2:46 AM, Greg Kroah-Hartman
 wrote:
> On Sun, Jan 07, 2018 at 10:06:59AM -0500, Pavel Tatashin wrote:
>> Hi Greg,
>>
>> I reverted suse12 back to:
>> 13dae54cb229d078635f159dd8afe16ae683980b
>> x86/kaiser: Move feature detection up (bsc#1068032).
>>
>> And, still do not see the problem. So, whatever fixes the issue comes
>> before kaiser.
>
> Ok, thanks for the hint.
>
> As I can't duplicate this here at all, any specifics as to what
> hardware/procesor type this is?
>

BIOS:
Version 2.17.1249. Copyright (C) 2016 American Megatrends, Inc.
BIOS Date: 08/30/2016 10:35:36 Ver: 38050100

ca-ostest442:linux-stable$ lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):40
On-line CPU(s) list:   0-39
Thread(s) per core:2
Core(s) per socket:10
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 79
Model name:Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Stepping:  1
CPU MHz:   1738.601
BogoMIPS:  4396.18
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39

Note, if I boot with nr_cpus=1, hang never happens, with nr_cpus=4
happens but seldomly, and with all 40 CPUs happens on almost every
reboot.

As Hugh Dickins suggested, I am going to show panic outputs, as I get
them. Here is one more panic (note output is not complete because
machine reboots):

[6.276456] EFI Variables Facility v0.08 2004-May-17
[6.384665] BUG: unable to handle kernel paging request at
901fff5a6000
[6.392461] IP: [] vmalloc_fault+0x1f8/0x340
[6.398987] PGD 0
[6.401242] Oops:  [#1] SMP
[6.404866] Modules linked in:
[6.408287] CPU: 10 PID: 0 Comm: swapper/10 Not tainted
4.4.110_pt_stable #2
[6.416156] Hardware name: Oracle Corporation ORACLE SERVER
X6-2/ASM,MOTHERBOARD,1U, BIOS 3
8050100 08/30/2016
[6.427226] task: 883ff1e28000 ti: 883ff1e24000 task.ti:
883ff1e24000
[6.435580] RIP: 0010:[]  []
vmalloc_fault+0x1f8/0x340
[6.444819] RSP: :883ff1e27cc0  EFLAGS: 00010086
[6.450749] RAX: 881fff5a6058 RBX: 3000 RCX:
081fff5a6000
[6.458714] RDX: 8800 RSI: 901fff5a6000 RDI:

[6.466681] RBP: 883ff1e27cf0 R08: 0018 R09:
0002d2de
[6.474647] R10: 00032ef3 R11: 2e04 R12:
c9f0
[6.482615] R13: 8800 R14: 901fff5a6000 R15:
881fff5a6000
[6.490574] FS:  () GS:88407e60()
knlGS:
[6.499607] CS:  0010 DS:  ES:  CR0: 80050033
[6.506022] CR2: 901fff5a6000 CR3: 01aa2000 CR4:
00360670
[6.513989] DR0:  DR1:  DR2:

[6.521956] DR3:  DR6: fffe0ff0 DR7:
0400
[6.529923] Stack:
[6.532169]  881fff5a6000[6.532405] [ cut here
]
[6.532414] WARNING: CPU: 22 PID: 162


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-08 Thread Pavel Tatashin
Hi Greg,



On Mon, Jan 8, 2018 at 2:46 AM, Greg Kroah-Hartman
 wrote:
> On Sun, Jan 07, 2018 at 10:06:59AM -0500, Pavel Tatashin wrote:
>> Hi Greg,
>>
>> I reverted suse12 back to:
>> 13dae54cb229d078635f159dd8afe16ae683980b
>> x86/kaiser: Move feature detection up (bsc#1068032).
>>
>> And, still do not see the problem. So, whatever fixes the issue comes
>> before kaiser.
>
> Ok, thanks for the hint.
>
> As I can't duplicate this here at all, any specifics as to what
> hardware/procesor type this is?
>

BIOS:
Version 2.17.1249. Copyright (C) 2016 American Megatrends, Inc.
BIOS Date: 08/30/2016 10:35:36 Ver: 38050100

ca-ostest442:linux-stable$ lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):40
On-line CPU(s) list:   0-39
Thread(s) per core:2
Core(s) per socket:10
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 79
Model name:Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Stepping:  1
CPU MHz:   1738.601
BogoMIPS:  4396.18
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39

Note, if I boot with nr_cpus=1, hang never happens, with nr_cpus=4
happens but seldomly, and with all 40 CPUs happens on almost every
reboot.

As Hugh Dickins suggested, I am going to show panic outputs, as I get
them. Here is one more panic (note output is not complete because
machine reboots):

[6.276456] EFI Variables Facility v0.08 2004-May-17
[6.384665] BUG: unable to handle kernel paging request at
901fff5a6000
[6.392461] IP: [] vmalloc_fault+0x1f8/0x340
[6.398987] PGD 0
[6.401242] Oops:  [#1] SMP
[6.404866] Modules linked in:
[6.408287] CPU: 10 PID: 0 Comm: swapper/10 Not tainted
4.4.110_pt_stable #2
[6.416156] Hardware name: Oracle Corporation ORACLE SERVER
X6-2/ASM,MOTHERBOARD,1U, BIOS 3
8050100 08/30/2016
[6.427226] task: 883ff1e28000 ti: 883ff1e24000 task.ti:
883ff1e24000
[6.435580] RIP: 0010:[]  []
vmalloc_fault+0x1f8/0x340
[6.444819] RSP: :883ff1e27cc0  EFLAGS: 00010086
[6.450749] RAX: 881fff5a6058 RBX: 3000 RCX:
081fff5a6000
[6.458714] RDX: 8800 RSI: 901fff5a6000 RDI:

[6.466681] RBP: 883ff1e27cf0 R08: 0018 R09:
0002d2de
[6.474647] R10: 00032ef3 R11: 2e04 R12:
c9f0
[6.482615] R13: 8800 R14: 901fff5a6000 R15:
881fff5a6000
[6.490574] FS:  () GS:88407e60()
knlGS:
[6.499607] CS:  0010 DS:  ES:  CR0: 80050033
[6.506022] CR2: 901fff5a6000 CR3: 01aa2000 CR4:
00360670
[6.513989] DR0:  DR1:  DR2:

[6.521956] DR3:  DR6: fffe0ff0 DR7:
0400
[6.529923] Stack:
[6.532169]  881fff5a6000[6.532405] [ cut here
]
[6.532414] WARNING: CPU: 22 PID: 162


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-08 Thread Guillaume Tucker

On 05/01/18 00:06, Kevin Hilman wrote:

kernelci.org bot  writes:


stable-rc/linux-4.4.y boot: 100 boots: 4 failed, 93 passed with 1 offline, 2 
conflicts (v4.4.109-38-g99abd6cdd65e)

Full Boot Summary: 
https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.4.y/kernel/v4.4.109-38-g99abd6cdd65e/
Full Build Summary: 
https://kernelci.org/build/stable-rc/branch/linux-4.4.y/kernel/v4.4.109-38-g99abd6cdd65e/

Tree: stable-rc
Branch: linux-4.4.y
Git Describe: v4.4.109-38-g99abd6cdd65e
Git Commit: 99abd6cdd65e984d89c8565508a7a96ea0fce179
Git URL: 
http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Tested: 53 unique boards, 19 SoC families, 16 builds out of 178


TL;DR;  All is well.


Boot Regressions Detected:

arm:

 exynos_defconfig:
 exynos5422-odroidxu3:
 lab-collabora: failing since 58 days (last pass: 
v4.4.95-21-g32458fcb7bd6 - first fail: v4.4.96-41-g336421367b9c)


Long standing issue in lab-collabora (passing in other labs)  Guillaume?


This should be fixed now, with a tweak to the device config to
enable relocating the ramdisk and dtb:

https://review.linaro.org/#/c/23238/


 multi_v7_defconfig:
 armada-xp-linksys-mamba:
 lab-free-electrons: new failure (last pass: 
v4.4.109-36-g8b381424010c)


Not a kerel issue, bootROM fails to start bootloader.  I pinged lab
owners (Free Electrons)


 tegra124-nyan-big:
 lab-collabora: failing since 1 day (last pass: v4.4.109 - first 
fail: v4.4.109-36-g8b381424010c)

 tegra_defconfig:
 tegra124-nyan-big:
 lab-collabora: failing since 1 day (last pass: 
v4.4.108-65-g57856049c0f8 - first fail: v4.4.109)


This one is booting fine, but the command to power-off the board is
timing out, resulting in a failure report.


Indeed, this was due to a crash of the lavapdu daemon - it's back
on track now.

(On a side note, the tegra124-nyan-big is still failing to boot
in mainline due to a genuine kernel driver issue.)

Guillaume


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-08 Thread Guillaume Tucker

On 05/01/18 00:06, Kevin Hilman wrote:

kernelci.org bot  writes:


stable-rc/linux-4.4.y boot: 100 boots: 4 failed, 93 passed with 1 offline, 2 
conflicts (v4.4.109-38-g99abd6cdd65e)

Full Boot Summary: 
https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.4.y/kernel/v4.4.109-38-g99abd6cdd65e/
Full Build Summary: 
https://kernelci.org/build/stable-rc/branch/linux-4.4.y/kernel/v4.4.109-38-g99abd6cdd65e/

Tree: stable-rc
Branch: linux-4.4.y
Git Describe: v4.4.109-38-g99abd6cdd65e
Git Commit: 99abd6cdd65e984d89c8565508a7a96ea0fce179
Git URL: 
http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Tested: 53 unique boards, 19 SoC families, 16 builds out of 178


TL;DR;  All is well.


Boot Regressions Detected:

arm:

 exynos_defconfig:
 exynos5422-odroidxu3:
 lab-collabora: failing since 58 days (last pass: 
v4.4.95-21-g32458fcb7bd6 - first fail: v4.4.96-41-g336421367b9c)


Long standing issue in lab-collabora (passing in other labs)  Guillaume?


This should be fixed now, with a tweak to the device config to
enable relocating the ramdisk and dtb:

https://review.linaro.org/#/c/23238/


 multi_v7_defconfig:
 armada-xp-linksys-mamba:
 lab-free-electrons: new failure (last pass: 
v4.4.109-36-g8b381424010c)


Not a kerel issue, bootROM fails to start bootloader.  I pinged lab
owners (Free Electrons)


 tegra124-nyan-big:
 lab-collabora: failing since 1 day (last pass: v4.4.109 - first 
fail: v4.4.109-36-g8b381424010c)

 tegra_defconfig:
 tegra124-nyan-big:
 lab-collabora: failing since 1 day (last pass: 
v4.4.108-65-g57856049c0f8 - first fail: v4.4.109)


This one is booting fine, but the command to power-off the board is
timing out, resulting in a failure report.


Indeed, this was due to a crash of the lavapdu daemon - it's back
on track now.

(On a side note, the tegra124-nyan-big is still failing to boot
in mainline due to a genuine kernel driver issue.)

Guillaume


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-07 Thread Greg Kroah-Hartman
On Sun, Jan 07, 2018 at 10:06:59AM -0500, Pavel Tatashin wrote:
> Hi Greg,
> 
> I reverted suse12 back to:
> 13dae54cb229d078635f159dd8afe16ae683980b
> x86/kaiser: Move feature detection up (bsc#1068032).
> 
> And, still do not see the problem. So, whatever fixes the issue comes
> before kaiser.

Ok, thanks for the hint.

As I can't duplicate this here at all, any specifics as to what
hardware/procesor type this is?

I can punt and say just "use 4.9 on this hardware if you have it",
right?  :)

I'll try to dig through the sles kernel some more, but given it is 2
patches, and I can't actually test the problem myself, it's not exactly
easy going...

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-07 Thread Greg Kroah-Hartman
On Sun, Jan 07, 2018 at 10:06:59AM -0500, Pavel Tatashin wrote:
> Hi Greg,
> 
> I reverted suse12 back to:
> 13dae54cb229d078635f159dd8afe16ae683980b
> x86/kaiser: Move feature detection up (bsc#1068032).
> 
> And, still do not see the problem. So, whatever fixes the issue comes
> before kaiser.

Ok, thanks for the hint.

As I can't duplicate this here at all, any specifics as to what
hardware/procesor type this is?

I can punt and say just "use 4.9 on this hardware if you have it",
right?  :)

I'll try to dig through the sles kernel some more, but given it is 2
patches, and I can't actually test the problem myself, it's not exactly
easy going...

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-07 Thread Pavel Tatashin
Hi Greg,

I reverted suse12 back to:
13dae54cb229d078635f159dd8afe16ae683980b
x86/kaiser: Move feature detection up (bsc#1068032).

And, still do not see the problem. So, whatever fixes the issue comes
before kaiser.

Pavel

On Sun, Jan 7, 2018 at 9:17 AM, Pavel Tatashin
 wrote:
> Hi Greg,
>
> I cloned and built suse12, and it does not have issues with EFI + PTI
> (kaiser) on my machine.
>
> BTW, i have also reproduced this problem on another machine with the
> same configuration, therefore, it is not specific only to one box.
> Also, as I mentioned earlier I am seeing the same issue with 4.1 +
> kaiser patches taken from 4.4.110.
>
> Thank you,
> Pavel
>
> On Sun, Jan 7, 2018 at 5:45 AM, Greg Kroah-Hartman
>  wrote:
>> On Fri, Jan 05, 2018 at 04:03:54PM -0500, Pavel Tatashin wrote:
>>> The hardware works :) I meant that before the patch linked in
>>> https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
>>> with that patch applied, I was able to boot it at least once, but it could
>>> be accidental. The hang/panic does not happen at the same time on every
>>> boot.
>>
>> Any chance you can grab the latest SLES 12 kernel and run it with pti
>> and efi enabled to see if that works properly for you or not?  I trust
>> SUSE's testing of their kernel, and odds are I'm just missing one of
>> their many other patches they have in their tree for other issues that
>> they have seen in the past.
>>
>> If you want, I can just send you the full patch that they run on top of
>> the latest 4.4 stable tree, so you don't have to dig it out of their git
>> repo if you can't find the binary image.
>>
>> thanks,
>>
>> greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-07 Thread Pavel Tatashin
Hi Greg,

I reverted suse12 back to:
13dae54cb229d078635f159dd8afe16ae683980b
x86/kaiser: Move feature detection up (bsc#1068032).

And, still do not see the problem. So, whatever fixes the issue comes
before kaiser.

Pavel

On Sun, Jan 7, 2018 at 9:17 AM, Pavel Tatashin
 wrote:
> Hi Greg,
>
> I cloned and built suse12, and it does not have issues with EFI + PTI
> (kaiser) on my machine.
>
> BTW, i have also reproduced this problem on another machine with the
> same configuration, therefore, it is not specific only to one box.
> Also, as I mentioned earlier I am seeing the same issue with 4.1 +
> kaiser patches taken from 4.4.110.
>
> Thank you,
> Pavel
>
> On Sun, Jan 7, 2018 at 5:45 AM, Greg Kroah-Hartman
>  wrote:
>> On Fri, Jan 05, 2018 at 04:03:54PM -0500, Pavel Tatashin wrote:
>>> The hardware works :) I meant that before the patch linked in
>>> https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
>>> with that patch applied, I was able to boot it at least once, but it could
>>> be accidental. The hang/panic does not happen at the same time on every
>>> boot.
>>
>> Any chance you can grab the latest SLES 12 kernel and run it with pti
>> and efi enabled to see if that works properly for you or not?  I trust
>> SUSE's testing of their kernel, and odds are I'm just missing one of
>> their many other patches they have in their tree for other issues that
>> they have seen in the past.
>>
>> If you want, I can just send you the full patch that they run on top of
>> the latest 4.4 stable tree, so you don't have to dig it out of their git
>> repo if you can't find the binary image.
>>
>> thanks,
>>
>> greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-07 Thread Pavel Tatashin
Hi Greg,

I cloned and built suse12, and it does not have issues with EFI + PTI
(kaiser) on my machine.

BTW, i have also reproduced this problem on another machine with the
same configuration, therefore, it is not specific only to one box.
Also, as I mentioned earlier I am seeing the same issue with 4.1 +
kaiser patches taken from 4.4.110.

Thank you,
Pavel

On Sun, Jan 7, 2018 at 5:45 AM, Greg Kroah-Hartman
 wrote:
> On Fri, Jan 05, 2018 at 04:03:54PM -0500, Pavel Tatashin wrote:
>> The hardware works :) I meant that before the patch linked in
>> https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
>> with that patch applied, I was able to boot it at least once, but it could
>> be accidental. The hang/panic does not happen at the same time on every
>> boot.
>
> Any chance you can grab the latest SLES 12 kernel and run it with pti
> and efi enabled to see if that works properly for you or not?  I trust
> SUSE's testing of their kernel, and odds are I'm just missing one of
> their many other patches they have in their tree for other issues that
> they have seen in the past.
>
> If you want, I can just send you the full patch that they run on top of
> the latest 4.4 stable tree, so you don't have to dig it out of their git
> repo if you can't find the binary image.
>
> thanks,
>
> greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-07 Thread Pavel Tatashin
Hi Greg,

I cloned and built suse12, and it does not have issues with EFI + PTI
(kaiser) on my machine.

BTW, i have also reproduced this problem on another machine with the
same configuration, therefore, it is not specific only to one box.
Also, as I mentioned earlier I am seeing the same issue with 4.1 +
kaiser patches taken from 4.4.110.

Thank you,
Pavel

On Sun, Jan 7, 2018 at 5:45 AM, Greg Kroah-Hartman
 wrote:
> On Fri, Jan 05, 2018 at 04:03:54PM -0500, Pavel Tatashin wrote:
>> The hardware works :) I meant that before the patch linked in
>> https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
>> with that patch applied, I was able to boot it at least once, but it could
>> be accidental. The hang/panic does not happen at the same time on every
>> boot.
>
> Any chance you can grab the latest SLES 12 kernel and run it with pti
> and efi enabled to see if that works properly for you or not?  I trust
> SUSE's testing of their kernel, and odds are I'm just missing one of
> their many other patches they have in their tree for other issues that
> they have seen in the past.
>
> If you want, I can just send you the full patch that they run on top of
> the latest 4.4 stable tree, so you don't have to dig it out of their git
> repo if you can't find the binary image.
>
> thanks,
>
> greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-07 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 04:03:54PM -0500, Pavel Tatashin wrote:
> The hardware works :) I meant that before the patch linked in
> https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
> with that patch applied, I was able to boot it at least once, but it could
> be accidental. The hang/panic does not happen at the same time on every
> boot.

Any chance you can grab the latest SLES 12 kernel and run it with pti
and efi enabled to see if that works properly for you or not?  I trust
SUSE's testing of their kernel, and odds are I'm just missing one of
their many other patches they have in their tree for other issues that
they have seen in the past.

If you want, I can just send you the full patch that they run on top of
the latest 4.4 stable tree, so you don't have to dig it out of their git
repo if you can't find the binary image.

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-07 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 04:03:54PM -0500, Pavel Tatashin wrote:
> The hardware works :) I meant that before the patch linked in
> https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
> with that patch applied, I was able to boot it at least once, but it could
> be accidental. The hang/panic does not happen at the same time on every
> boot.

Any chance you can grab the latest SLES 12 kernel and run it with pti
and efi enabled to see if that works properly for you or not?  I trust
SUSE's testing of their kernel, and odds are I'm just missing one of
their many other patches they have in their tree for other issues that
they have seen in the past.

If you want, I can just send you the full patch that they run on top of
the latest 4.4 stable tree, so you don't have to dig it out of their git
repo if you can't find the binary image.

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Mike Galbraith
On Fri, 2018-01-05 at 15:28 -0800, Hugh Dickins wrote:
> On Fri, Jan 5, 2018 at 6:03 AM, Mike Galbraith  wrote:
> > On Fri, 2018-01-05 at 14:34 +0100, Greg Kroah-Hartman wrote:
> >>
> >> Ok, we found two patches that were missing in 4.4-stable that were in
> >> the SLES12 tree (thanks to Jamie Iles), now I only have 19k more to sift
> >> through :)
> >
> > As you know, in enterprise, uname -r means you might find something
> > this old in your kernel if you look hard enough :)
> 
> Mike, I think there's a good chance that Greg's 4.4.110 final will fix
> your "segfault at ff5ff100" crashes: please give it a try when
> you can, and let us know - thanks.

Already done, and yes, it did.

-Mike


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Mike Galbraith
On Fri, 2018-01-05 at 15:28 -0800, Hugh Dickins wrote:
> On Fri, Jan 5, 2018 at 6:03 AM, Mike Galbraith  wrote:
> > On Fri, 2018-01-05 at 14:34 +0100, Greg Kroah-Hartman wrote:
> >>
> >> Ok, we found two patches that were missing in 4.4-stable that were in
> >> the SLES12 tree (thanks to Jamie Iles), now I only have 19k more to sift
> >> through :)
> >
> > As you know, in enterprise, uname -r means you might find something
> > this old in your kernel if you look hard enough :)
> 
> Mike, I think there's a good chance that Greg's 4.4.110 final will fix
> your "segfault at ff5ff100" crashes: please give it a try when
> you can, and let us know - thanks.

Already done, and yes, it did.

-Mike


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck

On 01/05/2018 12:54 PM, Greg Kroah-Hartman wrote:

On Fri, Jan 05, 2018 at 09:56:16AM -0800, Guenter Roeck wrote:

On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:

This is the start of the stable review cycle for the 4.4.110 release.
There are 37 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
Anything received after that time might be too late.



Update: v4.4.110 final nosmp builds fail as follows:


Error log:
arch/x86/entry/vdso/vma.c: In function ‘map_vdso’:
arch/x86/entry/vdso/vma.c:173:9: error:
implicit declaration of function ‘pvclock_pvti_cpu0_va’


x86-64 or i386?
That should be a CONFIG_PARAVIRT_CLOCK issue, not a smp build issue,
have a .config I can try?



Here is an easier way to reproduce the problem: make allnoconfig ; make

Guenter



Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck

On 01/05/2018 12:54 PM, Greg Kroah-Hartman wrote:

On Fri, Jan 05, 2018 at 09:56:16AM -0800, Guenter Roeck wrote:

On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:

This is the start of the stable review cycle for the 4.4.110 release.
There are 37 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
Anything received after that time might be too late.



Update: v4.4.110 final nosmp builds fail as follows:


Error log:
arch/x86/entry/vdso/vma.c: In function ‘map_vdso’:
arch/x86/entry/vdso/vma.c:173:9: error:
implicit declaration of function ‘pvclock_pvti_cpu0_va’


x86-64 or i386?
That should be a CONFIG_PARAVIRT_CLOCK issue, not a smp build issue,
have a .config I can try?



Here is an easier way to reproduce the problem: make allnoconfig ; make

Guenter



Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin

Hi Hugh,

Thank you very much for your very thoughtful input.

I quiet positive this problem is PTI regression, because exactly the 
same problem I see with kernel 4.1 to which I back-ported all the 
necessary PTI patches from 4.4.110. I will provide this thread with more 
information as I collect it. I will also try to root cause the problem.


The bug has memory corruption behavior, but with both 4.1 and 4.4 
kernels problem goes away when I boot with noefi parameter. So, EFI + 
PTI is the culprit for this memory corruption.


Thank you,
Pavel

On 01/05/2018 06:15 PM, Hugh Dickins wrote:

On Fri, Jan 5, 2018 at 1:03 PM, Pavel Tatashin
 wrote:

The hardware works :) I meant that before the patch linked in
https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
with that patch applied, I was able to boot it at least once, but it could
be accidental. The hang/panic does not happen at the same time on every
boot.


I get the feeling that it was accidental: it seems to me that you have
a memory corruption problem, that gets shifted around by the different
patches (or "noefi" or "nopti").

Because yesterday your boots were able to get way beyond the "EFI
Variables Facility" message, and I can't imagine why the EFI issue
would not have been equally debilitating on yesterday's 110-rc, if it
were in play.

I did intend to ask you to send your System.map, for us to scan
through: maybe some variable is marked __init and should not be, then
the "Freeing unused kernel memory" frees it for random reuse.

But today you didn't get anywhere near the "Freeing unused kernel
memory", so that can't be it - or do you sometimes get that far today?

You mention that the hang/panic does not happen at the same time on
every boot: I think all I can ask is for you to keep supplying us with
different examples (console messages) of where it occurs, in the hope
that one of them will point us in the right direction.

And it even seems possible that this has nothing to do with the
4.4.110 changes - that 4.4.109 plus some other random patches would
unleash similar corruption. Though on balance that does seem unlikely.

Hugh


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin

Hi Hugh,

Thank you very much for your very thoughtful input.

I quiet positive this problem is PTI regression, because exactly the 
same problem I see with kernel 4.1 to which I back-ported all the 
necessary PTI patches from 4.4.110. I will provide this thread with more 
information as I collect it. I will also try to root cause the problem.


The bug has memory corruption behavior, but with both 4.1 and 4.4 
kernels problem goes away when I boot with noefi parameter. So, EFI + 
PTI is the culprit for this memory corruption.


Thank you,
Pavel

On 01/05/2018 06:15 PM, Hugh Dickins wrote:

On Fri, Jan 5, 2018 at 1:03 PM, Pavel Tatashin
 wrote:

The hardware works :) I meant that before the patch linked in
https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
with that patch applied, I was able to boot it at least once, but it could
be accidental. The hang/panic does not happen at the same time on every
boot.


I get the feeling that it was accidental: it seems to me that you have
a memory corruption problem, that gets shifted around by the different
patches (or "noefi" or "nopti").

Because yesterday your boots were able to get way beyond the "EFI
Variables Facility" message, and I can't imagine why the EFI issue
would not have been equally debilitating on yesterday's 110-rc, if it
were in play.

I did intend to ask you to send your System.map, for us to scan
through: maybe some variable is marked __init and should not be, then
the "Freeing unused kernel memory" frees it for random reuse.

But today you didn't get anywhere near the "Freeing unused kernel
memory", so that can't be it - or do you sometimes get that far today?

You mention that the hang/panic does not happen at the same time on
every boot: I think all I can ask is for you to keep supplying us with
different examples (console messages) of where it occurs, in the hope
that one of them will point us in the right direction.

And it even seems possible that this has nothing to do with the
4.4.110 changes - that 4.4.109 plus some other random patches would
unleash similar corruption. Though on balance that does seem unlikely.

Hugh


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Hugh Dickins
On Fri, Jan 5, 2018 at 6:03 AM, Mike Galbraith  wrote:
> On Fri, 2018-01-05 at 14:34 +0100, Greg Kroah-Hartman wrote:
>>
>> Ok, we found two patches that were missing in 4.4-stable that were in
>> the SLES12 tree (thanks to Jamie Iles), now I only have 19k more to sift
>> through :)
>
> As you know, in enterprise, uname -r means you might find something
> this old in your kernel if you look hard enough :)

Mike, I think there's a good chance that Greg's 4.4.110 final will fix
your "segfault at ff5ff100" crashes: please give it a try when
you can, and let us know - thanks.

Hugh


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Hugh Dickins
On Fri, Jan 5, 2018 at 6:03 AM, Mike Galbraith  wrote:
> On Fri, 2018-01-05 at 14:34 +0100, Greg Kroah-Hartman wrote:
>>
>> Ok, we found two patches that were missing in 4.4-stable that were in
>> the SLES12 tree (thanks to Jamie Iles), now I only have 19k more to sift
>> through :)
>
> As you know, in enterprise, uname -r means you might find something
> this old in your kernel if you look hard enough :)

Mike, I think there's a good chance that Greg's 4.4.110 final will fix
your "segfault at ff5ff100" crashes: please give it a try when
you can, and let us know - thanks.

Hugh


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Hugh Dickins
On Fri, Jan 5, 2018 at 1:03 PM, Pavel Tatashin
 wrote:
> The hardware works :) I meant that before the patch linked in
> https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
> with that patch applied, I was able to boot it at least once, but it could
> be accidental. The hang/panic does not happen at the same time on every
> boot.

I get the feeling that it was accidental: it seems to me that you have
a memory corruption problem, that gets shifted around by the different
patches (or "noefi" or "nopti").

Because yesterday your boots were able to get way beyond the "EFI
Variables Facility" message, and I can't imagine why the EFI issue
would not have been equally debilitating on yesterday's 110-rc, if it
were in play.

I did intend to ask you to send your System.map, for us to scan
through: maybe some variable is marked __init and should not be, then
the "Freeing unused kernel memory" frees it for random reuse.

But today you didn't get anywhere near the "Freeing unused kernel
memory", so that can't be it - or do you sometimes get that far today?

You mention that the hang/panic does not happen at the same time on
every boot: I think all I can ask is for you to keep supplying us with
different examples (console messages) of where it occurs, in the hope
that one of them will point us in the right direction.

And it even seems possible that this has nothing to do with the
4.4.110 changes - that 4.4.109 plus some other random patches would
unleash similar corruption. Though on balance that does seem unlikely.

Hugh

>
> Pasha
>
>
> On 01/05/2018 03:45 PM, Greg Kroah-Hartman wrote:
>>
>> On Fri, Jan 05, 2018 at 02:18:32PM -0500, Pavel Tatashin wrote:
>>>
>>> Actually it helps, if before 4.4.110 never booted on my machine, not i
>>> was able to boot on a second try.
>>
>>
>> Wait, what?  This has never booted on 4.4.x before?  Did 4.4.108 work?
>> 109?  Are you sure this hardware even works?  :)
>>
>> thanks,
>>
>> greg k-h
>>
>


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Hugh Dickins
On Fri, Jan 5, 2018 at 1:03 PM, Pavel Tatashin
 wrote:
> The hardware works :) I meant that before the patch linked in
> https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. But
> with that patch applied, I was able to boot it at least once, but it could
> be accidental. The hang/panic does not happen at the same time on every
> boot.

I get the feeling that it was accidental: it seems to me that you have
a memory corruption problem, that gets shifted around by the different
patches (or "noefi" or "nopti").

Because yesterday your boots were able to get way beyond the "EFI
Variables Facility" message, and I can't imagine why the EFI issue
would not have been equally debilitating on yesterday's 110-rc, if it
were in play.

I did intend to ask you to send your System.map, for us to scan
through: maybe some variable is marked __init and should not be, then
the "Freeing unused kernel memory" frees it for random reuse.

But today you didn't get anywhere near the "Freeing unused kernel
memory", so that can't be it - or do you sometimes get that far today?

You mention that the hang/panic does not happen at the same time on
every boot: I think all I can ask is for you to keep supplying us with
different examples (console messages) of where it occurs, in the hope
that one of them will point us in the right direction.

And it even seems possible that this has nothing to do with the
4.4.110 changes - that 4.4.109 plus some other random patches would
unleash similar corruption. Though on balance that does seem unlikely.

Hugh

>
> Pasha
>
>
> On 01/05/2018 03:45 PM, Greg Kroah-Hartman wrote:
>>
>> On Fri, Jan 05, 2018 at 02:18:32PM -0500, Pavel Tatashin wrote:
>>>
>>> Actually it helps, if before 4.4.110 never booted on my machine, not i
>>> was able to boot on a second try.
>>
>>
>> Wait, what?  This has never booted on 4.4.x before?  Did 4.4.108 work?
>> 109?  Are you sure this hardware even works?  :)
>>
>> thanks,
>>
>> greg k-h
>>
>


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck
On Fri, Jan 05, 2018 at 09:54:45PM +0100, Greg Kroah-Hartman wrote:
> On Fri, Jan 05, 2018 at 09:56:16AM -0800, Guenter Roeck wrote:
> > On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
> > > This is the start of the stable review cycle for the 4.4.110 release.
> > > There are 37 patches in this series, all will be posted as a response
> > > to this one.  If anyone has any issues with these being applied, please
> > > let me know.
> > > 
> > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > Anything received after that time might be too late.
> > > 
> > 
> > Update: v4.4.110 final nosmp builds fail as follows:
> > 
> > 
> > Error log:
> > arch/x86/entry/vdso/vma.c: In function ‘map_vdso’:
> > arch/x86/entry/vdso/vma.c:173:9: error:
> > implicit declaration of function ‘pvclock_pvti_cpu0_va’
> 
> x86-64 or i386?

x86-64

> That should be a CONFIG_PARAVIRT_CLOCK issue, not a smp build issue,
> have a .config I can try?
> 
https://github.com/groeck/linux-build-test/blob/master/rootfs/x86_64/qemu_x86_64_pc_nosmp_defconfig

However,
https://github.com/groeck/linux-build-test/blob/master/rootfs/x86_64/qemu_x86_64_pc_defconfig
does build, and the only differences are:

30a31
> CONFIG_SMP=y
32a34,35
> CONFIG_NR_CPUS=24
> CONFIG_SCHED_SMT=y
44d46
< CONFIG_ACPI_CONTAINER=y

Both configurations have CONFIG_PARAVIRT_CLOCK disabled.

Guenter


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck
On Fri, Jan 05, 2018 at 09:54:45PM +0100, Greg Kroah-Hartman wrote:
> On Fri, Jan 05, 2018 at 09:56:16AM -0800, Guenter Roeck wrote:
> > On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
> > > This is the start of the stable review cycle for the 4.4.110 release.
> > > There are 37 patches in this series, all will be posted as a response
> > > to this one.  If anyone has any issues with these being applied, please
> > > let me know.
> > > 
> > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > Anything received after that time might be too late.
> > > 
> > 
> > Update: v4.4.110 final nosmp builds fail as follows:
> > 
> > 
> > Error log:
> > arch/x86/entry/vdso/vma.c: In function ‘map_vdso’:
> > arch/x86/entry/vdso/vma.c:173:9: error:
> > implicit declaration of function ‘pvclock_pvti_cpu0_va’
> 
> x86-64 or i386?

x86-64

> That should be a CONFIG_PARAVIRT_CLOCK issue, not a smp build issue,
> have a .config I can try?
> 
https://github.com/groeck/linux-build-test/blob/master/rootfs/x86_64/qemu_x86_64_pc_nosmp_defconfig

However,
https://github.com/groeck/linux-build-test/blob/master/rootfs/x86_64/qemu_x86_64_pc_defconfig
does build, and the only differences are:

30a31
> CONFIG_SMP=y
32a34,35
> CONFIG_NR_CPUS=24
> CONFIG_SCHED_SMT=y
44d46
< CONFIG_ACPI_CONTAINER=y

Both configurations have CONFIG_PARAVIRT_CLOCK disabled.

Guenter


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin
The hardware works :) I meant that before the patch linked in 
https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. 
But with that patch applied, I was able to boot it at least once, but it 
could be accidental. The hang/panic does not happen at the same time on 
every boot.


Pasha

On 01/05/2018 03:45 PM, Greg Kroah-Hartman wrote:

On Fri, Jan 05, 2018 at 02:18:32PM -0500, Pavel Tatashin wrote:

Actually it helps, if before 4.4.110 never booted on my machine, not i
was able to boot on a second try.


Wait, what?  This has never booted on 4.4.x before?  Did 4.4.108 work?
109?  Are you sure this hardware even works?  :)

thanks,

greg k-h



Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin
The hardware works :) I meant that before the patch linked in 
https://lkml.org/lkml/2018/1/5/534, I was never able to boot 4.4.110. 
But with that patch applied, I was able to boot it at least once, but it 
could be accidental. The hang/panic does not happen at the same time on 
every boot.


Pasha

On 01/05/2018 03:45 PM, Greg Kroah-Hartman wrote:

On Fri, Jan 05, 2018 at 02:18:32PM -0500, Pavel Tatashin wrote:

Actually it helps, if before 4.4.110 never booted on my machine, not i
was able to boot on a second try.


Wait, what?  This has never booted on 4.4.x before?  Did 4.4.108 work?
109?  Are you sure this hardware even works?  :)

thanks,

greg k-h



Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 09:56:16AM -0800, Guenter Roeck wrote:
> On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.4.110 release.
> > There are 37 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > Anything received after that time might be too late.
> > 
> 
> Update: v4.4.110 final nosmp builds fail as follows:
> 
> 
> Error log:
> arch/x86/entry/vdso/vma.c: In function ‘map_vdso’:
> arch/x86/entry/vdso/vma.c:173:9: error:
>   implicit declaration of function ‘pvclock_pvti_cpu0_va’

x86-64 or i386?
That should be a CONFIG_PARAVIRT_CLOCK issue, not a smp build issue,
have a .config I can try?

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 09:56:16AM -0800, Guenter Roeck wrote:
> On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.4.110 release.
> > There are 37 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > Anything received after that time might be too late.
> > 
> 
> Update: v4.4.110 final nosmp builds fail as follows:
> 
> 
> Error log:
> arch/x86/entry/vdso/vma.c: In function ‘map_vdso’:
> arch/x86/entry/vdso/vma.c:173:9: error:
>   implicit declaration of function ‘pvclock_pvti_cpu0_va’

x86-64 or i386?
That should be a CONFIG_PARAVIRT_CLOCK issue, not a smp build issue,
have a .config I can try?

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 10:12:38AM -0800, Guenter Roeck wrote:
> On Fri, Jan 05, 2018 at 04:00:55PM +0100, Greg Kroah-Hartman wrote:
> > On Thu, Jan 04, 2018 at 09:56:47AM -0800, Guenter Roeck wrote:
> > > 
> > > FWIW, v4.4.110-rc1 boots fine when merged into chromeos-4.4, on i7-7Y75.
> > 
> > That's good to know, hopefully 4.4.110-final also still works for you :)
> 
> It seems to be working. One patch to add for v4.4.111: 
> 
> 063fb3e56f6d ("x86/kasan: Write protect kasan zero shadow")
> 
> It is needed to be able to run KASAN enabled images in KVM.

Ugh, thanks for that, it also looks like SLES also is missing that one
too.

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 10:12:38AM -0800, Guenter Roeck wrote:
> On Fri, Jan 05, 2018 at 04:00:55PM +0100, Greg Kroah-Hartman wrote:
> > On Thu, Jan 04, 2018 at 09:56:47AM -0800, Guenter Roeck wrote:
> > > 
> > > FWIW, v4.4.110-rc1 boots fine when merged into chromeos-4.4, on i7-7Y75.
> > 
> > That's good to know, hopefully 4.4.110-final also still works for you :)
> 
> It seems to be working. One patch to add for v4.4.111: 
> 
> 063fb3e56f6d ("x86/kasan: Write protect kasan zero shadow")
> 
> It is needed to be able to run KASAN enabled images in KVM.

Ugh, thanks for that, it also looks like SLES also is missing that one
too.

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 02:18:32PM -0500, Pavel Tatashin wrote:
> Actually it helps, if before 4.4.110 never booted on my machine, not i
> was able to boot on a second try.

Wait, what?  This has never booted on 4.4.x before?  Did 4.4.108 work?
109?  Are you sure this hardware even works?  :)

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 02:18:32PM -0500, Pavel Tatashin wrote:
> Actually it helps, if before 4.4.110 never booted on my machine, not i
> was able to boot on a second try.

Wait, what?  This has never booted on 4.4.x before?  Did 4.4.108 work?
109?  Are you sure this hardware even works?  :)

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 10:15:00AM -0800, Andy Lutomirski wrote:
> On Fri, Jan 5, 2018 at 9:52 AM, Greg Kroah-Hartman
>  wrote:
> > On Fri, Jan 05, 2018 at 12:48:54PM -0500, Pavel Tatashin wrote:
> >> Boots successfully with "noefi" kernel parameter :)
> >
> > Thanks, that will help me narrow it down.  I'll dig through more patches
> > when I get home tonight...
> 
> I wish you luck.  The 4.4 series is "KAISER", not "KPTI", and the
> relevant code is spread all over the place and is generally garbage.
> See, for example, the turd called kaiser_set_shadow_pgd().  I would
> not be terribly surprised if that particular turd is biting here.
> 
> An alternative theory is that something is screwy in the EFI code.  I
> don't see anything directly wrong, but it's certainly a bit sketchy.
> The newer kernels carefully avoid using PCID 0 for real work to avoid
> corruption due to EFI and similar things.  The "KAISER" code has no
> such mitigation.  Fortunately, it seems to use PCID=0 for kernel and
> PCID=nonzero for user, so the obvious problem isn't present, but
> something could still be wrong.
> 
> Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
> first CPU worth is fine.)
> 
> FWIW, I said before that I have very little desire to help debug
> "KAISER".  I stand by that.

I totally understand, and do not expect your help at all.

Worse case, I point people at 4.14 and tell them to upgrade, I'm not
going to waste a ton of time on this for the same exact reasons you list
here.

And yeah, kaiser_set_shadow_pgd() is horrid, I've already gotten sucked
into it for long enough...

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 10:15:00AM -0800, Andy Lutomirski wrote:
> On Fri, Jan 5, 2018 at 9:52 AM, Greg Kroah-Hartman
>  wrote:
> > On Fri, Jan 05, 2018 at 12:48:54PM -0500, Pavel Tatashin wrote:
> >> Boots successfully with "noefi" kernel parameter :)
> >
> > Thanks, that will help me narrow it down.  I'll dig through more patches
> > when I get home tonight...
> 
> I wish you luck.  The 4.4 series is "KAISER", not "KPTI", and the
> relevant code is spread all over the place and is generally garbage.
> See, for example, the turd called kaiser_set_shadow_pgd().  I would
> not be terribly surprised if that particular turd is biting here.
> 
> An alternative theory is that something is screwy in the EFI code.  I
> don't see anything directly wrong, but it's certainly a bit sketchy.
> The newer kernels carefully avoid using PCID 0 for real work to avoid
> corruption due to EFI and similar things.  The "KAISER" code has no
> such mitigation.  Fortunately, it seems to use PCID=0 for kernel and
> PCID=nonzero for user, so the obvious problem isn't present, but
> something could still be wrong.
> 
> Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
> first CPU worth is fine.)
> 
> FWIW, I said before that I have very little desire to help debug
> "KAISER".  I stand by that.

I totally understand, and do not expect your help at all.

Worse case, I point people at 4.14 and tell them to upgrade, I'm not
going to waste a ton of time on this for the same exact reasons you list
here.

And yeah, kaiser_set_shadow_pgd() is horrid, I've already gotten sucked
into it for long enough...

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin
Actually it helps, if before 4.4.110 never booted on my machine, not i
was able to boot on a second try.

On Fri, Jan 5, 2018 at 2:14 PM, Pavel Tatashin
 wrote:
> I hoped, this patch would fix the efi issue:
> https://lkml.org/lkml/2018/1/5/534
>
> But, unfortunatly it does not. I got a partial panic message this time:
>
> [4.737578] usb 1-1: new high-speed USB device number 2 using ehci-pci
> [4.846712] BUG: unable to handle kernel paging request at 00017e10
> [4.854509] IP: []
> native_queued_spin_lock_slowpath+0xfe/0x170
> [4.862780] PGD 0
> [4.865034] Oops: 0002 [#1] SMP
> [4.868657] Modules linked in:
> [4.872075] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 4.4.110_pt_linux-v4.4.110 #3
> [4.880526] Hardware name: Oracle Corporation ORACLE SERVER
> X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
> [4.891596] task: 81aab500 ti: 81a98000 task.ti:
> 81a98000
> [4.899950] RIP: 0010:[]  []
> native_queued_spin_lock_slowpath+0xfe/0x170
> [4.910936] RSP: :881fff803c88  EFLAGS: 00010002
> [4.916865] RAX: 206b RBX: 88407e611900 RCX: 
> 881fff817e00
> [4.924831] RDX: 00017e10 RSI: 0004 RDI: 
> 88407e611a58
> [4.932797] RBP: 881fff803c88 R08: 0101 R09: 
> 
> [4.940764] R10: 5c96d000 R11: 88005c96d0c0 R12: 
> 881ff25e52c8
> [4.948730] R13: 88407e6d1900 R14: 881fff8118c0 R15: 
> 88407e6118c0
> [4.956696] FS:  () GS:881fff80()
> knlGS:
> [4.965727] CS:  0010 DS:  ES:  CR0: 80050033
> [4.972140] CR2: 00017e10 CR3: 01aa2000 CR4: 003606
>
> On Fri, Jan 5, 2018 at 1:21 PM, Pavel Tatashin
>  wrote:
>>> Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
>>> first CPU worth is fine.)
>>
>> With noefi option:
>>
>> [root@ca-ostest441 ~]# more /proc/cpuinfo
>> processor   : 0
>> vendor_id   : GenuineIntel
>> cpu family  : 6
>> model   : 79
>> model name  : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> stepping: 1
>> microcode   : 0xb1d
>> cpu MHz : 1971.406
>> cache size  : 25600 KB
>> physical id : 0
>> siblings: 20
>> core id : 0
>> cpu cores   : 10
>> apicid  : 0
>> initial apicid  : 0
>> fpu : yes
>> fpu_exception   : yes
>> cpuid level : 20
>> wp  : yes
>> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
>> cmov
>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
>> rdt
>> scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
>> nonstop_tsc ap
>> erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 
>> sdbg
>>  fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt 
>> tsc_deadline_time
>> r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb 
>> invpcid_singl
>> e pln pts dtherm intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid 
>> fsgsbase
>> tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap 
>> xsaveopt
>>  cqm_llc cqm_occup_llc
>> bugs:
>> bogomips: 4390.08
>> clflush size: 64
>> cache_alignment : 64
>> address sizes   : 46 bits physical, 48 bits virtual
>> power management:


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin
Actually it helps, if before 4.4.110 never booted on my machine, not i
was able to boot on a second try.

On Fri, Jan 5, 2018 at 2:14 PM, Pavel Tatashin
 wrote:
> I hoped, this patch would fix the efi issue:
> https://lkml.org/lkml/2018/1/5/534
>
> But, unfortunatly it does not. I got a partial panic message this time:
>
> [4.737578] usb 1-1: new high-speed USB device number 2 using ehci-pci
> [4.846712] BUG: unable to handle kernel paging request at 00017e10
> [4.854509] IP: []
> native_queued_spin_lock_slowpath+0xfe/0x170
> [4.862780] PGD 0
> [4.865034] Oops: 0002 [#1] SMP
> [4.868657] Modules linked in:
> [4.872075] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 4.4.110_pt_linux-v4.4.110 #3
> [4.880526] Hardware name: Oracle Corporation ORACLE SERVER
> X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
> [4.891596] task: 81aab500 ti: 81a98000 task.ti:
> 81a98000
> [4.899950] RIP: 0010:[]  []
> native_queued_spin_lock_slowpath+0xfe/0x170
> [4.910936] RSP: :881fff803c88  EFLAGS: 00010002
> [4.916865] RAX: 206b RBX: 88407e611900 RCX: 
> 881fff817e00
> [4.924831] RDX: 00017e10 RSI: 0004 RDI: 
> 88407e611a58
> [4.932797] RBP: 881fff803c88 R08: 0101 R09: 
> 
> [4.940764] R10: 5c96d000 R11: 88005c96d0c0 R12: 
> 881ff25e52c8
> [4.948730] R13: 88407e6d1900 R14: 881fff8118c0 R15: 
> 88407e6118c0
> [4.956696] FS:  () GS:881fff80()
> knlGS:
> [4.965727] CS:  0010 DS:  ES:  CR0: 80050033
> [4.972140] CR2: 00017e10 CR3: 01aa2000 CR4: 003606
>
> On Fri, Jan 5, 2018 at 1:21 PM, Pavel Tatashin
>  wrote:
>>> Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
>>> first CPU worth is fine.)
>>
>> With noefi option:
>>
>> [root@ca-ostest441 ~]# more /proc/cpuinfo
>> processor   : 0
>> vendor_id   : GenuineIntel
>> cpu family  : 6
>> model   : 79
>> model name  : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> stepping: 1
>> microcode   : 0xb1d
>> cpu MHz : 1971.406
>> cache size  : 25600 KB
>> physical id : 0
>> siblings: 20
>> core id : 0
>> cpu cores   : 10
>> apicid  : 0
>> initial apicid  : 0
>> fpu : yes
>> fpu_exception   : yes
>> cpuid level : 20
>> wp  : yes
>> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
>> cmov
>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
>> rdt
>> scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
>> nonstop_tsc ap
>> erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 
>> sdbg
>>  fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt 
>> tsc_deadline_time
>> r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb 
>> invpcid_singl
>> e pln pts dtherm intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid 
>> fsgsbase
>> tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap 
>> xsaveopt
>>  cqm_llc cqm_occup_llc
>> bugs:
>> bogomips: 4390.08
>> clflush size: 64
>> cache_alignment : 64
>> address sizes   : 46 bits physical, 48 bits virtual
>> power management:


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin
I hoped, this patch would fix the efi issue:
https://lkml.org/lkml/2018/1/5/534

But, unfortunatly it does not. I got a partial panic message this time:

[4.737578] usb 1-1: new high-speed USB device number 2 using ehci-pci
[4.846712] BUG: unable to handle kernel paging request at 00017e10
[4.854509] IP: []
native_queued_spin_lock_slowpath+0xfe/0x170
[4.862780] PGD 0
[4.865034] Oops: 0002 [#1] SMP
[4.868657] Modules linked in:
[4.872075] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.4.110_pt_linux-v4.4.110 #3
[4.880526] Hardware name: Oracle Corporation ORACLE SERVER
X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
[4.891596] task: 81aab500 ti: 81a98000 task.ti:
81a98000
[4.899950] RIP: 0010:[]  []
native_queued_spin_lock_slowpath+0xfe/0x170
[4.910936] RSP: :881fff803c88  EFLAGS: 00010002
[4.916865] RAX: 206b RBX: 88407e611900 RCX: 881fff817e00
[4.924831] RDX: 00017e10 RSI: 0004 RDI: 88407e611a58
[4.932797] RBP: 881fff803c88 R08: 0101 R09: 
[4.940764] R10: 5c96d000 R11: 88005c96d0c0 R12: 881ff25e52c8
[4.948730] R13: 88407e6d1900 R14: 881fff8118c0 R15: 88407e6118c0
[4.956696] FS:  () GS:881fff80()
knlGS:
[4.965727] CS:  0010 DS:  ES:  CR0: 80050033
[4.972140] CR2: 00017e10 CR3: 01aa2000 CR4: 003606

On Fri, Jan 5, 2018 at 1:21 PM, Pavel Tatashin
 wrote:
>> Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
>> first CPU worth is fine.)
>
> With noefi option:
>
> [root@ca-ostest441 ~]# more /proc/cpuinfo
> processor   : 0
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 79
> model name  : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> stepping: 1
> microcode   : 0xb1d
> cpu MHz : 1971.406
> cache size  : 25600 KB
> physical id : 0
> siblings: 20
> core id : 0
> cpu cores   : 10
> apicid  : 0
> initial apicid  : 0
> fpu : yes
> fpu_exception   : yes
> cpuid level : 20
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
> rdt
> scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
> ap
> erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 
> sdbg
>  fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt 
> tsc_deadline_time
> r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb 
> invpcid_singl
> e pln pts dtherm intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid 
> fsgsbase
> tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap 
> xsaveopt
>  cqm_llc cqm_occup_llc
> bugs:
> bogomips: 4390.08
> clflush size: 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin
I hoped, this patch would fix the efi issue:
https://lkml.org/lkml/2018/1/5/534

But, unfortunatly it does not. I got a partial panic message this time:

[4.737578] usb 1-1: new high-speed USB device number 2 using ehci-pci
[4.846712] BUG: unable to handle kernel paging request at 00017e10
[4.854509] IP: []
native_queued_spin_lock_slowpath+0xfe/0x170
[4.862780] PGD 0
[4.865034] Oops: 0002 [#1] SMP
[4.868657] Modules linked in:
[4.872075] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.4.110_pt_linux-v4.4.110 #3
[4.880526] Hardware name: Oracle Corporation ORACLE SERVER
X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
[4.891596] task: 81aab500 ti: 81a98000 task.ti:
81a98000
[4.899950] RIP: 0010:[]  []
native_queued_spin_lock_slowpath+0xfe/0x170
[4.910936] RSP: :881fff803c88  EFLAGS: 00010002
[4.916865] RAX: 206b RBX: 88407e611900 RCX: 881fff817e00
[4.924831] RDX: 00017e10 RSI: 0004 RDI: 88407e611a58
[4.932797] RBP: 881fff803c88 R08: 0101 R09: 
[4.940764] R10: 5c96d000 R11: 88005c96d0c0 R12: 881ff25e52c8
[4.948730] R13: 88407e6d1900 R14: 881fff8118c0 R15: 88407e6118c0
[4.956696] FS:  () GS:881fff80()
knlGS:
[4.965727] CS:  0010 DS:  ES:  CR0: 80050033
[4.972140] CR2: 00017e10 CR3: 01aa2000 CR4: 003606

On Fri, Jan 5, 2018 at 1:21 PM, Pavel Tatashin
 wrote:
>> Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
>> first CPU worth is fine.)
>
> With noefi option:
>
> [root@ca-ostest441 ~]# more /proc/cpuinfo
> processor   : 0
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 79
> model name  : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> stepping: 1
> microcode   : 0xb1d
> cpu MHz : 1971.406
> cache size  : 25600 KB
> physical id : 0
> siblings: 20
> core id : 0
> cpu cores   : 10
> apicid  : 0
> initial apicid  : 0
> fpu : yes
> fpu_exception   : yes
> cpuid level : 20
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
> rdt
> scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
> ap
> erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 
> sdbg
>  fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt 
> tsc_deadline_time
> r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb 
> invpcid_singl
> e pln pts dtherm intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid 
> fsgsbase
> tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap 
> xsaveopt
>  cqm_llc cqm_occup_llc
> bugs:
> bogomips: 4390.08
> clflush size: 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin
> Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
> first CPU worth is fine.)

With noefi option:

[root@ca-ostest441 ~]# more /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 79
model name  : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
stepping: 1
microcode   : 0xb1d
cpu MHz : 1971.406
cache size  : 25600 KB
physical id : 0
siblings: 20
core id : 0
cpu cores   : 10
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 20
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt
scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc ap
erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg
 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_time
r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_singl
e pln pts dtherm intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt
 cqm_llc cqm_occup_llc
bugs:
bogomips: 4390.08
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Pavel Tatashin
> Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
> first CPU worth is fine.)

With noefi option:

[root@ca-ostest441 ~]# more /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 79
model name  : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
stepping: 1
microcode   : 0xb1d
cpu MHz : 1971.406
cache size  : 25600 KB
physical id : 0
siblings: 20
core id : 0
cpu cores   : 10
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 20
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt
scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc ap
erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg
 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_time
r aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_singl
e pln pts dtherm intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt
 cqm_llc cqm_occup_llc
bugs:
bogomips: 4390.08
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Andy Lutomirski
On Fri, Jan 5, 2018 at 9:52 AM, Greg Kroah-Hartman
 wrote:
> On Fri, Jan 05, 2018 at 12:48:54PM -0500, Pavel Tatashin wrote:
>> Boots successfully with "noefi" kernel parameter :)
>
> Thanks, that will help me narrow it down.  I'll dig through more patches
> when I get home tonight...

I wish you luck.  The 4.4 series is "KAISER", not "KPTI", and the
relevant code is spread all over the place and is generally garbage.
See, for example, the turd called kaiser_set_shadow_pgd().  I would
not be terribly surprised if that particular turd is biting here.

An alternative theory is that something is screwy in the EFI code.  I
don't see anything directly wrong, but it's certainly a bit sketchy.
The newer kernels carefully avoid using PCID 0 for real work to avoid
corruption due to EFI and similar things.  The "KAISER" code has no
such mitigation.  Fortunately, it seems to use PCID=0 for kernel and
PCID=nonzero for user, so the obvious problem isn't present, but
something could still be wrong.

Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
first CPU worth is fine.)

FWIW, I said before that I have very little desire to help debug
"KAISER".  I stand by that.


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Andy Lutomirski
On Fri, Jan 5, 2018 at 9:52 AM, Greg Kroah-Hartman
 wrote:
> On Fri, Jan 05, 2018 at 12:48:54PM -0500, Pavel Tatashin wrote:
>> Boots successfully with "noefi" kernel parameter :)
>
> Thanks, that will help me narrow it down.  I'll dig through more patches
> when I get home tonight...

I wish you luck.  The 4.4 series is "KAISER", not "KPTI", and the
relevant code is spread all over the place and is generally garbage.
See, for example, the turd called kaiser_set_shadow_pgd().  I would
not be terribly surprised if that particular turd is biting here.

An alternative theory is that something is screwy in the EFI code.  I
don't see anything directly wrong, but it's certainly a bit sketchy.
The newer kernels carefully avoid using PCID 0 for real work to avoid
corruption due to EFI and similar things.  The "KAISER" code has no
such mitigation.  Fortunately, it seems to use PCID=0 for kernel and
PCID=nonzero for user, so the obvious problem isn't present, but
something could still be wrong.

Pavel, can you send your /proc/cpuinfo on a noefi boot?  (Just the
first CPU worth is fine.)

FWIW, I said before that I have very little desire to help debug
"KAISER".  I stand by that.


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck
On Fri, Jan 05, 2018 at 04:00:55PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Jan 04, 2018 at 09:56:47AM -0800, Guenter Roeck wrote:
> > 
> > FWIW, v4.4.110-rc1 boots fine when merged into chromeos-4.4, on i7-7Y75.
> 
> That's good to know, hopefully 4.4.110-final also still works for you :)

It seems to be working. One patch to add for v4.4.111: 

063fb3e56f6d ("x86/kasan: Write protect kasan zero shadow")

It is needed to be able to run KASAN enabled images in KVM.

Guenter


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck
On Fri, Jan 05, 2018 at 04:00:55PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Jan 04, 2018 at 09:56:47AM -0800, Guenter Roeck wrote:
> > 
> > FWIW, v4.4.110-rc1 boots fine when merged into chromeos-4.4, on i7-7Y75.
> 
> That's good to know, hopefully 4.4.110-final also still works for you :)

It seems to be working. One patch to add for v4.4.111: 

063fb3e56f6d ("x86/kasan: Write protect kasan zero shadow")

It is needed to be able to run KASAN enabled images in KVM.

Guenter


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
> On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman
>  wrote:
> > This is the start of the stable review cycle for the 4.4.110 release.
> > There are 37 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > Anything received after that time might be too late.
> >
> > The whole patch series can be found in one patch at:
> > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > or in the git tree and branch at:
> >   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > linux-4.4.y
> > and the diffstat can be found below.
> >
> 
> This patchset merges correctly with Gentoo patches and GCC version 6.4.0
> The kernel boot up correctly.
> Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44

Great, but Gentoo really should be moving to 4.9 and 4.14 here, I
hope no one running Gentoo is relying on 4.4 :)

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Sat, Jan 06, 2018 at 02:20:16AM +0900, Alice Ferrazzi wrote:
> On Thu, Jan 4, 2018 at 5:11 AM, Greg Kroah-Hartman
>  wrote:
> > This is the start of the stable review cycle for the 4.4.110 release.
> > There are 37 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > Anything received after that time might be too late.
> >
> > The whole patch series can be found in one patch at:
> > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > or in the git tree and branch at:
> >   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > linux-4.4.y
> > and the diffstat can be found below.
> >
> 
> This patchset merges correctly with Gentoo patches and GCC version 6.4.0
> The kernel boot up correctly.
> Logs: http://kernel1.amd64.dev.gentoo.org:8010/#/builders/5/builds/44

Great, but Gentoo really should be moving to 4.9 and 4.14 here, I
hope no one running Gentoo is relying on 4.4 :)

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 04:57:15PM +0100, Willy Tarreau wrote:
> On Fri, Jan 05, 2018 at 04:51:32PM +0100, Greg Kroah-Hartman wrote:
> > On Fri, Jan 05, 2018 at 10:32:49AM -0500, Pavel Tatashin wrote:
> (...)
> > > Reboots after about 30 seconds.
> > > 
> > > Boots fine with nopti option.
> > 
> > Crap.
> > 
> > And 4.9.75 works for you just fine?  Same with 4.15-rc6?
> > 
> > I'm wondering if this is some crazy gcc thing, given the ancient age of
> > what you are using (gcc 4.8.5).  I haven't used 4.x in many many years,
> > is this what comes with RHEL6?  What is the "base" distro you are
> > building this on, and anything special about the hardware being used
> > here?
> 
> I don't think so, I'm personally building with 4.7.4 and am not seeing
> this with 4.4.110.

Ok, looks like an efi issue...

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 04:57:15PM +0100, Willy Tarreau wrote:
> On Fri, Jan 05, 2018 at 04:51:32PM +0100, Greg Kroah-Hartman wrote:
> > On Fri, Jan 05, 2018 at 10:32:49AM -0500, Pavel Tatashin wrote:
> (...)
> > > Reboots after about 30 seconds.
> > > 
> > > Boots fine with nopti option.
> > 
> > Crap.
> > 
> > And 4.9.75 works for you just fine?  Same with 4.15-rc6?
> > 
> > I'm wondering if this is some crazy gcc thing, given the ancient age of
> > what you are using (gcc 4.8.5).  I haven't used 4.x in many many years,
> > is this what comes with RHEL6?  What is the "base" distro you are
> > building this on, and anything special about the hardware being used
> > here?
> 
> I don't think so, I'm personally building with 4.7.4 and am not seeing
> this with 4.4.110.

Ok, looks like an efi issue...

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.4.110 release.
> There are 37 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> Anything received after that time might be too late.
> 

Update: v4.4.110 final nosmp builds fail as follows:


Error log:
arch/x86/entry/vdso/vma.c: In function ‘map_vdso’:
arch/x86/entry/vdso/vma.c:173:9: error:
implicit declaration of function ‘pvclock_pvti_cpu0_va’

Guenter


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck
On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.4.110 release.
> There are 37 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> Anything received after that time might be too late.
> 

Update: v4.4.110 final nosmp builds fail as follows:


Error log:
arch/x86/entry/vdso/vma.c: In function ‘map_vdso’:
arch/x86/entry/vdso/vma.c:173:9: error:
implicit declaration of function ‘pvclock_pvti_cpu0_va’

Guenter


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 12:48:54PM -0500, Pavel Tatashin wrote:
> Boots successfully with "noefi" kernel parameter :)

Thanks, that will help me narrow it down.  I'll dig through more patches
when I get home tonight...

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Greg Kroah-Hartman
On Fri, Jan 05, 2018 at 12:48:54PM -0500, Pavel Tatashin wrote:
> Boots successfully with "noefi" kernel parameter :)

Thanks, that will help me narrow it down.  I'll dig through more patches
when I get home tonight...

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck
On Fri, Jan 05, 2018 at 02:41:04PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Jan 04, 2018 at 03:45:55PM -0800, Guenter Roeck wrote:
> > On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
> > > This is the start of the stable review cycle for the 4.4.110 release.
> > > There are 37 patches in this series, all will be posted as a response
> > > to this one.  If anyone has any issues with these being applied, please
> > > let me know.
> > > 
> > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > Anything received after that time might be too late.
> > > 
> > 
> > This is also reported to crash if loaded under qemu + haxm under windows.
[ ... ]
> > The crash part of this problem may be solved with the following patch
> > (thanks to Hugh for the hint). There is still another problem, though -
> > with this patch applied, the qemu session aborts with "VCPU Shutdown
> > request", whatever that means.
> > 

v4.4.110 still suffers from "VCPU Shutdown request" with qemu+haxm.
Unfortunately I don't have any other information about the problem
at this time.

Guenter


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-05 Thread Guenter Roeck
On Fri, Jan 05, 2018 at 02:41:04PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Jan 04, 2018 at 03:45:55PM -0800, Guenter Roeck wrote:
> > On Wed, Jan 03, 2018 at 09:11:06PM +0100, Greg Kroah-Hartman wrote:
> > > This is the start of the stable review cycle for the 4.4.110 release.
> > > There are 37 patches in this series, all will be posted as a response
> > > to this one.  If anyone has any issues with these being applied, please
> > > let me know.
> > > 
> > > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > > Anything received after that time might be too late.
> > > 
> > 
> > This is also reported to crash if loaded under qemu + haxm under windows.
[ ... ]
> > The crash part of this problem may be solved with the following patch
> > (thanks to Hugh for the hint). There is still another problem, though -
> > with this patch applied, the qemu session aborts with "VCPU Shutdown
> > request", whatever that means.
> > 

v4.4.110 still suffers from "VCPU Shutdown request" with qemu+haxm.
Unfortunately I don't have any other information about the problem
at this time.

Guenter


  1   2   3   >