Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Rafael J. Wysocki
On Thursday, July 28, 2016 01:20:53 AM Rafael J. Wysocki wrote:
> On Wednesday, July 27, 2016 05:17:38 PM Josh Poimboeuf wrote:
> > On Thu, Jul 28, 2016 at 12:12:15AM +0200, Rafael J. Wysocki wrote:
> > > On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> > > > Hm... I have a theory, but I'm not sure about it.  I noticed that
> > > > x86_acpi_enter_sleep_state(),
> > > 
> > > I think you mean x86_acpi_suspend_lowlevel().
> > 
> > Oops!
> > 
> > > > which is involved in suspend, overwrites
> > > > several global variables (e.g, initial_code) which are used by the CPU
> > > > boot code in head_64.S.  But surprisingly, it doesn't restore those
> > > > variables to their original values after it resumes.
> > > 
> > > Is the head_64.S code also used to bring up offline CPUs?
> > 
> > Yes.
> 
> OK
> 
> So it is really interesting why and how that stuff works for everybody.
> 
> Basically, CPU online should fail after a suspend-resume cycle, but it
> doesn't most of the time AFAICS.

do_boot_cpu() restores those values, so I think we're safe from that angle.

That should apply to the CPU online during resume from hibernation too.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Rafael J. Wysocki
On Thursday, July 28, 2016 01:20:53 AM Rafael J. Wysocki wrote:
> On Wednesday, July 27, 2016 05:17:38 PM Josh Poimboeuf wrote:
> > On Thu, Jul 28, 2016 at 12:12:15AM +0200, Rafael J. Wysocki wrote:
> > > On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> > > > Hm... I have a theory, but I'm not sure about it.  I noticed that
> > > > x86_acpi_enter_sleep_state(),
> > > 
> > > I think you mean x86_acpi_suspend_lowlevel().
> > 
> > Oops!
> > 
> > > > which is involved in suspend, overwrites
> > > > several global variables (e.g, initial_code) which are used by the CPU
> > > > boot code in head_64.S.  But surprisingly, it doesn't restore those
> > > > variables to their original values after it resumes.
> > > 
> > > Is the head_64.S code also used to bring up offline CPUs?
> > 
> > Yes.
> 
> OK
> 
> So it is really interesting why and how that stuff works for everybody.
> 
> Basically, CPU online should fail after a suspend-resume cycle, but it
> doesn't most of the time AFAICS.

do_boot_cpu() restores those values, so I think we're safe from that angle.

That should apply to the CPU online during resume from hibernation too.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Rafael J. Wysocki
On Wednesday, July 27, 2016 05:17:38 PM Josh Poimboeuf wrote:
> On Thu, Jul 28, 2016 at 12:12:15AM +0200, Rafael J. Wysocki wrote:
> > On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> > > Hm... I have a theory, but I'm not sure about it.  I noticed that
> > > x86_acpi_enter_sleep_state(),
> > 
> > I think you mean x86_acpi_suspend_lowlevel().
> 
> Oops!
> 
> > > which is involved in suspend, overwrites
> > > several global variables (e.g, initial_code) which are used by the CPU
> > > boot code in head_64.S.  But surprisingly, it doesn't restore those
> > > variables to their original values after it resumes.
> > 
> > Is the head_64.S code also used to bring up offline CPUs?
> 
> Yes.

OK

So it is really interesting why and how that stuff works for everybody.

Basically, CPU online should fail after a suspend-resume cycle, but it
doesn't most of the time AFAICS.

> > If not, then this is not the problem, because hibernation doesn't use it
> > for the boot CPU anyway.
> > 
> > > So if a suspend and resume were done before the hibernate, those
> > > variables would presumably have suspend-centric values, and the first
> > > time a CPU is brought up during the hibernation restore operation, it
> > > would jump to wakeup_long64() (the suspend resume function) instead of
> > > start_secondary (which is the normal CPU boot function).
> > > 
> > > So, if true, that would explain why my patch triggers a bug:
> > > wakeup_long64() always[*] jumps to .Lresume_point, which my patch
> > > affected.  Because of the FRAME_END, it would pop an extra value off the
> > > stack.  So when restore_processor_state() returns, it would return to
> > > whatever random address is on the stack after the real RIP.  Which is
> > > consistent with the oops from the bug.  It had a bad instruction
> > > pointer, which looked like a stack address.
> > 
> > OK, so why doesn't it break resume from suspend to RAM?
> 
> Because for suspend to RAM, it enters suspend through
> do_suspend_lowlevel(), which has the FRAME_BEGIN which corresponds to
> .Lresume_point's FRAME_END.
> 
> > wakeup_long64 is invoked by the CPU startup code then and doesn't the
> > FRAME_END affect that too?
> 
> Yes, I would imagine that any CPU startup operation (after
> suspend/resume to RAM) would be affected.

That would mean that your patch is needed anyway, wouldn't it?

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Rafael J. Wysocki
On Wednesday, July 27, 2016 05:17:38 PM Josh Poimboeuf wrote:
> On Thu, Jul 28, 2016 at 12:12:15AM +0200, Rafael J. Wysocki wrote:
> > On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> > > Hm... I have a theory, but I'm not sure about it.  I noticed that
> > > x86_acpi_enter_sleep_state(),
> > 
> > I think you mean x86_acpi_suspend_lowlevel().
> 
> Oops!
> 
> > > which is involved in suspend, overwrites
> > > several global variables (e.g, initial_code) which are used by the CPU
> > > boot code in head_64.S.  But surprisingly, it doesn't restore those
> > > variables to their original values after it resumes.
> > 
> > Is the head_64.S code also used to bring up offline CPUs?
> 
> Yes.

OK

So it is really interesting why and how that stuff works for everybody.

Basically, CPU online should fail after a suspend-resume cycle, but it
doesn't most of the time AFAICS.

> > If not, then this is not the problem, because hibernation doesn't use it
> > for the boot CPU anyway.
> > 
> > > So if a suspend and resume were done before the hibernate, those
> > > variables would presumably have suspend-centric values, and the first
> > > time a CPU is brought up during the hibernation restore operation, it
> > > would jump to wakeup_long64() (the suspend resume function) instead of
> > > start_secondary (which is the normal CPU boot function).
> > > 
> > > So, if true, that would explain why my patch triggers a bug:
> > > wakeup_long64() always[*] jumps to .Lresume_point, which my patch
> > > affected.  Because of the FRAME_END, it would pop an extra value off the
> > > stack.  So when restore_processor_state() returns, it would return to
> > > whatever random address is on the stack after the real RIP.  Which is
> > > consistent with the oops from the bug.  It had a bad instruction
> > > pointer, which looked like a stack address.
> > 
> > OK, so why doesn't it break resume from suspend to RAM?
> 
> Because for suspend to RAM, it enters suspend through
> do_suspend_lowlevel(), which has the FRAME_BEGIN which corresponds to
> .Lresume_point's FRAME_END.
> 
> > wakeup_long64 is invoked by the CPU startup code then and doesn't the
> > FRAME_END affect that too?
> 
> Yes, I would imagine that any CPU startup operation (after
> suspend/resume to RAM) would be affected.

That would mean that your patch is needed anyway, wouldn't it?

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Josh Poimboeuf
On Thu, Jul 28, 2016 at 12:12:15AM +0200, Rafael J. Wysocki wrote:
> On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> > Hm... I have a theory, but I'm not sure about it.  I noticed that
> > x86_acpi_enter_sleep_state(),
> 
> I think you mean x86_acpi_suspend_lowlevel().

Oops!

> > which is involved in suspend, overwrites
> > several global variables (e.g, initial_code) which are used by the CPU
> > boot code in head_64.S.  But surprisingly, it doesn't restore those
> > variables to their original values after it resumes.
> 
> Is the head_64.S code also used to bring up offline CPUs?

Yes.

> If not, then this is not the problem, because hibernation doesn't use it
> for the boot CPU anyway.
> 
> > So if a suspend and resume were done before the hibernate, those
> > variables would presumably have suspend-centric values, and the first
> > time a CPU is brought up during the hibernation restore operation, it
> > would jump to wakeup_long64() (the suspend resume function) instead of
> > start_secondary (which is the normal CPU boot function).
> > 
> > So, if true, that would explain why my patch triggers a bug:
> > wakeup_long64() always[*] jumps to .Lresume_point, which my patch
> > affected.  Because of the FRAME_END, it would pop an extra value off the
> > stack.  So when restore_processor_state() returns, it would return to
> > whatever random address is on the stack after the real RIP.  Which is
> > consistent with the oops from the bug.  It had a bad instruction
> > pointer, which looked like a stack address.
> 
> OK, so why doesn't it break resume from suspend to RAM?

Because for suspend to RAM, it enters suspend through
do_suspend_lowlevel(), which has the FRAME_BEGIN which corresponds to
.Lresume_point's FRAME_END.

> wakeup_long64 is invoked by the CPU startup code then and doesn't the
> FRAME_END affect that too?

Yes, I would imagine that any CPU startup operation (after
suspend/resume to RAM) would be affected.

-- 
Josh


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Josh Poimboeuf
On Thu, Jul 28, 2016 at 12:12:15AM +0200, Rafael J. Wysocki wrote:
> On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> > Hm... I have a theory, but I'm not sure about it.  I noticed that
> > x86_acpi_enter_sleep_state(),
> 
> I think you mean x86_acpi_suspend_lowlevel().

Oops!

> > which is involved in suspend, overwrites
> > several global variables (e.g, initial_code) which are used by the CPU
> > boot code in head_64.S.  But surprisingly, it doesn't restore those
> > variables to their original values after it resumes.
> 
> Is the head_64.S code also used to bring up offline CPUs?

Yes.

> If not, then this is not the problem, because hibernation doesn't use it
> for the boot CPU anyway.
> 
> > So if a suspend and resume were done before the hibernate, those
> > variables would presumably have suspend-centric values, and the first
> > time a CPU is brought up during the hibernation restore operation, it
> > would jump to wakeup_long64() (the suspend resume function) instead of
> > start_secondary (which is the normal CPU boot function).
> > 
> > So, if true, that would explain why my patch triggers a bug:
> > wakeup_long64() always[*] jumps to .Lresume_point, which my patch
> > affected.  Because of the FRAME_END, it would pop an extra value off the
> > stack.  So when restore_processor_state() returns, it would return to
> > whatever random address is on the stack after the real RIP.  Which is
> > consistent with the oops from the bug.  It had a bad instruction
> > pointer, which looked like a stack address.
> 
> OK, so why doesn't it break resume from suspend to RAM?

Because for suspend to RAM, it enters suspend through
do_suspend_lowlevel(), which has the FRAME_BEGIN which corresponds to
.Lresume_point's FRAME_END.

> wakeup_long64 is invoked by the CPU startup code then and doesn't the
> FRAME_END affect that too?

Yes, I would imagine that any CPU startup operation (after
suspend/resume to RAM) would be affected.

-- 
Josh


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Rafael J. Wysocki
On Thursday, July 28, 2016 12:12:15 AM Rafael J. Wysocki wrote:
> On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> > On Wed, Jul 27, 2016 at 01:08:21AM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jul 27, 2016 at 12:42 AM, Rafael J. Wysocki  
> > > wrote:
> > > > On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
> > > >> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> > > >> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> > > >> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > > >> > > > Hi,
> > > >> > > >
> > > >> > > > The following commit:
> > > >> > > >
> > > >> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > > >> > > > Author: Josh Poimboeuf 
> > > >> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
> > > >> > > >
> > > >> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > > >> > > >
> > > >> > > > do_suspend_lowlevel() is a callable non-leaf function which 
> > > >> > > > doesn't
> > > >> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack 
> > > >> > > > traces.
> > > >> > > >
> > > >> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is 
> > > >> > > > enabled.
> > > >> > > >
> > > >> > > > is reported to cause a resume-from-hibernation regression due to 
> > > >> > > > an attempt
> > > >> > > > to execute an NX page (we've seen quite a bit of that recently).
> > > >> > > >
> > > >> > > > I'm asking the reporter to try 4.7, but if the problem is still 
> > > >> > > > there, we'll
> > > >> > > > need to revert the above I'm afraid.
> > > >> >
> > > >> > So the bug is still there in 4.7 and it goes away after reverting 
> > > >> > the above
> > > >> > commit.  I guess I'll send a revert then.
> > > >>
> > > >> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
> > > >> why this change causes a panic.  Is it really causing the panic or is 
> > > >> it
> > > >> uncovering some other bug?
> > > >
> > > > It doesn't matter really.
> > > >
> > > > It surely interacts with something in a really odd way, but that only 
> > > > means
> > > > that its impact goes far beyond what was expected when it was applied.  
> > > > Its
> > > > changelog is inadequate as a result and so on.
> > > >
> > > >> Maybe we should hold off on reverting until we understand the issue.
> > > >
> > > > Which very well may take forever.
> > > >
> > > > And AFAICS this is a fix for a theoretical issue and it *reliably* 
> > > > triggers a
> > > > very practical kernel panic for this particular reporter.  I'd rather 
> > > > live
> > > > with the theoretical issue unfixed to be honest.
> > > 
> > > Well, actually, the best part is that do_suspend_lowlevel() is not
> > > even called during hibernation or resume from it.  It only is called
> > > during suspend-to-RAM.
> > > 
> > > Question now is how the change made by the commit in question can
> > > affect hibernation which is an unrelated code path.  We know for a
> > > fact that it does affect it, but how?
> > 
> > Hm... I have a theory, but I'm not sure about it.  I noticed that
> > x86_acpi_enter_sleep_state(),
> 
> I think you mean x86_acpi_suspend_lowlevel().
> 
> > which is involved in suspend, overwrites
> > several global variables (e.g, initial_code) which are used by the CPU
> > boot code in head_64.S.  But surprisingly, it doesn't restore those
> > variables to their original values after it resumes.
> 
> Is the head_64.S code also used to bring up offline CPUs?
> 
> If not, then this is not the problem, because hibernation doesn't use it
> for the boot CPU anyway.
> 
> > So if a suspend and resume were done before the hibernate, those
> > variables would presumably have suspend-centric values, and the first
> > time a CPU is brought up during the hibernation restore operation, it
> > would jump to wakeup_long64() (the suspend resume function) instead of
> > start_secondary (which is the normal CPU boot function).
> > 
> > So, if true, that would explain why my patch triggers a bug:
> > wakeup_long64() always[*] jumps to .Lresume_point, which my patch
> > affected.  Because of the FRAME_END, it would pop an extra value off the
> > stack.  So when restore_processor_state() returns, it would return to
> > whatever random address is on the stack after the real RIP.  Which is
> > consistent with the oops from the bug.  It had a bad instruction
> > pointer, which looked like a stack address.
> 
> OK, so why doesn't it break resume from suspend to RAM?  wakeup_long64 is
> invoked by the CPU startup code then and doesn't the FRAME_END affect
> that too?

Ah, I see.  wakeup_long64 will restore RSP from saved_rsp and that points
to the right address already.  OK

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Rafael J. Wysocki
On Thursday, July 28, 2016 12:12:15 AM Rafael J. Wysocki wrote:
> On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> > On Wed, Jul 27, 2016 at 01:08:21AM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jul 27, 2016 at 12:42 AM, Rafael J. Wysocki  
> > > wrote:
> > > > On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
> > > >> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> > > >> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> > > >> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > > >> > > > Hi,
> > > >> > > >
> > > >> > > > The following commit:
> > > >> > > >
> > > >> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > > >> > > > Author: Josh Poimboeuf 
> > > >> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
> > > >> > > >
> > > >> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > > >> > > >
> > > >> > > > do_suspend_lowlevel() is a callable non-leaf function which 
> > > >> > > > doesn't
> > > >> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack 
> > > >> > > > traces.
> > > >> > > >
> > > >> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is 
> > > >> > > > enabled.
> > > >> > > >
> > > >> > > > is reported to cause a resume-from-hibernation regression due to 
> > > >> > > > an attempt
> > > >> > > > to execute an NX page (we've seen quite a bit of that recently).
> > > >> > > >
> > > >> > > > I'm asking the reporter to try 4.7, but if the problem is still 
> > > >> > > > there, we'll
> > > >> > > > need to revert the above I'm afraid.
> > > >> >
> > > >> > So the bug is still there in 4.7 and it goes away after reverting 
> > > >> > the above
> > > >> > commit.  I guess I'll send a revert then.
> > > >>
> > > >> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
> > > >> why this change causes a panic.  Is it really causing the panic or is 
> > > >> it
> > > >> uncovering some other bug?
> > > >
> > > > It doesn't matter really.
> > > >
> > > > It surely interacts with something in a really odd way, but that only 
> > > > means
> > > > that its impact goes far beyond what was expected when it was applied.  
> > > > Its
> > > > changelog is inadequate as a result and so on.
> > > >
> > > >> Maybe we should hold off on reverting until we understand the issue.
> > > >
> > > > Which very well may take forever.
> > > >
> > > > And AFAICS this is a fix for a theoretical issue and it *reliably* 
> > > > triggers a
> > > > very practical kernel panic for this particular reporter.  I'd rather 
> > > > live
> > > > with the theoretical issue unfixed to be honest.
> > > 
> > > Well, actually, the best part is that do_suspend_lowlevel() is not
> > > even called during hibernation or resume from it.  It only is called
> > > during suspend-to-RAM.
> > > 
> > > Question now is how the change made by the commit in question can
> > > affect hibernation which is an unrelated code path.  We know for a
> > > fact that it does affect it, but how?
> > 
> > Hm... I have a theory, but I'm not sure about it.  I noticed that
> > x86_acpi_enter_sleep_state(),
> 
> I think you mean x86_acpi_suspend_lowlevel().
> 
> > which is involved in suspend, overwrites
> > several global variables (e.g, initial_code) which are used by the CPU
> > boot code in head_64.S.  But surprisingly, it doesn't restore those
> > variables to their original values after it resumes.
> 
> Is the head_64.S code also used to bring up offline CPUs?
> 
> If not, then this is not the problem, because hibernation doesn't use it
> for the boot CPU anyway.
> 
> > So if a suspend and resume were done before the hibernate, those
> > variables would presumably have suspend-centric values, and the first
> > time a CPU is brought up during the hibernation restore operation, it
> > would jump to wakeup_long64() (the suspend resume function) instead of
> > start_secondary (which is the normal CPU boot function).
> > 
> > So, if true, that would explain why my patch triggers a bug:
> > wakeup_long64() always[*] jumps to .Lresume_point, which my patch
> > affected.  Because of the FRAME_END, it would pop an extra value off the
> > stack.  So when restore_processor_state() returns, it would return to
> > whatever random address is on the stack after the real RIP.  Which is
> > consistent with the oops from the bug.  It had a bad instruction
> > pointer, which looked like a stack address.
> 
> OK, so why doesn't it break resume from suspend to RAM?  wakeup_long64 is
> invoked by the CPU startup code then and doesn't the FRAME_END affect
> that too?

Ah, I see.  wakeup_long64 will restore RSP from saved_rsp and that points
to the right address already.  OK

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Rafael J. Wysocki
On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> On Wed, Jul 27, 2016 at 01:08:21AM +0200, Rafael J. Wysocki wrote:
> > On Wed, Jul 27, 2016 at 12:42 AM, Rafael J. Wysocki  
> > wrote:
> > > On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
> > >> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> > >> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> > >> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > >> > > > Hi,
> > >> > > >
> > >> > > > The following commit:
> > >> > > >
> > >> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > >> > > > Author: Josh Poimboeuf 
> > >> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
> > >> > > >
> > >> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > >> > > >
> > >> > > > do_suspend_lowlevel() is a callable non-leaf function which 
> > >> > > > doesn't
> > >> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack 
> > >> > > > traces.
> > >> > > >
> > >> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is 
> > >> > > > enabled.
> > >> > > >
> > >> > > > is reported to cause a resume-from-hibernation regression due to 
> > >> > > > an attempt
> > >> > > > to execute an NX page (we've seen quite a bit of that recently).
> > >> > > >
> > >> > > > I'm asking the reporter to try 4.7, but if the problem is still 
> > >> > > > there, we'll
> > >> > > > need to revert the above I'm afraid.
> > >> >
> > >> > So the bug is still there in 4.7 and it goes away after reverting the 
> > >> > above
> > >> > commit.  I guess I'll send a revert then.
> > >>
> > >> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
> > >> why this change causes a panic.  Is it really causing the panic or is it
> > >> uncovering some other bug?
> > >
> > > It doesn't matter really.
> > >
> > > It surely interacts with something in a really odd way, but that only 
> > > means
> > > that its impact goes far beyond what was expected when it was applied.  
> > > Its
> > > changelog is inadequate as a result and so on.
> > >
> > >> Maybe we should hold off on reverting until we understand the issue.
> > >
> > > Which very well may take forever.
> > >
> > > And AFAICS this is a fix for a theoretical issue and it *reliably* 
> > > triggers a
> > > very practical kernel panic for this particular reporter.  I'd rather live
> > > with the theoretical issue unfixed to be honest.
> > 
> > Well, actually, the best part is that do_suspend_lowlevel() is not
> > even called during hibernation or resume from it.  It only is called
> > during suspend-to-RAM.
> > 
> > Question now is how the change made by the commit in question can
> > affect hibernation which is an unrelated code path.  We know for a
> > fact that it does affect it, but how?
> 
> Hm... I have a theory, but I'm not sure about it.  I noticed that
> x86_acpi_enter_sleep_state(),

I think you mean x86_acpi_suspend_lowlevel().

> which is involved in suspend, overwrites
> several global variables (e.g, initial_code) which are used by the CPU
> boot code in head_64.S.  But surprisingly, it doesn't restore those
> variables to their original values after it resumes.

Is the head_64.S code also used to bring up offline CPUs?

If not, then this is not the problem, because hibernation doesn't use it
for the boot CPU anyway.

> So if a suspend and resume were done before the hibernate, those
> variables would presumably have suspend-centric values, and the first
> time a CPU is brought up during the hibernation restore operation, it
> would jump to wakeup_long64() (the suspend resume function) instead of
> start_secondary (which is the normal CPU boot function).
> 
> So, if true, that would explain why my patch triggers a bug:
> wakeup_long64() always[*] jumps to .Lresume_point, which my patch
> affected.  Because of the FRAME_END, it would pop an extra value off the
> stack.  So when restore_processor_state() returns, it would return to
> whatever random address is on the stack after the real RIP.  Which is
> consistent with the oops from the bug.  It had a bad instruction
> pointer, which looked like a stack address.

OK, so why doesn't it break resume from suspend to RAM?  wakeup_long64 is
invoked by the CPU startup code then and doesn't the FRAME_END affect
that too?

> But then again, maybe there's a hole in that theory, because how could
> hibernate after suspend/resume possibly even work today if the CPU boot
> goes to wakeup_long64() instead of start_secondary?

Right.

> So I could be missing something, or even completely off base.  But the
> missing restore of those variables does seem like a pretty huge
> oversight.  I wonder if the following patch would fix it?

We'll need to ask the reporter. :-)

> 
> diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
> index adb3eaf..cd76fc5 100644
> --- a/arch/x86/kernel/acpi/sleep.c
> +++ 

Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Rafael J. Wysocki
On Wednesday, July 27, 2016 12:59:18 PM Josh Poimboeuf wrote:
> On Wed, Jul 27, 2016 at 01:08:21AM +0200, Rafael J. Wysocki wrote:
> > On Wed, Jul 27, 2016 at 12:42 AM, Rafael J. Wysocki  
> > wrote:
> > > On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
> > >> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> > >> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> > >> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > >> > > > Hi,
> > >> > > >
> > >> > > > The following commit:
> > >> > > >
> > >> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > >> > > > Author: Josh Poimboeuf 
> > >> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
> > >> > > >
> > >> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > >> > > >
> > >> > > > do_suspend_lowlevel() is a callable non-leaf function which 
> > >> > > > doesn't
> > >> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack 
> > >> > > > traces.
> > >> > > >
> > >> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is 
> > >> > > > enabled.
> > >> > > >
> > >> > > > is reported to cause a resume-from-hibernation regression due to 
> > >> > > > an attempt
> > >> > > > to execute an NX page (we've seen quite a bit of that recently).
> > >> > > >
> > >> > > > I'm asking the reporter to try 4.7, but if the problem is still 
> > >> > > > there, we'll
> > >> > > > need to revert the above I'm afraid.
> > >> >
> > >> > So the bug is still there in 4.7 and it goes away after reverting the 
> > >> > above
> > >> > commit.  I guess I'll send a revert then.
> > >>
> > >> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
> > >> why this change causes a panic.  Is it really causing the panic or is it
> > >> uncovering some other bug?
> > >
> > > It doesn't matter really.
> > >
> > > It surely interacts with something in a really odd way, but that only 
> > > means
> > > that its impact goes far beyond what was expected when it was applied.  
> > > Its
> > > changelog is inadequate as a result and so on.
> > >
> > >> Maybe we should hold off on reverting until we understand the issue.
> > >
> > > Which very well may take forever.
> > >
> > > And AFAICS this is a fix for a theoretical issue and it *reliably* 
> > > triggers a
> > > very practical kernel panic for this particular reporter.  I'd rather live
> > > with the theoretical issue unfixed to be honest.
> > 
> > Well, actually, the best part is that do_suspend_lowlevel() is not
> > even called during hibernation or resume from it.  It only is called
> > during suspend-to-RAM.
> > 
> > Question now is how the change made by the commit in question can
> > affect hibernation which is an unrelated code path.  We know for a
> > fact that it does affect it, but how?
> 
> Hm... I have a theory, but I'm not sure about it.  I noticed that
> x86_acpi_enter_sleep_state(),

I think you mean x86_acpi_suspend_lowlevel().

> which is involved in suspend, overwrites
> several global variables (e.g, initial_code) which are used by the CPU
> boot code in head_64.S.  But surprisingly, it doesn't restore those
> variables to their original values after it resumes.

Is the head_64.S code also used to bring up offline CPUs?

If not, then this is not the problem, because hibernation doesn't use it
for the boot CPU anyway.

> So if a suspend and resume were done before the hibernate, those
> variables would presumably have suspend-centric values, and the first
> time a CPU is brought up during the hibernation restore operation, it
> would jump to wakeup_long64() (the suspend resume function) instead of
> start_secondary (which is the normal CPU boot function).
> 
> So, if true, that would explain why my patch triggers a bug:
> wakeup_long64() always[*] jumps to .Lresume_point, which my patch
> affected.  Because of the FRAME_END, it would pop an extra value off the
> stack.  So when restore_processor_state() returns, it would return to
> whatever random address is on the stack after the real RIP.  Which is
> consistent with the oops from the bug.  It had a bad instruction
> pointer, which looked like a stack address.

OK, so why doesn't it break resume from suspend to RAM?  wakeup_long64 is
invoked by the CPU startup code then and doesn't the FRAME_END affect
that too?

> But then again, maybe there's a hole in that theory, because how could
> hibernate after suspend/resume possibly even work today if the CPU boot
> goes to wakeup_long64() instead of start_secondary?

Right.

> So I could be missing something, or even completely off base.  But the
> missing restore of those variables does seem like a pretty huge
> oversight.  I wonder if the following patch would fix it?

We'll need to ask the reporter. :-)

> 
> diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
> index adb3eaf..cd76fc5 100644
> --- a/arch/x86/kernel/acpi/sleep.c
> +++ b/arch/x86/kernel/acpi/sleep.c
> @@ -45,6 

Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Josh Poimboeuf
On Wed, Jul 27, 2016 at 01:08:21AM +0200, Rafael J. Wysocki wrote:
> On Wed, Jul 27, 2016 at 12:42 AM, Rafael J. Wysocki  
> wrote:
> > On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
> >> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> >> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> >> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> >> > > > Hi,
> >> > > >
> >> > > > The following commit:
> >> > > >
> >> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> >> > > > Author: Josh Poimboeuf 
> >> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
> >> > > >
> >> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> >> > > >
> >> > > > do_suspend_lowlevel() is a callable non-leaf function which 
> >> > > > doesn't
> >> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> >> > > >
> >> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> >> > > >
> >> > > > is reported to cause a resume-from-hibernation regression due to an 
> >> > > > attempt
> >> > > > to execute an NX page (we've seen quite a bit of that recently).
> >> > > >
> >> > > > I'm asking the reporter to try 4.7, but if the problem is still 
> >> > > > there, we'll
> >> > > > need to revert the above I'm afraid.
> >> >
> >> > So the bug is still there in 4.7 and it goes away after reverting the 
> >> > above
> >> > commit.  I guess I'll send a revert then.
> >>
> >> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
> >> why this change causes a panic.  Is it really causing the panic or is it
> >> uncovering some other bug?
> >
> > It doesn't matter really.
> >
> > It surely interacts with something in a really odd way, but that only means
> > that its impact goes far beyond what was expected when it was applied.  Its
> > changelog is inadequate as a result and so on.
> >
> >> Maybe we should hold off on reverting until we understand the issue.
> >
> > Which very well may take forever.
> >
> > And AFAICS this is a fix for a theoretical issue and it *reliably* triggers 
> > a
> > very practical kernel panic for this particular reporter.  I'd rather live
> > with the theoretical issue unfixed to be honest.
> 
> Well, actually, the best part is that do_suspend_lowlevel() is not
> even called during hibernation or resume from it.  It only is called
> during suspend-to-RAM.
> 
> Question now is how the change made by the commit in question can
> affect hibernation which is an unrelated code path.  We know for a
> fact that it does affect it, but how?

Hm... I have a theory, but I'm not sure about it.  I noticed that
x86_acpi_enter_sleep_state(), which is involved in suspend, overwrites
several global variables (e.g, initial_code) which are used by the CPU
boot code in head_64.S.  But surprisingly, it doesn't restore those
variables to their original values after it resumes.

So if a suspend and resume were done before the hibernate, those
variables would presumably have suspend-centric values, and the first
time a CPU is brought up during the hibernation restore operation, it
would jump to wakeup_long64() (the suspend resume function) instead of
start_secondary (which is the normal CPU boot function).

So, if true, that would explain why my patch triggers a bug:
wakeup_long64() always[*] jumps to .Lresume_point, which my patch
affected.  Because of the FRAME_END, it would pop an extra value off the
stack.  So when restore_processor_state() returns, it would return to
whatever random address is on the stack after the real RIP.  Which is
consistent with the oops from the bug.  It had a bad instruction
pointer, which looked like a stack address.

But then again, maybe there's a hole in that theory, because how could
hibernate after suspend/resume possibly even work today if the CPU boot
goes to wakeup_long64() instead of start_secondary?

So I could be missing something, or even completely off base.  But the
missing restore of those variables does seem like a pretty huge
oversight.  I wonder if the following patch would fix it?


diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index adb3eaf..cd76fc5 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -45,6 +45,12 @@ acpi_status asmlinkage __visible 
x86_acpi_enter_sleep_state(u8 state)
  */
 int x86_acpi_suspend_lowlevel(void)
 {
+#ifdef CONFIG_64BIT
+   unsigned long prev_initial_code;
+#ifdef CONFIG_SMP
+   unsigned long prev_stack_start, prev_gdt_address, prev_initial_gs;
+#endif
+#endif
struct wakeup_header *header =
(struct wakeup_header *) __va(real_mode_header->wakeup_header);
 
@@ -99,13 +105,18 @@ int x86_acpi_suspend_lowlevel(void)
saved_magic = 0x12345678;
 #else /* CONFIG_64BIT */
 #ifdef CONFIG_SMP
+   prev_stack_start = stack_start;
+   prev_gdt_address = early_gdt_descr.address;
+   

Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-27 Thread Josh Poimboeuf
On Wed, Jul 27, 2016 at 01:08:21AM +0200, Rafael J. Wysocki wrote:
> On Wed, Jul 27, 2016 at 12:42 AM, Rafael J. Wysocki  
> wrote:
> > On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
> >> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> >> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> >> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> >> > > > Hi,
> >> > > >
> >> > > > The following commit:
> >> > > >
> >> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> >> > > > Author: Josh Poimboeuf 
> >> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
> >> > > >
> >> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> >> > > >
> >> > > > do_suspend_lowlevel() is a callable non-leaf function which 
> >> > > > doesn't
> >> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> >> > > >
> >> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> >> > > >
> >> > > > is reported to cause a resume-from-hibernation regression due to an 
> >> > > > attempt
> >> > > > to execute an NX page (we've seen quite a bit of that recently).
> >> > > >
> >> > > > I'm asking the reporter to try 4.7, but if the problem is still 
> >> > > > there, we'll
> >> > > > need to revert the above I'm afraid.
> >> >
> >> > So the bug is still there in 4.7 and it goes away after reverting the 
> >> > above
> >> > commit.  I guess I'll send a revert then.
> >>
> >> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
> >> why this change causes a panic.  Is it really causing the panic or is it
> >> uncovering some other bug?
> >
> > It doesn't matter really.
> >
> > It surely interacts with something in a really odd way, but that only means
> > that its impact goes far beyond what was expected when it was applied.  Its
> > changelog is inadequate as a result and so on.
> >
> >> Maybe we should hold off on reverting until we understand the issue.
> >
> > Which very well may take forever.
> >
> > And AFAICS this is a fix for a theoretical issue and it *reliably* triggers 
> > a
> > very practical kernel panic for this particular reporter.  I'd rather live
> > with the theoretical issue unfixed to be honest.
> 
> Well, actually, the best part is that do_suspend_lowlevel() is not
> even called during hibernation or resume from it.  It only is called
> during suspend-to-RAM.
> 
> Question now is how the change made by the commit in question can
> affect hibernation which is an unrelated code path.  We know for a
> fact that it does affect it, but how?

Hm... I have a theory, but I'm not sure about it.  I noticed that
x86_acpi_enter_sleep_state(), which is involved in suspend, overwrites
several global variables (e.g, initial_code) which are used by the CPU
boot code in head_64.S.  But surprisingly, it doesn't restore those
variables to their original values after it resumes.

So if a suspend and resume were done before the hibernate, those
variables would presumably have suspend-centric values, and the first
time a CPU is brought up during the hibernation restore operation, it
would jump to wakeup_long64() (the suspend resume function) instead of
start_secondary (which is the normal CPU boot function).

So, if true, that would explain why my patch triggers a bug:
wakeup_long64() always[*] jumps to .Lresume_point, which my patch
affected.  Because of the FRAME_END, it would pop an extra value off the
stack.  So when restore_processor_state() returns, it would return to
whatever random address is on the stack after the real RIP.  Which is
consistent with the oops from the bug.  It had a bad instruction
pointer, which looked like a stack address.

But then again, maybe there's a hole in that theory, because how could
hibernate after suspend/resume possibly even work today if the CPU boot
goes to wakeup_long64() instead of start_secondary?

So I could be missing something, or even completely off base.  But the
missing restore of those variables does seem like a pretty huge
oversight.  I wonder if the following patch would fix it?


diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index adb3eaf..cd76fc5 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -45,6 +45,12 @@ acpi_status asmlinkage __visible 
x86_acpi_enter_sleep_state(u8 state)
  */
 int x86_acpi_suspend_lowlevel(void)
 {
+#ifdef CONFIG_64BIT
+   unsigned long prev_initial_code;
+#ifdef CONFIG_SMP
+   unsigned long prev_stack_start, prev_gdt_address, prev_initial_gs;
+#endif
+#endif
struct wakeup_header *header =
(struct wakeup_header *) __va(real_mode_header->wakeup_header);
 
@@ -99,13 +105,18 @@ int x86_acpi_suspend_lowlevel(void)
saved_magic = 0x12345678;
 #else /* CONFIG_64BIT */
 #ifdef CONFIG_SMP
+   prev_stack_start = stack_start;
+   prev_gdt_address = early_gdt_descr.address;
+   prev_initial_gs = initial_gs;
+

Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Borislav Petkov
On Tue, Jul 26, 2016 at 02:17:29PM -0700, Thomas Garnier wrote:
> I am sorry, there has been parallel work between KASLR memory
> randomization and hibernation support. That's why hibernation was not
> tested, it was not supported when the feature was created.
> Communication will be better next time.
> 
> I will work on identifying the problem and pushing a fix.

Would you please do me a favor and stop top-posting. It really disrupts
reading the thread.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
--


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Borislav Petkov
On Tue, Jul 26, 2016 at 02:17:29PM -0700, Thomas Garnier wrote:
> I am sorry, there has been parallel work between KASLR memory
> randomization and hibernation support. That's why hibernation was not
> tested, it was not supported when the feature was created.
> Communication will be better next time.
> 
> I will work on identifying the problem and pushing a fix.

Would you please do me a favor and stop top-posting. It really disrupts
reading the thread.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
--


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Wed, Jul 27, 2016 at 12:42 AM, Rafael J. Wysocki  wrote:
> On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
>> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
>> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
>> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>> > > > Hi,
>> > > >
>> > > > The following commit:
>> > > >
>> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
>> > > > Author: Josh Poimboeuf 
>> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
>> > > >
>> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>> > > >
>> > > > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>> > > >
>> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>> > > >
>> > > > is reported to cause a resume-from-hibernation regression due to an 
>> > > > attempt
>> > > > to execute an NX page (we've seen quite a bit of that recently).
>> > > >
>> > > > I'm asking the reporter to try 4.7, but if the problem is still there, 
>> > > > we'll
>> > > > need to revert the above I'm afraid.
>> >
>> > So the bug is still there in 4.7 and it goes away after reverting the above
>> > commit.  I guess I'll send a revert then.
>>
>> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
>> why this change causes a panic.  Is it really causing the panic or is it
>> uncovering some other bug?
>
> It doesn't matter really.
>
> It surely interacts with something in a really odd way, but that only means
> that its impact goes far beyond what was expected when it was applied.  Its
> changelog is inadequate as a result and so on.
>
>> Maybe we should hold off on reverting until we understand the issue.
>
> Which very well may take forever.
>
> And AFAICS this is a fix for a theoretical issue and it *reliably* triggers a
> very practical kernel panic for this particular reporter.  I'd rather live
> with the theoretical issue unfixed to be honest.

Well, actually, the best part is that do_suspend_lowlevel() is not
even called during hibernation or resume from it.  It only is called
during suspend-to-RAM.

Question now is how the change made by the commit in question can
affect hibernation which is an unrelated code path.  We know for a
fact that it does affect it, but how?

Thanks,
Rafael


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Wed, Jul 27, 2016 at 12:42 AM, Rafael J. Wysocki  wrote:
> On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
>> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
>> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
>> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>> > > > Hi,
>> > > >
>> > > > The following commit:
>> > > >
>> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
>> > > > Author: Josh Poimboeuf 
>> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
>> > > >
>> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>> > > >
>> > > > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>> > > >
>> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>> > > >
>> > > > is reported to cause a resume-from-hibernation regression due to an 
>> > > > attempt
>> > > > to execute an NX page (we've seen quite a bit of that recently).
>> > > >
>> > > > I'm asking the reporter to try 4.7, but if the problem is still there, 
>> > > > we'll
>> > > > need to revert the above I'm afraid.
>> >
>> > So the bug is still there in 4.7 and it goes away after reverting the above
>> > commit.  I guess I'll send a revert then.
>>
>> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
>> why this change causes a panic.  Is it really causing the panic or is it
>> uncovering some other bug?
>
> It doesn't matter really.
>
> It surely interacts with something in a really odd way, but that only means
> that its impact goes far beyond what was expected when it was applied.  Its
> changelog is inadequate as a result and so on.
>
>> Maybe we should hold off on reverting until we understand the issue.
>
> Which very well may take forever.
>
> And AFAICS this is a fix for a theoretical issue and it *reliably* triggers a
> very practical kernel panic for this particular reporter.  I'd rather live
> with the theoretical issue unfixed to be honest.

Well, actually, the best part is that do_suspend_lowlevel() is not
even called during hibernation or resume from it.  It only is called
during suspend-to-RAM.

Question now is how the change made by the commit in question can
affect hibernation which is an unrelated code path.  We know for a
fact that it does affect it, but how?

Thanks,
Rafael


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > > > Hi,
> > > > 
> > > > The following commit:
> > > > 
> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > > > Author: Josh Poimboeuf 
> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
> > > > 
> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > > > 
> > > > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > > > 
> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> > > > 
> > > > is reported to cause a resume-from-hibernation regression due to an 
> > > > attempt
> > > > to execute an NX page (we've seen quite a bit of that recently).
> > > > 
> > > > I'm asking the reporter to try 4.7, but if the problem is still there, 
> > > > we'll
> > > > need to revert the above I'm afraid.
> > 
> > So the bug is still there in 4.7 and it goes away after reverting the above
> > commit.  I guess I'll send a revert then.
> 
> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
> why this change causes a panic.  Is it really causing the panic or is it
> uncovering some other bug?

It doesn't matter really.

It surely interacts with something in a really odd way, but that only means
that its impact goes far beyond what was expected when it was applied.  Its
changelog is inadequate as a result and so on.

> Maybe we should hold off on reverting until we understand the issue.

Which very well may take forever.

And AFAICS this is a fix for a theoretical issue and it *reliably* triggers a
very practical kernel panic for this particular reporter.  I'd rather live
with the theoretical issue unfixed to be honest.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 04:53:19 PM Josh Poimboeuf wrote:
> On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> > > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > > > Hi,
> > > > 
> > > > The following commit:
> > > > 
> > > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > > > Author: Josh Poimboeuf 
> > > > Date:   Thu Jan 21 16:49:21 2016 -0600
> > > > 
> > > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > > > 
> > > > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> > > > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > > > 
> > > > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> > > > 
> > > > is reported to cause a resume-from-hibernation regression due to an 
> > > > attempt
> > > > to execute an NX page (we've seen quite a bit of that recently).
> > > > 
> > > > I'm asking the reporter to try 4.7, but if the problem is still there, 
> > > > we'll
> > > > need to revert the above I'm afraid.
> > 
> > So the bug is still there in 4.7 and it goes away after reverting the above
> > commit.  I guess I'll send a revert then.
> 
> Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
> why this change causes a panic.  Is it really causing the panic or is it
> uncovering some other bug?

It doesn't matter really.

It surely interacts with something in a really odd way, but that only means
that its impact goes far beyond what was expected when it was applied.  Its
changelog is inadequate as a result and so on.

> Maybe we should hold off on reverting until we understand the issue.

Which very well may take forever.

And AFAICS this is a fix for a theoretical issue and it *reliably* triggers a
very practical kernel panic for this particular reporter.  I'd rather live
with the theoretical issue unfixed to be honest.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > > Hi,
> > > 
> > > The following commit:
> > > 
> > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > > Author: Josh Poimboeuf 
> > > Date:   Thu Jan 21 16:49:21 2016 -0600
> > > 
> > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > > 
> > > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> > > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > > 
> > > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> > > 
> > > is reported to cause a resume-from-hibernation regression due to an 
> > > attempt
> > > to execute an NX page (we've seen quite a bit of that recently).
> > > 
> > > I'm asking the reporter to try 4.7, but if the problem is still there, 
> > > we'll
> > > need to revert the above I'm afraid.
> 
> So the bug is still there in 4.7 and it goes away after reverting the above
> commit.  I guess I'll send a revert then.

Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
why this change causes a panic.  Is it really causing the panic or is it
uncovering some other bug?  Maybe we should hold off on reverting until
we understand the issue.

-- 
Josh


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 10:15:39PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> > On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > > Hi,
> > > 
> > > The following commit:
> > > 
> > > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > > Author: Josh Poimboeuf 
> > > Date:   Thu Jan 21 16:49:21 2016 -0600
> > > 
> > > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > > 
> > > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> > > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > > 
> > > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> > > 
> > > is reported to cause a resume-from-hibernation regression due to an 
> > > attempt
> > > to execute an NX page (we've seen quite a bit of that recently).
> > > 
> > > I'm asking the reporter to try 4.7, but if the problem is still there, 
> > > we'll
> > > need to revert the above I'm afraid.
> 
> So the bug is still there in 4.7 and it goes away after reverting the above
> commit.  I guess I'll send a revert then.

Hm, the code in wakeup_64.S seems quite magical, but I can't figure out
why this change causes a panic.  Is it really causing the panic or is it
uncovering some other bug?  Maybe we should hold off on reverting until
we understand the issue.

-- 
Josh


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Thomas Garnier
I am sorry, there has been parallel work between KASLR memory
randomization and hibernation support. That's why hibernation was not
tested, it was not supported when the feature was created.
Communication will be better next time.

I will work on identifying the problem and pushing a fix.

Thanks for the feedback and pointer,

On Tue, Jul 26, 2016 at 1:59 PM, Kees Cook  wrote:
> On Tue, Jul 26, 2016 at 1:53 PM, Rafael J. Wysocki  wrote:
>> On Tuesday, July 26, 2016 01:33:02 PM Kees Cook wrote:
>>> On Tue, Jul 26, 2016 at 1:24 PM, Rafael J. Wysocki  
>>> wrote:
>>> > On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
>>> >> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>>> >> > Hi,
>>> >> >
>>> >> > The following commit:
>>> >> >
>>> >> > commit 13523309495cdbd57a0d344c0d5d574987af007f
>>> >> > Author: Josh Poimboeuf 
>>> >> > Date:   Thu Jan 21 16:49:21 2016 -0600
>>> >> >
>>> >> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>>> >> >
>>> >> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>>> >> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>>> >> >
>>> >> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>>> >> >
>>> >> > is reported to cause a resume-from-hibernation regression due to an 
>>> >> > attempt
>>> >> > to execute an NX page (we've seen quite a bit of that recently).
>>> >> >
>>> >> > I'm asking the reporter to try 4.7, but if the problem is still there, 
>>> >> > we'll
>>> >> > need to revert the above I'm afraid.
>>> >>
>>> >> So I can't resume properly from disk too, on the Intel laptop this time. 
>>> >> Top
>>> >> commit is from tip/master:
>>> >>
>>> >> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
>>> >> Merge: a4823bbffc96 dd9506954539
>>> >> Author: Ingo Molnar 
>>> >> Date:   Mon Jul 25 08:39:43 2016 +0200
>>> >>
>>> >> Merge branch 'linus'
>>> >>
>>> >>
>>> >> So I thought it might be Josh's patch above and reverted it. No joy.
>>> >>
>>> >> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
>>> >> microcode loader breakage which we've been debugging. Turned that off
>>> >> and machine resumes fine again.
>>> >
>>> > Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
>>> > was no hope it would not break hibernation if you asked me.
>>> >
>>> >> It looks like
>>> >>
>>> >>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
>>> >>
>>> >> broke a bunch of things. Off the top of my head, we probably should make
>>> >> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
>>> >> was the case with ASLR previously, AFAIR.
>>> >
>>> > Please no.
>>> >
>>> > First off, it should be perfectly possible to make hibernation work along
>>> > with this new variant of ASLR.  Second, quite obviously, the author of 
>>> > these
>>> > ASLR changes had not done sufficient research to estimate the possible
>>> > impact of them.
>>>
>>> I think that's a bit unfair: Thomas did a lot of testing, and it has
>>> been living in -next for a while.
>>
>> Well, with all due respect, "a lot of testing" is not quite the same thing as
>> "sufficient research" IMO.
>>
>> It should be known (at least from experience) that hibernation on x86-64 
>> doesn't
>> play well with ASLR quite as a rule, so it would be good to at least check 
>> that
>> particular thing or CC a relevant person (ie. me).
>
> Fair enough: we need to practice considering a wider usage model.
>
>> Or even ask me on IRC for that matter.  Give me a heads up ahead of time.
>>
>> But no.  I'm still on the receiving end of the "hibernation doesn't work with
>> ASLR" story which was entirely avoidable this time around.  Sigh.
>
> I'll be sure to keep you in the loop for future x86 KASLR changes;
> sorry for the new pain. :(
>
>>> > Honestly, I don't think it is a good idea to introduce random Kconfig 
>>> > options
>>> > for working around cases in which the author of some changes cannot be 
>>> > bothered
>>> > with doing things right.  Even if that is security.
>>>
>>> I would agree: let's try to get this fixed soon.
>>>
>>> > So IMO, either we should fix the problem, or that whole new ASLR stuff 
>>> > should
>>> > be reverted.
>>> >
>>> > I think I know how to fix it, but I won't be able to get to that before 
>>> > the
>>> > next week.  I guess it can wait till then, though.
>>>
>>> Thomas, will you have some time to examine this and estimate the work for a 
>>> fix?
>>
>> FWIW, my hunch ATM is that you need to look at the "Set up the direct mapping
>> from scratch" loop in set_up_temporary_mappings() and make it do the right
>> thing when the new ASLR stuff is enabled.
>
> Thanks for the pointer!
>
> -Kees
>
> --
> Kees Cook
> Chrome OS & Brillo Security


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Thomas Garnier
I am sorry, there has been parallel work between KASLR memory
randomization and hibernation support. That's why hibernation was not
tested, it was not supported when the feature was created.
Communication will be better next time.

I will work on identifying the problem and pushing a fix.

Thanks for the feedback and pointer,

On Tue, Jul 26, 2016 at 1:59 PM, Kees Cook  wrote:
> On Tue, Jul 26, 2016 at 1:53 PM, Rafael J. Wysocki  wrote:
>> On Tuesday, July 26, 2016 01:33:02 PM Kees Cook wrote:
>>> On Tue, Jul 26, 2016 at 1:24 PM, Rafael J. Wysocki  
>>> wrote:
>>> > On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
>>> >> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>>> >> > Hi,
>>> >> >
>>> >> > The following commit:
>>> >> >
>>> >> > commit 13523309495cdbd57a0d344c0d5d574987af007f
>>> >> > Author: Josh Poimboeuf 
>>> >> > Date:   Thu Jan 21 16:49:21 2016 -0600
>>> >> >
>>> >> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>>> >> >
>>> >> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>>> >> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>>> >> >
>>> >> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>>> >> >
>>> >> > is reported to cause a resume-from-hibernation regression due to an 
>>> >> > attempt
>>> >> > to execute an NX page (we've seen quite a bit of that recently).
>>> >> >
>>> >> > I'm asking the reporter to try 4.7, but if the problem is still there, 
>>> >> > we'll
>>> >> > need to revert the above I'm afraid.
>>> >>
>>> >> So I can't resume properly from disk too, on the Intel laptop this time. 
>>> >> Top
>>> >> commit is from tip/master:
>>> >>
>>> >> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
>>> >> Merge: a4823bbffc96 dd9506954539
>>> >> Author: Ingo Molnar 
>>> >> Date:   Mon Jul 25 08:39:43 2016 +0200
>>> >>
>>> >> Merge branch 'linus'
>>> >>
>>> >>
>>> >> So I thought it might be Josh's patch above and reverted it. No joy.
>>> >>
>>> >> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
>>> >> microcode loader breakage which we've been debugging. Turned that off
>>> >> and machine resumes fine again.
>>> >
>>> > Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
>>> > was no hope it would not break hibernation if you asked me.
>>> >
>>> >> It looks like
>>> >>
>>> >>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
>>> >>
>>> >> broke a bunch of things. Off the top of my head, we probably should make
>>> >> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
>>> >> was the case with ASLR previously, AFAIR.
>>> >
>>> > Please no.
>>> >
>>> > First off, it should be perfectly possible to make hibernation work along
>>> > with this new variant of ASLR.  Second, quite obviously, the author of 
>>> > these
>>> > ASLR changes had not done sufficient research to estimate the possible
>>> > impact of them.
>>>
>>> I think that's a bit unfair: Thomas did a lot of testing, and it has
>>> been living in -next for a while.
>>
>> Well, with all due respect, "a lot of testing" is not quite the same thing as
>> "sufficient research" IMO.
>>
>> It should be known (at least from experience) that hibernation on x86-64 
>> doesn't
>> play well with ASLR quite as a rule, so it would be good to at least check 
>> that
>> particular thing or CC a relevant person (ie. me).
>
> Fair enough: we need to practice considering a wider usage model.
>
>> Or even ask me on IRC for that matter.  Give me a heads up ahead of time.
>>
>> But no.  I'm still on the receiving end of the "hibernation doesn't work with
>> ASLR" story which was entirely avoidable this time around.  Sigh.
>
> I'll be sure to keep you in the loop for future x86 KASLR changes;
> sorry for the new pain. :(
>
>>> > Honestly, I don't think it is a good idea to introduce random Kconfig 
>>> > options
>>> > for working around cases in which the author of some changes cannot be 
>>> > bothered
>>> > with doing things right.  Even if that is security.
>>>
>>> I would agree: let's try to get this fixed soon.
>>>
>>> > So IMO, either we should fix the problem, or that whole new ASLR stuff 
>>> > should
>>> > be reverted.
>>> >
>>> > I think I know how to fix it, but I won't be able to get to that before 
>>> > the
>>> > next week.  I guess it can wait till then, though.
>>>
>>> Thomas, will you have some time to examine this and estimate the work for a 
>>> fix?
>>
>> FWIW, my hunch ATM is that you need to look at the "Set up the direct mapping
>> from scratch" loop in set_up_temporary_mappings() and make it do the right
>> thing when the new ASLR stuff is enabled.
>
> Thanks for the pointer!
>
> -Kees
>
> --
> Kees Cook
> Chrome OS & Brillo Security


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Kees Cook
On Tue, Jul 26, 2016 at 1:53 PM, Rafael J. Wysocki  wrote:
> On Tuesday, July 26, 2016 01:33:02 PM Kees Cook wrote:
>> On Tue, Jul 26, 2016 at 1:24 PM, Rafael J. Wysocki  
>> wrote:
>> > On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
>> >> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>> >> > Hi,
>> >> >
>> >> > The following commit:
>> >> >
>> >> > commit 13523309495cdbd57a0d344c0d5d574987af007f
>> >> > Author: Josh Poimboeuf 
>> >> > Date:   Thu Jan 21 16:49:21 2016 -0600
>> >> >
>> >> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>> >> >
>> >> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>> >> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>> >> >
>> >> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>> >> >
>> >> > is reported to cause a resume-from-hibernation regression due to an 
>> >> > attempt
>> >> > to execute an NX page (we've seen quite a bit of that recently).
>> >> >
>> >> > I'm asking the reporter to try 4.7, but if the problem is still there, 
>> >> > we'll
>> >> > need to revert the above I'm afraid.
>> >>
>> >> So I can't resume properly from disk too, on the Intel laptop this time. 
>> >> Top
>> >> commit is from tip/master:
>> >>
>> >> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
>> >> Merge: a4823bbffc96 dd9506954539
>> >> Author: Ingo Molnar 
>> >> Date:   Mon Jul 25 08:39:43 2016 +0200
>> >>
>> >> Merge branch 'linus'
>> >>
>> >>
>> >> So I thought it might be Josh's patch above and reverted it. No joy.
>> >>
>> >> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
>> >> microcode loader breakage which we've been debugging. Turned that off
>> >> and machine resumes fine again.
>> >
>> > Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
>> > was no hope it would not break hibernation if you asked me.
>> >
>> >> It looks like
>> >>
>> >>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
>> >>
>> >> broke a bunch of things. Off the top of my head, we probably should make
>> >> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
>> >> was the case with ASLR previously, AFAIR.
>> >
>> > Please no.
>> >
>> > First off, it should be perfectly possible to make hibernation work along
>> > with this new variant of ASLR.  Second, quite obviously, the author of 
>> > these
>> > ASLR changes had not done sufficient research to estimate the possible
>> > impact of them.
>>
>> I think that's a bit unfair: Thomas did a lot of testing, and it has
>> been living in -next for a while.
>
> Well, with all due respect, "a lot of testing" is not quite the same thing as
> "sufficient research" IMO.
>
> It should be known (at least from experience) that hibernation on x86-64 
> doesn't
> play well with ASLR quite as a rule, so it would be good to at least check 
> that
> particular thing or CC a relevant person (ie. me).

Fair enough: we need to practice considering a wider usage model.

> Or even ask me on IRC for that matter.  Give me a heads up ahead of time.
>
> But no.  I'm still on the receiving end of the "hibernation doesn't work with
> ASLR" story which was entirely avoidable this time around.  Sigh.

I'll be sure to keep you in the loop for future x86 KASLR changes;
sorry for the new pain. :(

>> > Honestly, I don't think it is a good idea to introduce random Kconfig 
>> > options
>> > for working around cases in which the author of some changes cannot be 
>> > bothered
>> > with doing things right.  Even if that is security.
>>
>> I would agree: let's try to get this fixed soon.
>>
>> > So IMO, either we should fix the problem, or that whole new ASLR stuff 
>> > should
>> > be reverted.
>> >
>> > I think I know how to fix it, but I won't be able to get to that before the
>> > next week.  I guess it can wait till then, though.
>>
>> Thomas, will you have some time to examine this and estimate the work for a 
>> fix?
>
> FWIW, my hunch ATM is that you need to look at the "Set up the direct mapping
> from scratch" loop in set_up_temporary_mappings() and make it do the right
> thing when the new ASLR stuff is enabled.

Thanks for the pointer!

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Kees Cook
On Tue, Jul 26, 2016 at 1:53 PM, Rafael J. Wysocki  wrote:
> On Tuesday, July 26, 2016 01:33:02 PM Kees Cook wrote:
>> On Tue, Jul 26, 2016 at 1:24 PM, Rafael J. Wysocki  
>> wrote:
>> > On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
>> >> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>> >> > Hi,
>> >> >
>> >> > The following commit:
>> >> >
>> >> > commit 13523309495cdbd57a0d344c0d5d574987af007f
>> >> > Author: Josh Poimboeuf 
>> >> > Date:   Thu Jan 21 16:49:21 2016 -0600
>> >> >
>> >> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>> >> >
>> >> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>> >> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>> >> >
>> >> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>> >> >
>> >> > is reported to cause a resume-from-hibernation regression due to an 
>> >> > attempt
>> >> > to execute an NX page (we've seen quite a bit of that recently).
>> >> >
>> >> > I'm asking the reporter to try 4.7, but if the problem is still there, 
>> >> > we'll
>> >> > need to revert the above I'm afraid.
>> >>
>> >> So I can't resume properly from disk too, on the Intel laptop this time. 
>> >> Top
>> >> commit is from tip/master:
>> >>
>> >> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
>> >> Merge: a4823bbffc96 dd9506954539
>> >> Author: Ingo Molnar 
>> >> Date:   Mon Jul 25 08:39:43 2016 +0200
>> >>
>> >> Merge branch 'linus'
>> >>
>> >>
>> >> So I thought it might be Josh's patch above and reverted it. No joy.
>> >>
>> >> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
>> >> microcode loader breakage which we've been debugging. Turned that off
>> >> and machine resumes fine again.
>> >
>> > Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
>> > was no hope it would not break hibernation if you asked me.
>> >
>> >> It looks like
>> >>
>> >>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
>> >>
>> >> broke a bunch of things. Off the top of my head, we probably should make
>> >> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
>> >> was the case with ASLR previously, AFAIR.
>> >
>> > Please no.
>> >
>> > First off, it should be perfectly possible to make hibernation work along
>> > with this new variant of ASLR.  Second, quite obviously, the author of 
>> > these
>> > ASLR changes had not done sufficient research to estimate the possible
>> > impact of them.
>>
>> I think that's a bit unfair: Thomas did a lot of testing, and it has
>> been living in -next for a while.
>
> Well, with all due respect, "a lot of testing" is not quite the same thing as
> "sufficient research" IMO.
>
> It should be known (at least from experience) that hibernation on x86-64 
> doesn't
> play well with ASLR quite as a rule, so it would be good to at least check 
> that
> particular thing or CC a relevant person (ie. me).

Fair enough: we need to practice considering a wider usage model.

> Or even ask me on IRC for that matter.  Give me a heads up ahead of time.
>
> But no.  I'm still on the receiving end of the "hibernation doesn't work with
> ASLR" story which was entirely avoidable this time around.  Sigh.

I'll be sure to keep you in the loop for future x86 KASLR changes;
sorry for the new pain. :(

>> > Honestly, I don't think it is a good idea to introduce random Kconfig 
>> > options
>> > for working around cases in which the author of some changes cannot be 
>> > bothered
>> > with doing things right.  Even if that is security.
>>
>> I would agree: let's try to get this fixed soon.
>>
>> > So IMO, either we should fix the problem, or that whole new ASLR stuff 
>> > should
>> > be reverted.
>> >
>> > I think I know how to fix it, but I won't be able to get to that before the
>> > next week.  I guess it can wait till then, though.
>>
>> Thomas, will you have some time to examine this and estimate the work for a 
>> fix?
>
> FWIW, my hunch ATM is that you need to look at the "Set up the direct mapping
> from scratch" loop in set_up_temporary_mappings() and make it do the right
> thing when the new ASLR stuff is enabled.

Thanks for the pointer!

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 01:33:02 PM Kees Cook wrote:
> On Tue, Jul 26, 2016 at 1:24 PM, Rafael J. Wysocki  wrote:
> > On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
> >> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> >> > Hi,
> >> >
> >> > The following commit:
> >> >
> >> > commit 13523309495cdbd57a0d344c0d5d574987af007f
> >> > Author: Josh Poimboeuf 
> >> > Date:   Thu Jan 21 16:49:21 2016 -0600
> >> >
> >> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> >> >
> >> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> >> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> >> >
> >> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> >> >
> >> > is reported to cause a resume-from-hibernation regression due to an 
> >> > attempt
> >> > to execute an NX page (we've seen quite a bit of that recently).
> >> >
> >> > I'm asking the reporter to try 4.7, but if the problem is still there, 
> >> > we'll
> >> > need to revert the above I'm afraid.
> >>
> >> So I can't resume properly from disk too, on the Intel laptop this time. 
> >> Top
> >> commit is from tip/master:
> >>
> >> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
> >> Merge: a4823bbffc96 dd9506954539
> >> Author: Ingo Molnar 
> >> Date:   Mon Jul 25 08:39:43 2016 +0200
> >>
> >> Merge branch 'linus'
> >>
> >>
> >> So I thought it might be Josh's patch above and reverted it. No joy.
> >>
> >> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
> >> microcode loader breakage which we've been debugging. Turned that off
> >> and machine resumes fine again.
> >
> > Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
> > was no hope it would not break hibernation if you asked me.
> >
> >> It looks like
> >>
> >>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
> >>
> >> broke a bunch of things. Off the top of my head, we probably should make
> >> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
> >> was the case with ASLR previously, AFAIR.
> >
> > Please no.
> >
> > First off, it should be perfectly possible to make hibernation work along
> > with this new variant of ASLR.  Second, quite obviously, the author of these
> > ASLR changes had not done sufficient research to estimate the possible
> > impact of them.
> 
> I think that's a bit unfair: Thomas did a lot of testing, and it has
> been living in -next for a while.

Well, with all due respect, "a lot of testing" is not quite the same thing as
"sufficient research" IMO.

It should be known (at least from experience) that hibernation on x86-64 doesn't
play well with ASLR quite as a rule, so it would be good to at least check that
particular thing or CC a relevant person (ie. me).

Or even ask me on IRC for that matter.  Give me a heads up ahead of time.

But no.  I'm still on the receiving end of the "hibernation doesn't work with
ASLR" story which was entirely avoidable this time around.  Sigh.

> > Honestly, I don't think it is a good idea to introduce random Kconfig 
> > options
> > for working around cases in which the author of some changes cannot be 
> > bothered
> > with doing things right.  Even if that is security.
> 
> I would agree: let's try to get this fixed soon.
> 
> > So IMO, either we should fix the problem, or that whole new ASLR stuff 
> > should
> > be reverted.
> >
> > I think I know how to fix it, but I won't be able to get to that before the
> > next week.  I guess it can wait till then, though.
> 
> Thomas, will you have some time to examine this and estimate the work for a 
> fix?

FWIW, my hunch ATM is that you need to look at the "Set up the direct mapping
from scratch" loop in set_up_temporary_mappings() and make it do the right
thing when the new ASLR stuff is enabled.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 01:33:02 PM Kees Cook wrote:
> On Tue, Jul 26, 2016 at 1:24 PM, Rafael J. Wysocki  wrote:
> > On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
> >> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> >> > Hi,
> >> >
> >> > The following commit:
> >> >
> >> > commit 13523309495cdbd57a0d344c0d5d574987af007f
> >> > Author: Josh Poimboeuf 
> >> > Date:   Thu Jan 21 16:49:21 2016 -0600
> >> >
> >> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> >> >
> >> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> >> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> >> >
> >> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> >> >
> >> > is reported to cause a resume-from-hibernation regression due to an 
> >> > attempt
> >> > to execute an NX page (we've seen quite a bit of that recently).
> >> >
> >> > I'm asking the reporter to try 4.7, but if the problem is still there, 
> >> > we'll
> >> > need to revert the above I'm afraid.
> >>
> >> So I can't resume properly from disk too, on the Intel laptop this time. 
> >> Top
> >> commit is from tip/master:
> >>
> >> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
> >> Merge: a4823bbffc96 dd9506954539
> >> Author: Ingo Molnar 
> >> Date:   Mon Jul 25 08:39:43 2016 +0200
> >>
> >> Merge branch 'linus'
> >>
> >>
> >> So I thought it might be Josh's patch above and reverted it. No joy.
> >>
> >> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
> >> microcode loader breakage which we've been debugging. Turned that off
> >> and machine resumes fine again.
> >
> > Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
> > was no hope it would not break hibernation if you asked me.
> >
> >> It looks like
> >>
> >>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
> >>
> >> broke a bunch of things. Off the top of my head, we probably should make
> >> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
> >> was the case with ASLR previously, AFAIR.
> >
> > Please no.
> >
> > First off, it should be perfectly possible to make hibernation work along
> > with this new variant of ASLR.  Second, quite obviously, the author of these
> > ASLR changes had not done sufficient research to estimate the possible
> > impact of them.
> 
> I think that's a bit unfair: Thomas did a lot of testing, and it has
> been living in -next for a while.

Well, with all due respect, "a lot of testing" is not quite the same thing as
"sufficient research" IMO.

It should be known (at least from experience) that hibernation on x86-64 doesn't
play well with ASLR quite as a rule, so it would be good to at least check that
particular thing or CC a relevant person (ie. me).

Or even ask me on IRC for that matter.  Give me a heads up ahead of time.

But no.  I'm still on the receiving end of the "hibernation doesn't work with
ASLR" story which was entirely avoidable this time around.  Sigh.

> > Honestly, I don't think it is a good idea to introduce random Kconfig 
> > options
> > for working around cases in which the author of some changes cannot be 
> > bothered
> > with doing things right.  Even if that is security.
> 
> I would agree: let's try to get this fixed soon.
> 
> > So IMO, either we should fix the problem, or that whole new ASLR stuff 
> > should
> > be reverted.
> >
> > I think I know how to fix it, but I won't be able to get to that before the
> > next week.  I guess it can wait till then, though.
> 
> Thomas, will you have some time to examine this and estimate the work for a 
> fix?

FWIW, my hunch ATM is that you need to look at the "Set up the direct mapping
from scratch" loop in set_up_temporary_mappings() and make it do the right
thing when the new ASLR stuff is enabled.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 01:31:00 PM Kees Cook wrote:
> On Tue, Jul 26, 2016 at 1:15 PM, Rafael J. Wysocki  wrote:
> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> >> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> >> > Hi,
> >> >
> >> > The following commit:
> >> >
> >> > commit 13523309495cdbd57a0d344c0d5d574987af007f
> >> > Author: Josh Poimboeuf 
> >> > Date:   Thu Jan 21 16:49:21 2016 -0600
> >> >
> >> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> >> >
> >> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> >> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> >> >
> >> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> >> >
> >> > is reported to cause a resume-from-hibernation regression due to an 
> >> > attempt
> >> > to execute an NX page (we've seen quite a bit of that recently).
> >> >
> >> > I'm asking the reporter to try 4.7, but if the problem is still there, 
> >> > we'll
> >> > need to revert the above I'm afraid.
> >
> > So the bug is still there in 4.7 and it goes away after reverting the above
> > commit.  I guess I'll send a revert then.
> 
> To make sure I understand:
> 
> There are two separate bugs here that break hibernation?

Yes, there are.

The first one is the BZ 150021 as reported here.

The second one is the clash with new ASLR-related changes as reported by Boris.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 01:31:00 PM Kees Cook wrote:
> On Tue, Jul 26, 2016 at 1:15 PM, Rafael J. Wysocki  wrote:
> > On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> >> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> >> > Hi,
> >> >
> >> > The following commit:
> >> >
> >> > commit 13523309495cdbd57a0d344c0d5d574987af007f
> >> > Author: Josh Poimboeuf 
> >> > Date:   Thu Jan 21 16:49:21 2016 -0600
> >> >
> >> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> >> >
> >> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> >> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> >> >
> >> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> >> >
> >> > is reported to cause a resume-from-hibernation regression due to an 
> >> > attempt
> >> > to execute an NX page (we've seen quite a bit of that recently).
> >> >
> >> > I'm asking the reporter to try 4.7, but if the problem is still there, 
> >> > we'll
> >> > need to revert the above I'm afraid.
> >
> > So the bug is still there in 4.7 and it goes away after reverting the above
> > commit.  I guess I'll send a revert then.
> 
> To make sure I understand:
> 
> There are two separate bugs here that break hibernation?

Yes, there are.

The first one is the BZ 150021 as reported here.

The second one is the clash with new ASLR-related changes as reported by Boris.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Kees Cook
On Tue, Jul 26, 2016 at 1:24 PM, Rafael J. Wysocki  wrote:
> On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
>> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>> > Hi,
>> >
>> > The following commit:
>> >
>> > commit 13523309495cdbd57a0d344c0d5d574987af007f
>> > Author: Josh Poimboeuf 
>> > Date:   Thu Jan 21 16:49:21 2016 -0600
>> >
>> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>> >
>> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>> >
>> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>> >
>> > is reported to cause a resume-from-hibernation regression due to an attempt
>> > to execute an NX page (we've seen quite a bit of that recently).
>> >
>> > I'm asking the reporter to try 4.7, but if the problem is still there, 
>> > we'll
>> > need to revert the above I'm afraid.
>>
>> So I can't resume properly from disk too, on the Intel laptop this time. Top
>> commit is from tip/master:
>>
>> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
>> Merge: a4823bbffc96 dd9506954539
>> Author: Ingo Molnar 
>> Date:   Mon Jul 25 08:39:43 2016 +0200
>>
>> Merge branch 'linus'
>>
>>
>> So I thought it might be Josh's patch above and reverted it. No joy.
>>
>> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
>> microcode loader breakage which we've been debugging. Turned that off
>> and machine resumes fine again.
>
> Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
> was no hope it would not break hibernation if you asked me.
>
>> It looks like
>>
>>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
>>
>> broke a bunch of things. Off the top of my head, we probably should make
>> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
>> was the case with ASLR previously, AFAIR.
>
> Please no.
>
> First off, it should be perfectly possible to make hibernation work along
> with this new variant of ASLR.  Second, quite obviously, the author of these
> ASLR changes had not done sufficient research to estimate the possible
> impact of them.

I think that's a bit unfair: Thomas did a lot of testing, and it has
been living in -next for a while.

> Honestly, I don't think it is a good idea to introduce random Kconfig options
> for working around cases in which the author of some changes cannot be 
> bothered
> with doing things right.  Even if that is security.

I would agree: let's try to get this fixed soon.

> So IMO, either we should fix the problem, or that whole new ASLR stuff should
> be reverted.
>
> I think I know how to fix it, but I won't be able to get to that before the
> next week.  I guess it can wait till then, though.

Thomas, will you have some time to examine this and estimate the work for a fix?

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Kees Cook
On Tue, Jul 26, 2016 at 1:24 PM, Rafael J. Wysocki  wrote:
> On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
>> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>> > Hi,
>> >
>> > The following commit:
>> >
>> > commit 13523309495cdbd57a0d344c0d5d574987af007f
>> > Author: Josh Poimboeuf 
>> > Date:   Thu Jan 21 16:49:21 2016 -0600
>> >
>> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>> >
>> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>> >
>> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>> >
>> > is reported to cause a resume-from-hibernation regression due to an attempt
>> > to execute an NX page (we've seen quite a bit of that recently).
>> >
>> > I'm asking the reporter to try 4.7, but if the problem is still there, 
>> > we'll
>> > need to revert the above I'm afraid.
>>
>> So I can't resume properly from disk too, on the Intel laptop this time. Top
>> commit is from tip/master:
>>
>> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
>> Merge: a4823bbffc96 dd9506954539
>> Author: Ingo Molnar 
>> Date:   Mon Jul 25 08:39:43 2016 +0200
>>
>> Merge branch 'linus'
>>
>>
>> So I thought it might be Josh's patch above and reverted it. No joy.
>>
>> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
>> microcode loader breakage which we've been debugging. Turned that off
>> and machine resumes fine again.
>
> Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
> was no hope it would not break hibernation if you asked me.
>
>> It looks like
>>
>>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
>>
>> broke a bunch of things. Off the top of my head, we probably should make
>> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
>> was the case with ASLR previously, AFAIR.
>
> Please no.
>
> First off, it should be perfectly possible to make hibernation work along
> with this new variant of ASLR.  Second, quite obviously, the author of these
> ASLR changes had not done sufficient research to estimate the possible
> impact of them.

I think that's a bit unfair: Thomas did a lot of testing, and it has
been living in -next for a while.

> Honestly, I don't think it is a good idea to introduce random Kconfig options
> for working around cases in which the author of some changes cannot be 
> bothered
> with doing things right.  Even if that is security.

I would agree: let's try to get this fixed soon.

> So IMO, either we should fix the problem, or that whole new ASLR stuff should
> be reverted.
>
> I think I know how to fix it, but I won't be able to get to that before the
> next week.  I guess it can wait till then, though.

Thomas, will you have some time to examine this and estimate the work for a fix?

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Kees Cook
On Tue, Jul 26, 2016 at 1:15 PM, Rafael J. Wysocki  wrote:
> On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
>> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>> > Hi,
>> >
>> > The following commit:
>> >
>> > commit 13523309495cdbd57a0d344c0d5d574987af007f
>> > Author: Josh Poimboeuf 
>> > Date:   Thu Jan 21 16:49:21 2016 -0600
>> >
>> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>> >
>> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>> >
>> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>> >
>> > is reported to cause a resume-from-hibernation regression due to an attempt
>> > to execute an NX page (we've seen quite a bit of that recently).
>> >
>> > I'm asking the reporter to try 4.7, but if the problem is still there, 
>> > we'll
>> > need to revert the above I'm afraid.
>
> So the bug is still there in 4.7 and it goes away after reverting the above
> commit.  I guess I'll send a revert then.

To make sure I understand:

There are two separate bugs here that break hibernation?

>> Hi Rafael,
>>
>> Is the oops output available somewhere?
>
> Not yet, but I've just ask the reporter to attach it to the $subject BZ entry.
>
> In any case, this is nasty stuff and the reason why we made that change was
> relatively weak IMO.
>
> Thanks,
> Rafael
>

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Kees Cook
On Tue, Jul 26, 2016 at 1:15 PM, Rafael J. Wysocki  wrote:
> On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
>> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
>> > Hi,
>> >
>> > The following commit:
>> >
>> > commit 13523309495cdbd57a0d344c0d5d574987af007f
>> > Author: Josh Poimboeuf 
>> > Date:   Thu Jan 21 16:49:21 2016 -0600
>> >
>> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
>> >
>> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
>> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
>> >
>> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
>> >
>> > is reported to cause a resume-from-hibernation regression due to an attempt
>> > to execute an NX page (we've seen quite a bit of that recently).
>> >
>> > I'm asking the reporter to try 4.7, but if the problem is still there, 
>> > we'll
>> > need to revert the above I'm afraid.
>
> So the bug is still there in 4.7 and it goes away after reverting the above
> commit.  I guess I'll send a revert then.

To make sure I understand:

There are two separate bugs here that break hibernation?

>> Hi Rafael,
>>
>> Is the oops output available somewhere?
>
> Not yet, but I've just ask the reporter to attach it to the $subject BZ entry.
>
> In any case, this is nasty stuff and the reason why we made that change was
> relatively weak IMO.
>
> Thanks,
> Rafael
>

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > Hi,
> > 
> > The following commit:
> > 
> > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > Author: Josh Poimboeuf 
> > Date:   Thu Jan 21 16:49:21 2016 -0600
> > 
> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > 
> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > 
> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> > 
> > is reported to cause a resume-from-hibernation regression due to an attempt
> > to execute an NX page (we've seen quite a bit of that recently).
> > 
> > I'm asking the reporter to try 4.7, but if the problem is still there, we'll
> > need to revert the above I'm afraid.
> 
> So I can't resume properly from disk too, on the Intel laptop this time. Top
> commit is from tip/master:
> 
> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
> Merge: a4823bbffc96 dd9506954539
> Author: Ingo Molnar 
> Date:   Mon Jul 25 08:39:43 2016 +0200
> 
> Merge branch 'linus'
> 
> 
> So I thought it might be Josh's patch above and reverted it. No joy.
> 
> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
> microcode loader breakage which we've been debugging. Turned that off
> and machine resumes fine again.

Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
was no hope it would not break hibernation if you asked me.

> It looks like
> 
>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
> 
> broke a bunch of things. Off the top of my head, we probably should make
> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
> was the case with ASLR previously, AFAIR.

Please no.

First off, it should be perfectly possible to make hibernation work along
with this new variant of ASLR.  Second, quite obviously, the author of these
ASLR changes had not done sufficient research to estimate the possible
impact of them.

Honestly, I don't think it is a good idea to introduce random Kconfig options
for working around cases in which the author of some changes cannot be bothered
with doing things right.  Even if that is security.

So IMO, either we should fix the problem, or that whole new ASLR stuff should
be reverted.

I think I know how to fix it, but I won't be able to get to that before the
next week.  I guess it can wait till then, though.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 04:04:42 PM Borislav Petkov wrote:
> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > Hi,
> > 
> > The following commit:
> > 
> > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > Author: Josh Poimboeuf 
> > Date:   Thu Jan 21 16:49:21 2016 -0600
> > 
> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > 
> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > 
> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> > 
> > is reported to cause a resume-from-hibernation regression due to an attempt
> > to execute an NX page (we've seen quite a bit of that recently).
> > 
> > I'm asking the reporter to try 4.7, but if the problem is still there, we'll
> > need to revert the above I'm afraid.
> 
> So I can't resume properly from disk too, on the Intel laptop this time. Top
> commit is from tip/master:
> 
> commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
> Merge: a4823bbffc96 dd9506954539
> Author: Ingo Molnar 
> Date:   Mon Jul 25 08:39:43 2016 +0200
> 
> Merge branch 'linus'
> 
> 
> So I thought it might be Josh's patch above and reverted it. No joy.
> 
> Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
> microcode loader breakage which we've been debugging. Turned that off
> and machine resumes fine again.

Well, I wasn't aware of *another* flavor of ASLR in the works.  And there
was no hope it would not break hibernation if you asked me.

> It looks like
> 
>   0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
> 
> broke a bunch of things. Off the top of my head, we probably should make
> suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
> was the case with ASLR previously, AFAIR.

Please no.

First off, it should be perfectly possible to make hibernation work along
with this new variant of ASLR.  Second, quite obviously, the author of these
ASLR changes had not done sufficient research to estimate the possible
impact of them.

Honestly, I don't think it is a good idea to introduce random Kconfig options
for working around cases in which the author of some changes cannot be bothered
with doing things right.  Even if that is security.

So IMO, either we should fix the problem, or that whole new ASLR stuff should
be reverted.

I think I know how to fix it, but I won't be able to get to that before the
next week.  I guess it can wait till then, though.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > Hi,
> > 
> > The following commit:
> > 
> > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > Author: Josh Poimboeuf 
> > Date:   Thu Jan 21 16:49:21 2016 -0600
> > 
> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > 
> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > 
> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> > 
> > is reported to cause a resume-from-hibernation regression due to an attempt
> > to execute an NX page (we've seen quite a bit of that recently).
> > 
> > I'm asking the reporter to try 4.7, but if the problem is still there, we'll
> > need to revert the above I'm afraid.

So the bug is still there in 4.7 and it goes away after reverting the above
commit.  I guess I'll send a revert then.

> Hi Rafael,
> 
> Is the oops output available somewhere?

Not yet, but I've just ask the reporter to attach it to the $subject BZ entry.

In any case, this is nasty stuff and the reason why we made that change was
relatively weak IMO.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Rafael J. Wysocki
On Tuesday, July 26, 2016 09:39:05 AM Josh Poimboeuf wrote:
> On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> > Hi,
> > 
> > The following commit:
> > 
> > commit 13523309495cdbd57a0d344c0d5d574987af007f
> > Author: Josh Poimboeuf 
> > Date:   Thu Jan 21 16:49:21 2016 -0600
> > 
> > x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> > 
> > do_suspend_lowlevel() is a callable non-leaf function which doesn't
> > honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> > 
> > Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> > 
> > is reported to cause a resume-from-hibernation regression due to an attempt
> > to execute an NX page (we've seen quite a bit of that recently).
> > 
> > I'm asking the reporter to try 4.7, but if the problem is still there, we'll
> > need to revert the above I'm afraid.

So the bug is still there in 4.7 and it goes away after reverting the above
commit.  I guess I'll send a revert then.

> Hi Rafael,
> 
> Is the oops output available somewhere?

Not yet, but I've just ask the reporter to attach it to the $subject BZ entry.

In any case, this is nasty stuff and the reason why we made that change was
relatively weak IMO.

Thanks,
Rafael



Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> Hi,
> 
> The following commit:
> 
> commit 13523309495cdbd57a0d344c0d5d574987af007f
> Author: Josh Poimboeuf 
> Date:   Thu Jan 21 16:49:21 2016 -0600
> 
> x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> 
> do_suspend_lowlevel() is a callable non-leaf function which doesn't
> honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> 
> Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> 
> is reported to cause a resume-from-hibernation regression due to an attempt
> to execute an NX page (we've seen quite a bit of that recently).
> 
> I'm asking the reporter to try 4.7, but if the problem is still there, we'll
> need to revert the above I'm afraid.

Hi Rafael,

Is the oops output available somewhere?

-- 
Josh


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> Hi,
> 
> The following commit:
> 
> commit 13523309495cdbd57a0d344c0d5d574987af007f
> Author: Josh Poimboeuf 
> Date:   Thu Jan 21 16:49:21 2016 -0600
> 
> x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> 
> do_suspend_lowlevel() is a callable non-leaf function which doesn't
> honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> 
> Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> 
> is reported to cause a resume-from-hibernation regression due to an attempt
> to execute an NX page (we've seen quite a bit of that recently).
> 
> I'm asking the reporter to try 4.7, but if the problem is still there, we'll
> need to revert the above I'm afraid.

Hi Rafael,

Is the oops output available somewhere?

-- 
Josh


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Borislav Petkov
On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> Hi,
> 
> The following commit:
> 
> commit 13523309495cdbd57a0d344c0d5d574987af007f
> Author: Josh Poimboeuf 
> Date:   Thu Jan 21 16:49:21 2016 -0600
> 
> x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> 
> do_suspend_lowlevel() is a callable non-leaf function which doesn't
> honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> 
> Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> 
> is reported to cause a resume-from-hibernation regression due to an attempt
> to execute an NX page (we've seen quite a bit of that recently).
> 
> I'm asking the reporter to try 4.7, but if the problem is still there, we'll
> need to revert the above I'm afraid.

So I can't resume properly from disk too, on the Intel laptop this time. Top
commit is from tip/master:

commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
Merge: a4823bbffc96 dd9506954539
Author: Ingo Molnar 
Date:   Mon Jul 25 08:39:43 2016 +0200

Merge branch 'linus'


So I thought it might be Josh's patch above and reverted it. No joy.

Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
microcode loader breakage which we've been debugging. Turned that off
and machine resumes fine again.

It looks like

  0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")

broke a bunch of things. Off the top of my head, we probably should make
suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
was the case with ASLR previously, AFAIR.

Adding more people to CC and leaving in the rest for reference.

> Date: Mon, 25 Jul 2016 21:16:29 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: r...@rjwysocki.net
> Subject: [Bug 150021] New: kernel panic: "kernel tried to execute
>  NX-protected page" when resuming from hibernate to disk
> Message-ID: 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=150021
> 
> Bug ID: 150021
>Summary: kernel panic: "kernel tried to execute NX-protected
> page" when resuming from hibernate to disk
>Product: Power Management
>Version: 2.5
> Kernel Version: 4.6.x
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Hibernation/Suspend
>   Assignee: r...@rjwysocki.net
>   Reporter: shuz...@mailbox.org
> Regression: No
> 
> Created attachment 226381
>   --> https://bugzilla.kernel.org/attachment.cgi?id=226381=edit
> last working .config
> 
> Overview: 
> 
> When commit 13523309495cdbd57a0d344c0d5d574987af007f is applied to my kernel
> sources my kernel panics when trying to resume from hibernate to disk.
> 
> 
> Steps to Reproduce: 
> 
> 1. have a working hibernate/resume setup
> 2. compile 4.6.x kernel
> 3. boot and hibernate to disk
> 4. test various kernels using "git bisect".
> 
> 
> Actual Results: kernel panics when trying to resume from hibernate to disk.
> 
> Expected Results: Resume from hibernate to disk like kernels without commit
> 13523309495cdbd57a0d344c0d5d574987af007f did.
> 
> 
> I attached my working .config of my 4.5.7 kernel.
> 
> Any help will be appreciated. Thanks!
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.


-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
--


Re: Fwd: [Bug 150021] New: kernel panic: "kernel tried to execute NX-protected page" when resuming from hibernate to disk

2016-07-26 Thread Borislav Petkov
On Tue, Jul 26, 2016 at 01:32:28PM +0200, Rafael J. Wysocki wrote:
> Hi,
> 
> The following commit:
> 
> commit 13523309495cdbd57a0d344c0d5d574987af007f
> Author: Josh Poimboeuf 
> Date:   Thu Jan 21 16:49:21 2016 -0600
> 
> x86/asm/acpi: Create a stack frame in do_suspend_lowlevel()
> 
> do_suspend_lowlevel() is a callable non-leaf function which doesn't
> honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
> 
> Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
> 
> is reported to cause a resume-from-hibernation regression due to an attempt
> to execute an NX page (we've seen quite a bit of that recently).
> 
> I'm asking the reporter to try 4.7, but if the problem is still there, we'll
> need to revert the above I'm afraid.

So I can't resume properly from disk too, on the Intel laptop this time. Top
commit is from tip/master:

commit 516f48acf59722429acd323b3d283f74f02891fe (refs/remotes/tip/master)
Merge: a4823bbffc96 dd9506954539
Author: Ingo Molnar 
Date:   Mon Jul 25 08:39:43 2016 +0200

Merge branch 'linus'


So I thought it might be Josh's patch above and reverted it. No joy.

Then I remembered that I enabled CONFIG_RANDOMIZE_MEMORY for the
microcode loader breakage which we've been debugging. Turned that off
and machine resumes fine again.

It looks like

  0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")

broke a bunch of things. Off the top of my head, we probably should make
suspend to disk and CONFIG_RANDOMIZE_MEMORY mutually exclusive, like it
was the case with ASLR previously, AFAIR.

Adding more people to CC and leaving in the rest for reference.

> Date: Mon, 25 Jul 2016 21:16:29 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: r...@rjwysocki.net
> Subject: [Bug 150021] New: kernel panic: "kernel tried to execute
>  NX-protected page" when resuming from hibernate to disk
> Message-ID: 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=150021
> 
> Bug ID: 150021
>Summary: kernel panic: "kernel tried to execute NX-protected
> page" when resuming from hibernate to disk
>Product: Power Management
>Version: 2.5
> Kernel Version: 4.6.x
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Hibernation/Suspend
>   Assignee: r...@rjwysocki.net
>   Reporter: shuz...@mailbox.org
> Regression: No
> 
> Created attachment 226381
>   --> https://bugzilla.kernel.org/attachment.cgi?id=226381=edit
> last working .config
> 
> Overview: 
> 
> When commit 13523309495cdbd57a0d344c0d5d574987af007f is applied to my kernel
> sources my kernel panics when trying to resume from hibernate to disk.
> 
> 
> Steps to Reproduce: 
> 
> 1. have a working hibernate/resume setup
> 2. compile 4.6.x kernel
> 3. boot and hibernate to disk
> 4. test various kernels using "git bisect".
> 
> 
> Actual Results: kernel panics when trying to resume from hibernate to disk.
> 
> Expected Results: Resume from hibernate to disk like kernels without commit
> 13523309495cdbd57a0d344c0d5d574987af007f did.
> 
> 
> I attached my working .config of my 4.5.7 kernel.
> 
> Any help will be appreciated. Thanks!
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.


-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
--