Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-20 Thread Alex G.


On 04/20/2018 02:27 AM, James Morse wrote:
> Hi Alex,
> 
> On 04/16/2018 10:59 PM, Alex G. wrote:
>> On 04/13/2018 11:38 AM, James Morse wrote:
>>> This assumes a cache-invalidate will clear the error, which I don't
> think we're
>>> guaranteed on arm.
>>> It also destroys any adjacent data, "everyone's happy" includes the
> thread that
>>> got a chunk of someone-else's stack frame, I don't think it will be
> happy for
>>> very long!
>>
>> Hmm, no cache-line (or page) invalidation on arm64? How does
>> dma_map/unmap_*() work then? You may not guarantee to fix the error, but
> 
> There are cache-invalidate instructions, but I don't think 'solving' a
> RAS error with them is the right thing to do.

You seem to be putting RAS on a pedestal in a very cloudy and foggy day.
I admit that I fail to see the specialness of RAS in comparison to other
errors.

>> I don't buy into the "let's crash without trying" argument.
> 
> Our 'cache writeback granule' may be as large as 2K, so we may have to
> invalidate up to 2K of data to convince the hardware this address is
> okay again.

Eureka! OS can invalidate the entire page. 1:1 mapping with the memory
management data.

> All we've done here is differently-corrupt the data so that it no longer
> generates a RAS fault, it just gives you the wrong data instead.
> Cache-invalidation is destructive.
> 
> I don't think there is a one-size-fits-all solution here.

Of course there isn't. That's not the issue.

A cache corruption is a special case of a memory access issue, and that,
we already know how to handle. Triple-fault and cpu-on-fire concerns
apply wrt returning to the context which triggered the problem. We've
already figured that out.

There is a lot of opportunity here for using well tested code paths and
not crashing on first go. Why let firmware make this a problem again?

>>> (this is a side issue for AER though)
>>
>> Somebody muddled up AER with these tables, so we now have to worry about
>> it. :)
> 
> Eh? I see there is a v2, maybe I'll understand this comment once I read it.

I meant that somebody (the spec writers) decided to put ominous errors
(PCIe) on the same severity scale with "cpu is on fire" errors.

 How does FFS handle race conditions that can occur when accessing HW
 concurrently with the OS? I'm told it's the main reasons why BIOS
 doesn't release unused cores from SMM early.
>>>
>>> This is firmware's problem, it depends on whether there is any
> hardware that is
>>> shared with the OS. Some hardware can be marked 'secure' in which
> case only
>>> firmware can access it, alternatively firmware can trap or just
> disable the OS's
>>> access to the shared hardware.
>>
>> It's everyone's problem. It's the firmware's responsibility.
> 
> It depends on the SoC design. If there is no hardware that the OS and
> firmware both need to access to handle an error then I don't think
> firmware needs to do this.
> 
> 
>>> For example, with the v8.2 RAS Extensions, there are some per-cpu error
>>> registers. Firmware can disable these for the OS, so that it always
> reads 0 from
>>> them. Instead firmware takes the error via FF, reads the registers from
>>> firmware, and dumps CPER records into the OS's memory.
>>>
>>> If there is a shared hardware resource that both the OS and firmware
> may be
>>> accessing, yes firmware needs to pull the other CPUs in, but this
> depends on the
>>> SoC design, it doesn't necessarily happen.
>>
>> The problem with shared resources is just a problem. I've seen systems
>> where all 100 cores are held up for 300+ ms. In latency-critical
>> applications reliability drops exponentially. Am I correct in assuming
>> your answer would be to "hide" more stuff from the OS?
> 
> No, I'm not a fan of firmware cycle stealing. If you can design the SoC or
> firmware so that the 'all CPUs' stuff doesn't need to happen, then you
> won't get
> these issues. (I don't design these things, I'm sure they're much more
> complicated
> than I think!)
> 
> Because the firmware is SoC-specific, so it only needs to do exactly
> what is necessary.

Irrespective of the hardware design, there's devicetree, ACPI methods,
and a few other ways to inform the OS of non-standard bits. They don't
have the resource sharing problem. I'm confused as to why FFS is used
when there are concerns about resource conflicts instead of race-free
alternatives.

 I think the idea of firmware-first is broken. But it's there, it's
 shipping in FW, so we have to accommodate it in SW.
>>>
>>> Part of our different-views here is firmware-first is taking
> something away from
>>> you, whereas for me its giving me information that would otherwise be in
>>> secret-soc-specific registers.
>>
>> Under this interpretation, FFS is a band-aid to the problem of "secret"
>> registers. "Secret" hardware doesn't really fit well into the idea of an
>> OS [1].
> 
> Sorry, I'm being sloppy with my terminology, by secret-soc-specific I
> mean either Linux can't access them 

Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-20 Thread James Morse

Hi Alex,

On 04/16/2018 10:59 PM, Alex G. wrote:
> On 04/13/2018 11:38 AM, James Morse wrote:
>> This assumes a cache-invalidate will clear the error, which I don't 
think we're

>> guaranteed on arm.
>> It also destroys any adjacent data, "everyone's happy" includes the 
thread that
>> got a chunk of someone-else's stack frame, I don't think it will be 
happy for

>> very long!
>
> Hmm, no cache-line (or page) invalidation on arm64? How does
> dma_map/unmap_*() work then? You may not guarantee to fix the error, but

There are cache-invalidate instructions, but I don't think 'solving' a 
RAS error with them is the right thing to do.



> I don't buy into the "let's crash without trying" argument.

Our 'cache writeback granule' may be as large as 2K, so we may have to 
invalidate up to 2K of data to convince the hardware this address is 
okay again.


All we've done here is differently-corrupt the data so that it no longer 
generates a RAS fault, it just gives you the wrong data instead. 
Cache-invalidation is destructive.


I don't think there is a one-size-fits-all solution here.


>> (this is a side issue for AER though)
>
> Somebody muddled up AER with these tables, so we now have to worry about
> it. :)

Eh? I see there is a v2, maybe I'll understand this comment once I read it.


>>> How does FFS handle race conditions that can occur when accessing HW
>>> concurrently with the OS? I'm told it's the main reasons why BIOS
>>> doesn't release unused cores from SMM early.
>>
>> This is firmware's problem, it depends on whether there is any 
hardware that is
>> shared with the OS. Some hardware can be marked 'secure' in which 
case only
>> firmware can access it, alternatively firmware can trap or just 
disable the OS's

>> access to the shared hardware.
>
> It's everyone's problem. It's the firmware's responsibility.

It depends on the SoC design. If there is no hardware that the OS and 
firmware both need to access to handle an error then I don't think 
firmware needs to do this.



>> For example, with the v8.2 RAS Extensions, there are some per-cpu error
>> registers. Firmware can disable these for the OS, so that it always 
reads 0 from

>> them. Instead firmware takes the error via FF, reads the registers from
>> firmware, and dumps CPER records into the OS's memory.
>>
>> If there is a shared hardware resource that both the OS and firmware 
may be
>> accessing, yes firmware needs to pull the other CPUs in, but this 
depends on the

>> SoC design, it doesn't necessarily happen.
>
> The problem with shared resources is just a problem. I've seen systems
> where all 100 cores are held up for 300+ ms. In latency-critical
> applications reliability drops exponentially. Am I correct in assuming
> your answer would be to "hide" more stuff from the OS?

No, I'm not a fan of firmware cycle stealing. If you can design the SoC or
firmware so that the 'all CPUs' stuff doesn't need to happen, then you 
won't get
these issues. (I don't design these things, I'm sure they're much more 
complicated

than I think!)

Because the firmware is SoC-specific, so it only needs to do exactly 
what is necessary.



>>> I think the idea of firmware-first is broken. But it's there, it's
>>> shipping in FW, so we have to accommodate it in SW.
>>
>> Part of our different-views here is firmware-first is taking 
something away from

>> you, whereas for me its giving me information that would otherwise be in
>> secret-soc-specific registers.
>
> Under this interpretation, FFS is a band-aid to the problem of "secret"
> registers. "Secret" hardware doesn't really fit well into the idea of an
> OS [1].

Sorry, I'm being sloppy with my terminology, by secret-soc-specific I 
mean either Linux can't access them (firmware privilege-level only) or 
Linux can't reasonably know where these registers are, as they're 
soc-specific and vary by manufacture.



>>> And linux can handle a wide subset of MCEs just fine, so the
>>> ghes_is_deferrable() logic would, under my argument, agree to pass
>>> execution to the actual handlers.
>>
>> For some classes of error we can't safely get there.
>
> Optimize for the common case.

At the expense of reliability?


Thanks,

James


Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-16 Thread Alex G.
On 04/13/2018 11:38 AM, James Morse wrote:
> Hi Alex,
> 
> On 09/04/18 19:11, Alex G. wrote:
>> On 04/06/2018 01:24 PM, James Morse wrote:
>> Do you have any ETA on when your SEA patches are going to make it
>> upstream? There's not much point in updating my patchset if it's going
>> to conflict with your work.
> 
> The SEA stuff went in with 7edda0886bc3 ("acpi: apei: handle SEA notification
> type for ARMv8"). My series is moving it to use the estatus-queue in the same
> way as x86's NOTIFY_NMI does. This lets us safely add the other two NMI-like
> notifications. I have no idea on the ETA, it depends on review feedback!

Okay. I'll get a v2 out soonish then.

(snip)
> This assumes a cache-invalidate will clear the error, which I don't think 
> we're
> guaranteed on arm.
> It also destroys any adjacent data, "everyone's happy" includes the thread 
> that
> got a chunk of someone-else's stack frame, I don't think it will be happy for
> very long!

Hmm, no cache-line (or page) invalidation on arm64? How does
dma_map/unmap_*() work then? You may not guarantee to fix the error, but
I don't buy into the "let's crash without trying" argument.

> (this is a side issue for AER though)

Somebody muddled up AER with these tables, so we now have to worry about
it. :)

(snip)
>> How does FFS handle race conditions that can occur when accessing HW
>> concurrently with the OS? I'm told it's the main reasons why BIOS
>> doesn't release unused cores from SMM early.
> 
> This is firmware's problem, it depends on whether there is any hardware that 
> is
> shared with the OS. Some hardware can be marked 'secure' in which case only
> firmware can access it, alternatively firmware can trap or just disable the 
> OS's
> access to the shared hardware.

It's everyone's problem. It's the firmware's responsibility.

> For example, with the v8.2 RAS Extensions, there are some per-cpu error
> registers. Firmware can disable these for the OS, so that it always reads 0 
> from
> them. Instead firmware takes the error via FF, reads the registers from
> firmware, and dumps CPER records into the OS's memory.
> 
> If there is a shared hardware resource that both the OS and firmware may be
> accessing, yes firmware needs to pull the other CPUs in, but this depends on 
> the
> SoC design, it doesn't necessarily happen.

The problem with shared resources is just a problem. I've seen systems
where all 100 cores are held up for 300+ ms. In latency-critical
applications reliability drops exponentially. Am I correct in assuming
your answer would be to "hide" more stuff from the OS?


(snip)
> Sure, we're quirking our behaviour based on a high level of mistrust for the
> firmware. My point here was we shouldn't duplicate the implementation because 
> we
> want x86:{AER,CPU,MEM} to behave differently to arm64:{AER,CPU,MEM}. I'd 
> rather
> the quirked-behaviour was along the *:{AER} versus *:{CPU,MEM} line. If we 
> have
> extra code to spot deferrable errors, we should use it on both architectures.

It's a well earned and well deserved mistrust. Firmware is evil (*).

(*) sarcastic overstatement of facts


(snip)
> For AER we agree, these never mean 'the CPU is on fire'.

Sounds like a good marketing slogan:
"ACME's new turbo-encabulated CPU -- it's on fire!"


(snip)
>>> even if broken firmware thinks they are fatal.
> 
>> I think the idea of firmware-first is broken. But it's there, it's
>> shipping in FW, so we have to accommodate it in SW.
> 
> Part of our different-views here is firmware-first is taking something away 
> from
> you, whereas for me its giving me information that would otherwise be in
> secret-soc-specific registers.

Under this interpretation, FFS is a band-aid to the problem of "secret"
registers. "Secret" hardware doesn't really fit well into the idea of an
OS [1]. The irony of the solution is that the response is centered on
firmware extensions which were designed to stile interoperability [2].

You are right, FFS is a poopstorm and headache for me. It takes my
sanity away. I once wrote FW for a laptop where the only use of SMM was
to run the "Enable ACPI" call -- which only disabled further SMM. We
didn't need an iota of FFS because at that time AMD wasn't secretive,
and there was no need to have "secret" registers in the first place.

[1] https://www.youtube.com/watch?v=_36yNWw_07g
[2]
http://antitrust.slated.org/www.iowaconsumercase.org/011607/3000/PX03020.pdf

(snip)
>> And linux can handle a wide subset of MCEs just fine, so the
>> ghes_is_deferrable() logic would, under my argument, agree to pass
>> execution to the actual handlers.
> 
> For some classes of error we can't safely get there.

Optimize for the common case.

(snip)
>> Though in that case, the problem is generalized to "how to best handle
>> error Y", rather than "how to handle error Y in FFS".
> 
> (that's a good thing yes?) I assume x86 can take MCE errors out of IRQ-masked
> code. Sharing the handle_foo_error_nmi() code between the two paths would be a

Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-13 Thread James Morse
Hi Alex,

On 09/04/18 19:11, Alex G. wrote:
> On 04/06/2018 01:24 PM, James Morse wrote:
> Do you have any ETA on when your SEA patches are going to make it
> upstream? There's not much point in updating my patchset if it's going
> to conflict with your work.

The SEA stuff went in with 7edda0886bc3 ("acpi: apei: handle SEA notification
type for ARMv8"). My series is moving it to use the estatus-queue in the same
way as x86's NOTIFY_NMI does. This lets us safely add the other two NMI-like
notifications. I have no idea on the ETA, it depends on review feedback!


>>> But if on arm64, you can return to firmware,
>>> then we have wildly different constraints.
>>
>> We can't return to firmware, but we can take the error again, causing another
>> trap to firmware.
>>
>> e.g.: If a program accesses a value in the cache and the cache location is
>> discovered to be corrupt/marked-as-poisoned. This causes an 'external abort'
>> exception, which firmware has configured to trap to firmware. Once there, it
>> generates CPER records and sets-up an external-abort-exception for Linux.
>> If linux returns from this external-abort, we return to whatever program
>> accessed the bad-cache-value in the first case, re-run the instruction that
>> generates the external abort, and the same thing happens again, but to 
>> firmware
>> it looks like a second fault at the same address.

> This is a very good example of why we should _not_ panic in NMI.
> Invalidate the cache-line, next load causes a memory round-trip, and
> everyone's happy. Of course, FFS can't do that because it doesn't know
> if it's valid to invalidate (pun intended). But the OS does. So you
> _want_ to reach the error handler downstream of NMI context, and you do
> _not_ want to crash.

This assumes a cache-invalidate will clear the error, which I don't think we're
guaranteed on arm.
It also destroys any adjacent data, "everyone's happy" includes the thread that
got a chunk of someone-else's stack frame, I don't think it will be happy for
very long!

(this is a side issue for AER though)


>>> On x86, you can't return to SMM. You can have a new SMI triggered while
>>> servicing the NMI, but you can't re-enter an SMI that you've returned
>>> from. SMM holds all the cores, and they all leave SMM at roughly the
>>> same time, so you don't deal with coexistence issues. This probably also
>>> avoids the multiple notifications that you are trying to implement on arm.
>>
>> its the new SMI case.
>>
>> (We don't necessarily have all cores pulled into firmware, (although firmware
>> could do that in software), so different CPUs taking NMI-like notifications 
>> is
>> one of the things we have to handle.)

> How does FFS handle race conditions that can occur when accessing HW
> concurrently with the OS? I'm told it's the main reasons why BIOS
> doesn't release unused cores from SMM early.

This is firmware's problem, it depends on whether there is any hardware that is
shared with the OS. Some hardware can be marked 'secure' in which case only
firmware can access it, alternatively firmware can trap or just disable the OS's
access to the shared hardware.

For example, with the v8.2 RAS Extensions, there are some per-cpu error
registers. Firmware can disable these for the OS, so that it always reads 0 from
them. Instead firmware takes the error via FF, reads the registers from
firmware, and dumps CPER records into the OS's memory.

If there is a shared hardware resource that both the OS and firmware may be
accessing, yes firmware needs to pull the other CPUs in, but this depends on the
SoC design, it doesn't necessarily happen.


>>> On quick thought, I think the best way to go for both series is to leave
>>> the entry points arch-specific, and prevent hax86 and harm64 from
>>> stepping on each other's toes. Thoughts?
>>
>> The behaviour should be driven from the standard. I agree there is some 
>> leeway,
>> which linux should handle on all architectures that support GHES.
> 
> "The standard" deals mostly with firmware behavior. While I'm all for
> following specifications in order to ensure smooth interaction and
> integration, Linux is an OS and it deals with a plethora of "standards".
> Its job is to serve user requests. When a "standard" deviates from this,
> such as mandating a crash when one is not required, it's the OS's job to
> _not_ follow this requirement.

Sure, we're quirking our behaviour based on a high level of mistrust for the
firmware. My point here was we shouldn't duplicate the implementation because we
want x86:{AER,CPU,MEM} to behave differently to arm64:{AER,CPU,MEM}. I'd rather
the quirked-behaviour was along the *:{AER} versus *:{CPU,MEM} line. If we have
extra code to spot deferrable errors, we should use it on both architectures.


>> Once in ghes.c all the NMI-like notifications get merged together as the only
>> relevant thing about them is we have to use the estatus-queue, and can't call
>> any of the sleepy recovery helpers.
>>
>> I don't th

Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-09 Thread Alex G.
Hi James,

On 04/06/2018 01:24 PM, James Morse wrote:

Do you have any ETA on when your SEA patches are going to make it
upstream? There's not much point in updating my patchset if it's going
to conflict with your work.

(snip)
>> On the contrary. No output, followed by a watchdog reboot is usually
>> what happens when the panic() is in NMI. One of the very few ways that
>> can be debugged is over a NULL modem cable.
> 
> (IPMI-SOL?)
> 
> Really? printk has NMI support, and panic()s call to bust_spinlocks() should
> unblank the screen before printk_safe_flush_on_panic() dumps the messages. 
> (I'm
> assuming some BMC-VGA exists to capture the 'screen').

Plausible. There are other mechanisms though that may reset the system
before it is practical to capture the screen.

(snip)
>> But if on arm64, you can return to firmware,
>> then we have wildly different constraints.
> 
> We can't return to firmware, but we can take the error again, causing another
> trap to firmware.
> 
> e.g.: If a program accesses a value in the cache and the cache location is
> discovered to be corrupt/marked-as-poisoned. This causes an 'external abort'
> exception, which firmware has configured to trap to firmware. Once there, it
> generates CPER records and sets-up an external-abort-exception for Linux.
> If linux returns from this external-abort, we return to whatever program
> accessed the bad-cache-value in the first case, re-run the instruction that
> generates the external abort, and the same thing happens again, but to 
> firmware
> it looks like a second fault at the same address.

This is a very good example of why we should _not_ panic in NMI.
Invalidate the cache-line, next load causes a memory round-trip, and
everyone's happy. Of course, FFS can't do that because it doesn't know
if it's valid to invalidate (pun intended). But the OS does. So you
_want_ to reach the error handler downstream of NMI context, and you do
_not_ want to crash.


(snip)
>> On x86, you can't return to SMM. You can have a new SMI triggered while
>> servicing the NMI, but you can't re-enter an SMI that you've returned
>> from. SMM holds all the cores, and they all leave SMM at roughly the
>> same time, so you don't deal with coexistence issues. This probably also
>> avoids the multiple notifications that you are trying to implement on arm.
> 
> its the new SMI case.
> 
> (We don't necessarily have all cores pulled into firmware, (although firmware
> could do that in software), so different CPUs taking NMI-like notifications is
> one of the things we have to handle.)

How does FFS handle race conditions that can occur when accessing HW
concurrently with the OS? I'm told it's the main reasons why BIOS
doesn't release unused cores from SMM early.


>> On quick thought, I think the best way to go for both series is to leave
>> the entry points arch-specific, and prevent hax86 and harm64 from
>> stepping on each other's toes. Thoughts?
> 
> The behaviour should be driven from the standard. I agree there is some 
> leeway,
> which linux should handle on all architectures that support GHES.

"The standard" deals mostly with firmware behavior. While I'm all for
following specifications in order to ensure smooth interaction and
integration, Linux is an OS and it deals with a plethora of "standards".
Its job is to serve user requests. When a "standard" deviates from this,
such as mandating a crash when one is not required, it's the OS's job to
_not_ follow this requirement.


> The entry points from the arch code are already separate, as these have
> different notification types in the HEST->GHES entries and different
> registration requirements.
> Once in ghes.c all the NMI-like notifications get merged together as the only
> relevant thing about them is we have to use the estatus-queue, and can't call
> any of the sleepy recovery helpers.
> 
> I don't think this is the issue though: An NMI notification could by any of 
> the
> eleven CPER notification record types.
> Your argument is Linux probably can handle PCIe-AER errors,

My argument is that Linux should do a best effort to recover from
hardware (or FFS) errors. In support of that argument, I presented
evidence that Linux already can handle a variety of errors. How those
errors are reported determines whether the OS crashes or not, and that's
not right. When errors are reported natively, we make it to the error
handlers before crashing. And the essence of my argument is that there
is no reason to have a different behavior with FFS.

> even if broken firmware thinks they are fatal.

I think the idea of firmware-first is broken. But it's there, it's
shipping in FW, so we have to accommodate it in SW.


> This is only one of the eleven notification record types. Returning to handle
> the error definitely doesn't work for some classes of fatal error, especially
> when interrupts are masked.

> I think its straightforward to test whether the CPER records contain only 
> errors
> where we can do better: (no

Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-06 Thread James Morse
Hi Alex,

On 04/04/18 20:49, Alex G. wrote:
> On 04/04/2018 11:53 AM, James Morse wrote:
 How do we know we will survive this trip?
>>>
>>> We don't.
>>
>> Isn't that even worse than panic()ing? (No output, followed by a watchdog 
>> reboot
>> if we're lucky)

> On the contrary. No output, followed by a watchdog reboot is usually
> what happens when the panic() is in NMI. One of the very few ways that
> can be debugged is over a NULL modem cable.

(IPMI-SOL?)

Really? printk has NMI support, and panic()s call to bust_spinlocks() should
unblank the screen before printk_safe_flush_on_panic() dumps the messages. (I'm
assuming some BMC-VGA exists to capture the 'screen').


 On arm64 systems it may not be possible to return to the context we took 
 the NMI
 notification from: we may bounce back into firmware with the same "world 
 is on
 fire" error. Firmware can rightly assume the OS has made no attempt to 
 handle
 the error. 
>>>
>>> I'm not aware of the NMI path being used on arm64:
>>> $ git grep 'select HAVE_ACPI_APEI_NMI'
>>> arch/x86/Kconfig:   select HAVE_ACPI_APEI_NMI   if ACPI
>>> $
>>
>> (I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI
>> that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code).
>>
>> CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the 
>> arch
>> code to implement register_nmi_handler() and (from memory) behave as a 
>> notifier
>> chain, which doesn't fit with how these things behave.
>>
>> NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig.
>> The series to add SDEI support is on the list here:
>> https://www.spinics.net/lists/arm-kernel/msg642946.html
>> (NOTIFY_SEI is very tricky for firmware to get right, I don't think we're 
>> there
>> yet.)

> From my understanding, your patch series tries to use ghes_notify_nmi()
> as a common entry point. 

It abstracts the NMI-like notifications to use a common helper into the
estatus-queue. ghes_notify_nmi() is the arch-code entry point for x86's
NOTIFY_NMI. We  have ghes_notify_sea() and ghes_sdei_callback() on arm64. They
are different as their registration requirements are different. They each call
_in_nmi_notify_one() to do the in_nmi() ghes work.


> But if on arm64, you can return to firmware,
> then we have wildly different constraints.

We can't return to firmware, but we can take the error again, causing another
trap to firmware.

e.g.: If a program accesses a value in the cache and the cache location is
discovered to be corrupt/marked-as-poisoned. This causes an 'external abort'
exception, which firmware has configured to trap to firmware. Once there, it
generates CPER records and sets-up an external-abort-exception for Linux.
If linux returns from this external-abort, we return to whatever program
accessed the bad-cache-value in the first case, re-run the instruction that
generates the external abort, and the same thing happens again, but to firmware
it looks like a second fault at the same address.

The same thing can happen for a processor error.


> On x86, you can't return to SMM. You can have a new SMI triggered while
> servicing the NMI, but you can't re-enter an SMI that you've returned
> from. SMM holds all the cores, and they all leave SMM at roughly the
> same time, so you don't deal with coexistence issues. This probably also
> avoids the multiple notifications that you are trying to implement on arm.

its the new SMI case.

(We don't necessarily have all cores pulled into firmware, (although firmware
could do that in software), so different CPUs taking NMI-like notifications is
one of the things we have to handle.)


> On quick thought, I think the best way to go for both series is to leave
> the entry points arch-specific, and prevent hax86 and harm64 from
> stepping on each other's toes. Thoughts?

The behaviour should be driven from the standard. I agree there is some leeway,
which linux should handle on all architectures that support GHES.

The entry points from the arch code are already separate, as these have
different notification types in the HEST->GHES entries and different
registration requirements.
Once in ghes.c all the NMI-like notifications get merged together as the only
relevant thing about them is we have to use the estatus-queue, and can't call
any of the sleepy recovery helpers.



I don't think this is the issue though: An NMI notification could by any of the
eleven CPER notification record types.
Your argument is Linux probably can handle PCIe-AER errors, even if broken
firmware thinks they are fatal.
This is only one of the eleven notification record types. Returning to handle
the error definitely doesn't work for some classes of fatal error, especially
when interrupts are masked.

I think its straightforward to test whether the CPER records contain only errors
where we can do better: (not tested/built, I haven't checked whether this is nmi
safe):
--

Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-04 Thread Alex G.


On 04/04/2018 11:53 AM, James Morse wrote:
> Hi Alex,
(snip)
>>> How do we know we will survive this trip?
>>
>> We don't.
> 
> Isn't that even worse than panic()ing? (No output, followed by a watchdog 
> reboot
> if we're lucky)

On the contrary. No output, followed by a watchdog reboot is usually
what happens when the panic() is in NMI. One of the very few ways that
can be debugged is over a NULL modem cable.

>>> On arm64 systems it may not be possible to return to the context we took 
>>> the NMI
>>> notification from: we may bounce back into firmware with the same "world is 
>>> on
>>> fire" error. Firmware can rightly assume the OS has made no attempt to 
>>> handle
>>> the error. 
>>
>> I'm not aware of the NMI path being used on arm64:
>> $ git grep 'select HAVE_ACPI_APEI_NMI'
>> arch/x86/Kconfig:   select HAVE_ACPI_APEI_NMI   if ACPI
>> $
> 
> (I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI
> that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code).
> 
> CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the arch
> code to implement register_nmi_handler() and (from memory) behave as a 
> notifier
> chain, which doesn't fit with how these things behave.
>
> NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig.
> The series to add SDEI support is on the list here:
> https://www.spinics.net/lists/arm-kernel/msg642946.html
> (NOTIFY_SEI is very tricky for firmware to get right, I don't think we're 
> there
> yet.)

>From my understanding, your patch series tries to use ghes_notify_nmi()
as a common entry point. But if on arm64, you can return to firmware,
then we have wildly different constraints.

On x86, you can't return to SMM. You can have a new SMI triggered while
servicing the NMI, but you can't re-enter an SMI that you've returned
from. SMM holds all the cores, and they all leave SMM at roughly the
same time, so you don't deal with coexistence issues. This probably also
avoids the multiple notifications that you are trying to implement on arm.

On quick thought, I think the best way to go for both series is to leave
the entry points arch-specific, and prevent hax86 and harm64 from
stepping on each other's toes. Thoughts?


(snip)
> Wouldn't reasonably-intelligent firmware be able to spot the same fault
> repeatedly in a loop? (last-fault-address == this-fault-address,
> last-fault-time==now)

I'm told that the BIOS on Dell machines will detect interrupt storms,
and disable the specific error source in such cases. On an x86 server.
50us of SMM eats up several milliseconds-core of CPU time, so they try
to be a little careful.

>>> Your 'not re-arming the error' example makes this worrying.
>>
>> Those are one class of bugs you get when you let firmware run the show.
>> It's been an issue since day one of EFI. The best you can do in such
>> cases is accept you may later lose the link to a device.
>> Side-note: We have a BIOS defect filed with against this particular
>> issue, and it only seems to affect PCIe SMIs.
> 
> This sounds like policy applied too early

That's firmware-first in one sentence.

> fixing the firmware to at least make this configurable would be preferable.

I can ask for "fixing the firmware" and we might even get the fix, but
the fact remains that there are machines in the field with the buggy
BIOSes, and there will continue to be such machines even after a fix is
released.


> My point was there is a class of reported-as-fatal error that really is fatal,
> and for these trying to defer the work to irq_work is worse than panic()ing as
> we get less information about what happened.

How do we get less information? We save the GHES structure on the
estatus_queue, and use it as the error source in either case. Is there a
class of errors where we can reasonable expect things to be okay in NMI,
but go horribly wrong on return from NMI?

(snip)
> Are all AER errors recoverable?  (is an OS:'fatal' AER error ever valid?)

PCIe "fatal" means that your link is gone and you might not get it back.
You lose all the downstream devices. You may be able to reset the link
and get it back. This sort of "fatal" has no impact on a machine's
ability to continue operation.

>From ACPI, FFS should only send a "fatal" error if a machine needs reset
[1], so it's clearly a firmware bug to mark any PCIe error "fatal". This
won't stop BIOS vendors from taking shortcuts and deciding to crash an
OS by sending the error as "fatal".

[1]ACPI 6.2: 18.1Hardware Errors and Error Sources


>> The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync,
>> duplicate code between the two. But we can also get ghes_proc_irq_work()
>> as an entry point, and this easily turns in a maintenance nightmare.
> 
> We can't call all the recovery-helpers from ghes_notify_nmi() as they sleep 
> and
> allocate memory. I think the trick would be to make the CPER parsing routines
> NMI-safe so that we can check t

Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-04 Thread James Morse
Hi Alex,

On 04/04/18 16:33, Alex G. wrote:
> On 04/04/2018 02:18 AM, James Morse wrote:
>> On 03/04/18 18:08, Alexandru Gagniuc wrote:
>>> BIOSes like to send NMIs for a number of silly reasons often deemed
>>> to be "fatal". For example pin bounce during a PCIE hotplug/unplug
>>> might cause the link to go down and retrain, with fatal PCI errors
>>> being generated while the link is retraining.

>>> Instead of panic()ing in NMI context, pass fatal errors down to IRQ
>>> context to see if they can be resolved.
>>
>> How do we know we will survive this trip?
> 
> We don't.

Isn't that even worse than panic()ing? (No output, followed by a watchdog reboot
if we're lucky)


>> On arm64 systems it may not be possible to return to the context we took the 
>> NMI
>> notification from: we may bounce back into firmware with the same "world is 
>> on
>> fire" error. Firmware can rightly assume the OS has made no attempt to handle
>> the error. 
> 
> I'm not aware of the NMI path being used on arm64:
> $ git grep 'select HAVE_ACPI_APEI_NMI'
> arch/x86/Kconfig:   select HAVE_ACPI_APEI_NMI   if ACPI
> $

(I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI
that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code).

CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the arch
code to implement register_nmi_handler() and (from memory) behave as a notifier
chain, which doesn't fit with how these things behave.

NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig.
The series to add SDEI support is on the list here:
https://www.spinics.net/lists/arm-kernel/msg642946.html
(NOTIFY_SEI is very tricky for firmware to get right, I don't think we're there
yet.)


> A far as jumping back into firmware, the OS should clear the GHES block
> status before exiting the NMI. THis tells the firmware that the OS got
> the message [1]
> I'm not seeing any mechanism to say "handled/not handled", so I'm not
> sure how firmware can make such assumptions.

(I'm probably assuming GHESv2's ACK stuff).

Wouldn't reasonably-intelligent firmware be able to spot the same fault
repeatedly in a loop? (last-fault-address == this-fault-address,
last-fault-time==now)


>> Your 'not re-arming the error' example makes this worrying.
> 
> Those are one class of bugs you get when you let firmware run the show.
> It's been an issue since day one of EFI. The best you can do in such
> cases is accept you may later lose the link to a device.
> Side-note: We have a BIOS defect filed with against this particular
> issue, and it only seems to affect PCIe SMIs.

This sounds like policy applied too early, fixing the firmware to at least make
this configurable would be preferable.


My point was there is a class of reported-as-fatal error that really is fatal,
and for these trying to defer the work to irq_work is worse than panic()ing as
we get less information about what happened.


>>> With these change, PCIe error are handled by AER. Other far less
>>> common errors, such as machine check exceptions, still cause a panic()
>>> in their respective handlers.
>>
>> I agree AER is always going to be different. Could we take a look at the CPER
>> records while still in_nmi() to decide whether linux knows better than 
>> firmware?
>>
>> For non-standard or processor-errors I think we should always panic() if 
>> they're
>> marked as fatal.
>> For memory-errors we could split memory_failure() up to have
>> {NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether
>> the affected memory is kernel memory.
> 
> I'm trying to avoid splitting up the recovery logic based on _how_ we
> were notified of the error. 

I agree, last time I checked there were 11 of them. 'in_nmi()' or not is a
distinction the ghes code already has for ghes_copy_tofrom_phys(). We already
have to consider NMI-like notifications separately as we can't call the
recovery-helpers.


> The only sure way to know if you've handled
> an error is to try to handle it. Not handling _any_ fatal error still
> results in a panic(), and I was very careful to ensure that.

Are all AER errors recoverable?  (is an OS:'fatal' AER error ever valid?)


> The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync,
> duplicate code between the two. But we can also get ghes_proc_irq_work()
> as an entry point, and this easily turns in a maintenance nightmare.

We can't call all the recovery-helpers from ghes_notify_nmi() as they sleep and
allocate memory. I think the trick would be to make the CPER parsing routines
NMI-safe so that we can check these 'fatal' records for cases where we're pretty
sure linux can do a better job.


>> For the PCI*/AER errors we should be able to try and handle it ... if we can 
>> get
>> back to process/IRQ context:
>> What happens if a PCIe driver has interrupts masked and is doing something to
>> cause this error? We can take the NMI and setup the irq-work, but it m

Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-04 Thread Alex G.
Hi James,

On 04/04/2018 02:18 AM, James Morse wrote:
> Hi Alexandru,
> 
> On 03/04/18 18:08, Alexandru Gagniuc wrote:
>> BIOSes like to send NMIs for a number of silly reasons often deemed
>> to be "fatal". For example pin bounce during a PCIE hotplug/unplug
>> might cause the link to go down and retrain, with fatal PCI errors
>> being generated while the link is retraining.
> 
> Sounds fun!
> 
> 
>> Instead of panic()ing in NMI context, pass fatal errors down to IRQ
>> context to see if they can be resolved.
> 
> How do we know we will survive this trip?

We don't.


> On arm64 systems it may not be possible to return to the context we took the 
> NMI
> notification from: we may bounce back into firmware with the same "world is on
> fire" error. Firmware can rightly assume the OS has made no attempt to handle
> the error. 

I'm not aware of the NMI path being used on arm64:
$ git grep 'select HAVE_ACPI_APEI_NMI'
arch/x86/Kconfig:   select HAVE_ACPI_APEI_NMI   if ACPI
$

A far as jumping back into firmware, the OS should clear the GHES block
status before exiting the NMI. THis tells the firmware that the OS got
the message [1]
I'm not seeing any mechanism to say "handled/not handled", so I'm not
sure how firmware can make such assumptions.


> Your 'not re-arming the error' example makes this worrying.

Those are one class of bugs you get when you let firmware run the show.
It's been an issue since day one of EFI. The best you can do in such
cases is accept you may later lose the link to a device.
Side-note: We have a BIOS defect filed with against this particular
issue, and it only seems to affect PCIe SMIs.


>> With these change, PCIe error are handled by AER. Other far less
>> common errors, such as machine check exceptions, still cause a panic()
>> in their respective handlers.
> 
> I agree AER is always going to be different. Could we take a look at the CPER
> records while still in_nmi() to decide whether linux knows better than 
> firmware?
> 
> For non-standard or processor-errors I think we should always panic() if 
> they're
> marked as fatal.
> For memory-errors we could split memory_failure() up to have
> {NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether
> the affected memory is kernel memory.

I'm trying to avoid splitting up the recovery logic based on _how_ we
were notified of the error. The only sure way to know if you've handled
an error is to try to handle it. Not handling _any_ fatal error still
results in a panic(), and I was very careful to ensure that.

The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync,
duplicate code between the two. But we can also get ghes_proc_irq_work()
as an entry point, and this easily turns in a maintenance nightmare.

> For the PCI*/AER errors we should be able to try and handle it ... if we can 
> get
> back to process/IRQ context:
> What happens if a PCIe driver has interrupts masked and is doing something to
> cause this error? We can take the NMI and setup the irq-work, but it may never
> run as we return to interrupts-masked context.

The way the recovery from NMI is set up is confusing. On NMIs,
ghes_proc_in_irq() is queued up to do further work. Despite the name, in
this case it runs as a task. That further queues up AER work,
Either way, PCIe endpoint interrupts are irrelevant here. AER recovery
runs independent of drivers or interrupts:
 - Drivers that implement AER recovery services are expected to do the
right thing and shut down IO. To not do so, is a bug in the driver.
 - Drivers that do not implement AER have their devices taken offline.

It is not uncommon for PCIe errors to amplify. Imagine losing the link
right after you've fired off several DMA queues. You'd get a stream of
Unsupported Request errors. AER handles this fairly well.

> 
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 2c998125b1d5..7243a99ea57e 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes)
>>  static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>>  {
>>  struct ghes *ghes;
>> -int sev, ret = NMI_DONE;
>> +int ret = NMI_DONE;
>>  
>>  if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
>>  return ret;
>> @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct 
>> pt_regs *regs)
>>  ret = NMI_HANDLED;
>>  }
>>  
>> -sev = ghes_severity(ghes->estatus->error_severity);
>> -if (sev >= GHES_SEV_PANIC) {
>> -oops_begin();
>> -ghes_print_queued_estatus();
>> -__ghes_panic(ghes);
>> -}
>> -
>>  if (!(ghes->flags & GHES_TO_CLEAR))
>>  continue;
> 
> For Processor-errors I think this is the wrong thing to do, but we should be
> able to poke around in the CPER records and find out what we 

Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

2018-04-04 Thread James Morse
Hi Alexandru,

On 03/04/18 18:08, Alexandru Gagniuc wrote:
> BIOSes like to send NMIs for a number of silly reasons often deemed
> to be "fatal". For example pin bounce during a PCIE hotplug/unplug
> might cause the link to go down and retrain, with fatal PCI errors
> being generated while the link is retraining.

Sounds fun!


> Instead of panic()ing in NMI context, pass fatal errors down to IRQ
> context to see if they can be resolved.

How do we know we will survive this trip?

On arm64 systems it may not be possible to return to the context we took the NMI
notification from: we may bounce back into firmware with the same "world is on
fire" error. Firmware can rightly assume the OS has made no attempt to handle
the error. Your 'not re-arming the error' example makes this worrying.


> With these change, PCIe error are handled by AER. Other far less
> common errors, such as machine check exceptions, still cause a panic()
> in their respective handlers.

I agree AER is always going to be different. Could we take a look at the CPER
records while still in_nmi() to decide whether linux knows better than firmware?

For non-standard or processor-errors I think we should always panic() if they're
marked as fatal.
For memory-errors we could split memory_failure() up to have
{NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether
the affected memory is kernel memory.

For the PCI*/AER errors we should be able to try and handle it ... if we can get
back to process/IRQ context:
What happens if a PCIe driver has interrupts masked and is doing something to
cause this error? We can take the NMI and setup the irq-work, but it may never
run as we return to interrupts-masked context.


> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 2c998125b1d5..7243a99ea57e 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes)
>  static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  {
>   struct ghes *ghes;
> - int sev, ret = NMI_DONE;
> + int ret = NMI_DONE;
>  
>   if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
>   return ret;
> @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct 
> pt_regs *regs)
>   ret = NMI_HANDLED;
>   }
>  
> - sev = ghes_severity(ghes->estatus->error_severity);
> - if (sev >= GHES_SEV_PANIC) {
> - oops_begin();
> - ghes_print_queued_estatus();
> - __ghes_panic(ghes);
> - }
> -
>   if (!(ghes->flags & GHES_TO_CLEAR))
>   continue;

For Processor-errors I think this is the wrong thing to do, but we should be
able to poke around in the CPER records and find out what we are dealing with.


Thanks,

James