Re: [PATCH] powerpc/mce: check if event info is valid

2021-10-15 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of October 7, 2021 10:09 pm:
> Ganesh  writes:
>> On 8/6/21 6:53 PM, Ganesh Goudar wrote:
>>
>>> Check if the event info is valid before printing the
>>> event information. When a fwnmi enabled nested kvm guest
>>> hits a machine check exception L0 and L2 would generate
>>> machine check event info, But L1 would not generate any
>>> machine check event info as it won't go through 0x200
>>> vector and prints some unwanted message.
>>>
>>> To fix this, 'in_use' variable in machine check event info is
>>> no more in use, rename it to 'valid' and check if the event
>>> information is valid before logging the event information.
>>>
>>> without this patch L1 would print following message for
>>> exceptions encountered in L2, as event structure will be
>>> empty in L1.
>>>
>>> "Machine Check Exception, Unknown event version 0".
>>>
>>> Signed-off-by: Ganesh Goudar 
>>> ---
>>
>> Hi mpe, Any comments on this patch.
> 
> The variable rename is a bit of a distraction.
> 
> But ignoring that, how do we end up processing a machine_check_event
> that has in_use = 0?
> 
> You don't give much detail on what call path you're talking about. I
> guess we're coming in via the calls in the KVM code?
> 
> In the definition of kvm_vcpu_arch we have:
> 
>   struct machine_check_event mce_evt; /* Valid if trap == 0x200 */
> 
> And you said we're not going via 0x200 in L1. But so shouldn't we be
> teaching the KVM code not to use mce_evt when trap is not 0x200?

I'm not sure we want the MCE to skip the L1 either. It should match the 
L0 hypervisor behaviour as closely as reasonably possible.

We might have to teach the KVM pseries path to do something about the
0x200 before the common HV guest exit handler (which is where the L1
message comes from).

Thanks,
Nick


Re: [PATCH] powerpc/mce: check if event info is valid

2021-10-07 Thread Michael Ellerman
Ganesh  writes:
> On 8/6/21 6:53 PM, Ganesh Goudar wrote:
>
>> Check if the event info is valid before printing the
>> event information. When a fwnmi enabled nested kvm guest
>> hits a machine check exception L0 and L2 would generate
>> machine check event info, But L1 would not generate any
>> machine check event info as it won't go through 0x200
>> vector and prints some unwanted message.
>>
>> To fix this, 'in_use' variable in machine check event info is
>> no more in use, rename it to 'valid' and check if the event
>> information is valid before logging the event information.
>>
>> without this patch L1 would print following message for
>> exceptions encountered in L2, as event structure will be
>> empty in L1.
>>
>> "Machine Check Exception, Unknown event version 0".
>>
>> Signed-off-by: Ganesh Goudar 
>> ---
>
> Hi mpe, Any comments on this patch.

The variable rename is a bit of a distraction.

But ignoring that, how do we end up processing a machine_check_event
that has in_use = 0?

You don't give much detail on what call path you're talking about. I
guess we're coming in via the calls in the KVM code?

In the definition of kvm_vcpu_arch we have:

struct machine_check_event mce_evt; /* Valid if trap == 0x200 */

And you said we're not going via 0x200 in L1. But so shouldn't we be
teaching the KVM code not to use mce_evt when trap is not 0x200?

cheers


Re: [PATCH] powerpc/mce: check if event info is valid

2021-09-17 Thread Ganesh

On 8/6/21 6:53 PM, Ganesh Goudar wrote:


Check if the event info is valid before printing the
event information. When a fwnmi enabled nested kvm guest
hits a machine check exception L0 and L2 would generate
machine check event info, But L1 would not generate any
machine check event info as it won't go through 0x200
vector and prints some unwanted message.

To fix this, 'in_use' variable in machine check event info is
no more in use, rename it to 'valid' and check if the event
information is valid before logging the event information.

without this patch L1 would print following message for
exceptions encountered in L2, as event structure will be
empty in L1.

"Machine Check Exception, Unknown event version 0".

Signed-off-by: Ganesh Goudar 
---


Hi mpe, Any comments on this patch.



[PATCH] powerpc/mce: check if event info is valid

2021-08-06 Thread Ganesh Goudar
Check if the event info is valid before printing the
event information. When a fwnmi enabled nested kvm guest
hits a machine check exception L0 and L2 would generate
machine check event info, But L1 would not generate any
machine check event info as it won't go through 0x200
vector and prints some unwanted message.

To fix this, 'in_use' variable in machine check event info is
no more in use, rename it to 'valid' and check if the event
information is valid before logging the event information.

without this patch L1 would print following message for
exceptions encountered in L2, as event structure will be
empty in L1.

"Machine Check Exception, Unknown event version 0".

Signed-off-by: Ganesh Goudar 
---
 arch/powerpc/include/asm/mce.h | 2 +-
 arch/powerpc/kernel/mce.c  | 7 +--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 331d944280b8..3646f53f228f 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -113,7 +113,7 @@ enum MCE_LinkErrorType {
 
 struct machine_check_event {
enum MCE_Versionversion:8;
-   u8  in_use;
+   u8  valid;
enum MCE_Severity   severity:8;
enum MCE_Initiator  initiator:8;
enum MCE_ErrorType  error_type:8;
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 47a683cd00d2..b778394a06b5 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -114,7 +114,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
mce->srr0 = nip;
mce->srr1 = regs->msr;
mce->gpr3 = regs->gpr[3];
-   mce->in_use = 1;
+   mce->valid = 1;
mce->cpu = get_paca()->paca_index;
 
/* Mark it recovered if we have handled it and MSR(RI=1). */
@@ -202,7 +202,7 @@ int get_mce_event(struct machine_check_event *mce, bool 
release)
if (mce)
*mce = *mc_evt;
if (release)
-   mc_evt->in_use = 0;
+   mc_evt->valid = 0;
ret = 1;
}
/* Decrement the count to free the slot. */
@@ -413,6 +413,9 @@ void machine_check_print_event_info(struct 
machine_check_event *evt,
"Probable Software error (some chance of hardware cause)",
};
 
+   if (!evt->valid)
+   return;
+
/* Print things out */
if (evt->version != MCE_V1) {
pr_err("Machine Check Exception, Unknown event version %d !\n",
-- 
2.31.1