Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-19 Thread Dan Carpenter
On Fri, Feb 16, 2024 at 05:17:20PM -0800, Dan Williams wrote:
> > commit 807fbe9cac9b190dab83e3ff377a30d18859c8ab
> > Author: Ira Weiny 
> > Date:   Wed Feb 14 15:25:24 2024 -0800
> > 
> > Revert "acpi/ghes: Process CXL Component Events"
> > 
> > This reverts commit 671a794c33c6e048ca5cedd5ad6af44d52d5d7e5.
> 
> Even reverts need changelogs, I can add one. I got conflicts trying to
> apply this to current fixes branch. I think I am going to just
> surgically backout the drivers/acpi/apei/ghes.c changes.

The `git revert` command comes from ancient times and it just sets
people up for failure...  :/

regards,
dan carpenter



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-16 Thread Dan Williams
Ira Weiny wrote:
> Ira Weiny wrote:
> > Jonathan Cameron wrote:
> > > On Wed, 14 Feb 2024 10:23:10 -0500
> > > Steven Rostedt  wrote:
> > > 
> > > > On Wed, 14 Feb 2024 12:11:53 +
> > > > Jonathan Cameron  wrote:
> > > > 
> > > > > So I'm thinking this is a won't fix - wait for the printk rework to 
> > > > > land and
> > > > > assume this will be resolved as well?  
> > > > 
> > > > That pretty much sums up what I was about to say ;-)
> > > > 
> > > > tp_printk is more of a hack and not to be used sparingly. With the right
> > > > trace events it can hang the machine.
> > > > 
> > > > So, you can use your internal patch locally, but I would recommend 
> > > > waiting
> > > > for the new printk changes to land.
> > 
> > Steven, Do you think that will land in 6.9?
> > 
> > > >
> > > > I'm really hoping that will be soon!
> > > > 
> > > > -- Steve
> > > 
> > > Thanks Steve,
> > > 
> > > Ira's fix is needed for other valid locking reasons - this was 'just 
> > > another'
> > > lock debugging report that came up whilst testing it.
> > > 
> > > For this patch (not a potential additional one that we aren't going to do 
> > > ;)
> > > 
> > > Tested-by: Jonathan Cameron 
> > > Reviewed-by: Jonathan Cameron 
> > 
> > Jonathan,
> > 
> > Again thanks for the testing!  However, Dan and I just discussed this and
> > he has an uneasy feeling about going forward with this for 6.8 final.
> > 
> > If we revert the following patch I can squash this fix and wait for the
> > tp_printk() fix to land in 6.9 and resubmit.
> > 
> > Dan here is the patch which backs out the actual bug:
> > 
> > Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") 
> 
> Unfortunately this is not the only patch.
> 
> We need to revert this too:
> 
> Fixes: dc97f6344f20 ("cxl/pci: Register for and process CPER events") 
> 
> And then revert ...
>   Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") 
> 
> ... but there is a conflict.
> 
> Dan, below is the correct revert patch.  Let me know if you need more.
> 
> Ira
> 
> commit 807fbe9cac9b190dab83e3ff377a30d18859c8ab
> Author: Ira Weiny 
> Date:   Wed Feb 14 15:25:24 2024 -0800
> 
> Revert "acpi/ghes: Process CXL Component Events"
> 
> This reverts commit 671a794c33c6e048ca5cedd5ad6af44d52d5d7e5.

Even reverts need changelogs, I can add one. I got conflicts trying to
apply this to current fixes branch. I think I am going to just
surgically backout the drivers/acpi/apei/ghes.c changes.



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-16 Thread Dan Williams
Ira Weiny wrote:
[..]
> > As Steve said, tp_printk is a hack (a very useful one) and
> > hopefully no one runs it in production.
> 
> OMG...  I did not realize what tp_printk() was exactly.  I should have
> looked closer.
> 
> Do we have evidence of its use in production?
> 
> I would love to not have to revert/respin,

The revert is for 2 non-trivial fixes needed in one new feature, lets
just circle back and get it right for v6.9. The tp_printk() was not the
final straw for me.



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-15 Thread Ira Weiny
Jonathan Cameron wrote:
> On Wed, 14 Feb 2024 17:33:18 -0500
> Steven Rostedt  wrote:
> 
> > On Wed, 14 Feb 2024 14:19:19 -0800
> > Ira Weiny  wrote:
> > 
> > > > > Jonathan Cameron  wrote:
> > > > > 
> > > > > > So I'm thinking this is a won't fix - wait for the printk rework to 
> > > > > > land and
> > > > > > assume this will be resolved as well?  
> > > > > 
> > > > > That pretty much sums up what I was about to say ;-)
> > > > > 
> > > > > tp_printk is more of a hack and not to be used sparingly. With the 
> > > > > right
> > > > > trace events it can hang the machine.
> > > > > 
> > > > > So, you can use your internal patch locally, but I would recommend 
> > > > > waiting
> > > > > for the new printk changes to land.
> > > 
> > > Steven, Do you think that will land in 6.9?
> > >   
> > > > >
> > > > > I'm really hoping that will be soon!
> > > > >   
> > 
> > I may be like Jon Corbet predicting RT will land in mainline if I do.
> > 
> > -- Steve
> > 
> 
> 
> Agreed. Don't wait on printk fixes landing. (Well unless you are sure
> it's the year of the Linux desktop.) Reverting is fine for 6.8
> if you and Dan feel it's unwise to take this forwards (all the distros
> will backport it anyway and 6.8 isn't an LTS so no great rush)
> so fine to just queue it up again for 6.9 with this fix.
> 
> As Steve said, tp_printk is a hack (a very useful one) and
> hopefully no one runs it in production.

OMG...  I did not realize what tp_printk() was exactly.  I should have
looked closer.

Do we have evidence of its use in production?

I would love to not have to revert/respin,
Ira



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-15 Thread Jonathan Cameron
On Wed, 14 Feb 2024 17:33:18 -0500
Steven Rostedt  wrote:

> On Wed, 14 Feb 2024 14:19:19 -0800
> Ira Weiny  wrote:
> 
> > > > Jonathan Cameron  wrote:
> > > > 
> > > > > So I'm thinking this is a won't fix - wait for the printk rework to 
> > > > > land and
> > > > > assume this will be resolved as well?  
> > > > 
> > > > That pretty much sums up what I was about to say ;-)
> > > > 
> > > > tp_printk is more of a hack and not to be used sparingly. With the right
> > > > trace events it can hang the machine.
> > > > 
> > > > So, you can use your internal patch locally, but I would recommend 
> > > > waiting
> > > > for the new printk changes to land.
> > 
> > Steven, Do you think that will land in 6.9?
> >   
> > > >
> > > > I'm really hoping that will be soon!
> > > >   
> 
> I may be like Jon Corbet predicting RT will land in mainline if I do.
> 
> -- Steve
> 


Agreed. Don't wait on printk fixes landing. (Well unless you are sure
it's the year of the Linux desktop.) Reverting is fine for 6.8
if you and Dan feel it's unwise to take this forwards (all the distros
will backport it anyway and 6.8 isn't an LTS so no great rush)
so fine to just queue it up again for 6.9 with this fix.

As Steve said, tp_printk is a hack (a very useful one) and
hopefully no one runs it in production.

Jonathan



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-14 Thread Ira Weiny
Ira Weiny wrote:
> Jonathan Cameron wrote:
> > On Wed, 14 Feb 2024 10:23:10 -0500
> > Steven Rostedt  wrote:
> > 
> > > On Wed, 14 Feb 2024 12:11:53 +
> > > Jonathan Cameron  wrote:
> > > 
> > > > So I'm thinking this is a won't fix - wait for the printk rework to 
> > > > land and
> > > > assume this will be resolved as well?  
> > > 
> > > That pretty much sums up what I was about to say ;-)
> > > 
> > > tp_printk is more of a hack and not to be used sparingly. With the right
> > > trace events it can hang the machine.
> > > 
> > > So, you can use your internal patch locally, but I would recommend waiting
> > > for the new printk changes to land.
> 
> Steven, Do you think that will land in 6.9?
> 
> > >
> > > I'm really hoping that will be soon!
> > > 
> > > -- Steve
> > 
> > Thanks Steve,
> > 
> > Ira's fix is needed for other valid locking reasons - this was 'just 
> > another'
> > lock debugging report that came up whilst testing it.
> > 
> > For this patch (not a potential additional one that we aren't going to do ;)
> > 
> > Tested-by: Jonathan Cameron 
> > Reviewed-by: Jonathan Cameron 
> 
> Jonathan,
> 
> Again thanks for the testing!  However, Dan and I just discussed this and
> he has an uneasy feeling about going forward with this for 6.8 final.
> 
> If we revert the following patch I can squash this fix and wait for the
> tp_printk() fix to land in 6.9 and resubmit.
> 
> Dan here is the patch which backs out the actual bug:
> 
>   Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") 

Unfortunately this is not the only patch.

We need to revert this too:

Fixes: dc97f6344f20 ("cxl/pci: Register for and process CPER events") 

And then revert ...
Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") 

... but there is a conflict.

Dan, below is the correct revert patch.  Let me know if you need more.

Ira

commit 807fbe9cac9b190dab83e3ff377a30d18859c8ab
Author: Ira Weiny 
Date:   Wed Feb 14 15:25:24 2024 -0800

Revert "acpi/ghes: Process CXL Component Events"

This reverts commit 671a794c33c6e048ca5cedd5ad6af44d52d5d7e5.

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index fe825a432c5b..ab2a82cb1b0b 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -26,7 +26,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -674,52 +673,6 @@ static void ghes_defer_non_standard_event(struct 
acpi_hest_generic_data *gdata,
schedule_work(>work);
 }
 
-/*
- * Only a single callback can be registered for CXL CPER events.
- */
-static DECLARE_RWSEM(cxl_cper_rw_sem);
-static cxl_cper_callback cper_callback;
-
-static void cxl_cper_post_event(enum cxl_event_type event_type,
-   struct cxl_cper_event_rec *rec)
-{
-   if (rec->hdr.length <= sizeof(rec->hdr) ||
-   rec->hdr.length > sizeof(*rec)) {
-   pr_err(FW_WARN "CXL CPER Invalid section length (%u)\n",
-  rec->hdr.length);
-   return;
-   }
-
-   if (!(rec->hdr.validation_bits & CPER_CXL_COMP_EVENT_LOG_VALID)) {
-   pr_err(FW_WARN "CXL CPER invalid event\n");
-   return;
-   }
-
-   guard(rwsem_read)(_cper_rw_sem);
-   if (cper_callback)
-   cper_callback(event_type, rec);
-}
-
-int cxl_cper_register_callback(cxl_cper_callback callback)
-{
-   guard(rwsem_write)(_cper_rw_sem);
-   if (cper_callback)
-   return -EINVAL;
-   cper_callback = callback;
-   return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_register_callback, CXL);
-
-int cxl_cper_unregister_callback(cxl_cper_callback callback)
-{
-   guard(rwsem_write)(_cper_rw_sem);
-   if (callback != cper_callback)
-   return -EINVAL;
-   cper_callback = NULL;
-   return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_callback, CXL);
-
 static bool ghes_do_proc(struct ghes *ghes,
 const struct acpi_hest_generic_status *estatus)
 {
@@ -754,22 +707,6 @@ static bool ghes_do_proc(struct ghes *ghes,
}
else if (guid_equal(sec_type, _SEC_PROC_ARM)) {
queued = ghes_handle_arm_hw_error(gdata, sev, sync);
-   } else if (guid_equal(sec_type, _SEC_CXL_GEN_MEDIA_GUID)) {
-   struct cxl_cper_event_rec *rec =
-   acpi_hest_get_payload(gdata);
-
-   cxl_cper_post_event(CXL_CPER_EVENT_GEN_MEDIA, rec);
-   } else if (guid_equal(sec_type, _SEC_CXL_DRAM_GUID)) {
-   struct cxl_cper_event_rec *rec =
-   acpi_hest_get_payload(gdata);
-
-   cxl_cper_post_event(CXL_CPER_EVENT_DRAM, rec);
-   } else if (guid_equal(sec_type,
- _SEC_CXL_MEM_MODULE_GUID)) {
-   struct cxl_cper_event_rec *rec =
-   

Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-14 Thread Steven Rostedt
On Wed, 14 Feb 2024 14:19:19 -0800
Ira Weiny  wrote:

> > > Jonathan Cameron  wrote:
> > >   
> > > > So I'm thinking this is a won't fix - wait for the printk rework to 
> > > > land and
> > > > assume this will be resolved as well?
> > > 
> > > That pretty much sums up what I was about to say ;-)
> > > 
> > > tp_printk is more of a hack and not to be used sparingly. With the right
> > > trace events it can hang the machine.
> > > 
> > > So, you can use your internal patch locally, but I would recommend waiting
> > > for the new printk changes to land.  
> 
> Steven, Do you think that will land in 6.9?
> 
> > >
> > > I'm really hoping that will be soon!
> > > 

I may be like Jon Corbet predicting RT will land in mainline if I do.

-- Steve



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-14 Thread Ira Weiny
Jonathan Cameron wrote:
> On Wed, 14 Feb 2024 10:23:10 -0500
> Steven Rostedt  wrote:
> 
> > On Wed, 14 Feb 2024 12:11:53 +
> > Jonathan Cameron  wrote:
> > 
> > > So I'm thinking this is a won't fix - wait for the printk rework to land 
> > > and
> > > assume this will be resolved as well?  
> > 
> > That pretty much sums up what I was about to say ;-)
> > 
> > tp_printk is more of a hack and not to be used sparingly. With the right
> > trace events it can hang the machine.
> > 
> > So, you can use your internal patch locally, but I would recommend waiting
> > for the new printk changes to land.

Steven, Do you think that will land in 6.9?

> >
> > I'm really hoping that will be soon!
> > 
> > -- Steve
> 
> Thanks Steve,
> 
> Ira's fix is needed for other valid locking reasons - this was 'just another'
> lock debugging report that came up whilst testing it.
> 
> For this patch (not a potential additional one that we aren't going to do ;)
> 
> Tested-by: Jonathan Cameron 
> Reviewed-by: Jonathan Cameron 

Jonathan,

Again thanks for the testing!  However, Dan and I just discussed this and
he has an uneasy feeling about going forward with this for 6.8 final.

If we revert the following patch I can squash this fix and wait for the
tp_printk() fix to land in 6.9 and resubmit.

Dan here is the patch which backs out the actual bug:

Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") 

Thanks everyone,
Ira



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-14 Thread Ira Weiny
Jonathan Cameron wrote:
> On Wed, 14 Feb 2024 10:23:10 -0500
> Steven Rostedt  wrote:
> 
> > On Wed, 14 Feb 2024 12:11:53 +
> > Jonathan Cameron  wrote:
> > 
> > > So I'm thinking this is a won't fix - wait for the printk rework to land 
> > > and
> > > assume this will be resolved as well?  
> > 
> > That pretty much sums up what I was about to say ;-)
> > 
> > tp_printk is more of a hack and not to be used sparingly. With the right
> > trace events it can hang the machine.
> > 
> > So, you can use your internal patch locally, but I would recommend waiting
> > for the new printk changes to land. I'm really hoping that will be soon!
> > 
> > -- Steve
> 
> Thanks Steve,
> 
> Ira's fix is needed for other valid locking reasons - this was 'just another'
> lock debugging report that came up whilst testing it.
> 
> For this patch (not a potential additional one that we aren't going to do ;)
> 
> Tested-by: Jonathan Cameron 
> Reviewed-by: Jonathan Cameron 

Thanks Jonathan!  I really appreciate the testing,
Ira



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-14 Thread Jonathan Cameron
On Wed, 14 Feb 2024 10:23:10 -0500
Steven Rostedt  wrote:

> On Wed, 14 Feb 2024 12:11:53 +
> Jonathan Cameron  wrote:
> 
> > So I'm thinking this is a won't fix - wait for the printk rework to land and
> > assume this will be resolved as well?  
> 
> That pretty much sums up what I was about to say ;-)
> 
> tp_printk is more of a hack and not to be used sparingly. With the right
> trace events it can hang the machine.
> 
> So, you can use your internal patch locally, but I would recommend waiting
> for the new printk changes to land. I'm really hoping that will be soon!
> 
> -- Steve

Thanks Steve,

Ira's fix is needed for other valid locking reasons - this was 'just another'
lock debugging report that came up whilst testing it.

For this patch (not a potential additional one that we aren't going to do ;)

Tested-by: Jonathan Cameron 
Reviewed-by: Jonathan Cameron 



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-14 Thread Ira Weiny
Ira Weiny wrote:
> Jonathan Cameron wrote:
> > On Tue, 06 Feb 2024 14:15:32 -0800
> > Ira Weiny  wrote:
> > 
> >

[snip]

> > 
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 9ff8a439d674..7ee45f22f56f 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -2957,7 +2957,7 @@ static void output_printk(struct trace_event_buffer 
> > *fbuffer)
> > iter->ent = fbuffer->entry;
> > event_call->event.funcs->trace(iter, 0, event);
> > trace_seq_putc(>seq, 0);
> > -   printk("%s", iter->seq.buffer);
> > +   printk_deferred("%s", iter->seq.buffer);
> > 
> > raw_spin_unlock_irqrestore(_iter_lock, flags);
> >  }
> > 
> > My assumption is similar views will apply here to Peter Zijlstra's comment 
> > on
> > not using printk_deferred.
> > 
> > https://lore.kernel.org/all/20231010141244.gm...@noisy.programming.kicks-ass.net/
> > 
> > Note I also tried the hacks Peter links to from there. They trip issues in 
> > the initial
> > CPER print - so I assume not appropriate here.
> > 
> > So I'm thinking this is a won't fix - wait for the printk rework to land and
> > assume this will be resolved as well?
> > 
> 
> Or could we avoid the situation entirely by using a static call?
> 
> The work queue still needs to be created because of the atomicness of
> ghes_do_proc() but it avoids cxl_cper_rw_sem.
> 
> I think the hardest thing may be writing the commit message to explain all
> this.  :-(
> 

Never mind, I already went down that path.  I was surprised I did not
mention it in this commit message.  I did in V1.  :-(

"A static call was considered but ARM does not select HAVE_STATIC_CALL
and in that case setting the function pointer uses a RW semaphore."
-- 
https://lore.kernel.org/all/20240202-cxl-cper-smatch-v1-1-7a4103c7f...@intel.com/

Should have carried that through.

So should we revert ...

Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") 

... and wait for the printk fix or just move forward with this patch?

Sorry for the noise,
Ira



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-14 Thread Ira Weiny
Jonathan Cameron wrote:
> On Tue, 06 Feb 2024 14:15:32 -0800
> Ira Weiny  wrote:
> 
> > Smatch caught that cxl_cper_post_event() is called with a spinlock held
> > or preemption disabled.[1]  The callback takes the device lock to
> > perform address translation and therefore might sleep.  The record data
> > is released back to BIOS in ghes_clear_estatus() which requires it to be
> > copied for use in the workqueue.
> > 
> > Copy the record to a lockless list and schedule a work item to process
> > the record outside of atomic context.
> > 
> > [1] 
> > https://lore.kernel.org/all/b963c490-2c13-4b79-bbe7-34c6568423c7@moroto.mountain/
> > 
> > Reported-by: Dan Carpenter 
> > Signed-off-by: Ira Weiny 
> 
> +CC tracing folk for the splat below (the second one as first one is fixed!)
> 
> Lock debugging is slow (on an emulated machine) :(
> Testing with my gitlab.com/jic23/qemu cxl-2024-02-05-draft branch
> which is only one I've put out with the FW first injection patches so far
> 
> For reference without this patch I got a nice spat identifying the original 
> bug.
> So far so good.
> 
> With this patch (and tp_printk in command line and trace points enabled)
> [  193.048229] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
> Error Source: 1
> [  193.049636] {1}[Hardware Error]: event severity: recoverable
> [  193.050472] {1}[Hardware Error]:  Error 0, type: recoverable
> [  193.051337] {1}[Hardware Error]:   section type: unknown, 
> fbcd0a77-c260-417f-85a9-088b1621eba6
> [  193.052270] {1}[Hardware Error]:   section length: 0x90
> [  193.053402] {1}[Hardware Error]:   : 0090 0007  
> 0d938086  
> [  193.055036] {1}[Hardware Error]:   0010: 000e  0005 
>   
> [  193.058592] {1}[Hardware Error]:   0020: 0180  1626fa24 
> 17b3b158  $.&.X...
> [  193.062289] {1}[Hardware Error]:   0030:    
>   
> [  193.065959] {1}[Hardware Error]:   0040: 07d0  0fc00307 
> 05210300  ..!.
> [  193.069782] {1}[Hardware Error]:   0050: 7269 6d207361 00326d65 
>   ..iras mem2.
> [  193.072760] {1}[Hardware Error]:   0060:    
>   
> [  193.074062] {1}[Hardware Error]:   0070:    
>   
> [  193.075346] {1}[Hardware Error]:   0080:    
>   
> 
> I rebased after this so now we get the smaller print - but that's not really 
> relevant here.
> 
> [  193.086589] cxl_general_media: memdev=mem1 host=:0e:00.0 serial=5 
> log=Warning : time=1707903675590441508 
> uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=128 flags='0x1' handle=0 
> related_handle=0 maint_op_class=0 : dpa=7c0 dpa_flags='0x10' 
> descriptor='UNCORRECTABLE_EVENT|THRESHOLD_EVENT|POISON_LIST_OVERFLOW' 
> type='0x3' transaction_type='0xc0' channel=3 rank=33 device=5 comp_id=69 72 
> 61 73 20 6d 65 6d 32 00 00 00 00 00 00 00 
> validity_flags='CHANNEL|RANK|DEVICE|COMPONENT'
> [  193.087181]
>   
>   
>   
> [  193.087361] =
> [  193.087399] [ BUG: Invalid wait context ]
> [  193.087504] 6.8.0-rc3 #1200 Not tainted
> [  193.087660] -
> [  193.087858] kworker/3:0/31 is trying to lock:
> [  193.087966] c0ce1898 (_lock_key){-.-.}-{3:3}, at: 
> pl011_console_write+0x148/0x1c8
> [  193.089754] other info that might help us debug this:
> [  193.089820] context-{5:5}[  193.089900] 8 locks held by kworker/3:0/31:
> [  193.089990]  #0: c0018738 ((wq_completion)events){+.+.}-{0:0}, at: 
> process_one_work+0x154/0x500
> [  193.090439]  #1: 800083793de0 (cxl_cper_work){+.+.}-{0:0}, at: 
> process_one_work+0x154/0x500
> [  193.090718]  #2: 800082883310 (cxl_cper_rw_sem){}-{4:4}, at: 
> cxl_cper_work_fn+0x2c/0xb0
> [  193.091019]  #3: c0a7b1a8 (>mutex){}-{4:4}, at: 
> pci_dev_lock+0x28/0x48
> [  193.091431]  #4: 800082738f18 (tracepoint_iter_lock){}-{2:2}, at: 
> trace_event_buffer_commit+0xd8/0x2c8
> [  193.091772]  #5: 8000826b3ce8 (console_lock){+.+.}-{0:0}, at: 
> vprintk_emit+0x124/0x398
> [  193.092031]  #6: 8000826b3d40 (console_srcu){}-{0:0}, at: 
> console_flush_all+0x88/0x4b0
> [  193.092363]  #7: 8000826b3ef8 (console_owner){}-{0:0}, at: 
> console_flush_all+0x1bc/0x4b0
> [  193.092799] stack backtrace:
> [  193.092973] CPU: 3 PID: 31 Comm: kworker/3:0 Not tainted 6.8.0-rc3 #1200
> [  193.093118] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown unknown
> [  193.093468] Workqueue: events cxl_cper_work_fn
> [  193.093782] Call trace:
> [  

Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-14 Thread Steven Rostedt
On Wed, 14 Feb 2024 12:11:53 +
Jonathan Cameron  wrote:

> So I'm thinking this is a won't fix - wait for the printk rework to land and
> assume this will be resolved as well?

That pretty much sums up what I was about to say ;-)

tp_printk is more of a hack and not to be used sparingly. With the right
trace events it can hang the machine.

So, you can use your internal patch locally, but I would recommend waiting
for the new printk changes to land. I'm really hoping that will be soon!

-- Steve



Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

2024-02-14 Thread Jonathan Cameron
On Tue, 06 Feb 2024 14:15:32 -0800
Ira Weiny  wrote:

> Smatch caught that cxl_cper_post_event() is called with a spinlock held
> or preemption disabled.[1]  The callback takes the device lock to
> perform address translation and therefore might sleep.  The record data
> is released back to BIOS in ghes_clear_estatus() which requires it to be
> copied for use in the workqueue.
> 
> Copy the record to a lockless list and schedule a work item to process
> the record outside of atomic context.
> 
> [1] 
> https://lore.kernel.org/all/b963c490-2c13-4b79-bbe7-34c6568423c7@moroto.mountain/
> 
> Reported-by: Dan Carpenter 
> Signed-off-by: Ira Weiny 

+CC tracing folk for the splat below (the second one as first one is fixed!)

Lock debugging is slow (on an emulated machine) :(
Testing with my gitlab.com/jic23/qemu cxl-2024-02-05-draft branch
which is only one I've put out with the FW first injection patches so far

For reference without this patch I got a nice spat identifying the original bug.
So far so good.

With this patch (and tp_printk in command line and trace points enabled)
[  193.048229] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 1
[  193.049636] {1}[Hardware Error]: event severity: recoverable
[  193.050472] {1}[Hardware Error]:  Error 0, type: recoverable
[  193.051337] {1}[Hardware Error]:   section type: unknown, 
fbcd0a77-c260-417f-85a9-088b1621eba6
[  193.052270] {1}[Hardware Error]:   section length: 0x90
[  193.053402] {1}[Hardware Error]:   : 0090 0007  
0d938086  
[  193.055036] {1}[Hardware Error]:   0010: 000e  0005 
  
[  193.058592] {1}[Hardware Error]:   0020: 0180  1626fa24 
17b3b158  $.&.X...
[  193.062289] {1}[Hardware Error]:   0030:    
  
[  193.065959] {1}[Hardware Error]:   0040: 07d0  0fc00307 
05210300  ..!.
[  193.069782] {1}[Hardware Error]:   0050: 7269 6d207361 00326d65 
  ..iras mem2.
[  193.072760] {1}[Hardware Error]:   0060:    
  
[  193.074062] {1}[Hardware Error]:   0070:    
  
[  193.075346] {1}[Hardware Error]:   0080:    
  

I rebased after this so now we get the smaller print - but that's not really 
relevant here.

[  193.086589] cxl_general_media: memdev=mem1 host=:0e:00.0 serial=5 
log=Warning : time=1707903675590441508 
uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=128 flags='0x1' handle=0 
related_handle=0 maint_op_class=0 : dpa=7c0 dpa_flags='0x10' 
descriptor='UNCORRECTABLE_EVENT|THRESHOLD_EVENT|POISON_LIST_OVERFLOW' 
type='0x3' transaction_type='0xc0' channel=3 rank=33 device=5 comp_id=69 72 61 
73 20 6d 65 6d 32 00 00 00 00 00 00 00 
validity_flags='CHANNEL|RANK|DEVICE|COMPONENT'
[  193.087181]  



[  193.087361] =
[  193.087399] [ BUG: Invalid wait context ]
[  193.087504] 6.8.0-rc3 #1200 Not tainted
[  193.087660] -
[  193.087858] kworker/3:0/31 is trying to lock:
[  193.087966] c0ce1898 (_lock_key){-.-.}-{3:3}, at: 
pl011_console_write+0x148/0x1c8
[  193.089754] other info that might help us debug this:
[  193.089820] context-{5:5}[  193.089900] 8 locks held by kworker/3:0/31:
[  193.089990]  #0: c0018738 ((wq_completion)events){+.+.}-{0:0}, at: 
process_one_work+0x154/0x500
[  193.090439]  #1: 800083793de0 (cxl_cper_work){+.+.}-{0:0}, at: 
process_one_work+0x154/0x500
[  193.090718]  #2: 800082883310 (cxl_cper_rw_sem){}-{4:4}, at: 
cxl_cper_work_fn+0x2c/0xb0
[  193.091019]  #3: c0a7b1a8 (>mutex){}-{4:4}, at: 
pci_dev_lock+0x28/0x48
[  193.091431]  #4: 800082738f18 (tracepoint_iter_lock){}-{2:2}, at: 
trace_event_buffer_commit+0xd8/0x2c8
[  193.091772]  #5: 8000826b3ce8 (console_lock){+.+.}-{0:0}, at: 
vprintk_emit+0x124/0x398
[  193.092031]  #6: 8000826b3d40 (console_srcu){}-{0:0}, at: 
console_flush_all+0x88/0x4b0
[  193.092363]  #7: 8000826b3ef8 (console_owner){}-{0:0}, at: 
console_flush_all+0x1bc/0x4b0
[  193.092799] stack backtrace:
[  193.092973] CPU: 3 PID: 31 Comm: kworker/3:0 Not tainted 6.8.0-rc3 #1200
[  193.093118] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown unknown
[  193.093468] Workqueue: events cxl_cper_work_fn
[  193.093782] Call trace:
[  193.094004]  dump_backtrace+0xa4/0x130
[  193.094145]  show_stack+0x20/0x38
[  193.094231]  dump_stack_lvl+0x60/0xb0
[  193.094315]  dump_stack+0x18/0x28
[  193.094395]  __lock_acquire+0x9e8/0x1ee8
[  193.094477]