Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
On Fri, Feb 16, 2024 at 05:17:20PM -0800, Dan Williams wrote: > > commit 807fbe9cac9b190dab83e3ff377a30d18859c8ab > > Author: Ira Weiny > > Date: Wed Feb 14 15:25:24 2024 -0800 > > > > Revert "acpi/ghes: Process CXL Component Events" > > > > This reverts commit 671a794c33c6e048ca5cedd5ad6af44d52d5d7e5. > > Even reverts need changelogs, I can add one. I got conflicts trying to > apply this to current fixes branch. I think I am going to just > surgically backout the drivers/acpi/apei/ghes.c changes. The `git revert` command comes from ancient times and it just sets people up for failure... :/ regards, dan carpenter
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Ira Weiny wrote: > Ira Weiny wrote: > > Jonathan Cameron wrote: > > > On Wed, 14 Feb 2024 10:23:10 -0500 > > > Steven Rostedt wrote: > > > > > > > On Wed, 14 Feb 2024 12:11:53 + > > > > Jonathan Cameron wrote: > > > > > > > > > So I'm thinking this is a won't fix - wait for the printk rework to > > > > > land and > > > > > assume this will be resolved as well? > > > > > > > > That pretty much sums up what I was about to say ;-) > > > > > > > > tp_printk is more of a hack and not to be used sparingly. With the right > > > > trace events it can hang the machine. > > > > > > > > So, you can use your internal patch locally, but I would recommend > > > > waiting > > > > for the new printk changes to land. > > > > Steven, Do you think that will land in 6.9? > > > > > > > > > > I'm really hoping that will be soon! > > > > > > > > -- Steve > > > > > > Thanks Steve, > > > > > > Ira's fix is needed for other valid locking reasons - this was 'just > > > another' > > > lock debugging report that came up whilst testing it. > > > > > > For this patch (not a potential additional one that we aren't going to do > > > ;) > > > > > > Tested-by: Jonathan Cameron > > > Reviewed-by: Jonathan Cameron > > > > Jonathan, > > > > Again thanks for the testing! However, Dan and I just discussed this and > > he has an uneasy feeling about going forward with this for 6.8 final. > > > > If we revert the following patch I can squash this fix and wait for the > > tp_printk() fix to land in 6.9 and resubmit. > > > > Dan here is the patch which backs out the actual bug: > > > > Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") > > Unfortunately this is not the only patch. > > We need to revert this too: > > Fixes: dc97f6344f20 ("cxl/pci: Register for and process CPER events") > > And then revert ... > Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") > > ... but there is a conflict. > > Dan, below is the correct revert patch. Let me know if you need more. > > Ira > > commit 807fbe9cac9b190dab83e3ff377a30d18859c8ab > Author: Ira Weiny > Date: Wed Feb 14 15:25:24 2024 -0800 > > Revert "acpi/ghes: Process CXL Component Events" > > This reverts commit 671a794c33c6e048ca5cedd5ad6af44d52d5d7e5. Even reverts need changelogs, I can add one. I got conflicts trying to apply this to current fixes branch. I think I am going to just surgically backout the drivers/acpi/apei/ghes.c changes.
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Ira Weiny wrote: [..] > > As Steve said, tp_printk is a hack (a very useful one) and > > hopefully no one runs it in production. > > OMG... I did not realize what tp_printk() was exactly. I should have > looked closer. > > Do we have evidence of its use in production? > > I would love to not have to revert/respin, The revert is for 2 non-trivial fixes needed in one new feature, lets just circle back and get it right for v6.9. The tp_printk() was not the final straw for me.
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Jonathan Cameron wrote: > On Wed, 14 Feb 2024 17:33:18 -0500 > Steven Rostedt wrote: > > > On Wed, 14 Feb 2024 14:19:19 -0800 > > Ira Weiny wrote: > > > > > > > Jonathan Cameron wrote: > > > > > > > > > > > So I'm thinking this is a won't fix - wait for the printk rework to > > > > > > land and > > > > > > assume this will be resolved as well? > > > > > > > > > > That pretty much sums up what I was about to say ;-) > > > > > > > > > > tp_printk is more of a hack and not to be used sparingly. With the > > > > > right > > > > > trace events it can hang the machine. > > > > > > > > > > So, you can use your internal patch locally, but I would recommend > > > > > waiting > > > > > for the new printk changes to land. > > > > > > Steven, Do you think that will land in 6.9? > > > > > > > > > > > > > I'm really hoping that will be soon! > > > > > > > > > I may be like Jon Corbet predicting RT will land in mainline if I do. > > > > -- Steve > > > > > Agreed. Don't wait on printk fixes landing. (Well unless you are sure > it's the year of the Linux desktop.) Reverting is fine for 6.8 > if you and Dan feel it's unwise to take this forwards (all the distros > will backport it anyway and 6.8 isn't an LTS so no great rush) > so fine to just queue it up again for 6.9 with this fix. > > As Steve said, tp_printk is a hack (a very useful one) and > hopefully no one runs it in production. OMG... I did not realize what tp_printk() was exactly. I should have looked closer. Do we have evidence of its use in production? I would love to not have to revert/respin, Ira
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
On Wed, 14 Feb 2024 17:33:18 -0500 Steven Rostedt wrote: > On Wed, 14 Feb 2024 14:19:19 -0800 > Ira Weiny wrote: > > > > > Jonathan Cameron wrote: > > > > > > > > > So I'm thinking this is a won't fix - wait for the printk rework to > > > > > land and > > > > > assume this will be resolved as well? > > > > > > > > That pretty much sums up what I was about to say ;-) > > > > > > > > tp_printk is more of a hack and not to be used sparingly. With the right > > > > trace events it can hang the machine. > > > > > > > > So, you can use your internal patch locally, but I would recommend > > > > waiting > > > > for the new printk changes to land. > > > > Steven, Do you think that will land in 6.9? > > > > > > > > > > I'm really hoping that will be soon! > > > > > > I may be like Jon Corbet predicting RT will land in mainline if I do. > > -- Steve > Agreed. Don't wait on printk fixes landing. (Well unless you are sure it's the year of the Linux desktop.) Reverting is fine for 6.8 if you and Dan feel it's unwise to take this forwards (all the distros will backport it anyway and 6.8 isn't an LTS so no great rush) so fine to just queue it up again for 6.9 with this fix. As Steve said, tp_printk is a hack (a very useful one) and hopefully no one runs it in production. Jonathan
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Ira Weiny wrote: > Jonathan Cameron wrote: > > On Wed, 14 Feb 2024 10:23:10 -0500 > > Steven Rostedt wrote: > > > > > On Wed, 14 Feb 2024 12:11:53 + > > > Jonathan Cameron wrote: > > > > > > > So I'm thinking this is a won't fix - wait for the printk rework to > > > > land and > > > > assume this will be resolved as well? > > > > > > That pretty much sums up what I was about to say ;-) > > > > > > tp_printk is more of a hack and not to be used sparingly. With the right > > > trace events it can hang the machine. > > > > > > So, you can use your internal patch locally, but I would recommend waiting > > > for the new printk changes to land. > > Steven, Do you think that will land in 6.9? > > > > > > > I'm really hoping that will be soon! > > > > > > -- Steve > > > > Thanks Steve, > > > > Ira's fix is needed for other valid locking reasons - this was 'just > > another' > > lock debugging report that came up whilst testing it. > > > > For this patch (not a potential additional one that we aren't going to do ;) > > > > Tested-by: Jonathan Cameron > > Reviewed-by: Jonathan Cameron > > Jonathan, > > Again thanks for the testing! However, Dan and I just discussed this and > he has an uneasy feeling about going forward with this for 6.8 final. > > If we revert the following patch I can squash this fix and wait for the > tp_printk() fix to land in 6.9 and resubmit. > > Dan here is the patch which backs out the actual bug: > > Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") Unfortunately this is not the only patch. We need to revert this too: Fixes: dc97f6344f20 ("cxl/pci: Register for and process CPER events") And then revert ... Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") ... but there is a conflict. Dan, below is the correct revert patch. Let me know if you need more. Ira commit 807fbe9cac9b190dab83e3ff377a30d18859c8ab Author: Ira Weiny Date: Wed Feb 14 15:25:24 2024 -0800 Revert "acpi/ghes: Process CXL Component Events" This reverts commit 671a794c33c6e048ca5cedd5ad6af44d52d5d7e5. diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index fe825a432c5b..ab2a82cb1b0b 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -26,7 +26,6 @@ #include #include #include -#include #include #include #include @@ -674,52 +673,6 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata, schedule_work(>work); } -/* - * Only a single callback can be registered for CXL CPER events. - */ -static DECLARE_RWSEM(cxl_cper_rw_sem); -static cxl_cper_callback cper_callback; - -static void cxl_cper_post_event(enum cxl_event_type event_type, - struct cxl_cper_event_rec *rec) -{ - if (rec->hdr.length <= sizeof(rec->hdr) || - rec->hdr.length > sizeof(*rec)) { - pr_err(FW_WARN "CXL CPER Invalid section length (%u)\n", - rec->hdr.length); - return; - } - - if (!(rec->hdr.validation_bits & CPER_CXL_COMP_EVENT_LOG_VALID)) { - pr_err(FW_WARN "CXL CPER invalid event\n"); - return; - } - - guard(rwsem_read)(_cper_rw_sem); - if (cper_callback) - cper_callback(event_type, rec); -} - -int cxl_cper_register_callback(cxl_cper_callback callback) -{ - guard(rwsem_write)(_cper_rw_sem); - if (cper_callback) - return -EINVAL; - cper_callback = callback; - return 0; -} -EXPORT_SYMBOL_NS_GPL(cxl_cper_register_callback, CXL); - -int cxl_cper_unregister_callback(cxl_cper_callback callback) -{ - guard(rwsem_write)(_cper_rw_sem); - if (callback != cper_callback) - return -EINVAL; - cper_callback = NULL; - return 0; -} -EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_callback, CXL); - static bool ghes_do_proc(struct ghes *ghes, const struct acpi_hest_generic_status *estatus) { @@ -754,22 +707,6 @@ static bool ghes_do_proc(struct ghes *ghes, } else if (guid_equal(sec_type, _SEC_PROC_ARM)) { queued = ghes_handle_arm_hw_error(gdata, sev, sync); - } else if (guid_equal(sec_type, _SEC_CXL_GEN_MEDIA_GUID)) { - struct cxl_cper_event_rec *rec = - acpi_hest_get_payload(gdata); - - cxl_cper_post_event(CXL_CPER_EVENT_GEN_MEDIA, rec); - } else if (guid_equal(sec_type, _SEC_CXL_DRAM_GUID)) { - struct cxl_cper_event_rec *rec = - acpi_hest_get_payload(gdata); - - cxl_cper_post_event(CXL_CPER_EVENT_DRAM, rec); - } else if (guid_equal(sec_type, - _SEC_CXL_MEM_MODULE_GUID)) { - struct cxl_cper_event_rec *rec = -
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
On Wed, 14 Feb 2024 14:19:19 -0800 Ira Weiny wrote: > > > Jonathan Cameron wrote: > > > > > > > So I'm thinking this is a won't fix - wait for the printk rework to > > > > land and > > > > assume this will be resolved as well? > > > > > > That pretty much sums up what I was about to say ;-) > > > > > > tp_printk is more of a hack and not to be used sparingly. With the right > > > trace events it can hang the machine. > > > > > > So, you can use your internal patch locally, but I would recommend waiting > > > for the new printk changes to land. > > Steven, Do you think that will land in 6.9? > > > > > > > I'm really hoping that will be soon! > > > I may be like Jon Corbet predicting RT will land in mainline if I do. -- Steve
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Jonathan Cameron wrote: > On Wed, 14 Feb 2024 10:23:10 -0500 > Steven Rostedt wrote: > > > On Wed, 14 Feb 2024 12:11:53 + > > Jonathan Cameron wrote: > > > > > So I'm thinking this is a won't fix - wait for the printk rework to land > > > and > > > assume this will be resolved as well? > > > > That pretty much sums up what I was about to say ;-) > > > > tp_printk is more of a hack and not to be used sparingly. With the right > > trace events it can hang the machine. > > > > So, you can use your internal patch locally, but I would recommend waiting > > for the new printk changes to land. Steven, Do you think that will land in 6.9? > > > > I'm really hoping that will be soon! > > > > -- Steve > > Thanks Steve, > > Ira's fix is needed for other valid locking reasons - this was 'just another' > lock debugging report that came up whilst testing it. > > For this patch (not a potential additional one that we aren't going to do ;) > > Tested-by: Jonathan Cameron > Reviewed-by: Jonathan Cameron Jonathan, Again thanks for the testing! However, Dan and I just discussed this and he has an uneasy feeling about going forward with this for 6.8 final. If we revert the following patch I can squash this fix and wait for the tp_printk() fix to land in 6.9 and resubmit. Dan here is the patch which backs out the actual bug: Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") Thanks everyone, Ira
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Jonathan Cameron wrote: > On Wed, 14 Feb 2024 10:23:10 -0500 > Steven Rostedt wrote: > > > On Wed, 14 Feb 2024 12:11:53 + > > Jonathan Cameron wrote: > > > > > So I'm thinking this is a won't fix - wait for the printk rework to land > > > and > > > assume this will be resolved as well? > > > > That pretty much sums up what I was about to say ;-) > > > > tp_printk is more of a hack and not to be used sparingly. With the right > > trace events it can hang the machine. > > > > So, you can use your internal patch locally, but I would recommend waiting > > for the new printk changes to land. I'm really hoping that will be soon! > > > > -- Steve > > Thanks Steve, > > Ira's fix is needed for other valid locking reasons - this was 'just another' > lock debugging report that came up whilst testing it. > > For this patch (not a potential additional one that we aren't going to do ;) > > Tested-by: Jonathan Cameron > Reviewed-by: Jonathan Cameron Thanks Jonathan! I really appreciate the testing, Ira
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
On Wed, 14 Feb 2024 10:23:10 -0500 Steven Rostedt wrote: > On Wed, 14 Feb 2024 12:11:53 + > Jonathan Cameron wrote: > > > So I'm thinking this is a won't fix - wait for the printk rework to land and > > assume this will be resolved as well? > > That pretty much sums up what I was about to say ;-) > > tp_printk is more of a hack and not to be used sparingly. With the right > trace events it can hang the machine. > > So, you can use your internal patch locally, but I would recommend waiting > for the new printk changes to land. I'm really hoping that will be soon! > > -- Steve Thanks Steve, Ira's fix is needed for other valid locking reasons - this was 'just another' lock debugging report that came up whilst testing it. For this patch (not a potential additional one that we aren't going to do ;) Tested-by: Jonathan Cameron Reviewed-by: Jonathan Cameron
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Ira Weiny wrote: > Jonathan Cameron wrote: > > On Tue, 06 Feb 2024 14:15:32 -0800 > > Ira Weiny wrote: > > > > [snip] > > > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > > index 9ff8a439d674..7ee45f22f56f 100644 > > --- a/kernel/trace/trace.c > > +++ b/kernel/trace/trace.c > > @@ -2957,7 +2957,7 @@ static void output_printk(struct trace_event_buffer > > *fbuffer) > > iter->ent = fbuffer->entry; > > event_call->event.funcs->trace(iter, 0, event); > > trace_seq_putc(>seq, 0); > > - printk("%s", iter->seq.buffer); > > + printk_deferred("%s", iter->seq.buffer); > > > > raw_spin_unlock_irqrestore(_iter_lock, flags); > > } > > > > My assumption is similar views will apply here to Peter Zijlstra's comment > > on > > not using printk_deferred. > > > > https://lore.kernel.org/all/20231010141244.gm...@noisy.programming.kicks-ass.net/ > > > > Note I also tried the hacks Peter links to from there. They trip issues in > > the initial > > CPER print - so I assume not appropriate here. > > > > So I'm thinking this is a won't fix - wait for the printk rework to land and > > assume this will be resolved as well? > > > > Or could we avoid the situation entirely by using a static call? > > The work queue still needs to be created because of the atomicness of > ghes_do_proc() but it avoids cxl_cper_rw_sem. > > I think the hardest thing may be writing the commit message to explain all > this. :-( > Never mind, I already went down that path. I was surprised I did not mention it in this commit message. I did in V1. :-( "A static call was considered but ARM does not select HAVE_STATIC_CALL and in that case setting the function pointer uses a RW semaphore." -- https://lore.kernel.org/all/20240202-cxl-cper-smatch-v1-1-7a4103c7f...@intel.com/ Should have carried that through. So should we revert ... Fixes: 671a794c33c6 ("acpi/ghes: Process CXL Component Events") ... and wait for the printk fix or just move forward with this patch? Sorry for the noise, Ira
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
Jonathan Cameron wrote: > On Tue, 06 Feb 2024 14:15:32 -0800 > Ira Weiny wrote: > > > Smatch caught that cxl_cper_post_event() is called with a spinlock held > > or preemption disabled.[1] The callback takes the device lock to > > perform address translation and therefore might sleep. The record data > > is released back to BIOS in ghes_clear_estatus() which requires it to be > > copied for use in the workqueue. > > > > Copy the record to a lockless list and schedule a work item to process > > the record outside of atomic context. > > > > [1] > > https://lore.kernel.org/all/b963c490-2c13-4b79-bbe7-34c6568423c7@moroto.mountain/ > > > > Reported-by: Dan Carpenter > > Signed-off-by: Ira Weiny > > +CC tracing folk for the splat below (the second one as first one is fixed!) > > Lock debugging is slow (on an emulated machine) :( > Testing with my gitlab.com/jic23/qemu cxl-2024-02-05-draft branch > which is only one I've put out with the FW first injection patches so far > > For reference without this patch I got a nice spat identifying the original > bug. > So far so good. > > With this patch (and tp_printk in command line and trace points enabled) > [ 193.048229] {1}[Hardware Error]: Hardware error from APEI Generic Hardware > Error Source: 1 > [ 193.049636] {1}[Hardware Error]: event severity: recoverable > [ 193.050472] {1}[Hardware Error]: Error 0, type: recoverable > [ 193.051337] {1}[Hardware Error]: section type: unknown, > fbcd0a77-c260-417f-85a9-088b1621eba6 > [ 193.052270] {1}[Hardware Error]: section length: 0x90 > [ 193.053402] {1}[Hardware Error]: : 0090 0007 > 0d938086 > [ 193.055036] {1}[Hardware Error]: 0010: 000e 0005 > > [ 193.058592] {1}[Hardware Error]: 0020: 0180 1626fa24 > 17b3b158 $.&.X... > [ 193.062289] {1}[Hardware Error]: 0030: > > [ 193.065959] {1}[Hardware Error]: 0040: 07d0 0fc00307 > 05210300 ..!. > [ 193.069782] {1}[Hardware Error]: 0050: 7269 6d207361 00326d65 > ..iras mem2. > [ 193.072760] {1}[Hardware Error]: 0060: > > [ 193.074062] {1}[Hardware Error]: 0070: > > [ 193.075346] {1}[Hardware Error]: 0080: > > > I rebased after this so now we get the smaller print - but that's not really > relevant here. > > [ 193.086589] cxl_general_media: memdev=mem1 host=:0e:00.0 serial=5 > log=Warning : time=1707903675590441508 > uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=128 flags='0x1' handle=0 > related_handle=0 maint_op_class=0 : dpa=7c0 dpa_flags='0x10' > descriptor='UNCORRECTABLE_EVENT|THRESHOLD_EVENT|POISON_LIST_OVERFLOW' > type='0x3' transaction_type='0xc0' channel=3 rank=33 device=5 comp_id=69 72 > 61 73 20 6d 65 6d 32 00 00 00 00 00 00 00 > validity_flags='CHANNEL|RANK|DEVICE|COMPONENT' > [ 193.087181] > > > > [ 193.087361] = > [ 193.087399] [ BUG: Invalid wait context ] > [ 193.087504] 6.8.0-rc3 #1200 Not tainted > [ 193.087660] - > [ 193.087858] kworker/3:0/31 is trying to lock: > [ 193.087966] c0ce1898 (_lock_key){-.-.}-{3:3}, at: > pl011_console_write+0x148/0x1c8 > [ 193.089754] other info that might help us debug this: > [ 193.089820] context-{5:5}[ 193.089900] 8 locks held by kworker/3:0/31: > [ 193.089990] #0: c0018738 ((wq_completion)events){+.+.}-{0:0}, at: > process_one_work+0x154/0x500 > [ 193.090439] #1: 800083793de0 (cxl_cper_work){+.+.}-{0:0}, at: > process_one_work+0x154/0x500 > [ 193.090718] #2: 800082883310 (cxl_cper_rw_sem){}-{4:4}, at: > cxl_cper_work_fn+0x2c/0xb0 > [ 193.091019] #3: c0a7b1a8 (>mutex){}-{4:4}, at: > pci_dev_lock+0x28/0x48 > [ 193.091431] #4: 800082738f18 (tracepoint_iter_lock){}-{2:2}, at: > trace_event_buffer_commit+0xd8/0x2c8 > [ 193.091772] #5: 8000826b3ce8 (console_lock){+.+.}-{0:0}, at: > vprintk_emit+0x124/0x398 > [ 193.092031] #6: 8000826b3d40 (console_srcu){}-{0:0}, at: > console_flush_all+0x88/0x4b0 > [ 193.092363] #7: 8000826b3ef8 (console_owner){}-{0:0}, at: > console_flush_all+0x1bc/0x4b0 > [ 193.092799] stack backtrace: > [ 193.092973] CPU: 3 PID: 31 Comm: kworker/3:0 Not tainted 6.8.0-rc3 #1200 > [ 193.093118] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown unknown > [ 193.093468] Workqueue: events cxl_cper_work_fn > [ 193.093782] Call trace: > [
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
On Wed, 14 Feb 2024 12:11:53 + Jonathan Cameron wrote: > So I'm thinking this is a won't fix - wait for the printk rework to land and > assume this will be resolved as well? That pretty much sums up what I was about to say ;-) tp_printk is more of a hack and not to be used sparingly. With the right trace events it can hang the machine. So, you can use your internal patch locally, but I would recommend waiting for the new printk changes to land. I'm really hoping that will be soon! -- Steve
Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held
On Tue, 06 Feb 2024 14:15:32 -0800 Ira Weiny wrote: > Smatch caught that cxl_cper_post_event() is called with a spinlock held > or preemption disabled.[1] The callback takes the device lock to > perform address translation and therefore might sleep. The record data > is released back to BIOS in ghes_clear_estatus() which requires it to be > copied for use in the workqueue. > > Copy the record to a lockless list and schedule a work item to process > the record outside of atomic context. > > [1] > https://lore.kernel.org/all/b963c490-2c13-4b79-bbe7-34c6568423c7@moroto.mountain/ > > Reported-by: Dan Carpenter > Signed-off-by: Ira Weiny +CC tracing folk for the splat below (the second one as first one is fixed!) Lock debugging is slow (on an emulated machine) :( Testing with my gitlab.com/jic23/qemu cxl-2024-02-05-draft branch which is only one I've put out with the FW first injection patches so far For reference without this patch I got a nice spat identifying the original bug. So far so good. With this patch (and tp_printk in command line and trace points enabled) [ 193.048229] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [ 193.049636] {1}[Hardware Error]: event severity: recoverable [ 193.050472] {1}[Hardware Error]: Error 0, type: recoverable [ 193.051337] {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 [ 193.052270] {1}[Hardware Error]: section length: 0x90 [ 193.053402] {1}[Hardware Error]: : 0090 0007 0d938086 [ 193.055036] {1}[Hardware Error]: 0010: 000e 0005 [ 193.058592] {1}[Hardware Error]: 0020: 0180 1626fa24 17b3b158 $.&.X... [ 193.062289] {1}[Hardware Error]: 0030: [ 193.065959] {1}[Hardware Error]: 0040: 07d0 0fc00307 05210300 ..!. [ 193.069782] {1}[Hardware Error]: 0050: 7269 6d207361 00326d65 ..iras mem2. [ 193.072760] {1}[Hardware Error]: 0060: [ 193.074062] {1}[Hardware Error]: 0070: [ 193.075346] {1}[Hardware Error]: 0080: I rebased after this so now we get the smaller print - but that's not really relevant here. [ 193.086589] cxl_general_media: memdev=mem1 host=:0e:00.0 serial=5 log=Warning : time=1707903675590441508 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=128 flags='0x1' handle=0 related_handle=0 maint_op_class=0 : dpa=7c0 dpa_flags='0x10' descriptor='UNCORRECTABLE_EVENT|THRESHOLD_EVENT|POISON_LIST_OVERFLOW' type='0x3' transaction_type='0xc0' channel=3 rank=33 device=5 comp_id=69 72 61 73 20 6d 65 6d 32 00 00 00 00 00 00 00 validity_flags='CHANNEL|RANK|DEVICE|COMPONENT' [ 193.087181] [ 193.087361] = [ 193.087399] [ BUG: Invalid wait context ] [ 193.087504] 6.8.0-rc3 #1200 Not tainted [ 193.087660] - [ 193.087858] kworker/3:0/31 is trying to lock: [ 193.087966] c0ce1898 (_lock_key){-.-.}-{3:3}, at: pl011_console_write+0x148/0x1c8 [ 193.089754] other info that might help us debug this: [ 193.089820] context-{5:5}[ 193.089900] 8 locks held by kworker/3:0/31: [ 193.089990] #0: c0018738 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x154/0x500 [ 193.090439] #1: 800083793de0 (cxl_cper_work){+.+.}-{0:0}, at: process_one_work+0x154/0x500 [ 193.090718] #2: 800082883310 (cxl_cper_rw_sem){}-{4:4}, at: cxl_cper_work_fn+0x2c/0xb0 [ 193.091019] #3: c0a7b1a8 (>mutex){}-{4:4}, at: pci_dev_lock+0x28/0x48 [ 193.091431] #4: 800082738f18 (tracepoint_iter_lock){}-{2:2}, at: trace_event_buffer_commit+0xd8/0x2c8 [ 193.091772] #5: 8000826b3ce8 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x124/0x398 [ 193.092031] #6: 8000826b3d40 (console_srcu){}-{0:0}, at: console_flush_all+0x88/0x4b0 [ 193.092363] #7: 8000826b3ef8 (console_owner){}-{0:0}, at: console_flush_all+0x1bc/0x4b0 [ 193.092799] stack backtrace: [ 193.092973] CPU: 3 PID: 31 Comm: kworker/3:0 Not tainted 6.8.0-rc3 #1200 [ 193.093118] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown unknown [ 193.093468] Workqueue: events cxl_cper_work_fn [ 193.093782] Call trace: [ 193.094004] dump_backtrace+0xa4/0x130 [ 193.094145] show_stack+0x20/0x38 [ 193.094231] dump_stack_lvl+0x60/0xb0 [ 193.094315] dump_stack+0x18/0x28 [ 193.094395] __lock_acquire+0x9e8/0x1ee8 [ 193.094477]