On 2020-06-10 10:21:29 [+0200], Stephen Berman wrote:
> (GMX put your email into my spam folder, so I didn't see it before I
> sent my followup about removing the wifi firmware.)

okay.

> On Tue, 9 Jun 2020 22:23:39 +0200 Sebastian Andrzej Siewior 
> <bige...@linutronix.de> wrote:
> >   scripts/decode_stacktrace.sh vmlinux $(pwd)/ kernel_power_off+0x3d/0x70
> >
> > should reveal that.
> 
> I saved the call trace lines to a file and did this (with `$(pwd)/' the
> script did not run):
> 
> steve [ /sources/linux-5.6.4 ]$ ./scripts/decode_stacktrace.sh vmlinux . 
> kernel_power_off+0x3d/0x70 < /home/steve/temp/call_trace 
> [1352.***] Call Trace:
> [1352.***] ? __schedule (core.c:?) 
> [1352.***] schedule (??:?) 
> [1352.***] schedule_timeout (??:?) 
> [1352.***] ? acpi_os_release_object (??:?) 
> [1352.***] ? acpi_ut_update_object_reference (??:?) 
> [1352.***] wait_for_completion (??:?) 
> [1352.***] ? wake_up_q (??:?) 
> [1352.***] flush_workqueue (??:?) 
> [1352.***] kernel_power_off (??:?) 
> [1352.***] __do_sys_reboot (reboot.c:?) 
> [1352.***] ? do_send_sig_info (??:?) 
> [1352.***] ? kill_pid_info (??:?) 
> [1352.***] ? kill_something_info (signal.c:?) 
> [1352.***] ? __x64_sys_kill (??:?) 
> [1352.***] do_syscall_64 (??:?) 
> [1352.***] entry_SYSCALL_64_after_hwframe (??:?) 
> [1352.***] RIP: 0033:0x7f95dd6992c3
> [1352.***] Code: Bad RIP value.
> objdump: '/tmp/tmp.BlpNDdSQpG.o': No such file
> 
> Did I do something wrong or do I need to pass certain compiler flags
> when building the kernel (when I build Emacs I pass CFLAGS='-Og -g3')?

the arguments last argument should not be kernel_power_off+0x3d/0x70. I
should have been entered after "enter" but you can also pipe the whole
file as you did.
If it gives you the ?? then might need
  CONFIG_DEBUG_INFO

> > The system shutdowns. So you are facing something different. I am
> > surprised that the workqueue stall detector did not yell here. The patch
> > at the bottom should get rid of the sr warnings.
> 
> But the cdrom seems to have nothing to do with power-off failing, since
> it happens even when the cdrom is detached (and hence there were no sr
> warnings in the log output), or am I misunderstanding you?

Yes. I think it has nothing todo with it, it is just something
irrelevant that I do understand now. Before that trace it looked like
AHCI is failing to continue.

> By the other patch do you mean the following?  (This email was also put
> into my spam by GMX and AFAICT has not yet shown up on the vger list.)

Yes. But now that I played a little with it, I made dis:

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4e01c448b4b48..e09fc41e99fc3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2777,6 +2777,8 @@ void flush_workqueue(struct workqueue_struct *wq)
        lock_map_acquire(&wq->lockdep_map);
        lock_map_release(&wq->lockdep_map);
 
+       if (system_state > SYSTEM_RUNNING)
+               pr_err("Flushing %s\n", wq->name);
        mutex_lock(&wq->mutex);
 
        /*
@@ -2826,7 +2828,18 @@ void flush_workqueue(struct workqueue_struct *wq)
 
        mutex_unlock(&wq->mutex);
 
-       wait_for_completion(&this_flusher.done);
+       if (system_state > SYSTEM_RUNNING) {
+               int ret;
+try_again:
+               ret = wait_for_completion_timeout(&this_flusher.done, 3 * HZ);
+               pr_err("Flushing %s over, %d\n", wq->name, ret);
+               if (!ret) {
+                       show_workqueue_state();
+                       goto try_again;
+               }
+       } else {
+               wait_for_completion(&this_flusher.done);
+       }
 
        /*
         * Wake-up-and-cascade phase
@@ -3039,11 +3052,18 @@ static bool __flush_work(struct work_struct *work, bool 
from_cancel)
                lock_map_release(&work->lockdep_map);
        }
 
+       if (system_state > SYSTEM_RUNNING)
+               pr_err("Flush %pS\n", work->func);
+
        if (start_flush_work(work, &barr, from_cancel)) {
                wait_for_completion(&barr.done);
                destroy_work_on_stack(&barr.work);
+               if (system_state > SYSTEM_RUNNING)
+                       pr_err("Flush %pS done\n", work->func);
                return true;
        } else {
+               if (system_state > SYSTEM_RUNNING)
+                       pr_err("Flush %pS done\n", work->func);
                return false;
        }
 }

So you should have output like:
|[   14.131696] Flushing events_freezable_power_
the flush that hangs

|[   17.203246] Flushing events_freezable_power_ over, 0
It has been noticed (0 at the end)

|[   17.203849] Showing busy workqueues and worker pools:
|[   17.204400] workqueue events_freezable_power_: flags=0x84
|[   17.205009]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
|[   17.205774]     in-flight: 2147:disk_events_workfn
|[   17.206268] workqueue kblockd: flags=0x18
|[   17.206682]   pwq 13: cpus=6 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
|[   17.207428]     pending: blk_mq_requeue_work
|[   17.207958] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=2s workers=3 idle: 
535 42

A dump of all pending worqueue items.

|[   20.275239] Flushing events_freezable_power_ over, 0
|[   20.275812] Showing busy workqueues and worker pools:
|[   20.276385] workqueue events_freezable_power_: flags=0x84
|[   20.276972]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
|[   20.277712]     in-flight: 2147:disk_events_workfn
|[   20.278245] workqueue kblockd: flags=0x18
|[   20.278685]   pwq 13: cpus=6 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
|[   20.279471]     pending: blk_mq_requeue_work
|[   20.279989] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=5s workers=3 idle: 
535 42

Another atempt. As you see `blk_mq_requeue_work' is references in both
outputs but it makes progress (it is scheduled multiple times). The item
`disk_events_workfn' stucks.

Sebastian

Reply via email to