On 2020-06-10 10:21:29 [+0200], Stephen Berman wrote: > (GMX put your email into my spam folder, so I didn't see it before I > sent my followup about removing the wifi firmware.)
okay. > On Tue, 9 Jun 2020 22:23:39 +0200 Sebastian Andrzej Siewior > <bige...@linutronix.de> wrote: > > scripts/decode_stacktrace.sh vmlinux $(pwd)/ kernel_power_off+0x3d/0x70 > > > > should reveal that. > > I saved the call trace lines to a file and did this (with `$(pwd)/' the > script did not run): > > steve [ /sources/linux-5.6.4 ]$ ./scripts/decode_stacktrace.sh vmlinux . > kernel_power_off+0x3d/0x70 < /home/steve/temp/call_trace > [1352.***] Call Trace: > [1352.***] ? __schedule (core.c:?) > [1352.***] schedule (??:?) > [1352.***] schedule_timeout (??:?) > [1352.***] ? acpi_os_release_object (??:?) > [1352.***] ? acpi_ut_update_object_reference (??:?) > [1352.***] wait_for_completion (??:?) > [1352.***] ? wake_up_q (??:?) > [1352.***] flush_workqueue (??:?) > [1352.***] kernel_power_off (??:?) > [1352.***] __do_sys_reboot (reboot.c:?) > [1352.***] ? do_send_sig_info (??:?) > [1352.***] ? kill_pid_info (??:?) > [1352.***] ? kill_something_info (signal.c:?) > [1352.***] ? __x64_sys_kill (??:?) > [1352.***] do_syscall_64 (??:?) > [1352.***] entry_SYSCALL_64_after_hwframe (??:?) > [1352.***] RIP: 0033:0x7f95dd6992c3 > [1352.***] Code: Bad RIP value. > objdump: '/tmp/tmp.BlpNDdSQpG.o': No such file > > Did I do something wrong or do I need to pass certain compiler flags > when building the kernel (when I build Emacs I pass CFLAGS='-Og -g3')? the arguments last argument should not be kernel_power_off+0x3d/0x70. I should have been entered after "enter" but you can also pipe the whole file as you did. If it gives you the ?? then might need CONFIG_DEBUG_INFO > > The system shutdowns. So you are facing something different. I am > > surprised that the workqueue stall detector did not yell here. The patch > > at the bottom should get rid of the sr warnings. > > But the cdrom seems to have nothing to do with power-off failing, since > it happens even when the cdrom is detached (and hence there were no sr > warnings in the log output), or am I misunderstanding you? Yes. I think it has nothing todo with it, it is just something irrelevant that I do understand now. Before that trace it looked like AHCI is failing to continue. > By the other patch do you mean the following? (This email was also put > into my spam by GMX and AFAICT has not yet shown up on the vger list.) Yes. But now that I played a little with it, I made dis: diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 4e01c448b4b48..e09fc41e99fc3 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2777,6 +2777,8 @@ void flush_workqueue(struct workqueue_struct *wq) lock_map_acquire(&wq->lockdep_map); lock_map_release(&wq->lockdep_map); + if (system_state > SYSTEM_RUNNING) + pr_err("Flushing %s\n", wq->name); mutex_lock(&wq->mutex); /* @@ -2826,7 +2828,18 @@ void flush_workqueue(struct workqueue_struct *wq) mutex_unlock(&wq->mutex); - wait_for_completion(&this_flusher.done); + if (system_state > SYSTEM_RUNNING) { + int ret; +try_again: + ret = wait_for_completion_timeout(&this_flusher.done, 3 * HZ); + pr_err("Flushing %s over, %d\n", wq->name, ret); + if (!ret) { + show_workqueue_state(); + goto try_again; + } + } else { + wait_for_completion(&this_flusher.done); + } /* * Wake-up-and-cascade phase @@ -3039,11 +3052,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel) lock_map_release(&work->lockdep_map); } + if (system_state > SYSTEM_RUNNING) + pr_err("Flush %pS\n", work->func); + if (start_flush_work(work, &barr, from_cancel)) { wait_for_completion(&barr.done); destroy_work_on_stack(&barr.work); + if (system_state > SYSTEM_RUNNING) + pr_err("Flush %pS done\n", work->func); return true; } else { + if (system_state > SYSTEM_RUNNING) + pr_err("Flush %pS done\n", work->func); return false; } } So you should have output like: |[ 14.131696] Flushing events_freezable_power_ the flush that hangs |[ 17.203246] Flushing events_freezable_power_ over, 0 It has been noticed (0 at the end) |[ 17.203849] Showing busy workqueues and worker pools: |[ 17.204400] workqueue events_freezable_power_: flags=0x84 |[ 17.205009] pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 |[ 17.205774] in-flight: 2147:disk_events_workfn |[ 17.206268] workqueue kblockd: flags=0x18 |[ 17.206682] pwq 13: cpus=6 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2 |[ 17.207428] pending: blk_mq_requeue_work |[ 17.207958] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=2s workers=3 idle: 535 42 A dump of all pending worqueue items. |[ 20.275239] Flushing events_freezable_power_ over, 0 |[ 20.275812] Showing busy workqueues and worker pools: |[ 20.276385] workqueue events_freezable_power_: flags=0x84 |[ 20.276972] pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 |[ 20.277712] in-flight: 2147:disk_events_workfn |[ 20.278245] workqueue kblockd: flags=0x18 |[ 20.278685] pwq 13: cpus=6 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2 |[ 20.279471] pending: blk_mq_requeue_work |[ 20.279989] pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=5s workers=3 idle: 535 42 Another atempt. As you see `blk_mq_requeue_work' is references in both outputs but it makes progress (it is scheduled multiple times). The item `disk_events_workfn' stucks. Sebastian