Re: [question] panic() during reboot -f (reboot syscall)

2019-03-13 Thread Peter Zijlstra
On Tue, Mar 12, 2019 at 04:29:25PM -0500, Eric W. Biederman wrote:
> I wonder if there is an easy way to get the scheduler to not schedule
> userspace processes once the reboot system call has started.  That
> sounds like the simple way to avoid this kind of confusion.

That sounds like adding code to a hotpath that is 'never' used.


Re: [question] panic() during reboot -f (reboot syscall)

2019-03-12 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Wed, Mar 6, 2019 at 5:29 AM Petr Mladek  wrote:
>>
>> I wonder if it is "normal" to get panic() when the system is rebooted
>> using "reboot -f". I looks a bit weird to me.
>
> No, a panic is never normal (except possibly for test modules etc, of course).
>
>> Now, "reboot -f" just calls the reboot() syscall. I do not see
>> anything that would stop processes.
>
> There isn't supposed to be anything. It's meant for "things are
> screwed up, just reboot *now* without doing anything else".
>
> The "reboot now" is basically meant to be a poor man's power cycle.
>
>> But it shuts down devices very early, via:
>>
>>   + kernel_restart()
>> + kernel_restart_prepare()
>>   + blocking_notifier_call_chain(_notifier_list, SYS_RESTART, 
>> cmd);
>>   + device_shutdown()
>
> The problem is that there are conflicting goals here, and the kernel
> doesn't even *know* if this is supposed to be a normal clean reboot,
> or a "reboot -f" that just shuts down everything.
>
> On a nice clean reboot (where init has shut everything down) we
> obviously _do_ want to shut devices down etc. Quite often you need to
> do it just to make sure they come up nicely again (because the
> firmware isn't even always re-initializing things properly on a soft
> reboot).
>
> But on a "reboot -f", user space _hasn't_ cleaned up, and just wants
> things to reboot. But the kernel doesn't really know. It just gets the
> reboot system call in both cases.
>
>> By other words. It looks like the panic() is possible by design.
>> But it looks a bit weird. Any opinion?
>
> It's definitely not "by design", but it might be unavoidable in this case.
>
> Of course, "unavoidable" is relative. There could be workarounds that
> are reasonably ok in practice.
>
> Like having the filesystem panic code see "oh, system_state isn't
> SYSTEM_RUNNING, so I shouldn't be panicing".

I wonder if there is an easy way to get the scheduler to not schedule
userspace processes once the reboot system call has started.  That
sounds like the simple way to avoid this kind of confusion.

Eric



Re: [question] panic() during reboot -f (reboot syscall)

2019-03-10 Thread Linus Torvalds
On Wed, Mar 6, 2019 at 5:29 AM Petr Mladek  wrote:
>
> I wonder if it is "normal" to get panic() when the system is rebooted
> using "reboot -f". I looks a bit weird to me.

No, a panic is never normal (except possibly for test modules etc, of course).

> Now, "reboot -f" just calls the reboot() syscall. I do not see
> anything that would stop processes.

There isn't supposed to be anything. It's meant for "things are
screwed up, just reboot *now* without doing anything else".

The "reboot now" is basically meant to be a poor man's power cycle.

> But it shuts down devices very early, via:
>
>   + kernel_restart()
> + kernel_restart_prepare()
>   + blocking_notifier_call_chain(_notifier_list, SYS_RESTART, cmd);
>   + device_shutdown()

The problem is that there are conflicting goals here, and the kernel
doesn't even *know* if this is supposed to be a normal clean reboot,
or a "reboot -f" that just shuts down everything.

On a nice clean reboot (where init has shut everything down) we
obviously _do_ want to shut devices down etc. Quite often you need to
do it just to make sure they come up nicely again (because the
firmware isn't even always re-initializing things properly on a soft
reboot).

But on a "reboot -f", user space _hasn't_ cleaned up, and just wants
things to reboot. But the kernel doesn't really know. It just gets the
reboot system call in both cases.

> By other words. It looks like the panic() is possible by design.
> But it looks a bit weird. Any opinion?

It's definitely not "by design", but it might be unavoidable in this case.

Of course, "unavoidable" is relative. There could be workarounds that
are reasonably ok in practice.

Like having the filesystem panic code see "oh, system_state isn't
SYSTEM_RUNNING, so I shouldn't be panicing".

Linus


[question] panic() during reboot -f (reboot syscall)

2019-03-06 Thread Petr Mladek
Hello,

I wonder if it is "normal" to get panic() when the system is rebooted
using "reboot -f". I looks a bit weird to me.

In our case, the panic() was triggered from ext4 filesystem code
that was mounted with "errors=panic"

  crash> bt
  PID: 3984   TASK: 887db1f6c180  CPU: 32  COMMAND: "bash"
  #0 [887e637bf9a8] machine_kexec at 81059c5c
  #1 [887e637bf9f8] __crash_kexec at 81119e0a
  #2 [887e637bfab8] panic at 81193c31
  #3 [887e637bfb30] ext4_handle_error at a0229faa [ext4]
  #4 [887e637bfb40] __ext4_error_inode at a022a12a [ext4]
  #5 [887e637bfbe0] __ext4_get_inode_loc at a02096a5 [ext4]
  #6 [887e637bfc40] ext4_iget at a020c028 [ext4]
  #7 [887e637bfcc0] ext4_lookup at a0216ca0 [ext4]
  #8 [887e637bfce8] lookup_real at 81218e3f
  #9 [887e637bfd00] __lookup_hash at 8121916f
  #10 [887e637bfd20] walk_component at 8121b50f
  #11 [887e637bfd70] path_lookupat at 8121ca30
  #12 [887e637bfd98] filename_lookup at 8121e58c
  #13 [887e637bfe98] vfs_fstatat at 81214549
  #14 [887e637bfed8] SYSC_newstat at 812149ca
  #15 [887e637bff50] entry_SYSCALL_64_fastpath at 8161de61
  RIP: 7f9db8d3ebe5  RSP: 7ffda081cf68  RFLAGS: 0246
  RAX: ffda  RBX:   RCX: 7f9db8d3ebe5
  RDX: 013c7fa0  RSI: 013c7fa0  RDI: 013c7f40
  RBP: 7f9db943bee0   R8: 013c7f40   R9: 000b
  R10: 7af2c337  R11: 0246  R12: 013c7fa0
  R13: 013c7fa0  R14: 0008  R15: 013c7f80
  ORIG_RAX: 0004  CS: 0033  SS: 002b


Now, "reboot -f" just calls the reboot() syscall. I do not see
anything that would stop processes. It even does not stop
other CPUs by purpose, see the commit cf7df378aa4ff7da
("reboot: rigrate shutdown/reboot to boot cpu").

But it shuts down devices very early, via:

  + kernel_restart()
+ kernel_restart_prepare()
  + blocking_notifier_call_chain(_notifier_list, SYS_RESTART, cmd);
  + device_shutdown()

As a result, processes are still running. Filesystem code return
errors because the underlaying disk device was removed. It causes
panic() because "errors=panic" mount option.


My undestanding that userspace is reponsible for "clean" reboot.
The "reboot" command normally stops services, kill processes,
sync disks, umount filesystem before it calls the "reboot"
syscall.

By other words. It looks like the panic() is possible by design.
But it looks a bit weird. Any opinion?

Best Regards,
Petr