bug#37757: Kernel panic upon shutdown

2019-12-09 Thread Ludovic Courtès
Hi,

Ludovic Courtès  skribis:

> My plan is to:
>
>   1. push the patch below to the ‘stable-2.2’ branch of Guile;
>  done:
>  
> ;
>
>   2. use a patched Guile for the ‘shepherd’ package;

Done:
.

>   3. include the crash handler in the Shepherd.

Done:
.

I’m closing the bug.  Please reopen it if you notice anything wrong!

Ludo’.





bug#37757: Kernel panic upon shutdown

2019-12-09 Thread Ludovic Courtès
Hello,

[+Cc: Andy for a heads-up on the fix below.]

Ludovic Courtès  skribis:

> It turns out the previous patch didn’t work; in short, we really have to
> use async-signal-safe functions only from the signal handler, so this
> has to be done in C.
>
> The attached patch does that.  I’ve tried it with ‘guix system
> container’ and it seems to dump core as expected, from what I can see.
>
> Let me know if you manage to reproduce the bug and to get a core dumped
> with this patch.

Good news!  The patch does indeed allow shepherd to dump core, and I
managed to grab the backtrace below on an x86_64 machine running Guix
System (from yesterday) with GNOME:

--8<---cut here---start->8---
Using host libthread_db library 
"/gnu/store/ahqgl4h89xqj695lgqvsaf6zh2nhy4pj-glibc-2.29/lib/libthread_db.so.1".
Core was generated by 
`/gnu/store/1mkkv2caiqbdbbd256c4dirfi4kwsacv-guile-2.2.6/bin/guile 
--no-auto-com'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  handle_crash (sig=11)
at /gnu/store/dayk54wxskp14w53813384azhxmd5awz-shepherd-crash-handler.c:43
43* (int *) 0 = 42;
[Current thread is 1 (LWP 4635)]

[…]

Thread 1 (LWP 4635):
#0  handle_crash (sig=11) at 
/gnu/store/dayk54wxskp14w53813384azhxmd5awz-shepherd-crash-handler.c:43
infinity = {rlim_cur = 18446744073709551615, rlim_max = 
18446744073709551615}
pid = 
msg = "Shepherd crashed!\n"
pid = 
#1  
No locals.
#2  handle_crash (sig=6) at 
/gnu/store/dayk54wxskp14w53813384azhxmd5awz-shepherd-crash-handler.c:43
infinity = {rlim_cur = 18446744073709551615, rlim_max = 
18446744073709551615}
pid = 
msg = "Shepherd crashed!\n"
pid = 
#3  
No locals.
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
set = {__val = {0, 2314885530818445312, 0 }}
pid = 
tid = 
ret = 
#5  0x7f03eef40891 in __GI_abort () at abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
sa_mask = {__val = {0 , 139654877144192, 0, 
139654877624544}}, sa_flags = -279049286, sa_restorer = 0x7f03ef57e480 
}
sigs = {__val = {32, 0 }}
#6  0x7f03ef57e89a in finalization_thread_proc (unused=) at 
finalizers.c:228
data = {byte = -24 '\350', n = -1, err = 4}
#7  0x7f03ef56f35a in c_body (d=0x7f03ed152e50) at continuations.c:422
data = 0x7f03ed152e50
#8  0x7f03ef5f079f in vm_regular_engine (thread=0x2, vp=0x7f03eb1caea0, 
registers=0x0, resume=-286001158) at vm-engine.c:786
ret = 2
ip = 
sp = 
op = 10
jump_table_ = {…}
jump_table = 0x7f03ef64d8e0 

[…]

#19 scm_with_guile (func=, data=) at threads.c:710
No locals.
#20 0x7f03ef497015 in start_thread (arg=0x7f03ed153700) at 
pthread_create.c:486
ret = 
pd = 0x7f03ed153700
now = 
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139654839219968, 
-749312912628550421, 140727702524830, 140727702524831, 140727702524832, 
139654839219968, 837174519050892523, 837169745183601899}, mask_was_saved = 0}}, 
priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, 
canceltype = 0}}}
not_first_call = 
#21 0x7f03eeffd91f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.
--8<---cut here---end--->8---

So what happens is that ‘finalization_thread_proc’ in Guile receives
EINTR (data.err == 4) but then, despite EINTR, it goes on to check the
value of ‘data.byte’ and aborts because it’s neither 0 nor 1.

My plan is to:

  1. push the patch below to the ‘stable-2.2’ branch of Guile;
 done:
 
;

  2. use a patched Guile for the ‘shepherd’ package;

  3. include the crash handler in the Shepherd.

Thoughts?

Thanks,
Ludo’.

diff --git a/libguile/finalizers.c b/libguile/finalizers.c
index c5d69e8e3..94a6e6b0a 100644
--- a/libguile/finalizers.c
+++ b/libguile/finalizers.c
@@ -1,4 +1,4 @@
-/* Copyright (C) 2012, 2013, 2014 Free Software Foundation, Inc.
+/* Copyright (C) 2012, 2013, 2014, 2019 Free Software Foundation, Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
@@ -211,21 +211,26 @@ finalization_thread_proc (void *unused)
 
   scm_without_guile (read_finalization_pipe_data, &data);
   
-  if (data.n <= 0 && data.err != EINTR) 
+  if (data.n <= 0)
 {
-  perror ("error in finalization thread");
-  return NULL;
+  if (data.err != EINTR)
+{
+  perror ("error in finalization thread");
+  return NULL;
+}
 }
-
-  switch (data.byte)
+  else
 {
-case 0:
-  scm_run_finalizers ();
-  break;
-c

bug#37757: Kernel panic upon shutdown

2019-12-03 Thread Arne Babenhauserheide

Ludovic Courtès  writes:
> To everyone reading this: if you’re experiencing shepherd crashes,
> please raise your hand :-)

\o

> and consider applying this patch so we can gather debugging info!

Can I do that without installing from a local checkout?

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken


signature.asc
Description: PGP signature


bug#37757: Kernel panic upon shutdown

2019-12-02 Thread Ludovic Courtès
Hi!

Ludovic Courtès  skribis:

> Jesse (and anyone else experiencing this!), could you try to (1)
> reconfigure with this patch, (2) reboot, (3) try to halt the system to
> reproduce the crash, and (4) retrieve a backtrace from the ‘core’ file?
>
> For #4, you’ll have to do something along these lines once you’ve
> rebooted after the crash:
>
>   sudo gdb /run/current-system/profile/bin/guile /core
>
> and then type “thread apply all bt” at the GDB prompt.

It turns out the previous patch didn’t work; in short, we really have to
use async-signal-safe functions only from the signal handler, so this
has to be done in C.

The attached patch does that.  I’ve tried it with ‘guix system
container’ and it seems to dump core as expected, from what I can see.

Let me know if you manage to reproduce the bug and to get a core dumped
with this patch.

To everyone reading this: if you’re experiencing shepherd crashes,
please raise your hand :-) and consider applying this patch so we can
gather debugging info!

Thanks,
Ludo’.

diff --git a/gnu/services/shepherd.scm b/gnu/services/shepherd.scm
index 08bb33039c..cf82ef0a4c 100644
--- a/gnu/services/shepherd.scm
+++ b/gnu/services/shepherd.scm
@@ -271,6 +271,23 @@ and return the resulting '.go' file."
  (compile-file #$file #:output-file #$output
#:env env))
 
+(define (crash-handler)
+  (define gcc-toolchain
+(module-ref (resolve-interface '(gnu packages commencement))
+'gcc-toolchain))
+
+  (define source
+(local-file "../system/aux-files/shepherd-crash-handler.c"))
+
+  (computed-file "crash-handler.so"
+ #~(begin
+ (setenv "PATH" #+(file-append gcc-toolchain "/bin"))
+ (setenv "CPATH" #+(file-append gcc-toolchain "/include"))
+ (setenv "LIBRARY_PATH"
+ #+(file-append gcc-toolchain "/lib"))
+ (system* "gcc" "-Wall" "-g" "-O3" "-fPIC"
+  "-shared" "-o" #$output #$source
+
 (define (shepherd-configuration-file services)
   "Return the shepherd configuration file for SERVICES."
   (assert-valid-graph services)
@@ -281,6 +298,9 @@ and return the resulting '.go' file."
   (use-modules (srfi srfi-34)
(system repl error-handling))
 
+  ;; Load the crash handler, which allows shepherd to dump core.
+  (dynamic-link #$(crash-handler))
+
   ;; Arrange to spawn a REPL if something goes wrong.  This is better
   ;; than a kernel panic.
   (call-with-error-handling
diff --git a/gnu/system/aux-files/shepherd-crash-handler.c b/gnu/system/aux-files/shepherd-crash-handler.c
new file mode 100644
index 00..6b2db10866
--- /dev/null
+++ b/gnu/system/aux-files/shepherd-crash-handler.c
@@ -0,0 +1,70 @@
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include/* For SYS_xxx definitions */
+#include 
+
+static void
+handle_crash (int sig)
+{
+  static const char msg[] = "Shepherd crashed!\n";
+  write (2, msg, sizeof msg);
+
+#ifdef __sparc__
+  /* See 'raw_clone' in systemd.  */
+# error "SPARC uses a different 'clone' syscall convention"
+#endif
+
+  pid_t pid = syscall (SYS_clone, SIGCHLD, NULL);
+  if (pid < 0)
+abort ();
+
+  if (pid == 0)
+{
+  /* Restore the default signal handler to get a core dump.  */
+  signal (sig, SIG_DFL);
+
+  const struct rlimit infinity = { RLIM_INFINITY, RLIM_INFINITY };
+  setrlimit (RLIMIT_CORE, &infinity);
+  chdir ("/");
+
+  int pid = syscall (SYS_getpid);
+  kill (pid, sig);
+
+  /* As it turns out, 'kill' simply returns without doing anything, which
+	 is consistent with the "Notes" section of kill(2).  Thus, force a
+	 crash.  */
+  * (int *) 0 = 42;
+
+  _exit (254);
+}
+  else
+{
+  signal (sig, SIG_IGN);
+
+  int status;
+  waitpid (pid, &status, 0);
+
+  sync ();
+
+  _exit (255);
+}
+
+  _exit (253);
+}
+
+static void initialize_crash_handler (void)
+  __attribute__ ((constructor));
+
+static void
+initialize_crash_handler (void)
+{
+  signal (SIGSEGV, handle_crash);
+  signal (SIGABRT, handle_crash);
+}


bug#37757: Kernel panic upon shutdown

2019-11-28 Thread Ludovic Courtès
Hello!

The attached patch should allow shepherd (PID 1) to dump core when it
crashes (systemd does something similar).

Jesse (and anyone else experiencing this!), could you try to (1)
reconfigure with this patch, (2) reboot, (3) try to halt the system to
reproduce the crash, and (4) retrieve a backtrace from the ‘core’ file?

For #4, you’ll have to do something along these lines once you’ve
rebooted after the crash:

  sudo gdb /run/current-system/profile/bin/guile /core

and then type “thread apply all bt” at the GDB prompt.

I’ll also try to do that on another machine where I’ve seen it happen.

Thanks in advance!

Ludo’.

diff --git a/gnu/services/shepherd.scm b/gnu/services/shepherd.scm
index 08bb33039c..ec49244cf6 100644
--- a/gnu/services/shepherd.scm
+++ b/gnu/services/shepherd.scm
@@ -277,45 +277,87 @@ and return the resulting '.go' file."
 
   (let ((files (map shepherd-service-file services)))
 (define config
-  #~(begin
-  (use-modules (srfi srfi-34)
-   (system repl error-handling))
+  (with-imported-modules '((guix build syscalls))
+#~(begin
+(use-modules (srfi srfi-34)
+ (system repl error-handling)
+ (guix build syscalls)
+ (system foreign))
 
-  ;; Arrange to spawn a REPL if something goes wrong.  This is better
-  ;; than a kernel panic.
-  (call-with-error-handling
-(lambda ()
-  (apply register-services
- (map load-compiled '#$(map scm->go files)
+(define signal
+  (let ((proc (pointer->procedure int
+  (dynamic-func "signal"
+(dynamic-link))
+  (list int '*
+(lambda (signum handler)
+  (proc signum
+(if (integer? handler);SIG_DFL, etc.
+(make-pointer handler)
+(procedure->pointer void handler (list int)))
 
-  ;; guix-daemon 0.6 aborts if 'PATH' is undefined, so work around
-  ;; it.
-  (setenv "PATH" "/run/current-system/profile/bin")
+(define (handle-crash sig)
+  (dynamic-wind
+(const #t)
+(lambda ()
+  (gc-disable)
+  (pk 'crash! sig)
+  ;; Fork and have the child dump core at the root.
+  (match (clone SIGCHLD)
+(0
+ (setrlimit 'core #f #f)
+ (chdir "/")
+ (signal sig SIG_DFL)
+ ;; Note: 'getpid' would return 1, hence this hack.
+ (kill (string->number (readlink "/proc/self"))
+   sig)
+ (primitive-_exit 253))
+(child
+ (waitpid child)
+ (sync)
+ ;; Hopefully at this point core has been dumped.
+ (pk 'done)
+ (sleep 3)
+ (primitive-_exit 255
+(lambda ()
+  (primitive-_exit 254
 
-  (format #t "starting services...~%")
-  (for-each (lambda (service)
-  ;; In the Shepherd 0.3 the 'start' method can raise
-  ;; '&action-runtime-error' if it fails, so protect
-  ;; against it.  (XXX: 'action-runtime-error?' is not
-  ;; exported is 0.3, hence 'service-error?'.)
-  (guard (c ((service-error? c)
- (format (current-error-port)
- "failed to start service '~a'~%"
- service)))
-(start service)))
-'#$(append-map shepherd-service-provision
-   (filter shepherd-service-auto-start?
-   services)))
+(signal SIGSEGV handle-crash)
 
-  ;; Hang up stdin.  At this point, we assume that 'start' methods
-  ;; that required user interaction on the console (e.g.,
-  ;; 'cryptsetup open' invocations, post-fsck emergency REPL) have
-  ;; completed.  User interaction becomes impossible after this
-  ;; call; this avoids situations where services wrongfully lead
-  ;; PID 1 to read from stdin (the console), which users may not
-  ;; have access to (see ).
-  (redirect-port (open-input-file "/dev/null")
- (current-input-port
+;; Arrange to spawn a REPL if something goes wrong.  This is better
+;; than a kernel panic.
+(call-

bug#37757: Kernel panic upon shutdown

2019-11-13 Thread Jan
Hi,
I encountered the same error today. I had ran "sudo herd stop tor" and
then "sudo herd stop xorg-server" and it panicked.


Jan Wielkiewicz





bug#37757: Kernel panic upon shutdown

2019-11-13 Thread Ludovic Courtès
Ludovic Courtès  skribis:

> I’ve just seen it on a laptop running GNOME and ‘%desktop-services’.
> The kernel panic appeared right after shutting down ModemManager (I
> don’t have ModemManager on my own laptop and I’ve never experienced the
> bug, but I don’t know if it’s significant.)
>
> Note that we see (roughly):
>
>   attempted to kill init! exit code=0x000b

[...]

> Is it reproducible for you in a VM built with ‘guix system vm’?  If
> would be helpful if we had that.

For the record, apparently I can’t reproduce it in a ‘guix system vm
gnu/system/examples/desktop.tmpl’ VM.

Ludo’.





bug#37757: Kernel panic upon shutdown

2019-10-28 Thread Ludovic Courtès
Hi,

Jesse Gibbons  skribis:

> Attached is a picture of the kernel panic. It happened when I tried to shut
> down.
> I do not know what log to look at to get any details about what happened
> about that time. Of course, the panic itself is not in any of the logs in
> /var/log.
> This is not the first time there was a kernel panic during the shutdown
> process.

I’ve just seen it on a laptop running GNOME and ‘%desktop-services’.
The kernel panic appeared right after shutting down ModemManager (I
don’t have ModemManager on my own laptop and I’ve never experienced the
bug, but I don’t know if it’s significant.)

Note that we see (roughly):

  attempted to kill init! exit code=0x000b

which, unless I’m mistaken, means that PID 1 segfaulted (SIGSEGV = 11),
which is bad.

According to reboot(2), the ‘reboot’ syscall doesn’t return in this
case, so the segfault must have happened before the ‘reboot’ call.

The problem appeared roughly after the ‘core-updates’ merge, but I don’t
see any change to the ‘reboot’ wrapper in glibc 2.29.

Is it reproducible for you in a VM built with ‘guix system vm’?  If
would be helpful if we had that.

Thanks,
Ludo’.