subject:"bug#31925\: 'guix substitutes' sometimes hangs on glibc 2.27"

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-06 Thread Ludovic Courtès

Hello Andy,

Andy Wingo  skribis:

> On Thu 05 Jul 2018 14:27, l...@gnu.org (Ludovic Courtès) writes:
>
>> Hello,
>>
>> Andy Wingo  skribis:
>>
>>> The signal thread is a possibility though in that case you'd get a
>>> warning; the signal-handling thread appears in scm_all_threads.  Do you
>>> see a warning?  If you do, that is a problem :)
>>
>> I don’t see a warning.
>>
>> But as a Guile user, I shouldn’t see a warning just because there’s a
>> signal thread anyway; it’s not a thread I spawned myself.
>
> I understand but it's how it works.  If we want to change this, probably
> we need a similar interface as we have with finalization.

Right, understood.

>> scheme@(guile-user)> (all-threads)
>> $1 = (# #> (1b06e00)>)
>> scheme@(guile-user)> (when (zero? (primitive-fork)) (primitive-_exit 0))
>> ;; no warning
>
> Are you certain that this is the signal-handling thread and not the
> finalizer thread?  I suspect it is the finalizer thread, and that it
> gets properly shut down before the fork.

Oh, you must be right.

> Regarding seeing the warning: do you do make some other binding for the
> default warning port in Guix?

No.

Thanks,
Ludo’.

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-05 Thread Andy Wingo

On Thu 05 Jul 2018 14:27, l...@gnu.org (Ludovic Courtès) writes:

> Hello,
>
> Andy Wingo  skribis:
>
>> The signal thread is a possibility though in that case you'd get a
>> warning; the signal-handling thread appears in scm_all_threads.  Do you
>> see a warning?  If you do, that is a problem :)
>
> I don’t see a warning.
>
> But as a Guile user, I shouldn’t see a warning just because there’s a
> signal thread anyway; it’s not a thread I spawned myself.

I understand but it's how it works.  If we want to change this, probably
we need a similar interface as we have with finalization.

> The weird thing is that the signal thread always exists, and it’s
> surprising IMO that it shows up in ‘scm_all_threads’ because it’s not a
> “user thread”.  The other surprise is that the warning isn’t triggered:
>
> $ guile
> GNU Guile 2.2.4
> Copyright (C) 1995-2017 Free Software Foundation, Inc.
>
> Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
> This program is free software, and you are welcome to redistribute it
> under certain conditions; type `,show c' for details.
>
> Enter `,help' for help.
> scheme@(guile-user)> (all-threads)
> $1 = (# #)
> scheme@(guile-user)> (when (zero? (primitive-fork)) (primitive-_exit 0))
> ;; no warning

Are you certain that this is the signal-handling thread and not the
finalizer thread?  I suspect it is the finalizer thread, and that it
gets properly shut down before the fork.

Regarding seeing the warning: do you do make some other binding for the
default warning port in Guix?

Andy

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-05 Thread Andy Wingo

Hi,

On Thu 05 Jul 2018 12:05, Mark H Weaver  writes:

> However, it's also the case that libgc uses 'pthread_atfork' (where
> available) to arrange to grab the GC allocation as well as the "mark
> locks" in the case where parallel marking is enabled.  See
> fork_prepare_proc, fork_parent_proc, and fork_child_proc in
> pthread_support.c.

I don't think this is enabled by default.  You have to configure your
libgc this way.  When investigating similar bugs, I proposed enabling it
by default a while ago:

  http://www.hpl.hp.com/hosted/linux/mail-archives/gc/2012-February/004958.html

I ended up realizing that pthread_atfork was just a bogus interface.
For one, it turns out that POSIX clearly says that if a multithreaded
program forks, the behavior of the child after the fork is undefined if
it calls any non-async-signal-safe function before calling exec():

  https://lists.gnu.org/archive/html/guile-devel/2012-02/msg00157.html

But practically, the only reasonable thing to do with atfork is to grab
all of the locks, then release them after forking, in both child and
parent.  However you can't do this without deadlocking from a library,
as the total lock order is a property of the program and not something a
library can decide.

There are thus two solutions: either ensure that there are no other
threads when you fork, or only call async-signal-safe functions before
you exec().  open-process does the latter.  fork will warn if the former
is not the case.  When last I looked into this, I concluded that
pthread_atfork doesn't buy us anything, though I could be wrong!

>> Here's the body of primitive-fork fwiw:
>>
>> {
>>   int pid;
>>   scm_i_finalizer_pre_fork ();
>>   if (scm_ilength (scm_all_threads ()) != 1)
>
> I think there's a race here.  I think it's possible for the finalizer
> thread to be respawned after 'scm_i_finalizer_pre_fork' in two different
> ways:
>
> (1) 'scm_all_threads' performs allocation, which could trigger GC.
>
> (2) another thread could perform heap allocation and trigger GC after
> 'scm_i_finalizer_pre_fork' joins the thread.  it might then shut
> down before 'scm_all_thread' is called.
>
> However, these are highly unlikely scenarios, and most likely not the
> problem we're seeing here.
>
> Still, I think the 'scm_i_finalizer_pre_fork' should be moved after the
> 'if', to avoid this race.

Good point!  Probably we should use some non-allocating
 scm_i_is_multi_threaded() or something.  We can't move the pre-fork
thing though because one of the effects we are looking for is to reduce
the thread count!

Cheers,

Andy

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-05 Thread Ludovic Courtès

Hello,

Andy Wingo  skribis:

> On Thu 05 Jul 2018 05:33, Mark H Weaver  writes:

[...]

>> Another possibility: both the finalization thread and the signal
>> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
>> which also temporarily grabs the GC allocation lock before calling the
>> specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
>> libgc.  You spawn the signal delivery thread by calling 'sigaction' and
>> you make work for it to do every second when the SIGALRM is delivered.
>
> The signal thread is a possibility though in that case you'd get a
> warning; the signal-handling thread appears in scm_all_threads.  Do you
> see a warning?  If you do, that is a problem :)

I don’t see a warning.

But as a Guile user, I shouldn’t see a warning just because there’s a
signal thread anyway; it’s not a thread I spawned myself.

The weird thing is that the signal thread always exists, and it’s
surprising IMO that it shows up in ‘scm_all_threads’ because it’s not a
“user thread”.  The other surprise is that the warning isn’t triggered:

--8<---cut here---start->8---
$ guile
GNU Guile 2.2.4
Copyright (C) 1995-2017 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> (all-threads)
$1 = (# #)
scheme@(guile-user)> (when (zero? (primitive-fork)) (primitive-_exit 0))
;; no warning
--8<---cut here---end--->8---

Ludo’, surprised.  :-)

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-05 Thread Mark H Weaver

Hi,

Andy Wingo  writes:

> On Thu 05 Jul 2018 05:33, Mark H Weaver  writes:
>
>>> One problem I’ve noticed is that the child process that
>>> ‘call-with-decompressed-port’ spawns would be stuck trying to get the
>>> allocation lock:
>>>
>>> So it seems quite clear that the thing has the alloc lock taken.  I
>>> suppose this can happen if one of the libgc threads runs right when we
>>> call fork and takes the alloc lock, right?
>>
>> Does libgc spawn threads that run concurrently with user threads?  If
>> so, that would be news to me.  My understanding was that incremental
>> marking occurs within GC allocation calls, and marking threads are only
>> spawned after all user threads have been stopped, but I could be wrong.
>
> I think Mark is correct.

Actually, looking at the libgc code more closely, it seems that Ludovic
is correct.  GC_init calls GC_start_mark_threads_inner if PARALLEL_MARK
is defined at compile time.

However, it's also the case that libgc uses 'pthread_atfork' (where
available) to arrange to grab the GC allocation as well as the "mark
locks" in the case where parallel marking is enabled.  See
fork_prepare_proc, fork_parent_proc, and fork_child_proc in
pthread_support.c.

As the libgc developers admit in the comment above 'fork_prepare_proc',
they are not strictly meeting the requirements for safe use of 'fork',
but they _are_ grabbing the allocation lock during 'fork', at least on
systems that support 'pthread_atfork'.

It looks like setting the GC_MARKERS environment variable to 1 should
result in 'available_markers_m1' being set to 0, and thus effectively
disable parallel marking.  In that case, if I understand the code
correctly, no marker threads will be spawned.

It would be interesting to see if this problem can be reproduced when
GC_MARKERS is set to 1.

>> The first idea that comes to my mind is that perhaps the finalization
>> thread is holding the GC allocation lock when 'fork' is called.
>
> So of course we agree you're only supposed to "fork" when there are no
> other threads running, I think.
>
> As far as the finalizer thread goes, "primitive-fork" calls
> "scm_i_finalizer_pre_fork" which should join the finalizer thread,
> before the fork.  There could be a bug obviously but the intention is
> for Guile to shut down its internal threads.

Ah, good!  I didn't know this.  So, I guess the problem is most likely
elsewhere.

We should probably arrange to join the signal delivery thread at the
same time, and then to respawn it in the parent and child if needed.
What do you think?

> Here's the body of primitive-fork fwiw:
>
> {
>   int pid;
>   scm_i_finalizer_pre_fork ();
>   if (scm_ilength (scm_all_threads ()) != 1)

I think there's a race here.  I think it's possible for the finalizer
thread to be respawned after 'scm_i_finalizer_pre_fork' in two different
ways:

(1) 'scm_all_threads' performs allocation, which could trigger GC.

(2) another thread could perform heap allocation and trigger GC after
'scm_i_finalizer_pre_fork' joins the thread.  it might then shut
down before 'scm_all_thread' is called.

However, these are highly unlikely scenarios, and most likely not the
problem we're seeing here.

Still, I think the 'scm_i_finalizer_pre_fork' should be moved after the
'if', to avoid this race.

>> Another possibility: both the finalization thread and the signal
>> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
>> which also temporarily grabs the GC allocation lock before calling the
>> specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
>> libgc.  You spawn the signal delivery thread by calling 'sigaction' and
>> you make work for it to do every second when the SIGALRM is delivered.
>
> The signal thread is a possibility though in that case you'd get a
> warning; the signal-handling thread appears in scm_all_threads.  Do you
> see a warning?  If you do, that is a problem :)

Good point!

>>> If that is correct, the fix would be to call fork within
>>> ‘GC_call_with_alloc_lock’.
>>>
>>> How does that sound?
>>
>> Sure, sounds good to me.
>
> I don't think this is necessary.  I think the problem is that other
> threads are running.  If we solve that, then we solve this issue; if we
> don't solve that, we don't know what else those threads are doing, so we
> don't know what mutexes and other state they might have.

On second thought, I agree with Andy here.  Grabbing the allocation lock
shouldn't be needed, and moreover it's not sufficient.  I think we
really need to ensure that no other threads are running, in which case
grabbing the allocation lock is pointless.  Anyway, it seems that libgc
is already arranging to do this, at least on modern GNU/Linux systems.

 Thanks,
   Mark

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-05 Thread Ludovic Courtès

Hello Mark,

Thanks for chiming in!

Mark H Weaver  skribis:

> Does libgc spawn threads that run concurrently with user threads?  If
> so, that would be news to me.  My understanding was that incremental
> marking occurs within GC allocation calls, and marking threads are only
> spawned after all user threads have been stopped, but I could be wrong.

libgc launches mark threads as soon as it is initialized, I think.

> The first idea that comes to my mind is that perhaps the finalization
> thread is holding the GC allocation lock when 'fork' is called.  The
> finalization thread grabs the GC allocation lock every time it calls
> 'GC_invoke_finalizers'.  All ports backed by POSIX file descriptors
> (including pipes) register finalizers and therefore spawn the
> finalization thread and make work for it to do.

In 2.2 there’s scm_i_finalizer_pre_fork that takes care of shutting down
the finalization thread right before fork.  So the finalization thread
cannot be blamed, AIUI.

> Another possibility: both the finalization thread and the signal
> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
> which also temporarily grabs the GC allocation lock before calling the
> specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
> libgc.  You spawn the signal delivery thread by calling 'sigaction' and
> you make work for it to do every second when the SIGALRM is delivered.

That’s definitely a possibility: the signal thread could be allocating
stuff, and thereby taking the alloc lock just at that time.

>> If that is correct, the fix would be to call fork within
>> ‘GC_call_with_alloc_lock’.
>>
>> How does that sound?
>
> Sure, sounds good to me.

Here’s a patch:

diff --git a/libguile/posix.c b/libguile/posix.c
index b0fcad5fd..088e75631 100644
--- a/libguile/posix.c
+++ b/libguile/posix.c
@@ -1209,6 +1209,13 @@ SCM_DEFINE (scm_execle, "execle", 2, 0, 1,
 #undef FUNC_NAME
 
 #ifdef HAVE_FORK
+static void *
+do_fork (void *pidp)
+{
+  * (int *) pidp = fork ();
+  return NULL;
+}
+
 SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
 (),
 	"Creates a new \"child\" process by duplicating the current \"parent\" process.\n"
@@ -1236,7 +1243,13 @@ SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
 " further behavior unspecified.  See \"Processes\" in the\n"
 " manual, for more information.\n"),
scm_current_warning_port ());
-  pid = fork ();
+
+  /* Take the alloc lock to make sure it is released when the child
+ process starts.  Failing to do that the child process could start
+ in a state where the alloc lock is taken and will never be
+ released.  */
+  GC_call_with_alloc_lock (do_fork, &pid);
+
   if (pid == -1)
 SCM_SYSERROR;
   return scm_from_int (pid);

Thoughts?

Unfortunately my ‘call-with-decompressed-port’ reproducer doesn’t seem t
to reproduce much today so I can’t tell if this helps (I let it run more
than 5 minutes with the supposedly-buggy Guile and nothing happened…).

Thanks,
Ludo’.

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-05 Thread Andy Wingo

Hi!

On Thu 05 Jul 2018 05:33, Mark H Weaver  writes:

>> One problem I’ve noticed is that the child process that
>> ‘call-with-decompressed-port’ spawns would be stuck trying to get the
>> allocation lock:
>>
>> So it seems quite clear that the thing has the alloc lock taken.  I
>> suppose this can happen if one of the libgc threads runs right when we
>> call fork and takes the alloc lock, right?
>
> Does libgc spawn threads that run concurrently with user threads?  If
> so, that would be news to me.  My understanding was that incremental
> marking occurs within GC allocation calls, and marking threads are only
> spawned after all user threads have been stopped, but I could be wrong.

I think Mark is correct.

> The first idea that comes to my mind is that perhaps the finalization
> thread is holding the GC allocation lock when 'fork' is called.

So of course we agree you're only supposed to "fork" when there are no
other threads running, I think.

As far as the finalizer thread goes, "primitive-fork" calls
"scm_i_finalizer_pre_fork" which should join the finalizer thread,
before the fork.  There could be a bug obviously but the intention is
for Guile to shut down its internal threads.  Here's the body of
primitive-fork fwiw:

{
  int pid;
  scm_i_finalizer_pre_fork ();
  if (scm_ilength (scm_all_threads ()) != 1)
/* Other threads may be holding on to resources that Guile needs --
   it is not safe to permit one thread to fork while others are
   running.

   In addition, POSIX clearly specifies that if a multi-threaded
   program forks, the child must only call functions that are
   async-signal-safe.  We can't guarantee that in general.  The best
   we can do is to allow forking only very early, before any call to
   sigaction spawns the signal-handling thread.  */
scm_display
  (scm_from_latin1_string
   ("warning: call to primitive-fork while multiple threads are 
running;\n"
" further behavior unspecified.  See \"Processes\" in the\n"
" manual, for more information.\n"),
   scm_current_warning_port ());
  pid = fork ();
  if (pid == -1)
SCM_SYSERROR;
  return scm_from_int (pid);
}

> Another possibility: both the finalization thread and the signal
> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
> which also temporarily grabs the GC allocation lock before calling the
> specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
> libgc.  You spawn the signal delivery thread by calling 'sigaction' and
> you make work for it to do every second when the SIGALRM is delivered.

The signal thread is a possibility though in that case you'd get a
warning; the signal-handling thread appears in scm_all_threads.  Do you
see a warning?  If you do, that is a problem :)

>> If that is correct, the fix would be to call fork within
>> ‘GC_call_with_alloc_lock’.
>>
>> How does that sound?
>
> Sure, sounds good to me.

I don't think this is necessary.  I think the problem is that other
threads are running.  If we solve that, then we solve this issue; if we
don't solve that, we don't know what else those threads are doing, so we
don't know what mutexes and other state they might have.

Andy

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-04 Thread Mark H Weaver

Hi Ludovic,

l...@gnu.org (Ludovic Courtès) writes:

> (+Cc: Andy as the ultimate authority for all these things.  :-))
>
> l...@gnu.org (Ludovic Courtès) skribis:
>
>> (let loop ((files files)
>>(n 0))
>>   (match files
>> ((file . tail)
>>  (call-with-input-file file
>>(lambda (port)
>>  (call-with-decompressed-port 'gzip port
>>(lambda (port)
>>  (let loop ()
>>(unless (eof-object? (get-bytevector-n port 777))
>>  (loop)))
>>  ;; (pk 'loop n file)
>>  (display ".")
>>  (loop tail (+ n 1)
>
> One problem I’ve noticed is that the child process that
> ‘call-with-decompressed-port’ spawns would be stuck trying to get the
> allocation lock:
>
> (gdb) bt
> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x7f9fd8d5cb25 in __GI___pthread_mutex_lock (mutex=0x7f9fd91b3240 
> ) at ../nptl/pthread_mutex_lock.c:78
> #2  0x7f9fd8f8ef8f in GC_call_with_alloc_lock (fn=fn@entry=0x7f9fd92b0420 
> , client_data=client_data@entry=0x7ffe4b9a0d80) at 
> misc.c:1929
> #3  0x7f9fd92b1270 in copy_weak_entry (dst=0x7ffe4b9a0d70, src=0x759ed0) 
> at weak-set.c:124
> #4  weak_set_remove_x (closure=0x8850c0, pred=0x7f9fd92b0440 , 
> hash=3944337866184184181, set=0x70cf00) at weak-set.c:615
> #5  scm_c_weak_set_remove_x (set=set@entry=#, 
> raw_hash=, pred=pred@entry=0x7f9fd92b0440 , 
> closure=closure@entry=0x8850c0) at weak-set.c:791
> #6  0x7f9fd92b13b0 in scm_weak_set_remove_x (set=#, 
> obj=obj@entry=#) at weak-set.c:812
> #7  0x7f9fd926f72f in close_port (port=#, 
> explicit=) at ports.c:884
> #8  0x7f9fd92ad307 in vm_regular_engine (thread=0x7f9fd91b3240 
> , vp=0x7adf30, registers=0x0, resume=-657049556) at 
> vm-engine.c:786
> #9  0x7f9fd92afb37 in scm_call_n (proc= Cannot access memory at address 0xd959b030>0x7f9fd959b030, 
> argv=argv@entry=0x7ffe4b9a1018, nargs=nargs@entry=1) at vm.c:1257
> #10 0x7f9fd9233017 in scm_primitive_eval (exp=, 
> exp@entry= 0xd5677cf8>0x855280) at eval.c:662
> #11 0x7f9fd9233073 in scm_eval (exp= Cannot access memory at address 0xd5677cf8>0x855280, 
> module_or_state=module_or_state@entry= access memory at address 0xd95580d8>0x83d140) at eval.c:696
> #12 0x7f9fd927e8d0 in scm_shell (argc=2, argv=0x7ffe4b9a1668) at 
> script.c:454
> #13 0x7f9fd9249a9d in invoke_main_func (body_data=0x7ffe4b9a1510) at 
> init.c:340
> #14 0x7f9fd922c28a in c_body (d=0x7ffe4b9a1450) at continuations.c:422
> #15 0x7f9fd92ad307 in vm_regular_engine (thread=0x7f9fd91b3240 
> , vp=0x7adf30, registers=0x0, resume=-657049556) at 
> vm-engine.c:786
> #16 0x7f9fd92afb37 in scm_call_n (proc=proc@entry=# 795120>, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
> #17 0x7f9fd9231e69 in scm_call_0 (proc=proc@entry=# 795120>) at eval.c:481
> #18 0x7f9fd929e7b2 in catch (tag=tag@entry=#t, thunk=# 795120>, handler= address 0x4>0x7950c0, pre_unwind_handler= ERROR: Cannot access memory at address 0x4>0x7950a0) at throw.c:137
> #19 0x7f9fd929ea95 in scm_catch_with_pre_unwind_handler 
> (key=key@entry=#t, thunk=, handler=, 
> pre_unwind_handler=) at throw.c:254
> #20 0x7f9fd929ec5f in scm_c_catch (tag=tag@entry=#t, 
> body=body@entry=0x7f9fd922c280 , 
> body_data=body_data@entry=0x7ffe4b9a1450, 
> handler=handler@entry=0x7f9fd922c510 , 
> handler_data=handler_data@entry=0x7ffe4b9a1450, 
> pre_unwind_handler=pre_unwind_handler@entry=0x7f9fd922c370 
> , pre_unwind_handler_data=0x7a9bc0) at throw.c:377
> #21 0x7f9fd922c870 in scm_i_with_continuation_barrier 
> (body=body@entry=0x7f9fd922c280 , 
> body_data=body_data@entry=0x7ffe4b9a1450, 
> handler=handler@entry=0x7f9fd922c510 , 
> handler_data=handler_data@entry=0x7ffe4b9a1450, 
> pre_unwind_handler=pre_unwind_handler@entry=0x7f9fd922c370 
> , pre_unwind_handler_data=0x7a9bc0) at continuations.c:360
> #22 0x7f9fd922c905 in scm_c_with_continuation_barrier (func= out>, data=) at continuations.c:456
> #23 0x7f9fd929d3ec in with_guile (base=base@entry=0x7ffe4b9a14b8, 
> data=data@entry=0x7ffe4b9a14e0) at threads.c:661
> #24 0x7f9fd8f8efb8 in GC_call_with_stack_base (fn=fn@entry=0x7f9fd929d3a0 
> , arg=arg@entry=0x7ffe4b9a14e0) at misc.c:1949
> #25 0x7f9fd929d708 in scm_i_with_guile (dynamic_state=, 
> data=data@entry=0x7ffe4b9a14e0, func=func@entry=0x7f9fd9249a80 
> ) at threads.c:704
> #26 scm_with_guile (func=func@entry=0x7f9fd9249a80 , 
> data=data@entry=0x7ffe4b9a1510) at threads.c:710
> #27 0x7f9fd9249c32 in scm_boot_guile (argc=argc@entry=2, 
> argv=argv@entry=0x7ffe4b9a1668, main_func=main_func@entry=0x400cb0 
> , closure=closure@entry=0x0) at init.c:323
> #28 0x00400b70 in main (argc=2, argv=0x7ffe4b9a1668) at guile.c:101
> (gdb) info threads
>   Id   Target Id Frame 
> * 1Thread 0x7f9fd972eb80 (LWP 15573) "guile" __lll_lock_wait () at 
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
>
> So i

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-04 Thread Ludovic Courtès

(+Cc: Andy as the ultimate authority for all these things.  :-))

l...@gnu.org (Ludovic Courtès) skribis:

> (let loop ((files files)
>(n 0))
>   (match files
> ((file . tail)
>  (call-with-input-file file
>(lambda (port)
>  (call-with-decompressed-port 'gzip port
>(lambda (port)
>  (let loop ()
>(unless (eof-object? (get-bytevector-n port 777))
>  (loop)))
>  ;; (pk 'loop n file)
>  (display ".")
>  (loop tail (+ n 1)

One problem I’ve noticed is that the child process that
‘call-with-decompressed-port’ spawns would be stuck trying to get the
allocation lock:

--8<---cut here---start->8---
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x7f9fd8d5cb25 in __GI___pthread_mutex_lock (mutex=0x7f9fd91b3240 
) at ../nptl/pthread_mutex_lock.c:78
#2  0x7f9fd8f8ef8f in GC_call_with_alloc_lock (fn=fn@entry=0x7f9fd92b0420 
, client_data=client_data@entry=0x7ffe4b9a0d80) at 
misc.c:1929
#3  0x7f9fd92b1270 in copy_weak_entry (dst=0x7ffe4b9a0d70, src=0x759ed0) at 
weak-set.c:124
#4  weak_set_remove_x (closure=0x8850c0, pred=0x7f9fd92b0440 , 
hash=3944337866184184181, set=0x70cf00) at weak-set.c:615
#5  scm_c_weak_set_remove_x (set=set@entry=#, 
raw_hash=, pred=pred@entry=0x7f9fd92b0440 , 
closure=closure@entry=0x8850c0) at weak-set.c:791
#6  0x7f9fd92b13b0 in scm_weak_set_remove_x (set=#, 
obj=obj@entry=#) at weak-set.c:812
#7  0x7f9fd926f72f in close_port (port=#, 
explicit=) at ports.c:884
#8  0x7f9fd92ad307 in vm_regular_engine (thread=0x7f9fd91b3240 
, vp=0x7adf30, registers=0x0, resume=-657049556) at 
vm-engine.c:786
#9  0x7f9fd92afb37 in scm_call_n (proc=0x7f9fd959b030, 
argv=argv@entry=0x7ffe4b9a1018, nargs=nargs@entry=1) at vm.c:1257
#10 0x7f9fd9233017 in scm_primitive_eval (exp=, 
exp@entry=0x855280) at eval.c:662
#11 0x7f9fd9233073 in scm_eval (exp=0x855280, 
module_or_state=module_or_state@entry=0x83d140) at eval.c:696
#12 0x7f9fd927e8d0 in scm_shell (argc=2, argv=0x7ffe4b9a1668) at 
script.c:454
#13 0x7f9fd9249a9d in invoke_main_func (body_data=0x7ffe4b9a1510) at 
init.c:340
#14 0x7f9fd922c28a in c_body (d=0x7ffe4b9a1450) at continuations.c:422
#15 0x7f9fd92ad307 in vm_regular_engine (thread=0x7f9fd91b3240 
, vp=0x7adf30, registers=0x0, resume=-657049556) at 
vm-engine.c:786
#16 0x7f9fd92afb37 in scm_call_n (proc=proc@entry=#, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#17 0x7f9fd9231e69 in scm_call_0 (proc=proc@entry=#) at eval.c:481
#18 0x7f9fd929e7b2 in catch (tag=tag@entry=#t, thunk=#, handler=0x7950c0, pre_unwind_handler=0x7950a0) at throw.c:137
#19 0x7f9fd929ea95 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, 
thunk=, handler=, pre_unwind_handler=) at throw.c:254
#20 0x7f9fd929ec5f in scm_c_catch (tag=tag@entry=#t, 
body=body@entry=0x7f9fd922c280 , 
body_data=body_data@entry=0x7ffe4b9a1450, handler=handler@entry=0x7f9fd922c510 
, handler_data=handler_data@entry=0x7ffe4b9a1450, 
pre_unwind_handler=pre_unwind_handler@entry=0x7f9fd922c370 
, pre_unwind_handler_data=0x7a9bc0) at throw.c:377
#21 0x7f9fd922c870 in scm_i_with_continuation_barrier 
(body=body@entry=0x7f9fd922c280 , 
body_data=body_data@entry=0x7ffe4b9a1450, handler=handler@entry=0x7f9fd922c510 
, handler_data=handler_data@entry=0x7ffe4b9a1450, 
pre_unwind_handler=pre_unwind_handler@entry=0x7f9fd922c370 
, pre_unwind_handler_data=0x7a9bc0) at continuations.c:360
#22 0x7f9fd922c905 in scm_c_with_continuation_barrier (func=, data=) at continuations.c:456
#23 0x7f9fd929d3ec in with_guile (base=base@entry=0x7ffe4b9a14b8, 
data=data@entry=0x7ffe4b9a14e0) at threads.c:661
#24 0x7f9fd8f8efb8 in GC_call_with_stack_base (fn=fn@entry=0x7f9fd929d3a0 
, arg=arg@entry=0x7ffe4b9a14e0) at misc.c:1949
#25 0x7f9fd929d708 in scm_i_with_guile (dynamic_state=, 
data=data@entry=0x7ffe4b9a14e0, func=func@entry=0x7f9fd9249a80 
) at threads.c:704
#26 scm_with_guile (func=func@entry=0x7f9fd9249a80 , 
data=data@entry=0x7ffe4b9a1510) at threads.c:710
#27 0x7f9fd9249c32 in scm_boot_guile (argc=argc@entry=2, 
argv=argv@entry=0x7ffe4b9a1668, main_func=main_func@entry=0x400cb0 
, closure=closure@entry=0x0) at init.c:323
#28 0x00400b70 in main (argc=2, argv=0x7ffe4b9a1668) at guile.c:101
(gdb) info threads
  Id   Target Id Frame 
* 1Thread 0x7f9fd972eb80 (LWP 15573) "guile" __lll_lock_wait () at 
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
--8<---cut here---end--->8---

So it seems quite clear that the thing has the alloc lock taken.  I
suppose this can happen if one of the libgc threads runs right when we
call fork and takes the alloc lock, right?

If that is correct, the fix would be to call fork within
‘GC_call_with_alloc_lock’.

How does that sound?

As a workaround on the Guix side, we might achieve the

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-07-04 Thread Ludovic Courtès

l...@gnu.org (Ludovic Courtès) skribis:

> #0  0x7fbb34bf794d in __GI___pthread_timedjoin_ex 
> (threadid=140441961314048, thread_return=thread_return@entry=0x0, 
> abstime=abstime@entry=0x0, block=block@entry=true)
> at pthread_join_common.c:89
> #1  0x7fbb34bf773c in __pthread_join (threadid=, 
> thread_return=thread_return@entry=0x0) at pthread_join.c:24
> #2  0x7fbb350d7548 in stop_finalization_thread () at finalizers.c:265
> #3  0x7fbb350d7759 in scm_i_finalizer_pre_fork () at finalizers.c:290
> #4  0x7fbb3514f256 in scm_fork () at posix.c:1222

Here’s a reproducer that works quite well (it hangs within a couple of
minutes):

(use-modules (guix utils)
 (ice-9 ftw)
 (ice-9 match)
 (srfi srfi-1)
 (srfi srfi-26)
 (rnrs io ports))

(define infodir
  (string-append (getenv "HOME")
 "/.guix-profile/share/info/"))

(define files
  (apply circular-list
 (map (cut string-append infodir <>)
  (scandir infodir (lambda (file)
 (string-suffix? ".gz" file))

(sigaction SIGALRM
  (lambda _
(alarm 1)))
(alarm 1)

(let loop ((files files)
   (n 0))
  (match files
((file . tail)
 (call-with-input-file file
   (lambda (port)
 (call-with-decompressed-port 'gzip port
   (lambda (port)
 (let loop ()
   (unless (eof-object? (get-bytevector-n port 777))
 (loop)))
 ;; (pk 'loop n file)
 (display ".")
 (loop tail (+ n 1)

Ludo’.

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-06-21 Thread Ricardo Wurmus



Hi Ludo,

> When downloading a number of substitutes, ‘guix substitute’ sometimes
> hangs for me since the switch to glibc 2.27.  Anyone else experiencing
> this?  It’s relatively frequent for me.

It has never happened to me.

-- 
Ricardo

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

2018-06-21 Thread Ludovic Courtès

Hello Guix!

When downloading a number of substitutes, ‘guix substitute’ sometimes
hangs for me since the switch to glibc 2.27.  Anyone else experiencing
this?  It’s relatively frequent for me.

The backtrace shows this:

--8<---cut here---start->8---
(gdb) bt
#0  0x7fbb34bf794d in __GI___pthread_timedjoin_ex 
(threadid=140441961314048, thread_return=thread_return@entry=0x0, 
abstime=abstime@entry=0x0, block=block@entry=true)
at pthread_join_common.c:89
#1  0x7fbb34bf773c in __pthread_join (threadid=, 
thread_return=thread_return@entry=0x0) at pthread_join.c:24
#2  0x7fbb350d7548 in stop_finalization_thread () at finalizers.c:265
#3  0x7fbb350d7759 in scm_i_finalizer_pre_fork () at finalizers.c:290
#4  0x7fbb3514f256 in scm_fork () at posix.c:1222
#5  0x7fbb351477fd in vm_regular_engine (thread=0x7fbb313739d0, 
vp=0x1569f30, registers=0x52de, resume=884963661) at vm-engine.c:784
#6  0x7fbb3514ae5a in scm_call_n (proc=0x7fbb355c2030, 
argv=argv@entry=0x7fff856ae7f8, nargs=nargs@entry=1)
at vm.c:1257
#7  0x7fbb350ceff7 in scm_primitive_eval (exp=exp@entry=0x164fb00) at 
eval.c:662
#8  0x7fbb350cf053 in scm_eval (exp=0x164fb00, 
module_or_state=module_or_state@entry=0x162b140) at eval.c:696
#9  0x7fbb3511a0b0 in scm_shell (argc=10, argv=0x15f0390) at script.c:454
#10 0x7fbb350e5add in invoke_main_func (body_data=0x7fff856aed10) at 
init.c:340
#11 0x7fbb350c82da in c_body (d=0x7fff856aec50) at continuations.c:422
#12 0x7fbb351477fd in vm_regular_engine (thread=0x7fbb313739d0, 
vp=0x1569f30, registers=0x52de, resume=884963661) at vm-engine.c:784
#13 0x7fbb3514ae5a in scm_call_n (proc=proc@entry=#, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#14 0x7fbb350cdef9 in scm_call_0 (proc=proc@entry=#) at eval.c:481
#15 0x7fbb3513a026 in catch (tag=tag@entry=#t, thunk=#, handler=0x15641e0, 
pre_unwind_handler=0x15641c0) at throw.c:137
#16 0x7fbb3513a365 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, 
thunk=, handler=, pre_unwind_handler=) at throw.c:254
#17 0x7fbb3513a51f in scm_c_catch (tag=tag@entry=#t, 
body=body@entry=0x7fbb350c82d0 , 
body_data=body_data@entry=0x7fff856aec50, handler=handler@entry=0x7fbb350c8560 
, 
handler_data=handler_data@entry=0x7fff856aec50, 
pre_unwind_handler=pre_unwind_handler@entry=0x7fbb350c83c0 
, pre_unwind_handler_data=0x1564b60) at throw.c:377
#18 0x7fbb350c88c0 in scm_i_with_continuation_barrier 
(body=body@entry=0x7fbb350c82d0 , 
body_data=body_data@entry=0x7fff856aec50, handler=handler@entry=0x7fbb350c8560 
, 
handler_data=handler_data@entry=0x7fff856aec50, 
pre_unwind_handler=pre_unwind_handler@entry=0x7fbb350c83c0 
, pre_unwind_handler_data=0x1564b60) at continuations.c:360
#19 0x7fbb350c8955 in scm_c_with_continuation_barrier (func=, data=) at continuations.c:456
#20 0x7fbb35138c3c in with_guile (base=base@entry=0x7fff856aecb8, 
data=data@entry=0x7fff856aece0) at threads.c:661
#21 0x7fbb34e2afb8 in GC_call_with_stack_base (fn=fn@entry=0x7fbb35138bf0 
, arg=arg@entry=0x7fff856aece0) at misc.c:1949
#22 0x7fbb35138fd8 in scm_i_with_guile (dynamic_state=, 
data=data@entry=0x7fff856aece0, func=func@entry=0x7fbb350e5ac0 
) at threads.c:704
#23 scm_with_guile (func=func@entry=0x7fbb350e5ac0 , 
data=data@entry=0x7fff856aed10) at threads.c:710
#24 0x7fbb350e5c72 in scm_boot_guile (argc=argc@entry=7, 
argv=argv@entry=0x7fff856aee68, main_func=main_func@entry=0x400ce0 
, closure=closure@entry=0x0) at init.c:323
#25 0x00400b80 in main (argc=7, argv=0x7fff856aee68) at guile.c:101
(gdb) info threads
  Id   Target Id Frame 
* 1Thread 0x7fbb355cab80 (LWP 21207) "guix substitute" 0x7fbb34bf794d 
in __GI___pthread_timedjoin_ex (threadid=140441961314048, 
thread_return=thread_return@entry=0x0, 
abstime=abstime@entry=0x0, block=block@entry=true) at 
pthread_join_common.c:89
  2Thread 0x7fbb3342c700 (LWP 21208) ".guix-real" 0x7fbb34bfc552 in 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x7fbb3504f6ec )
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3Thread 0x7fbb32c2b700 (LWP 21209) ".guix-real" 0x7fbb34bfc552 in 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x7fbb3504f6ec )
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  4Thread 0x7fbb3242a700 (LWP 21210) ".guix-real" 0x7fbb34bfc552 in 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x7fbb3504f6ec )
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  5Thread 0x7fbb2eb78700 (LWP 21212) ".guix-real" 0x7fbb34bffaac in 
__libc_read (fd=9, buf=buf@entry=0x7fbb2eb77540, nbytes=nbytes@entry=1) at 
../sysdeps/unix/sysv/linux/read.c:27
  6Thread 0x7fbb31373700 (LWP 21214) "guix substitute" 0x7fbb34bffaac 
in __libc_read (fd=5, buf=buf@entry=0x7fbb31372a30, nbytes=nbytes@entry=1) at 
../sysdeps/unix/sysv/linux/read.c:27
--8<---cut here---end

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

12 matches

Site Navigation

Mail list logo

Footer information