bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-07-02 Thread Ludovic Courtès
Hi,

l...@gnu.org (Ludovic Courtès) skribis:

> On current ‘core-updates’, the code in (guix build graft) triggers
> random Guile crashes (GC issue? thread-safety issue?) when running on
> Guile 2.2.2, as initially reported by Marius at
> :
>
> grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> 
> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
> ERROR: In procedure put-bytevector: Wrong type argument in position 1 
> (expecting open output port): #
> builder for 
> `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
> with exit code 1
>
> [...]
>
> ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc 
> --list-failures)
> ludo@ribbon ~/src/guix/+core-updates$ guix build 
> /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
> La jena derivo estos konstruata:
>/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
> @ build-started 
> /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 
> x86_64-linux 
> /var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
> grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> 
> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
> ERROR: In procedure put-bytevector: Wrong type argument in position 1 
> (expecting open output port): #
> builder for 
> `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
> with exit code 1
> @ build-failed 
> /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 1 
> builder for 
> `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
> with exit code 1
> guix build: error: build failed: build of 
> `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed
> ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc 
> --list-failures)
> ludo@ribbon ~/src/guix/+core-updates$ guix build 
> /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
> La jena derivo estos konstruata:
>/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
> @ build-started 
> /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 
> x86_64-linux 
> /var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
> grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> 
> '/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
> ERROR: In procedure variable-ref: Not a variable: (194 195 196 197 198 199 
> 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 
> 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 
> 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255)
> builder for 
> `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
> with exit code 1
> @ build-failed 
> /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 1 
> builder for 
> `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
> with exit code 1

Bad news: when using Guile 2.2.4 for grafting, I’m still getting similar
crashes, perhaps marginally less frequently.

Ludo’.





bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-05-10 Thread Mark H Weaver
Mark H Weaver  writes:
> From these facts, I believe we can conclude that the error is happening
> within the body of 'begin-thread' in 'n-par-for-each', but outside of
> the body of 'proc' passed to 'n-par-for-each'.

It could also be happening within the 'call-with-backtrace', used here:

  
https://git.savannah.gnu.org/cgit/guile.git/tree/module/ice-9/threads.scm?h=v2.2.3#n159

More investigation is needed, but it's unlikely that I'll be able to
spend more time on this today.

   Mark





bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-05-10 Thread Mark H Weaver
Hi,

I've been studying the backtrace provided by Ludovic, trying to
determine where, in the Scheme code, this error happened in thread 1.

l...@gnu.org (Ludovic Courtès) writes:
> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)):
> #0  0x7f7019db0d79 in scm_is_pair (x= Cannot access memory at address 0x0>0x0) at ../libguile/pairs.h:159
> #1  scm_ilength (sx=) at list.c:190
> #2  0x7f7019e0e2f6 in vm_regular_engine (thread=0x1425670, vp=0x144e6c0, 
> registers=0x0, resume=16) at vm-engine.c:909
> #3  0x7f7019e117da in scm_call_n (proc=proc@entry=#, 
> argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
> #4  0x7f7019d94879 in scm_call_0 (proc=proc@entry=#) at 
> eval.c:481
> #5  0x7f7019d85eb8 in scm_call_with_unblocked_asyncs (proc=# 1555fe0>) at async.c:400
> #6  0x7f7019e0e17d in vm_regular_engine (thread=0x1425670, vp=0x144e6c0, 
> registers=0x0, resume=16) at vm-engine.c:784
> #7  0x7f7019e117da in scm_call_n (proc=#, 
> argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
> #8  0x7f7019d94879 in scm_call_0 (proc=) at eval.c:481
> #9  0x7f7019dff496 in really_launch (d=0x149e520) at threads.c:794

The call to 'scm_call_with_unblocked_asyncs' in frame #5 is an important
clue, because it is used in very few places.  I see only two such places
in Guile (and none in Guix):

  
https://git.savannah.gnu.org/cgit/guile.git/tree/module/ice-9/scm-style-repl.scm?h=v2.2.3#n80
  
https://git.savannah.gnu.org/cgit/guile.git/tree/module/ice-9/threads.scm?h=v2.2.3#n158

I assume that it must be the second one in this case.  So, I believe
frames 5-9 correspond to lines 147 through 158 of ice-9/threads.scm:

  
https://git.savannah.gnu.org/cgit/guile.git/tree/module/ice-9/threads.scm?h=v2.2.3#n147

which is within 'call-with-new-thread', called via the 'begin-thread'
macro used by 'n-par-for-each', here:

  
https://git.savannah.gnu.org/cgit/guile.git/tree/module/ice-9/threads.scm?h=v2.2.3#n320

However, I would expect to see a 'catch' frame deeper in the stack,
since the procedure passed to 'n-par-for-each' by graft.scm is wrapped
by 'exit-on-exception'.

>From these facts, I believe we can conclude that the error is happening
within the body of 'begin-thread' in 'n-par-for-each', but outside of
the body of 'proc' passed to 'n-par-for-each'.

There's not a lot of code in there.  My best guess at the moment is that
this error is happening within the (apply proc args) on line 335,
although I do not yet have an explanation of how that could happen.

To be continued...

   Mark





bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-05-10 Thread Andy Wingo
On Thu 10 May 2018 08:50, Mark H Weaver  writes:

> Andy Wingo  writes:
>
>> On Wed 09 May 2018 02:32, Mark H Weaver  writes:
>>
>>> However, I think it's _far_ more likely that the NULL argument on the
>>> stack was copied from memory shared by multiple threads without proper
>>> thread synchronization.
>>
>> I think this is unlikely on x86 given its total-store-ordering memory
>> model.  I agree with you about the value of barriers, but I don't think
>> they are part of this bug that Ludo is seeing.
>
> I think you're forgetting about the C compiler.  It's true that x86
> machine code has a TSO memory model, but C does not.  In the absence of
> barriers, the C compiler may freely reorder stores to non-volatile,
> non-atomic objects.  In particular, it is free to reorder the
> initialization of an object with the write of that object's address.
>
> I admit that I haven't checked whether GCC 5.5.0 does this in practice.
> Do you have reason to believe that it never does so?

Oh I agree with you here as well, and compiler reordering could well be
happening here.  My suspicions are however that it's not happening.  In
libguile, we rarely mutate shared state, and in that case it's usually
within mutexes.  The main source of mutation in libguile is
initialization -- but there that's on a fresh object local to a thread,
and we try to avoid publishing that object to other threads without a
barrier (atomic or mutex), and in any case such publishing is usually
outside of the region that a compiler can work on.

There is the possibility of mutation via e.g. vector-set!, but hopefully
we aren't doing that on shared data; likewise in Scheme there are
barriers (the atomic box instructions and mutexes, both of which are
compiler barriers as well).  It's still possible to write Scheme
programs with races, of course, but I don't think that's what's
happening here.

I could be misunderstanding things of course!

Andy





bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-05-09 Thread Mark H Weaver
Hi Andy,

Andy Wingo  writes:

> On Wed 09 May 2018 02:32, Mark H Weaver  writes:
>
>> However, I think it's _far_ more likely that the NULL argument on the
>> stack was copied from memory shared by multiple threads without proper
>> thread synchronization.
>
> I think this is unlikely on x86 given its total-store-ordering memory
> model.  I agree with you about the value of barriers, but I don't think
> they are part of this bug that Ludo is seeing.

I think you're forgetting about the C compiler.  It's true that x86
machine code has a TSO memory model, but C does not.  In the absence of
barriers, the C compiler may freely reorder stores to non-volatile,
non-atomic objects.  In particular, it is free to reorder the
initialization of an object with the write of that object's address.

I admit that I haven't checked whether GCC 5.5.0 does this in practice.
Do you have reason to believe that it never does so?

 Thanks,
   Mark





bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-05-09 Thread Andy Wingo
On Wed 09 May 2018 02:32, Mark H Weaver  writes:

> However, I think it's _far_ more likely that the NULL argument on the
> stack was copied from memory shared by multiple threads without proper
> thread synchronization.

I think this is unlikely on x86 given its total-store-ordering memory
model.  I agree with you about the value of barriers, but I don't think
they are part of this bug that Ludo is seeing.

>> I commented out the MADV_DONTNEED call to be sure, but I can still
>> reproduce the bug.
>
> I strongly doubt that the MADV_DONTNEED is relevant to this issue.

It could be.  It would zero out VM stack frames, and if GC is called
when/if vp->sp is out of date, then that would be possible.  However I
think vp->sp is never out of date, so that's probably not it.  The
things that can be out of date are the on-heap copy of the IP (vp->ip)
and the local register copy of the sp (sp).  It's more likely for the
local "sp" cache to be out of date -- if we recursed through Scheme in a
call out from the VM, eventualy causing stack expansion and relocation,
then on the return forgot to re-cache the sp value, that could be it.

Similarly, forgetting to set vp->ip before calling out to something that
could GC could likewise cause a problem because the stack map wouldn't
be right and the precise stack marker could clear a value by mistake.
This only happens for non-innermost frames though; the innermost frame
is marked conservatively.

The rules are: update vp->ip before something that can allocate, and
update local "sp" after returning from a C function that could have
recursively called Scheme.

I did find a couple places in the VM where we forgot to do one of these,
e.g. 07b7490f73fb4a6cb0c9577d082d37c8d9cee7b0 and just now
9a72e212622fa3bd118d7c02c4386601285b3224.  These two patches aren't
shipped yet fwiw.

Andy





bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-05-09 Thread Ludovic Courtès
Hi Mark,

Mark H Weaver  skribis:

> l...@gnu.org (Ludovic Courtès) writes:
> [...]
>> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)):
>> #0  0x7f7019db0d79 in scm_is_pair (x=> Cannot access memory at address 0x0>0x0) at ../libguile/pairs.h:159
>> #1  scm_ilength (sx=) at list.c:190
> [...]
>> What this means is that Thread 1 gets NULL instead of a list as its
>> on-stack argument (vm-engine.c:909 is ‘tail-apply’).
>>
>> How can arguments on the VM stack be zeroed?
>
> I doubt that's what happened, because I expect that each VM stack is
> dedicated to a single hardware thread.  In theory, if a single VM stack
> is used by one thread, and then later used by another thread,
> thread-safety issues on the VM stack could arise in the absense of
> proper thread synchronization.
>
> However, I think it's _far_ more likely that the NULL argument on the
> stack was copied from memory shared by multiple threads without proper
> thread synchronization.

It could be this, but this particular case is an “embarrassingly
parallel” program where threads work on independent data sets without
any inter-thread communication whatsoever.

What you describe could nevertheless be happening at a lower level,
within libguile, though it’s not clear to me where that could be.

>> I commented out the MADV_DONTNEED call to be sure, but I can still
>> reproduce the bug.
>
> I strongly doubt that the MADV_DONTNEED is relevant to this issue.

I thought about it because that’s one way part of the VM stack could be
zeroed out.

>> Then I thought vp->sp might be out-of-sync compared to the local
>> variable ‘sp’, which in turn could cause ‘scm_i_vm_mark_stack’ to not
>> mark a few items on the tip of the stack.  So I did this:
>>
>> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
>> index 9509cd643..1136b2271 100644
>> --- a/libguile/vm-engine.c
>> +++ b/libguile/vm-engine.c
>> @@ -151,7 +151,8 @@
>> code, or otherwise push anything on the stack, you will need to
>> CACHE_SP afterwards to restore the possibly-changed stack pointer.  */
>>  
>> -#define SYNC_IP() vp->ip = (ip)
>> +#define SYNC_IP()   \
>> +  do { vp->ip = (ip); vp->sp = (sp); } while (0)
>
> I don't see how a change like this could be useful for any thread safety
> problem.

I witnessed situations where the local ‘sp’ seemed to be different from
‘vp->sp’, though it’s hard to tell because I’m unsure where gcc stores
‘sp’.  Here’s an example:

--8<---cut here---start->8---
(gdb) frame
#16 0x7fabf30af2ca in vm_regular_engine (thread=0x24e6000, vp=0x22de6c0, 
registers=0x0, resume=40) at vm-engine.c:785
785   ret = scm_apply_subr (sp, FRAME_LOCALS_COUNT ());
(gdb) p vp->sp
$5 = (union scm_vm_stack_element *) 0x7fabec158718
(gdb) p (union scm_vm_stack_element *) $r13
$6 = (union scm_vm_stack_element *) 0x7fabec158e30
(gdb) p $6 - $5
$7 = 227
(gdb) p vp->fp
$8 = (union scm_vm_stack_element *) 0x7fabec158730
(gdb) p vp->stack_top
$9 = (union scm_vm_stack_element *) 0x7fabec159000
(gdb) p vp->stack_bottom
$10 = (union scm_vm_stack_element *) 0x7fabec158000
(gdb) p vp->sp_min_since_gc
$11 = (union scm_vm_stack_element *) 0x7fabec158620
(gdb) info registers
rax0x1  1
rbx0xa  10
rcx0x28 40
rdx0x0  0
rsi0x23f192037689632
rdi0x24e600038690816
rbp0x22de6c00x22de6c0
rsp0x7fabcce18660   0x7fabcce18660
r8 0x1  1
r9 0x1  1
r100x100256
r110x23f192037689632
r120x7fabf330b8c0   140376496191680
r130x7fabec158e30   140376376970800
r140x7fabf30c6d7c   140376493813116
r150x7fabf0fa7f28   140376459083560
rip0x7fabf30af2ca   0x7fabf30af2ca 
eflags 0x10246  [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0  0
es 0x0  0
fs 0x0  0
gs 0x0  0
--8<---cut here---end--->8---

My hypothesis was that such a bug could lead heap elements to be
reclaimed too early.  This is more likely to happen in a multi-threaded
context because one thread could be allocating memory and triggering a
GC while another thread is invoking a subr with an out-of-sync ‘vp->sp’.

Does that make sense?

> For now, I would suggest avoiding multi-threaded code in Guix, or at
> least to avoid loading any Scheme code from multiple threads.
>
> How about using multiple processes instead?

We could do that, but with my Guile maintainer hat on (a hat I don’t
wear that often as you might have noticed ;-)) I think it would be nice
to fix the issue.

Thanks,
Ludo’.





bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-05-08 Thread Mark H Weaver
Hi Ludovic,

l...@gnu.org (Ludovic Courtès) writes:
[...]
> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)):
> #0  0x7f7019db0d79 in scm_is_pair (x= Cannot access memory at address 0x0>0x0) at ../libguile/pairs.h:159
> #1  scm_ilength (sx=) at list.c:190
[...]
> What this means is that Thread 1 gets NULL instead of a list as its
> on-stack argument (vm-engine.c:909 is ‘tail-apply’).
>
> How can arguments on the VM stack be zeroed?

I doubt that's what happened, because I expect that each VM stack is
dedicated to a single hardware thread.  In theory, if a single VM stack
is used by one thread, and then later used by another thread,
thread-safety issues on the VM stack could arise in the absense of
proper thread synchronization.

However, I think it's _far_ more likely that the NULL argument on the
stack was copied from memory shared by multiple threads without proper
thread synchronization.

On modern weakly-ordered memory architectures, writes by one thread may
be seen in a different order by another thread.  For example, if one
thread allocates a pair, initializes its CAR and CDR to non-NULL values,
and then writes the address of the pair somewhere, another thread could
first read the address of the pair, and then read NULL from its CAR or
CDR, unless proper thread synchronization is used.  At best, this
requires memory barriers in both the reader and writer which are
typically quite expensive.  On x86 processors the read barrier could
expand into a no-op in typical cases, but the write barrier cannot be
avoided, and must be placed after initializing the structure and before
writing its pointer.

I think it's most likely that something like this is happening, because
NULL is not a valid SCM value.  The only code that should normally write
a NULL to an SCM slot is Boehm GC, which clears memory before handing it
to the user.  So, if you read a NULL from a memory location that's meant
to hold an SCM, then it's most likely because the reading thread does
not yet see the writes that initialized it, because of the
weakly-ordered memory architecture.

> I commented out the MADV_DONTNEED call to be sure, but I can still
> reproduce the bug.

I strongly doubt that the MADV_DONTNEED is relevant to this issue.

> Then I thought vp->sp might be out-of-sync compared to the local
> variable ‘sp’, which in turn could cause ‘scm_i_vm_mark_stack’ to not
> mark a few items on the tip of the stack.  So I did this:
>
> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
> index 9509cd643..1136b2271 100644
> --- a/libguile/vm-engine.c
> +++ b/libguile/vm-engine.c
> @@ -151,7 +151,8 @@
> code, or otherwise push anything on the stack, you will need to
> CACHE_SP afterwards to restore the possibly-changed stack pointer.  */
>  
> -#define SYNC_IP() vp->ip = (ip)
> +#define SYNC_IP()   \
> +  do { vp->ip = (ip); vp->sp = (sp); } while (0)

I don't see how a change like this could be useful for any thread safety
problem.  Since stores by one thread may be seen in a different order by
other threads, the memory write corresponding to "vp->sp = (sp)" might
be delayed for some arbitrarily long period of time after the writes
that follow it, up until the next appropriate memory barrier.

For now, I would suggest avoiding multi-threaded code in Guix, or at
least to avoid loading any Scheme code from multiple threads.

How about using multiple processes instead?

  Mark





bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-05-08 Thread Ludovic Courtès
Hello Andy & Mark,

l...@gnu.org (Ludovic Courtès) skribis:

> l...@gnu.org (Ludovic Courtès) skribis:
>
>> On current ‘core-updates’, the code in (guix build graft) triggers
>> random Guile crashes (GC issue? thread-safety issue?) when running on
>> Guile 2.2.2, as initially reported by Marius at
>> :
>
> The problem still shows up with Guile 2.2.3+.

Here’s a clearer backtrace:

--8<---cut here---start->8---
Core was generated by 
`/gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/bin/guile t.scm'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x7f7019db0d79 in scm_is_pair (x=0x0) at ../libguile/pairs.h:159
159 ../libguile/pairs.h: Dosiero aŭ dosierujo ne ekzistas.
[Current thread is 1 (Thread 0x7f6fe6f5d700 (LWP 2856))]
(gdb) thread apply all bt

Thread 18 (Thread 0x7f7015859700 (LWP 2845)):
#0  0x7f70198c77af in futex_wait_cancelable (private=, 
expected=0, futex_word=0x12715a0) at 
../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x129bd80, 
cond=cond@entry=0x1271578) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x1271578, 
mutex=mutex@entry=0x129bd80) at pthread_cond_wait.c:655
#3  0x7f7019dffe45 in scm_pthread_cond_wait (cond=cond@entry=0x1271578, 
mutex=mutex@entry=0x129bd80) at threads.c:1621
#4  0x7f7019eb in block_self (
queue=((#) # # # # # 
# # # # # # #), mutex=mutex@entry=0x129bd80, 
waittime=waittime@entry=0x0) at threads.c:316
#5  0x7f7019e00157 in lock_mutex (current_thread=0x1271540, waittime=0x0, 
m=0x129bd80, kind=SCM_MUTEX_STANDARD) at threads.c:1037
#6  scm_timed_lock_mutex (mutex=#, timeout=) 
at threads.c:1098
#7  0x7f7019e0e17d in vm_regular_engine (thread=0x12715a0, vp=0x144ecf0, 
registers=0x0, resume=428636079) at vm-engine.c:784
#8  0x7f7019e117da in scm_call_n (proc=proc@entry=#, 
argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#9  0x7f7019d94879 in scm_call_0 (proc=proc@entry=#) at 
eval.c:481
#10 0x7f7019d85eb8 in scm_call_with_unblocked_asyncs (proc=#) at async.c:400
#11 0x7f7019e0e17d in vm_regular_engine (thread=0x12715a0, vp=0x144ecf0, 
registers=0x0, resume=428636079) at vm-engine.c:784
#12 0x7f7019e117da in scm_call_n (proc=#, 
argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#13 0x7f7019d94879 in scm_call_0 (proc=) at eval.c:481
#14 0x7f7019dff496 in really_launch (d=0x1633240) at threads.c:794
#15 0x7f7019d8ec5a in c_body (d=0x7f7015858e60) at continuations.c:422
#16 0x7f7019e0e17d in vm_regular_engine (thread=0x12715a0, vp=0x144ecf0, 
registers=0x0, resume=428636079) at vm-engine.c:784
#17 0x7f7019e117da in scm_call_n (proc=proc@entry=#, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1257
#18 0x7f7019d94879 in scm_call_0 (proc=proc@entry=#) at eval.c:481
#19 0x7f7019e009a6 in catch (tag=tag@entry=#t, thunk=#, 
handler=0x1423ce0, 
pre_unwind_handler=0x1423c60) at throw.c:137
#20 0x7f7019e00ce5 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, 
thunk=, handler=, pre_unwind_handler=) at throw.c:254
#21 0x7f7019e00e9f in scm_c_catch (tag=tag@entry=#t, 
body=body@entry=0x7f7019d8ec50 , 
body_data=body_data@entry=0x7f7015858e60, 
handler=handler@entry=0x7f7019d8eee0 , 
handler_data=handler_data@entry=0x7f7015858e60, 
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 
, pre_unwind_handler_data=0x1311bc0) at throw.c:377
#22 0x7f7019d8f240 in scm_i_with_continuation_barrier 
(body=body@entry=0x7f7019d8ec50 , 
body_data=body_data@entry=0x7f7015858e60, 
handler=handler@entry=0x7f7019d8eee0 , 
handler_data=handler_data@entry=0x7f7015858e60, 
pre_unwind_handler=pre_unwind_handler@entry=0x7f7019d8ed40 
, pre_unwind_handler_data=0x1311bc0) at continuations.c:360
#23 0x7f7019d8f2d5 in scm_c_with_continuation_barrier (func=, data=) at continuations.c:456
#24 0x7f7019dff5bc in with_guile (base=base@entry=0x7f7015858ec8, 
data=data@entry=0x7f7015858ef0) at threads.c:661
#25 0x7f7019af4758 in GC_call_with_stack_base (fn=fn@entry=0x7f7019dff570 
, arg=arg@entry=0x7f7015858ef0) at misc.c:1935
#26 0x7f7019dfec8d in scm_i_with_guile (dynamic_state=, 
data=0x1633240, func=0x7f7019dff420 ) at threads.c:704
#27 launch_thread (d=0x1633240) at threads.c:803
#28 0x7f70198c14d5 in start_thread (arg=0x7f7015859700) at 
pthread_create.c:465
#29 0x7f70182342cf in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 17 (Thread 0x7f701714f700 (LWP 2842)):
#0  0x7f70198c77af in futex_wait_cancelable (private=, 
expected=0, futex_word=0x1271ae0) at 
../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x129bd80, 
cond=cond@entry=0x1271ab8) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x1271ab8, 
mutex=mutex@entry=0x129bd80) at pthread_cond_wait.c

bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2018-04-24 Thread Ludovic Courtès
l...@gnu.org (Ludovic Courtès) skribis:

> On current ‘core-updates’, the code in (guix build graft) triggers
> random Guile crashes (GC issue? thread-safety issue?) when running on
> Guile 2.2.2, as initially reported by Marius at
> :

The problem still shows up with Guile 2.2.3+.  Some raw notes on clues I
have so far.

When running the grafting code under Helgrind, we get things like this:

--8<---cut here---start->8---
==30263== 
==30263== 
==30263==  Lock at 0x5168F60 was first observed
==30263==at 0x4C34943: pthread_mutex_init (in 
/gnu/store/yiiz2cdfsx6rqvp1gvlz34pvrl0cjf5c-valgrind-3.13.0/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==30263==by 0x4EF36C3: scm_threads_prehistory (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4E9F64D: scm_i_init_guile (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4EF2557: scm_i_init_thread_for_guile (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4EF2588: with_guile (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x5185757: GC_call_with_stack_base (in 
/gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263==by 0x4EF2957: scm_with_guile (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4E9F5F1: scm_boot_guile (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x400B5F: main (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/bin/guile)
==30263==  Address 0x5168f60 is 0 bytes inside data symbol "scm_i_misc_mutex"
==30263== 
==30263==  Lock at 0x53AA240 was first observed
==30263==at 0x4C30D6D: mutex_trylock_WRK (in 
/gnu/store/yiiz2cdfsx6rqvp1gvlz34pvrl0cjf5c-valgrind-3.13.0/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==30263==by 0x4C349E6: pthread_mutex_trylock (in 
/gnu/store/yiiz2cdfsx6rqvp1gvlz34pvrl0cjf5c-valgrind-3.13.0/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==30263==by 0x517C9C3: GC_register_disappearing_link_inner (in 
/gnu/store/c4jrwbv7qckvnqa7f3h7bd1hh8rbg72y-libgc-7.6.0/lib/libgc.so.1.0.3)
==30263==by 0x4F0585D: weak_set_add_x (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4F05E1F: scm_c_weak_set_add_x (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4EF0964: scm_i_str2symbol (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4EF1162: scm_from_utf8_symboln (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4EAA950: scm_c_public_ref (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4ED39BF: scm_compile_shell_switches (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4ED3A24: scm_shell (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4E9F45C: invoke_main_func (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4E81C59: c_body (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==  Address 0x53aa240 is 0 bytes inside data symbol "GC_allocate_ml"
==30263== 
==30263== Possible data race during write of size 8 at 0xB349528 by thread #8
==30263== Locks held: 1, at address 0x5168F60
==30263==at 0x4E85E65: scm_strerror (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4573E4F: ???
==30263==by 0x4E85F12: scm_syserror (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4E8F228: scm_mkdir (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4F0117C: vm_regular_engine (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4F047D9: scm_call_n (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4E78EB7: scm_call_with_unblocked_asyncs (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4F0117C: vm_regular_engine (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4F047D9: scm_call_n (in 
/gnu/store/b90y3swxlx3vw2yyacs8cz59b8cbpbw5-guile-2.2.3/lib/libguile-2.2.so.1.3.0)
==30263==by 0x4EF2495: really_launch (in 
/gnu/store/b90y3sw

bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2017-08-23 Thread Ludovic Courtès
l...@gnu.org (Ludovic Courtès) skribis:

> On current ‘core-updates’, the code in (guix build graft) triggers
> random Guile crashes (GC issue? thread-safety issue?) when running on
> Guile 2.2.2, as initially reported by Marius at
> :

Commit e4925e00ca420737556e2039b4fa1c40121ee567 works around it by using
Guile 2.0 for these derivations.

Ludo’.





bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

2017-08-23 Thread Ludovic Courtès
On current ‘core-updates’, the code in (guix build graft) triggers
random Guile crashes (GC issue? thread-safety issue?) when running on
Guile 2.2.2, as initially reported by Marius at
:

--8<---cut here---start->8---
grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> 
'/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
ERROR: In procedure put-bytevector: Wrong type argument in position 1 
(expecting open output port): #
builder for 
`/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
with exit code 1

[...]

ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc 
--list-failures)
ludo@ribbon ~/src/guix/+core-updates$ guix build 
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
La jena derivo estos konstruata:
   /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
@ build-started 
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 
x86_64-linux 
/var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> 
'/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
ERROR: In procedure put-bytevector: Wrong type argument in position 1 
(expecting open output port): #
builder for 
`/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
with exit code 1
@ build-failed 
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 1 builder 
for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
with exit code 1
guix build: error: build failed: build of 
`/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed
ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc 
--list-failures)
ludo@ribbon ~/src/guix/+core-updates$ guix build 
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
La jena derivo estos konstruata:
   /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
@ build-started 
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 
x86_64-linux 
/var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> 
'/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
ERROR: In procedure variable-ref: Not a variable: (194 195 196 197 198 199 200 
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255)
builder for 
`/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
with exit code 1
@ build-failed 
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 1 builder 
for `/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed 
with exit code 1
--8<---cut here---end--->8---

The problem does not show up when running on a single thread:

--8<---cut here---start->8---
guix build: error: build failed: build of 
`/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv' failed
ludo@ribbon ~/src/guix/+core-updates$ guix gc --clear-failures $(guix gc 
--list-failures)
ludo@ribbon ~/src/guix/+core-updates$ guix build 
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv --cores=1
La jena derivo estos konstruata:
   /gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv
@ build-started 
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv - 
x86_64-linux 
/var/log/guix/drvs/3c//rrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv.bz2
grafting '/gnu/store/i71kkrch1asnwvm0vm71w9aaza0n2m9q-icecat-52.1.0-gnu1' -> 
'/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1'...
@ build-succeeded 
/gnu/store/3crrls3ms9m3g860yvqib02rnm7akhf8-icecat-52.1.0-gnu1.drv -
/gnu/store/7w92kgcdcmf7lsc9nvs6b2ca7mk9422s-icecat-52.1.0-gnu1
--8<---cut here---end--->8---

Ludo’.