Re: Issues with debugging GC-related crashes #2

2018-04-16 Thread Matthias Klumpp via Digitalmars-d

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:

[...]
The code uses std.typecons.scoped occasionally, does no GC 
allocations in destructors and does nothing to mess with the GC 
in general. There are a few calls to GC.add/removeRoot in the 
gir-to-d generated code (ObjectG.d), but those are very 
unlikely to cause issues (removing them did yield the same 
crash, and the same code is used by more projects).

[...]


Another thing to mention is that the software uses LMDB[1] and 
mmaps huge amounts of data into memory (gigabyte range).

Not sure if that information is relevant at all though.

[1]: https://symas.com/lmdb/technical/



Re: Issues with debugging GC-related crashes #2

2018-04-17 Thread Kagamin via Digitalmars-d

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
The code uses std.typecons.scoped occasionally, does no GC 
allocations in destructors and does nothing to mess with the GC 
in general.


What do you use destructors for?


Re: Issues with debugging GC-related crashes #2

2018-04-17 Thread Kagamin via Digitalmars-d

Other stuff to try:
1. run application compiled on debian against ubuntu libs
2. can you mix dependencies from debian and ubuntu?


Re: Issues with debugging GC-related crashes #2

2018-04-17 Thread Matthias Klumpp via Digitalmars-d

On Tuesday, 17 April 2018 at 08:23:07 UTC, Kagamin wrote:

Other stuff to try:
1. run application compiled on debian against ubuntu libs
2. can you mix dependencies from debian and ubuntu?


I haven't tried that yet (next on my todo list), if I do run the 
program compiled with address sanitizer on Debian, I do get 
errors like:

```
AddressSanitizer:DEADLYSIGNAL
=
==25964==ERROR: AddressSanitizer: SEGV on unknown address 
0x7fac8db3f800 (pc 0x7fac9c433430 bp 0x0008 sp 
0x7ffc92be3dd0 T0)

==25964==The signal is caused by a READ memory access.
#0 0x7fac9c43342f in 
_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa142f)
#1 0x7fac9c433a2f in 
_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa1a2f)
#2 0x7fac9c459ad4 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ad4)
#3 0x7fac9c459ac6 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ac6)
#4 0x7fac9c459ac6 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ac6)
#5 0x7fac9c459ac6 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ac6)
#6 0x7fac9c459a51 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7a51)
#7 0x7fac9c430f26 in 
_D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0x9ef26)
#8 0x7fac9c431226 in 
_D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs18fullCollectNoStackMFNbZ2goFNbPSQEaQEaQDyQEj3GcxZmTQvZQDfMFNbKQBgZm (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0x9f226)
#9 0x7fac9c4355d0 in gc_term 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa35d0)
#10 0x7fac9c443ab2 in rt_term 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xb1ab2)
#11 0x7fac9c443e65 in 
_D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZv 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xb1e65)
#12 0x7fac9c443d0b in _d_run_main 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xb1d0b)
#13 0x7fac9b9cfa86 in __libc_start_main 
(/lib/x86_64-linux-gnu/libc.so.6+0x21a86)
#14 0x55acd1dbe1d9 in _start 
(/home/matthias/Development/AppStream/generator/build/src/asgen/appstream-generator+0xba1d9)


AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa142f) 
in _D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv

==25964==ABORTING
```
So, I don't think this bug is actually limited to Ubuntu, it just 
shows up there more often for some reason.


Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread Kagamin via Digitalmars-d
You can call GC.collect at some points in the program to see if 
they can trigger the crash 
https://dlang.org/library/core/memory/gc.collect.html
If you link against debug druntime, GC can check invariants for 
correctness of its structures. There's a number of debugging 
options for GC, though not sure which ones are enabled in default 
debug build of druntime: 
https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1388


Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread Matthias Klumpp via Digitalmars-d

On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
You can call GC.collect at some points in the program to see if 
they can trigger the crash


I already do that, and indeed I get crashes. I could throw those 
calls into every function though, or make a minimal pool size, 
maybe that yields something...



https://dlang.org/library/core/memory/gc.collect.html
If you link against debug druntime, GC can check invariants for 
correctness of its structures. There's a number of debugging 
options for GC, though not sure which ones are enabled in 
default debug build of druntime: 
https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1388


I get compile errors for the INVARIANT option, and I don't 
actually know how to deal with those properly:

```
src/gc/impl/conservative/gc.d(1396): Error: shared mutable method 
core.internal.spinlock.SpinLock.lock is not callable using a 
shared const object
src/gc/impl/conservative/gc.d(1396):Consider adding const 
or inout to core.internal.spinlock.SpinLock.lock
src/gc/impl/conservative/gc.d(1403): Error: shared mutable method 
core.internal.spinlock.SpinLock.unlock is not callable using a 
shared const object
src/gc/impl/conservative/gc.d(1403):Consider adding const 
or inout to core.internal.spinlock.SpinLock.unlock

```

Commenting out the locks (eww!!) yields no change in behavior 
though.


The crashes always appear in 
https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/gc.d#L1990


Meanwhile, I also tried to reproduce the crash locally in a 
chroot, with no result. All libraries used between the machine 
where the crashes occur and my local machine were 100% identical, 
the only differences I am aware of are obviously the hardware 
(AWS cloud vs. home workstation) and the Linux kernel (4.4.0 vs 
4.15.0)


The crash happens when built with LDC or DMD, that doesn't 
influence the result. Copying over a binary from the working 
machine to the crashing one also results in the same errors.


I am completely out of ideas here. Since I think I can rule out a 
hardware fault at Amazon, I don't even know what else would make 
sense to try.


Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread kinke via Digitalmars-d

On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
There's a number of debugging options for GC, though not sure 
which

ones are enabled in default debug build of druntime


Speaking for LDC, none are, they all need to be enabled 
explicitly. There's a whole bunch of them 
(https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/gc.d#L20-L31), so enabling most of them would surely help in tracking this down, but it's most likely still going to be very tedious.
I'm not really surprised that there are compilation errors when 
enabling the debug options, that's a likely fate of untested code 
unfortunately.


If possible, I'd give static linking a try.


Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread Johannes Pfau via Digitalmars-d
Am Wed, 18 Apr 2018 17:40:56 + schrieb Matthias Klumpp:
> 
> The crashes always appear in
> https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/
gc.d#L1990
> 

The important point to note here is that this is not one of these 'GC 
collected something because it was not reachable' bugs. A crash in the GC 
mark routine means it somehow scans an invalid address range. Actually, 
I've seen this before...


> Meanwhile, I also tried to reproduce the crash locally in a chroot, with
> no result. All libraries used between the machine where the crashes
> occur and my local machine were 100% identical,
> the only differences I am aware of are obviously the hardware (AWS cloud
> vs. home workstation) and the Linux kernel (4.4.0 vs 4.15.0)
> 
> The crash happens when built with LDC or DMD, that doesn't influence the
> result. Copying over a binary from the working machine to the crashing
> one also results in the same errors.


Actually this sounds very familiar:
https://github.com/D-Programming-GDC/GDC/pull/236

it took us quite some time to reduce and debug this:

https://github.com/D-Programming-GDC/GDC/pull/236/commits/
5021b8d031fcacac52ee43d83508a5d2856606cd

So I wondered why I couldn't find this in the upstream druntime code. 
Turns out our pull request has never been merged

https://github.com/dlang/druntime/pull/1678


-- 
Johannes


Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread negi via Digitalmars-d

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:

...


This reminds me of (otherwise unrelated) problems I had involving 
Linux 4.15.


If you feel out of ideas, I suggest you take a look at the 
kernels.  It might
be that Ubuntu is turning some security-related knob in a 
different direction
than Debian.  Or it might be some bug in 4.15 (I found it to be 
quite buggy,
specially during the first few point releases; 4.15 was the first 
upstream

release including large amounts of meltdown/spectre-related work).


Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread Matthias Klumpp via Digitalmars-d

On Wednesday, 18 April 2018 at 18:55:48 UTC, kinke wrote:

On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
There's a number of debugging options for GC, though not sure 
which

ones are enabled in default debug build of druntime


Speaking for LDC, none are, they all need to be enabled 
explicitly. There's a whole bunch of them 
(https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/gc.d#L20-L31), so enabling most of them would surely help in tracking this down, but it's most likely still going to be very tedious.
I'm not really surprised that there are compilation errors when 
enabling the debug options, that's a likely fate of untested 
code unfortunately.


Yeah... Maybe making a CI build with "enable all the things" 
makes sense to combat that...



If possible, I'd give static linking a try.


I tried that, with at least linking druntime and phobos 
statically. I did not, however, link all the things statically.
That is something to try (at least statically linking all the D 
libraries).




Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread Matthias Klumpp via Digitalmars-d
On Wednesday, 18 April 2018 at 20:40:52 UTC, Matthias Klumpp 
wrote:

[...]

If possible, I'd give static linking a try.


I tried that, with at least linking druntime and phobos 
statically. I did not, however, link all the things statically.
That is something to try (at least statically linking all the D 
libraries).


No luck...
```
#0  0x007f10e8 in 
_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., 
ptop=0x7fcf6a11b010, pbot=0x7fcf6951b010)

at src/gc/impl/conservative/gc.d:1990
p1 = 0x7fcf6951b010
p2 = 0x7fcf6a11b010
stackPos = 0
stack =
{{pbot = 0x7fffcc60, ptop = 0x7f15af 
<_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 
0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 
0x87b4118, ptop = 0x87b4118}, {pbot = 0x0, ptop = 0xc4cda0}, 
{pbot = 0x7fffcca0, ptop = 0x7f15af 
<_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 
0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 
0x87af258, ptop = 0x87af258}, {pbot = 0x0, ptop = 0xc4cda0}, 
{pbot = 0x7fffcce0, ptop = 0x7f15af 
<_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 
0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 
0x87af158, ptop = 0x87af158}, {pbot = 0x0, ptop = 0xc4cda0}, 
{pbot = 0x7fffcd20, ptop = 0x7f15af 
<_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 
0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 
0x87af0d8, ptop = 0x87af0d8}, {pbot = 0x0, ptop = 0xc4cda0}, 
{pbot = 0x7fdf6b265000, ptop = 0x69b96a0}, {pbot = 0x28, ptop = 
0x7fcf5951b000}, {pbot = 0x309eab7000, ptop = 0x7fdf6b265000}, 
{pbot = 0x0, ptop = 0x0}, {pbot = 0x1381d00, ptop = 0x1c}, {pbot 
= 0x1d, ptop = 0x1c}, {pbot = 0x1a44100, ptop = 0x1a4410}, {pbot 
= 0x1a44, ptop = 0x4}, {pbot = 0x7fdf6b355000, ptop = 0x69b96a0}, 
{pbot = 0x28, ptop = 0x7fcf5951b000}, {pbot = 0x309eab7000, ptop 
= 0x4ac0}, {pbot = 0x4a, ptop = 0x0}, {pbot = 0x1381d00, ptop = 
0x1c}, {pbot = 0x1d, ptop = 0x1c}, {pbot = 0x4ac00, ptop = 
0x4ac0}, {pbot = 0x4a, ptop = 0x4}}

pcache = 0
pools = 0x69b96a0
highpool = 40
minAddr = 0x7fcf5951b000
memSize = 208820465664
base = 0xaef0
top = 0xae
p = 0x4618770
pool = 0x0
low = 110859936
high = 40
mid = 140528533483520
offset = 208820465664
biti = 8329709
pn = 142275872
bin = 1
offsetBase = 0
next = 0xc4cc80
next = {pbot = 0x7fffcbe0, ptop = 0x7f19ed 
<_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi+57>}

__r292 = 0x7fffd320
__key293 = 8376632
rng = @0x0: 
#1  0x007f19ed in 
_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi (this=0x7fffd360, __applyArg0=...)

at src/gc/impl/conservative/gc.d:2188
range = {pbot = 0x7fcf6951b010, ptop = 0x7fcf6a11b010, ti 
= 0x0}
#2  0x007fd161 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZ9__lambda2MFNbKxSQCpQCpQCfZi (this=0x7fffd320, e=...) at src/rt/util/container/treap.d:47
#3  0x007fd539 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x80396c0) at src/rt/util/container/treap.d:221

result = 0
#4  0x007fd565 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x87c8140) at src/rt/util/container/treap.d:224

result = 0
#5  0x007fd516 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000950) at src/rt/util/container/treap.d:218

result = 16844032
#6  0x007fd516 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000a50) at src/rt/util/container/treap.d:218

result = 0
#7  0x007fd516 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000c50) at src/rt/util/container/treap.d:218

result = 0

[etc...]
#37 0x0077e889 in core.memory.GC.collect() () at 
src/core/memory.d:207
#38 0x006b4791 in asgen.engine.Engine.gcCollect() 
(this=0x77ee13c0) at ../src/asgen/engine.d:122

```




Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread kinke via Digitalmars-d

On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
Actually this sounds very familiar: 
https://github.com/D-Programming-GDC/GDC/pull/236


Interesting, but I don't think it applies here. Both start and 
end addresses are 16-bytes aligned, and both cannot be accessed 
according to the stack trace (`pbot=0x7fcf4d721010 access memory at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 
`). That's 
quite interesting too: `memSize = 209153867776`. Don't know what 
exactly it is, but it's a pretty large number (~194 GB).


Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread Matthias Klumpp via Digitalmars-d

On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:

[...]

Actually this sounds very familiar: 
https://github.com/D-Programming-GDC/GDC/pull/236


it took us quite some time to reduce and debug this:

https://github.com/D-Programming-GDC/GDC/pull/236/commits/ 
5021b8d031fcacac52ee43d83508a5d2856606cd


So I wondered why I couldn't find this in the upstream druntime 
code. Turns out our pull request has never been merged


https://github.com/dlang/druntime/pull/1678


Just to be sure, I applied your patch, but unfortunately I still 
get the same result...


On Wednesday, 18 April 2018 at 20:38:20 UTC, negi wrote:

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:

...


This reminds me of (otherwise unrelated) problems I had 
involving Linux 4.15.


If you feel out of ideas, I suggest you take a look at the 
kernels.  It might
be that Ubuntu is turning some security-related knob in a 
different direction
than Debian.  Or it might be some bug in 4.15 (I found it to be 
quite buggy,
specially during the first few point releases; 4.15 was the 
first upstream
release including large amounts of meltdown/spectre-related 
work).


All the crashes are happening on a 4.4 kernel though... I am 
currently pondering digging out a 4.4 kernel here to see if that 
makes me reproduce the crash locally.


Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread Matthias Klumpp via Digitalmars-d

On Wednesday, 18 April 2018 at 22:12:12 UTC, kinke wrote:
On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau 
wrote:
Actually this sounds very familiar: 
https://github.com/D-Programming-GDC/GDC/pull/236


Interesting, but I don't think it applies here. Both start and 
end addresses are 16-bytes aligned, and both cannot be accessed 
according to the stack trace (`pbot=0x7fcf4d721010 Cannot access memory at address 0x7fcf4d721010>, 
ptop=0x7fcf4e321010 0x7fcf4e321010>`). That's quite interesting too: `memSize = 
209153867776`. Don't know what exactly it is, but it's a pretty 
large number (~194 GB).


size_t memSize = pooltable.maxAddr - minAddr;
(https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1982
 )
That wouldn't make sense for a pool size...

The machine this is running on has 16G memory, at the time of the 
crash the software was using ~2.1G memory, with 130G virtual 
memory due to LMDB memory mapping (I wonder what happens if I 
reduce that...)




Re: Issues with debugging GC-related crashes #2

2018-04-18 Thread Johannes Pfau via Digitalmars-d
Am Wed, 18 Apr 2018 22:24:13 + schrieb Matthias Klumpp:

> On Wednesday, 18 April 2018 at 22:12:12 UTC, kinke wrote:
>> On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
>>> Actually this sounds very familiar:
>>> https://github.com/D-Programming-GDC/GDC/pull/236
>>
>> Interesting, but I don't think it applies here. Both start and end
>> addresses are 16-bytes aligned, and both cannot be accessed according
>> to the stack trace (`pbot=0x7fcf4d721010 > at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 > memory at address 0x7fcf4e321010>`). That's quite interesting too:
>> `memSize = 209153867776`. Don't know what exactly it is, but it's a
>> pretty large number (~194 GB).
> 
> size_t memSize = pooltable.maxAddr - minAddr;
> (https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/
conservative/gc.d#L1982
> )
> That wouldn't make sense for a pool size...
> 
> The machine this is running on has 16G memory, at the time of the crash
> the software was using ~2.1G memory, with 130G virtual memory due to
> LMDB memory mapping (I wonder what happens if I reduce that...)

I see. Then I'd try to debug where the range originally comes from, try 
adding breakpoints in _d_dso_registry, registerGCRanges and similar 
functions here: https://github.com/dlang/druntime/blob/master/src/rt/
sections_elf_shared.d#L421

Generally if you produced a crash in gdb it should be reproducible if you 
restart the program in gdb. So once you have a crash, you should be able 
to restart the program and look at the _dso_registry and see the same 
addresses somewhere. If you then think you see memory corruption 
somewhere you could also use read or write watchpoints.

But just to be sure: you're not adding any GC ranges manually, right?
You could also try to compare the GC range to the address range layout 
in /proc/$PID/maps .



-- 
Johannes


Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Johannes Pfau via Digitalmars-d
Am Thu, 19 Apr 2018 06:33:27 + schrieb Johannes Pfau:

> 
> Generally if you produced a crash in gdb it should be reproducible if
> you restart the program in gdb. So once you have a crash, you should be
> able to restart the program and look at the _dso_registry and see the
> same addresses somewhere. If you then think you see memory corruption
> somewhere you could also use read or write watchpoints.
> 
> But just to be sure: you're not adding any GC ranges manually, right?
> You could also try to compare the GC range to the address range layout
> in /proc/$PID/maps .

Of course, if this is a GC pool / heap range adding breakpoints in the 
sections code won't be useful. Then I'd try to add a write watchpoint on 
pooltable.minAddr / maxAddr, restart the programm in gdb and see where / 
why the values are set.

-- 
Johannes


Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Johannes Pfau via Digitalmars-d
Am Thu, 19 Apr 2018 07:04:14 + schrieb Johannes Pfau:

> Am Thu, 19 Apr 2018 06:33:27 + schrieb Johannes Pfau:
> 
> 
>> Generally if you produced a crash in gdb it should be reproducible if
>> you restart the program in gdb. So once you have a crash, you should be
>> able to restart the program and look at the _dso_registry and see the
>> same addresses somewhere. If you then think you see memory corruption
>> somewhere you could also use read or write watchpoints.
>> 
>> But just to be sure: you're not adding any GC ranges manually, right?
>> You could also try to compare the GC range to the address range layout
>> in /proc/$PID/maps .
> 
> Of course, if this is a GC pool / heap range adding breakpoints in the
> sections code won't be useful. Then I'd try to add a write watchpoint on
> pooltable.minAddr / maxAddr, restart the programm in gdb and see where /
> why the values are set.

Having a quick look at https://github.com/ldc-developers/druntime/blob/
ldc/src/gc/pooltable.d: The GC seems to allocate multiple pools using 
malloc, but only keeps track of one minimum/maximum address for all 
pools. Now if there's some other memory area malloced in between these 
pools, you will end up with a huge memory block. When this will get 
scanned and if any of the memory in-between the GC pools is protected, 
you might see the GC crash. However, I don't really know anything about 
the GC code, so some GC expert would have to confirm this.



-- 
Johannes


Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Kagamin via Digitalmars-d
On Wednesday, 18 April 2018 at 17:40:56 UTC, Matthias Klumpp 
wrote:

On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
You can call GC.collect at some points in the program to see 
if they can trigger the crash


I already do that, and indeed I get crashes. I could throw 
those calls into every function though, or make a minimal pool 
size, maybe that yields something...


Can you narrow down the earliest point at which it starts to 
crash? That might identify if something in particular causes the 
crash.


Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Kagamin via Digitalmars-d
On Wednesday, 18 April 2018 at 17:40:56 UTC, Matthias Klumpp 
wrote:
I get compile errors for the INVARIANT option, and I don't 
actually know how to deal with those properly:

```
src/gc/impl/conservative/gc.d(1396): Error: shared mutable 
method core.internal.spinlock.SpinLock.lock is not callable 
using a shared const object
src/gc/impl/conservative/gc.d(1396):Consider adding 
const or inout to core.internal.spinlock.SpinLock.lock
src/gc/impl/conservative/gc.d(1403): Error: shared mutable 
method core.internal.spinlock.SpinLock.unlock is not callable 
using a shared const object
src/gc/impl/conservative/gc.d(1403):Consider adding 
const or inout to core.internal.spinlock.SpinLock.unlock

```

Commenting out the locks (eww!!) yields no change in behavior 
though.


As a workaround:
(cast(shared)rangesLock).lock();


Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Kagamin via Digitalmars-d
On Wednesday, 18 April 2018 at 22:24:13 UTC, Matthias Klumpp 
wrote:

size_t memSize = pooltable.maxAddr - minAddr;
(https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1982
 )
That wouldn't make sense for a pool size...

The machine this is running on has 16G memory, at the time of 
the crash the software was using ~2.1G memory, with 130G 
virtual memory due to LMDB memory mapping (I wonder what 
happens if I reduce that...)


If big LMDB mapping causes a problem, try a test like this:
---
import core.memory;
void testLMDB()
{
//how do you use it?
}
void test1()
{
void*[][] a;
foreach(i;0..10)a~=new void*[1];
void*[][] b;
foreach(i;0..10)b~=new void*[1];
b=null;
GC.collect();

testLMDB();

GC.collect();
foreach(i;0..10)a~=new void*[1];
foreach(i;0..10)b~=new void*[1];
b=null;
GC.collect();
}
---


Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Kagamin via Digitalmars-d

foreach(i;0..1)
10 is too much


Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Matthias Klumpp via Digitalmars-d

On Thursday, 19 April 2018 at 08:30:45 UTC, Kagamin wrote:
On Wednesday, 18 April 2018 at 22:24:13 UTC, Matthias Klumpp 
wrote:

size_t memSize = pooltable.maxAddr - minAddr;
(https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1982
 )
That wouldn't make sense for a pool size...

The machine this is running on has 16G memory, at the time of 
the crash the software was using ~2.1G memory, with 130G 
virtual memory due to LMDB memory mapping (I wonder what 
happens if I reduce that...)


If big LMDB mapping causes a problem, try a test like this:
---
import core.memory;
void testLMDB()
{
//how do you use it?
}
void test1()
{
void*[][] a;
foreach(i;0..10)a~=new void*[1];
void*[][] b;
foreach(i;0..10)b~=new void*[1];
b=null;
GC.collect();

testLMDB();

GC.collect();
foreach(i;0..10)a~=new void*[1];
foreach(i;0..10)b~=new void*[1];
b=null;
GC.collect();
}
---


I tried something similar, with no effect.
Something that maybe is relevant though: I occasionally get the 
following SIGABRT crash in the tool on machines which have the 
SIGSEGV crash:

```
Thread 53 "appstream-gener" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fdfe98d4700 (LWP 7326)]
0x75040428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or 
directory.

(gdb) bt
#0  0x75040428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54

#1  0x7504202a in __GI_abort () at abort.c:89
#2  0x00780ae0 in core.thread.Fiber.allocStack(ulong, 
ulong) (this=0x7fde0758a680, guardPageSize=4096, sz=20480) at 
src/core/thread.d:4606
#3  0x007807fc in 
_D4core6thread5Fiber6__ctorMFNbDFZvmmZCQBlQBjQBf 
(this=0x7fde0758a680, guardPageSize=4096, sz=16384, dg=...)

at src/core/thread.d:4134
#4  0x006f9b31 in 
_D3std11concurrency__T9GeneratorTAyaZQp6__ctorMFDFZvZCQCaQBz__TQBpTQBiZQBx (this=0x7fde0758a680, dg=...)
at 
/home/ubuntu/dtc/dmd/generated/linux/debug/64/../../../../../druntime/import/core/thread.d:4126
#5  0x006e9467 in 
_D5asgen8handlers11iconhandler5Theme21matchingIconFilenamesMFAyaSQCl5utils9ImageSizebZC3std11concurrency__T9GeneratorTQCfZQp (this=0x7fdea2747800, relaxedScalingRules=true, size=..., iname=...) at ../src/asgen/handlers/iconhandler.d:196
#6  0x006ea75a in 
_D5asgen8handlers11iconhandler11IconHandler21possibleIconFilenamesMFAyaSQCs5utils9ImageSizebZ9__lambda4MFZv (this=0x7fde0752bd00)

at ../src/asgen/handlers/iconhandler.d:392
#7  0x0082fdfa in core.thread.Fiber.run() 
(this=0x7fde07528580) at src/core/thread.d:4436
#8  0x0082fd5d in fiber_entryPoint () at 
src/core/thread.d:3665

#9  0x in  ()
```

This is in the constructor of a std.concurrency.Generator:
auto gen = new Generator!string (...)

I am not sure what to make of this yet though... This goes into 
DRuntime territory that I actually hoped to never have to deal 
with as much as I apparently need to now.




Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread kinke via Digitalmars-d

On Thursday, 19 April 2018 at 17:01:48 UTC, Matthias Klumpp wrote:
Something that maybe is relevant though: I occasionally get the 
following SIGABRT crash in the tool on machines which have the 
SIGSEGV crash:

```
Thread 53 "appstream-gener" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fdfe98d4700 (LWP 7326)]
0x75040428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or 
directory.

(gdb) bt
#0  0x75040428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54

#1  0x7504202a in __GI_abort () at abort.c:89
#2  0x00780ae0 in core.thread.Fiber.allocStack(ulong, 
ulong) (this=0x7fde0758a680, guardPageSize=4096, sz=20480) at 
src/core/thread.d:4606
#3  0x007807fc in 
_D4core6thread5Fiber6__ctorMFNbDFZvmmZCQBlQBjQBf 
(this=0x7fde0758a680, guardPageSize=4096, sz=16384, dg=...)

at src/core/thread.d:4134
#4  0x006f9b31 in 
_D3std11concurrency__T9GeneratorTAyaZQp6__ctorMFDFZvZCQCaQBz__TQBpTQBiZQBx (this=0x7fde0758a680, dg=...)
at 
/home/ubuntu/dtc/dmd/generated/linux/debug/64/../../../../../druntime/import/core/thread.d:4126
#5  0x006e9467 in 
_D5asgen8handlers11iconhandler5Theme21matchingIconFilenamesMFAyaSQCl5utils9ImageSizebZC3std11concurrency__T9GeneratorTQCfZQp (this=0x7fdea2747800, relaxedScalingRules=true, size=..., iname=...) at ../src/asgen/handlers/iconhandler.d:196
#6  0x006ea75a in 
_D5asgen8handlers11iconhandler11IconHandler21possibleIconFilenamesMFAyaSQCs5utils9ImageSizebZ9__lambda4MFZv (this=0x7fde0752bd00)

at ../src/asgen/handlers/iconhandler.d:392
#7  0x0082fdfa in core.thread.Fiber.run() 
(this=0x7fde07528580) at src/core/thread.d:4436
#8  0x0082fd5d in fiber_entryPoint () at 
src/core/thread.d:3665

#9  0x in  ()
```


You probably already figured that the new Fiber seems to be 
allocating its 16KB-stack, with an additional 4 KB guard page at 
its bottom, via a 20 KB mmap() call. The abort seems to be 
triggered by mprotect() returning -1, i.e., a failure to disallow 
all access to the the guard page; so checking `errno` should help.


Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Matthias Klumpp via Digitalmars-d

On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:
On Thursday, 19 April 2018 at 17:01:48 UTC, Matthias Klumpp 
wrote:
Something that maybe is relevant though: I occasionally get 
the following SIGABRT crash in the tool on machines which have 
the SIGSEGV crash:

```
Thread 53 "appstream-gener" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fdfe98d4700 (LWP 7326)]
0x75040428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or 
directory.

(gdb) bt
#0  0x75040428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54

#1  0x7504202a in __GI_abort () at abort.c:89
#2  0x00780ae0 in core.thread.Fiber.allocStack(ulong, 
ulong) (this=0x7fde0758a680, guardPageSize=4096, sz=20480) at 
src/core/thread.d:4606
#3  0x007807fc in 
_D4core6thread5Fiber6__ctorMFNbDFZvmmZCQBlQBjQBf 
(this=0x7fde0758a680, guardPageSize=4096, sz=16384, dg=...)

at src/core/thread.d:4134
#4  0x006f9b31 in 
_D3std11concurrency__T9GeneratorTAyaZQp6__ctorMFDFZvZCQCaQBz__TQBpTQBiZQBx (this=0x7fde0758a680, dg=...)
at 
/home/ubuntu/dtc/dmd/generated/linux/debug/64/../../../../../druntime/import/core/thread.d:4126
#5  0x006e9467 in 
_D5asgen8handlers11iconhandler5Theme21matchingIconFilenamesMFAyaSQCl5utils9ImageSizebZC3std11concurrency__T9GeneratorTQCfZQp (this=0x7fdea2747800, relaxedScalingRules=true, size=..., iname=...) at ../src/asgen/handlers/iconhandler.d:196
#6  0x006ea75a in 
_D5asgen8handlers11iconhandler11IconHandler21possibleIconFilenamesMFAyaSQCs5utils9ImageSizebZ9__lambda4MFZv (this=0x7fde0752bd00)

at ../src/asgen/handlers/iconhandler.d:392
#7  0x0082fdfa in core.thread.Fiber.run() 
(this=0x7fde07528580) at src/core/thread.d:4436
#8  0x0082fd5d in fiber_entryPoint () at 
src/core/thread.d:3665

#9  0x in  ()
```


You probably already figured that the new Fiber seems to be 
allocating its 16KB-stack, with an additional 4 KB guard page 
at its bottom, via a 20 KB mmap() call. The abort seems to be 
triggered by mprotect() returning -1, i.e., a failure to 
disallow all access to the the guard page; so checking `errno` 
should help.


Jup, I did that already, it just took a really long time to run 
because when I made the change to print errno I also enabled 
detailed GC profiling (via the PRINTF* debug options). Enabling 
the INVARIANT option for the GC is completely broken by the way, 
I enforced the compile to work by casting to shared, with the 
result of the GC locking up forever at the start of the program.


Anyway, I think for a chance I actually produced some useful 
information via the GC debug options:

Given the following crash:
```
#0  0x007f1d94 in 
_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., 
ptop=0x7fdfce7fc010, pbot=0x7fdfcdbfc010)

at src/gc/impl/conservative/gc.d:1990
p1 = 0x7fdfcdbfc010
p2 = 0x7fdfce7fc010
stackPos = 0
[...]
```
The scanned range seemed fairly odd to me, so I searched for it 
in the (very verbose!) GC debug output, which yielded:

```
235.25: 0xc4f090.Gcx::addRange(0x8264230, 0x8264270)
235.244460: 0xc4f090.Gcx::addRange(0x7fdfcdbfc010, 0x7fdfce7fc010)
235.253861: 0xc4f090.Gcx::addRange(0x8264300, 0x8264340)
235.253873: 0xc4f090.Gcx::addRange(0x8264390, 0x82643d0)
```
So, something is calling addRange explicitly there, causing the 
GC to scan a range that it shouldn't scan. Since my code doesn't 
add ranges to the GC, and I looked at the generated code from 
girtod/GtkD and it very much looks fine to me, I am currently 
looking into EMSI containers[1] as the possible culprit.
That library being the issue would also make perfect sense, 
because this issue started to appear with such a frequency only 
after containers were added (there was a GC-related crash before, 
but that might have been a different one).


So, I will look into that addRange call next.

[1]: https://github.com/dlang-community/containers



Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Matthias Klumpp via Digitalmars-d

On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:

[...]
Jup, I did that already, it just took a really long time to run 
because when I made the change to print errno [...]


I forgot to mention that, the error code was 12, ENOMEM, so this 
is actually likely not a relevant issue afterall.




Re: Issues with debugging GC-related crashes #2

2018-04-19 Thread Dmitry Olshansky via Digitalmars-d

On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:

On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:

[...]


Jup, I did that already, it just took a really long time to run 
because when I made the change to print errno I also enabled 
detailed GC profiling (via the PRINTF* debug options). Enabling 
the INVARIANT option for the GC is completely broken by the 
way, I enforced the compile to work by casting to shared, with 
the result of the GC locking up forever at the start of the 
program.


[...]


I think the order of operations is wrong, here is an example from 
containers:


allocator.dispose(buckets);
static if (useGC)
GC.removeRange(buckets.ptr);

If GC triggers between dispose and removeRange, it will likely 
segfault.



[1]: https://github.com/dlang-community/containers


Re: Issues with debugging GC-related crashes #2

2018-04-20 Thread Kagamin via Digitalmars-d

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
#2  0x751341c8 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZ9__lambda2MFNbKxSQCpQCpQCfZi (e=...) at treap.d:47
dg = {context = 0x7fffc140 "\320\065\206", funcptr 
= 0x75121d10 
<_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi>}
#3  0x75134238 in 
_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (node=0x7568700, dg=...) at treap.d:221


Indeed, this is iteration over Treap!Range used to store ranges 
added with addRange method.

https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L2182


Re: Issues with debugging GC-related crashes #2

2018-04-20 Thread Matthias Klumpp via Digitalmars-d

On Friday, 20 April 2018 at 05:32:32 UTC, Dmitry Olshansky wrote:

On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:

On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:

[...]


Jup, I did that already, it just took a really long time to 
run because when I made the change to print errno I also 
enabled detailed GC profiling (via the PRINTF* debug options). 
Enabling the INVARIANT option for the GC is completely broken 
by the way, I enforced the compile to work by casting to 
shared, with the result of the GC locking up forever at the 
start of the program.


[...]


I think the order of operations is wrong, here is an example 
from containers:


allocator.dispose(buckets);
static if (useGC)
GC.removeRange(buckets.ptr);

If GC triggers between dispose and removeRange, it will likely 
segfault.


Indeed! It's also the only place where this is shuffled around, 
all other parts of the containers library do this properly.
The thing I wonder about is though, that the crash usually 
appeared in an explicit GC.collect() call when the application 
was not running multiple threads. At that point, the GC - as far 
as I know - couldn't have triggered after the buckets were 
disposed of and the ranges were removed. But maybe I am wrong 
with that assumption.

This crash would be explained perfectly by that bug.



Re: Issues with debugging GC-related crashes #2

2018-04-20 Thread Matthias Klumpp via Digitalmars-d

On Friday, 20 April 2018 at 18:30:30 UTC, Matthias Klumpp wrote:
On Friday, 20 April 2018 at 05:32:32 UTC, Dmitry Olshansky 
wrote:
On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp 
wrote:

On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:

[...]

[...]


I think the order of operations is wrong, here is an example 
from containers:


allocator.dispose(buckets);
static if (useGC)
GC.removeRange(buckets.ptr);

If GC triggers between dispose and removeRange, it will likely 
segfault.


Indeed! It's also the only place where this is shuffled around, 
all other parts of the containers library do this properly.
The thing I wonder about is though, that the crash usually 
appeared in an explicit GC.collect() call when the application 
was not running multiple threads. At that point, the GC - as 
far as I know - couldn't have triggered after the buckets were 
disposed of and the ranges were removed. But maybe I am wrong 
with that assumption.

This crash would be explained perfectly by that bug.


Turns out that was indeed the case! I created a small testcase 
which managed to very reliably reproduce the issue on all 
machines that I tested it on. After reordering the 
dispose/removeRange, the crashes went away completely.
I submitted a pull request to the containers library to fix this 
issue: https://github.com/dlang-community/containers/pull/107


I will also try to get the patch into the components in Debian 
and Ubuntu, so we can maybe have a chance of updating the 
software center metadata for Ubuntu before 18.04 LTS releases 
next week.
Since asgen uses HashMaps for pretty much everything, an most of 
the time with GC-managed elements, this should improve the 
stability of the application greatly.


Thanks a lot for the help in debugging this, I learned a lot 
about DRuntime internals in the process. Also, it is no 
exaggeration to say that the appstream-generator project would 
not be written in D (there was a Rust prototype once...) and I 
would probably not be using D as much (or at all) without the 
helpful community around it.

Thank you :-)



Re: Issues with debugging GC-related crashes #2

2018-04-23 Thread Dmitry Olshansky via Digitalmars-d

On Friday, 20 April 2018 at 19:32:24 UTC, Matthias Klumpp wrote:

On Friday, 20 April 2018 at 18:30:30 UTC, Matthias Klumpp wrote:

[...]


Turns out that was indeed the case! I created a small testcase 
which managed to very reliably reproduce the issue on all 
machines that I tested it on. After reordering the 
dispose/removeRange, the crashes went away completely.
I submitted a pull request to the containers library to fix 
this issue: 
https://github.com/dlang-community/containers/pull/107


Partly dumb luck on my part since I opened hashmap file first 
just to see if there are some mistakes in GC.add/removeRange, and 
it was a hit. I just assumed it was wrong everywhere else ;)


Glad it was that simple. Thanks for fixing it for good.



Thanks a lot for the help in debugging this, I learned a lot 
about DRuntime internals in the process. Also, it is no 
exaggeration to say that the appstream-generator project would 
not be written in D (there was a Rust prototype once...) and I 
would probably not be using D as much (or at all) without the 
helpful community around it.

Thank you :-)