[valgrind] [Bug 377006] valgrind/memcheck segfaults under certain kernel versions (amd64) but not others.

zephyrus00jp Tue, 28 Feb 2017 02:09:48 -0800

https://bugs.kde.org/show_bug.cgi?id=377006


--- Comment #3 from zephyrus00jp <ishik...@yk.rim.or.jp> ---
Description of log directory archive.

[1] Description of log files under 4.7.0.1

There are a group of files. (N being 2,3,4,5)
- logN-main.txt,
- logN-gdb-dddd.txt,
- logN-gdb-dddd'.txt
- logN-stra.out.gz
- logN-crash.txt

I ran ~/bin/run-valgrind-test.sh which contains the following command

strace -f -o /tmp/stra.out valgrind --verbose --trace-signals=yes
--show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes
~/objdir-tb3/dist/bin/thunderbird-bin

strace is used to capture what system calls valgrind was issueing.

I used -vgdb=yes, and --vgdb-error=0 to attach to the thunderbird binary that
is monitied by thunderbird binary.
It turns out that thunderbird binary runs an external program (by fork()) about
the time it segfaulted and so I added --tracechildren=yes.
(However, I found that the crash occurs on the original process).

Anyway, logN-main.txt recorded the log output from the above command line.
One of the logN-dddd.txt and logN-dddd'.txt contains the first instance of gdb
that is attached to the thunderbird process under valgrind/memcheck.
I set a few breakpoints at main and at fork, and let it run.
I dumped the mmap layout by printing /proc/PID/maps (PID being the process id
of the thunderbird process).

logN-stra.out.gz contains the recoding from strace.
logN-crash.txt contains the last part of the strace output (logN-stra.out.gz)
and a few comments regarding the particular test run.

Under the particular version of 4.7.0.1 (which is Debian-supplied), valgrind
invokes another process by fork(), and when it happens, valgrind suggests that
I attach another gdb to the sibling process. (--vgdb=yes --vgdb=error), and I
do so and run the sibling process under gdb.
However, the sibling does not seem to be the culprit.
But the original parent process or rather the original valgrind that runs the
original thunderbird binary seemed to crash. (But note the comment about
gdb showing three SIGSEGV and the first two (2) are for stack fault and handled
by valgrind, and only the third one being the fatal SIGSEGV later in this post.
(In the discussion of 4-9-x-dir/log5-main.txt )

Anyway, you can see that valgrind experiences SIGSEGV eventually.

I did not include the case where I stepped through the parent
valgrind+thunderbird execution by stepping using "c" for continue and "fun" for
function level trace. In one case, the crash did not happen and thunderbird ran
to completion without experiencing the crash. So the issue is timing-related
and race condition exists somewhere in the code (on the valgrind side as well.)

[2] log under 4.9.x kernel test runs

There are files (N=1,2,3,4)
 - logN-main.txt
 - logN-error.txt
 - logN-stra.out.dddd.gz
 - logN-gdb-ddd.txt

logN-main.txt contains the log/print from running the following command line:
strace -ff -o /tmp/stra.out valgrind --verbose --trace-signals=yes
--show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes
/NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin

(Sorry, this is on a different machine with different file layout. this is on 4
core machine.)


logN-gdb-ddd.txt: logN-gdb-ddd.txt contains the log from the attaching gdb.
Since the SIGSEGV occurs way before the child process is forked(), there is
only one log from the instance of gdb that is attached to the thunderbird
binary monitored under valgrind/memcheck.

As in the case of kernel 4.7.0.1, I set breakpoint on main, fork and in a few
cases, dlopen, but under kernel 4.7.0.1, valgrind crashes way before
fork is called and thus I only dumped /proc/PID/maps at the breakpoint placed
at the entry of main().

logN-stra.out.dddd.gz: this contains the strace output.

logN-error.txt: 
(I am afraid I emptied log2-error.txt by mistake....)

This contains the last part of logN-stra.out.ddd.gz and a comment.
For example, log1-error.txt contains this.
--- begin quote
valgrind crashed even before I had chance to place a breakpoint, i.e.,
before --vgdb-error=0 took effect!?

gettid()                                = 2851
mmap(0x803041000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x803041000
getpid()                                = 2851
write(1027, "--2851-- REDIR: 0x52e68b0 (libst"..., 115) = 115
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---
+++ killed by SIGSEGV +++
--- end quote

So the situation is highly timing dependent / racy.

log5-main.txt: this is very interesting.
I ran valgrind under gdb.
And then issued to gdb
run --verbose --trace-signals=yes --show-mismatched-frees=no --vgdb=yes
--vgdb-error=0 --trace-children=yes
/NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin

With this log, you can see that valgrind DOES experience SIGSEGV, but
it uses SIGSEGV internally to catch stack over(?)/under(?) flow and adjust the
stack and continue.
(I came to recognize this after I saw the signal tracking log lines from
valgrind.)
Previously, as soon as I saw SIGSEGV, I gave up. (And the stack trace is that
of signal handler for SIGSEGV (???), and it is not printed as in the usual
case???. That prompted my giving up also.)
In this log file, log5-main.txt, 
for the first two SIGSEGV cases, I dumped stack and register (info register),
and then continued with "c" for cont.
To my surprise, valgrind happily continued.
Only at the third SIGSEGV, valgrind crashes.

This may give us some clue on the nature of the crash?

[3] 3.19.5 log from successful runs.

This was captured under 3.19.5 kernel.
This shows the success runs and what messages ought to be printed by successful
valgrind run.


There are files (N=1,2,3)
logN-main.txt: co
I ran the following command.
strace -ff -o /tmp/stra.out valgrind --verbose --trace-signals=yes
--show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes
/NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin

Under the kernel 3.19.5 which I used, valgrind does not crash,
and thus the initial run of log1-main.txt
had to be interrupted midway because the strace output became too large: it had
to be removed from
the upload, too.

But for the two subsequent runs, log2-main.txt and log3-main.txt, I exited
thunderbird runs
after the initial profile selection menu comes up: I simply exited from there.
By that time, though, the sibling process to probe the graphics adaptor was
invoked and exited successfully.

logN-gdb-ddd.txt, logN-gdb-ddd'.txt:
I attach gdb to the initial thunderbird process and the subsequent sibling
process that is forked from the original process.

logN-stra.out.ddd.gz: the strace output for the initial valgrind process.
There are only two of them:
      3-19-5-dir/log2-stra.out.10447.gz
      3-19-5-dir/log3-stra.out.10973.gz

For the first test run, it became too large since I followed the
execution until initial windows came up. I had to drop it before uploading.

The immediate change you will notice is the slight change of mmap.

Somehow the initial process's stack is NOT at the end, but somewhere earlier.

The successful case shows many signal handling dump.
I wondery why we do not see them in the failed cases.
That may be related to the root cause(s) of the crash.


Hope this helps someone in figuring out the true nature of the bug
or in suggesting some tips for further debugging on my end.

TIA

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 377006] valgrind/memcheck segfaults under certain kernel versions (amd64) but not others.

Reply via email to