[valgrind] [Bug 377006] New: valgrind/memcheck segfaults under certain kernel versions (amd64) but not others.

zephyrus00jp Tue, 28 Feb 2017 00:22:49 -0800

https://bugs.kde.org/show_bug.cgi?id=377006


            Bug ID: 377006
           Summary: valgrind/memcheck segfaults under certain kernel
                    versions (amd64) but not others.
           Product: valgrind
           Version: unspecified
          Platform: Other
                OS: Linux
            Status: UNCONFIRMED
          Severity: normal
          Priority: NOR
         Component: memcheck
          Assignee: jsew...@acm.org
          Reporter: ishik...@yk.rim.or.jp
  Target Milestone: ---

Created attachment 104262
  --> https://bugs.kde.org/attachment.cgi?id=104262&action=edit
debug log under 4.7.0.1 (valgrind crashes)

System Debian GNU/Linux: amd64

I am trying to run mozilla's thunderbird which I compile locally under
valgrind.
It works rather well.
However, over the past couple of years, I found that valgrind + thunderbird
does not work very well under certain Debian-supplied (and my locally created)
linux kernel versions.

I think there is a reason for this.
But until now, I have no clue as to what kernel configure options have effect
on this.

I posted the buggy situation in the following mail post: you can see the thread
therein.

https://sourceforge.net/p/valgrind/mailman/message/35667483/

I have managed to glean 5 runs under 
- 4.7.0.1 (two core VM image). Failure cases.
- 4.9.x (4.9.6, manually created. four core VM image). Failure cases.
and a couple of runs for
- 3.19.5 (four core VM image). Success cases for comparison.

After five runs, I thought I would seek experts opinion in which direction I
should spend time in pursuing the issue.

I am going to upload a few attachments now.
One is for 4.7.0.1 test runs.
The second one is for  4.9.x test runs.
The last  one is for 3.19.5 test runs for comparison.

Here is a quick observation (my wild guess based on what I observed).
I think people in the know can glean more information to refute my guess or
suggest how I can pursue the debugging still further.

You may want to read the post to the above URL first.
Here is the URL again:
https://sourceforge.net/p/valgrind/mailman/message/35667483/

--- comment as of now.
Please remember that valgrind + thunderbird runs just fine under
more or less vanilla 3.19.5 kernel.
But valgrind failed to run under certain dDebian-supplied kernels and
4.9.x series I created.
Valgrind segfaults.

I obtained some logs from failure cases under 4.7.0.1 and 4.9.x,
and a successful cases from 3.19.5 for comparison.

There could be an issue of mmaps layout change and
signal handler setup issue.

Debug story:

I set breakpoint on mozilla thunderbird
to figure out if there is any particular behavior of mozilla
thunderbird that triggers valgrind segmentation error on certaion
linux kernel versions.

For successful cases under 3.19.5 and failed cases under 4.7.0.1,
I could set breakpoint on fork() [that is used to call an external
program to check for graphics adaptor capability.] and
at that point I could dump /proc/$pid/maps in addition to the dump
at breakpoint placed on main().
Under 4.9.x series kernel, valgrind segfaults way before this |fork|
and so I only obtained /proc/$pid/maps only at the breakpoint at
main().

The different dumpings of maps revealed a change near the stack of
valgrind under different kenrel vesion..

I show the excerpts near the end of maps listing.

>From 4.9.x series kernel: Failure case
  ...
806203000-806334000 rwxp 00000000 00:00 0 
806af9000-806ce2000 rwxp 00000000 00:00 0 
ffeffe000-fff001000 rw-p 00000000 00:00 0 
7ffd03470000-7ffd03492000 rw-p 00000000 00:00 0                         
[stack]
7ffd034ba000-7ffd034bc000 r--p 00000000 00:00 0                          [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                 
[vsyscall]
(gdb) cont



>From 4.7.0.1: failure case
     ...
805b29000-805c29000 rwxp 00000000 00:00 0 
8063ee000-8067d7000 rwxp 00000000 00:00 0 
ffeffe000-fff001000 rw-p 00000000 00:00 0 
7ffcbe4d1000-7ffcbe4f3000 rw-p 00000000 00:00 0                         
[stack]
7ffcbe5ee000-7ffcbe5f0000 r--p 00000000 00:00 0                          [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                 
[vsyscall]

From: 3.19.5: success case:

Note that the stack of the first process that runs is at an earlier
map [stack:9283]
...
802001000-802cdc000 rwxp 00000000 00:00 0 
802d8c000-802ea8000 rwxp 00000000 00:00 0 
802ea8000-802eaa000 ---p 00000000 00:00 0 
802eaa000-802faa000 rwxp 00000000 00:00 0                               
[stack:9283]
802faa000-802fac000 ---p 00000000 00:00 0 
802fac000-802fad000 rw-s 00000000 08:18 482                             
/tmp/vgdb-pipe-shared-mem-vgdb-9283-by-ishikawa-on-???
802fad000-802fcd000 rwxp 00000000 00:00 0 
803081000-8034c4000 rwxp 00000000 00:00 0 
80356a000-805649000 rwxp 00000000 00:00 0 
805749000-805849000 rwxp 00000000 00:00 0 
805c0e000-805d0e000 rwxp 00000000 00:00 0 
805e0e000-806019000 rwxp 00000000 00:00 0 
806203000-806334000 rwxp 00000000 00:00 0 
806af9000-806ce2000 rwxp 00000000 00:00 0 
ffeffe000-fff001000 rw-p 00000000 00:00 0 
7fffe6088000-7fffe60aa000 rw-p 00000000 00:00 0 
7fffe616b000-7fffe616d000 r--p 00000000 00:00 0                          [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                 
[vsyscall]
(gdb) cont


The above shows the maps layout change I noticed.

Timing/race issue: Tough.

I have learned while debugging this way under linux 4.7.0.1 kernel
that the issue is timing-dependent and is racey (!).  The problem did
not occur when I stepped in the gdb using "s" and "fun". thunderbird
ran fine under valgrind then. Ugh...  (valgrind + thunderbird
segfaults if I let it run at full speed without attaching gdb to the
interpreted thunderbird.)

So there is a very small chance to narrow the area of search of
particular behavior of the thunderbird.  I doubt if we could narrow it
to a single line of code of mozilla thunderbird.

My current thinking: signal handler and mmap layout change?

Rather I think it is the handling of signals such as virtual timer
event that seems to be used often inside C-C TB that seems to mess
with valgrind under certain kernels.

The reasoning: while I see many signal handling dump in the successful
execution of valgrind + mozilla thunderbird under 3.19.5 kernel,
I don't see them in the failure cases (segfault) under 4.7.0.1 and 4.9
series kernels.
I suspect that the lack of signal handling dump in failed cases
suggests that there is a timing window where signal handler is not
quite well set up (inside valgrind?) when a memory error is detected.
(I still am trying to grapple with the situation where
valgrind fails to properly catches sigsegv condition.
Is the signal handler not properly set up in those cases?

OTOH, valgrind seems to be capable of enlarging stack based on memory
violation. OTOH, the final sigsegv that failed valgrind is not handled.

[I took the following snippets out from 4-9-x series testing's log2.]

Successful Stack fault handling
1st case:
--3174-- SIGSEGV: si_code=1 faultaddr=0xffeffd6a8 tid=1 ESP=0xffeffd6a0
seg=0xffe7b0000-0xffeffdfff
--3174--        -> extended stack base to 0xffeffd000

2nd case:
--3174-- SIGSEGV: si_code=1 faultaddr=0xffeffc678 tid=1 ESP=0xffeffc670
seg=0xffe7b0000-0xffeffcfff
--3174--        -> extended stack base to 0xffeffc000
--3174-- REDIR: 0x5b79750 (libc.so.6:bcmp) redirected to 0x4a26742
(_vgnU_ifunc_wrapper)

final error in the 4.9.x series: 

gettid()                                = 3174
mmap(0x803041000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x803041000
getpid()                                = 3174
write(1027, "--3174-- REDIR: 0x52e68b0 (libst"..., 115) = 115
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---
+++ killed by SIGSEGV +++


Wild guess: Is it possible that due to mmap layout change(s),
the stack fault is reported as

--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---

and valgrind did not notice that it was running out of stack or heap
or whatever which valgrind could hanlde either by extending it or printing
awarning and stopping, etc.? Maybe since it did not recognize the SIGSEGV
as such and simply threw out the error without catching it? (See
random observation [2]).

I am saying this since si_addr=0xffeffa24c reported is rathe near the
stack address reported in the stackfault above.

Anyway, I appreciate if I can get some feedback.

Random observation:

[1]. The error address in SIGSEGV seems to be repeatable under 4.9.x: this
may be due to the fact that the error occurs even before the control
reaches the breakpoint at |fork| call. It occurs before any
interaction.  OTOH, under 4.7.0.1, the error seems to occur around
|fork| call where I set a breakpoint. There *IS* a timing issue
involved. And the address reported is slightly different across the
log files. (It relates to the address on the stack then?)

[2]. Valgrind uses SIGSEGV for handling stack overflow.

Running valgrind under gdb gave me some clues on this.

I ran valgrind under gdb once under 4.9.x serie kernel.
(See log5-main.txt log file.)

gdb reports SIGSEGV three times when valgrind crashes.
Actually the first two SIGSEGV crashes are the ones that valgrind uses
internally for checking stack fault. valgrind adjusts stack
accordingly and chugs along happily.
Previously, I did not notice this and when I saw SIGSEGV reported by
GDB, I gave up.
But this time, I wised up and "c"ontinue and voila!
The first two times, valgrind continued.

Only at the third SIGSEGV, valgrind crashed.

[3] Testing after a reboot.

I cheated a bit. I did not reboot 4.9.x kernel every time I tested
valgrind+thunderbird combination.
Under linux, each process should not be disturbed in such a
manner as to making debugging user programs unrepeatable just because
a few independent processes ran before it.

Well, that is the principle.

I know that there have been cases where linux kernel, especially
fork()-related handling left a few dormat bugs for a long time. That
is why I had a short-lived excitement when I noticed that the segfault
occurred near fork call under 4.7.0.1. I thought I hit upon a linux
kernel bug or similar bug in valgrind. [I definitely think signal
handler setup/release have something to do due to this observation,
too.]

The fact is that there is not a single thing I can do to run
X-based application under Debian WITHOUT RUNNING A COUPLE OF THOUSAND
PROCESSES before I get a chance to login from the login manager and
under X desktop. I found that the PID started over 2000 when I tried
rebooting and began testing, and I am not even sure if it is a 
rounded PID or not (module 2^16).
So "running test afresh" is almost impossible under today's linux if one
wants to test an X-based application.

TIA

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 377006] New: valgrind/memcheck segfaults under certain kernel versions (amd64) but not others.

Reply via email to