https://bugs.kde.org/show_bug.cgi?id=377006
Bug ID: 377006 Summary: valgrind/memcheck segfaults under certain kernel versions (amd64) but not others. Product: valgrind Version: unspecified Platform: Other OS: Linux Status: UNCONFIRMED Severity: normal Priority: NOR Component: memcheck Assignee: jsew...@acm.org Reporter: ishik...@yk.rim.or.jp Target Milestone: --- Created attachment 104262 --> https://bugs.kde.org/attachment.cgi?id=104262&action=edit debug log under 4.7.0.1 (valgrind crashes) System Debian GNU/Linux: amd64 I am trying to run mozilla's thunderbird which I compile locally under valgrind. It works rather well. However, over the past couple of years, I found that valgrind + thunderbird does not work very well under certain Debian-supplied (and my locally created) linux kernel versions. I think there is a reason for this. But until now, I have no clue as to what kernel configure options have effect on this. I posted the buggy situation in the following mail post: you can see the thread therein. https://sourceforge.net/p/valgrind/mailman/message/35667483/ I have managed to glean 5 runs under - 4.7.0.1 (two core VM image). Failure cases. - 4.9.x (4.9.6, manually created. four core VM image). Failure cases. and a couple of runs for - 3.19.5 (four core VM image). Success cases for comparison. After five runs, I thought I would seek experts opinion in which direction I should spend time in pursuing the issue. I am going to upload a few attachments now. One is for 4.7.0.1 test runs. The second one is for 4.9.x test runs. The last one is for 3.19.5 test runs for comparison. Here is a quick observation (my wild guess based on what I observed). I think people in the know can glean more information to refute my guess or suggest how I can pursue the debugging still further. You may want to read the post to the above URL first. Here is the URL again: https://sourceforge.net/p/valgrind/mailman/message/35667483/ --- comment as of now. Please remember that valgrind + thunderbird runs just fine under more or less vanilla 3.19.5 kernel. But valgrind failed to run under certain dDebian-supplied kernels and 4.9.x series I created. Valgrind segfaults. I obtained some logs from failure cases under 4.7.0.1 and 4.9.x, and a successful cases from 3.19.5 for comparison. There could be an issue of mmaps layout change and signal handler setup issue. Debug story: I set breakpoint on mozilla thunderbird to figure out if there is any particular behavior of mozilla thunderbird that triggers valgrind segmentation error on certaion linux kernel versions. For successful cases under 3.19.5 and failed cases under 4.7.0.1, I could set breakpoint on fork() [that is used to call an external program to check for graphics adaptor capability.] and at that point I could dump /proc/$pid/maps in addition to the dump at breakpoint placed on main(). Under 4.9.x series kernel, valgrind segfaults way before this |fork| and so I only obtained /proc/$pid/maps only at the breakpoint at main(). The different dumpings of maps revealed a change near the stack of valgrind under different kenrel vesion.. I show the excerpts near the end of maps listing. >From 4.9.x series kernel: Failure case ... 806203000-806334000 rwxp 00000000 00:00 0 806af9000-806ce2000 rwxp 00000000 00:00 0 ffeffe000-fff001000 rw-p 00000000 00:00 0 7ffd03470000-7ffd03492000 rw-p 00000000 00:00 0 [stack] 7ffd034ba000-7ffd034bc000 r--p 00000000 00:00 0 [vvar] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] (gdb) cont >From 4.7.0.1: failure case ... 805b29000-805c29000 rwxp 00000000 00:00 0 8063ee000-8067d7000 rwxp 00000000 00:00 0 ffeffe000-fff001000 rw-p 00000000 00:00 0 7ffcbe4d1000-7ffcbe4f3000 rw-p 00000000 00:00 0 [stack] 7ffcbe5ee000-7ffcbe5f0000 r--p 00000000 00:00 0 [vvar] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] From: 3.19.5: success case: Note that the stack of the first process that runs is at an earlier map [stack:9283] ... 802001000-802cdc000 rwxp 00000000 00:00 0 802d8c000-802ea8000 rwxp 00000000 00:00 0 802ea8000-802eaa000 ---p 00000000 00:00 0 802eaa000-802faa000 rwxp 00000000 00:00 0 [stack:9283] 802faa000-802fac000 ---p 00000000 00:00 0 802fac000-802fad000 rw-s 00000000 08:18 482 /tmp/vgdb-pipe-shared-mem-vgdb-9283-by-ishikawa-on-??? 802fad000-802fcd000 rwxp 00000000 00:00 0 803081000-8034c4000 rwxp 00000000 00:00 0 80356a000-805649000 rwxp 00000000 00:00 0 805749000-805849000 rwxp 00000000 00:00 0 805c0e000-805d0e000 rwxp 00000000 00:00 0 805e0e000-806019000 rwxp 00000000 00:00 0 806203000-806334000 rwxp 00000000 00:00 0 806af9000-806ce2000 rwxp 00000000 00:00 0 ffeffe000-fff001000 rw-p 00000000 00:00 0 7fffe6088000-7fffe60aa000 rw-p 00000000 00:00 0 7fffe616b000-7fffe616d000 r--p 00000000 00:00 0 [vvar] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] (gdb) cont The above shows the maps layout change I noticed. Timing/race issue: Tough. I have learned while debugging this way under linux 4.7.0.1 kernel that the issue is timing-dependent and is racey (!). The problem did not occur when I stepped in the gdb using "s" and "fun". thunderbird ran fine under valgrind then. Ugh... (valgrind + thunderbird segfaults if I let it run at full speed without attaching gdb to the interpreted thunderbird.) So there is a very small chance to narrow the area of search of particular behavior of the thunderbird. I doubt if we could narrow it to a single line of code of mozilla thunderbird. My current thinking: signal handler and mmap layout change? Rather I think it is the handling of signals such as virtual timer event that seems to be used often inside C-C TB that seems to mess with valgrind under certain kernels. The reasoning: while I see many signal handling dump in the successful execution of valgrind + mozilla thunderbird under 3.19.5 kernel, I don't see them in the failure cases (segfault) under 4.7.0.1 and 4.9 series kernels. I suspect that the lack of signal handling dump in failed cases suggests that there is a timing window where signal handler is not quite well set up (inside valgrind?) when a memory error is detected. (I still am trying to grapple with the situation where valgrind fails to properly catches sigsegv condition. Is the signal handler not properly set up in those cases? OTOH, valgrind seems to be capable of enlarging stack based on memory violation. OTOH, the final sigsegv that failed valgrind is not handled. [I took the following snippets out from 4-9-x series testing's log2.] Successful Stack fault handling 1st case: --3174-- SIGSEGV: si_code=1 faultaddr=0xffeffd6a8 tid=1 ESP=0xffeffd6a0 seg=0xffe7b0000-0xffeffdfff --3174-- -> extended stack base to 0xffeffd000 2nd case: --3174-- SIGSEGV: si_code=1 faultaddr=0xffeffc678 tid=1 ESP=0xffeffc670 seg=0xffe7b0000-0xffeffcfff --3174-- -> extended stack base to 0xffeffc000 --3174-- REDIR: 0x5b79750 (libc.so.6:bcmp) redirected to 0x4a26742 (_vgnU_ifunc_wrapper) final error in the 4.9.x series: gettid() = 3174 mmap(0x803041000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x803041000 getpid() = 3174 write(1027, "--3174-- REDIR: 0x52e68b0 (libst"..., 115) = 115 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} --- +++ killed by SIGSEGV +++ Wild guess: Is it possible that due to mmap layout change(s), the stack fault is reported as --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} --- and valgrind did not notice that it was running out of stack or heap or whatever which valgrind could hanlde either by extending it or printing awarning and stopping, etc.? Maybe since it did not recognize the SIGSEGV as such and simply threw out the error without catching it? (See random observation [2]). I am saying this since si_addr=0xffeffa24c reported is rathe near the stack address reported in the stackfault above. Anyway, I appreciate if I can get some feedback. Random observation: [1]. The error address in SIGSEGV seems to be repeatable under 4.9.x: this may be due to the fact that the error occurs even before the control reaches the breakpoint at |fork| call. It occurs before any interaction. OTOH, under 4.7.0.1, the error seems to occur around |fork| call where I set a breakpoint. There *IS* a timing issue involved. And the address reported is slightly different across the log files. (It relates to the address on the stack then?) [2]. Valgrind uses SIGSEGV for handling stack overflow. Running valgrind under gdb gave me some clues on this. I ran valgrind under gdb once under 4.9.x serie kernel. (See log5-main.txt log file.) gdb reports SIGSEGV three times when valgrind crashes. Actually the first two SIGSEGV crashes are the ones that valgrind uses internally for checking stack fault. valgrind adjusts stack accordingly and chugs along happily. Previously, I did not notice this and when I saw SIGSEGV reported by GDB, I gave up. But this time, I wised up and "c"ontinue and voila! The first two times, valgrind continued. Only at the third SIGSEGV, valgrind crashed. [3] Testing after a reboot. I cheated a bit. I did not reboot 4.9.x kernel every time I tested valgrind+thunderbird combination. Under linux, each process should not be disturbed in such a manner as to making debugging user programs unrepeatable just because a few independent processes ran before it. Well, that is the principle. I know that there have been cases where linux kernel, especially fork()-related handling left a few dormat bugs for a long time. That is why I had a short-lived excitement when I noticed that the segfault occurred near fork call under 4.7.0.1. I thought I hit upon a linux kernel bug or similar bug in valgrind. [I definitely think signal handler setup/release have something to do due to this observation, too.] The fact is that there is not a single thing I can do to run X-based application under Debian WITHOUT RUNNING A COUPLE OF THOUSAND PROCESSES before I get a chance to login from the login manager and under X desktop. I found that the PID started over 2000 when I tried rebooting and began testing, and I am not even sure if it is a rounded PID or not (module 2^16). So "running test afresh" is almost impossible under today's linux if one wants to test an X-based application. TIA -- You are receiving this mail because: You are watching all bug changes.