Bug#483811: [fam] fam segfaults
I experienced more core dumps, but haven't had time to examine them. In particular, at a time I had three core dumps in a row on Jul 14: -rw--- 1 ale staff 6180864 Jul 14 17:10 core_epoch=1531581032_pid=5409_file=!usr!sbin!famd -rw--- 1 ale staff 9302016 Jul 14 17:09 core_epoch=1531580943_pid=5317_file=!usr!sbin!famd -rw--- 1 ale staff 26234880 Jul 14 17:05 core_epoch=1531580714_pid=26593_file=!usr!sbin!famd The first of these only run for 1 minute. The "closure" in TCP_Client::unblock_handler (closure=0x295ca20) at TCP_Client.c++:270 is always a LocalClient with a bad to_be_scanned element. In the first (short) dump, it had: to_be_scanned = { , std::allocator >> = std::set with 251 elements = { Elements 0 to 102 were bad memory pointers. Elements 103 to 250 were entries of type DirEntry with a common parent corresponding to a courierimapkeywords directory of a Courier-MTA mail folder. The parent had 1170 entries, and this batch matched entries 12 to 159 in the parent. Since those file names contain Unix epoch, I could check that most --not all-- of them were in ascending order. One file was listed twice, and was out of order. Dates ranged from June to a few hours before crash. }, } Courier at times removes a bunch of those files and stores the corresponding IMAP keywords in a single ":list" file. I have to study the code better to understand how parent and clients can be notified of Interest* deletion. Maybe next month...
Bug#483811: [fam] fam segfaults
On sat 26 may 2018 it segfaulted again. Identical pattern: (gdb) bt #0 0x6811 in ?? () #1 0x00412b1c in TCP_Client::unblock_handler (closure=0xdb1870) at TCP_Client.c++:270 #2 0x004105dc in Scheduler::handle_io (fds=0xaf3eb0, fds@entry=0x7ffc71013620, iotype=::FDInfo::read, iotype@entry=::FDInfo::write) at Scheduler.c++:315 #3 0x00410811 in Scheduler::select () at Scheduler.c++:342 #4 0x00402fa5 in loop () at Scheduler.h:89 #5 main (argc=, argv=0x7ffc71013898) at main.c++:306 Clearly, the set contains a pointer to an object of type Interest which was deleted. Since ClientInterest issues a dequeue_from_scan when it is deleted, the culprit must be a DirEntry. In fact, a call to its scan() function would not be detectable from the core dump, because the code is tail-optimized: bool DirEntry::scan(Interest *ip) { assert(!ip); return parent->scan(this); } (gdb) info address DirEntry::scan Symbol "DirEntry::scan(Interest*)" is a function at address 0x404750. (gdb) disass 0x404750 Dump of assembler code for function DirEntry::scan(Interest*): 0x00404750 <+0>: test %rsi,%rsi 0x00404753 <+3>: jne0x40476b 0x00404755 <+5>: mov%rdi,%rax 0x00404758 <+8>: mov0xc8(%rdi),%rdi 0x0040475f <+15>:mov%rax,%rsi 0x00404762 <+18>:mov(%rdi),%rdx 0x00404765 <+21>:mov0x20(%rdx),%rdx 0x00404769 <+25>:jmpq *%rdx <--- this jumps to a deleted ClientInterest 0x0040476b <+27>:push %rax 0x0040476c <+28>:mov$0x414c80,%ecx 0x00404771 <+33>:mov$0x3b,%edx 0x00404776 <+38>:mov$0x414b8e,%esi 0x0040477b <+43>:mov$0x414b9b,%edi 0x00404780 <+48>:callq 0x4024b0 <__assert_fail@plt> End of assembler dump. Now I modified ClientInterest.h and Directory.c++, and further modified ClientInterest.c++. I have: --- orig/src/ClientInterest.h 2018-02-06 18:05:59.0 +0100 +++ ./src/ClientInterest.h 2018-06-16 10:25:18.0 +0200 @@ -73,6 +73,8 @@ virtual FileSystem * get_filesystem() { return myfilesystem; } +void dequeue_from_scan(Interest *ip); + private: enum { ACTIVE_STATE = 1 << 0 }; --- orig/src/ClientInterest.c++ 2003-01-18 15:18:12.0 +0100 +++ ./src/ClientInterest.c++2018-06-16 10:29:08.0 +0200 @@ -62,6 +62,8 @@ ClientInterest::~ClientInterest() { myfilesystem->cancel(this, fs_request); +if (myclient) +myclient->dequeue_from_scan(this); } void @@ -162,6 +164,13 @@ myclient->dequeue_from_scan(ip); } +void +ClientInterest::dequeue_from_scan(Interest *ip) +{ +if (myclient) +myclient->dequeue_from_scan(ip); +} + bool ClientInterest::do_scan() { --- orig/src/Directory.c++ 2003-01-18 15:18:12.0 +0100 +++ ./src/Directory.c++ 2018-06-16 10:17:32.0 +0200 @@ -77,6 +77,7 @@ { (void) chdir(); while (p) { q = p->next; + dequeue_from_scan(p); delete p; p = q; } Waiting for next core dump...
Bug#483811: [fam] fam segfaults
On Fri, 30 Mar 2018 12:47:15 +0200 I wrote: > That suggests that Set.h doesn't work as expected. No, that wasn't the case. A couple of days ago I got a core dump with exactly the same back trace (bt), even if using std::set. Core was generated by `/usr/sbin/famd -v -f -T 0'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x in ?? () (gdb) bt #0 0x in ?? () #1 0x00412aec in TCP_Client::unblock_handler (closure=0x1935d20) at TCP_Client.c++:270 #2 0x004105ac in Scheduler::handle_io (fds=0x1c028f0, fds@entry=0x7ffd7bd92ef0, iotype=::FDInfo::read, iotype@entry=::FDInfo::write) at Scheduler.c++:315 #3 0x004107e1 in Scheduler::select () at Scheduler.c++:342 #4 0x00402fa5 in loop () at Scheduler.h:89 #5 main (argc=, argv=0x7ffd7bd93168) at main.c++:306 Need to look more closely at ip. (gdb) hd Undefined command: "hd". Try "help". (gdb) define hd Type commands for definition of "hd". End with a line saying just "end". >dump binary memory dump.tmp $arg0 $arg0+$arg1 >shell hd dump.tmp >end (gdb) p sizeof(Interest) $11 = 200 (gdb) p sizeof(File) $43 = 240 (gdb) hd (char*)ip 240 30 c1 73 02 00 00 00 00 50 56 02 02 00 00 00 00 |0.s.PV..| 0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || 0020 e0 29 c0 01 00 00 00 00 36 30 30 30 36 38 31 31 |.)..60006811| 0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 00c0 7f 00 00 01 01 00 00 00 f0 44 91 01 00 00 00 00 |.D..| 00d0 10 2b c0 01 00 00 00 00 01 00 00 00 00 00 00 00 |.+..| 00e0 f0 00 00 00 00 00 00 00 60 00 00 00 00 00 00 00 |`...| We have the following layout (from Interest.h, ClientInterest.h and `p &((Interest*)0).'): offset Instance Variables content found _vtbl bad (virtual functions) 0008Interest *hashlink; ?? 0010dev_t dev; 0 0018ino_t ino; 0 0020char *const myname; mixed buffer (see MYNAME below) 0028ScanState scan_state: 1;"6000?6811I?" ExecState cur_exec_state: 1; ExecState old_exec_state: 1; 0029char ci_char; 002achar dir_char; 0030struct stat old_stat; 00c0in_addr myhost; 127.0.0.1 00c4bool mypath_exported_to_host; 1 (remaining 3 bytes could have been set before) 00c8Client *myclient; bad, 0x19144f0 (see CLIENT below) 00d0Request request; 00d8const Cred mycred; 00e0FileSystem *myfilesystem; 00e8Request fs_request; MYNAME: (gdb) hd 0x1c029e0 80 00 00 00 00 00 00 00 00 32 39 2e 4d 38 33 31 38 |29.M8318| 0010 34 37 50 32 37 31 36 38 56 30 30 30 30 30 30 30 |47P27168V000| 0020 30 30 30 30 30 36 38 31 31 49 30 30 30 30 30 30 |06811I00| 0030 30 30 30 30 34 45 35 33 37 32 5f 31 2e 6e 6f 72 |4E5372_1.nor| 0040 74 68 2c 53 3d 32 32 35 36 36 35 00 53 00 00 53 |th,S=225665.S..S| 0050 The same is true for other Interest pointers still in to_be_scanned, $tbs0, ...: (gdb) hd $tbs0 200 90 d6 73 02 00 00 00 00 c0 2c c0 01 00 00 00 00 |..s..,..| 0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || 0020 40 2a c0 01 00 00 00 00 36 30 30 30 36 38 31 31 |@*..60006811| 0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 00c0 7f 00 00 01 01 35 34 33 |.543| 00c8 (gdb) p ((Interest*)$tbs0)->myname $13 = 0x1c02a40 "" (gdb) hd 0x1c02a40 80 00 00 00 00 00 00 00 00 32 39 2e 4d 38 34 33 38 |29.M8438| 0010 38 38 50 32 37 31 37 30 56 30 30 30 30 30 30 30 |88P27170V000| 0020 30 30 30 30 30 36 38 31 31 49 30 30 30 30 30 30 |06811I00| 0030 30 30 30 30 34 45 35 33 38 43 5f 31 2e 6e 6f 72 |4E538C_1.nor| 0040 74 68 2c 53 3d 32 32 35 36 36 35 00 00 00 00 00 |th,S=225665.| 0050 b0 01 00 00 00 00 00 00 30 00 00 00 00 00 00 00 |0...| (gdb) hd $tbs1 200 00 2b c0 01 00 00 00 00 80 2e c0 01 00 00 00 00 |.+..| 0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || 0020 00 2c c0 01 00 00 00 00 36 30 30 30 36 38 31 31 |.,..60006811| 0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 00c0 7f 00 00 01 01 35 34 33 |.543| 00c8 (gdb) p ((Interest*)$tbs1)->myname $14 = 0x1c02c00 "" (gdb) hd 0x1c02c00 80 00 00 00 00 00 00 00 00 34 34 2e 4d 34 35 37 31 |44.M4571| 0010 36 50 32 37 32 30 35 56 30 30 30 30 30 30 30 30 |6P27205V| 0020 30 30 30 30 36 38 31 31 49 30 30 30 30 30 30 30 |6811I000| 0030 30
Bug#483811: [fam] fam segfaults
Yesterday I got another core dump and, again, it was like the previous ones: #1 0x00412927 in TCP_Client::unblock_handler (closure=0x1ffa030) at TCP_Client.c++:270 I checked the assertions were indeed compiled in, by (gdb) disass ((TCP_Client*)$closure)->dequeue_from_scan, which ended with: 0x004129e4 <+84>:callq 0x402310 <__assert_fail@plt> Yet, those assertions didn't catch anything. So I must have extracted from client->to_be_scanned an ip with vtbl[4] == 0, which was never put into the set. That suggests that Set.h doesn't work as expected. To check the latter hypothesis, I modified TCP_Client.h so as to use a standard set instead: --- fam2/fam-2.7.0/build-tree/fam-2.7.0/src/TCP_Client.h2003-01-18 15:18:12.0 +0100 +++ fam/fam-2.7.0/build-tree/fam-2.7.0/src/TCP_Client.h 2018-03-30 10:53:03.0 +0200 @@ -25,9 +25,28 @@ #include "ClientConnection.h" #include "MxClient.h" -#include "Set.h" + +// use std::set instead +//#include "Set.h" +#include #include "Cred.h" +class stdset : public std::set +{ +public: +// Inherit insert(), size() +bool contains(const key_type e) {return find(e) != end();} +void remove(const key_type e) {erase(e);} +key_type first() const {return begin() == end()? NULL: *begin();} +key_type next(const key_type k) const { + const_iterator it = find(k); + if (it != end()) ++it; + if (it == end()) return NULL; + return *it; +} +// sizeofnode() method is not used +}; + struct sockaddr_un; // A TCP_Client is a client that connects to fam using the TCP/IP @@ -60,7 +79,8 @@ protected: private: -Set to_be_scanned; +// Set to_be_scanned; +stdset to_be_scanned; Scanner *my_scanner; ClientConnection conn; Activity a;// simply declaring it activates timer. Waiting for next core dump...
Bug#483811: [fam] fam segfaults
Yesterday it happened again, exactly the same pattern: Reading symbols from ./build-tree/fam-2.7.0/src/famd...done. [New LWP 18773] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/famd -v -f -T 0'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x in ?? () (gdb) bt #0 0x in ?? () #1 0x004128e7 in TCP_Client::unblock_handler (closure=0x10982a0) at TCP_Client.c++:270 #2 0x004103cc in Scheduler::handle_io (fds=0x1a4dd70, fds@entry=0x7ffce30d7a40, iotype=::FDInfo::read, iotype@entry=::FDInfo::write) at Scheduler.c++:315 #3 0x00410601 in Scheduler::select () at Scheduler.c++:342 #4 0x00402dc5 in loop () at Scheduler.h:89 #5 main (argc=, argv=0x7ffce30d7cb8) at main.c++:306 This confirms the diagnosis. I put a couple of assertions like so: --- src/TCP_Client.c++.orig 2003-01-18 15:18:12.0 +0100 +++ src/TCP_Client.c++ 2018-03-15 18:35:24.0 +0100 @@ -287,12 +287,14 @@ { if (!to_be_scanned.size()) conn.ready_for_input(false); +assert((*(long **)ip)[4] != 0L); // ip func[4] (scan) not virtual to_be_scanned.insert(ip); } void TCP_Client::dequeue_from_scan(Interest *ip) { +assert((*(long **)ip)[4] != 0L); // ip func[4] (scan) not virtual to_be_scanned.remove(ip); if (!to_be_scanned.size()) conn.ready_for_input(true); While recompiling, I noticed this warning: Listener.o: In function `Listener::create_local_client(TCP_Client&, unsigned int)': fam-2.7.0/build-tree/fam-2.7.0/src/Listener.c++:220: warning: the use of `tempnam' is dangerous, better use `mkstemp' I'm not sure how that is related to this bug, if at all. Waiting for next core dump...
Bug#483811: [fam] fam segfaults
Package: fam Version: 2.7.0-17.1 Followup-For: Bug #483811 I eventually managed to compile a non-stripped version of famd. In order to run it, I modified /etc/init.d/fam, which reportbug detected and automatically copied below (search "-ale:"). I'll attach /root/famd-wrapper in a moment. It's a useful workaround anyway. Now what I found in the dumped core file: Reading symbols from /usr/sbin/famd...done. [New LWP 27816] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/famd -v -f -T 0'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x in ?? () (gdb) bt #0 0x in ?? () #1 0x004128e7 in TCP_Client::unblock_handler (closure=0x1a9f4c0) at TCP_Client.c++:270 #2 0x004103cc in Scheduler::handle_io (fds=0x17438c0, fds@entry=0x7ffd982ee5b0, iotype=::FDInfo::read, iotype@entry=::FDInfo::write) at Scheduler.c++:315 #3 0x00410601 in Scheduler::select () at Scheduler.c++:342 #4 0x00402dc5 in loop () at Scheduler.h:89 #5 main (argc=, argv=0x7ffd982ee828) at main.c++:306 Scheduler::handle_io called the handler, presumably NetConnection::write_handler() which called NetConnection::flush() which called NetConnection::set_handlers(). All of those function return void and call the next function right before returning, so their stack frames are optimized out. The relevant code in frame #1 is: Interest *ip; while (client->ready_for_events() && (ip = client->to_be_scanned.first())) { client->to_be_scanned.remove(ip); ip->scan(); } The culprit value of ip: (gdb) info vtbl ip vtable for 'Interest' @ 0x1743530 (subobject @ 0x17438c0): [0]: 0x20 [1]: 0x311 [2]: 0x17431b0 [3]: 0x17438b0 [4]: 0x0 [5]: 0x0 [6]: 0x1743470 [7]: 0x1743396 [8]: 0x0 [9]: 0x0 ip->scan() corresponds to entry [4], callq *0x20(%rdx), hence segv. I checked the value of ip is correct. Although it was just removed from the set, the value is consistent with the registers. It is not an Interest*. In addition, I looked at the 6 key entries in the root node of the set, defined as Set to_be_scanned; I found: (gdb) set $i = 0 (gdb) while ($i < 6) >printf "%s\n", >typeid(*(*(*(TCP_Client*)0x1a9f4c0).to_be_scanned.root).key[$i]).__name >set $i = $i + 1 >end 8DirEntry ` 8Interest 8DirEntry 8DirEntry 8DirEntry So there is a pure abstract element and an unknown pointer. The diagnosis is garbage in the set. Now this could be a flaw in BTree.h or something strange in ClientInterest::scan(), which seems to be the only place where new entries are added to the set. Ideas? The key[0] entry is actually a mail server temporary file: (gdb) p (*(*(*(TCP_Client*)0x1a9f4c0).to_be_scanned.root).key[0]).myname $1 = 0x18479b0 ".5058547.1517362926.M462851P21244V", '0' , "6811I00566211_0.north,S=4981" (gdb) p (*(*(*(TCP_Client*)0x1a9f4c0).to_be_scanned.root).key[0]).old_stat.st_mtim $2 = {tv_sec = 1517563957, tv_nsec = 0} The tv_sec matches the creation of the core dump, four days ago. Note that it took weeks to get a core dump. Now I wait for the next. Ale -- System Information: Debian Release: 8.10 APT prefers oldstable-updates APT policy: (500, 'oldstable-updates'), (500, 'oldstable') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 3.16.0-4-amd64 (SMP w/8 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Init: sysvinit (via /sbin/init) Versions of packages fam depends on: ii libc6 2.19-18+deb8u10 ii libgcc11:4.9.2-10 ii libstdc++6 4.9.2-10 ii lsb-base 4.1+Debian13+nmu1 ii rpcbind [portmap] 0.2.1-6+deb8u2 ii update-inetd 4.43 fam recommends no packages. fam suggests no packages. -- Configuration Files: /etc/init.d/fam changed: PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin DAEMON="/usr/sbin/famd" NAME="FAM" DESC="file alteration monitor" FAMOPTS="-T 0" test -x $DAEMON || exit 0 egrep -qs "^(sgi_fam|391002)" /etc/inetd.conf . /lib/lsb/init-functions set -e case "$1" in start) status_of_proc $DAEMON $NAME > /dev/null && exit 0 log_daemon_msg "Starting $DESC" "$NAME" # -ale: was:start-stop-daemon --start --quiet --exec $DAEMON -- $FAMOPTS < /dev/null (setsid /root/famd-wrapper& )& log_end_msg $? ;; stop) log_daemon_msg "Stopping $DESC" "$NAME" start-stop-daemon --stop --oknodo --quiet --exec $DAEMON log_end_msg $? ;; restart|force-reload) $0 stop sleep 1 $0 start ;; status) status_of_proc $DAEMON $NAME ;; *) echo "Usage: $0 {start|stop|restart|force-reload|status}" >&2 exit 1 ;; esac exit 0 -- no debconf information #! /bin/sh # called from /etc/init.d/fam, where I replaced the start-stop-daemon #
Bug#483811: [fam] fam segfaults
Package: fam Version: 2.7.0-13.2 Severity: normal --- Please enter the report below this line. --- no idea why, but i found this in my /var/log/messages: kernel: famd[3044]: segfault at e08 rip 414f45 rsp 7fffedd01c80 error 4 sorry, I don't have anything more. jürgen --- System information. --- Architecture: amd64 Kernel: Linux 2.6.24-5-1 Debian Release: lenny/sid 500 unstablewww.debian-multimedia.org 500 unstablemi.mirror.garr.it 500 unstableftp.tuke.sk 500 unstableftp.de.debian.org 500 unstableftp.at.debian.org 500 unstabledebian.inode.at 1 experimentaldebian.inode.at --- Package information. --- Depends(Version) | Installed -+- libc6 (= 2.7-1) | 2.7-11 libgcc1 (= 1:4.1.1-21) | 1:4.3.0-5 libstdc++6 (= 4.1.1-21) | 4.3.0-5 lsb-base (= 3.0-6) | 3.2-12 portmap | 6.0-5.1 update-inetd | 4.30 -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]