Bug#301511: sysklogd: hangs the whole system
On Sat, 14 May 2005 08:22:56, GOTO Masanori wrote: I have not reappeared sysklogd breakage yet, but IMHO this problem is potentially existed - I agreed Miquel's proposal. Miquel, did you confirm this problem using sysklogd? If this patch fixes this bug, I think we should do NMU for sarge. Yes, the reason I filed this bug was not theoretical - the servers I run regularly hung completely because lots of processes on the box blocked (cron calls syslog() ), and the process table got full. No way to login, not even on the serial console (login calls syslog() too!), no way to recover .. Mike.
Bug#301511: sysklogd: hangs the whole system
At Wed, 4 May 2005 14:11:38 +0200, Miquel van Smoorenburg wrote: Christian Hammers wrote: For what it's worth, I also tried Miguel's ctime-hang.c on both a Sarge i386 and a Sid amd64 machine with 2.6 kernels and can reproduce the hang in 10 of 10 attempts. I also re-ran the ctime-hang.c test program on i386 uniprocessor and SMP, and amd64 SMP (all up-to-date sarge and 2.6 kernel) and ctime-hang.c locked every time right away. I confirm that Miquel's ctime-hang.c stops its execution on 2.6 kernel + the latest glibc. Recent glibc switches to use NPTL instead of LinuxThreads when 2.6 kernel is used. If you set environment variable LD_ASSUME_KERNEL=2.4.1 and rerun his programs on 2.6 kernel, the problem is just disappeared (because LinuxThreads is used). Note that NPTL uses futex for mutex protection, instead LinuxThreads uses signal. SUSv3, aka POSIX, defines the signal handler safe functions as follows: http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04 Unfortunatelly, ctime() is not defined on this list. So, glibc does not guarantee the sane behavior when one uses ctime() in signal handler. BTW, I'm surprised that sysklogd calls some functions in signal handler. I have not reappeared sysklogd breakage yet, but IMHO this problem is potentially existed - I agreed Miquel's proposal. Miquel, did you confirm this problem using sysklogd? If this patch fixes this bug, I think we should do NMU for sarge. Regards, -- gotom -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#301511: sysklogd: hangs the whole system
Christian Hammers wrote: For what it's worth, I also tried Miguel's ctime-hang.c on both a Sarge i386 and a Sid amd64 machine with 2.6 kernels and can reproduce the hang in 10 of 10 attempts. I also re-ran the ctime-hang.c test program on i386 uniprocessor and SMP, and amd64 SMP (all up-to-date sarge and 2.6 kernel) and ctime-hang.c locked every time right away. Mike. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#301511: sysklogd: hangs the whole system
For what it's worth, I also tried Miguel's ctime-hang.c on both a Sarge i386 and a Sid amd64 machine with 2.6 kernels and can reproduce the hang in 10 of 10 attempts. bye, -christian- $ strace -e trace=\!time ./a.out ... set_thread_area({entry_number:-1 - 6, base_addr:0x401522a0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 munmap(0x40018000, 20908) = 0 rt_sigaction(SIGALRM, {0x8048464, [], 0}, NULL, 8) = 0 setitimer(ITIMER_REAL, {it_interval={0, 1000}, it_value={0, 1000}}, NULL) = 0 brk(0) = 0x804a000 brk(0x806b000) = 0x806b000 brk(0) = 0x806b000 open(/etc/localtime, O_RDONLY)= 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=837, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40018000 --- SIGALRM (Alarm clock) @ 0 (0) --- futex(0x4014fcec, FUTEX_WAIT, 2, NULL unfinished ... -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#301511: sysklogd: hangs the whole system
I ran Miquel's ctime-hang.c, on a sarge machine with a 2.6 kernel, but it kept on running for many minutes (that is, the SIGALRM happened all the time). This didn't seem like a lockup. Did I do something wrong in my attempt to reproduce this? -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#301511: sysklogd: hangs the whole system
Miguel, I was looking through the sysklogd patch that you provided, and looked through the glibc code. I don't see any __libc_lock magic happening. Since ctime isn't reentrant, no effort is made to protect the coder against threaded programming. I also checked the 'time' function in case you were worried about that. Which glibc function were you worried about? I can see this code being included for completeness so that an entry isn't lost, but I don't think this is what's causing your breakage. TKs, Jeff Bailey
Bug#301511: sysklogd: hangs the whole system
On Thu, 2005-03-31 at 09:31 -0500, Jeff Bailey wrote: Miguel, I was looking through the sysklogd patch that you provided, and looked through the glibc code. I don't see any __libc_lock magic happening. ctime() - localtime() - __tz_convert() - __libc_lock_lock() Since ctime isn't reentrant, no effort is made to protect the coder against threaded programming. I also checked the 'time' function in case you were worried about that. Which glibc function were you worried about? I can see this code being included for completeness so that an entry isn't lost, but I don't think this is what's causing your breakage. It is, I tested it :) Compile the attached C program, run with sarge glibc on a 2.6 kernel. It's the most clear if you run it like this: $ strace -e trace=\!time ./a.out Lockup within a fraction of a second. Mike. #include sys/time.h #include time.h #include signal.h #include stdio.h #include stdlib.h #include unistd.h #include string.h volatile char *r; void handler(int sig) { time_t t; time(t); r = ctime(t); } int main() { struct itimerval it; struct sigaction sa; time_t t; memset(sa, 0, sizeof(sa)); sa.sa_handler = handler; sigaction(SIGALRM, sa, NULL); it.it_value.tv_sec = 0; it.it_value.tv_usec = 1000; it.it_interval.tv_sec = 0; it.it_interval.tv_usec = 1000; setitimer(ITIMER_REAL, it, NULL); while(1) { time(t); r = ctime(t); } return 0; }
Bug#301511: sysklogd: hangs the whole system
Package: sysklogd Version: 1.4.1-16 Severity: grave Justification: breaks the whole system References: http://lkml.org/lkml/2005/3/26/37 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=103392 http://lkml.org/lkml/2004/12/21/208 http://lkml.org/lkml/2004/11/2/17 Syslogd can hang if domark() is called while the main loop is just calling in the ctime() libc function. ctime() is not reentrant and will make syslogd hang (with recent glibc and 2.6 kernels, because glibc uses __libc_lock which uses a [fm]utex). Because AF_UNIX SOCK_DGRAM sockets are blocking by default under Linux, after a short while everything on the system that calls syslog() will hang as well. Which means you can't login anymore, and almost all other network services hang as well. Also cron will fork every now and again and call syslog() so the whole process table will fill up. The system is fubared. The original syslogd protects itself from this by blocking SIGHUP and SIGALRM in logmsg(), but that is under #ifndef SYSV. The attached patch fixes this by adding POSIX sigprocmask calls in logmsg(), as that is the simplest fix. Non-bug related comments: A better fix would be to just set a flag in domark() and check that flag every so often in the mainloop and do the MARKing there. Someone should take out all the #ifdef/#ifndef SYSV stuff and replace it with POSIX routines so that this syslogd compiles under all modern OSes. I think that it doesn't even compile without #define SYSV anymore, anyway. Mike. --- sysklogd-1.4.1/syslogd.c.ORIG 2005-03-25 22:04:07.360493000 +0100 +++ sysklogd-1.4.1/syslogd.c2005-03-26 13:28:17.650819246 +0100 @@ -1558,11 +1558,21 @@ int fac, prilev, lognum; int msglen; char *timestamp; +#ifdef linux + sigset_t mask, omask; +#endif dprintf(logmsg: %s, flags %x, from %s, msg %s\n, textpri(pri), flags, from, msg); -#ifndef SYSV +#ifdef linux /* POSIX, really. */ + sigemptyset(mask); + sigaddset(mask, SIGHUP); + sigaddset(mask, SIGALRM); + sigprocmask(SIG_BLOCK, mask, omask); +#else +# ifndef SYSV omask = sigblock(sigmask(SIGHUP)|sigmask(SIGALRM)); +# endif #endif /* @@ -1603,8 +1613,12 @@ (void) close(f-f_file); f-f_file = -1; } -#ifndef SYSV +#ifdef linux + sigprocmask(SIG_SETMASK, omask, NULL); +#else +# ifndef SYSV (void) sigsetmask(omask); +# endif #endif return; } @@ -1668,8 +1682,12 @@ } } } -#ifndef SYSV +#ifdef linux + sigprocmask(SIG_SETMASK, omask, NULL); +#else +# ifndef SYSV (void) sigsetmask(omask); +# endif #endif } #if FALSE