Bug#301511: sysklogd: hangs the whole system

2005-05-16 Thread Miquel van Smoorenburg
On Sat, 14 May 2005 08:22:56, GOTO Masanori wrote:
 I have not reappeared sysklogd breakage yet, but IMHO this problem is
 potentially existed - I agreed Miquel's proposal.  Miquel, did you
 confirm this problem using sysklogd?  If this patch fixes this bug, I
 think we should do NMU for sarge.

Yes, the reason I filed this bug was not theoretical - the
servers I run regularly hung completely because lots of processes
on the box blocked (cron calls syslog() ), and the process
table got full. No way to login, not even on the serial console
(login calls syslog() too!), no way to recover ..

Mike.




Bug#301511: sysklogd: hangs the whole system

2005-05-14 Thread GOTO Masanori
At Wed, 4 May 2005 14:11:38 +0200,
Miquel van Smoorenburg wrote:
 Christian Hammers wrote:
  For what it's worth, I also tried Miguel's ctime-hang.c on both a Sarge i386
  and a Sid amd64 machine with 2.6 kernels and can reproduce the hang in 
  10 of 10 attempts.
 
 I also re-ran the ctime-hang.c test program on i386 uniprocessor
 and SMP, and amd64 SMP (all up-to-date sarge and 2.6 kernel) and
 ctime-hang.c locked every time right away.

I confirm that Miquel's ctime-hang.c stops its execution on 2.6 kernel
+ the latest glibc.  Recent glibc switches to use NPTL instead of
LinuxThreads when 2.6 kernel is used.  If you set environment variable
LD_ASSUME_KERNEL=2.4.1 and rerun his programs on 2.6 kernel, the
problem is just disappeared (because LinuxThreads is used).  Note that
NPTL uses futex for mutex protection, instead LinuxThreads uses
signal.

SUSv3, aka POSIX, defines the signal handler safe functions as
follows:

  
http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04

Unfortunatelly, ctime() is not defined on this list.  So, glibc does
not guarantee the sane behavior when one uses ctime() in signal
handler.  BTW, I'm surprised that sysklogd calls some functions in
signal handler.

I have not reappeared sysklogd breakage yet, but IMHO this problem is
potentially existed - I agreed Miquel's proposal.  Miquel, did you
confirm this problem using sysklogd?  If this patch fixes this bug, I
think we should do NMU for sarge.

Regards,
-- gotom




-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#301511: sysklogd: hangs the whole system

2005-05-04 Thread Miquel van Smoorenburg
Christian Hammers wrote:
 For what it's worth, I also tried Miguel's ctime-hang.c on both a Sarge i386
 and a Sid amd64 machine with 2.6 kernels and can reproduce the hang in 
 10 of 10 attempts.

I also re-ran the ctime-hang.c test program on i386 uniprocessor
and SMP, and amd64 SMP (all up-to-date sarge and 2.6 kernel) and
ctime-hang.c locked every time right away.

Mike.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#301511: sysklogd: hangs the whole system

2005-04-24 Thread Christian Hammers
For what it's worth, I also tried Miguel's ctime-hang.c on both a Sarge i386
and a Sid amd64 machine with 2.6 kernels and can reproduce the hang in 
10 of 10 attempts.

bye,

-christian-

$ strace -e trace=\!time ./a.out
...
set_thread_area({entry_number:-1 - 6, base_addr:0x401522a0, limit:1048575, 
seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, 
useable:1}) = 0
munmap(0x40018000, 20908)   = 0
rt_sigaction(SIGALRM, {0x8048464, [], 0}, NULL, 8) = 0
setitimer(ITIMER_REAL, {it_interval={0, 1000}, it_value={0, 1000}}, NULL) = 0
brk(0)  = 0x804a000
brk(0x806b000)  = 0x806b000
brk(0)  = 0x806b000
open(/etc/localtime, O_RDONLY)= 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=837, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x40018000
--- SIGALRM (Alarm clock) @ 0 (0) ---
futex(0x4014fcec, FUTEX_WAIT, 2, NULL unfinished ...



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#301511: sysklogd: hangs the whole system

2005-04-18 Thread Lars Wirzenius
I ran Miquel's ctime-hang.c, on a sarge machine with a 2.6 kernel, but
it kept on running for many minutes (that is, the SIGALRM happened all
the time). This didn't seem like a lockup. Did I do something wrong in
my attempt to reproduce this?



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#301511: sysklogd: hangs the whole system

2005-03-31 Thread Jeff Bailey




Miguel,

I was looking through the sysklogd patch that you provided, and looked through the glibc code. I don't see any __libc_lock magic happening. Since ctime isn't reentrant, no effort is made to protect the coder against threaded programming. I also checked the 'time' function in case you were worried about that.

Which glibc function were you worried about?

I can see this code being included for completeness so that an entry isn't lost, but I don't think this is what's causing your breakage.

TKs,
Jeff Bailey





Bug#301511: sysklogd: hangs the whole system

2005-03-31 Thread Miquel van Smoorenburg
On Thu, 2005-03-31 at 09:31 -0500, Jeff Bailey wrote:
 Miguel,
 
 I was looking through the sysklogd patch that you provided, and looked
 through the glibc code.  I don't see any __libc_lock magic happening. 

ctime() - localtime() - __tz_convert() - __libc_lock_lock()

  Since ctime isn't reentrant, no effort is made to protect the coder
 against threaded programming.  I also checked the 'time' function in
 case you were worried about that.
 
 Which glibc function were you worried about?
 
 I can see this code being included for completeness so that an entry
 isn't lost, but I don't think this is what's causing your breakage.

It is, I tested it :) Compile the attached C program, run with sarge
glibc on a 2.6 kernel. It's the most clear if you run it like this:

$ strace -e trace=\!time ./a.out

Lockup within a fraction of a second.

Mike.
#include sys/time.h
#include time.h
#include signal.h
#include stdio.h
#include stdlib.h
#include unistd.h
#include string.h

volatile char *r;

void handler(int sig)
{
	time_t			t;

	time(t);
	r = ctime(t);
}

int main()
{
	struct itimerval	it;
	struct sigaction	sa;
	time_t			t;

	memset(sa, 0, sizeof(sa));
	sa.sa_handler = handler;
	sigaction(SIGALRM, sa, NULL);

	it.it_value.tv_sec = 0;
	it.it_value.tv_usec = 1000;
	it.it_interval.tv_sec = 0;
	it.it_interval.tv_usec = 1000;
	setitimer(ITIMER_REAL, it, NULL);

	while(1) {
		time(t);
		r = ctime(t);
	}

	return 0;
}



Bug#301511: sysklogd: hangs the whole system

2005-03-26 Thread Miquel van Smoorenburg
Package: sysklogd
Version: 1.4.1-16
Severity: grave
Justification: breaks the whole system

References:
  http://lkml.org/lkml/2005/3/26/37
  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=103392
  http://lkml.org/lkml/2004/12/21/208
  http://lkml.org/lkml/2004/11/2/17

Syslogd can hang if domark() is called while the main loop is just
calling in the ctime() libc function. ctime() is not reentrant
and will make syslogd hang (with recent glibc and 2.6 kernels,
because glibc uses __libc_lock which uses a [fm]utex).

Because AF_UNIX SOCK_DGRAM sockets are blocking by default under
Linux, after a short while everything on the system that calls
syslog() will hang as well. Which means you can't login anymore,
and almost all other network services hang as well.
Also cron will fork every now and again and call syslog() so the
whole process table will fill up. The system is fubared.

The original syslogd protects itself from this by blocking SIGHUP
and SIGALRM in logmsg(), but that is under #ifndef SYSV.

The attached patch fixes this by adding POSIX sigprocmask calls
in logmsg(), as that is the simplest fix.

Non-bug related comments:

A better fix would be to just set a flag in domark() and check that
flag every so often in the mainloop and do the MARKing there.

Someone should take out all the #ifdef/#ifndef SYSV stuff and
replace it with POSIX routines so that this syslogd compiles
under all modern OSes. I think that it doesn't even compile
without #define SYSV anymore, anyway.

Mike.
--- sysklogd-1.4.1/syslogd.c.ORIG   2005-03-25 22:04:07.360493000 +0100
+++ sysklogd-1.4.1/syslogd.c2005-03-26 13:28:17.650819246 +0100
@@ -1558,11 +1558,21 @@
int fac, prilev, lognum;
int msglen;
char *timestamp;
+#ifdef linux
+   sigset_t mask, omask;
+#endif
 
dprintf(logmsg: %s, flags %x, from %s, msg %s\n, textpri(pri), flags, 
from, msg);
 
-#ifndef SYSV
+#ifdef linux /* POSIX, really. */
+   sigemptyset(mask);
+   sigaddset(mask, SIGHUP);
+   sigaddset(mask, SIGALRM);
+   sigprocmask(SIG_BLOCK, mask, omask);
+#else
+#  ifndef SYSV
omask = sigblock(sigmask(SIGHUP)|sigmask(SIGALRM));
+#  endif
 #endif
 
/*
@@ -1603,8 +1613,12 @@
(void) close(f-f_file);
f-f_file = -1;
}
-#ifndef SYSV
+#ifdef linux
+   sigprocmask(SIG_SETMASK, omask, NULL);
+#else
+#  ifndef SYSV
(void) sigsetmask(omask);
+#  endif
 #endif
return;
}
@@ -1668,8 +1682,12 @@
}
}
}
-#ifndef SYSV
+#ifdef linux
+   sigprocmask(SIG_SETMASK, omask, NULL);
+#else
+#  ifndef SYSV
(void) sigsetmask(omask);
+#  endif
 #endif
 }
 #if FALSE