syslogd stops logging - caught in the act

2000-03-25 Thread Sue Blake

Let's solve this once and for all.

I've run syslogd -d and sent output to a file and waited for the
inevitable cessation of logging although syslogd is still running.
(Refer PRs 2191 5548 6216 8847 8865 10553 and two or three threads
in -isp and/or -questions earlier this year that summarised the
problems and their scope but didn't reach the list archives)

Now logging's stopped and I need to get it restarted again soon, but I'd
like to collect some useful information first. I need help to do that.

This has been reported for almost all -release and -stable versions since
early 2.2, and it's been hard to pin down what circumstances cause it
or to repeat it on unaffected machines.

The common facts are that syslogd is running, using CPU, but nothing
goes to the logs, not mark messages, logger messages, nothing. One
exception: the logs dutifully rotate and log that they have rotated.
Sending a sighup does not fix it, only completely killing and
restarting syslogd gets it going. Unless this is done, it will continue
with the same behaviour (running but not logging) until reboot.
All past speculation as to the cause has been met with counterexamples.

There are five freebsd machines that exhibit this problem which I only
have access to for another couple of days, so if anyone is interested
in solving this long-standing failure of syslogd please take this
opportunity to work with me on it.

These machines range from almost idle very vanilla 3.3R workstations
with only sendmail running, up to 3.4-STABLE of january running many
daemons and with reasonable load, for which reliable logging is
critical.

Replies to my email address would be appreciated.

-- 

Regards,
-*Sue*-
 
 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: syslogd stops logging - caught in the act

2000-03-25 Thread Jonathan Lemon

I asked Sue to get a ktrace of the syslogd, and here's the
output:

 18869 syslogd  954045445.977145 PSIG  SIGALRM caught handler=0x804b068 mask=0x0
 code=0x0
 18869 syslogd  954045445.977343 RET   poll -1 errno 4 Interrupted system call
 18869 syslogd  954045445.977366 CALL  gettimeofday(0xbfbfc5f0,0)
 18869 syslogd  954045445.977382 RET   gettimeofday 0
 18869 syslogd  954045445.977403 CALL  setitimer(0,0xbfbfc5e8,0xbfbfc5d8)
 18869 syslogd  954045445.977424 RET   setitimer 0
 18869 syslogd  954045445.977438 CALL  old.sigreturn(0xbfbfc624)
 18869 syslogd  954045445.977456 RET   old.sigreturn JUSTRETURN
 18869 syslogd  954045445.977476 CALL  poll(0xbfbfc6f0,0x1,0x9c40)
 18869 syslogd  954045475.987785 PSIG  SIGALRM caught handler=0x804b068 mask=0x0
 code=0x0
 18869 syslogd  954045475.987859 RET   poll -1 errno 4 Interrupted system call
 18869 syslogd  954045475.987879 CALL  gettimeofday(0xbfbfc5f0,0)
 18869 syslogd  954045475.987895 RET   gettimeofday 0
 18869 syslogd  954045475.987917 CALL  setitimer(0,0xbfbfc5e8,0xbfbfc5d8)
 18869 syslogd  954045475.987938 RET   setitimer 0
 18869 syslogd  954045475.987952 CALL  old.sigreturn(0xbfbfc624)
 18869 syslogd  954045475.987969 RET   old.sigreturn JUSTRETURN
 18869 syslogd  954045475.987990 CALL  poll(0xbfbfc6f0,0x1,0x9c40)
 18869 syslogd  954045505.997954 PSIG  SIGALRM caught handler=0x804b068 mask=0x0
 code=0x0
 18869 syslogd  954045505.998120 RET   poll -1 errno 4 Interrupted system call


The poll() calls are from libc/net/res_send, while the gettimeofday()
calls are from the alarm handler (in syslogd).  The res_send code does
roughly the following:

msec = (timeout calculated based on # of tries)
   repeat:
poll(pfd, 1, msec);
if (errno == EINTR)
goto repeat;

So what's happening here is it seems that after the # of tries grows
to a certain point, the timeout being passed to poll() is larger than
the timeout between calls to the SIGALRM handler.  Since the poll()
timeout is not reset, this leads to an infinite loop.

In the traces above, the poll() timeout is 4msec (== 40 sec),
and the alarm handler is called every 30 sec.

The fix should probably be to change res_send.c so that it properly
decrements it's timeout value after being interrrupted.
--
Jonathan


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: syslogd stops logging - caught in the act

2000-03-26 Thread Thomas Stromberg

Me and my roommate saw a similar thing occuring when developing a Windows
NT EventLog -> syslogd forwarder (http://www.schizo.com/software/sislog/)
on an older 4.0-CURRENT machine and a 2.2.8-RELEASE machine.. 

We concluded that it appears to be if the host sending the syslog messages
is unresolvable (in our case, the DNS server could not be contacted), it
would stop logging for us. I'm not sure about any other situations. 

This drove me nuts for quite a while, but since we have no home
connectivity I forgot about submitting a GNATS report later. Perhaps a
good thing to check...

-
> Thomas R. Stromberg  Senior Systems Administrator :
> smtp[[EMAIL PROTECTED]]Research Triangle Commerce, Inc. :
> http[afterthought.org]   pots[1.919.657.1317] :
> irc[helixblue]   FreeBSD Contributor, Perl Hacker :
-

On Sun, 26 Mar 2000, Sue Blake wrote:

> Let's solve this once and for all.
> 
> I've run syslogd -d and sent output to a file and waited for the
> inevitable cessation of logging although syslogd is still running.
> (Refer PRs 2191 5548 6216 8847 8865 10553 and two or three threads
> in -isp and/or -questions earlier this year that summarised the
> problems and their scope but didn't reach the list archives)
> 
> Now logging's stopped and I need to get it restarted again soon, but I'd
> like to collect some useful information first. I need help to do that.
> 
> This has been reported for almost all -release and -stable versions since
> early 2.2, and it's been hard to pin down what circumstances cause it
> or to repeat it on unaffected machines.
> 
> The common facts are that syslogd is running, using CPU, but nothing
> goes to the logs, not mark messages, logger messages, nothing. One
> exception: the logs dutifully rotate and log that they have rotated.
> Sending a sighup does not fix it, only completely killing and
> restarting syslogd gets it going. Unless this is done, it will continue
> with the same behaviour (running but not logging) until reboot.
> All past speculation as to the cause has been met with counterexamples.
> 
> There are five freebsd machines that exhibit this problem which I only
> have access to for another couple of days, so if anyone is interested
> in solving this long-standing failure of syslogd please take this
> opportunity to work with me on it.
> 
> These machines range from almost idle very vanilla 3.3R workstations
> with only sendmail running, up to 3.4-STABLE of january running many
> daemons and with reasonable load, for which reliable logging is
> critical.
> 
> Replies to my email address would be appreciated.
> 
> -- 
> 
> Regards,
> -*Sue*-
>  
>  
> 
> 
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-hackers" in the body of the message
> 



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: syslogd stops logging - caught in the act

2000-03-27 Thread Sue Blake

On Sun, Mar 26, 2000 at 03:25:20PM -0500, Thomas Stromberg wrote:
> Me and my roommate saw a similar thing occuring when developing a Windows
> NT EventLog -> syslogd forwarder (http://www.schizo.com/software/sislog/)
> on an older 4.0-CURRENT machine and a 2.2.8-RELEASE machine.. 
> 
> We concluded that it appears to be if the host sending the syslog messages
> is unresolvable (in our case, the DNS server could not be contacted), it
> would stop logging for us. I'm not sure about any other situations. 
> 
> This drove me nuts for quite a while, but since we have no home
> connectivity I forgot about submitting a GNATS report later. Perhaps a
> good thing to check...

Yes, this seems to be exactly what is happening, but it does affect
machines which only log for and to themselves. The DNS angle explains
why at least five machines in the one building are showing the problems
all the time when few others have seen this and only temporarily when
it occurs to others.

The primary name server for the affected machines is frequently
rebooted to freshen it up a bit (sic). As I'm leaving tomorrow, I have
left instructions that anyone who claims to have a valid reason for
rebooting the nameserver is obliged to kill and restart syslogd on up
to ten other machines immediately afterwards :-)

Still, it would be real nice to have a suitproof syslogd one day.
It looks like that won't be so far off, now that we know the cause.

Thanks to all who helped and encouraged.

-- 

Regards,
-*Sue*-
 
 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message