Re: Socket leak (Was: Re: What triggers "No Buffer Space) ?Available"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 It didn't kept climbing ... - --On Tuesday, May 15, 2007 21:39:35 +0200 Ulrich Spoerlein <[EMAIL PROTECTED]> wrote: > I'm slowly cathing up on FreeBSD related mails and found this mail ... > > Marc G. Fournier wrote: >> > > kern.ipc.numopensockets: 7400 >> > > kern.ipc.maxsockets: 12328 >> > > >> > > ps looks like: >> > > >> >> >> >> > 2368 p2 Is+ Sat01PM 0:00.03 /bin/tcsh > root2112 0.0 0.1 5220 >> > 2360 p3 Ss+ Sat01PM 0:00.04 /bin/tcsh > root 91221 0.0 0.1 5140 >> > 2440 p4 Ss+ 11:49PM 0:00.12 -tcsh (tcsh) >> > >> > I don't think those processes should consume 7400 sockets. >> > Indeed, this really looks like a leak in the kernel. >> >> Robert has sent me a suggestion to try that I'm in the process of putting >> together right now, involving backing out some work on uipc_usrreg.c ... > > How did the backing out work for you? > > Ulrich Spoerlein > -- > "The trouble with the dictionary is you have to know how the word is > spelled before you can look it up to see how it is spelled." > -- Will Cuppy > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "[EMAIL PROTECTED]" - Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGSjDm4QvfyHIvDvMRAv+4AKCUc0ijgXs4igHymP94NGM5XAmvXQCfUi2X m/jpnf+voCioDKmJjedIRbw= =dyqI -END PGP SIGNATURE- ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) ?Available"?
On Tue, May 15, 2007 at 09:39:35PM +0200, Ulrich Spoerlein wrote: > How did the backing out work for you? Taken from another mail from Marc, since there's now multiple threads discussing this: >> Did we determine whether backing out to before the unpcb socket >> reference count change made any difference for you? > > The problem appeared to persist after backing it out ... -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) ?Available"?
I'm slowly cathing up on FreeBSD related mails and found this mail ... Marc G. Fournier wrote: > > > kern.ipc.numopensockets: 7400 > > > kern.ipc.maxsockets: 12328 > > > > > > ps looks like: > > > > > > > > 2368 p2 Is+ Sat01PM 0:00.03 /bin/tcsh > root2112 0.0 0.1 5220 > > 2360 p3 Ss+ Sat01PM 0:00.04 /bin/tcsh > root 91221 0.0 0.1 5140 > > 2440 p4 Ss+ 11:49PM 0:00.12 -tcsh (tcsh) > > > > I don't think those processes should consume 7400 sockets. > > Indeed, this really looks like a leak in the kernel. > > Robert has sent me a suggestion to try that I'm in the process of putting > together right now, involving backing out some work on uipc_usrreg.c ... How did the backing out work for you? Ulrich Spoerlein -- "The trouble with the dictionary is you have to know how the word is spelled before you can look it up to see how it is spelled." -- Will Cuppy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) ?Available"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 - --On Tuesday, May 08, 2007 15:14:29 +0200 Oliver Fromme <[EMAIL PROTECTED]> wrote: > What kind of jails are those? What applications are > running inside them? It's quite possible that the > processes on one machine use 120 sockets per jail, > while on a different machine they use only half that > many per jail, on average. Of course, I can't tell > for sure without knowing what is running in those > jails. The all run pretty much the same thing, on all the machines ... by default, standard syslog, sshd, cron, cyrus imapd, postfix and apache ... some run aolserver over top of that, or jdk/tomcat, or zope ... but they aren't specific to the server itself, as they get moved around ... > > kern.ipc.numopensockets: 7400 > > kern.ipc.maxsockets: 12328 > > > > ps looks like: > > > 2368 p2 Is+ Sat01PM 0:00.03 /bin/tcsh > root2112 0.0 0.1 5220 > 2360 p3 Ss+ Sat01PM 0:00.04 /bin/tcsh > root 91221 0.0 0.1 5140 > 2440 p4 Ss+ 11:49PM 0:00.12 -tcsh (tcsh) > > I don't think those processes should consume 7400 sockets. > Indeed, this really looks like a leak in the kernel. Robert has sent me a suggestion to try that I'm in the process of putting together right now, involving backing out some work on uipc_usrreg.c ... > Maybe "sockstat -u" and/or "fstat | grep -w local" (both > of those commands should basically list the same kind of > information). My guess is that the output will be rather > short, i.e. much shorter than 7355 lines. If that's true, > it is another indication that the problem is caused by > a kernel leak. at the time I rebooted, with no processes, but 7400 sockets: > wc -l sockstat.out.txt 12 sockstat.out.txt > grep local fstat.out.txt | wc -l 7 - Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGQLrf4QvfyHIvDvMRAqlWAJ9Dg2J55e6YVAzkfC9mGascFfr+JQCeJpWo uXAZtN0WbyKdM4a12WJjszs= =BA7G -END PGP SIGNATURE- ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) ?Available"?
Marc G. Fournier wrote: > Oliver Fromme wrote: > > If I remember correctly, you wrote that 11k sockets are > > in use with 90 jails. That's about 120 sockets per jail, > > which isn't out of the ordinary. Of course it depends on > > what is running in those jails, but my guess is that you > > just need to increase the limit on the number of sockets > > (i.e. kern.ipc.maxsockets). > > The problem is that if I compare it to another server, running 2/3 as > many jails, I'm finding its using 1/4 as many sockets, after over 60 > days of uptime: > > kern.ipc.numopensockets: 3929 > kern.ipc.maxsockets: 12328 What kind of jails are those? What applications are running inside them? It's quite possible that the processes on one machine use 120 sockets per jail, while on a different machine they use only half that many per jail, on average. Of course, I can't tell for sure without knowing what is running in those jails. > But, let's try what I think it was Matt suggested ... Yes, that was a good suggestion. > right now, I'm at just over 11k sockets on that machine, so I'm going > to shutdown everything except bare minimum server (all jails shut > off) and see where sockets drop to after that ... > > I'm down to ~7400 sockets: > > kern.ipc.numopensockets: 7400 > kern.ipc.maxsockets: 12328 > > ps looks like: > > mars# ps aux > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > [kernel threads omitted] > root 1 0.0 0.0 768 232 ?? ILs Sat12PM 3:22.01 /sbin/init -- > root 480 0.0 0.0 528 244 ?? Is Sat12PM 0:04.32 /sbin/devd > root 539 0.0 0.0 1388 848 ?? Ss Sat12PM 0:07.21 > /usr/sbin/syslogd -l /var/run/log -l /var/named/var/run/log -s -s > daemon 708 0.0 0.0 1316 748 ?? Ss Sat12PM 0:02.49 > /usr/sbin/rwhod > root 749 0.0 0.0 3532 1824 ?? Is Sat12PM 0:07.60 /usr/sbin/sshd > root 768 0.0 0.0 1412 920 ?? Is Sat12PM 0:02.23 > /usr/sbin/cron -s > root2087 0.0 0.0 2132 1360 ?? Ss Sat01PM 0:04.73 screen -R > root 88103 0.0 0.1 6276 2600 ?? Ss 11:41PM 0:00.62 sshd: [EMAIL > PROTECTED] (sshd) > root 91218 0.0 0.1 6276 2664 ?? Ss 11:49PM 0:00.24 sshd: [EMAIL > PROTECTED] (sshd) > root 813 0.0 0.0 1352 748 v0 Is+ Sat12PM 0:00.00 > /usr/libexec/getty Pc ttyv0 > root 88106 0.0 0.1 5160 2516 p0 Ss 11:41PM 0:00.20 -tcsh (tcsh) > root 97563 0.0 0.0 1468 804 p0 R+ 12:17AM 0:00.00 ps aux > root2088 0.0 0.1 5352 2368 p2 Is+ Sat01PM 0:00.03 /bin/tcsh > root2112 0.0 0.1 5220 2360 p3 Ss+ Sat01PM 0:00.04 /bin/tcsh > root 91221 0.0 0.1 5140 2440 p4 Ss+ 11:49PM 0:00.12 -tcsh (tcsh) I don't think those processes should consume 7400 sockets. Indeed, this really looks like a leak in the kernel. > And netstat -n -funix shows 7355 lines similar to: > > d05f1000 stream 0 00 d05f109000 > d05f1090 stream 0 00 d05f100000 > cf1be000 stream 0 00 cf1bdea000 > cf1bdea0 stream 0 00 cf1be00000 > cec42bd0 stream 0 00 cf2ac48000 > cf2ac480 stream 0 00 cec42bd000 > > with the final few associated with running processes: How do you determine that? You _cannot_ tell from netstat which sockets are associated with running processes. > I'm willing to shut everthing down like this again the next time it happens > (in > 2-3 days) if someone has some other command / output they'd like fo rme to > provide the output of? Maybe "sockstat -u" and/or "fstat | grep -w local" (both of those commands should basically list the same kind of information). My guess is that the output will be rather short, i.e. much shorter than 7355 lines. If that's true, it is another indication that the problem is caused by a kernel leak. > And, I have the following outputs as of the above, where everythign is > shutdown > and its running on minimal processes: > > # ls -lt > total 532 > - -rw-r--r-- 1 root wheel 11142 May 8 00:20 fstat.out > - -rw-r--r-- 1 root wheel 742 May 8 00:20 netstat_m.out > - -rw-r--r-- 1 root wheel 486047 May 8 00:20 netstat_na.out > - -rw-r--r-- 1 root wheel 735 May 8 00:20 sockstat.out ^^^ Aha. :-) Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "C++ is the only current language making COBOL look good." -- Bertrand Meyer
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
On Tue, 8 May 2007, Marc G. Fournier wrote: So, over 7000 sockets with pretty much all processes shut down ... Shouldn't the garbage collector be cutting in somewhere here? I'm willing to shut everthing down like this again the next time it happens (in 2-3 days) if someone has some other command / output they'd like fo rme to provide the output of? And, I have the following outputs as of the above, where everythign is shutdown and its running on minimal processes: I think there may be a bug in the MFC of the UNIX domain socket reference count changes in RELENG_6: revision 1.155.2.8 date: 2007/01/12 16:24:23; author: jhb; state: Exp; lines: +36 -7 MFC: Close a race between enumerating UNIX domain socket pcb structures via sysctl and socket teardown. Note that we engage in a bit of trickery to preserve the ABI of 'struct unpcb' in 6.x. We change the UMA zone to hold a 'struct unpcb_wrapper' which holds a 6.x 'struct unpcb' followed by the new reference count needed for handling the race. We then cast 'struct unpcb' pointers to 'struct unpcb_wrapper' pointers when we need to access the reference count. Submitted by: ups (including the ABI trickery) Could you try backing this out locally and see if the problem goes away? I've forwarded the information you sent to me previously to Stephan so he can take a look. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 - --On Monday, May 07, 2007 19:01:02 +0200 Oliver Fromme <[EMAIL PROTECTED]> wrote: > If I remember correctly, you wrote that 11k sockets are > in use with 90 jails. That's about 120 sockets per jail, > which isn't out of the ordinary. Of course it depends on > what is running in those jails, but my guess is that you > just need to increase the limit on the number of sockets > (i.e. kern.ipc.maxsockets). The problem is that if I compare it to another server, running 2/3 as many jails, I'm finding its using 1/4 as many sockets, after over 60 days of uptime: kern.ipc.numopensockets: 3929 kern.ipc.maxsockets: 12328 But, let's try what I think it was Matt suggested ... right now, I'm at just over 11k sockets on that machine, so I'm going to shutdown everything except bare minimum server (all jails shut off) and see where sockets drop to after that ... I'm down to ~7400 sockets: kern.ipc.numopensockets: 7400 kern.ipc.maxsockets: 12328 ps looks like: mars# ps aux USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 10 99.0 0.0 0 8 ?? RL Sat12PM 2527:55.02 [idle: cpu1] root 11 99.0 0.0 0 8 ?? RL Sat12PM 2816:58.21 [idle: cpu0] root 0 0.0 0.0 0 0 ?? WLs Sat12PM 0:00.00 [swapper] root 1 0.0 0.0 768 232 ?? ILs Sat12PM 3:22.01 /sbin/init -- root 2 0.0 0.0 0 8 ?? DL Sat12PM 0:31.14 [g_event] root 3 0.0 0.0 0 8 ?? DL Sat12PM 12:02.57 [g_up] root 4 0.0 0.0 0 8 ?? DL Sat12PM 17:20.73 [g_down] root 5 0.0 0.0 0 8 ?? DL Sat12PM 0:00.35 [thread taskq] root 6 0.0 0.0 0 8 ?? DL Sat12PM 0:00.00 [xpt_thrd] root 7 0.0 0.0 0 8 ?? DL Sat12PM 0:00.00 [kqueue taskq] root 8 0.0 0.0 0 8 ?? DL Sat12PM 0:00.00 [aic_recovery0] root 9 0.0 0.0 0 8 ?? DL Sat12PM 0:00.00 [aic_recovery0] root 12 0.0 0.0 0 8 ?? WL Sat12PM 12:11.84 [swi1: net] root 13 0.0 0.0 0 8 ?? WL Sat12PM 15:31.57 [swi4: clock] root 14 0.0 0.0 0 8 ?? WL Sat12PM 0:00.00 [swi3: vm] root 15 0.0 0.0 0 8 ?? DL Sat12PM 1:10.54 [yarrow] root 16 0.0 0.0 0 8 ?? WL Sat12PM 0:00.00 [swi6: task queue] root 17 0.0 0.0 0 8 ?? WL Sat12PM 0:00.00 [swi6: Giant taskq] root 18 0.0 0.0 0 8 ?? WL Sat12PM 0:00.00 [swi5: +] root 19 0.0 0.0 0 8 ?? WL Sat12PM 11:50.45 [swi2: cambio] root 20 0.0 0.0 0 8 ?? WL Sat12PM 8:28.94 [irq20: fxp0] root 21 0.0 0.0 0 8 ?? WL Sat12PM 0:00.00 [irq21: fxp1] root 22 0.0 0.0 0 8 ?? WL Sat12PM 0:00.00 [irq25: ahc0] root 23 0.0 0.0 0 8 ?? DL Sat12PM 0:00.00 [aic_recovery1] root 24 0.0 0.0 0 8 ?? WL Sat12PM 7:53.11 [irq26: ahc1] root 25 0.0 0.0 0 8 ?? DL Sat12PM 0:00.00 [aic_recovery1] root 26 0.0 0.0 0 8 ?? WL Sat12PM 0:00.00 [irq1: atkbd0] root 27 0.0 0.0 0 8 ?? DL Sat12PM 0:32.19 [pagedaemon] root 28 0.0 0.0 0 8 ?? DL Sat12PM 0:00.00 [vmdaemon] root 29 0.0 0.0 0 8 ?? DL Sat12PM 38:04.73 [pagezero] root 30 0.0 0.0 0 8 ?? DL Sat12PM 0:30.43 [bufdaemon] root 31 0.0 0.0 0 8 ?? DL Sat12PM 11:38.76 [syncer] root 32 0.0 0.0 0 8 ?? DL Sat12PM 0:57.76 [vnlru] root 33 0.0 0.0 0 8 ?? DL Sat12PM 1:21.24 [softdepflush] root 34 0.0 0.0 0 8 ?? DL Sat12PM 6:00.16 [schedcpu] root 35 0.0 0.0 0 8 ?? DL Sat12PM 6:26.10 [g_mirror md1] root 36 0.0 0.0 0 8 ?? DL Sat12PM 6:10.56 [g_mirror md2] root 37 0.0 0.0 0 8 ?? DL Sat12PM 0:00.00 [g_mirror vm] root 480 0.0 0.0 528 244 ?? Is Sat12PM 0:04.32 /sbin/devd root 539 0.0 0.0 1388 848 ?? Ss Sat12PM 0:07.21 /usr/sbin/syslogd -l /var/run/log -l /var/named/var/run/log -s -s daemon 708 0.0 0.0 1316 748 ?? Ss Sat12PM 0:02.49 /usr/sbin/rwhod root 749 0.0 0.0 3532 1824 ?? Is Sat12PM 0:07.60 /usr/sbin/sshd root 768 0.0 0.0 1412 920 ?? Is Sat12PM 0:02.23 /usr/sbin/cron -s root2087 0.0 0.0 2132 1360 ?? Ss Sat01PM 0:04.73 screen -R root 88103 0.0 0.1 6276 2600 ?? Ss 11:41PM 0:00.62 sshd: [EMAIL PROTECTED] (sshd) root 91218 0.0 0.1 6276 2664 ?? Ss 11:49PM 0:00.24 sshd: [EMAIL PROTECTED] (sshd) root 813 0.0 0.0 1352 748 v0 Is+ Sat12PM 0:00.00 /usr/libexec/getty Pc ttyv0 root 88106 0.0 0.1 5160 2516 p0 Ss 11:41PM 0:00.20 -tcsh (tcsh) root 97563 0.0 0.0 1468 804 p0 R+ 12:17AM 0:00.00 ps aux root2088 0.0 0.1 5352 2368 p2 Is+ Sat01PM
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
On Mon, May 07, 2007 at 07:01:02PM +0200, Oliver Fromme wrote: > Marc G. Fournier wrote: > > Now, that makes sense to me, I can understand that ... but, how would > > that look as far as netstat -nA shows? Or, would it? For example, I > > have: > > You should use "-na" to list all sockets, not "-nA". > > > mars# netstat -nA | grep c9655a20 > > c9655a20 stream 0 00 c95d63f000 > > c95d63f0 stream 0 00 c9655a2000 > > mars# netstat -nA | grep c95d63f0 > > c9655a20 stream 0 00 c95d63f000 > > c95d63f0 stream 0 00 c9655a2000 > > > > They are attached to each other, but there appears to be no 'referencing > > process' > > netstat doesn't show processes at all (sockstat, fstat > and lsof list sockets by processes). The sockets above > are probably from a socketpair(2) or a pipe (which is > implemented with socketpair(2), AFAIK). That's perfectly > normal. > > If I remember correctly, you wrote that 11k sockets are > in use with 90 jails. That's about 120 sockets per jail, > which isn't out of the ordinary. Of course it depends on > what is running in those jails, but my guess is that you > just need to increase the limit on the number of sockets > (i.e. kern.ipc.maxsockets). Yes, and if you have 11000 sockets in use under "normal" situations then you're likely to be pressing right up against the default limit anyway (e.g. on this machine with 8GB of RAM the default is 12328), so a slight increase in load will run out of space. Kris ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
Marc G. Fournier wrote: > Now, that makes sense to me, I can understand that ... but, how would > that look as far as netstat -nA shows? Or, would it? For example, I > have: You should use "-na" to list all sockets, not "-nA". > mars# netstat -nA | grep c9655a20 > c9655a20 stream 0 00 c95d63f000 > c95d63f0 stream 0 00 c9655a2000 > mars# netstat -nA | grep c95d63f0 > c9655a20 stream 0 00 c95d63f000 > c95d63f0 stream 0 00 c9655a2000 > > They are attached to each other, but there appears to be no 'referencing > process' netstat doesn't show processes at all (sockstat, fstat and lsof list sockets by processes). The sockets above are probably from a socketpair(2) or a pipe (which is implemented with socketpair(2), AFAIK). That's perfectly normal. If I remember correctly, you wrote that 11k sockets are in use with 90 jails. That's about 120 sockets per jail, which isn't out of the ordinary. Of course it depends on what is running in those jails, but my guess is that you just need to increase the limit on the number of sockets (i.e. kern.ipc.maxsockets). > Again, if I'm reading / understanding things right, without the 'referencing > process', it won't show up in sockstat -u, which is why my netstat -nA > numbers > keep growing, but sockstat -u numbers don't ... which also means that there > is > no way to figure out what process / program is leaving 'dangling sockets'? :( Be careful here, sockstat's output is process-based and lists sockets multiple times. For example, the server sockets that httpd children inherit from their parent are listed for every single child, while you see it only once in the netstat output. On the other hand, sockstat doesn't show sockets that have been closed and are in TIME_WAIT state or similar. Are you sure that UNIX domain sockets are causing the problem? Can you rule out other sockets (e.g. tcp)? In that case you should run "netstat -funix" to list only UNIX domain sockets (basically the same as the -u option to sockstat). Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd $ dd if=/dev/urandom of=test.pl count=1 $ file test.pl test.pl: perl script text executable ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 - --On Friday, May 04, 2007 12:05:11 +0100 Robert Watson <[EMAIL PROTECTED]> wrote: > I think we should be careful to avoid prematurely drawing conclusions about > the source of the problem. First question: have you confirmed that the > resource limit on sockets is definitely what is causing the error you're > seeing? I.e., does the number of sockets hit the maximum sockets? 'k, so, based on your other email this morning, about sockstat | stream, I'm now keeping an eye on: # uptime ; netstat -nA | grep -c stream ; sockstat -u | grep -c stream ; sysctl kern.ipc.numopensockets ; sysctl kern.ipc.maxsockets 8:59AM up 1 day, 9:57, 7 users, load averages: 1.63, 4.92, 5.12 6877 2323 kern.ipc.numopensockets: 8463 kern.ipc.maxsockets: 12328 I'm at least 24 hours out from the error(s) starting to happen ... > Second point: there are two kinds of resource leaks that seem likely > candidates for a socket resource exhaustion problem. First, kernel bugs, in > which the kernel maintains objects despite there being no application > references, and second, application reference leaks, in which applications > keep references to kernel objects despite no longer needing them. Our > immediate goal is to determine which of these is the case: is it a kernel > bug, or an application bug? Using tools like netstat and sockstat, we can > try and determine if all kernel sockets are properly referenced. Experience > suggests that it is an application bug, but we shouldn't rule out a kernel > bug; the good news is that the tools to use in the debugging process are > identical at this stage. 'k, in preparation for it starting, so that I can reboot as quickly as possible, but get max information ... do I just want to save the output of 'sockstat -u' and 'netstat -nA', or is there something else that will be useful? - Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGOz294QvfyHIvDvMRAsy6AKCme99kb27uIHrgLC53fVCZrqKkSgCgheFR 2DYk1DPdmAGzoJhqAXpt+Sc= =G1NF -END PGP SIGNATURE- ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
On Thu, 3 May 2007, Marc G. Fournier wrote: I'm trying to probe this as well as I can, but network stacks and sockets have never been my strong suit ... Robert had mentioned in one of his emails about a "Sockets can also exist without any referencing process (if the application closes, but there is still data draining on an open socket)." Now, that makes sense to me, I can understand that ... but, how would that look as far as netstat -nA shows? Or, would it? For example, I have: mars# netstat -nA | grep c9655a20 c9655a20 stream 0 00 c95d63f000 c95d63f0 stream 0 00 c9655a2000 mars# netstat -nA | grep c95d63f0 c9655a20 stream 0 00 c95d63f000 c95d63f0 stream 0 00 c9655a2000 They are attached to each other, but there appears to be no 'referencing process' ... it is now 10pm at night ... I saved a 'snapshot' of netstat -nA output at 6:45pm, over 3 hours ago, and it has the same entries as above: c9655a20 stream 0 00 c95d63f000 c95d63f0 stream 0 00 c9655a2000 again, if I'm reading this right, there is no 'referencing process' ... first, of course, am I reading this right? second ... if I am reading this right, and, if I am understanding what Robert was saying about 'draining' (alot of ifs, I know) ... isn't it odd for it to take >3 hours to drain? Again, if I'm reading / understanding things right, without the 'referencing process', it won't show up in sockstat -u, which is why my netstat -nA numbers keep growing, but sockstat -u numbers don't ... which also means that there is no way to figure out what process / program is leaving 'dangling sockets'? :( I think we should be careful to avoid prematurely drawing conclusions about the source of the problem. First question: have you confirmed that the resource limit on sockets is definitely what is causing the error you're seeing? I.e., does the number of sockets hit the maximum sockets? Second point: there are two kinds of resource leaks that seem likely candidates for a socket resource exhaustion problem. First, kernel bugs, in which the kernel maintains objects despite there being no application references, and second, application reference leaks, in which applications keep references to kernel objects despite no longer needing them. Our immediate goal is to determine which of these is the case: is it a kernel bug, or an application bug? Using tools like netstat and sockstat, we can try and determine if all kernel sockets are properly referenced. Experience suggests that it is an application bug, but we shouldn't rule out a kernel bug; the good news is that the tools to use in the debugging process are identical at this stage. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
On Thu, 3 May 2007, Marc G. Fournier wrote: > Robert had mentioned in one of his emails about a "Sockets can also exist > without any referencing process (if the application closes, but there is > still > data draining on an open socket)." [..] > Again, if I'm reading / understanding things right, without the 'referencing > process', it won't show up in sockstat -u, which is why my netstat -nA > numbers > keep growing, but sockstat -u numbers don't ... which also means that there > is > no way to figure out what process / program is leaving 'dangling sockets'? :( Marc, I don't know if it may provide any more clues in this instance, but lsof -U also shows unix domain sockets with pid, command and fd. Cheers, Ian ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
:*groan* why couldn't this be happening on a server that I have better remote :access to? :( : :But, based on your explanation(s) above ... if I kill off all of the jail(s) on :the machine, so that there are minimal processes running, shouldn't I see a :significant drop in the number of sockets in use as well? or is there :something special about single user mode vs just killing off all 'extra :processes'? : :- :Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Yes, you can. Nothing special about single user... just kill all the processes that might be using sockets. Killing the jails is a good start. If you are running a lot of jails then I would strongly suspect that there is an issue with file desciptor passing over unix domain sockets. In particular, web servers, databases, and java or other applets could be the culprit. Other possibilities... you could just be running out of file descriptors in the file descriptor table. use vmstat -m and vmstat -z too... find out what allocates the socket memory and see what it reports. Check your mbuf allocation statistics too (netstat -m). Damn, I wish that information were collected on a per-jail basis but I don't think it is. Look at all the memory statistics and check to see if anything is growing unbounded over a long period of time (verses just growing into a cache balance). Create a cron job that dumps memory statistics once a minute to a file then break each report with a clear-screen sequence and cat it in a really big xterm window. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 - --On Thursday, May 03, 2007 18:26:30 -0700 Matthew Dillon <[EMAIL PROTECTED]> wrote: > One thing you can do is drop into single user mode... kill all the > processes on the system, and see if the sockets are recovered. That > will give you a good idea as to whether it is a real leak or whether > some process is directly or indirectly (by not draining a unix domain > socket on which other sockets are being transfered) holding onto the > socket. *groan* why couldn't this be happening on a server that I have better remote access to? :( But, based on your explanation(s) above ... if I kill off all of the jail(s) on the machine, so that there are minimal processes running, shouldn't I see a significant drop in the number of sockets in use as well? or is there something special about single user mode vs just killing off all 'extra processes'? - Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGOpeM4QvfyHIvDvMRAoppAJ9SNmIi+i2vDXEZzrpaVe74a3uKyQCfeMY7 z3lFWXEo111CL5peXvqqsCQ= =qxmO -END PGP SIGNATURE- ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
:I'm trying to probe this as well as I can, but network stacks and sockets have :never been my strong suit ... : :Robert had mentioned in one of his emails about a "Sockets can also exist :without any referencing process (if the application closes, but there is still :data draining on an open socket)." : :Now, that makes sense to me, I can understand that ... but, how would that look :as far as netstat -nA shows? Or, would it? For example, I have: : :... Netstat should show any sockets, whether they are attached to processes or not. Usually you can match up the address from netstat -nA with the addresses from sockets shown by fstat to figure out what processes the sockets are attached to. There are three situations that you have to watch out for: (1) The socket was close()'d and is still draining. The socket will timeout and terminate within ~1-5 minutes. It will not be referenced to a descriptor or process. (2) The socket descriptor itself has been sent over a unix domain socket from one process to another and is currently in transit. The file pointer representing the descriptor is what is actually in transit, and will not be referenced by any processes while it is in transit. There is a garbage collector that figures out unreferencable loops. I think its called unp_gc or something like that. (3) The socket is not closed, but is idle (like having a remote shell open and never typing in it). Service processes can get stuck waiting for data on such sockets. The socket WILL be referenced by some process. These are controlled by net.inet.tcp.keep* and net.inet.tcp.always_keepalive. I almost universally turn on net.inet.tcp.always_keepalive to ensure that dead idle connections get cleaned out. Note that keepalive only applies to idle connections. A socket that has been closed and needs to drain (either data or the FIN state) will timeout and clean up itself whether keepalive is turned on or off). netstat -nA will give you the status of all your sockets. You can observe the state of any TCP sockets. Unix domain sockets have no state and closure is governed simply by them being dereferenced, just like a pipe. In this case there are really only two situations: (1) One end of the unix domain socket is still referenced by a process or (2) The socket has been sent over another unix domain socket and is 'in transit'. The socket will remain intact until it is either no longer in transit (read out from the other unix domain socket), or the garbage collector determines that the socket the descripor is transiting over is not externally referencablee, and will destroy it and any in-transit sockets contained within. Any sockets that don't fall into these categories are in trouble... either a timer has failed somewhere or (if unix domain) the garbage collector has failed to detect that it is in an unreferencable loop. - One thing you can do is drop into single user mode... kill all the processes on the system, and see if the sockets are recovered. That will give you a good idea as to whether it is a real leak or whether some process is directly or indirectly (by not draining a unix domain socket on which other sockets are being transfered) holding onto the socket. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I'm trying to probe this as well as I can, but network stacks and sockets have never been my strong suit ... Robert had mentioned in one of his emails about a "Sockets can also exist without any referencing process (if the application closes, but there is still data draining on an open socket)." Now, that makes sense to me, I can understand that ... but, how would that look as far as netstat -nA shows? Or, would it? For example, I have: mars# netstat -nA | grep c9655a20 c9655a20 stream 0 00 c95d63f000 c95d63f0 stream 0 00 c9655a2000 mars# netstat -nA | grep c95d63f0 c9655a20 stream 0 00 c95d63f000 c95d63f0 stream 0 00 c9655a2000 They are attached to each other, but there appears to be no 'referencing process' ... it is now 10pm at night ... I saved a 'snapshot' of netstat -nA output at 6:45pm, over 3 hours ago, and it has the same entries as above: c9655a20 stream 0 00 c95d63f000 c95d63f0 stream 0 00 c9655a2000 again, if I'm reading this right, there is no 'referencing process' ... first, of course, am I reading this right? second ... if I am reading this right, and, if I am understanding what Robert was saying about 'draining' (alot of ifs, I know) ... isn't it odd for it to take >3 hours to drain? Again, if I'm reading / understanding things right, without the 'referencing process', it won't show up in sockstat -u, which is why my netstat -nA numbers keep growing, but sockstat -u numbers don't ... which also means that there is no way to figure out what process / program is leaving 'dangling sockets'? :( - Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGOoe94QvfyHIvDvMRAj2LAKDXobcYr4VGOB+WfXYqCBTatZNZLQCfbyWa zsG/o1K3RM3ybjA5RLiSW5s= =8DJi -END PGP SIGNATURE- ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"