Network hangs are insidious. [old fart story time] The headscratcher for me was the one in the 1990's at apple.com (when apple.com was a DEC VAX-8650 running 4.3BSD) that led me to discover TCP_SYN attacks and report that to the CERT two years before panix.com was attacked in the same way. Problem: far too limited initial TCP SYN queue length (5!), and when the short queue was full, any new TCP connection attempts to that port failed from "connection timed out" (SYN packet inbound dropped because queue for that port is full), despite ping (ICMP) working fine.
Imagine: "telnet localhost 25" gives "connection timed out" (wait, what? How is that possible?) kill sendmail (yeah, we used sendmail back then) telnet localhost 25 gives "connection refused" (OK, as expected) restart sendmail telnet localhost 25 gives "connection timed out" (WTF?!!) Rebooting the VAX didn't clear the problem either - same behavior afterwards. That's when I went looking to our routers to see if anything was wrong with the rest of our connections to the Internet. The source of my problem was warring "default" routes in a pair of our exterior-facing Cisco routers (round & round a class of outbound packets went until TTL exceeded), but because the routers carried about 2/3rds of the full "default-free" Internet routing table at the time, we didn't immediately notice that we couldn't talk to 1/3rd of the Internet. Of course, they could all still send packets to us ... which is how the TCP SYN queue got full: our SYN_ACKs weren't getting out to that 1/3rd, and with the SYN queue full (and a two-minute timeout), suddenly SMTP stops accepting any other connection attempts. Once I found the default route loop, I fixed it, and then watched the load on apple.com shoot up as the Internet started actually being able to speak to our SMTP server again. My report to the CERT (then at CMU SEI) came out of first "how did this happen?", followed by, "wow, I could send five or six packets every two minutes with totally random non-responsive (non-existant!) IP source addresses to any particular host/TCP port combination and stop that host from being able to respond on that port! I could shut down E-mail at AOL! Moo hah hah! Oh, and, yeah, just try to trace & stop me, I dare you." [the CERT did nothing with my report, alas. I quietly provided it to friends at SGI and a few other places] I also sent a somewhat oblique message to the IETF mailing list, asserting that a class of one-way (bidirectional communication not required) attacks existed, and that ISP ingress filtering of customer IP source addresses was the only way we'd be able to both forestall them, and trace them. That's a BCP now, but Phil Karn flamed me at the time for wanting to break one mode of mobile-IP. I wasn't graphic or explicit because that list was public, and I didn't want to provide a recipe for any would-be attackers until both the ingress filtering was deployed, and the OS companies had fixed their TCP implementations. This all got fixed a few years later after Panix.com was attacked (though nowhere near as elegantly - they were really massively flooded) with the TCP SYN queue system we now have in NetBSD and all other responsible OSes. The Internet is a pretty hostile network. [/old fart story time] How this relates: as noted in PR/7285, we have a semantic problem with our errors from the networking code: ENOBUFS (55) is returned for BOTH mbuf exhaustion, AND for "network interface queue full" (see the IFQ_MAXLEN, IFQ_SET_MAXLEN(), IF_QFULL() macros in /usr/include/net/if.h, and then in the particular network interface driver you use). TCP is well-behaved: it just backs off and retransmits when it hits a condition like that, and your application probably never hears about it - though it may experience the condition as a performance degradation as TCP backs off. UDP, not so much. If your UDP-based applications are reporting that error, they're probably not doing anything active/adaptive about it. Some human is expected to analyze the situation and "deal with it" somehow. Lucky you, human. It might be time for you to recapitulate the TCP congestion measurement and backoff algorithms in your UDP application (good luck with that well-trod path to tears). Or just convert to TCP. Or ... fix your network (stack? interface? media? switches?), if you can figure out what's actually wrong. The bad part is that without a distinct error message for "queue full", I can't tell you whether you really are running out of mbufs (though netstat -m will tell you if you've ever hit the limit, and netstat -s will tell you about some queues on a per-protocol basis, but I don't see counters for network interfaces in there, as there probably should be), or whether you're overrunning the network interface output queue limit, whatever that is. In both cases, your application should take such an error as a message to back off and retransmit "later" (like TCP does). The trouble with a network interface output queue full error is that it could be that your application is just plain transmitting faster than the network interface can physically go (and good luck finding that datum from the Unix networking API), or your interface has been flow-controlled due to congestion (modern gigabit Ethernet switches do that now), or, worse, the driver really is hanging in some odd state "for a while" (missed interrupt, perhaps? other hardware hiccup?) and the packets are piling up until the queue is full. You seem to think it's that last, and it could well be - but I think you're going to have to instrument some code to catch it in the act to be able to really figure this out and be sure of your analysis. We really should fix PR/7285 properly with the required API change: a new error code allocated at least amongst the BSD's, though we ought to get Linux on board, too (I haven't looked, but I bet they have the same problem). An aside: one of my favorite network heartbeat monitoring tools is Network Time Protocol (NTP), because it (politely) polls its peers, and keeps very careful track of both packet losses, and transit times. Just looking at an "NTP billboard" (ntpq -p) can tell you quite a lot about the health of your network, depending upon which peers/servers you configure. I hope this is of some use towards solving your problem, Erik <f...@netbsd.org>