On Thu, Dec 03, 2009 at 12:29:55AM -0500, Lincoln wrote: > Hi Willy, I agree it's pretty confusing. > > I should have been clearer - the problem does not happen every time, it's > very random. But when it happens it always follows that exact pattern - > that's what I meant to say.
OK, that's what I understood first, but I wanted confirmation. > I actually have somaxconn set to 10000 so I don't think that's the issue. indeed. > At this point I'm thinking about scrapping my EC2 instances and trying 2 new > ones - you never know. One large site I know about had problems with some instances that were a lot slower than others, and looked like they were randomly losing a lot of packets (probably sharing the same machine as others saturating the bandwidth). When they switched to other instances, they discovered that some of them were immediately receiving attacks, most likely because they were abandonned by sites being attacked. It seems like what works well is already used and what you can find unused is probably bad... This site finally moved off there to solve their problems, which were undebugable in virtualized environments. > Just in case you have any other insights here's the output from the 3 > commands you mentioned. Thanks again for all your help! > > Lincoln > > r...@lb1:~$ uname -a > Linux domU-12-31-39-0A-92-72 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 > EST 2008 i686 i686 i386 GNU/Linux I don't know if it's the latest Xen kernel available, but 2.6.21 does not sound like on of the best kernels to me, so maybe that can explain things, though I'm not specifically aware of issues in it. Don't you have anything more recent for these boxes ? This kernel was built almost 2 years ago, and given the number of critical security vulnerabilities since, there must have been updates. > r...@lb1:~$ netstat -i > Kernel Interface table > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP > TX-OVR Flg > eth0 1500 0 67999261 0 0 0 70299595 0 0 > 0 BMRU > lo 16436 0 8045554 0 0 0 8045554 0 0 > 0 LRU OK no drop here. > r...@lb1:~$ netstat -s > Tcp: > 15400091 active connections openings > 1500044 passive connection openings > 2110125 failed connection attempts Is it expected that you have that many failed connection attempts ? Maybe one of your servers is down and it's just the health checks count, but it looks large for a health check. It's possible that we have the same problem on both sides. > TcpExt: > 2722 invalid SYN cookies received Do you have SYN cookies enabled ? If so, could you try disabling them ? > 1922 resets received for embryonic SYN_RECV sockets > 712136 TCP sockets finished time wait in fast timer That sounds a lot, how many connections per second do you get in average ? And from a same IP address ? > 39530 passive connections rejected because of time stamp Troubling ! Looks like what you're experiencing. I don't know under what condition it can happen. Maybe the sender's clock is going backwards when it reuses a same connection ? Willy