Hi Willy, thanks for all your help with this issue. I upgraded to ubuntu with a recent kernel and poof, the problem disappeared.
Thanks, Lincoln On Thu, Dec 3, 2009 at 12:59 AM, Willy Tarreau <w...@1wt.eu> wrote: > On Thu, Dec 03, 2009 at 12:29:55AM -0500, Lincoln wrote: > > Hi Willy, I agree it's pretty confusing. > > > > I should have been clearer - the problem does not happen every time, it's > > very random. But when it happens it always follows that exact pattern - > > that's what I meant to say. > > OK, that's what I understood first, but I wanted confirmation. > > > I actually have somaxconn set to 10000 so I don't think that's the issue. > > indeed. > > > At this point I'm thinking about scrapping my EC2 instances and trying 2 > new > > ones - you never know. > > One large site I know about had problems with some instances that were > a lot slower than others, and looked like they were randomly losing a > lot of packets (probably sharing the same machine as others saturating > the bandwidth). When they switched to other instances, they discovered > that some of them were immediately receiving attacks, most likely > because they were abandonned by sites being attacked. It seems like > what works well is already used and what you can find unused is probably > bad... This site finally moved off there to solve their problems, which > were undebugable in virtualized environments. > > > Just in case you have any other insights here's the output from the 3 > > commands you mentioned. Thanks again for all your help! > > > > Lincoln > > > > r...@lb1:~$ uname -a > > Linux domU-12-31-39-0A-92-72 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 > > EST 2008 i686 i686 i386 GNU/Linux > > I don't know if it's the latest Xen kernel available, but 2.6.21 does not > sound like on of the best kernels to me, so maybe that can explain things, > though I'm not specifically aware of issues in it. Don't you have anything > more recent for these boxes ? This kernel was built almost 2 years ago, and > given the number of critical security vulnerabilities since, there must > have been updates. > > > r...@lb1:~$ netstat -i > > Kernel Interface table > > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP > > TX-OVR Flg > > eth0 1500 0 67999261 0 0 0 70299595 0 0 > > 0 BMRU > > lo 16436 0 8045554 0 0 0 8045554 0 0 > > 0 LRU > > OK no drop here. > > > r...@lb1:~$ netstat -s > > Tcp: > > 15400091 active connections openings > > 1500044 passive connection openings > > 2110125 failed connection attempts > > Is it expected that you have that many failed connection > attempts ? Maybe one of your servers is down and it's just > the health checks count, but it looks large for a health > check. It's possible that we have the same problem on both > sides. > > > TcpExt: > > 2722 invalid SYN cookies received > > Do you have SYN cookies enabled ? If so, could you try disabling > them ? > > > 1922 resets received for embryonic SYN_RECV sockets > > 712136 TCP sockets finished time wait in fast timer > > That sounds a lot, how many connections per second do you get > in average ? And from a same IP address ? > > > 39530 passive connections rejected because of time stamp > > Troubling ! Looks like what you're experiencing. I don't know > under what condition it can happen. Maybe the sender's clock > is going backwards when it reuses a same connection ? > > Willy > >