Hi Willy, I agree it's pretty confusing. I should have been clearer - the problem does not happen every time, it's very random. But when it happens it always follows that exact pattern - that's what I meant to say.
I actually have somaxconn set to 10000 so I don't think that's the issue. At this point I'm thinking about scrapping my EC2 instances and trying 2 new ones - you never know. Just in case you have any other insights here's the output from the 3 commands you mentioned. Thanks again for all your help! Lincoln r...@lb1:~$ uname -a Linux domU-12-31-39-0A-92-72 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686 i686 i386 GNU/Linux r...@lb1:~$ netstat -i Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 67999261 0 0 0 70299595 0 0 0 BMRU lo 16436 0 8045554 0 0 0 8045554 0 0 0 LRU r...@lb1:~$ netstat -s Ip: 76004137 total packets received 2 with invalid addresses 0 forwarded 0 incoming packets discarded 76004135 incoming packets delivered 78424996 requests sent out Icmp: 1700441 ICMP messages received 6 input ICMP message failed. ICMP input histogram: destination unreachable: 1485599 echo requests: 74856 echo replies: 139986 1559234 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 1484378 echo replies: 74856 Tcp: 15400091 active connections openings 1500044 passive connection openings 2110125 failed connection attempts 646 connection resets received 1 connections established 72811429 segments received 74607063 segments send out 56735 segments retransmited 1510 bad segments received. 781257 resets sent Udp: 7887 packets received 1484378 packets to unknown port received. 0 packet receive errors 1781057 packets sent UdpLite: TcpExt: 2722 invalid SYN cookies received 1922 resets received for embryonic SYN_RECV sockets 712136 TCP sockets finished time wait in fast timer 22808 time wait sockets recycled by time stamp 851007 TCP sockets finished time wait in slow timer 39530 passive connections rejected because of time stamp 160 packets rejects in established connections because of timestamp 585822 delayed acks sent 23 delayed acks further delayed because of locked socket Quick ack mode was activated 6840 times 19917 packets directly queued to recvmsg prequeue. 338 packets directly received from prequeue 15636688 packets header predicted 26116808 acknowledgments not containing data received 936810 predicted acknowledgments 130 times recovered from packet loss due to fast retransmit 7603 times recovered from packet loss due to SACK data 3 bad SACKs received Detected reordering 22 times using FACK Detected reordering 6 times using SACK Detected reordering 13 times using reno fast retransmit Detected reordering 119 times using time stamp 117 congestion windows fully recovered 542 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 43 14626 congestion windows recovered after partial ack 6847 TCP data loss events 60 timeouts after reno fast retransmit 1965 timeouts after SACK recovery 306 timeouts in loss state 12099 fast retransmits 3795 forward retransmits 9935 retransmits in slow start 23335 other TCP timeouts TCPRenoRecoveryFail: 74 739 sack retransmits failed 6890 DSACKs sent for old packets 3367 DSACKs received 19 DSACKs for out of order packets received 643 connections reset due to unexpected data 240 connections reset due to early user close 201 connections aborted due to timeout On Thu, Dec 3, 2009 at 12:16 AM, Willy Tarreau <w...@1wt.eu> wrote: > On Wed, Dec 02, 2009 at 07:44:40PM -0500, Lincoln wrote: > > Thanks Willy for offering to help us out with this. > > > > We are running on an Amazon EC2 m1small instance which is very common for > a > > load balancer machine. > > > > I changed /proc/sys/net/ipv4/tcp_timestamps to 1 - unfortunately to no > > effect. > > OK. > > > Here are my iptables settings (nothing special here that I can see - I > > haven't modified anything): > > r...@lb1:~$ iptables -L > > Chain INPUT (policy ACCEPT) > > target prot opt source destination > > > > Chain FORWARD (policy ACCEPT) > > target prot opt source destination > > > > Chain OUTPUT (policy ACCEPT) > > target prot opt source destination > > OK so most likely it was not even loaded. > > > I would like to try accepting INVALIDs as you suggest - just to see if > that > > addresses the problem before digging deeper. Unfortunately I'm not very > > familiar with iptables - could you show me what I should run to try that? > > you don't need to because you don't have any iptables rules, so those are > implicitly allowed. The common case I was talking about was when people > explicitly drop packets in invalid state. > > > If not that, perhaps something else about the EC2 infrastructure is using > > sequence number randomization? Are there other things I can look for? > > If you don't have iptables, the your machine should have sent either a > SYN/ACK or an ACK. If you really took the trace from the machine itself, > then I have no explanation about the problem :-( > > You said that in every trace it was the same pattern, ie the first > packet which was accepted was the SYN without timestamps. Are you > absolutely sure it's *always* the case and it's not just random ? > I'm asking because the system might refrain from sending a SYN/ACK > when the TCP SYN backlog is full, which is completely independant > from the SYN packet's shape. Your tcp parameters tuning were OK, > but for the backlog you also need to set /proc/sys/net/core/somaxconn > to a large value otherwise it serves as a max. By default it's very > low (128). Try setting it to 10000 (you need to restart haproxy for > the change to take effect). > > A "uname -a", "netstat -i" and "netstat -s" can help too. > > Regards, > Willy > >