2011/6/11 Matt Christiansen <ad...@nikore.net>: > Thats good to know, while 2000 concurrent connections what we do right > now, it will be closer to 10,000 concurrent connections come the > holiday season which is closer to 2.5 GB of ram (still less then whats > on the server). > > One though I have is our requests can be very large at times (big > headers, super huge cookies), it may not be packet loss that the > bigger buffer is fixing but a better ability to buffer our large > requests. Which might explain why nginx wasn't showing this issue > where as haproxy was. > > We don't have any HP Servers or Broadcom NICs (all Intel). I too have > had a lot of issues in general with both HP and Broadcom and choose > hardware for our LB that didn't have those nics. > > Our switches are new, but not super high quality (netgears) its > possible they are not performing as well as we would like, ill have to > do some more tests on them.
I already experienced some negotiation problems with netgears. Have you tried to force the media on the nics ? Cheers Joris > > I'm working on creating a more production like lab where I can test a > number of different aspects of the LB to see what else I can do in > terms of performance. I will make lots of use of halog -srv along with > other tools to measure performance and to see if I can crackdown any > issues in our current H/W setup. > > Thanks for all the help, > > Matt C > > On Thu, Jun 9, 2011 at 10:20 PM, Willy Tarreau <w...@1wt.eu> wrote: >> On Thu, Jun 09, 2011 at 04:04:26PM -0700, Matt Christiansen wrote: >>> I added in the tun.bufsize 65536 and right away things got better, I >>> doubled that to 131072 and all of the outliers went way. Set at that >>> with my tests it looks like haproxy is faster then nginx on 95% of >>> responses and on par with nginx for the last 5% which is fine with me >>> =). >> >> Nice, at least we have a good indication of what may be wrong. I'm >> pretty sure you're having an important packet loss rate. >> >>> What is the negative to setting this high like that? If its just ram >>> usage all of our LBs have 16GB of ram (don't ask why) so if thats all >>> I don't think it will be an issue having that so high. >> >> Yes it's just an impact on RAM. There are two buffers per connection, >> so each connection consumes 256kB of RAM in your case. If you do that >> times 2000 concurrent connections, that's 512MB, which is still small >> compared to what is present in the machine :-) >> >> However, you should *really* try to spot what is causing the issue, >> because right now you're just hiding it under the carpet, and it's not >> completely hidden as retransmits still take some time to be sent. >> >> Many people have encountered the same problem with Broadcom NetXtreme2 >> network cards, which was particularly marked on those shipped with a >> lot of HP machines (firmware 1.9.6). The issue was a huge Tx drop rate >> (which is not reported in netstat). A tcpdump on the machine and another >> one on the next hop can show that some outgoing packets never reach their >> destination. >> >> It is also possible that one equipment is dying (eg: a switch port) and >> that the issue will get worse with time. >> >> You should pass "halog -srv" on your logs which exhibit the varying >> times. It will output the average connection times and response times >> per server. If you see that all servers are affected, you'll conclude >> that the issue is closer to haproxy. If you see that just a group of >> servers is affected, you'll conclude that the issue only lies around >> them (maybe you'll identify a few older servers too). >> >> Regards, >> Willy >> >> > >