On Sat, Jan 30, 2010 at 1:04 AM, Willy Tarreau <w...@1wt.eu> wrote:
> Hi David,
>
> On Fri, Jan 29, 2010 at 03:58:09PM -0800, David Birdsong wrote:
>> I'm curious what others are doing to achieve high connection rates
>> -say 10Kconnections/ second.
>>
>> We're serving objects averaging around 100KB, so 10K/sec is a fully
>> utilized 1G ethernet card.
>
> No, at 10k/sec you're at 1GB/s or approx 10 Gbps. 100 kB is huge for
> an average size. My experiences with common web sites are in the range
> from a few hundreds of bytes (buttons, icons, ...) to a few tens of kB
> (js, css, images). The more objects you have on a page, the smaller
> they are and the higher the hit rate too.
we serve media.  i double checked the average size simply by reading
content length values and averaging them for every 100 and 1000
requests over a 5 minute period.

yep, 85-95kB is what we serve mostly....all images.

>
>>  I'd like to safely hit 7-800 Mb/sec, but
>> interrupts are just eating the machine alive.
>
> you just did not configure your e1000 properly. I'm used to force
> InterruptThrottleRate between 5000 and 10000, not more. You have
> to set the value as many times as you have NICs.
>
I looked into this.  e1000e has some really good documentation for
setting InterruptThrottleRate.

I tried setting a static value like 5000,7000, and 10000 with little
improvement.  Then I wrote a script to measure interrupts for eth1's
irq out of /proc/interrupts.

Simply moving the InterruptThrottleRate from the default of 'Dynamic
Conservative' to 'Dynamic' reduce hardware interrupts down to less
than 5K/sec.  I was surprised this had any affect at all since the two
settings only differ in that 'Dynamic' allows the interrupt ceiling to
rise up into the 70k/sec range if the driver detects traffic patterns
to fit the 'lowest latency' class.

This actually helped boost the bandwidth ceiling close to line rate,
the machine was pushing 800-900 Mbps -though the system cpu was still
very high and unstable.


>> Before adjusting the ethernet irq to allow interrupts to be delivered
>> to either core instead of just cpu1, I was hitting a limit right
>> around 480Mb/sec, cpu1 was taxed out servicing both hardware and
>> software interrupts.
>
> Check that you have properly disabled irqbalance. It kills network
> performance because its switching rate is too low. You need something
> which balances a lot faster or does not balance at all, both of which
> are achieved by default by the hardware.
irqbalance, the daemon is not running if that's what you mean.

here is what we've done to allow irq's for eth0 and eth1 to hit both
cpu0 and cpu1

echo ffffffff,ffffffff > /proc/irq/16/smp_affinity
echo ffffffff,ffffffff > /proc/irq/17/smp_affinity

>
>> I adjusted the ethernet card's IRQ to remove it's affinity to cpu1 and
>> now the limit is around 560Mb/sec before the machine starts dropping
>> packets.  I did this against the advice that this could cause cpu
>> cache misses.
>>
>> machine is: Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
>
> On opterons, I'm used to bind haproxy to one core and the IRQs to
> the other one. On Core2, it's generally the opposite, I bind them
> to the same CPU. But at 1 Gbps, you should not be saturating a core
> with softirqs. Your experience sounds like a massive tx drop which
> causes heavy retransmits. Maybe your outgoing bandwidth is capped
> by a bufferless equipment (switch...), or maybe you haven't set
> enough tx descriptors for your e1000 NIC.
>
>> os: fedora 10 2.6.27.38-170.2.113.fc10.x86_64 #1 SMP
>>
>> card: Intel Corporation 82573L Gigabit Ethernet Controller
>>
>> I've had some ideas on cutting down interrupts:
>>
>>  - jumbo frames behind haproxy (inside my network)
>
> Be careful with jumbos and e1000, I often get page allocation
> failures with them. Setting them to 7kB is generally fine though,
> as the system only has to allocate 2, not 3 pages.

instead of setting the mtu on haproxy machine higher, maybe i could
set it higher on the cache server it's talking to.  the haproxy server
talks directly to the internet, so it'd be costly to send out higher
mtu's and cause router fragmentation farther down the path ...and then
have to resize mtu.

if i set it on the backend, then haproxy machine should auto-resize to
a higher mtu right?


>
>>  - LRO enabled cards (not even sure what this is yet)
>
> it's Large Receive Upload. The Myricom 10GE NICs support that and it
> considerably boosts performance for high packet rates. But we're 10
> times above your load. This consists in recomposing large incoming
> TCP segments from many small ones, so that the TCP stack has less
> IP/TCP headers to process. This is an almost absolute requirement
> when processing 800k packets per second (10G @1.5kB). To be honnest,
> at gig rate, I don't see a big benefit. Also, LRO cannot be enabled
> if you're doing ip forwarding on the same NIC.

I'm very interested in 10GE, a few of these could cut way down on
proxy machine footprint.

>
>> I'm not even exactly sure which cards support either of these features yet.
>
> all e1000 that I know support jumbo frames. Recent kernels support
> GRO which is a software version of LRO but which still improves
> performance.
>
>> Also, an msi-x card sound like it might reduce interrupts, but I'm
>> uncertain....might be trying these soonest.
>
> well, please adjust your NIC's settings first :-)
the msi-x cards arrived, but i want to hold off and see I've simply
not tuned enough.

>
>> Here's some net kernel settings.
>> sysctl -A | grep net >  http://pastebin.com/m26e88d16
>
> # net.ipv4.tcp_wmem = 4096        65536   16777216
> # net.ipv4.tcp_rmem = 4096        87380   16777216
>
> Try reducing the middle value 2 or 4 fold. You may very well
> be lacking socket buffers.

reduced, what is the middle number?

>
> # net.netfilter.nf_conntrack_max = 1310720
> # net.netfilter.nf_conntrack_buckets = 16384
>
> well, no wonder you're using a lot of CPU with conntrack enabled
> at these session rates. Also, the conntrack_buckets is low compared
> to the conntrack_max. I'm used to set it between 1/16 and 1/4 of
> the other one to limit the hash table length. But even better
> would be not to load the module at all.
>
it's statically in the kernel, i haven't gottten around to recompiling
the kernel yet to compile it out.

i am using the NOTRACK module to bypass all traffic around conntrack though.
 sudo iptables -L -t raw
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
NOTRACK    all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
NOTRACK    all  --  anywhere             anywhere



> Also I suspect you ran it on an idle system since the conntrack_count
> is zero. On a live system it should be vey high due to the large
> timeouts (especially the tcp_timeout_time_wait at 120 seconds).
>

yep, conntrack count is low -we're not tracking thanks to iptables.

>> I also have everything out of /proc/net/nestat graphed for the last
>> few weeks if anybody wants to see.
>>
>> Is this the best I can expect out of the card,  the machine and the
>> kernel?  Are there any amount of tuning that can alleviate this?
>
> Well, first please recheck your numbers, especially the average
> object size. The worst case are for objects between 5 and 20kB.
> They produce large numbers of sessions AND large numbers of bytes,
> which increase CPU usage and socket buffer usage. But that's not
> a reason for not sustaining the gig rate :-)

I am starting to think the limiting factor is soft interrupts now that
I've actually measured hardware interrupt rate.  There are alot of
interrupts on the machine as haproxy is getting traffic from a
localhost nginx instance.  This chaining of daemons together over
loopback doesn't cost in a hardware interrupt, but could be costing in
soft interrupts.

I wonder if could make tcp buffers for loopback really big to cut down
on soft interrupts?

>
> Regards,
> Willy
>
>

Reply via email to