Hi David,

On Sun, Jan 31, 2010 at 11:41:20AM -0800, David Birdsong wrote:
> we serve media.  i double checked the average size simply by reading
> content length values and averaging them for every 100 and 1000
> requests over a 5 minute period.
> 
> yep, 85-95kB is what we serve mostly....all images.

OK.

> > you just did not configure your e1000 properly. I'm used to force
> > InterruptThrottleRate between 5000 and 10000, not more. You have
> > to set the value as many times as you have NICs.
> >
> I looked into this.  e1000e has some really good documentation for
> setting InterruptThrottleRate.

yes indeed, intel's drivers have really good documentation.

> I tried setting a static value like 5000,7000, and 10000 with little
> improvement.  Then I wrote a script to measure interrupts for eth1's
> irq out of /proc/interrupts.
> 
> Simply moving the InterruptThrottleRate from the default of 'Dynamic
> Conservative' to 'Dynamic' reduce hardware interrupts down to less
> than 5K/sec.  I was surprised this had any affect at all since the two
> settings only differ in that 'Dynamic' allows the interrupt ceiling to
> rise up into the 70k/sec range if the driver detects traffic patterns
> to fit the 'lowest latency' class.
> 
> This actually helped boost the bandwidth ceiling close to line rate,
> the machine was pushing 800-900 Mbps -though the system cpu was still
> very high and unstable.

Do you use TCP splicing ? You need a recent kernel (> 2.6.27.X, with
X at least up to date with your distro). Then build haproxy with
USE_LINUX_SPLICE=1.

In your config (eg in the frontend), add "option splice-response".
It could save a lot of CPU cycles by avoiding multiple copies and
segmentation/desegmentation in the kernel.

> >> Before adjusting the ethernet irq to allow interrupts to be delivered
> >> to either core instead of just cpu1, I was hitting a limit right
> >> around 480Mb/sec, cpu1 was taxed out servicing both hardware and
> >> software interrupts.
> >
> > Check that you have properly disabled irqbalance. It kills network
> > performance because its switching rate is too low. You need something
> > which balances a lot faster or does not balance at all, both of which
> > are achieved by default by the hardware.
> irqbalance, the daemon is not running if that's what you mean.

OK. Depending on the kernel version, this can be handled by a kernel
thread which requires a reboot to be disabled (by booting with the
"noirqbalance" command line parameter). You can easily check its
presence as it appears in "ps aux|grep balance".

> here is what we've done to allow irq's for eth0 and eth1 to hit both
> cpu0 and cpu1
> 
> echo ffffffff,ffffffff > /proc/irq/16/smp_affinity
> echo ffffffff,ffffffff > /proc/irq/17/smp_affinity

if your values are kept, it means irqbalance is not running. It
likes to change yours all the time.

You're saying you have two NICs, care to indicate a bit more about
that ? Are they used in load balancing or one for input and the other
one for output ?

> >> os: fedora 10 2.6.27.38-170.2.113.fc10.x86_64 #1 SMP

Ah I this one is OK for splicing.

> > Be careful with jumbos and e1000, I often get page allocation
> > failures with them. Setting them to 7kB is generally fine though,
> > as the system only has to allocate 2, not 3 pages.
> 
> instead of setting the mtu on haproxy machine higher, maybe i could
> set it higher on the cache server it's talking to.  the haproxy server
> talks directly to the internet, so it'd be costly to send out higher
> mtu's and cause router fragmentation farther down the path ...and then
> have to resize mtu.
> 
> if i set it on the backend, then haproxy machine should auto-resize to
> a higher mtu right?

Exactly. You can use jumbos on your internal network and normal small
frames on the outside. Packet size does not matter much when sending,
it's only on the receive path that it's important to have large ones.
Also, you can set a per-socket MSS in haproxy's config (check the
"bind" options).

> I'm very interested in 10GE, a few of these could cut way down on
> proxy machine footprint.

Well, if you have two nics, with a proper setup, you could already
push 2 gigs out. The principle consists in having two distinct
processes, each bound to one NIC and having each NIC working in
both directions. Then you bind each NIC to one single core and
each process to the same CPU core as the NIC it uses. You end up
with two distinct machines in a single one. But let's try to
enable splicing first.

> >> Also, an msi-x card sound like it might reduce interrupts, but I'm
> >> uncertain....might be trying these soonest.
> >
> > well, please adjust your NIC's settings first :-)
> the msi-x cards arrived, but i want to hold off and see I've simply
> not tuned enough.

I agree with you. All e1000's I have used till now had absolutely
no trouble forwarding at line rate on objects that large. I really
think that the presence of iptables is hitting you somewhat.

> >> Here's some net kernel settings.
> >> sysctl -A | grep net >  http://pastebin.com/m26e88d16
> >
> > # net.ipv4.tcp_wmem = 4096        65536   16777216
> > # net.ipv4.tcp_rmem = 4096        87380   16777216
> >
> > Try reducing the middle value 2 or 4 fold. You may very well
> > be lacking socket buffers.
> 
> reduced, what is the middle number?

The default socket size. Everytime the kernel creates a socket
(accept or connect), it allocates this size for each direction,
then lowers it if it lacks memory or increases it if the latency
is high and queues fill.

> > # net.netfilter.nf_conntrack_max = 1310720
> > # net.netfilter.nf_conntrack_buckets = 16384
> >
> > well, no wonder you're using a lot of CPU with conntrack enabled
> > at these session rates. Also, the conntrack_buckets is low compared
> > to the conntrack_max. I'm used to set it between 1/16 and 1/4 of
> > the other one to limit the hash table length. But even better
> > would be not to load the module at all.
> >
> it's statically in the kernel, i haven't gottten around to recompiling
> the kernel yet to compile it out.
>
> i am using the NOTRACK module to bypass all traffic around conntrack though.

What a shame :-(
Unless I'm mistaken, that means that a connection is created for each
incoming packet, then immediately destroyed using the NOTRACK target.
Then it's the same again for outgoing packets. So while lookups are
fast in an empty table, this still costs a lot of CPU. Also, there was
a discussion in the past about netfilter's counters causing cache
thrashing in SMP because they are updated for every packet. I don't
remember the details and I may even be wrong though.

> I am starting to think the limiting factor is soft interrupts now that
> I've actually measured hardware interrupt rate.

Soft interrupts are just delayed work from hard interrupts. Typically
netfilter, TCP desegmentation, defragmentation, packet retransmits, ...
What you save on one side is consumed on the other one. Still you seem
to be doing too much kernel processing.

> There are alot of
> interrupts on the machine as haproxy is getting traffic from a
> localhost nginx instance.

Ah interesting. One more reason to enable splicing then ! On the
loopback, the packets will almost be delivered directly from nginx
to the NIC "flying over" haproxy. Don't forget to set a very large
MTU on your loopback if that's not already the case.

> This chaining of daemons together over
> loopback doesn't cost in a hardware interrupt, but could be costing in
> soft interrupts.

It should not cost much in soft interrupts either. On my machine
(C2D 3 GHz), at 20 Gbps on the loopback doing 20000 100k files per
second, I'm only at 25% softirq. So assuming that you're at 1 Gbps,
you should only spend 2.5% of softirq there.

Passing that through haproxy slows the traffic down to 6 Gbps with
haproxy at 100% CPU on one core. If I enable splicing, haproxy relaxes
to 75% CPU and the rate reaches 8 Gbps. The two other processes now
become the limiting factor.

> I wonder if could make tcp buffers for loopback really big to cut down
> on soft interrupts?

Yes, double your loopback's MTU, or even make it a lot higher. I've just
set mine to 61440 and the rate now goes slightly over 10 Gbps, still CPU
bound by the two other progs.

In order to save memory on haproxy 1.4, you can set the per-process
client and server recv and send buffer sizes. If you're doing a lot
of outgoing traffic, it can be fine to set your buffers that way for
instance, in order to save a lot of memory :

global
        tune.rcvbuf.client 4096
        tune.sndbuf.client 61440
        tune.rcvbuf.server 61440
        tune.sndbuf.server 4096

Regards,
Willy


Reply via email to