On 1/26/15 11:33 PM, Pavel Odintsov wrote:
Hello!
Looks like somebody want to build Linux soft router!) Nice idea for
routing 10-30 GBps. I route about 5+ Gbps in Xeon E5-2620v2 with 4
10GE cards Intel 82599 and Debian Wheezy 3.2 (but it's really terrible
kernel, everyone should use modern kernels since 3.16 because "buggy
linux route cache"). My current processor load on server is about:
15%, thus I can route about 15 GE on my Linux server.
I looked into the promise and limits of this approach pretty intensively
a few years back before abandoning the effort abruptly due to other
constraints. Underscoring what others have said: it's all about pps, not
aggregate throughput. Modern NICs can inject packets at line rate into
the kernel, and distribute them across per-processor queues, etc.
Payloads end up getting DMA-ed from NIC to RAM to NIC. There's really no
reason you shouldn't be able to push 80 Gb/s of traffic, or more,
through these boxes. As for routing protocol performance (BGP
convergence time, ability to handle multiple full tables, etc.): that's
just CPU and RAM.
The part that's hard (as in "can't be fixed without rethinking this
approach") is the per-packet routing overhead: the cost of reading the
packet header, looking up the destination in the routing table,
decrementing the TTL, and enqueueing the packet on the correct outbound
interface. At the time, I was able to convince myself that being able to
do this in 4 us, average, in the Linux kernel, was within reach. That's
not really very much time: you start asking things like "will the entire
routing table fit into the L2 cache?"
4 us to "think about" each packet comes out to 250Kpps per processor;
with 24 processors, it's 6Mpps (assuming zero concurrency/locking
overhead, which might be a little bit of an ... assumption). With
1500-byte packets, 6Mpps is 72 Gb/s of throughput -- not too shabby. But
with 40-byte packets, it's less than 2 Gb/s. Which means that your Xeon
ES-2620v2 will not cope well with a DDoS of 40-byte packets. That's not
necessarily a reason not to use this approach, depending on your
situation; but it's something to be aware of.
I ended up convincing myself that OpenFlow was the right general idea:
marry fast, dumb, and cheap switching hardware with fast, smart, and
cheap generic CPU for the complicated stuff.
My expertise, such as it ever was, is a bit stale at this point, and my
figures might be a little off. But I think the general principle
applies: think about the minimum number of x86 instructions, and the
minimum number of main memory accesses, to inspect a packet header, do a
routing table lookup, and enqueue the packet on an outbound interface. I
can't see that ever getting reduced to the point where a generic server
can handle 40-byte packets at line rate (for that matter, "line rate" is
increasing a lot faster than "speed of generic server" these days).
Jim