> >The down side to NPUs is that they have to service every packet in a
> >fixed amount of time so they can't do much. They need to have fixed
> >sized state and fragment reassembly tables.  They also aren't allowed 
> >to
> >do much work per packet.  You will also be able to surf Moore's law
> >better with a normal x86 processor than with an NPU.
> Well, the only difference between that and the requirements for a 
> PF-style setup is that there's more room for buffering.  That is, you 
> don't strictly have to finish servicing a packet by the time the next 
> one arrives (or a hardware buffer by the time it would be filled 
> again), but if you want to saturate, you have to have a "service time" 
> per packet of less than the amount of time it takes for the packet to 
> transmit.  (If your "service time" is less than the duration of the 
> packet transmit, you get 100% saturation.  If it takes 5-6us to 
> transmit a 64-byte ethernet frame, and you take 50-60us to process it 
> and hand it off to the outgoing transceiver (if necessary), then you 
> only get 10% saturation.)

That's not strictly true.  All of the NPUs out there are either highly
parallel or have extensive hardware offload units (or both).  Intel's
IXP NPU is an extreme example of parallelism; you get four simultaneous
threads per microengine and 4 microengines on the old IXP1200, the newer
IXP2* line has even more.  Each thread can be processing a packet in
parallel.  With the hardware offload NPUs the microcode will tell the
NPU to "go do a state lookup on this packet", when that comes back it
will either route the packet or tell the hardware to do a rulesest
lookup..  In between each of those hardware operations the NPU will be
doing other things; processing other packets.
 
> And of course, buffering really only makes things more complex: every 
> system bus is added overhead and potential bottleneck.  Is that packet 
> being stored in an on-chip cache, or did the OS copy it to off-chip 
> core?  Etc.  Add to that the fact that one processor has to do this for 
> all interfaces, and moving from 2 to 6 interfaces is going to give you, 
> what, perhaps 33% theoretical bandwidth (not even counting the extra 
> routing table overhead, or added rule complexity), because every one of 
> a pair of interfaces may be saturated with traffic that needs to be 
> serviced to meet theoretical maximum capacity.

In the traditional software firewall the packet will never leave L1
cache.  Even if we didn't have art@ doing cache coloring everywhere he
could and got a lot of cache collisions, the hardware victim buffer will
catch most of the line evictions.

On an NPU the packet fifo will be presented as local memory.  Worst case
is it will be hanging off the L3 interface.  Sometimes the first X bytes
will even be preloaded into registers.
 
> That doesn't even get into things like logging, or restrictions from 
> the OS design.  If you want to be able to saturate AND log, you need to 
> know what you're logging to.  If writing to disk, or sending messages 
> to a loghost, even on a dedicated interface, adds extra system latency 
> or traffic.  In theory, extreme logging on all 64-byte ethernet frames 
> for some rule or another could generate MORE traffic than that which 
> you are logging.

There is *ONLY* one product capable of logging at that data rate and
it's name totally eludes me at the moment.  Damn, it's on the tip of my
tongue.  Anyway, it's more of a glorified tcpdump that costs bucu
dollars.  Any one who wants their firewall to log ever packet on a
saturated wire needs to get their head re-examined.  Ahh, the company is
Niksun.

Firewalls always do logging out of band of the packet processing code
path.

> All of that said, I wonder if there isn't some way to implement 
> something vaguely PF-ish in an FPGA that would allow more control over 
> the rulesets than an off-the-shelf ASIC.

Again, why?  FPGAs are expensive, network processors are cheap and
faster.  It seems like everyone and their brother is shipping a network
processor these days that can easily do light-weight firewalling.  I
wouldn't want to rewrite scrub, carp or ALTQ for them though.  But
there's no way in hell I'd want to redesign those in vhdl for an fpga
either.  Can you imagine having to pipeline those so the logic doesn't
blow through the clock on an fpga?  yikes

.mike
frantzen@(nfr.com | cvs.openbsd.org | w4g.org)
PGP:  CC A4 E2 E8 0C F8 42 F0  BC 26 85 5B 6F 9E ED 28

Reply via email to