Re: [Flow-tools] Netflow volume optimization

Jay A. Kreibich Thu, 21 Jul 2005 20:41:31 -0700

On Thu, Jul 21, 2005 at 05:55:55PM +0400, Alexey Lobanov scratched on the wall:


> Does anyone know a way for additional optimization of raw netflow
> records, by merging all events during the *specified* period (i.e., 1
> hour) having same src, dst and ports?

  Assuming you want to maintain the same definition of "flow", you'd
  need to match on a lot more than that...  src/dest IPs, protocol,
  src/dst ports (if applicable), and src/dst interfaces (very important
  in the spec) for starters.  You'd also want to look for things like
  TCP FIN flags, just like the exporter does.

  So assuming you want to keep the same semantics for a flow, we need to
  look at why a flow is exported.  In general, there are four reasons
  why a flow is exported:

  1. Network transport protocol transaction ends (e.g. TCP FIN flags seen)
  2. Max-lifetime timeout
  3. No activity timeout.
  4. Netflow cache overflow (auto reduction of inactivity
        timer, on most exporters).

  We can ignore the first one, since a flow is a flow and it is
  supposed to end if that happened.  The last one is just hard luck and
  means you're not running a device configuration that is matched to
  your traffic patterns (or are under a big DOS).  While you could
  work around this, it would be better to add memory and/or upgrade
  your devices.
 
  That leaves the middle two issues: maximum cache entry lifetime
  timeout, and the no-activity timeout.

> The aim is to save disk space not
> loosing important information regarding traffic details. Actually, same
> operation is done inside of cisco box - but the aggregation time is too
> small in most cases. And further optimisation in a dedicated
> high-performance computer seems to be quite feasible.

  I think you can address this problem much more easily by adjusting
  the timeout values in your netflow exporters.  As you say, some of
  the aggregation is already done on the exporters; if you don't
  like the timouts, it is easiest to just change the timeouts rather
  than trying to post-process around them.

  IIRC, by default on Cisco exporters, the max-lifetime timeout is 30
  min., and the no-activity timeout is on the order of 15 seconds.

  First, lets look at the max-lifetime timer.

  While I obviously can't speak for your traffic situation, in our flow
  records the max-lifetime timer is hit by considerably less than 1% of
  our flows (although a lot of this depends on your uplink speed--
  slower links will have a larger number of longer flows).  In
  addition, of those flows that do tend to hit the max-lifetime timer,
  a large percentage are connections that are constant and more or less
  never-ending (streaming data, research data transfer from off-site
  facilities, etc., in our case).  In other words, adjusting the
  max-lifetime timer up even higher to 60 or 90 minutes is not likely to
  significantly reduce the total number of flows that hit the
  max-lifetime timeout.  It may catch a few long-lived downloads
  (e.g. 40 min. downloading a new ISO image), but not some of the
  larger stuff.

  So two points: first, the total number of flows hitting this timer is
  usually very very small.  On top of that, adjusting the timer upwards
  is not likely to greatly reduce or zero out these timeouts (although
  adjusting it downwards to five min. or so will show a noticeable
  increase in the number of exported flows).

  The end result is very very little savings from "stitching" these
  types of flows back together (and a whole lot of work to do so).
  Even if you open up your window to 60 min., you're only seeing a
  savings of 50% in those (very) few flows that fall into this
  category.

  OK, that leaves just the no-activity timeouts.

  Now here you have something.  If you have a lot of stuttered traffic,
  there is some possibility to stitch these flows back together,
  although it will alter some of the statistics packages that make
  assumptions about the max size of a "hole" in the flow (e.g. periods
  of inactivity, which effect averaging of flows, bytes, and packets
  over the lifetime of the flow; although that's some pretty advanced
  analysis, but we've got some people doing research along those lines).
  Adjustment of the timer up to something like 60 seconds is likely to
  catch a fair number of long-duration flows that are very start/stop,
  but not all of them.  Stuff like ssh sessions that may stay open for
  days, with just the occasional keep-alive packet, are always going to
  get broken up. 
  
  There's a catch to raising this timer, however-- IIRC, with the exception
  of TCP flows (which can look for the FIN flag), all other protocols use
  the inactivity timer to "end" and export the flow.  If you crank up
  the no-activity timer to something like 60 seconds, every DNS lookup
  (for example) will keep two flow records in the cache for 60 seconds.
  Taken from the default 15 seconds, this has the capability of
  quadrupling the resource requirements of the cache.  The end result
  is that you need a lot of memory in your netflow exporter, and if
  you've got a major uplink, you're going to need a HUGE on-device cache.

  It is a bit more difficult to predict how much savings this would
  result in.  We can get a rough idea by looking at TCP flows, since we
  can look at the headers to see how many "complete" flows there are.
  That's not a strictly valid sub-sample, since protocols with very
  short expected lifetimes (i.e. DNS) are build on top of UDP-- on one
  hand, short stuff isn't expected to be effected by the inactivity
  timers, but on the other, UDP depends on the inactivity to flush from
  the cache.  Still, these numbers should give us some ballpark figures.

  So... let's look at just the TCP flows for some random hour:

  Total TCP flows:                              6469234

    (not filtering out flows with
     a duration of 30 min-- i.e. flows
     that hit the max lifetime timer--
     since this is less than 1%)

  Flows with SYN and FIN flags ("full" flows):  3486305  (53% of total)
  Flows with just SYN flag ("start" flows):     1602225  (25% of total)
  Flows without SYN or FIN ("middle" flows):    1023851  (16% of total)
  Flows with just FIN flag ("end" flows):        356853  ( 6% of total)
                                                         --------------
                                                         100% of total

  Flows with RST (reset) flag:                   716783  (11% of total)

  What this says is that around 53% of flows are "complete"-- i.e. the
  TCP transaction runs start to end within one flow.  The rough balance
  is made up of "start", "middle", and "end" flows.  While this looks
  like it points to at best 22% savings for stitching all those middle
  and end flows together with their start flows, I'm not sure that's the
  case.  Things are muddled by the fact that the number of "start" and
  "end" flows don't match (even if you add resets to ends); further,
  the large number of reset flags mess with the what is going on--
  especially since resets can be found in every combo of SYN and FIN
  flags.  Then again, this is pushing my understanding of TCP just a bit.

  TCP is roughly 66% of our flows.  I think it unlikely that this kind
  of stitching would reduce the UDP stuff very much.  The type of
  traffic patterns seen in UDP just aren't the same.

  So our rough numbers are that TCP will show something on the order of
  20% reduction, which might be about 15% total (since you'll get some
  UDP savings).  I'll be the first to say there's a lot of fluff in that
  number, but it shouldn't be radically off from the true value
  (assuming your traffic looks more or less like ours).  Returns will
  be further reduced if you use a static window (e.g. one hour, every
  hour) rather than a sliding window.  The first is much easier to
  program, but it won't give you as much reduction as a window.

  Although numbers are difficult to predict, if you have a healthy
  amount of memory on your export device and your flow cache is not
  very full, you might try increasing the inactivity timer and see
  what kind of flow reductions you get from that.  You aren't going to
  see the full 15% savings (assuming that's a reasonable number), but
  you might see 10ish or 12ish percent.  Bumping up the timer may give
  you a better picture of what kind of returns you can get from a 
  stitching program without having to go out and build it.  I'm guessing
  you'll see most of any possible returns from that timer adjustment,
  and assuming it doesn't overload your export cache, that seems a lot
  easier than engineering some complex system to stitch flows back
  together.  Personally, the returns just don't seem worth it, unless
  you have some radically strange traffic patterns that generate a large
  number of "broken" flows.

  You might also consider that the only practical way of doing this for
  larger chucks of diverse flows is to hold the whole thing in memory.
  While I have no idea what your traffic patterns are like, we can
  easily see 5GB of raw binary flow data per hour.  Assuming data
  structure overhead, you're looking at around 8GB or more to hold the
  data structure required to stitch all that together.  I question if
  that can be done in virtual memory (since you need to do everything
  within an hour), so you're looking at perhaps 4 to 6GB of RAM (if not
  actually 8 or more).  That's a pretty serious investment of several
  thousand dollars (not only for the RAM, but a nice system that will
  hold that much), and for what?  Assuming your traffic is really
  strange and you save 50% of your disk space and have turned that
  5GB into 2.5GB, you've saved a whopping $5 or so in Fibre Channel RAID
  space, or about $2 in traditional hard drive space.   Obviously
  you're data volume may not be that big or require that much memory,
  but if that's the case, your savings will be much smaller as well.
  In the whole chain of expenses-- the export router, the collection
  server, and the disk attached to it-- the cheapest item is usually
  going to be the disks, unless you get into a nicer and larger RAID.

  There's also the thought that if you're running things so close to
  the edge that you need to save 10% to 15% just to get it on the disk,
  you're in no condition to absorb a DOS or worm.  We can see huge
  fluctuations in our flow export rates when new worms hit the net, or
  when our own network is under a DOS (from the inside OR the outside).
  Unless you've got a very controlled network with some type of IDS/IPS
  in front of your netflow export device, you need the ability to
  absorb those kind of network events.  When things are running the way
  I want, our disk are always 20% or more free "just in case"-- and I'm
  taking about a setup that can save a fair fraction of a year.

  It is also worth saying that something as simple as gzip'ing the
  flow files will give you a ~70% reduction.  All that costs is a bit
  of CPU time and access speed.  Plus, it is simple and easy to automate.
  In our case we gzip flow files after two weeks, when they go into a
  quasi-archive state for a few months until we delete them.

                *               *               *
  
  I'll be the first to admit these numbers our uniquely ours.  I've got
  a lot of netflow experience on our net, but I've not had much time to
  look at numbers and traffic patterns for a "typical" business.
  University networks, especially large ones, are unique beasts.  I'm
  not going to say "you will NOT" see no significant reduction in total
  flows by stitching flows, but I'd be doubtful.  Maybe you can run
  some of these numbers on your own data and get a better idea of
  how it would work out for you.

  We have actually looked at writing something that cross-correlated
  flows and tried to match flows from the same network transaction back
  together, but our goal in that had more to do with traffic pattern
  analysis and some heavy-duty research into traffic shapes and such
  nonsense.  In some cases, we actually wanted to reduce the inactivity
  timer to just a few seconds so that we could more clearly see "holes"
  in the traffic (and then use a stitching program to recreate the
  whole flow pattern).  We never wrote it; first, because the complexity
  and size of the problem was just too out of control for the amount of
  data we generate, and second, the people working on the research
  (unofficially) never really had the time to setup a proper research
  proposal.  But it did sound like interesting stuff.

  I'd love to see some of these numbers for a different style of
  network.

   -j
  
-- 
                     Jay A. Kreibich | CommTech, Emrg Net Tech Svcs
                        [EMAIL PROTECTED] | Campus IT & Edu Svcs
          <http://www.uiuc.edu/~jak> | University of Illinois at U/C
_______________________________________________
Flow-tools mailing list
[EMAIL PROTECTED]
http://mailman.splintered.net/mailman/listinfo/flow-tools

Re: [Flow-tools] Netflow volume optimization

Reply via email to