> On Oct 6, 2017, at 5:59 PM, Jim Mellander <[email protected]> wrote:
> 
> I particularly like the idea of an allocation pool that per-packet 
> information can be stored, and reused by the next packet.

Turns out bro does this most of the time.. unless you use the next_packet 
event.  Normal connections use the sessions cache which holds connection 
objects, but new_packet has its own code path that creates the ip header from 
scratch for each packet.  I tried to pre-allocate PortVal objects, but I think 
I was screwing something up with 'Ref' and bro would just segfault on the 2nd 
connection.


> There also are probably some optimizations of frequent operations now that 
> we're in a 64-bit world that could prove useful - the one's complement 
> checksum calculation in net_util.cc is one that comes to mind, especially 
> since it works effectively a byte at a time (and works with even byte counts 
> only).  Seeing as this is done per-packet on all tcp payload, optimizing this 
> seems reasonable.  Here's a discussion of do the checksum calc in 64-bit 
> arithmetic: https://locklessinc.com/articles/tcp_checksum/ - this website 
> also has an x64 allocator that is claimed to be faster than tcmalloc, see: 
> https://locklessinc.com/benchmarks_allocator.shtml  (note: I haven't tried 
> anything from this source, but find it interesting).

I couldn't get this code to return the right checksums inside bro (some casting 
issue?), but if it is faster it should increase performance by a small 
percentage.  Comparing 'bro -b' runs on a pcap with 'bro -b -C' runs (which 
should show what kind of performance increase we would get if that function 
took 0s to run) shows a decent chunk of time taken computing checksums.

> I'm guessing there are a number of such "small" optimizations that could 
> provide significant performance gains.

I've been trying to figure out the best way to profile bro.  So far attempting 
to use linux perf, or google perftools hasn't been able to shed much light on 
anything.  I think the approach I was using to benchmark certain operations in 
the bro language is the better approach.

Instead of running bro and trying to profile it to figure out what is causing 
the most load, simply compare the execution of two bro runs with slightly 
different scripts/settings.  I think this will end up being the better approach 
because it answers real questions like "If I load this script or change this 
setting what is the performance impact on the bro process".  When I did this 
last I used this method to compare the performance from one bro commit to the 
next, but I never tried comparing bro with one set of scripts loaded to bro 
with a different set of scripts loaded.

For example, the simplest and most dramatic test I came up with so far:

$ time bro -r 2009-M57-day11-18.trace -b
real    0m2.434s
user    0m2.236s
sys     0m0.200s

$ cat np.bro
event new_packet(c: connection, p: pkt_hdr)
{

}

$ time bro -r 2009-M57-day11-18.trace -b np.bro
real    0m10.588s
user    0m10.392s
sys     0m0.204s

We've been saying for a while that adding that event is expensive, but I don't 
know if it's even been quantified.

The main thing I still need to figure out is how to do this type of test in a 
cluster environment while replaying a long pcap.



Somewhat related, came across this presentation yesterday:

https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be

CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance 
Trading Systems in C++”

Among other things, he mentions using a memory pool for objects instead of 
creating/deleting them.



— 
Justin Azoff


_______________________________________________
bro-dev mailing list
[email protected]
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev

Reply via email to