I particularly like the idea of an allocation pool that per-packet information can be stored, and reused by the next packet.
There also are probably some optimizations of frequent operations now that we're in a 64-bit world that could prove useful - the one's complement checksum calculation in net_util.cc is one that comes to mind, especially since it works effectively a byte at a time (and works with even byte counts only). Seeing as this is done per-packet on all tcp payload, optimizing this seems reasonable. Here's a discussion of do the checksum calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ - this website also has an x64 allocator that is claimed to be faster than tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml (note: I haven't tried anything from this source, but find it interesting). I'm guessing there are a number of such "small" optimizations that could provide significant performance gains. Take care, Jim On Fri, Oct 6, 2017 at 7:26 AM, Azoff, Justin S <[email protected]> wrote: > > > On Oct 6, 2017, at 12:10 AM, Clark, Gilbert <[email protected]> wrote: > > > > I'll note that one of the challenges with profiling is that there are > the bro scripts, and then there is the bro engine. The scripting layer has > a completely different set of optimizations that might make sense than the > engine does: turning off / turning on / tweaking different scripts can have > a huge impact on Bro's relative performance depending on the frequency with > which those script fragments are executed. Thus, one way to look at > speeding things up might be to take a look at the scripts that are run most > often and seeing about ways to accelerate core pieces of them ... possibly > by moving pieces of those scripts to builtins (as C methods). > > > > Re: scripts, I have some code I put together to do arbitrary benchmarks of > templated bro scripts. I need to clean it up and publish it, but I found > some interesting things. Function calls are relatively slow.. so things > like > > ip in Site::local_nets > > Is faster than calling > > Site::is_local_addr(ip); > > inlining short functions could speed things up a bit. > > I also found that things like > > port == 22/tcp || port == 3389/tcp > > Is faster than checking if port in {22/tcp,3389/tcp}.. up to about 10 > ports.. Having the hash class fallback to a linear search when the hash > only contains few items could speed things up there. Things like > 'likely_server_ports' have 1 or 2 ports in most cases. > > > > If I had to guess at one engine-related thing that would've sped things > up when I was profiling this stuff back in the day, it'd probably be > rebuilding the memory allocation strategy / management. From what I > remember, Bro does do some malloc / free in the data path, which hurts > quite a bit when one is trying to make things go fast. It also means that > the selection of a memory allocator and NUMA / per-node memory management > is going to be important. That's probably not going to qualify as > something *small*, though ... > > Ah! This reminds me of something I was thinking about a few weeks ago. > I'm not sure to what extent bro uses memory allocation pools/interning for > common immutable data structures. Like for port objects or small strings. > There's no reason bro should be mallocing/freeing memory to create port > objects when they are only 65536 times 2 (or 3?) port objects... but bro > does things like > > tcp_hdr->Assign(0, new PortVal(ntohs(tp->th_sport), > TRANSPORT_TCP)); > tcp_hdr->Assign(1, new PortVal(ntohs(tp->th_dport), > TRANSPORT_TCP)); > > For every packet. As well as allocating a ton of TYPE_COUNT vals for > things like packet sizes and header lengths.. which will almost always be > between 0 and 64k. > > For things that can't be interned, like ipv6 address, having an allocation > pool could speed things up... Instead of freeing things like IPAddr objects > they could just be returned to a pool, and then when a new IPAddr object is > needed, an already initialized object could be grabbed from the pool and > 'refreshed' with the new value. > > https://golang.org/pkg/sync/#Pool > > Talks about that sort of thing. > > > On a related note, a fun experiment is always to try running bro with a > different allocator and seeing what happens ... > > I recently noticed our boxes were using jemalloc instead of tcmalloc.. > Switching that caused malloc to drop a few places down in 'perf top' output. > > > — > Justin Azoff > > >
_______________________________________________ bro-dev mailing list [email protected] http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
