Re: [Bro-Dev] Performance Enhancements

2017-10-14 Thread Jim Mellander
Yeh, the lockless implementation has a bug:

if (size)

s/b

if (size & 1)

I ended up writing an checksum routine that sums 32 bits at a time into a
64 bit register, which avoids the need to check for overflow - it seems to
be faster than the full 64 bit implementation - will test with Bro and
report results.

On Thu, Oct 12, 2017 at 3:08 PM, Azoff, Justin S 
wrote:

>
> > On Oct 6, 2017, at 5:59 PM, Jim Mellander  wrote:
> >
> > I particularly like the idea of an allocation pool that per-packet
> information can be stored, and reused by the next packet.
> >
> > There also are probably some optimizations of frequent operations now
> that we're in a 64-bit world that could prove useful - the one's complement
> checksum calculation in net_util.cc is one that comes to mind, especially
> since it works effectively a byte at a time (and works with even byte
> counts only).  Seeing as this is done per-packet on all tcp payload,
> optimizing this seems reasonable.  Here's a discussion of do the checksum
> calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/
> -
>
> So I still haven't gotten this to work, but I did some more tests that I
> think show it is worthwhile to look into replacing this function.
>
> I generated a large pcap of a 3 minute iperf run:
>
> $ du -hs iperf.pcap
> 9.6Giperf.pcap
> $ tcpdump  -n -r iperf.pcap |wc -l
> reading from file iperf.pcap, link-type EN10MB (Ethernet)
> 7497698
>
> Then ran either `bro -Cbr` or `bro -br` on it 5 times and track runtime as
> well as cpu instructions reported by `perf`:
>
> $ python2 bench.py 5 bro -Cbr iperf.pcap
> 15.19 49947664388
> 15.66 49947827678
> 15.74 49947853306
> 15.66 49949603644
> 15.42 49951191958
> elapsed
> Min 15.18678689
> Max 15.7425909042
> Avg 15.5343231678
>
> instructions
> Min 49947664388
> Max 49951191958
> Avg 49948828194
>
> $ python2 bench.py 5 bro -br iperf.pcap
> 20.82 95502327077
> 21.31 95489729078
> 20.52 95483242217
> 21.45 95499193001
> 21.32 95498830971
> elapsed
> Min 20.5184400082
> Max 21.4452238083
> Avg 21.083449173
>
> instructions
> Min 95483242217
> Max 95502327077
> Avg 95494664468
>
>
> So this shows that for every ~7,500,000 packets bro processes, almost 5
> seconds is spent computing checksums.
>
> According to https://locklessinc.com/articles/tcp_checksum/, they run
> their benchmark 2^24 times (16,777,216) which is about 2.2 times as many
> packets.
>
> Their runtime starts out at about 11s, which puts it in line with the
> current implementation in bro.  The other implementations they show are
> between 7 and 10x faster depending on packet size.  A 90% drop in time
> spent computing checksums would be a noticeable improvement.
>
>
> Unfortunately I couldn't get their implementation to work inside of bro
> and get the right result, and even if I could, it's not clear what the
> license for the code is.
>
>
>
>
>
> —
> Justin Azoff
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-12 Thread Azoff, Justin S

> On Oct 6, 2017, at 5:59 PM, Jim Mellander  wrote:
> 
> I particularly like the idea of an allocation pool that per-packet 
> information can be stored, and reused by the next packet.
> 
> There also are probably some optimizations of frequent operations now that 
> we're in a 64-bit world that could prove useful - the one's complement 
> checksum calculation in net_util.cc is one that comes to mind, especially 
> since it works effectively a byte at a time (and works with even byte counts 
> only).  Seeing as this is done per-packet on all tcp payload, optimizing this 
> seems reasonable.  Here's a discussion of do the checksum calc in 64-bit 
> arithmetic: https://locklessinc.com/articles/tcp_checksum/ -

So I still haven't gotten this to work, but I did some more tests that I think 
show it is worthwhile to look into replacing this function.

I generated a large pcap of a 3 minute iperf run:

$ du -hs iperf.pcap
9.6Giperf.pcap
$ tcpdump  -n -r iperf.pcap |wc -l
reading from file iperf.pcap, link-type EN10MB (Ethernet)
7497698

Then ran either `bro -Cbr` or `bro -br` on it 5 times and track runtime as well 
as cpu instructions reported by `perf`:

$ python2 bench.py 5 bro -Cbr iperf.pcap
15.19 49947664388
15.66 49947827678
15.74 49947853306
15.66 49949603644
15.42 49951191958
elapsed
Min 15.18678689
Max 15.7425909042
Avg 15.5343231678

instructions
Min 49947664388
Max 49951191958
Avg 49948828194

$ python2 bench.py 5 bro -br iperf.pcap
20.82 95502327077
21.31 95489729078
20.52 95483242217
21.45 95499193001
21.32 95498830971
elapsed
Min 20.5184400082
Max 21.4452238083
Avg 21.083449173

instructions
Min 95483242217
Max 95502327077
Avg 95494664468


So this shows that for every ~7,500,000 packets bro processes, almost 5 seconds 
is spent computing checksums.

According to https://locklessinc.com/articles/tcp_checksum/, they run their 
benchmark 2^24 times (16,777,216) which is about 2.2 times as many packets.

Their runtime starts out at about 11s, which puts it in line with the current 
implementation in bro.  The other implementations they show are
between 7 and 10x faster depending on packet size.  A 90% drop in time spent 
computing checksums would be a noticeable improvement.


Unfortunately I couldn't get their implementation to work inside of bro and get 
the right result, and even if I could, it's not clear what the license for the 
code is.





— 
Justin Azoff


___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-09 Thread Azoff, Justin S

> On Oct 9, 2017, at 4:57 PM, Jim Mellander  wrote:
> 
> Well, I found pathological behavior with BaseList::remove
> 
> Instrumenting it with a printf of num_entries & i after the for loop, running 
> against a test pcap then summarizing with awk gives:
> 
> Count, num_entries, i
> 
> 1 3583 3536
> 1 3584 3537
> 1 3623 3542
> 1 3624 3543
> 1 3628 3620
> ...
> 5 5379 5376
> 5 5380 5377
> 7 5499 5496
> 7 5500 5497
> 10 5497 5494
> 10 5498 5495
> 26 3 2
> 5861 3 1
> 13714 4 4
> 14130 4 1
> 34914 4 0
> 74299 3 3
> 1518194 2 1
> 2648755 2 2
> 8166358 3 0
> 13019625 2 0
> 62953139 0 0
> 71512938 1 1
> 104294506 1 0
> 
> there are 286 instances where the list has over 3000 entries, and the desired 
> entry is near the end...  That linear search has got to be killing 
> performance, even though its uncommon  :-(
> 
> The case of num_entries being 0 can be optimized a bit, but is relatively 
> minor.
> 
> Now, I'll see if I can identify the offending List
> 
> Jim
> 

A for loop over an empty table/set causes the "0 0" entries.  Something related 
to the "cookies" it uses for iteration.

Not sure what causes the "1 1" cases.

What I also find interesting are the "1 0" entries.  I wonder how many of those 
cases are followed up by the list itself being destroyed. 

Something allocating and then destroying 104,294,506 lists that only ever have 
a single item in them.  That's a lot of work for nothing.

— 
Justin Azoff

___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-09 Thread Azoff, Justin S

> On Oct 9, 2017, at 4:03 PM, Clark, Gilbert  wrote:
> 
> If you look in one of the subdirectories or another, in ages past there was a 
> little shell script to incrementally execute bro against a specific trace, 
> loading one script at a time to see the effect each of them had on the 
> overall runtime.  I can't remember what it was called offhand, but it was 
> useful for quick and dirty testing.
> 
> And yes, function calls in bro script land are pretty horrific in terms of 
> overhead.  Line per line, bro script in general isn't terribly efficient just 
> because it doesn't do any of the things a modern interpreter might (e.g. 
> just-in-time compilation).  That's not a criticism, it's only a note - most 
> folks rely on horizontal scaling to deal with faster ingress, which I think 
> makes a whole lot of sense.

From what my little microbenchmarks have been showing me, bro scripts are 
mostly fast, but there are some operations that may be slower than they should 
be.. like for loops or set/table lookups on small or empty tables.

> Just in the event it's useful, I've attached some profiling I did on the 
> script function overhead with the crap I wrote: these are some graphs on 
> which script functions call which other script functions, how many times that 
> happens, and how much time is spent in each function.  
> 
> avggraph is the average time spent per-call, and resgraph is the aggregate 
> time spent in each function across the entire run.  The numbers' formatting 
> needed some fixing, but never made it that far ...

Something like this from an hours worth of bro runtime would be neat.  One 
potential issue is that if a function is called a lot, it could mean it's 
either something that needs to be optimized so it is faster, or it could mean 
it's something that needs to be refactored so it's not called so much.

> I know Robin et. al. were working on different approaches for next-generation 
> scripting kind of stuff, but haven't kept up well enough to really know where 
> those are.

http://www.icir.org/hilti/ I believe.

> One thing I played around with was using plugin hooks to integrate other 
> scripting languages into the bro fast path (luajit was my weapon of choice) 
> and seeing if conversion from bro script to one of those other languages 
> might improve the run time.  Other languages would still be less efficient 
> than C, and anything garbage collected would need to be *really* carefully 
> used, but ... it struck me as an idea that might be worth a look :)

I'm not sure how hard it would be to write a transpiler for bro scripts and 
convert them completely to something like lua.   Other than maybe ip address 
and subnets as data types, I think they overlap fairly well.

> And yeah, generally speaking, most of the toolkits I've played with for 
> software-based packet processing absolutely do use memory pools for the fast 
> path.  They also use burst fetch tricks (to amortize the cost of fetching 
> packets over X packets, rather than fetching one packet at a time), and I've 
> also seen quite a bit of prefetch / SIMD to try to keep things moving quickly 
> once the packets make it to the CPU.  
> 
> Things start to get pretty crazy as packet rates increase, though: once you 
> hit about 10 - 15 Gbps, even a *cache miss* on a modern system is enough to 
> force a drop ...

Data rate is just one part of it.. the number of packets per second and new 
sessions per second has a huge impact as well.  Handling 10Gbps of a few 
concurrent connections is easy, but 10Gbps of DNS will not work so well.

DNS is one analyzer/script that I think could benefit from being ported to C++. 
 The script doesn't do much and there aren't many scripts out there that want 
the lower level dns events.. Having the analyzer merge the request/reply 
directly and bypass the scripts entirely could boost performance by quite a bit.

— 
Justin Azoff


___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-09 Thread Jim Mellander
gt; than C, and anything garbage collected would need to be *really* carefully
> used, but ... it struck me as an idea that might be worth a look :)
>
>
> And yeah, generally speaking, most of the toolkits I've played with for
> software-based packet processing absolutely do use memory pools for the
> fast path.  They also use burst fetch tricks (to amortize the cost of
> fetching packets over X packets, rather than fetching one packet at a
> time), and I've also seen quite a bit of prefetch / SIMD to try to keep
> things moving quickly once the packets make it to the CPU.
>
>
> Things start to get pretty crazy as packet rates increase, though: once
> you hit about 10 - 15 Gbps, even a *cache miss* on a modern system is
> enough to force a drop ...
>
>
> For what it's worth ...
>
>
> -Gilbert
>
>
> --------------
> *From:* Azoff, Justin S <jaz...@illinois.edu>
> *Sent:* Monday, October 9, 2017 10:19:26 AM
> *To:* Jim Mellander
> *Cc:* Clark, Gilbert; bro-dev@bro.org
> *Subject:* Re: [Bro-Dev] Performance Enhancements
>
>
> > On Oct 6, 2017, at 5:59 PM, Jim Mellander <jmellan...@lbl.gov> wrote:
> >
> > I particularly like the idea of an allocation pool that per-packet
> information can be stored, and reused by the next packet.
>
> Turns out bro does this most of the time.. unless you use the next_packet
> event.  Normal connections use the sessions cache which holds connection
> objects, but new_packet has its own code path that creates the ip header
> from scratch for each packet.  I tried to pre-allocate PortVal objects, but
> I think I was screwing something up with 'Ref' and bro would just segfault
> on the 2nd connection.
>
>
> > There also are probably some optimizations of frequent operations now
> that we're in a 64-bit world that could prove useful - the one's complement
> checksum calculation in net_util.cc is one that comes to mind, especially
> since it works effectively a byte at a time (and works with even byte
> counts only).  Seeing as this is done per-packet on all tcp payload,
> optimizing this seems reasonable.  Here's a discussion of do the checksum
> calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/
> - this website also has an x64 allocator that is claimed to be faster than
> tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml  (note:
> I haven't tried anything from this source, but find it interesting).
> The TCP/IP Checksum - Lockless Inc
> <https://locklessinc.com/articles/tcp_checksum/>
> locklessinc.com
> The above obvious algorithm handles 16-bits at a time. Usually, if you
> process more data at a time, then performance is better. Since we have a
> 64-bit machine, we ...
>
> Benchmarks of the Lockless Memory Allocator
> <https://locklessinc.com/benchmarks_allocator.shtml>
> locklessinc.com
> The speed of various memory allocators is compared.
>
>
>
> I couldn't get this code to return the right checksums inside bro (some
> casting issue?), but if it is faster it should increase performance by a
> small percentage.  Comparing 'bro -b' runs on a pcap with 'bro -b -C' runs
> (which should show what kind of performance increase we would get if that
> function took 0s to run) shows a decent chunk of time taken computing
> checksums.
>
> > I'm guessing there are a number of such "small" optimizations that could
> provide significant performance gains.
>
> I've been trying to figure out the best way to profile bro.  So far
> attempting to use linux perf, or google perftools hasn't been able to shed
> much light on anything.  I think the approach I was using to benchmark
> certain operations in the bro language is the better approach.
>
> Instead of running bro and trying to profile it to figure out what is
> causing the most load, simply compare the execution of two bro runs with
> slightly different scripts/settings.  I think this will end up being the
> better approach because it answers real questions like "If I load this
> script or change this setting what is the performance impact on the bro
> process".  When I did this last I used this method to compare the
> performance from one bro commit to the next, but I never tried comparing
> bro with one set of scripts loaded to bro with a different set of scripts
> loaded.
>
> For example, the simplest and most dramatic test I came up with so far:
>
> $ time bro -r 2009-M57-day11-18.trace -b
> real0m2.434s
> user0m2.236s
> sys 0m0.200s
>
> $ cat np.bro
> event new_packet(c: connection, p: pkt_hdr)
> {
>
> }
>
> $ time bro -r 2009-M57-day11-18.trace -b np.bro
> real0m10.588s
> user0m10.3

Re: [Bro-Dev] Performance Enhancements

2017-10-09 Thread Azoff, Justin S

> On Oct 6, 2017, at 5:59 PM, Jim Mellander  wrote:
> 
> I particularly like the idea of an allocation pool that per-packet 
> information can be stored, and reused by the next packet.

Turns out bro does this most of the time.. unless you use the next_packet 
event.  Normal connections use the sessions cache which holds connection 
objects, but new_packet has its own code path that creates the ip header from 
scratch for each packet.  I tried to pre-allocate PortVal objects, but I think 
I was screwing something up with 'Ref' and bro would just segfault on the 2nd 
connection.


> There also are probably some optimizations of frequent operations now that 
> we're in a 64-bit world that could prove useful - the one's complement 
> checksum calculation in net_util.cc is one that comes to mind, especially 
> since it works effectively a byte at a time (and works with even byte counts 
> only).  Seeing as this is done per-packet on all tcp payload, optimizing this 
> seems reasonable.  Here's a discussion of do the checksum calc in 64-bit 
> arithmetic: https://locklessinc.com/articles/tcp_checksum/ - this website 
> also has an x64 allocator that is claimed to be faster than tcmalloc, see: 
> https://locklessinc.com/benchmarks_allocator.shtml  (note: I haven't tried 
> anything from this source, but find it interesting).

I couldn't get this code to return the right checksums inside bro (some casting 
issue?), but if it is faster it should increase performance by a small 
percentage.  Comparing 'bro -b' runs on a pcap with 'bro -b -C' runs (which 
should show what kind of performance increase we would get if that function 
took 0s to run) shows a decent chunk of time taken computing checksums.

> I'm guessing there are a number of such "small" optimizations that could 
> provide significant performance gains.

I've been trying to figure out the best way to profile bro.  So far attempting 
to use linux perf, or google perftools hasn't been able to shed much light on 
anything.  I think the approach I was using to benchmark certain operations in 
the bro language is the better approach.

Instead of running bro and trying to profile it to figure out what is causing 
the most load, simply compare the execution of two bro runs with slightly 
different scripts/settings.  I think this will end up being the better approach 
because it answers real questions like "If I load this script or change this 
setting what is the performance impact on the bro process".  When I did this 
last I used this method to compare the performance from one bro commit to the 
next, but I never tried comparing bro with one set of scripts loaded to bro 
with a different set of scripts loaded.

For example, the simplest and most dramatic test I came up with so far:

$ time bro -r 2009-M57-day11-18.trace -b
real0m2.434s
user0m2.236s
sys 0m0.200s

$ cat np.bro
event new_packet(c: connection, p: pkt_hdr)
{

}

$ time bro -r 2009-M57-day11-18.trace -b np.bro
real0m10.588s
user0m10.392s
sys 0m0.204s

We've been saying for a while that adding that event is expensive, but I don't 
know if it's even been quantified.

The main thing I still need to figure out is how to do this type of test in a 
cluster environment while replaying a long pcap.



Somewhat related, came across this presentation yesterday:

https://www.youtube.com/watch?v=NH1Tta7purM=youtu.be

CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance 
Trading Systems in C++”

Among other things, he mentions using a memory pool for objects instead of 
creating/deleting them.



— 
Justin Azoff


___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-06 Thread Jim Mellander
Interesting info.  The > order of magnitude difference in time between
BaseList::remove & BaseList::removenth suggests the possibility that the
for loop in BaseList::remove is falling off the end in many cases (i.e.
attempting to remove an item that doesn't exist).  Maybe thats whats broken.



On Fri, Oct 6, 2017 at 3:49 PM, Azoff, Justin S  wrote:

>
> > On Oct 6, 2017, at 5:59 PM, Jim Mellander  wrote:
> >
> > I particularly like the idea of an allocation pool that per-packet
> information can be stored, and reused by the next packet.
> >
> > There also are probably some optimizations of frequent operations now
> that we're in a 64-bit world that could prove useful - the one's complement
> checksum calculation in net_util.cc is one that comes to mind, especially
> since it works effectively a byte at a time (and works with even byte
> counts only).  Seeing as this is done per-packet on all tcp payload,
> optimizing this seems reasonable.  Here's a discussion of do the checksum
> calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/
> - this website also has an x64 allocator that is claimed to be faster than
> tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml  (note:
> I haven't tried anything from this source, but find it interesting).
> >
> > I'm guessing there are a number of such "small" optimizations that could
> provide significant performance gains.
> >
> > Take care,
> >
> > Jim
>
> I've been messing around with 'perf top', the one's complement function
> often shows up fairly high up.. that, PriorityQueue::BubbleDown, and
> BaseList::remove
>
> Something (on our configuration?) is doing a lot of
> PQ_TimerMgr::~PQ_TimerMgr... I don't think I've come across that class
> before in bro.. I think a script may be triggering something that is
> hurting performance.  I can't think of what it would be though.
>
> Running perf top on a random worker right now with -F 1 shows:
>
> Samples: 485K of event 'cycles', Event count (approx.): 26046568975
> Overhead  Shared Object Symbol
>   34.64%  bro   [.] BaseList::remove
>3.32%  libtcmalloc.so.4.2.6  [.] operator delete
>3.25%  bro   [.] PriorityQueue::BubbleDown
>2.31%  bro   [.] BaseList::remove_nth
>2.05%  libtcmalloc.so.4.2.6  [.] operator new
>1.90%  bro   [.] Attributes::FindAttr
>1.41%  bro   [.] Dictionary::NextEntry
>1.27%  libc-2.17.so  [.] __memcpy_ssse3_back
>0.97%  bro   [.] StmtList::Exec
>0.87%  bro   [.] Dictionary::Lookup
>0.85%  bro   [.] NameExpr::Eval
>0.84%  bro   [.] BroFunc::Call
>0.80%  libtcmalloc.so.4.2.6  [.] tc_free
>0.77%  libtcmalloc.so.4.2.6  [.] operator delete[]
>0.70%  bro   [.] ones_complement_checksum
>0.60%  libtcmalloc.so.4.2.6  [.] tcmalloc::ThreadCache::
> ReleaseToCentralCache
>0.60%  bro   [.] RecordVal::RecordVal
>0.53%  bro   [.] UnaryExpr::Eval
>0.51%  bro   [.] ExprStmt::Exec
>0.51%  bro   [.] iosource::Manager::FindSoonest
>0.50%  libtcmalloc.so.4.2.6  [.] operator new[]
>
>
> Which sums up to 59.2%
>
> BaseList::remove/BaseList::remove_nth seems particularly easy to
> optimize. Can't that loop be replaced by a memmove?
> I think something may be broken if it's being called that much though.
>
>
>
> —
> Justin Azoff
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-06 Thread Azoff, Justin S

> On Oct 6, 2017, at 5:59 PM, Jim Mellander  wrote:
> 
> I particularly like the idea of an allocation pool that per-packet 
> information can be stored, and reused by the next packet.
> 
> There also are probably some optimizations of frequent operations now that 
> we're in a 64-bit world that could prove useful - the one's complement 
> checksum calculation in net_util.cc is one that comes to mind, especially 
> since it works effectively a byte at a time (and works with even byte counts 
> only).  Seeing as this is done per-packet on all tcp payload, optimizing this 
> seems reasonable.  Here's a discussion of do the checksum calc in 64-bit 
> arithmetic: https://locklessinc.com/articles/tcp_checksum/ - this website 
> also has an x64 allocator that is claimed to be faster than tcmalloc, see: 
> https://locklessinc.com/benchmarks_allocator.shtml  (note: I haven't tried 
> anything from this source, but find it interesting).
> 
> I'm guessing there are a number of such "small" optimizations that could 
> provide significant performance gains.
> 
> Take care,
> 
> Jim

I've been messing around with 'perf top', the one's complement function often 
shows up fairly high up.. that, PriorityQueue::BubbleDown, and BaseList::remove

Something (on our configuration?) is doing a lot of 
PQ_TimerMgr::~PQ_TimerMgr... I don't think I've come across that class before 
in bro.. I think a script may be triggering something that is hurting 
performance.  I can't think of what it would be though.

Running perf top on a random worker right now with -F 1 shows:

Samples: 485K of event 'cycles', Event count (approx.): 26046568975
Overhead  Shared Object Symbol
  34.64%  bro   [.] BaseList::remove
   3.32%  libtcmalloc.so.4.2.6  [.] operator delete
   3.25%  bro   [.] PriorityQueue::BubbleDown
   2.31%  bro   [.] BaseList::remove_nth
   2.05%  libtcmalloc.so.4.2.6  [.] operator new
   1.90%  bro   [.] Attributes::FindAttr
   1.41%  bro   [.] Dictionary::NextEntry
   1.27%  libc-2.17.so  [.] __memcpy_ssse3_back
   0.97%  bro   [.] StmtList::Exec
   0.87%  bro   [.] Dictionary::Lookup
   0.85%  bro   [.] NameExpr::Eval
   0.84%  bro   [.] BroFunc::Call
   0.80%  libtcmalloc.so.4.2.6  [.] tc_free
   0.77%  libtcmalloc.so.4.2.6  [.] operator delete[]
   0.70%  bro   [.] ones_complement_checksum
   0.60%  libtcmalloc.so.4.2.6  [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache
   0.60%  bro   [.] RecordVal::RecordVal
   0.53%  bro   [.] UnaryExpr::Eval
   0.51%  bro   [.] ExprStmt::Exec
   0.51%  bro   [.] iosource::Manager::FindSoonest
   0.50%  libtcmalloc.so.4.2.6  [.] operator new[]


Which sums up to 59.2%

BaseList::remove/BaseList::remove_nth seems particularly easy to optimize. 
Can't that loop be replaced by a memmove?
I think something may be broken if it's being called that much though.



— 
Justin Azoff


___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-06 Thread Jim Mellander
I particularly like the idea of an allocation pool that per-packet
information can be stored, and reused by the next packet.

There also are probably some optimizations of frequent operations now that
we're in a 64-bit world that could prove useful - the one's complement
checksum calculation in net_util.cc is one that comes to mind, especially
since it works effectively a byte at a time (and works with even byte
counts only).  Seeing as this is done per-packet on all tcp payload,
optimizing this seems reasonable.  Here's a discussion of do the checksum
calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ -
this website also has an x64 allocator that is claimed to be faster than
tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml  (note: I
haven't tried anything from this source, but find it interesting).

I'm guessing there are a number of such "small" optimizations that could
provide significant performance gains.

Take care,

Jim




On Fri, Oct 6, 2017 at 7:26 AM, Azoff, Justin S  wrote:

>
> > On Oct 6, 2017, at 12:10 AM, Clark, Gilbert  wrote:
> >
> > I'll note that one of the challenges with profiling is that there are
> the bro scripts, and then there is the bro engine.  The scripting layer has
> a completely different set of optimizations that might make sense than the
> engine does: turning off / turning on / tweaking different scripts can have
> a huge impact on Bro's relative performance depending on the frequency with
> which those script fragments are executed.  Thus, one way to look at
> speeding things up might be to take a look at the scripts that are run most
> often and seeing about ways to accelerate core pieces of them ... possibly
> by moving pieces of those scripts to builtins (as C methods).
> >
>
> Re: scripts, I have some code I put together to do arbitrary benchmarks of
> templated bro scripts.  I need to clean it up and publish it, but I found
> some interesting things.  Function calls are relatively slow.. so things
> like
>
> ip in Site::local_nets
>
> Is faster than calling
>
> Site::is_local_addr(ip);
>
> inlining short functions could speed things up a bit.
>
> I also found that things like
>
> port == 22/tcp || port == 3389/tcp
>
> Is faster than checking if port in {22/tcp,3389/tcp}.. up to about 10
> ports.. Having the hash class fallback to a linear search when the hash
> only contains few items could speed things up there.  Things like
> 'likely_server_ports' have 1 or 2 ports in most cases.
>
>
> > If I had to guess at one engine-related thing that would've sped things
> up when I was profiling this stuff back in the day, it'd probably be
> rebuilding the memory allocation strategy / management.  From what I
> remember, Bro does do some malloc / free in the data path, which hurts
> quite a bit when one is trying to make things go fast.  It also means that
> the selection of a memory allocator and NUMA / per-node memory management
> is going to be important.  That's probably not going to qualify as
> something *small*, though ...
>
> Ah!  This reminds me of something I was thinking about a few weeks ago.
> I'm not sure to what extent bro uses memory allocation pools/interning for
> common immutable data structures.  Like for port objects or small strings.
> There's no reason bro should be mallocing/freeing memory to create port
> objects when they are only 65536 times 2 (or 3?) port objects... but bro
> does things like
>
> tcp_hdr->Assign(0, new PortVal(ntohs(tp->th_sport),
> TRANSPORT_TCP));
> tcp_hdr->Assign(1, new PortVal(ntohs(tp->th_dport),
> TRANSPORT_TCP));
>
> For every packet.  As well as allocating a ton of TYPE_COUNT vals for
> things like packet sizes and header lengths.. which will almost always be
> between 0 and 64k.
>
> For things that can't be interned, like ipv6 address, having an allocation
> pool could speed things up... Instead of freeing things like IPAddr objects
> they could just be returned to a pool, and then when a new IPAddr object is
> needed, an already initialized object could be grabbed from the pool and
> 'refreshed' with the new value.
>
> https://golang.org/pkg/sync/#Pool
>
> Talks about that sort of thing.
>
> > On a related note, a fun experiment is always to try running bro with a
> different allocator and seeing what happens ...
>
> I recently noticed our boxes were using jemalloc instead of tcmalloc..
> Switching that caused malloc to drop a few places down in 'perf top' output.
>
>
> —
> Justin Azoff
>
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-06 Thread Azoff, Justin S

> On Oct 6, 2017, at 12:10 AM, Clark, Gilbert  wrote:
> 
> I'll note that one of the challenges with profiling is that there are the bro 
> scripts, and then there is the bro engine.  The scripting layer has a 
> completely different set of optimizations that might make sense than the 
> engine does: turning off / turning on / tweaking different scripts can have a 
> huge impact on Bro's relative performance depending on the frequency with 
> which those script fragments are executed.  Thus, one way to look at speeding 
> things up might be to take a look at the scripts that are run most often and 
> seeing about ways to accelerate core pieces of them ... possibly by moving 
> pieces of those scripts to builtins (as C methods).
> 

Re: scripts, I have some code I put together to do arbitrary benchmarks of 
templated bro scripts.  I need to clean it up and publish it, but I found some 
interesting things.  Function calls are relatively slow.. so things like

ip in Site::local_nets

Is faster than calling

Site::is_local_addr(ip);

inlining short functions could speed things up a bit.

I also found that things like

port == 22/tcp || port == 3389/tcp

Is faster than checking if port in {22/tcp,3389/tcp}.. up to about 10 ports.. 
Having the hash class fallback to a linear search when the hash only contains 
few items could speed things up there.  Things like 'likely_server_ports' have 
1 or 2 ports in most cases.


> If I had to guess at one engine-related thing that would've sped things up 
> when I was profiling this stuff back in the day, it'd probably be rebuilding 
> the memory allocation strategy / management.  From what I remember, Bro does 
> do some malloc / free in the data path, which hurts quite a bit when one is 
> trying to make things go fast.  It also means that the selection of a memory 
> allocator and NUMA / per-node memory management is going to be important.  
> That's probably not going to qualify as something *small*, though ...

Ah!  This reminds me of something I was thinking about a few weeks ago.  I'm 
not sure to what extent bro uses memory allocation pools/interning for common 
immutable data structures.  Like for port objects or small strings.  There's no 
reason bro should be mallocing/freeing memory to create port objects when they 
are only 65536 times 2 (or 3?) port objects... but bro does things like

tcp_hdr->Assign(0, new PortVal(ntohs(tp->th_sport), TRANSPORT_TCP));
tcp_hdr->Assign(1, new PortVal(ntohs(tp->th_dport), TRANSPORT_TCP));

For every packet.  As well as allocating a ton of TYPE_COUNT vals for things 
like packet sizes and header lengths.. which will almost always be between 0 
and 64k.

For things that can't be interned, like ipv6 address, having an allocation pool 
could speed things up... Instead of freeing things like IPAddr objects they 
could just be returned to a pool, and then when a new IPAddr object is needed, 
an already initialized object could be grabbed from the pool and 'refreshed' 
with the new value.

https://golang.org/pkg/sync/#Pool

Talks about that sort of thing.

> On a related note, a fun experiment is always to try running bro with a 
> different allocator and seeing what happens ...

I recently noticed our boxes were using jemalloc instead of tcmalloc.. 
Switching that caused malloc to drop a few places down in 'perf top' output.


— 
Justin Azoff



___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-05 Thread Clark, Gilbert
Howdy:


Not sure about the content of the BroCon talk ... but a few years back, I did a 
bit of work on this.  There was a plugin here:


https://github.com/cubic1271/bro-plugin-instrumentation


that allowed me to profile the execution of various bro scripts and figure out 
what was eating the most time.  It also added a pretty braindead mechanism to 
expose script variables through a REST interface, which I wrapped in an HTML5 
UI to get some real-time statistics ... though I have no idea where that code 
went.


I also threw together this:


https://github.com/cubic1271/pybrig


which was intended to benchmark Bro on a specific platform, the idea being to 
get results that were relatively consistent.  It could make some pretty 
pictures, which was kind of neat ... but I'd probably do things a lot 
differently if I had it to do over again :)

I'll note that one of the challenges with profiling is that there are the bro 
scripts, and then there is the bro engine.  The scripting layer has a 
completely different set of optimizations that might make sense than the engine 
does: turning off / turning on / tweaking different scripts can have a huge 
impact on Bro's relative performance depending on the frequency with which 
those script fragments are executed.  Thus, one way to look at speeding things 
up might be to take a look at the scripts that are run most often and seeing 
about ways to accelerate core pieces of them ... possibly by moving pieces of 
those scripts to builtins (as C methods).


If I had to guess at one engine-related thing that would've sped things up when 
I was profiling this stuff back in the day, it'd probably be rebuilding the 
memory allocation strategy / management.  From what I remember, Bro does do 
some malloc / free in the data path, which hurts quite a bit when one is trying 
to make things go fast.  It also means that the selection of a memory allocator 
and NUMA / per-node memory management is going to be important.  That's 
probably not going to qualify as something *small*, though ...


On a related note, a fun experiment is always to try running bro with a 
different allocator and seeing what happens ...


Another thing that (I found) got me a few percentage points for more-or-less 
free was profile-guided optimization: I ran bro first with profiling enabled 
against a representative data set, then rebuilt it against the profile I 
collected.  Of course, your mileage may vary ...


Anyway, hope something in there was useful.


Cheers,

Gilbert Clark



From: bro-dev-boun...@bro.org <bro-dev-boun...@bro.org> on behalf of Jim 
Mellander <jmellan...@lbl.gov>
Sent: Thursday, October 5, 2017 3:45:21 PM
To: bro-dev@bro.org
Subject: [Bro-Dev] Performance Enhancements

Hi all:

One item of particular interest to me from Brocon was this tidbit from 
Packetsled's lightning talk:

"Optimizing core loops (like net_run() ) with preprocessor branch prediction 
macros likely() and unlikely() for ~3% speedup. We optimize for maximum load."

After conversing with Leo Linsky of Packetsled, I wanted to initiate a 
conversation about easy performance improvements that may be within fairly easy 
reach:

1. Obviously, branch prediction, as mentioned above.  3% speedup for (almost) 
free is nothing to sneeze at.
2. Profiling bro to identify other hot spots that could benefit from 
optimization.
3. Best practices for compiling Bro (compiler options, etc.)
4. Data structure revisit (hash functions, perhaps?)
etc.

Perhaps the Bro core team is working on some, all, or a lot more in this area.  
It might be nice to get the Bro community involved too.  Is anyone else 
interested?

Jim Mellander
ESNet




___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-05 Thread Slagell, Adam J


On Oct 5, 2017, at 2:45 PM, Jim Mellander 
> wrote:

1. Obviously, branch prediction, as mentioned above.  3% speedup for (almost) 
free is nothing to sneeze at.
2. Profiling bro to identify other hot spots that could benefit from 
optimization.
3. Best practices for compiling Bro (compiler options, etc.)
4. Data structure revisit (hash functions, perhaps?)


Jon Siwek was optimizing the main event loop last February, but I believe it 
could only go so far without the new Broker API being integrated. Also, I 
believe there is a need to move off of the select() function. Anyway, there is 
definitely a lot of optimization that could be made there.

--

Adam J. Slagell
Director, Cybersecurity & Networking Division
Chief Information Security Officer
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
www.slagell.info

"Under the Illinois Freedom of Information Act (FOIA), any written 
communication to or from University employees regarding University business is 
a public record and may be subject to public disclosure."








___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Performance Enhancements

2017-10-05 Thread Jim Mellander
Hi all:

One item of particular interest to me from Brocon was this tidbit from
Packetsled's lightning talk:

"Optimizing core loops (like net_run() ) with preprocessor branch
prediction macros likely() and unlikely() for ~3% speedup. We optimize for
maximum load."


After conversing with Leo Linsky of Packetsled, I wanted to initiate a
conversation about easy performance improvements that may be within fairly
easy reach:


1. Obviously, branch prediction, as mentioned above.  3% speedup for
(almost) free is nothing to sneeze at.

2. Profiling bro to identify other hot spots that could benefit from
optimization.

3. Best practices for compiling Bro (compiler options, etc.)

4. Data structure revisit (hash functions, perhaps?)

etc.


Perhaps the Bro core team is working on some, all, or a lot more in this
area.  It might be nice to get the Bro community involved too.  Is anyone
else interested?


Jim Mellander

ESNet
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev