Yeh, the lockless implementation has a bug: if (size)
s/b if (size & 1) I ended up writing an checksum routine that sums 32 bits at a time into a 64 bit register, which avoids the need to check for overflow - it seems to be faster than the full 64 bit implementation - will test with Bro and report results. On Thu, Oct 12, 2017 at 3:08 PM, Azoff, Justin S <[email protected]> wrote: > > > On Oct 6, 2017, at 5:59 PM, Jim Mellander <[email protected]> wrote: > > > > I particularly like the idea of an allocation pool that per-packet > information can be stored, and reused by the next packet. > > > > There also are probably some optimizations of frequent operations now > that we're in a 64-bit world that could prove useful - the one's complement > checksum calculation in net_util.cc is one that comes to mind, especially > since it works effectively a byte at a time (and works with even byte > counts only). Seeing as this is done per-packet on all tcp payload, > optimizing this seems reasonable. Here's a discussion of do the checksum > calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ > - > > So I still haven't gotten this to work, but I did some more tests that I > think show it is worthwhile to look into replacing this function. > > I generated a large pcap of a 3 minute iperf run: > > $ du -hs iperf.pcap > 9.6G iperf.pcap > $ tcpdump -n -r iperf.pcap |wc -l > reading from file iperf.pcap, link-type EN10MB (Ethernet) > 7497698 > > Then ran either `bro -Cbr` or `bro -br` on it 5 times and track runtime as > well as cpu instructions reported by `perf`: > > $ python2 bench.py 5 bro -Cbr iperf.pcap > 15.19 49947664388 > 15.66 49947827678 > 15.74 49947853306 > 15.66 49949603644 > 15.42 49951191958 > elapsed > Min 15.18678689 > Max 15.7425909042 > Avg 15.5343231678 > > instructions > Min 49947664388 > Max 49951191958 > Avg 49948828194 > > $ python2 bench.py 5 bro -br iperf.pcap > 20.82 95502327077 > 21.31 95489729078 > 20.52 95483242217 > 21.45 95499193001 > 21.32 95498830971 > elapsed > Min 20.5184400082 > Max 21.4452238083 > Avg 21.083449173 > > instructions > Min 95483242217 > Max 95502327077 > Avg 95494664468 > > > So this shows that for every ~7,500,000 packets bro processes, almost 5 > seconds is spent computing checksums. > > According to https://locklessinc.com/articles/tcp_checksum/, they run > their benchmark 2^24 times (16,777,216) which is about 2.2 times as many > packets. > > Their runtime starts out at about 11s, which puts it in line with the > current implementation in bro. The other implementations they show are > between 7 and 10x faster depending on packet size. A 90% drop in time > spent computing checksums would be a noticeable improvement. > > > Unfortunately I couldn't get their implementation to work inside of bro > and get the right result, and even if I could, it's not clear what the > license for the code is. > > > > > > — > Justin Azoff > >
_______________________________________________ bro-dev mailing list [email protected] http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
