Sorry screwed up the command line (should wake up next time before
attempting this). Here are the log messages:
======= /tmp/log.r453 =======
get rid of code duplication from xrl performance tests
======= /tmp/log.r454 =======
add a test for measuring XRL delays
======= /tmp/log.r456 =======
Get rid of 3 system calls to time() - 6% improvement in XRL performance. Before
we would call time:
1) Top of event loop for sanity check
2) Top of event loop for time infrastructure.
3) When checking for expired timers.
4) After select.
5) Bottom of event loop for sanity check.
The first three calls return the same time and can be simplified to one call.
The 5th call isn't strictly necessary - we can either use reading #1 or #4. We
now call time twice:
1) Top of event loop.
2) After select.
We can probably get away by killing #1, or in some cases, #2 (e.g., small select
timeout).
======= /tmp/log.r457 =======
get rid of superfluous select calls [+4%]. Before we'd call select 3x for each
send / receive. Now only once. Furthermore, I call time() after select, only
if select has a timeout greater than 0. It is common to call select with a
timeout with 0 so this often spares us a call to time(). To send / receive, we
now do the following syscalls:
1) time. Start of event loop.
2) select.
3) read / write
4) time, only if select had a timeout. This call is not performed when there is
data queued, and when there's a high throughput of XRLs, which is good
because that's exactly when we need spare cycles to do work.
Before it was:
1) time.
2) time.
3) time.
4) select.
5) select.
6) select.
7) time.
8) read / write.
9) time.
The overall syscall reduction got us 11%. From 10961 xrls/s, to 12202 xrls/s.
======= /tmp/log.r458 =======
Cache the result of Xrl::string_no_args [+13%]. I wanted to commit this
separately to show how much potential for improvement there is by eliminating
strings. Just by caching the result of a (unnecessarily) heavily used function,
we get +13%.
have fun.
======= /tmp/log.r459 =======
Avoid constructing temporary XRL atoms, and counting the packed size twice
[+3%]. There are many XRL atom copies out there - we should get quite an
improvement if we get rid of them.
======= /tmp/log.r460 =======
Reuse XRL objects across calls rather than creating multiple ones per call
[+13%]. This mitigates string usage and caches results for subsequent uses.
======= /tmp/log.r462 =======
Have long lived XRL atoms - not new copies per each XRL [+5%].
======= /tmp/log.r464 =======
pre-increment operator for performance - spotted by pavlin
======= /tmp/log.r471 =======
cache xrls and atoms on the receive path too [+11%]. Can now do 18K XRLs/s.
======= /tmp/log.r475 =======
get rid of profiling stuff
======= /tmp/log.r476 =======
Speed up protocol check and xrl resolution; improve data structures; use
positioning to determine argument (rather than name). Name should still work.
Overall improvement: 8%. We can now do 20k XRLs / sec.
======= /tmp/log.r478 =======
fix obvious bug - allow sending xrls to different targets
======= /tmp/log.r480 =======
add more profiling support to libxorp. This is intended for fast measurements
that should not significantly slowdown xorp.
======= /tmp/log.r484 =======
get rid of more superfluous calls to time() [+16%]. The second one may need to
stay. I think that the deal should be as follows:
1) current_time() should be used for obtaining the last best known time. This
isn't precise but will typically be within 1 second of error.
2) If any code needs critical timing information, then advance_time() [slow!].
======= /tmp/log.r485 =======
get rid of two signal calls per write [+1%]. Every little helps =P
======= /tmp/log.r486 =======
Fix a small "bug". Run timers that have expired exactly now - the rest of the
code assumes this will happen, but it actually doesn't. In practice it means
things are a bit slower because we need to execute the eventloop twice each time
a timer expires exactly. (This will be important for a future commit i do.)
======= /tmp/log.r487 =======
get rid of superflous calls to select. tighten event loop (i.e., run event loop
until we got nothing to do).
overall improvement +19%. We can now do 33K xrl/s. It's the first time that
we're actually spending time writing / reading and not figuring out what time it
is, selecting, or building XRLs =D. w00t.
======= /tmp/log.r488 =======
respect the priority of selectors & tasks. We now dispatch only 1 selector at a
time, and then recompute priorities. this got a 2% improvement in performance
by the way.
======= /tmp/log.r489 =======
relax eventloop a bit. Same performance, but safer. We need a better solution
though. I don't like the old event loop, but I don't like the new one either (I
prefer it slightly though =D). I'll think about it carefully tomorrow.
======= /tmp/log.r492 =======
All tests pass with new XRL code. The only hack was adding a 1 second sleep to
olsr's test_routing1.py test_c4_partial test. I need to understand this better
but I think that it has to do with the simulator trying to keep events
synchronized with time and assuming that each eventloop run is a "single
iteration" - this no longer holds with the aggressive eventloop. We need to
decide whether we want this new eventloop (+26% performance with XRL benchmark)
or whether we should keep the old one for simplicity and safety.
======= /tmp/log.r493 =======
support for XRL batching across eventloop runs. You can now do:
xrl_router.batch_start("target name");
xrl_router.send()
xrl_router.send()
...
xrl_router.batch_end("target name");
This will batch all the sends in one single writev call. (Note that this is
currently limited to 16 by asyncio.cc's max_coalesce paramter - it should be
increased if there's no real reason for the 16.)
The only caveat in the current implementation is that you must send a "standard"
XRL [not batched] to the target before you can use batching for that target.
I've changed the XRL type field to 1 byte instead of 2. The other byte is now
"flags" and batching is an example of a flag.
======= /tmp/log.r494 =======
make the eventloop's "aggressiveness" a parameter. I've added a command line
argument to tune this in the XRL benchmark. Try running with -a 0 ["old"
eventloop] and -a 5 ["aggressive" eventloop]. It makes a 20% difference on my
box: from 27K XRL/s to 32K XRL/s.
======= /tmp/log.r495 =======
by default, use the old eventloop.
======= /tmp/log.r496 =======
basic UNIX socket XRL transport. 6% improvement.
======= /tmp/log.r497 =======
try to use unix sockets is possible, then tcp
======= /tmp/log.r498 =======
allow one to choose the UNIX PF via the XORP_PF environmental variables. This
is useful in case that we wanna disable unix sockets by default.
======= /tmp/log.r499 =======
the creation of unix sockets is now a parameter - on by default
_______________________________________________
Xorp-cvs mailing list
[email protected]
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-cvs