>> Okay. In the future, we probably need some form of
>> "serialization-free" batching mechanism to ship data more efficiently.
>
> Do you guys have a sense of how load splits up between serialization
> and batching/communication? My hope has been that batching itself can
> take care of the performance issues, so that we'll be able to send
> logs as standard CAF messages, each one representing a batch of N log
> lines. The benchmark I had created a little while ago to examine that
> wasn't able to get the necessary performance out of Broker/CAF to do
> that (hence the fall-back to Bro's old serialization of log messages
> for now, sent over CAF). But iirc, the conclusion was that there's
> still room for improvement in CAF that should make this feasible
> eventually. However, if you guys believe it's really CAF's
> serialization that's the bottle-neck, then we'll need to come up with
> something else indeed.
I think there are a couple of orthogonal aspects merged together here. Namely,
(1) memory-mapping, (2) batching, and (3) performance of CAF's serialization.
1) Matthias threw in memory-mapping, but I’m not so sure if this is actually
feasible for you. The main benefit here is to have a unified representation in
memory, on disk, and on the wire. I think you’re still going to keep the ASCII
log output format for Bro logs. Also, a memory-mapped format would mean to drop
the current broker::data API entirely. My hunch is that you would rather not
break the API immediately after releasing it to the public.
2) CAF already does batching. Ideally, Broker should not need to do any
additional batching on top of that. In fact, doing the batching in user code
greatly diminishes effectiveness of CAF’s own batching, because now CAF can no
longer break up chunks on its own to make efficient use of resources.
3) Serialization should really not be a bottleneck. The costly part is
shuffling bytes around in buffers and heap allocations when deserializing a
broker::data. There’s no way around these two costs. Do you still remember what
showed up during your investigation that triggered you to go with the blob?
Because what I can see as a *much* bigger issue is *copying* overhead, not
serialization. CAF streams assume that individual elements are cheap to copy.
So probably a copy-on-write optimization for broker::data would have a much
higher impact on performance (it’s also straightforward to implement and CAF
has re-usable pieces for that). If serialization still shows up with
unreasonable costs in a profiler, however, there are ways to speed things up.
The customization point here is a specialized inspect() overload for
broker::data that essentially allows you apply all optimization you want (and
that might be used in Bro’s framework).
I hope we’re not talking past each other. :)
An in-depth performance analysis of Broker’s streaming layer is on my todo list
for months at this point. I hope I get something done before the Bro Workshop
in Europe. Then we can hopefully discuss this with some reliable data in person.
Dominik
_______________________________________________
bro-dev mailing list
[email protected]
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev