> the benchmark no longer
> terminates and the server quickly stops getting data, and I would like
> to know why.

I'll have a look at it.

> I've tried various parameters for the scheduler throughput, but they do
> not seem to make a difference. Would you mind taking a look at what's
> going on here?

The throughput parameter does not apply to network inputs, so you only modify 
how many integers per scheduler run the server receives. You could additionally 
try to tweak caf#middleman.max_consecutive_reads, which configures how many 
new_data_msg messages a broker receives from the backend in a single shot. It 
makes sense to have the two separated, because one configures fairness in the 
scheduling and the other fairness of connection multiplexing.

> It looks like the "sender overload protection" you
> mentioned is not working as expected.

The new feature "merely" allows (CAF) brokers to receive messages from the 
backend when data is transferred. This basically uplifts TCP's backpressure. 
When blindly throwing messages at remote actors, there's nothing CAF could do 
about it. However, the new broker feedback will be one piece in the puzzle when 
implementing flow control in CAF later on.

> I'm also attaching a new gperftools profiler output from the client and
> server. The server is not too telling, because it was spinning idle for
> a bit until I ran the client, hence the high CPU load in nanosleep.
> Looking at the client, it seems that only 67.3% of time is spent in
> local_actor::resume, which would mean that the runtime adds 33.7%
> overhead.

The call to resume() happens in the BASP broker which dumps the messages to its 
output buffer. So the 67% load include serialization, etc. 28.3% of the 
remaining load are accumulated in main().

> Still, why is intrusive_ptr::get consuming 27.9%?

The 27.9% is accumulating all load down the path, isn't it? intrusive_ptr::get 
itself simply returns a pointer: 
https://github.com/actor-framework/actor-framework/blob/d5f43de65c42a74afa4c979ae4f60292f71e371f/libcaf_core/caf/intrusive_ptr.hpp#L128

> Looking on the left tree, it looks like this workload stresses the
> allocator heavily: 
> 
>    - 20.4% tc_malloc_skip_new_handler 
>    - 7% std::vector::insert in the BASP broker
>    - 13.5% CAF serialization (adding two out-edges from
>            basp::instance::write, 5.8 + 7.5)

Not really surprising. You are sending integers around. Each integer has to be 
wrapped in a heap-allocated message which gets enqueued to an actor's mailbox. 
By using many small messages, you basically maximize the messaging overhead.

> Switching gears to your own performance measurements: it sounded like
> that you got gains at the order 400% when comparing just raw byte
> throughput (as opposed to message throughput). Can you give us an
> intuition how that relates to the throughput measurements we have been
> doing?

At the lowest level, a framework like CAF ultimately needs to efficiently 
manage buffers and events provided by the OS. That's the functionality of 
recv/send/poll/epoll and friends. That's what I was looking at, since you can't 
get good performance if you have problems at that level (which, as it turned 
out, CAF had).

Moving a few layers up, some overhead is inherent in a messaging framework. 
Stressing the heap (see 20% load in tc_malloc_skip_new_handler) when sending 
many small messages, for example.

>From the gperf output (just looking at the client), I don't see that much CPU 
>time spent in CAF itself. If I sum up CPU load from std::vector (6.2%), 
>tcmalloc (20.4%), atomics (8%) and serialization (12.4%), I'm already at 47% 
>out of 70% total for the multiplexer (default_multiplexer::run).

Pattern Matching (caf::detail::try_match) cause less than 6% CPU load, so that 
seems not to be an issue. Serialization has 12% CPU load, which probably mostly 
results from std::copy (cut out after std::function unfortunately). So, I don't 
see that many optimization opportunities in these components.

Tackling the "many small messages problem" isn't going to be easy. CAF could 
try to wrap multiple messages from the network into a single heap-allocated 
storage that is then shipped to an actor as a whole, but this optimization 
would have a high complexity.

That's of course just some thoughts after looking at the gperf output you 
provided. I'll hopefully have new insights after looking at the termination 
problem in detail.

    Dominik
_______________________________________________
bro-dev mailing list
[email protected]
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev

Reply via email to