HI folks,

I'm writing here to share some thoughts related to the Artemis threading
model and how it affects broker scalability.

Currently (on 2.7.0) we relies on a shared thread pool ie
ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to
process tasks.
Thanks to the the Actor abstraction we use a lock-free queue to serialize
tasks (or items),
processing them in batch in the shared thread pool, awaking a consumer
thread only if needed (the logic is contained in ProcessorBase).
The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will
execute on the shared thread pool a specific task to drain and execute a
batch of tasks only if necessary, not on every added task/item.

Looking at the contention graphs of the broker (ie the bar width are the
nanoseconds before entering into a lock) is quite clear the limitation of
the current implementation:

[image: image.png]

In violet are shown the offer and poll operations on the
LinkedBlockingQueue of the shared thread pool, happening from any thread of
the pool (the thread is the base of each bar, in red).
The LinkedBlockingQueue indeed has a ReentrantLock to protect any operation
on the linked q and is clear that having a giant lock in front of high
contention point won't scale.

The above graph has been obtained with a single producer/single
consumer/single queue/not-persistent run, but I don't have enough resources
to check what could happen with more and more producers/consumers/queues.
The critical part is the offering/polling of tasks on the shared thread
pool and in theory a maxed-out broker shouldn't have many idle threads to
be awaken, but given that more producers/consumers/queues means many
different Actors, in order to guarantee each actor tasks to be executed,
the shared thread pool will need to process many unnecessary "awake" tasks,
creating lot of contention on the blocking linked q, slowing down the
entire broker.

In the past I've tried to replace the current shared thread pool
implementation with a ForkJoinPool or (the most recent attempt) by using a
lock-free q instead of BlockingLinkedQueue, with no success (
https://github.com/apache/activemq-artemis/pull/2582).

Below the contention graph using a lock-free q in the shared thread pool:

[image: image.png]

In violet now we have QueueImpl::deliver and RefsOperation::afterCommit
that are contending QueueImpl lock, but the numbers for each bar are very
different: in the previous graph the contention on the shared thread pool
lock is of 600 ns, while here is 20-80 ns and it can scale with number of
queues, while the previous version not.

All green right? So, why I've reverted the lock-free thread pool?

Because with a low utilization of the broker (ie 1 producer/1 consumer/1
queue) the latencies and throughput were actually worse: cpu utilization
graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was
spending most of its time by awaking the shared thread pool. The same was
happening with a ForkJoin pool, sadly.
It seems (and it is just a guess) that, given that tasks get consumed
faster (there is no lock preventing them to get polled and executed), the
thread pool is getting idle sooner (the default thread pool size is of 30
and I have a machine with just 8 real cores), forcing any new task
submission to awake any of the thread pool to process incoming tasks.

What are your thoughts on this?
I don't want to trade so much the "low utilization" performance for the
scaling TBH, that's why I've preferred to revert the change.
Note that other applications with scalability needs (eg Cassandra) have
changed their shared pool approach based on SEDA to a thread-per-pool
architecture for this same reason.

Cheers,
Franz

Reply via email to