Re: Artemis - dealing with variable consumer performance on a large cluster

Gary Tully Wed, 26 Jan 2022 02:47:55 -0800

Hi Graham,

I have a lot of questions, in particular on the need for 32 brokers
but those can wait!

this is tricky, redistribution based on remote demand when local
demand is low is very difficult to get right.
The fundamental problem is moving messages (data) to consumers, trying
to do it fast (which is what the broker is good at) and not having a
'herd' effect of moving too many messages to brokers where those
consumers can get slow.
The amount of data to track to do this well and efficiently is huge.
For example every second, calculate the the rate per consumer * num
consumers on each peer, factor in the message size and compare to the
local consumers rates and determine what portion of the backlog to
redistribute and recalculate. If you get it wrong, you want to other
end to redistribute etc... it gets very messy very fast!
Hense it is not implemented!

Recently I added the possibility to redistribute when there are no
local consumers, even when cluster load balancing is off - i.e: there
is no 'normal' routing to remote consumers.
When the last local consumer drops or when there are no local
consumers that match the current selectors present on a queue,
redistribution can kick in if there are remote consumers.
This could be extended to try and account for a 'slow' consumer, but I
think this is going down a rabbit hole of sorts. As ever, it depends
on lots of factors, there may be a good use case for it, I don't know
for sure, one problem is that all messages will move!

There is an alternative approach, but it is quite different.

The alternative approach is to partition the jobs (data) to some set
of brokers or a single broker and distribute the consumers in the
normal way but have them gravitate to the broker where the data is.
In this way, all the consumers directly compete for the full data set
and can be as fast or as slow as they like.

The broker balancer feature allows connections to be accepted or
rejected on some criteria, which allows connections to be partitioned
on the data they will access.

Peek the docs for some more context:
https://activemq.apache.org/components/artemis/documentation/latest/broker-balancers.html#data-gravity

There is an example[1] that will help explain and demonstrate, based
on partitioning durable subscriptions across brokers. Their client_id
has an affinity to a single broker.

But this can be as general as necessary, in particular, the use of an
authenticated users 'role' to partition can approach application level
granularity.

I think long term, at scale and for lowest latency, it is best if data
movement is limited to "producer app to broker, broker to ultimate
consumer".
Any additional overhead, to track demand across a cluster, or to move
data to satisfy demand, is best minimised or removed altogether.

I understand this is quite a different approach and considering the
current setup works well in general, it may not be applicable, but it
may be worth considering.

would love your feedback!

[1] 
https://github.com/apache/activemq-artemis/tree/main/examples/features/broker-balancer/symmetric-simple

On Tue, 25 Jan 2022 at 18:57, Graham Stewart <grahamdstew...@gmail.com> wrote:
>
> Hi,
> Firstly, thanks to everyone involved in the Artemis project, it's a key
> part of the system I work on.
>
> We have a symmetric cluster of 32 brokers split across 2 physical data
> centres (16 in each).
> All 32 hosts are VMs. Each host runs a JVM with an embedded Artemis broker
> and several other JVMs that connect to their local broker and produce
> and/or consume messages.
> Messages are distributed round-robin across the cluster.
>
> This approach has worked well for us in several other environments where
> we've used homogeneous physical hardware. It's also been fine in
> environments where we use VMs that are running on undersubscribed physicals.
> We're coming up against a problem with VMs on oversubscribed physicals due
> to variable performance in consuming/producing messages.
>
> An example may help:
>
> A "job" produces 32,000 messages.
> These are distributed round-robin across the 32 brokers - 1,000 messages on
> each.
> On each host there's a process consuming these 1,000 messages.
>
> When these consumer processes are perform similarly, each process their
> 1,000 message in roughly the same time and the job completes. Great.
>
> However, should one of the processes slow down to let's say half speed, we
> are left waiting twice as long for the job to complete. Not only that but
> there are 31 consumer processes left idle.
>
> Do you have any ideas on how we can handle this better? I don't see how the
> slow consumer approach of setting the consumer window to zero would help
> here. We're really looking for something like the cluster detecting there
> are idle consumers and redistributing messages away form the broker with
> the slow consumers.
>
> Any pointers you can give me would be really appreciated.
>
> Many thanks
>
> Graham Stewart

Re: Artemis - dealing with variable consumer performance on a large cluster

Reply via email to