Re: Artemis - dealing with variable consumer performance on a large cluster

Graham Stewart Wed, 26 Jan 2022 10:47:44 -0800

Hey Gary,

Thanks for all your help on this. I really appreciate it.
The approach of consumers competing for messages is where we need to get to.


We made a decision ages ago that producer and consumer processes on a given
host, would all connect to a local Artemis broker running on the same host.
The reasoning behind this was that we wouldn't be making a network hop to
send or receive.
Producers/Consumer would always be making just a local call and we'd leave
Artemis to do all the network distribution, which it does superbly.

>From what you've said we'll need to revise that decision.

I'll take a look at the broker balancer features to see whether that fits.
Will let you know where we get with this. Meanwhile ask away on 32 brokers
and anything else and anything else for that matter.

thanks

Graham




On Wed, 26 Jan 2022 at 10:47, Gary Tully <gary.tu...@gmail.com> wrote:

> Hi Graham,
>
> I have a lot of questions, in particular on the need for 32 brokers
> but those can wait!
>
> this is tricky, redistribution based on remote demand when local
> demand is low is very difficult to get right.
> The fundamental problem is moving messages (data) to consumers, trying
> to do it fast (which is what the broker is good at) and not having a
> 'herd' effect of moving too many messages to brokers where those
> consumers can get slow.
> The amount of data to track to do this well and efficiently is huge.
> For example every second, calculate the the rate per consumer * num
> consumers on each peer, factor in the message size and compare to the
> local consumers rates and determine what portion of the backlog to
> redistribute and recalculate. If you get it wrong, you want to other
> end to redistribute etc... it gets very messy very fast!
> Hense it is not implemented!
>
> Recently I added the possibility to redistribute when there are no
> local consumers, even when cluster load balancing is off - i.e: there
> is no 'normal' routing to remote consumers.
> When the last local consumer drops or when there are no local
> consumers that match the current selectors present on a queue,
> redistribution can kick in if there are remote consumers.
> This could be extended to try and account for a 'slow' consumer, but I
> think this is going down a rabbit hole of sorts. As ever, it depends
> on lots of factors, there may be a good use case for it, I don't know
> for sure, one problem is that all messages will move!
>
> There is an alternative approach, but it is quite different.
>
> The alternative approach is to partition the jobs (data) to some set
> of brokers or a single broker and distribute the consumers in the
> normal way but have them gravitate to the broker where the data is.
> In this way, all the consumers directly compete for the full data set
> and can be as fast or as slow as they like.
>
> The broker balancer feature allows connections to be accepted or
> rejected on some criteria, which allows connections to be partitioned
> on the data they will access.
>
> Peek the docs for some more context:
>
> https://activemq.apache.org/components/artemis/documentation/latest/broker-balancers.html#data-gravity
>
> There is an example[1] that will help explain and demonstrate, based
> on partitioning durable subscriptions across brokers. Their client_id
> has an affinity to a single broker.
>
> But this can be as general as necessary, in particular, the use of an
> authenticated users 'role' to partition can approach application level
> granularity.
>
> I think long term, at scale and for lowest latency, it is best if data
> movement is limited to "producer app to broker, broker to ultimate
> consumer".
> Any additional overhead, to track demand across a cluster, or to move
> data to satisfy demand, is best minimised or removed altogether.
>
> I understand this is quite a different approach and considering the
> current setup works well in general, it may not be applicable, but it
> may be worth considering.
>
> would love your feedback!
>
> [1]
> https://github.com/apache/activemq-artemis/tree/main/examples/features/broker-balancer/symmetric-simple
>
> On Tue, 25 Jan 2022 at 18:57, Graham Stewart <grahamdstew...@gmail.com>
> wrote:
> >
> > Hi,
> > Firstly, thanks to everyone involved in the Artemis project, it's a key
> > part of the system I work on.
> >
> > We have a symmetric cluster of 32 brokers split across 2 physical data
> > centres (16 in each).
> > All 32 hosts are VMs. Each host runs a JVM with an embedded Artemis
> broker
> > and several other JVMs that connect to their local broker and produce
> > and/or consume messages.
> > Messages are distributed round-robin across the cluster.
> >
> > This approach has worked well for us in several other environments where
> > we've used homogeneous physical hardware. It's also been fine in
> > environments where we use VMs that are running on undersubscribed
> physicals.
> > We're coming up against a problem with VMs on oversubscribed physicals
> due
> > to variable performance in consuming/producing messages.
> >
> > An example may help:
> >
> > A "job" produces 32,000 messages.
> > These are distributed round-robin across the 32 brokers - 1,000 messages
> on
> > each.
> > On each host there's a process consuming these 1,000 messages.
> >
> > When these consumer processes are perform similarly, each process their
> > 1,000 message in roughly the same time and the job completes. Great.
> >
> > However, should one of the processes slow down to let's say half speed,
> we
> > are left waiting twice as long for the job to complete. Not only that but
> > there are 31 consumer processes left idle.
> >
> > Do you have any ideas on how we can handle this better? I don't see how
> the
> > slow consumer approach of setting the consumer window to zero would help
> > here. We're really looking for something like the cluster detecting there
> > are idle consumers and redistributing messages away form the broker with
> > the slow consumers.
> >
> > Any pointers you can give me would be really appreciated.
> >
> > Many thanks
> >
> > Graham Stewart
>

Re: Artemis - dealing with variable consumer performance on a large cluster

Reply via email to