Hi James,

3 Consumers in a group means you are having 20 partitions per consumer (as
per your 60 partition and 1 CGroup setup), 5 means 12. There's nothing
special about these numbers as you also noticed.
Have you tried setting fetch.max.wait.ms = 0 and see whether that's making
a difference for you?

Thanks,


On Thu, 5 Mar 2020 at 03:43, James Olsen <ja...@inaseq.com> wrote:

> I’m seeing behaviour that I don’t understand when I have Consumers
> fetching from multiple Partitions from the same Topic.  There are two
> different conditions arising:
>
> 1. A subset of the Partitions allocated to a given Consumer not being
> consumed at all.  The Consumer appears healthy, the Thread is running and
> logging activity and is successfully processing records from some of the
> Partitions it has been assigned.  I don’t think this is due to the first
> Partition fetched filling a Batch (KIP-387).  The problem does not occur if
> we have a particular number of Consumers (3 in this case) but it has failed
> with a range of other larger values.  I don’t think there is anything
> special about 3 - it just happens to work OK with that value although it is
> the same as the Broker and Replica count.  When we tried 6, 5 Consumers
> were fine but 1 exhibited this issue.
>
> 2. Up to a half second delay between Producer sending and Consumer
> receiving a message.  This looks suspiciously like the fetch.max.wait.ms=500
> but we also have fetch.min.bytes=1 so should get messages as soon as
> something is available.  The only explanation I can think of is if the
> fetch.max.wait.ms is applied in full to the first Partition checked and
> it remains empty for the duration.  Then it moves on to a subsequent
> non-empty Partition and delivers messages from there.
>
> Our environment is AWS MSK (Kafka 2.2.1) and Kafka Java client 2.4.0.
>
> All environments appear healthy and under light load, e.g. clients only
> operating at a 1-2% CPU, Brokers (3) at 5-10% CPU.   No swap, no crashes,
> no dead threads etc.
>
> Typical scenario is a Topic with 60 Partitions, 3 Replicas and a single
> ConsumerGroup with 5 Consumers.  The Partitioning is for semantic purposes
> with the intention being to add more Consumers as the business grows and
> load increases.  Some of the Partitions are always empty due to using short
> string keys and the default Partitioner - we will probably implement a
> custom Partitioner to achieve better distribution in the near future.
>
> I don’t have access to the detailed JMX metrics yet but am working on that
> in the hope it will help diagnose.
>
> Thoughts and advice appreciated!

Reply via email to