Hi James, 3 Consumers in a group means you are having 20 partitions per consumer (as per your 60 partition and 1 CGroup setup), 5 means 12. There's nothing special about these numbers as you also noticed. Have you tried setting fetch.max.wait.ms = 0 and see whether that's making a difference for you?
Thanks, On Thu, 5 Mar 2020 at 03:43, James Olsen <ja...@inaseq.com> wrote: > I’m seeing behaviour that I don’t understand when I have Consumers > fetching from multiple Partitions from the same Topic. There are two > different conditions arising: > > 1. A subset of the Partitions allocated to a given Consumer not being > consumed at all. The Consumer appears healthy, the Thread is running and > logging activity and is successfully processing records from some of the > Partitions it has been assigned. I don’t think this is due to the first > Partition fetched filling a Batch (KIP-387). The problem does not occur if > we have a particular number of Consumers (3 in this case) but it has failed > with a range of other larger values. I don’t think there is anything > special about 3 - it just happens to work OK with that value although it is > the same as the Broker and Replica count. When we tried 6, 5 Consumers > were fine but 1 exhibited this issue. > > 2. Up to a half second delay between Producer sending and Consumer > receiving a message. This looks suspiciously like the fetch.max.wait.ms=500 > but we also have fetch.min.bytes=1 so should get messages as soon as > something is available. The only explanation I can think of is if the > fetch.max.wait.ms is applied in full to the first Partition checked and > it remains empty for the duration. Then it moves on to a subsequent > non-empty Partition and delivers messages from there. > > Our environment is AWS MSK (Kafka 2.2.1) and Kafka Java client 2.4.0. > > All environments appear healthy and under light load, e.g. clients only > operating at a 1-2% CPU, Brokers (3) at 5-10% CPU. No swap, no crashes, > no dead threads etc. > > Typical scenario is a Topic with 60 Partitions, 3 Replicas and a single > ConsumerGroup with 5 Consumers. The Partitioning is for semantic purposes > with the intention being to add more Consumers as the business grows and > load increases. Some of the Partitions are always empty due to using short > string keys and the default Partitioner - we will probably implement a > custom Partitioner to achieve better distribution in the near future. > > I don’t have access to the detailed JMX metrics yet but am working on that > in the hope it will help diagnose. > > Thoughts and advice appreciated!