Hi,
We have been evaluating Managed Streaming for Kafka (MSK) on AWS for a use-case
that requires high-speed data ingestion of the order of millions of messages
(each ~1 KB size) per second. We ran into some issues when testing this case.
Context:
To start with, we have set up single topic with 3 partitions on a 3 node MSK of
m5.large (2 cores, 8 GB RAM, 500 GB EBS) with encryption enabled for
inter-broker (intra-MSK) communication. Each broker is in a separate AZ (total
3 AZs and 3 brokers) and has 10 network threads and 16 IO threads.
When the topic has replication-factor = 2 and min.insync.replicas = 2 and
publisher uses acks = all, when sending 100+ million messages using 3 parallel
publishers intermittently results in following error.
`Delivery failed: Broker: Not enough in-sync replicas`
As per documentation this error is thrown when ins-sync replicas are lagging
behind for more than a configured duration (replica.lag.time.max.ms=30 seconds
as default).
However when we don't see this error, the throughput is around 90 K msgs/sec
i.e. 90 MB/sec. CPU usage is below 50% disk usage is also < 20%. So apparently
CPU/Memory/Disk are not an issue ??
If we change replication-factor =1 and min.insync.replicas = 1 and/or ack=1 and
keep all other things same, then there are no errors and throughput is ~380 K
msgs.sec i.e. 380 MB/sec. CPU usage was below < 30 %
Question:
Without replication we were able to get 380 MB/sec written, so assuming disk or
CPU or memory are not an issue. what could be the reason for replicas to lag
behind at 90 MB/sec throughput? Is it the number of total threads (10 n/w + 16
IO) being too high for a 2 core machine? But then same thread setting works
good without replication. What could be the reason for (1) lesser throughput
when turning replication on and (2) replicas lagging behind when replication is
turned on?
Thanks
Arti