RE: Kafka Streams - Producer attempted to produce with an old epoch.

2022-10-27 Thread Andrew Muraco
Detail I forgot to mention is that I am using EOS, so the streams application 
was consistently getting this error causing it to restart that task, which 
obviously would bottle neck everything.
Once I increased the threads on the brokers the epoch error subsided until the 
volume increased.

Right now the streams application has 36 threads per box (5 boxes w/ 48 
threads) and when running normally it is keeping up, but once this epoch error 
starts it causes a cascade of slowness. The CPU is not even being fully 
utilized due to this error happening every minute or so.

Perhaps some other tuning is needed too, but I'm lost what to look at.
I do have JMX connection to the broker and streams applications if there's any 
useful information I should be looking at.

-Original Message-
From: Sophie Blee-Goldman  
Sent: Thursday, October 27, 2022 11:22 PM
To: users@kafka.apache.org
Subject: Re: Kafka Streams - Producer attempted to produce with an old epoch.

CAUTION: This email originated from outside of ShopHQ. Do not click links or 
open attachments unless you recognize the sender and know the content is safe!


I'm not one of the real experts on the Producer and even further from one with 
broker performance, so someone else may need to chime in for that, but I did 
have a few questions:

What specifically are you unsatisfied with w.r.t the performance? Are you 
hoping for a higher throughput of your Streams app's output, or is there 
something about the brokers? I'm curious why you started with increasing the 
broker threads, especially if the perf issue/bottleneck is with the Streams 
app's processing (but maybe it is not). I would imagine that throwing more and 
more threads at the machine could even make things worse, it definitely will if 
the thread count gets high enough though it's hard to say where/when it might 
start to decline. Point is, if the brokers are eating up all the cpu time with 
their own threads then the clients embedded in Streams may be getting starved 
out at times, causing that StreamThread/consumer to drop out of the group and 
resulting in the producer getting fenced. Or it could be blocking i/o for 
rocksdb  and leading to write stalls, which could similarly get that 
StreamThread kicked from the consumer group (if the application has state, 
especially if quite a lot).

How many StreamThreads did you give the app btw?

On Thu, Oct 27, 2022 at 8:01 PM Andrew Muraco  wrote:

> Hi,
> I have a kafka streams application deployed on 5 nodes and with full 
> traffic I am getting the error message:
>
> org.apache.kafka.common.errors.InvalidProducerEpochException: Producer 
> attempted to produce with an old epoch.
> Written offsets would not be recorded and no more records would be 
> sent since the producer is fenced, indicating the task may be migrated 
> out
>
> I have 5 x 24 CPU/48 core machines with 128gb of ram. These machines 
> are the kafka brokers with 2x1TB disks for kafka logs and also running 
> the kafka Streams application.
> 2x replication factor on topic, topic is producing about 250k per second.
> I have 2 aggregations in the topology to 2 output topics, the final 
> output topics are in the 10s of k range per second.
>
> I'm assuming I have a bottleneck somewhere, I increased the broker 
> thread counts and observed that this frequency of this error reduced, 
> but it's still happening.
> Here's the broker configuration I'm using now, but I might be 
> overshooting some of these values.
>
> num.network.threads=48
> num.io.threads=48
> socket.send.buffer.bytes=512000
> socket.receive.buffer.bytes=512000
> replica.socket.receive.buffer.bytes=1024000
> socket.request.max.bytes=10485760
> num.replica.fetchers=48
> log.cleaner.threads=48
> queued.max.requests=48000
>
> I can't find good documentation on the effect of broker configuration 
> on performance.
>


Re: Kafka Streams - Producer attempted to produce with an old epoch.

2022-10-27 Thread Sophie Blee-Goldman
I'm not one of the real experts on the Producer and even further from one
with broker performance, so someone else may need to chime in for that, but
I did have a few questions:

What specifically are you unsatisfied with w.r.t the performance? Are you
hoping for a higher throughput
of your Streams app's output, or is there something about the brokers? I'm
curious why you started with
increasing the broker threads, especially if the perf issue/bottleneck is
with the Streams app's processing
(but maybe it is not). I would imagine that throwing more and more threads
at the machine could even
make things worse, it definitely will if the thread count gets high enough
though it's hard to say where/when
it might start to decline. Point is, if the brokers are eating up all the
cpu time with their own threads then
the clients embedded in Streams may be getting starved out at times,
causing that StreamThread/consumer
to drop out of the group and resulting in the producer getting fenced. Or
it could be blocking i/o for rocksdb
 and leading to write stalls, which could similarly get that StreamThread
kicked from the consumer group
(if the application has state, especially if quite a lot).

How many StreamThreads did you give the app btw?

On Thu, Oct 27, 2022 at 8:01 PM Andrew Muraco  wrote:

> Hi,
> I have a kafka streams application deployed on 5 nodes and with full
> traffic I am getting the error message:
>
> org.apache.kafka.common.errors.InvalidProducerEpochException: Producer
> attempted to produce with an old epoch.
> Written offsets would not be recorded and no more records would be sent
> since the producer is fenced, indicating the task may be migrated out
>
> I have 5 x 24 CPU/48 core machines with 128gb of ram. These machines are
> the kafka brokers with 2x1TB disks for kafka logs and also running the
> kafka Streams application.
> 2x replication factor on topic, topic is producing about 250k per second.
> I have 2 aggregations in the topology to 2 output topics, the final output
> topics are in the 10s of k range per second.
>
> I'm assuming I have a bottleneck somewhere, I increased the broker thread
> counts and observed that this frequency of this error reduced, but it's
> still happening.
> Here's the broker configuration I'm using now, but I might be overshooting
> some of these values.
>
> num.network.threads=48
> num.io.threads=48
> socket.send.buffer.bytes=512000
> socket.receive.buffer.bytes=512000
> replica.socket.receive.buffer.bytes=1024000
> socket.request.max.bytes=10485760
> num.replica.fetchers=48
> log.cleaner.threads=48
> queued.max.requests=48000
>
> I can't find good documentation on the effect of broker configuration on
> performance.
>