Hi,

We are using kafka-producer 0.8.2 on our production. We configured it with
retries to Integer.MAX_VALUE and buffer.memory to 1GB.
Thanks to this setup we are protected from unavailability of all brokers
for around one hour (taking into account our production traffic).
For example, when all brokers from a single DC/zone are down,
kafka-producer buffers all incoming messages in its accumulator until full.
When brokers are available again, the producer sends all the buffered
messages to kafka. Thanks to this we have some time for recovery and don't
loose messages at all.

Now, we would like to migrate to the newest kafka-producer 0.10.1 but we
have a problem with preserving described behaviour because of changes
introduced to producer library:

- proposal about adding request timeout to NetworkClient
https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient
- producer record can stay in RecordAccumulator forever if leader is not
available https://issues.apache.org/jira/browse/KAFKA-1788
- add a request timeout to NetworkClient
https://issues.apache.org/jira/browse/KAFKA-2120

These changes provide request.timeout.ms parameter which is used in:

1. actual network RTT
2. server replication time
3. new mechanism for aborting expired batches

When brokers are unavailable for more than request.timeout.ms then
kafka-producer starts dropping batches from accumulator with a
TimeoutException in a callback with a message:

  "Batch containing " + recordCount + " record(s) expired due to timeout
while requesting metadata from brokers for " + topicPartition

As a possible solution, to protect against unavailability of all brokers,
in the newest kafka-producer:

- I could increase request.timeout.ms to one hour and batches would be
dropped after that time but this value is not reasonable for (1) and (2)
- I could catch TimeoutException and send corresponding message to
kafka-producer again but then I don’t have guarantee that there will be
free space in accumulator

In my opinion timeout for (3) should be independent from (1) and (2), or
dropping expired batches should be an optional feature.
What do you think about this issue? Do you have any suggestion/solution for
this use case?

Best regards,
Luke Druminski

Reply via email to