Dmitry created KAFKA-10013:
------------------------------
Summary: Consumer hang-up in case of unclean leader election
Key: KAFKA-10013
URL: https://issues.apache.org/jira/browse/KAFKA-10013
Project: Kafka
Issue Type: Bug
Components: consumer
Affects Versions: 2.4.1, 2.5.0, 2.3.1, 2.4.0, 2.3.0
Reporter: Dmitry
Starting from kafka 2.3 new offset reset negotiation algorithm added
(org.apache.kafka.clients.consumer.internals.Fetcher#validateOffsetsAsync)
During this validation, Fetcher
`org.apache.kafka.clients.consumer.internals.SubscriptionState` is held in
`AWAIT_VALIDATION` fetch state.
This effectively means that fetch requests are not issued and consumption
stopped.
In case if unclean leader election is happening during this time,
`LogTruncationException` is thrown from future listener in method
`validateOffsetsAsync` (probably in order to turn on the logic defined by
`auto.offset.reset` parameter).
The main problem is that this exception (thrown from listener of future) is
effectively swallowed by
`org.apache.kafka.clients.consumer.internals.AsyncClient#sendAsyncRequest`
by this part of code
} catch (RuntimeException e) {
if (!future.isDone()) {
future.raise(e);
}
}
In the end the result is: The only way to get out of AWAIT_VALIDATION and
continue consumption is to successfully finish validation, but it can not be
finished.
However - consumer is alive, but is consuming nothing. The only way to resume
consumption is to terminate consumer and start another one.
We discovered this situation by means of kstreams application, where valid
value of `auto.offset.reset` provided by our code is replaced by `None` value
for a purpose of position reset
(org.apache.kafka.streams.processor.internals.StreamThread#create).
And with kstreams it is even worse, as application may be working, logging warn
messages of format `Truncation detected for partition ...,` but data is not
generated for a long time and in the end is lost, making kstreams application
unreliable.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)