Sophie Blee-Goldman created KAFKA-10559:
-------------------------------------------
Summary: Don't shutdown the entire app upon TimeoutException
during internal topic validation
Key: KAFKA-10559
URL: https://issues.apache.org/jira/browse/KAFKA-10559
Project: Kafka
Issue Type: Bug
Components: streams
Reporter: Sophie Blee-Goldman
Fix For: 2.7.0
During some of the KIP-572 work, we made things pretty brittle by changing the
StreamsPartitionAssignor to send the `INCOMPLETE_SOURCE_TOPIC_METADATA` error
code and shut down the entire application if a TimeoutException is hit during
the internal topic creation/validation.
Internal topic validation occurs during every rebalance, and we have seen it
time out on topic discovery in unstable environments. So shutting down the
entire application seems like a step in the wrong direction, and antithetical
to the goal of KIP-572 (improving the resiliency of Streams in the face of
TimeoutExceptions)
I'm not totally sure what the previous behavior was, but it seems to me we have
three options:
# Rethrow the TimeoutException and allow it to kill the thread
# Swallow the TimeoutException and retry the rebalance indefinitely
# Some combination of the above: swallow the TimeoutException but don't retry
indefinitely:
## Start a timer and allow retrying rebalances for up the configured
task.timeout.ms, the timeout config introduced in KIP-572
## Retry for some constant number of rebalances
I think if we go with option 3, then shutting down the entire application is
relatively more palatable, as we have given the environment a chance to
stabilize.
But, killing the thread still seems preferable, given the two new features that
are coming out soon: the ability to start up new threads, and the improved
exception handler that allows the user to choose to shut down the entire
application if that's really what they want. Once users have this level of
control over the application, we should allow them to decide how they want to
handle exceptional cases like this, rather than forcing an option on them (eg
shutdown everything)
Imo we should fix this before 2.7 comes out, even if it's just a partial fix
(eg we do option 1 in 2.7, but plan to implement option 3 eventually)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)