[jira] [Created] (KAFKA-10559) Don't shutdown the entire app upon TimeoutException during internal topic validation

Sophie Blee-Goldman (Jira) Wed, 30 Sep 2020 11:44:21 -0700

Sophie Blee-Goldman created KAFKA-10559:
-------------------------------------------


             Summary: Don't shutdown the entire app upon TimeoutException 
during internal topic validation
                 Key: KAFKA-10559
                 URL: https://issues.apache.org/jira/browse/KAFKA-10559
             Project: Kafka
          Issue Type: Bug
          Components: streams
            Reporter: Sophie Blee-Goldman
             Fix For: 2.7.0


During some of the KIP-572 work, we made things pretty brittle by changing the 
StreamsPartitionAssignor to send the `INCOMPLETE_SOURCE_TOPIC_METADATA` error 
code and shut down the entire application if a TimeoutException is hit during 
the internal topic creation/validation.

Internal topic validation occurs during every rebalance, and we have seen it 
time out on topic discovery in unstable environments. So shutting down the 
entire application seems like a step in the wrong direction, and antithetical 
to the goal of KIP-572 (improving the resiliency of Streams in the face of 
TimeoutExceptions)

I'm not totally sure what the previous behavior was, but it seems to me we have 
three options:
 # Rethrow the TimeoutException and allow it to kill the thread
 # Swallow the TimeoutException and retry the rebalance indefinitely
 # Some combination of the above: swallow the TimeoutException but don't retry 
indefinitely:
 ## Start a timer and allow retrying rebalances for up the configured 
task.timeout.ms, the timeout config introduced in KIP-572
 ## Retry for some constant number of rebalances

I think if we go with option 3, then shutting down the entire application is 
relatively more palatable, as we have given the environment a chance to 
stabilize.

But, killing the thread still seems preferable, given the two new features that 
are coming out soon: the ability to start up new threads, and the improved 
exception handler that allows the user to choose to shut down the entire 
application if that's really what they want. Once users have this level of 
control over the application, we should allow them to decide how they want to 
handle exceptional cases like this, rather than forcing an option on them (eg 
shutdown everything) 

 

Imo we should fix this before 2.7 comes out, even if it's just a partial fix 
(eg we do option 1 in 2.7, but plan to implement option 3 eventually)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-10559) Don't shutdown the entire app upon TimeoutException during internal topic validation

Reply via email to