GitHub user hachikuji opened a pull request: https://github.com/apache/kafka/pull/4349
KAFKA-6366 [WIP]: Fix stack overflow in consumer due to fast offset commits during coordinator disconnect When the coordinator is marked unknown, we explicitly disconnect its connection and cancel pending requests. Currently the disconnect happens before the coordinator state is set to null, which means that callbacks which inspect the coordinator state will see it still as active. This can lead to further requests being sent. In pathological cases, the disconnect itself is not able to return because new requests are sent to the coordinator before the disconnect can complete, which leads to the stack overflow error. To fix the problem, I have reordered the disconnect to happen after the coordinator is set to null. I have added a basic test case to verify that callbacks for in-flight or unsent requests see the coordinator as unknown which prevents them from attempting to resend. We may need additional test cases after we determine whether this is in fact was it happening in the reported ticket. Note that I have also included some minor cleanups which I noticed along the way. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) You can merge this pull request into a Git repository by running: $ git pull https://github.com/hachikuji/kafka KAFKA-6366 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/4349.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4349 ---- commit 488de3dca5be6111fd447980c8e79477259dc99a Author: Jason Gustafson <jason@...> Date: 2017-12-18T18:53:38Z KAFKA-6366 [WIP]: Fix stack overflow in consumer due to fast offset commits during coordinator disconnect ---- ---