GitHub user hachikuji opened a pull request:
https://github.com/apache/kafka/pull/4349
KAFKA-6366 [WIP]: Fix stack overflow in consumer due to fast offset commits
during coordinator disconnect
When the coordinator is marked unknown, we explicitly disconnect its
connection and cancel pending requests. Currently the disconnect happens before
the coordinator state is set to null, which means that callbacks which inspect
the coordinator state will see it still as active. This can lead to further
requests being sent. In pathological cases, the disconnect itself is not able
to return because new requests are sent to the coordinator before the
disconnect can complete, which leads to the stack overflow error. To fix the
problem, I have reordered the disconnect to happen after the coordinator is set
to null.
I have added a basic test case to verify that callbacks for in-flight or
unsent requests see the coordinator as unknown which prevents them from
attempting to resend. We may need additional test cases after we determine
whether this is in fact was it happening in the reported ticket.
Note that I have also included some minor cleanups which I noticed along
the way.
### Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hachikuji/kafka KAFKA-6366
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/kafka/pull/4349.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4349
----
commit 488de3dca5be6111fd447980c8e79477259dc99a
Author: Jason Gustafson <jason@...>
Date: 2017-12-18T18:53:38Z
KAFKA-6366 [WIP]: Fix stack overflow in consumer due to fast offset commits
during coordinator disconnect
----
---