Javier, Got it. The proposal from SO should work, while the drawback is that you need one more full fledged consumer instance to do that.
If you'd like to go a bit deeper, you can actually turn on DEBUG level logging on the `o.a.k.clients.NetworkClient` class which would print the following upon node disconnects: ``` Node {} disconnected. ``` Then you can have a very simple grep program that look for this line, and fire healthcheck actions whenever it exceeds an limit within a sliding window, e.g. Guozhang On Thu, Feb 21, 2019 at 11:23 PM Javier Arias Losada < javier.ari...@gmail.com> wrote: > Thank you for your responses! > > Guozhang, what you propose seems like a very good way to monitor externally > the healthiness of consumers, with this combination of metrics (offset > advance + bytes-in/out) it can be deduced when a consumer is not working. > > What we are trying to accomplish is detect this very same situation, but > from inside the consumer process. The reason is our consumer is running as > a container task in AWS-ECS; and we have an HTTP healthcheck in the process > so that whenever the process returns 'unhealthy', the cluster scheduler > stops that instance. > > So our idea is to find the best way to realize from inside the consumer > that we lost connection to the broker so that we can mark the instance as > unhealthy. > > We found in stackoverflow a way to do it, have a consumer and periodically > do a listTopics(timeout) call, whenever you lose the connection to the > cluster, this raises an exception. What do you think? Are there any > drawbacks with this approach other than one extra consumer? Is it better to > reuse the same consumer, or create a new consumer every time? it would be > about every minute, this is the period for healthchecks in our cluster. > > Again, thanks. > > > > El mié., 20 feb. 2019 a las 18:54, Guozhang Wang (<wangg...@gmail.com>) > escribió: > > > Hello Javier, > > > > Matthias is right it is a known issue, not only in Streams, but in the > > underlying producer / consumer clients. > > > > For you own healthcheck monitoring, I'd suggest you can consider some > > following alternatives: > > > > 1) Monitor on consumer offsets, and alert when it did not change for a > long > > time. > > > > 2) Obviously not all scenarios of 1) above is contributed from lost > > connection, so in addition to it you can also monitor on the embedded > > consumer / producer's bytes-in / bytes-out rate, and alert when it drops > to > > zero for some time. > > > > Combining 1) with 2), when both happens, it is usually indicating a lost > > connection situation. > > > > > > Guozhang > > > > > > On Wed, Feb 20, 2019 at 9:39 AM Matthias J. Sax <matth...@confluent.io> > > wrote: > > > > > It's a known issue: https://issues.apache.org/jira/browse/KAFKA-6520 > > > > > > > > > On 2/20/19 3:25 AM, Javier Arias Losada wrote: > > > > Hello Kafka users, > > > > > > > > working on a Kafka-Streams stateless application; we want to > implement > > > some > > > > healthchecks so that whenever connection to Kafka is lost for more > > than a > > > > threshold, marke the instance as unhealthy, so that our cluster > manager > > > > (could be K8S or AWS-ECS) kills that instance and starts a new one. > > > > > > > > We have notice that when the consumer is running and the connection > is > > > > lost, it tries to reconnect and sends some logs, but we didn't find a > > way > > > > to programatically check or subscribe to the connection status. > > > > > > > > Am I missing something? > > > > Is this an intended feature? Why? > > > > What are the best practices for healtchecking Kafka-streams > > applications? > > > > > > > > I also found that with a plain Kafka consumer, no exception is raised > > on > > > > lost connectivity... how could we somehow check the connection > status? > > > How > > > > are other people solving this issue? > > > > > > > > Thank you very much. > > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang