Thank you for your responses!

Guozhang, what you propose seems like a very good way to monitor externally
the healthiness of consumers, with this combination of metrics (offset
advance + bytes-in/out) it can be deduced when a consumer is not working.

What we are trying to accomplish is detect this very same situation, but
from inside the consumer process. The reason is our consumer is running as
a container task in AWS-ECS; and we have an HTTP healthcheck in the process
so that whenever the process returns 'unhealthy', the cluster scheduler
stops that instance.

So our idea is to find the best way to realize from inside the consumer
that we lost connection to the broker so that we can mark the instance as
unhealthy.

We found in stackoverflow a way to do it, have a consumer and periodically
do a listTopics(timeout) call, whenever you lose the connection to the
cluster, this raises an exception. What do you think? Are there any
drawbacks with this approach other than one extra consumer? Is it better to
reuse the same consumer, or create a new consumer every time? it would be
about every minute, this is the period for healthchecks in our cluster.

Again, thanks.



El mié., 20 feb. 2019 a las 18:54, Guozhang Wang (<wangg...@gmail.com>)
escribió:

> Hello Javier,
>
> Matthias is right it is a known issue, not only in Streams, but in the
> underlying producer / consumer clients.
>
> For you own healthcheck monitoring, I'd suggest you can consider some
> following alternatives:
>
> 1) Monitor on consumer offsets, and alert when it did not change for a long
> time.
>
> 2) Obviously not all scenarios of 1) above is contributed from lost
> connection, so in addition to it you can also monitor on the embedded
> consumer / producer's bytes-in / bytes-out rate, and alert when it drops to
> zero for some time.
>
> Combining 1) with 2), when both happens, it is usually indicating a lost
> connection situation.
>
>
> Guozhang
>
>
> On Wed, Feb 20, 2019 at 9:39 AM Matthias J. Sax <matth...@confluent.io>
> wrote:
>
> > It's a known issue: https://issues.apache.org/jira/browse/KAFKA-6520
> >
> >
> > On 2/20/19 3:25 AM, Javier Arias Losada wrote:
> > > Hello Kafka users,
> > >
> > > working on a Kafka-Streams stateless application; we want to implement
> > some
> > > healthchecks so that whenever connection to Kafka is lost for more
> than a
> > > threshold, marke the instance as unhealthy, so that our cluster manager
> > > (could be K8S or AWS-ECS) kills that instance and starts a new one.
> > >
> > > We have notice that when the consumer is running and the connection is
> > > lost, it tries to reconnect and sends some logs, but we didn't find a
> way
> > > to programatically check or subscribe to the connection status.
> > >
> > > Am I missing something?
> > > Is this an intended feature? Why?
> > > What are the best practices for healtchecking Kafka-streams
> applications?
> > >
> > > I also found that with a plain Kafka consumer, no exception is raised
> on
> > > lost connectivity... how could we somehow check the connection status?
> > How
> > > are other people solving this issue?
> > >
> > > Thank you very much.
> > >
> >
> >
>
> --
> -- Guozhang
>

Reply via email to