Re: [DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-30 Thread Jose M
Hello Stanislav,

Thanks again for your comments.

I understand, and Im happy to hear that my usecase is rare. The reason
is that before going to real production, we are forced to build a
prototype, with limited resources, but still resilient enough to pass
acceptance tests.

I agree the whitelist on configuration does not look great, and that
the cardinality exposed on the broker would be high and it is better
to avoid that. Currently many of my consumer groups only have 1
consumer, but it is a trade off that I can accept to expose the metric
on consumer side, even if it means I will not have it in case of crash
loop.

I understand the metric could be something like this on consumer side:
```
 metricMessagesLost +=  ( firstAvailableOffset >  currentOffset ?
firstAvailableOffset - currentOffset : 0 )
```
I will update the KIP with your inputs.

Thanks a lot for you help and your time,


Jose M

On Mon, Jul 29, 2019 at 3:55 PM Stanislav Kozlovski
 wrote:
>
> Hey Jose,
>
> Thanks for sharing your use cases.
> From my experience, it is uncommon to run with a retention.ms setting small
> enough that it can make you lose messages when your consumers can't catch
> up. If you are concerned with data loss, I think the cost investment into
> hardware is generally worth it.
> I think your use case might benefit from setting `retention.bytes` to
> ensure you don't go over a specific size and a higher retention.ms. I
> assume that might be more deterministic as it is likely you have a better
> idea of how much data these files will be (and can get) rather than how
> long they'd take to process.
>
> In any case, I think it's an exception to have to manually configure and
> modify retention.ms in real time according to consumer lag. This metric (if
> enabled) would be the highest cardinality metric in the server, as it is
> per consumer group *and* partition. I know the current proposal suggests we
> enable it through a whitelist config, but I think that would be intuitive
> to users and I'm not sure if it's a good idea to guard metrics according to
> configurations.
> In general, I believe we should aim to limit the raw number of metrics
> exposed from the broker when there is another way to solve the problem.
>
> I think the metric should go on the broker side, in case the consumer
> > is not even be instantiated, or it is crashing in a loop.
>
> We would need *all* consumers in the consumer group to not be available in
> order to not have the information exposed. Also, it is generally expected
> to have your consumers run all the time (Kafka is a real-time streaming
> platform) and batch use cases are the exception.
> If they are all crashing in a loop, there is an outage to be solved and you
> should increase your retention if there is a chance it deleted unconsumed
> data.
> Because of the rareness of needing the information in real-time, I still
> think having it in the consumer is a good approach.
>
> Let me know if that makes sense.
>
> Thanks,
> Stanislav
>
> On Sun, Jul 28, 2019 at 5:00 PM Jose M  wrote:
>
> > Hello,
> >
> > Thanks for taking the time to review my KIP!
> >
> > I will describe some production scenarios I faced to better explain
> > the reasons for this KIP.
> >
> > * Usecase 1: batch processing of files.
> > A batch is producing huge files that must be processed. Each line of
> > the file will be a message produced to a topic. It means the topic
> > storing this messages will go from 0 lag to lets say 5 million lag, in
> > a few seconds. I will adjust the retention time on the topic based on
> > the processing rate on the consumer of this topic. Ex: 5 million
> > messages at 100 TPS needs ~14 hours retention time. In practice we set
> > up bigger retention time, just in case. If a second file arrives
> > before the first one has been processed and the processing ratio is
> > slower than I thought, I will lose the end of the first file, without
> > notice.
> >
> > * Usecase 2: application facing network errors.
> > The application consumes messages on input topic, process them and
> > push them to an external system (ex: webservice). If there are
> > connectivity problem between my kafka consumer and the external
> > webservice, the lag of the application will grow. As I have alerting
> > rules on records-max-lag, I will be aware the backlog of the topic is
> > above a limit. I will take action as in the previous example, and I
> > will adjust retention time on the topic based on the processing rate.
> > If the processing rate is not constant, due to the network
> > connectivity problem, the retention time may not be enough and I will
> > lose messages.
> >
> > In both cases, I don't know if Ive lost messages or not. I suspect
> > that yes but I can not give an accurate number of messages lost, or
> > guarantee I have not lost any of them.
> >
> > I could solve both use cases setting up oversized retention time for
> > the topics, but in practice I'm limited by the hardware resources.
> >
> 

Re: [DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-29 Thread Stanislav Kozlovski
Hey Jose,

Thanks for sharing your use cases.
>From my experience, it is uncommon to run with a retention.ms setting small
enough that it can make you lose messages when your consumers can't catch
up. If you are concerned with data loss, I think the cost investment into
hardware is generally worth it.
I think your use case might benefit from setting `retention.bytes` to
ensure you don't go over a specific size and a higher retention.ms. I
assume that might be more deterministic as it is likely you have a better
idea of how much data these files will be (and can get) rather than how
long they'd take to process.

In any case, I think it's an exception to have to manually configure and
modify retention.ms in real time according to consumer lag. This metric (if
enabled) would be the highest cardinality metric in the server, as it is
per consumer group *and* partition. I know the current proposal suggests we
enable it through a whitelist config, but I think that would be intuitive
to users and I'm not sure if it's a good idea to guard metrics according to
configurations.
In general, I believe we should aim to limit the raw number of metrics
exposed from the broker when there is another way to solve the problem.

I think the metric should go on the broker side, in case the consumer
> is not even be instantiated, or it is crashing in a loop.

We would need *all* consumers in the consumer group to not be available in
order to not have the information exposed. Also, it is generally expected
to have your consumers run all the time (Kafka is a real-time streaming
platform) and batch use cases are the exception.
If they are all crashing in a loop, there is an outage to be solved and you
should increase your retention if there is a chance it deleted unconsumed
data.
Because of the rareness of needing the information in real-time, I still
think having it in the consumer is a good approach.

Let me know if that makes sense.

Thanks,
Stanislav

On Sun, Jul 28, 2019 at 5:00 PM Jose M  wrote:

> Hello,
>
> Thanks for taking the time to review my KIP!
>
> I will describe some production scenarios I faced to better explain
> the reasons for this KIP.
>
> * Usecase 1: batch processing of files.
> A batch is producing huge files that must be processed. Each line of
> the file will be a message produced to a topic. It means the topic
> storing this messages will go from 0 lag to lets say 5 million lag, in
> a few seconds. I will adjust the retention time on the topic based on
> the processing rate on the consumer of this topic. Ex: 5 million
> messages at 100 TPS needs ~14 hours retention time. In practice we set
> up bigger retention time, just in case. If a second file arrives
> before the first one has been processed and the processing ratio is
> slower than I thought, I will lose the end of the first file, without
> notice.
>
> * Usecase 2: application facing network errors.
> The application consumes messages on input topic, process them and
> push them to an external system (ex: webservice). If there are
> connectivity problem between my kafka consumer and the external
> webservice, the lag of the application will grow. As I have alerting
> rules on records-max-lag, I will be aware the backlog of the topic is
> above a limit. I will take action as in the previous example, and I
> will adjust retention time on the topic based on the processing rate.
> If the processing rate is not constant, due to the network
> connectivity problem, the retention time may not be enough and I will
> lose messages.
>
> In both cases, I don't know if Ive lost messages or not. I suspect
> that yes but I can not give an accurate number of messages lost, or
> guarantee I have not lost any of them.
>
> I could solve both use cases setting up oversized retention time for
> the topics, but in practice I'm limited by the hardware resources.
>
> One of the reasons Ive opened this KIP is because I think the
> implementation should be doable. The broker has all the information
> needed (expired offset and last consumed offset). Though I have
> questions about the impact on the performance, that's why I hesitate
> to propose this new metric as default for all consumers, or only to
> the consumers that request it through configuration.
>
> As you said, the motivation is to know when (and which), and how many
> messages a consumer have missed because they have been deleted. I
> think it is possible to return the exact amount of messages missed due
> to retention policy.
>
> I think the metric should go on the broker side, in case the consumer
> is not even be instantiated, or it is crashing in a loop.
>
> Please let me know what do you think.
>
>
> Thanks,
> Jose M
>
>
>
> On Sat, Jul 27, 2019 at 4:22 PM Stanislav Kozlovski
>  wrote:
> >
> > Hey Jose,
> >
> > Thanks for the KIP.
> >
> > I think that Colin was referring to an existing client metric called
> >
> 

Re: [DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-28 Thread Jose M
Hello,

Thanks for taking the time to review my KIP!

I will describe some production scenarios I faced to better explain
the reasons for this KIP.

* Usecase 1: batch processing of files.
A batch is producing huge files that must be processed. Each line of
the file will be a message produced to a topic. It means the topic
storing this messages will go from 0 lag to lets say 5 million lag, in
a few seconds. I will adjust the retention time on the topic based on
the processing rate on the consumer of this topic. Ex: 5 million
messages at 100 TPS needs ~14 hours retention time. In practice we set
up bigger retention time, just in case. If a second file arrives
before the first one has been processed and the processing ratio is
slower than I thought, I will lose the end of the first file, without
notice.

* Usecase 2: application facing network errors.
The application consumes messages on input topic, process them and
push them to an external system (ex: webservice). If there are
connectivity problem between my kafka consumer and the external
webservice, the lag of the application will grow. As I have alerting
rules on records-max-lag, I will be aware the backlog of the topic is
above a limit. I will take action as in the previous example, and I
will adjust retention time on the topic based on the processing rate.
If the processing rate is not constant, due to the network
connectivity problem, the retention time may not be enough and I will
lose messages.

In both cases, I don't know if Ive lost messages or not. I suspect
that yes but I can not give an accurate number of messages lost, or
guarantee I have not lost any of them.

I could solve both use cases setting up oversized retention time for
the topics, but in practice I'm limited by the hardware resources.

One of the reasons Ive opened this KIP is because I think the
implementation should be doable. The broker has all the information
needed (expired offset and last consumed offset). Though I have
questions about the impact on the performance, that's why I hesitate
to propose this new metric as default for all consumers, or only to
the consumers that request it through configuration.

As you said, the motivation is to know when (and which), and how many
messages a consumer have missed because they have been deleted. I
think it is possible to return the exact amount of messages missed due
to retention policy.

I think the metric should go on the broker side, in case the consumer
is not even be instantiated, or it is crashing in a loop.

Please let me know what do you think.


Thanks,
Jose M



On Sat, Jul 27, 2019 at 4:22 PM Stanislav Kozlovski
 wrote:
>
> Hey Jose,
>
> Thanks for the KIP.
>
> I think that Colin was referring to an existing client metric called
> "kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}",name=records-lag-max",
> exposed on the consumer application.
> You should be able to use that to get a sense of how far behind your
> consumer is.
>
> That may not solve the motivation for the KIP, though, which as far as I
> understand is to have a way to know when a (and which) consumer has missed
> messages because they have been deleted.
> I also assume that this is happening because consumers are not long-running
> but rather started every once in a while. Let me know if that is correct.
> The consumer code does
> ```
> return topicPartitionState.highWatermark == null ? null :
> topicPartitionState.highWatermark - topicPartitionState.position.offset;
> ```
> to calculate the partition lag. I believe,
> `topicPartitionState.position.offset` is the latest offset consumed by the
> consumer group.
> In the case of a start-up of a consumer group that has lagged behind and
> lost messages due to retention, the very first reading of that metric may
> be very high and perhaps misleading, iff
> `TopicPartitionState#logStartOffset > TopicPartitionState#position`
>
> It would be good to expand on the motivation on when/why a consumer would
> miss messages due to retention taking effect.
>
> In any case, I think this might be better suited as a client metric. The
> consumer should have all the necessary information, as it has the start
> offset of the log and the group's latest offset.
>
> Thanks,
> Stanislav
>
> On Wed, Jul 24, 2019 at 6:00 PM Jose M  wrote:
>
> > Hello Kamal,
> >
> > The compacted topics are excluded from the KIP, because users of compacted
> > topics are mainly interested on the last state for a certain key, and can
> > afford to miss intermediary states. Technically is possible to know if the
> > topic is compacted through "log.config.compact" attribute. Thanks a lot for
> > your feedback!
> >
> > Ive updated the KIP to precise:
> >
> >- compacted topics are excluded of the KIP.
> >- instead of logging on the broker, I propose to create a new metric,
> >following Colin's comment (thanks a lot!)
> >
> > Thanks,
> >
> > Jose
> >
> > On Tue, Jul 23, 2019 at 11:45 AM Kamal Chandraprakash <
> > 

Re: [DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-27 Thread Stanislav Kozlovski
Hey Jose,

Thanks for the KIP.

I think that Colin was referring to an existing client metric called
"kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}",name=records-lag-max",
exposed on the consumer application.
You should be able to use that to get a sense of how far behind your
consumer is.

That may not solve the motivation for the KIP, though, which as far as I
understand is to have a way to know when a (and which) consumer has missed
messages because they have been deleted.
I also assume that this is happening because consumers are not long-running
but rather started every once in a while. Let me know if that is correct.
The consumer code does
```
return topicPartitionState.highWatermark == null ? null :
topicPartitionState.highWatermark - topicPartitionState.position.offset;
```
to calculate the partition lag. I believe,
`topicPartitionState.position.offset` is the latest offset consumed by the
consumer group.
In the case of a start-up of a consumer group that has lagged behind and
lost messages due to retention, the very first reading of that metric may
be very high and perhaps misleading, iff
`TopicPartitionState#logStartOffset > TopicPartitionState#position`

It would be good to expand on the motivation on when/why a consumer would
miss messages due to retention taking effect.

In any case, I think this might be better suited as a client metric. The
consumer should have all the necessary information, as it has the start
offset of the log and the group's latest offset.

Thanks,
Stanislav

On Wed, Jul 24, 2019 at 6:00 PM Jose M  wrote:

> Hello Kamal,
>
> The compacted topics are excluded from the KIP, because users of compacted
> topics are mainly interested on the last state for a certain key, and can
> afford to miss intermediary states. Technically is possible to know if the
> topic is compacted through "log.config.compact" attribute. Thanks a lot for
> your feedback!
>
> Ive updated the KIP to precise:
>
>- compacted topics are excluded of the KIP.
>- instead of logging on the broker, I propose to create a new metric,
>following Colin's comment (thanks a lot!)
>
> Thanks,
>
> Jose
>
> On Tue, Jul 23, 2019 at 11:45 AM Kamal Chandraprakash <
> kamal.chandraprak...@gmail.com> wrote:
>
> > Jose,
> >
> > How do you differentiate the compaction topics from the time
> retention
> > topics? Deleting a message due to compaction policy is a valid case
> > and users won't be interested in monitoring/reading those deleted
> messages.
> >
> > Thanks,
> > Kamal
> >
> > On Tue, Jul 23, 2019 at 4:00 AM Jose M  wrote:
> >
> > > Hi Colin,
> > >
> > > Thanks a lot for your feedback. Please note that I only propose to log
> > when
> > > a message is lost this for a set of consumer groups, not as default
> > > behaviour for all consumer groups.
> > > But in fact, I agree with you that to log a line per message expired
> can
> > be
> > > quite lot, and that is not the better way do it. I can propose to add a
> > > dedicated JMX metric of type counter "expired messages" per consumer
> > group.
> > > What do you think ?
> > >
> > > About monitoring the lag to ensure that messages are not lost, I know
> > that
> > > is what clients do, to set up alerting when the lag is above a
> threshold.
> > > But even if the alert is triggered, we dont know if messages have been
> > lost
> > > or not. Implementing this KIP clients would know if something has been
> > > missed or not.
> > >
> > >
> > > Thanks,
> > >
> > >
> > > Jose
> > >
> > > On Mon, Jul 22, 2019 at 5:51 PM Colin McCabe 
> wrote:
> > >
> > > > Hi Jose,
> > > >
> > > > One issue that I see here is that the number of log messages could be
> > > > huge.  I've seen people create tens of thousands of consumer groups.
> > > > People can also have settings that create pretty small log files.  A
> > > > message per log file per group could be quite a lot of messages.
> > > >
> > > > A log message on the broker is also not that useful for detecting bad
> > > > client behavior.  People generally only look at the server logs after
> > > they
> > > > become aware that something is wrong through some other means.
> > > >
> > > > Perhaps the clients should just monitor their lag?  There is a JMX
> > metric
> > > > for this, which means it can be hooked into traditional metrics /
> > > reporting
> > > > systems.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > On Mon, Jul 22, 2019, at 03:12, Jose M wrote:
> > > > > Hello,
> > > > >
> > > > > I didn't get any feedback on this small KIP-490
> > > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted
> > > > >.
> > > > > In summary, I propose a way to be noticed when messages are being
> > > > > removed
> > > > > due to retention policy, without being consumed by a given consumer
> > > > > group.
> > > > > It will be useful to realize that some important messages have been

Re: [DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-24 Thread Jose M
Hello Kamal,

The compacted topics are excluded from the KIP, because users of compacted
topics are mainly interested on the last state for a certain key, and can
afford to miss intermediary states. Technically is possible to know if the
topic is compacted through "log.config.compact" attribute. Thanks a lot for
your feedback!

Ive updated the KIP to precise:

   - compacted topics are excluded of the KIP.
   - instead of logging on the broker, I propose to create a new metric,
   following Colin's comment (thanks a lot!)

Thanks,

Jose

On Tue, Jul 23, 2019 at 11:45 AM Kamal Chandraprakash <
kamal.chandraprak...@gmail.com> wrote:

> Jose,
>
> How do you differentiate the compaction topics from the time retention
> topics? Deleting a message due to compaction policy is a valid case
> and users won't be interested in monitoring/reading those deleted messages.
>
> Thanks,
> Kamal
>
> On Tue, Jul 23, 2019 at 4:00 AM Jose M  wrote:
>
> > Hi Colin,
> >
> > Thanks a lot for your feedback. Please note that I only propose to log
> when
> > a message is lost this for a set of consumer groups, not as default
> > behaviour for all consumer groups.
> > But in fact, I agree with you that to log a line per message expired can
> be
> > quite lot, and that is not the better way do it. I can propose to add a
> > dedicated JMX metric of type counter "expired messages" per consumer
> group.
> > What do you think ?
> >
> > About monitoring the lag to ensure that messages are not lost, I know
> that
> > is what clients do, to set up alerting when the lag is above a threshold.
> > But even if the alert is triggered, we dont know if messages have been
> lost
> > or not. Implementing this KIP clients would know if something has been
> > missed or not.
> >
> >
> > Thanks,
> >
> >
> > Jose
> >
> > On Mon, Jul 22, 2019 at 5:51 PM Colin McCabe  wrote:
> >
> > > Hi Jose,
> > >
> > > One issue that I see here is that the number of log messages could be
> > > huge.  I've seen people create tens of thousands of consumer groups.
> > > People can also have settings that create pretty small log files.  A
> > > message per log file per group could be quite a lot of messages.
> > >
> > > A log message on the broker is also not that useful for detecting bad
> > > client behavior.  People generally only look at the server logs after
> > they
> > > become aware that something is wrong through some other means.
> > >
> > > Perhaps the clients should just monitor their lag?  There is a JMX
> metric
> > > for this, which means it can be hooked into traditional metrics /
> > reporting
> > > systems.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Mon, Jul 22, 2019, at 03:12, Jose M wrote:
> > > > Hello,
> > > >
> > > > I didn't get any feedback on this small KIP-490
> > > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted
> > > >.
> > > > In summary, I propose a way to be noticed when messages are being
> > > > removed
> > > > due to retention policy, without being consumed by a given consumer
> > > > group.
> > > > It will be useful to realize that some important messages have been
> > > > lost.
> > > >
> > > > As Im new to the codebase, I have technical questions about how to
> > > achieve
> > > > this, but before going deeper, I would like your feedback on the
> > feature.
> > > >
> > > > Thanks a lot,
> > > >
> > > >
> > > > Jose Morales
> > > >
> > > > On Sun, Jul 14, 2019 at 12:51 AM Jose M  wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I would like to know what do you think on KIP-490:
> > > > >
> > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted
> > > > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+expired
> > > >
> > > > >
> > > > >
> > > > > Thanks a lot !
> > > > > --
> > > > > Jose M
> > > > >
> > > >
> > > >
> > > > --
> > > > J
> > > >
> > >
> >
> >
> > --
> > J
> >
>


-- 
J


Re: [DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-23 Thread Kamal Chandraprakash
Jose,

How do you differentiate the compaction topics from the time retention
topics? Deleting a message due to compaction policy is a valid case
and users won't be interested in monitoring/reading those deleted messages.

Thanks,
Kamal

On Tue, Jul 23, 2019 at 4:00 AM Jose M  wrote:

> Hi Colin,
>
> Thanks a lot for your feedback. Please note that I only propose to log when
> a message is lost this for a set of consumer groups, not as default
> behaviour for all consumer groups.
> But in fact, I agree with you that to log a line per message expired can be
> quite lot, and that is not the better way do it. I can propose to add a
> dedicated JMX metric of type counter "expired messages" per consumer group.
> What do you think ?
>
> About monitoring the lag to ensure that messages are not lost, I know that
> is what clients do, to set up alerting when the lag is above a threshold.
> But even if the alert is triggered, we dont know if messages have been lost
> or not. Implementing this KIP clients would know if something has been
> missed or not.
>
>
> Thanks,
>
>
> Jose
>
> On Mon, Jul 22, 2019 at 5:51 PM Colin McCabe  wrote:
>
> > Hi Jose,
> >
> > One issue that I see here is that the number of log messages could be
> > huge.  I've seen people create tens of thousands of consumer groups.
> > People can also have settings that create pretty small log files.  A
> > message per log file per group could be quite a lot of messages.
> >
> > A log message on the broker is also not that useful for detecting bad
> > client behavior.  People generally only look at the server logs after
> they
> > become aware that something is wrong through some other means.
> >
> > Perhaps the clients should just monitor their lag?  There is a JMX metric
> > for this, which means it can be hooked into traditional metrics /
> reporting
> > systems.
> >
> > best,
> > Colin
> >
> >
> > On Mon, Jul 22, 2019, at 03:12, Jose M wrote:
> > > Hello,
> > >
> > > I didn't get any feedback on this small KIP-490
> > > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted
> > >.
> > > In summary, I propose a way to be noticed when messages are being
> > > removed
> > > due to retention policy, without being consumed by a given consumer
> > > group.
> > > It will be useful to realize that some important messages have been
> > > lost.
> > >
> > > As Im new to the codebase, I have technical questions about how to
> > achieve
> > > this, but before going deeper, I would like your feedback on the
> feature.
> > >
> > > Thanks a lot,
> > >
> > >
> > > Jose Morales
> > >
> > > On Sun, Jul 14, 2019 at 12:51 AM Jose M  wrote:
> > >
> > > > Hello,
> > > >
> > > > I would like to know what do you think on KIP-490:
> > > >
> > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted
> > > > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+expired
> > >
> > > >
> > > >
> > > > Thanks a lot !
> > > > --
> > > > Jose M
> > > >
> > >
> > >
> > > --
> > > J
> > >
> >
>
>
> --
> J
>


Re: [DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-22 Thread Jose M
Hi Colin,

Thanks a lot for your feedback. Please note that I only propose to log when
a message is lost this for a set of consumer groups, not as default
behaviour for all consumer groups.
But in fact, I agree with you that to log a line per message expired can be
quite lot, and that is not the better way do it. I can propose to add a
dedicated JMX metric of type counter "expired messages" per consumer group.
What do you think ?

About monitoring the lag to ensure that messages are not lost, I know that
is what clients do, to set up alerting when the lag is above a threshold.
But even if the alert is triggered, we dont know if messages have been lost
or not. Implementing this KIP clients would know if something has been
missed or not.


Thanks,


Jose

On Mon, Jul 22, 2019 at 5:51 PM Colin McCabe  wrote:

> Hi Jose,
>
> One issue that I see here is that the number of log messages could be
> huge.  I've seen people create tens of thousands of consumer groups.
> People can also have settings that create pretty small log files.  A
> message per log file per group could be quite a lot of messages.
>
> A log message on the broker is also not that useful for detecting bad
> client behavior.  People generally only look at the server logs after they
> become aware that something is wrong through some other means.
>
> Perhaps the clients should just monitor their lag?  There is a JMX metric
> for this, which means it can be hooked into traditional metrics / reporting
> systems.
>
> best,
> Colin
>
>
> On Mon, Jul 22, 2019, at 03:12, Jose M wrote:
> > Hello,
> >
> > I didn't get any feedback on this small KIP-490
> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted
> >.
> > In summary, I propose a way to be noticed when messages are being
> > removed
> > due to retention policy, without being consumed by a given consumer
> > group.
> > It will be useful to realize that some important messages have been
> > lost.
> >
> > As Im new to the codebase, I have technical questions about how to
> achieve
> > this, but before going deeper, I would like your feedback on the feature.
> >
> > Thanks a lot,
> >
> >
> > Jose Morales
> >
> > On Sun, Jul 14, 2019 at 12:51 AM Jose M  wrote:
> >
> > > Hello,
> > >
> > > I would like to know what do you think on KIP-490:
> > >
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted
> > > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+expired
> >
> > >
> > >
> > > Thanks a lot !
> > > --
> > > Jose M
> > >
> >
> >
> > --
> > J
> >
>


-- 
J


Re: [DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-22 Thread Colin McCabe
Hi Jose,

One issue that I see here is that the number of log messages could be huge.  
I've seen people create tens of thousands of consumer groups.  People can also 
have settings that create pretty small log files.  A message per log file per 
group could be quite a lot of messages.

A log message on the broker is also not that useful for detecting bad client 
behavior.  People generally only look at the server logs after they become 
aware that something is wrong through some other means.

Perhaps the clients should just monitor their lag?  There is a JMX metric for 
this, which means it can be hooked into traditional metrics / reporting systems.

best,
Colin


On Mon, Jul 22, 2019, at 03:12, Jose M wrote:
> Hello,
> 
> I didn't get any feedback on this small KIP-490
> .
> In summary, I propose a way to be noticed when messages are being 
> removed
> due to retention policy, without being consumed by a given consumer 
> group.
> It will be useful to realize that some important messages have been 
> lost.
> 
> As Im new to the codebase, I have technical questions about how to achieve
> this, but before going deeper, I would like your feedback on the feature.
> 
> Thanks a lot,
> 
> 
> Jose Morales
> 
> On Sun, Jul 14, 2019 at 12:51 AM Jose M  wrote:
> 
> > Hello,
> >
> > I would like to know what do you think on KIP-490:
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted
> > 
> >
> >
> > Thanks a lot !
> > --
> > Jose M
> >
> 
> 
> -- 
> J
>


Re: [DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-22 Thread Jose M
Hello,

I didn't get any feedback on this small KIP-490
.
In summary, I propose a way to be noticed when messages are being removed
due to retention policy, without being consumed by a given consumer group.
It will be useful to realize that some important messages have been lost.

As Im new to the codebase, I have technical questions about how to achieve
this, but before going deeper, I would like your feedback on the feature.

Thanks a lot,


Jose Morales

On Sun, Jul 14, 2019 at 12:51 AM Jose M  wrote:

> Hello,
>
> I would like to know what do you think on KIP-490:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted
> 
>
>
> Thanks a lot !
> --
> Jose M
>


-- 
J


[DISCUSS] KIP-490: log when consumer groups lose a message because offset has been deleted

2019-07-13 Thread Jose M
Hello,

I would like to know what do you think on KIP-490:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-490%3A+log+when+consumer+groups+lose+a+message+because+offset+has+been+deleted



Thanks a lot !
-- 
Jose M