Re: [DISCUSS] KIP-489 Kafka Consumer Record Latency Metric

Sean Glover Fri, 17 Jan 2020 13:15:24 -0800

Hi Habib,

With regards to your earlier question about timezones, I've updated the KIP
to remove the LatencyTime abstraction since it is no longer relevant.  I
added a note about epoch time as well.


Thanks,
Sean

On Wed, Jan 15, 2020 at 8:28 AM Habib Nahas <[email protected]> wrote:

> Hi Sean,
>
> Thats great, look forward to it.
>
> Thanks,
> Habib
>
> On Tue, Jan 14, 2020, at 2:55 PM, Sean Glover wrote:
> > Hi Habib,
> >
> > Thank you for the reminder. I'll update the KIP this week and address the
> > feedback from you and Gokul.
> >
> > Regards,
> > Sean
> >
> > On Tue, Jan 14, 2020 at 9:06 AM Habib Nahas <[email protected]> wrote:
> >
> > > Any chance of an update on the KIP? We are interested in seeing this
> move
> > > forward.
> > >
> > > Thanks,
> > > Habib
> > > Sr SDE, AWS
> > >
> > > On Wed, Dec 18, 2019, at 3:27 PM, Habib Nahas wrote:
> > > > Thanks Sean. Look forward to the updated KIP.
> > > >
> > > > Regards,
> > > > Habib
> > > >
> > > > On Fri, Dec 13, 2019, at 6:22 AM, Sean Glover wrote:
> > > > > Hi,
> > > > >
> > > > > After my last reply I had a nagging feeling something wasn't right,
> > > and I
> > > > > remembered that epoch time is UTC. This makes the discussion about
> > > > > timezone irrelevant, since we're always using UTC. This makes the
> need
> > > for
> > > > > the LatencyTime interface that I proposed in the design irrelevant
> as
> > > well,
> > > > > since I can no longer think about how that might be useful. I'll
> update
> > > > > the KIP. I'll also review KIP-32 to understand message timestamps
> > > better
> > > > > so I can explain the different types of latency results that could
> be
> > > > > reported with this metric.
> > > > >
> > > > > Regards,
> > > > > Sean
> > > > >
> > > > > On Thu, Dec 12, 2019 at 6:25 PM Sean Glover <
> [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Habib,
> > > > > >
> > > > > > Thanks for question! If the consumer is in a different timezone
> than
> > > the
> > > > > > timezone used to produce messages to a partition then you can
> use an
> > > > > > implementation of LatencyTime to return the current time of that
> > > timezone.
> > > > > > The current design assumes that messages produced to a partition
> > > must all
> > > > > > be produced from the same timezone. If timezone metadata were
> > > encoded into
> > > > > > each message then it would be possible to automatically
> determine the
> > > > > > source timezone and calculate latency, however the current design
> > > will not
> > > > > > pass individual messages into LatencyTime to retrieve message
> > > metadata.
> > > > > > Instead, the LatencyTime.getWallClockTime method is only called
> once
> > > per
> > > > > > fetch request response per partition and then the metric is
> recorded
> > > once
> > > > > > the latency calculation is complete. This follows the same
> design as
> > > the
> > > > > > current consumer lag metric which calculates offset lag based on
> the
> > > last
> > > > > > message of the fetch request response for a partition. Since the
> > > metric is
> > > > > > just an aggregate (max/mean) over some time window we only need
> to
> > > > > > occasionally calculate latency, which will have negligible
> impact on
> > > the
> > > > > > performance of consumer polling.
> > > > > >
> > > > > > A simple implementation of LatencyTime that returns wall clock
> time
> > > for
> > > > > > the Asia/Singapore timezone for all partitions:
> > > > > >
> > > > > > import java.time.*;
> > > > > >
> > > > > > class SingaporeTime implements LatencyTime {
> > > > > > ZoneId zoneSingapore = ZoneId.of("Asia/Singapore");
> > > > > > Clock clockSingapore = Clock.system(zoneSingapore);
> > > > > >
> > > > > > @Override
> > > > > > public long getWallClockTime(TopicPartition tp) {
> > > > > > return clockSingapore.instant.getEpochSecond();
> > > > > > }
> > > > > >
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > > Regards,
> > > > > > Sean
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 12, 2019 at 6:18 AM Habib Nahas <[email protected]>
> wrote:
> > > > > >
> > > > > >> Hi Sean,
> > > > > >>
> > > > > >> Thanks for the KIP.
> > > > > >>
> > > > > >> As I understand it users are free to set their own timestamp on
> > > > > >> ProducerRecord. What is the recommendation for the proposed
> metric
> > > in a
> > > > > >> scenario where the user sets this timestamp in timezone A and
> > > consumes the
> > > > > >> record in timezone B. Its not clear to me if a custom
> > > implementation of
> > > > > >> LatencyTime will help here.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Habib
> > > > > >>
> > > > > >> On Wed, Dec 11, 2019, at 4:52 PM, Sean Glover wrote:
> > > > > >> > Hello again,
> > > > > >> >
> > > > > >> > There has been some interest in this KIP recently. I'm
> bumping the
> > > > > >> thread
> > > > > >> > to encourage feedback on the design.
> > > > > >> >
> > > > > >> > Regards,
> > > > > >> > Sean
> > > > > >> >
> > > > > >> > On Mon, Jul 15, 2019 at 9:01 AM Sean Glover <
> > > [email protected]>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > To hopefully spark some discussion I've copied the
> motivation
> > > section
> > > > > >> from
> > > > > >> > > the KIP:
> > > > > >> > >
> > > > > >> > > Consumer lag is a useful metric to monitor how many records
> are
> > > > > >> queued to
> > > > > >> > > be processed. We can look at individual lag per partition
> or we
> > > may
> > > > > >> > > aggregate metrics. For example, we may want to monitor what
> the
> > > > > >> maximum lag
> > > > > >> > > of any particular partition in our consumer subscription so
> we
> > > can
> > > > > >> identify
> > > > > >> > > hot partitions, caused by an insufficient producing
> partitioning
> > > > > >> strategy.
> > > > > >> > > We may want to monitor a sum of lag across all partitions
> so we
> > > have a
> > > > > >> > > sense as to our total backlog of messages to consume. Lag in
> > > offsets
> > > > > >> is
> > > > > >> > > useful when you have a good understanding of your messages
> and
> > > > > >> processing
> > > > > >> > > characteristics, but it doesn’t tell us how far behind *in
> > > time* we
> > > > > >> are.
> > > > > >> > > This is known as wait time in queueing theory, or more
> > > informally it’s
> > > > > >> > > referred to as latency.
> > > > > >> > >
> > > > > >> > > The latency of a message can be defined as the difference
> > > between when
> > > > > >> > > that message was first produced to when the message is
> received
> > > by a
> > > > > >> > > consumer. The latency of records in a partition correlates
> with
> > > lag,
> > > > > >> but a
> > > > > >> > > larger lag doesn’t necessarily mean a larger latency. For
> > > example, a
> > > > > >> topic
> > > > > >> > > consumed by two separate application consumer groups A and B
> > > may have
> > > > > >> > > similar lag, but different latency per partition.
> Application A
> > > is a
> > > > > >> > > consumer which performs CPU intensive business logic on each
> > > message
> > > > > >> it
> > > > > >> > > receives. It’s distributed across many consumer group
> members to
> > > > > >> handle the
> > > > > >> > > load quickly enough, but since its processing time is
> slower,
> > > it takes
> > > > > >> > > longer to process each message per partition. Meanwhile,
> > > Application
> > > > > >> B is
> > > > > >> > > a consumer which performs a simple ETL operation to land
> > > streaming
> > > > > >> data in
> > > > > >> > > another system, such as HDFS. It may have similar lag to
> > > Application
> > > > > >> A, but
> > > > > >> > > because it has a faster processing time its latency per
> > > partition is
> > > > > >> > > significantly less.
> > > > > >> > >
> > > > > >> > > If the Kafka Consumer reported a latency metric it would be
> > > easier to
> > > > > >> > > build Service Level Agreements (SLAs) based on
> non-functional
> > > > > >> requirements
> > > > > >> > > of the streaming system. For example, the system must never
> > > have a
> > > > > >> latency
> > > > > >> > > of greater than 10 minutes. This SLA could be used in
> monitoring
> > > > > >> alerts or
> > > > > >> > > as input to automatic scaling solutions.
> > > > > >> > >
> > > > > >> > > On Thu, Jul 11, 2019 at 12:36 PM Sean Glover <
> > > > > >> [email protected]>
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > >> Hi kafka-dev,
> > > > > >> > >>
> > > > > >> > >> I've created KIP-489 as a proposal for adding latency
> metrics
> > > to the
> > > > > >> > >> Kafka Consumer in a similar way as record-lag metrics are
> > > > > >> implemented.
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >>
> > > > > >>
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/489%3A+Kafka+Consumer+Record+Latency+Metric
> > > > > >> > >>
> > > > > >> > >> Regards,
> > > > > >> > >> Sean
> > > > > >> > >>
> > > > > >> > >> --
> > > > > >> > >> Principal Engineer, Lightbend, Inc.
> > > > > >> > >>
> > > > > >> > >> <http://lightbend.com>
> > > > > >> > >>
> > > > > >> > >> @seg1o <https://twitter.com/seg1o>, in/seanaglover
> > > > > >> > >> <https://www.linkedin.com/in/seanaglover/>
> > > > > >> > >>
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --
> > > > > >> > > Principal Engineer, Lightbend, Inc.
> > > > > >> > >
> > > > > >> > > <http://lightbend.com>
> > > > > >> > >
> > > > > >> > > @seg1o <https://twitter.com/seg1o>, in/seanaglover
> > > > > >> > > <https://www.linkedin.com/in/seanaglover/>
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Sean Glover
> > > > > Principal Engineer, Alpakka, Lightbend, Inc. <
> https://lightbend.com>
> > > > > @seg1o <https://twitter.com/seg1o>, in/seanaglover
> > > > > <https://www.linkedin.com/in/seanaglover/>
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-489 Kafka Consumer Record Latency Metric

Reply via email to