Re: [DISCUSS] FLIP-179: Expose Standardized Operator Metrics

Arvid Heise Fri, 23 Jul 2021 04:57:55 -0700

Hi Becket,

I still can't follow your view on the metric being a global variable or
your concern that it is confusing to users. Nevertheless, I like your
proposal with having an additional collect method.


I was thinking that
> SourceOutput is going to have an additional method of collect(T Record,
> Long timestamp, Long fetchTime). So people can just pass in the fetch time
> directly when they emit a record, regardless of using SourceReaderBase or
> not.
>

Can we think of other use cases for the fetchTime parameter beyond metrics
in the future? If so, it would make an even stronger case.

I'll update the PR with your proposals.

Best,

Arvid

On Fri, Jul 23, 2021 at 12:08 PM Becket Qin <becket....@gmail.com> wrote:

> Regarding the generic type v.s. class/subclasses of Metadata.
>
> I think generic types usually make sense if the framework/abstract class
> itself does not look into the instances, but just pass them from one user
> logic to another. Otherwise, interfaces or class/subclasses would be
> preferred.
>
> In our case, it depends on whether we expect the SourceReaderBase to look
> into the MetaData. At this point, it does not. But it seems possible that
> in the future it may look into MetaData. Therefore I think the class /
> subclass pattern would be better.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Fri, Jul 23, 2021 at 5:54 PM Becket Qin <becket....@gmail.com> wrote:
>
> > Hi Arvid,
> >
> > > I'm not sure if I follow the global variable argument, could you
> > elaborate? Are you referring specifically to the SettableGauge? How is
> that
> > different from a Counter or Meter?
> > What I meant is that the fetch lag computing logic can either get the
> > information required from method argument or something like a static
> global
> > variable. We are essentially trying to reuse the metric as a static
> global
> > variable. It seems not a common pattern in most systems. It is a little
> > counterintuitive that a gauge reported to the metric system would be used
> > by the program main logic later on as a variable.
> >
> > > We could do that. That would remove the gauge from the MetricGroup,
> > right? The main downside is that sources that do not use SourceReaderBase
> > cannot set the metric anymore. So I'd rather keep the current way and
> > extend it with the metadata extension.
> > Yes, that would remove the gauge from the MetricGroup. I was thinking
> that
> > SourceOutput is going to have an additional method of collect(T Record,
> > Long timestamp, Long fetchTime). So people can just pass in the fetch
> time
> > directly when they emit a record, regardless of using SourceReaderBase or
> > not.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Thu, Jul 22, 2021 at 3:46 PM Chesnay Schepler <ches...@apache.org>
> > wrote:
> >
> >> The only histogram implementation available to use are those by
> >> dropwizard, and they do some lock-free synchronization stuff that so
> far we
> >> wanted to keep out of hot paths (this applis to both reading and
> writing);
> >> we have however never made benchmarks.
> >> But it is reasonable to assume that they are way more expensive than the
> >> alternatives (in the ideal case just being a getter).
> >> You'd pay that cost irrespective of whether a reporter is enabled or
> not,
> >> which is another thing we so far wanted to prevent.
> >> Finally, histograms are problematic because they are 10x more expensive
> >> on the metric backend (because they are effectively 10 metrics), and
> should
> >> be used with extreme caution, in particular because we lack any switch
> to
> >> disable/enable metrics (and I think we are getting to a point where the
> >> metric system becomes unusable for heavy users because of that, where
> >> another histogram isn't helping).
> >>
> >> Overall, at this time I'm against using Histograms.
> >> Furthermore, I believe that we should be able to derive a Histogram from
> >> that supplier if we later one decide differently. We'd just need to poll
> >> the supplier more often.
> >>
> >> On 22/07/2021 09:36, Arvid Heise wrote:
> >>
> >> Hi all,
> >>
> >> @Steven Wu <stevenz...@gmail.com>
> >>
> >>> Regarding "lastFetchTime" latency metric, I found Gauge to be less
> >>> informative as it only captures the last sampling value for each metric
> >>> publish interval (e.g. 60s).
> >>> * Can we make it a histogram? Histograms are more expensive though.
> >>> * Timer [1, 2] is cheaper as it just tracks min, max, avg, count. but
> >>> there
> >>> is no such metric type in Flink
> >>> * Summary metric type [3] (from Prometheus) would be nice too
> >>>
> >> I'd also think that a histogram is much more expressive but the original
> >> FLIP-33 decided against it because of it's cost. @Chesnay Schepler
> >> <ches...@apache.org> could you shed some light on how much more
> >> expensive it is in comparison to a simple gauge? Does it depend on
> whether
> >> a reporter is actually using the metric?
> >> The current interface of this FLIP-179 would actually allow to switch
> the
> >> type of the metric later. But since the metric type is user-facing, we
> need
> >> to have an agreement now.
> >>
> >> @Becket Qin <becket....@gmail.com>
> >>
> >>> In that case, do we still need the metric here? It seems we are
> creating
> >>> a
> >>> "global variable" which users may potentially use. I am wondering how
> >>> much
> >>> additional convenience it provides because it seems easy for people to
> >>> simply pass the fetch time by themselves if they have decided to not
> use
> >>> SourceReaderBase. Also, it looks like we do not have an API pattern
> that
> >>> lets users get the value of a metric and derive another metric. So I
> >>> think
> >>> it is easier for people to understand if LastFetchTimeGauge() is just
> an
> >>> independent metric by itself, instead of being a part of the
> >>> eventTimeFetchLag computation.
> >>>
> >> I'm not sure if I follow the global variable argument, could you
> >> elaborate? Are you referring specifically to the SettableGauge? How is
> that
> >> different from a Counter or Meter?
> >>
> >> With the current design, we could very well add a LastFetchTime metric.
> >> The key point of the current abstraction is that a user gets the much
> >> harder eventTimeFetchLag metric for free, since we already need to
> extract
> >> the event time for other metrics. I think the JavaDoc makes it clear
> what
> >> the intent of the LastFetchTimeGauge is and if not we can improve it.
> >> Btw we have derived metrics already. For example, we have Meters for
> >> byteIn/Out and recordIn/Out. That's already part of FLIP-33.
> >>
> >> Would it make sense to have a more generic metadata type <T> associated
> >>> with the records batch? In some cases, it may be useful to allow the
> >>> Source
> >>> implementation to carry some additional information of the batch to the
> >>> RecordEmitter. For example, the split info of the batch, the sender of
> >>> the
> >>> batch etc. Because the RecordEmitter only takes one record at.a time,
> >>> currently such information needs to be put into each record, which may
> >>> involve a lot of wrapper object creation.
> >>>
> >> I like the idea of having more general metadata and I follow the
> example.
> >> I'm wondering if we could avoid a generic type (since that adds a bit of
> >> complexity to the mental model and usage) by simply encouraging to use a
> >> more specific MetaData subclass as a return type of the method.
> >>
> >> public interface RecordsWithSplitIds<E> {
> >>     @Nullable    default RecordMetadata getMetadata() {
> >>         return null;    }
> >>     ...
> >> }
> >>
> >> public interface RecordMetadata {
> >>     long getLastFetchTime(); // mandatory?}
> >>
> >> And using it as
> >>
> >> public class KafkaRecordMetadata implements RecordMetadata {}
> >>
> >> private static class KafkaPartitionSplitRecords<T> implements
> RecordsWithSplitIds<T> {
> >>     @Override    public KafkaRecordMetadata getMetadata() {
> >>         return metadata;    }
> >> }
> >>
> >> Or do we want to have the generic to explicitly pass it to the
> >> RecordEmitter? Would that metadata be a fourth parameter of
> >> RecordEmitter#emitRecord?
> >>
> >> It might be slightly better if we let the method accept a Supplier in
> this
> >>> case. However, it seems to introduce a parallel channel or a sidepath
> >>> between the user implementation and SourceOutput. I am not sure if this
> >>> is
> >>> the right way to go. Would it be more intuitive if we just add a new
> >>> method
> >>> to the SourceOutput, to allow the FetchTime to be passed in explicitly?
> >>> This would work well with the change I suggested above, which adds a
> >>> generic metadata type <T> to the RecordsWithSplits and passes that to
> the
> >>> RecordEmitter.emitRecord() as an argument.
> >>>
> >>
> >> We could do that. That would remove the gauge from the MetricGroup,
> >> right? The main downside is that sources that do not use
> SourceReaderBase
> >> cannot set the metric anymore. So I'd rather keep the current way and
> >> extend it with the metadata extension.
> >>
> >> Best,
> >>
> >> Arvid
> >>
> >>
> >> On Wed, Jul 21, 2021 at 1:38 PM Becket Qin <becket....@gmail.com>
> wrote:
> >>
> >>> Hey Chesnay,
> >>>
> >>> I think I got what that method was designed for now. Basically the
> >>> motivation is to let the SourceOutput to report the eventTimeFetchLag
> for
> >>> users. At this point, the SourceOutput only has the EventTime, so this
> >>> method provides a way for the users to pass the FetchTime to the
> >>> SourceOutput. This is essentially a context associated with each record
> >>> emitted to the SourceOutput.
> >>>
> >>> It might be slightly better if we let the method accept a Supplier in
> >>> this
> >>> case. However, it seems to introduce a parallel channel or a sidepath
> >>> between the user implementation and SourceOutput. I am not sure if this
> >>> is
> >>> the right way to go. Would it be more intuitive if we just add a new
> >>> method
> >>> to the SourceOutput, to allow the FetchTime to be passed in explicitly?
> >>> This would work well with the change I suggested above, which adds a
> >>> generic metadata type <T> to the RecordsWithSplits and passes that to
> the
> >>> RecordEmitter.emitRecord() as an argument.
> >>>
> >>> What do you think?
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Tue, Jul 20, 2021 at 2:50 PM Chesnay Schepler <ches...@apache.org>
> >>> wrote:
> >>>
> >>> > Would it be easier to understand if the method would accept a
> Supplier
> >>> > instead?
> >>> >
> >>> > On 20/07/2021 05:36, Becket Qin wrote:
> >>> > > In that case, do we still need the metric here? It seems we are
> >>> creating
> >>> > a
> >>> > > "global variable" which users may potentially use. I am wondering
> how
> >>> > much
> >>> > > additional convenience it provides because it seems easy for people
> >>> to
> >>> > > simply pass the fetch time by themselves if they have decided to
> not
> >>> use
> >>> > > SourceReaderBase. Also, it looks like we do not have an API pattern
> >>> that
> >>> > > lets users get the value of a metric and derive another metric. So
> I
> >>> > think
> >>> > > it is easier for people to understand if LastFetchTimeGauge() is
> >>> just an
> >>> > > independent metric by itself, instead of being a part of the
> >>> > > eventTimeFetchLag computation.
> >>> >
> >>> >
> >>> >
> >>>
> >>
> >>
>

Re: [DISCUSS] FLIP-179: Expose Standardized Operator Metrics

Reply via email to