The only histogram implementation available to use are those by dropwizard, and they do some lock-free synchronization stuff that so far we wanted to keep out of hot paths (this applis to both reading and writing); we have however never made benchmarks. But it is reasonable to assume that they are way more expensive than the alternatives (in the ideal case just being a getter). You'd pay that cost irrespective of whether a reporter is enabled or not, which is another thing we so far wanted to prevent. Finally, histograms are problematic because they are 10x more expensive on the metric backend (because they are effectively 10 metrics), and should be used with extreme caution, in particular because we lack any switch to disable/enable metrics (and I think we are getting to a point where the metric system becomes unusable for heavy users because of that, where another histogram isn't helping).

Overall, at this time I'm against using Histograms.
Furthermore, I believe that we should be able to derive a Histogram from that supplier if we later one decide differently. We'd just need to poll the supplier more often.

On 22/07/2021 09:36, Arvid Heise wrote:
Hi all,

@Steven Wu <mailto:stevenz...@gmail.com>

    Regarding "lastFetchTime" latency metric, I found Gauge to be less
    informative as it only captures the last sampling value for each
    metric
    publish interval (e.g. 60s).
    * Can we make it a histogram? Histograms are more expensive though.
    * Timer [1, 2] is cheaper as it just tracks min, max, avg, count.
    but there
    is no such metric type in Flink
    * Summary metric type [3] (from Prometheus) would be nice too

I'd also think that a histogram is much more expressive but the original FLIP-33 decided against it because of it's cost. @Chesnay Schepler <mailto:ches...@apache.org> could you shed some light on how much more expensive it is in comparison to a simple gauge? Does it depend on whether a reporter is actually using the metric? The current interface of this FLIP-179 would actually allow to switch the type of the metric later. But since the metric type is user-facing, we need to have an agreement now.

@Becket Qin <mailto:becket....@gmail.com>

    In that case, do we still need the metric here? It seems we are
    creating a
    "global variable" which users may potentially use. I am wondering
    how much
    additional convenience it provides because it seems easy for people to
    simply pass the fetch time by themselves if they have decided to
    not use
    SourceReaderBase. Also, it looks like we do not have an API
    pattern that
    lets users get the value of a metric and derive another metric. So
    I think
    it is easier for people to understand if LastFetchTimeGauge() is
    just an
    independent metric by itself, instead of being a part of the
    eventTimeFetchLag computation.

I'm not sure if I follow the global variable argument, could you elaborate? Are you referring specifically to the SettableGauge? How is that different from a Counter or Meter?

With the current design, we could very well add a LastFetchTime metric. The key point of the current abstraction is that a user gets the much harder eventTimeFetchLag metric for free, since we already need to extract the event time for other metrics. I think the JavaDoc makes it clear what the intent of the LastFetchTimeGauge is and if not we can improve it. Btw we have derived metrics already. For example, we have Meters for byteIn/Out and recordIn/Out. That's already part of FLIP-33.

    Would it make sense to have a more generic metadata type <T>
    associated
    with the records batch? In some cases, it may be useful to allow
    the Source
    implementation to carry some additional information of the batch
    to the
    RecordEmitter. For example, the split info of the batch, the
    sender of the
    batch etc. Because the RecordEmitter only takes one record at.a time,
    currently such information needs to be put into each record, which may
    involve a lot of wrapper object creation.

I like the idea of having more general metadata and I follow the example. I'm wondering if we could avoid a generic type (since that adds a bit of complexity to the mental model and usage) by simply encouraging to use a more specific MetaData subclass as a return type of the method.
public interface RecordsWithSplitIds<E> {
     @Nullable default RecordMetadatagetMetadata() {
         return null; }
     ...
}
public interface RecordMetadata {
     long getLastFetchTime(); // mandatory? }
And using it as
public class KafkaRecordMetadataimplements RecordMetadata {}
private static class KafkaPartitionSplitRecords<T> implements 
RecordsWithSplitIds<T> {
     @Override public KafkaRecordMetadatagetMetadata() {
         return metadata; }
}
Or do we want to have the generic to explicitly pass it to the RecordEmitter? Would that metadata be a fourth parameter of RecordEmitter#emitRecord?

    It might be slightly better if we let the method accept a Supplier
    in this
    case. However, it seems to introduce a parallel channel or a sidepath
    between the user implementation and SourceOutput. I am not sure if
    this is
    the right way to go. Would it be more intuitive if we just add a
    new method
    to the SourceOutput, to allow the FetchTime to be passed in
    explicitly?
    This would work well with the change I suggested above, which adds a
    generic metadata type <T> to the RecordsWithSplits and passes that
    to the
    RecordEmitter.emitRecord() as an argument.


We could do that. That would remove the gauge from the MetricGroup, right? The main downside is that sources that do not use SourceReaderBase cannot set the metric anymore. So I'd rather keep the current way and extend it with the metadata extension.

Best,

Arvid


On Wed, Jul 21, 2021 at 1:38 PM Becket Qin <becket....@gmail.com <mailto:becket....@gmail.com>> wrote:

    Hey Chesnay,

    I think I got what that method was designed for now. Basically the
    motivation is to let the SourceOutput to report the
    eventTimeFetchLag for
    users. At this point, the SourceOutput only has the EventTime, so this
    method provides a way for the users to pass the FetchTime to the
    SourceOutput. This is essentially a context associated with each
    record
    emitted to the SourceOutput.

    It might be slightly better if we let the method accept a Supplier
    in this
    case. However, it seems to introduce a parallel channel or a sidepath
    between the user implementation and SourceOutput. I am not sure if
    this is
    the right way to go. Would it be more intuitive if we just add a
    new method
    to the SourceOutput, to allow the FetchTime to be passed in
    explicitly?
    This would work well with the change I suggested above, which adds a
    generic metadata type <T> to the RecordsWithSplits and passes that
    to the
    RecordEmitter.emitRecord() as an argument.

    What do you think?

    Thanks,

    Jiangjie (Becket) Qin

    On Tue, Jul 20, 2021 at 2:50 PM Chesnay Schepler
    <ches...@apache.org <mailto:ches...@apache.org>> wrote:

    > Would it be easier to understand if the method would accept a
    Supplier
    > instead?
    >
    > On 20/07/2021 05:36, Becket Qin wrote:
    > > In that case, do we still need the metric here? It seems we
    are creating
    > a
    > > "global variable" which users may potentially use. I am
    wondering how
    > much
    > > additional convenience it provides because it seems easy for
    people to
    > > simply pass the fetch time by themselves if they have decided
    to not use
    > > SourceReaderBase. Also, it looks like we do not have an API
    pattern that
    > > lets users get the value of a metric and derive another
    metric. So I
    > think
    > > it is easier for people to understand if LastFetchTimeGauge()
    is just an
    > > independent metric by itself, instead of being a part of the
    > > eventTimeFetchLag computation.
    >
    >
    >


Reply via email to