Re: [DISCUSS] FLIP-386: Support adding custom metrics in Recovery Spans

Roman Khachatryan Tue, 21 Nov 2023 16:54:14 -0800

Thanks for clarifying, I see your points (although reporting metrics as
spans still seems counter intuitive to me).


As for the aggregation, I'm concerned that it might be unnecessarily
ambiguous: where the aggregation is performed (JM/TM);  across what
(tasks/time); and which aggregation should be used.

How about dropping it from the API and always using min, max, sum, avg? I
think we're interested in these aggregations for all the metrics, and there
is no penalty for reporting all of them because it's only for
initialization.

Regards,
Roman

On Mon, Nov 20, 2023, 8:42 AM Piotr Nowojski <pnowoj...@apache.org> wrote:

> Hi Roman!
>
> > 1. why the existing MetricGroup interface can't be used? It already had
> > methods to add metrics and spans ...
>
> That's because of the need to:
> a) associate the spans to specifically Job's initialisation
> b) we need to logically aggregate the span's attributes across subtasks.
>
> `MetricGroup` doesn't have such capabilities and it's too generic an
> interface to introduce things like that IMO.
>
> Additionally for metrics:
> c) reporting initialization measurements as metrics is a flawed concept as
> described in the FLIP's-384 motivation
> Additionally for spans:
> d) as discussed in the FLIP's-384 thread, we don't want to report separate
> spans on the TMs. At least not right now
>
> Also having a specialized, dedicated for initialization metrics class to
> collect those numbers, makes the interfaces
> more lean and more specialized.
>
> > 2. IIUC, based on these numbers, we're going to report span(s). Shouldn't
> > the backend report them as spans?
>
> As discussed in the FLIP's-384, initially we don't want to report spans on
> TMs. Later, optionally reporting
> individual subtask's checkpoint/recovery spans on the JM looks like a
> logical follow up.
>
> > 3. How is the implementation supposed to infer that some metric is a part
> > of initialization (and make the corresponding RPC to JM?). Should the
> > interfaces be more explicit about that?
>
> This FLIP proposes [1] to add
> `CustomInitializationMetrics
> KeyedStateBackendParameters#getCustomInitializationMetrics()`
> accessor to the `KeyedStateBackendParameters` argument that's passed to
> `createKeyedStateBackend(...)`
> method. That's pretty explicit I would say :)
>
> > 4. What do you think about using histogram or percentiles instead of
> > min/max/sum/avg? That would be more informative
>
> I would prefer to start with the simplest min/max/sum/avg, and let's see in
> which direction (if any) we need to evolve
> that. Alternative to percentiles is previously mentioned to report
> separately each subtask's initialisation/checkpointing span.
>
> Best,
> Piotrek
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans#FLIP386:SupportaddingcustommetricsinRecoverySpans-PublicInterfaces
>
> czw., 16 lis 2023 o 15:45 Roman Khachatryan <ro...@apache.org> napisał(a):
>
> > Thanks for the proposal,
> >
> > Can you please explain:
> > 1. why the existing MetricGroup interface can't be used? It already had
> > methods to add metrics and spans ...
> >
> > 2. IIUC, based on these numbers, we're going to report span(s). Shouldn't
> > the backend report them as spans?
> >
> > 3. How is the implementation supposed to infer that some metric is a part
> > of initialization (and make the corresponding RPC to JM?). Should the
> > interfaces be more explicit about that?
> >
> > 4. What do you think about using histogram or percentiles instead of
> > min/max/sum/avg? That would be more informative
> >
> > I like the idea of introducing parameter objects for backend creation.
> >
> > Regards,
> > Roman
> >
> > On Tue, Nov 7, 2023, 1:20 PM Piotr Nowojski <pnowoj...@apache.org>
> wrote:
> >
> > > (Fixing topic)
> > >
> > > wt., 7 lis 2023 o 09:40 Piotr Nowojski <pnowoj...@apache.org>
> > napisał(a):
> > >
> > > > Hi all!
> > > >
> > > > I would like to start a discussion on a follow up of FLIP-384:
> > Introduce
> > > > TraceReporter and use it to create checkpointing and recovery traces
> > [1]:
> > > >
> > > > *FLIP-386: Support adding custom metrics in Recovery Spans [2]*
> > > >
> > > > This FLIP adds a functionality that will allow state backends to
> attach
> > > > custom metrics to the recovery/initialization traces. This requires
> > > changes
> > > > to the `@PublicEvolving` `StateBackend` API, and it will be initially
> > > used
> > > > in `RocksDBIncrementalRestoreOperation` to measure how long does it
> > take
> > > to
> > > > download remote files and separately how long does it take to load
> > those
> > > > files into the local RocksDB instance.
> > > >
> > > > Please let me know what you think!
> > > >
> > > > Best,
> > > > Piotr Nowojski
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> > > > [2]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-386: Support adding custom metrics in Recovery Spans

Reply via email to