Hi Roman!

> 1. why the existing MetricGroup interface can't be used? It already had
> methods to add metrics and spans ...

That's because of the need to:
a) associate the spans to specifically Job's initialisation
b) we need to logically aggregate the span's attributes across subtasks.

`MetricGroup` doesn't have such capabilities and it's too generic an
interface to introduce things like that IMO.

Additionally for metrics:
c) reporting initialization measurements as metrics is a flawed concept as
described in the FLIP's-384 motivation
Additionally for spans:
d) as discussed in the FLIP's-384 thread, we don't want to report separate
spans on the TMs. At least not right now

Also having a specialized, dedicated for initialization metrics class to
collect those numbers, makes the interfaces
more lean and more specialized.

> 2. IIUC, based on these numbers, we're going to report span(s). Shouldn't
> the backend report them as spans?

As discussed in the FLIP's-384, initially we don't want to report spans on
TMs. Later, optionally reporting
individual subtask's checkpoint/recovery spans on the JM looks like a
logical follow up.

> 3. How is the implementation supposed to infer that some metric is a part
> of initialization (and make the corresponding RPC to JM?). Should the
> interfaces be more explicit about that?

This FLIP proposes [1] to add
`CustomInitializationMetrics
KeyedStateBackendParameters#getCustomInitializationMetrics()`
accessor to the `KeyedStateBackendParameters` argument that's passed to
`createKeyedStateBackend(...)`
method. That's pretty explicit I would say :)

> 4. What do you think about using histogram or percentiles instead of
> min/max/sum/avg? That would be more informative

I would prefer to start with the simplest min/max/sum/avg, and let's see in
which direction (if any) we need to evolve
that. Alternative to percentiles is previously mentioned to report
separately each subtask's initialisation/checkpointing span.

Best,
Piotrek

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans#FLIP386:SupportaddingcustommetricsinRecoverySpans-PublicInterfaces

czw., 16 lis 2023 o 15:45 Roman Khachatryan <ro...@apache.org> napisał(a):

> Thanks for the proposal,
>
> Can you please explain:
> 1. why the existing MetricGroup interface can't be used? It already had
> methods to add metrics and spans ...
>
> 2. IIUC, based on these numbers, we're going to report span(s). Shouldn't
> the backend report them as spans?
>
> 3. How is the implementation supposed to infer that some metric is a part
> of initialization (and make the corresponding RPC to JM?). Should the
> interfaces be more explicit about that?
>
> 4. What do you think about using histogram or percentiles instead of
> min/max/sum/avg? That would be more informative
>
> I like the idea of introducing parameter objects for backend creation.
>
> Regards,
> Roman
>
> On Tue, Nov 7, 2023, 1:20 PM Piotr Nowojski <pnowoj...@apache.org> wrote:
>
> > (Fixing topic)
> >
> > wt., 7 lis 2023 o 09:40 Piotr Nowojski <pnowoj...@apache.org>
> napisał(a):
> >
> > > Hi all!
> > >
> > > I would like to start a discussion on a follow up of FLIP-384:
> Introduce
> > > TraceReporter and use it to create checkpointing and recovery traces
> [1]:
> > >
> > > *FLIP-386: Support adding custom metrics in Recovery Spans [2]*
> > >
> > > This FLIP adds a functionality that will allow state backends to attach
> > > custom metrics to the recovery/initialization traces. This requires
> > changes
> > > to the `@PublicEvolving` `StateBackend` API, and it will be initially
> > used
> > > in `RocksDBIncrementalRestoreOperation` to measure how long does it
> take
> > to
> > > download remote files and separately how long does it take to load
> those
> > > files into the local RocksDB instance.
> > >
> > > Please let me know what you think!
> > >
> > > Best,
> > > Piotr Nowojski
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
> > >
> > >
> >
>

Reply via email to