Hi Roman! > 1. why the existing MetricGroup interface can't be used? It already had > methods to add metrics and spans ...
That's because of the need to: a) associate the spans to specifically Job's initialisation b) we need to logically aggregate the span's attributes across subtasks. `MetricGroup` doesn't have such capabilities and it's too generic an interface to introduce things like that IMO. Additionally for metrics: c) reporting initialization measurements as metrics is a flawed concept as described in the FLIP's-384 motivation Additionally for spans: d) as discussed in the FLIP's-384 thread, we don't want to report separate spans on the TMs. At least not right now Also having a specialized, dedicated for initialization metrics class to collect those numbers, makes the interfaces more lean and more specialized. > 2. IIUC, based on these numbers, we're going to report span(s). Shouldn't > the backend report them as spans? As discussed in the FLIP's-384, initially we don't want to report spans on TMs. Later, optionally reporting individual subtask's checkpoint/recovery spans on the JM looks like a logical follow up. > 3. How is the implementation supposed to infer that some metric is a part > of initialization (and make the corresponding RPC to JM?). Should the > interfaces be more explicit about that? This FLIP proposes [1] to add `CustomInitializationMetrics KeyedStateBackendParameters#getCustomInitializationMetrics()` accessor to the `KeyedStateBackendParameters` argument that's passed to `createKeyedStateBackend(...)` method. That's pretty explicit I would say :) > 4. What do you think about using histogram or percentiles instead of > min/max/sum/avg? That would be more informative I would prefer to start with the simplest min/max/sum/avg, and let's see in which direction (if any) we need to evolve that. Alternative to percentiles is previously mentioned to report separately each subtask's initialisation/checkpointing span. Best, Piotrek [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans#FLIP386:SupportaddingcustommetricsinRecoverySpans-PublicInterfaces czw., 16 lis 2023 o 15:45 Roman Khachatryan <ro...@apache.org> napisał(a): > Thanks for the proposal, > > Can you please explain: > 1. why the existing MetricGroup interface can't be used? It already had > methods to add metrics and spans ... > > 2. IIUC, based on these numbers, we're going to report span(s). Shouldn't > the backend report them as spans? > > 3. How is the implementation supposed to infer that some metric is a part > of initialization (and make the corresponding RPC to JM?). Should the > interfaces be more explicit about that? > > 4. What do you think about using histogram or percentiles instead of > min/max/sum/avg? That would be more informative > > I like the idea of introducing parameter objects for backend creation. > > Regards, > Roman > > On Tue, Nov 7, 2023, 1:20 PM Piotr Nowojski <pnowoj...@apache.org> wrote: > > > (Fixing topic) > > > > wt., 7 lis 2023 o 09:40 Piotr Nowojski <pnowoj...@apache.org> > napisał(a): > > > > > Hi all! > > > > > > I would like to start a discussion on a follow up of FLIP-384: > Introduce > > > TraceReporter and use it to create checkpointing and recovery traces > [1]: > > > > > > *FLIP-386: Support adding custom metrics in Recovery Spans [2]* > > > > > > This FLIP adds a functionality that will allow state backends to attach > > > custom metrics to the recovery/initialization traces. This requires > > changes > > > to the `@PublicEvolving` `StateBackend` API, and it will be initially > > used > > > in `RocksDBIncrementalRestoreOperation` to measure how long does it > take > > to > > > download remote files and separately how long does it take to load > those > > > files into the local RocksDB instance. > > > > > > Please let me know what you think! > > > > > > Best, > > > Piotr Nowojski > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces > > > [2] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans > > > > > > > > >