The proposed ScopeFormat is still problematic for a few reasons.

Extending the set of ScopeFormats is problematic because it in practice it breaks the config if users actively rely on it, since there's now another key that they _must_ set for it to be consistent/compatible with their existing setup. Unfortunately due to how powerful scope formats are we can't derive a default value that matches their existing setup.
Hence we should try to do this as rarely as possible.

This FLIP does not adhere to that since it proposes a dedicated format for coordinators; next time we want to expose operator-specific metrics (e.g., in the scheduler) we'd have to add another one to support it.

Additionally, the configurable variables (operator name/id) are logically not attached to the coordinator, but operators, so to me it just doesn't make sense to structure it like this.

Another thing I'm concerned about is that, because we don't include tasks in the hierarchy, users wishing to collect all metrics for a particular task (in this case ==vertex) now have to go significantly out of their way to get them, since they can no longer just filter by the task ID but have to be filter for _all_ operators that are part of the task.

On 16/01/2023 03:09, Hang Ruan wrote:
Hi, @ches...@apache.org <ches...@apache.org> ,

Do you have time to help to review this FLIP again after the modification?
Looking forward to your reply.
This FLIP will add a new configuration for the
OperatorCoordinatorMetricGroup scope format. It provides an internal
implementation and is added as a component to the JobManagerJobMetricGroup.
If something doesn't make sense, could you provide some advice? It will be
very helpful. Thanks a lot for your help.

Best,
Hang

Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三 16:34写道:

Hi Hang,

I'm a bit surprised that this has gone to a vote, given that Chesnay
deliberately mentioned that he would vote against it as-is. I would expect
that before going to a vote, he has had the opportunity to participate in
this discussion.

Best regards,

Martijn

On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <imj...@gmail.com> wrote:

Hi Dong,

Regarding “SplitEnumeratorContext#metricGroup”, my only concern is that
this is a core interface for the FLIP. It’s hard to tell how sources use
metric group without mentioning this interface. Even if this is an
existing
API, I think it’s worth introducing the interface again and declaring
that
we will implement the interface instead of a no-op method in this FLIP.

Anyway, this is a minor problem and shouldn’t block this FLIP. I’m +1 to
start a vote.

Best,
Jark


2023年1月3日 10:03,Hang Ruan <ruanhang1...@gmail.com> 写道:

Hi, Jark and Dong,

Thanks for your comments. Sorry for my late reply.

For suggestion 1, I plan to implement the SplitEnumeratorMetricGroup in
another issue, and it is not contained in this FLIP. I will add some
description about this part.
For suggestion 2, changes about OperatorCoordinator#metricGroup has
already
been documented in the proposed change section.

Best,
Hang

Dong Lin <lindon...@gmail.com> 于2023年1月1日周日 09:45写道:

Let me chime-in and add comments regarding the public interface
section.
Please see my comments inline.

On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <imj...@gmail.com> wrote:

Hi Hang,

Thanks for driving this discussion. I think this is a very useful
feature
for connectors.

The FLIP looks quite good to me, and I just have two suggestions.

1. In the "Public Interface" section, mention that the implementation
behavior of "SplitEnumeratorContext#metricGroup" is changed from
returning
null to returning a concrete SplitEnumeratorMetricGroup instance.
Even
though the API is already there, the behavior change can also be
considered
a public change.

SplitEnumeratorContext#metricGroup is an interface and this FLIP does
not
seem to change its semantics/behavior. The FLIP does change the
implementation/behavior of SourceCoordinatorContext#metricGroup, which
is
marked @Internal.

Thus it might seem a bit weird to add in the public interface section
saying that we change the interface SplitEnumeratorContext#metricGroup
from
returning null to non-null object.

2. Mention the newly added interface of
"OperatorCoordinator#metricGroup"
in the "Proposed Changes" section or "Public Interface" section. As
the
FLIP said, OperatorCoordinator is widely used in many connectors.
Though
it
is still an @Internal API, I think it is worth mentioning the change
in
the
FLIP.

Since OperatorCoordinator is an internal API, it seems reasonable to
explain it in the proposed change section. The FLIP seems to have
documented this in point 5 of the proposed change section.

BTW, if we think there are @internal classes that are important enough
to
be added in the public interface section, it might be useful to
explicitly
discuss this topic and document it in the "*What are the "public
interfaces" of the project*" in this
<

https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
wiki.


Best,
Jark


On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ruanhang1...@gmail.com>
wrote:
Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
After combining everyone's comments, the main concerns and
corresponding
adjustments are as follows.

Q1: Common metrics are not quite useful.
numEventsIn and numEventsOut counters will be removed from the
OperatorCoordinatorMetricGroup. These common metrics do not provide
enough
information for users. The users are more willing to get the number
of
events of the specified type instead of the total number. And this
metric
is calculated differently. The implementation could register the
metric
by
themselves.

Q2: This FLIP is overly complicated.
This FLIP will become concise after these modifications.
OperatorCoordinatorMetricGroup has already been introduced into
Flink
by
FLIP-179<


https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
.
And
this FLIP will not change it. This FLIP only provides a new metric
option
and a new metric group scope. The changes in proposed changes
provide
the
details about the modifications for the internal classes, which
might
make
it look complicated.

Thanks for all the comments again. If there are no further comments,
we
plan to start the voting thread this week.

Best,
Hang

Qingsheng Ren <renqs...@gmail.com> 于2022年12月26日周一 16:48写道:

Thanks for the FLIP, Hang!

This FLIP overall looks good to me. Actually I share the same
concern
with
Zhu that numEventsIn and numEventsOut counters are not quite useful
to
end
users. OperatorEvent is a quite low-level abstraction, which
requires
instantialization in order to be practical to users and developers,
so
maybe it's better to exclude them from the FLIP.

Best,
Qingsheng
Ververica (Alibaba)

On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <reed...@gmail.com>
wrote:
Hi Hang,
I still see no strong reason why we need numEventsIn/numEventsOut
metrics.
In the discussion in FLINK-29801, I can see the same concern from
others.
So I prefer to exclude them from this FLIP to avoid over-extending
the
scope.

Thanks,
Zhu

Hang Ruan <ruanhang1...@gmail.com> 于2022年12月23日周五 15:21写道:
Hi everyone,

As for the Zhu Zhu's problem, I think we should keep the common
metrics,
which will help to observe incoming and outgoing events. What do
you
think,
@Zhu Zhu ?
And @Chesnay, are there any other issues you are more concerned
about?
Looking forward to your reply.
Thanks for all the comments. If there are no further comments, we
plan
to start the voting thread next week.
Best,
Hang

Hang Ruan <ruanhang1...@gmail.com> 于2022年12月15日周四 16:49写道:
Hi, Zhu Zhu,

Thanks for your feedback!

The OperatorCoordinator implementations are different. And their
metrics are much different too. We try to find the common metrics
and
put
them in the OperatorCoordinatorMetricGroup. If most developers
think
we
do
not need these common metrics, removing the common metrics is also
acceptable.
Best,
Hang

Zhu Zhu <reed...@gmail.com> 于2022年12月14日周三 22:09写道:
Hi Hang & MengYue,

Thanks for creating this FLIP!

I think it is very useful, mainly in two aspects:
1. Enables OperatorCoordinators to register metrics. Currently
the coordinators has no way to do this. And operator
coordinator
metric group further enables the SplitEnumerator to have access
to a registered metric group (via the existing public interface
SplitEnumeratorContext#metricGroup()), which is null at the
moment.
2. Defines the scope of operator coordinator metrics. A clear
definition
makes it easy for users to find their wanted metrics. The
definition
also helps to avoid conflicts of metrics from multiple
OperatorCoordinators
of the same kind. E.g. each SourceCoordinator may have its own
numSourceSplits metric, these metrics should not be directly
registered
to the job metric group.

What I'm a bit concerned is the necessity of the introduced
common
metrics
numEventsInCounter & numEventsOutCounter. If there any case
which
strongly
requires them?

Regarding the concerns of Chesnay,
A dedicated coordinator MG implementation is overkill
Directly using the job metric group can result in metric
conflicts,
as
mentioned
in above #2.

Thanks,
Zhu

Dong Lin <lindon...@gmail.com> 于2022年12月10日周六 14:16写道:

Hi Chesney,

Just to double check with you, OperatorCoordinatorMetricGroup
(annotated as
@PublicEvolving) has already been introduced into Flink by
FLIP-179
<
https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
.
And that FLIP has got you +1.. Do you mean we should remove
this
OperatorCoordinatorMetricGroup?

Regards,
Dong

On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
ches...@apache.org>
wrote:
As a whole I feel like this FLIP is overly complicated. A
dedicated
coordinator MG implementation is overkill; it could just
re-use
the
existing Task/OperatorMGs to create the same structure we
have
on
TMs,
similar to what we did with the Job MG.

However, I'm not convinced that this is required anyway,
because
all the
example metrics you listed can be implemented on the TM
side
+
aggregating them in the external metrics backend.

Since I'm on holidays soon, just so no one tries to pull a
fast
one on
me, if this were to go to a vote as-is I'd be against it.


On 09/12/2022 15:30, Dong Lin wrote:
Hi Hang,

Thanks for the FLIP! The FLIP looks good and it is pretty
informative.
I have just two minor comments regarding names:
- Would it be useful to rename the config key as
*metrics.scope.jm.job.operator-coordinator* for
consistency
with
*metrics.scope.jm.job
*(which is not named as *jm-job)?
- Maybe rename the variable as
SCOPE_NAMING_OPERATOR_COORDINATOR
for
simplicity and consistency with SCOPE_NAMING_OPERATOR
(which
is
not named
as SCOPE_NAMING_TM_JOB_OPERATOR)?

Cheers,
Dong



On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
ruanhang1...@gmail.com>
wrote:
Hi all,

MengYue and I created FLIP-274[1] Introduce metric group
for
OperatorCoordinator. OperatorCoordinator is the
coordinator
for
runtime
operators and running on Job Manager. The coordination
mechanism is
operator events between OperatorCoordinator and its all
operators, the
coordination is more and more using in Flink, for
example
many
Sources
and
Sinks depend on the mechanism to assign splits and
coordinate
commits to
external systems. The OperatorCoordinator is widely
using
in
flink kafka
connector, flink pulsar connector, flink cdc connector,
flink
hudi
connector and so on.

But there is not a suitable metric group scope for the
OperatorCoordinator
and not an implementation for the interface
OperatorCoordinatorMetricGroup.
These metrics in OperatorCoordinator could be how many
splits/partitions
have been assigned to source readers, how many files
have
been
written
out
by sink writers, these metrics not only help users to
know
the
job
progress
but also make big job maintaining easier. Thus we
propose
the
FLIP-274
to
introduce a new metric group scope for
OperatorCoordinator
and
provide
an
internal implementation for
OperatorCoordinatorMetricGroup.
Could you help review this FLIP when you get time? Any
feedback
is
appreciated!

Best,
Hang

[1]


https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator



Reply via email to