Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Hang Ruan Thu, 02 Feb 2023 05:02:19 -0800

Hi everyone,

All issues about FLIP-274 have been resolved and consensus has been
reached. If there are no further problems, we plan to start the voting
thread.


Thanks a lot for participating in the vote discussion. Special thanks
to @Chesnay
Schepler <[email protected]> and @Jark Wu <[email protected]>, their
suggestions and discussion make this FLIP clearer and more reasonable.

Best,
Hang

Chesnay Schepler <[email protected]> 于2023年2月2日周四 18:44写道：

> Looks good from my side.
>
> On 01/02/2023 13:27, Hang Ruan wrote:
> > Hi all,
> >
> > I make a mistake about the default value of the  scope format
> > JobManagerOperatorScopeFormat. It should be
> > <host>.jobmanager.<job_name>.<operator_name>.
> > I have fixed it in the FLIP.
> > And the variables (task id and task name) have been added to the list.
> >
> > Best,
> > Hang
> >
> >
> > Hang Ruan <[email protected]> 于2023年2月1日周三 14:11写道：
> >
> >> Hi all,
> >>
> >>> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
> >> implementations (it never worked). There were no coordinator metrics
> >> implemented and registered. @Hang Ruan please correct me if I'm wrong.
> >> Yes,  we shouldn't have to worry about backwards-compatibility as
> Chesnay
> >> said.
> >>
> >> After the discussion, I think the FLIP should be modified as follow.
> >> 1.  Add new metric group JobManagerOperatorMetricGroup, whose scope
> >> format is JobManagerOperatorScopeFormat. The config key of
> JobManagerOperatorScopeFormat
> >> is metrics.scope.jm-operator, and the default value is
> >> <host>.jobmanager.<job_name>.<task_name>.<operator_name>.
> >> 2. JobManagerOperatorMetricGroup will be sub components of
> >> JobManagerJobMetricGroup.
> >> 3. InternalOperatorCoordinatorMetricGroup will extend ProxyMetricGroup,
> >> which registered under JobManagerOperatorMetricGroup by
> >> addGroup("coordinator").
> >>
> >> I have updated the FLIP-274. @Chesnay Schepler <[email protected]>
> @Jark
> >> Wu <[email protected]> , please help to take a look at the modified FLIP
> >> and make sure that I do not miss  something.
> >>
> >> If there are no more problems, we plan to start the voting thread later.
> >>
> >> Best,
> >> Hang
> >>
> >> Chesnay Schepler <[email protected]> 于2023年1月31日周二 20:09写道：
> >>
> >>>> IIUC, there is only an OperatorCoordinatorMetricGroup interface
> without
> >>> implementations (it never worked). There were no coordinator metrics
> >>> implemented and registered. @Hang Ruan <[email protected]> please
> >>> correct me if I'm wrong.
> >>>
> >>> Indeed. There are interfaces, and theoretically a metric group is
> exposed
> >>> to user-code via the SplitEnumeratorContext, but the metric group is
> always
> >>> null in the SourceCoordinatorContext. So we shouldn't have to worry
> about
> >>> backwards-compatibility.
> >>>
> >>>> However, the config key "metrics.scope.jm.job" is deprecated and is
> >>> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job"
> is
> >>> replaced with "metrics.scope.tm-job". That's why I though
> >>> "metrics.scope.jm-operator" might be more consistent.
> >>>
> >>> HA! Good catch. I only looked at the 1.16 docs where they weren't
> >>> deprecated yet.
> >>>
> >>> Then it should be "jm-operator" indeed.
> >>>
> >>> On 31/01/2023 10:51, Jark Wu wrote:
> >>>
> >>> Hi Chesnay,
> >>>
> >>>> The new option must default to the configured value of
> >>> metrics.scope.jm.job because otherwise it isn't backwards-compatible
> with
> >>> existing coordinator metrics.
> >>> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
> >>> implementations (it never worked). There were no coordinator metrics
> >>> implemented and registered. @Hang Ruan <[email protected]> please
> >>> correct me if I'm wrong.
> >>>
> >>>> Intuitively I think the config key should be
> >>> metrics.scope.jm.operator., similar to metrics.scope.jm.job.
> >>> However, the config key "metrics.scope.jm.job" is deprecated and is
> >>> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job"
> is
> >>> replaced with "metrics.scope.tm-job". That's why I though
> >>> "metrics.scope.jm-operator" might be more consistent.
> >>>
> >>>> We would only support operator_name, operator_id, task_name and
> task_id
> >>> (+ everything else from the jm/job groups of course).
> >>> +1
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>>
> >>>
> >>> On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <[email protected]>
> >>> wrote:
> >>>
> >>>> The new option must default to the configured value of
> >>>> metrics.scope.jm.job because otherwise it isn't backwards-compatible
> with
> >>>> existing coordinator metrics.
> >>>> According to some earlier response in this thread there are already
> >>>> coordinator metrics.
> >>>>
> >>>> If that weren't the case you'd be correct, and we'd keep the default
> >>>> consistent with the default for tm operator metrics.
> >>>>
> >>>> Intuitively I think the config key should be
> metrics.scope.jm.operator.,
> >>>> similar to metrics.scope.jm.job.
> >>>> For symmetry we should also add new metrics.scope.tm.operator/task and
> >>>> eventually phase out metrics.scope.operator.
> >>>>
> >>>>> Regarding "coordinator metrics don't *belong to* any tasks/vertexes",
> >>>> I mean the coordinator doesn't run on specific tasks, so it doesn't
> have
> >>>> runtime information of tasks.
> >>>>
> >>>> That is true. I intentionally only referred to task id/name, as things
> >>>> like subtask index or attempt ids don't make sense for coordinators
> as you
> >>>> said.
> >>>> We would only support operator_name, operator_id, task_name and
> task_id
> >>>> (+ everything else from the jm/job groups of course).
> >>>> (Maybe in the future attempt_num could be interesting though)
> >>>>
> >>>> On 30/01/2023 10:39, Jark Wu wrote:
> >>>>
> >>>> Hi Chesnay,
> >>>>
> >>>> IIUC, our only disagreement is whether to support task info in the
> >>>> metric format. But we all agree with the default metric format
> >>>>
> "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
> >>>> which doesn't include task info, right?
> >>>>
> >>>> Regarding "coordinator metrics don't *belong to* any tasks/vertexes",
> I
> >>>> mean the coordinator doesn't run on specific tasks, so it doesn't have
> >>>> runtime information of tasks. My concern about adding task info to the
> >>>> coordinator metric is that
> >>>> 1) if we register coordinator metrics for every subtask, the number of
> >>>> coordinator metrics will be exploded when task parallelism is large.
> This
> >>>> is an overhead to JM & metric systems, and the duplicated metrics are
> >>>> useless for users.
> >>>> 2) Correct me if I'm wrong, MetricGroup#counter only supports
> >>>> registering one metric. It seems it can't register a metric reporting
> for
> >>>> multiple metric names (e.g. multiple subtasks).
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>>
> >>>> On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> I hadn't looked in detail in to how the task info can be introduced,
> >>>>> but
> >>>>> given that the coordinates are created by the execution graph, where
> we
> >>>>> only work on tasks, it should be possible and rather simple.
> >>>>>
> >>>>> The OperatorCoordinatorHolder gets access to the ExecutionJobVertex
> >>>>> from
> >>>>> which it can extract everything and create the task/operator groups.
> >>>>> Exposure to the coordinator could happen via the coordinator context
> as
> >>>>> originally planned.
> >>>>>
> >>>>>   > As we know, coordinator metrics are just like JMJobMetricGroup,
> >>>>> which
> >>>>> don't belong to any tasks/vertexes.
> >>>>>
> >>>>> I don't think this is true, every coordinator is scoped to a specific
> >>>>> operator, and every operator is associated with a particular vertex.
> >>>>>
> >>>>>   From the javadocs:
> >>>>>
> >>>>> " A coordinator for runtime operators. The OperatorCoordinator runs
> on
> >>>>> the master, associated with
> >>>>>    the job vertex of the operator. It communicates with operators via
> >>>>> sending operator events."
> >>>>>
> >>>>> On 28/01/2023 09:15, Jark Wu wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> IIUC, Chesnay means we should have a more general metric group for
> >>>>>> **operator**
> >>>>>> not for **coordinator** in JM, this would be useful to extend
> >>>>>> other operator-specific
> >>>>>> metrics in the future. That means the new scope format should be
> >>>>> designed
> >>>>>> for
> >>>>>> the operator,
> >>>>>> e.g.,
> >>>>>
> metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
> >>>>>> The coordinator metric is a subgroup (a constant "coordinator"
> >>>>> suffix) of
> >>>>>> the JMOperatorMG.
> >>>>>>
> >>>>>> I think this is a nice design. However, I have a question about
> >>>>> adding task
> >>>>>> id/name to this list.
> >>>>>> How to get the task id/name when reporting a coordinator metric? As
> we
> >>>>>> know, coordinator metrics
> >>>>>> are just like JMJobMetricGroup, which don't belong to any
> >>>>> tasks/vertexes.
> >>>>>> Do you mean reporting
> >>>>>> coordinator metrics for every task id under the operator?
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>>
> >>>>>> On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <[email protected]>
> >>>>> wrote:
> >>>>>>>    > First, I do not understand why users have to configure the new
> >>>>> scope
> >>>>>>> format, which has a default value.
> >>>>>>>
> >>>>>>> If you don't use scope formats, sure. If you do use scope formats,
> >>>>> e.g.
> >>>>>>> to add a common prefix (which is the case for datadog users for
> >>>>>>> example), then the current default in the FLIP is insufficient and
> >>>>>>> requires the user to update the configuration.
> >>>>>>>
> >>>>>>>    > I usually do not need to change these configuration of scope
> >>>>> formats
> >>>>>>> when submitting the flink job.
> >>>>>>>
> >>>>>>> Scope formats as a whole are quite a power-user feature, but that
> >>>>>>> doesn't mean we should ignore it.
> >>>>>>>
> >>>>>>>    > I try to let it extend the ComponentMetricGroup
> >>>>>>>
> >>>>>>> This isn't inherently required just because it is a "component".
> >>>>>>> Component metric groups should _only_ be used for cases where such
> a
> >>>>>>> component provides several bits of metadata.
> >>>>>>> For example, tasks provide vertex ids, task names, attempted IDs
> >>>>> etc.,
> >>>>>>> and we'd like users to have the option on how this metadata ends up
> >>>>>>> being used (via scope formats). This currently can't be built with
> >>>>> the
> >>>>>>> addGroup() methods, hence why the component groups exist.
> >>>>>>> However, the operator coordinator _itself_does not provide several
> >>>>> bits
> >>>>>>> of metadata. Logically, the _operator_ does, not the coordinator.
> >>>>>>>
> >>>>>>>    > This make me decide to add a new scope format.
> >>>>>>>
> >>>>>>> And this is fine; just don't add a new scope format for the
> >>>>> coordinator
> >>>>>>> but logically for operators on the JM side, such that we can extend
> >>>>> the
> >>>>>>> set of operator-specific metrics in the future without having to
> add
> >>>>> yet
> >>>>>>> another scope format.
> >>>>>>>
> >>>>>>>    > The relationship between the Operator and OperatorCoordinator
> is
> >>>>>>> maintained in the OperatorCoordinatorHolder. In the POC, the
> operator
> >>>>>>> name/id could be found and the OperatorCoordinatorMetricGroup will
> be
> >>>>>>> created here.
> >>>>>>>
> >>>>>>> That's all irrelevant.
> >>>>>>>
> >>>>>>>    > The registered metrics of OperatorCoordinator always are those
> >>>>> which
> >>>>>>> can not be aggregated from the tasks, like the number of unassigned
> >>>>>>> splits in the SourceCoordinator. Actually we have not encountered
> the
> >>>>>>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> >>>>>>> should only contain the metrics for itself instead of its subtasks.
> >>>>>>>
> >>>>>>> You are misunderstanding the problem.
> >>>>>>>
> >>>>>>> This isn't about aggregating metrics on our side or exposing
> metrics
> >>>>>>> from TMs on the JM side or anything like that.
> >>>>>>>
> >>>>>>> It's purely about the available metadata for operator metrics on
> the
> >>>>> JM
> >>>>>>> side. Currently you suggest operator name / id; what I'm proposing
> >>>>> is to
> >>>>>>> add task(==vertex!) id / name to this list.
> >>>>>>>
> >>>>>>> The benefit here is simple. I can look up all metrics where
> >>>>> task_id==XXX
> >>>>>>> and get _everything_ related to that vertex, including all metrics
> >>>>>>> associated with operators that are part of that vertex, including
> the
> >>>>>>> coordinator.
> >>>>>>>
> >>>>>>>    > Using metrics.scope.jm.job is not enough to distinguish the
> >>>>> different
> >>>>>>> coordinators.
> >>>>>>>
> >>>>>>> I did not suggest just using metrics.scope.jm.job. I suggested that
> >>>>> as
> >>>>>>> the default for the jm operator scope format, that is used for the
> >>>>>>> proposed JobManagerOperatorMetricGroup, which will have some plain
> >>>>>>> metric groups as children for coordinators.
> >>>>>>> Aka, you have the JMOperatorMG, then you call
> >>>>> addGroup("coordinators")
> >>>>>>> and then addGroup(coordinator_name) for each coordinator.
> >>>>>>>
> >>>>>>>    > So there are two choice for the
> >>>>>>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
> >>>>> scope
> >>>>>>> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which
> >>>>> one
> >>>>>>> is better?
> >>>>>>>
> >>>>>>> There are more options than that.
> >>>>>>> I feel like there are a lot of misunderstandings in this
> discussion.
> >>>>>>> Please let me know if I could clear things up. If not I can also
> >>>>> provide
> >>>>>>> a PoC based on your PoC if that can speed things up. It's not that
> >>>>>>> different anyway.
> >>>>>>>
> >>>>>>> On 19/01/2023 07:39, Hang Ruan wrote:
> >>>>>>>> Hi, chesnay,
> >>>>>>>>
> >>>>>>>> Thanks for your reply. I still have some doubts about the
> questions
> >>>>>>>> you raised.
> >>>>>>>>
> >>>>>>>>       > Extending the set of ScopeFormats is problematic because
> it
> >>>>> in
> >>>>>>>>       practice
> >>>>>>>>       it breaks the config if users actively rely on it, since
> >>>>> there's now
> >>>>>>>>       another key that they _must_ set for it to be
> >>>>>>>>       consistent/compatible with
> >>>>>>>>       their existing setup.
> >>>>>>>>       Unfortunately due to how powerful scope formats are we can't
> >>>>> derive a
> >>>>>>>>       default value that matches their existing setup.
> >>>>>>>>       Hence we should try to do this as rarely as possible.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>       > This FLIP does not adhere to that since it proposes a
> >>>>> dedicated
> >>>>>>>>       format
> >>>>>>>>       for coordinators; next time we want to expose
> operator-specific
> >>>>>>>>       metrics
> >>>>>>>>       (e.g., in the scheduler) we'd have to add another one to
> >>>>> support it.
> >>>>>>>>
> >>>>>>>> First, I do not understand why users have to configure the new
> scope
> >>>>>>>> format, which has a default value. I usually do not need to change
> >>>>>>>> these configuration of scope formats when submitting the flink
> job.
> >>>>>>>> IMO, the OperatorCoordinator is a component running at the
> >>>>> JobMaster.
> >>>>>>>> I try to let it extend the ComponentMetricGroup, but I cannot find
> >>>>>>>> a suitable scope. This make me decide to add a new scope format.
> >>>>>>>>
> >>>>>>>>       > Additionally, the configurable variables (operator
> name/id)
> >>>>> are
> >>>>>>>>       logically not attached to the coordinator, but operators, so
> >>>>> to me it
> >>>>>>>>       just doesn't make sense to structure it like this.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The relationship between the Operator and OperatorCoordinator is
> >>>>>>>> maintained in the OperatorCoordinatorHolder. In the POC, the
> >>>>> operator
> >>>>>>>> name/id could be found and the OperatorCoordinatorMetricGroup will
> >>>>> be
> >>>>>>>> created here.
> >>>>>>>>
> >>>>>>>>       > Another thing I'm concerned about is that, because we
> don't
> >>>>> include
> >>>>>>>>       tasks in the hierarchy, users wishing to collect all metrics
> >>>>> for a
> >>>>>>>>       particular task (in this case ==vertex) now have to go
> >>>>>>>>       significantly out
> >>>>>>>>       of their way to get them, since they can no longer just
> filter
> >>>>> by the
> >>>>>>>>       task ID but have to be filter for _all_ operators that are
> >>>>> part of
> >>>>>>>>       the task.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The registered metrics of OperatorCoordinator always are those
> which
> >>>>>>>> can not be aggregated from the tasks, like the number of
> unassigned
> >>>>>>>> splits in the SourceCoordinator. Actually we have not encountered
> >>>>> the
> >>>>>>>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> >>>>>>>> should only contain the metrics for itself instead of its
> subtasks.
> >>>>>>>>
> >>>>>>>> Using metrics.scope.jm.job is not enough to distinguish the
> >>>>> different
> >>>>>>>> coordinators. The operator id/name is necessary. Then
> >>>>>>>> the implementation should be like this. But users can not change
> the
> >>>>>>>> scope in this way. This is also acceptable.
> >>>>>>>> @Internal public class InternalOperatorCoordinatorMetricGroup
> >>>>>>>>            extends ProxyMetricGroup<MetricGroup>
> >>>>>>>>            implements OperatorCoordinatorMetricGroup {
> >>>>>>>>        public InternalOperatorCoordinatorMetricGroup(
> >>>>>>>>                JobManagerJobMetricGroup parent, OperatorID
> >>>>> operatorID,
> >>>>>>> String operatorName) {
> >>>>>>>>            super(parent.addGroup(operatorID +
> >>>>>>> operatorName).addGroup("coordinator")); }
> >>>>>>>> }
> >>>>>>>> So there are two choice for the
> >>>>>>>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
> >>>>>>>> scope format; 2. use the metrics.scope.jm.job and
> ProxyMetricGroup.
> >>>>>>>> Which one is better?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Hang
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Chesnay Schepler <[email protected]> 于2023年1月18日周三 17:03写道：
> >>>>>>>>
> >>>>>>>>       You're misunderstanding the problem.
> >>>>>>>>
> >>>>>>>>       Metric groups form a tree with each group providing certain
> >>>>> metadata.
> >>>>>>>>       E.g., on the taskmanager we have a TM metric group that
> >>>>> provides info
> >>>>>>>>       about the TM, that has child task metric groups, that have
> >>>>> child
> >>>>>>>>       operator metric groups etc. The operator metric group again
> has
> >>>>>>>>       children, where we are now mostly in the user/connector
> land,
> >>>>> like a
> >>>>>>>>       kafka metric group providing partition info or something.
> >>>>>>>>
> >>>>>>>>       What is being proposed is to create a coordinator metric
> group
> >>>>> on
> >>>>>>>>       the JM
> >>>>>>>>       side that provides operator metadata.
> >>>>>>>>
> >>>>>>>>       A more appropriate structure is to create an operator group
> on
> >>>>> the JM
> >>>>>>>>       side that provides this info, with the coordinator metric
> group
> >>>>>>>>       being a
> >>>>>>>>       child.
> >>>>>>>>
> >>>>>>>>       On 17/01/2023 19:56, Steven Wu wrote:
> >>>>>>>>       >> Additionally, the configurable variables (operator
> name/id)
> >>>>> are
> >>>>>>>>       > logically not attached to the coordinator, but operators,
> so
> >>>>> to
> >>>>>>>>       me it
> >>>>>>>>       > just doesn't make sense to structure it like this.
> >>>>>>>>       >
> >>>>>>>>       > Chesnay, maybe we should clarify the terminology. To me,
> >>>>>>>>       pperators (like
> >>>>>>>>       > FLIP-27 source) can have two parts (coordinator and
> >>>>>>>>       reader/subtask). I
> >>>>>>>>       > think it is fine to include operator name/id for
> coordinator
> >>>>>>>>       metrics.
> >>>>>>>>       >
> >>>>>>>>       > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
> >>>>>>>>       <[email protected]> wrote:
> >>>>>>>>       >
> >>>>>>>>       >> Slight correction: Using metrics.scope.jm.job as the
> default
> >>>>>>>>       should be
> >>>>>>>>       >> safe.
> >>>>>>>>       >>
> >>>>>>>>       >> On 16/01/2023 10:18, Chesnay Schepler wrote:
> >>>>>>>>       >>> The proposed ScopeFormat is still problematic for a few
> >>>>> reasons.
> >>>>>>>>       >>>
> >>>>>>>>       >>> Extending the set of ScopeFormats is problematic because
> >>>>> it in
> >>>>>>>>       >>> practice it breaks the config if users actively rely on
> >>>>> it, since
> >>>>>>>>       >>> there's now another key that they _must_ set for it to
> be
> >>>>>>>>       >>> consistent/compatible with their existing setup.
> >>>>>>>>       >>> Unfortunately due to how powerful scope formats are we
> >>>>> can't
> >>>>>>>>       derive a
> >>>>>>>>       >>> default value that matches their existing setup.
> >>>>>>>>       >>> Hence we should try to do this as rarely as possible.
> >>>>>>>>       >>>
> >>>>>>>>       >>> This FLIP does not adhere to that since it proposes a
> >>>>>>>>       dedicated format
> >>>>>>>>       >>> for coordinators; next time we want to expose
> >>>>> operator-specific
> >>>>>>>>       >>> metrics (e.g., in the scheduler) we'd have to add
> another
> >>>>> one to
> >>>>>>>>       >>> support it.
> >>>>>>>>       >>>
> >>>>>>>>       >>> Additionally, the configurable variables (operator
> >>>>> name/id) are
> >>>>>>>>       >>> logically not attached to the coordinator, but
> operators,
> >>>>> so
> >>>>>>>>       to me it
> >>>>>>>>       >>> just doesn't make sense to structure it like this.
> >>>>>>>>       >>>
> >>>>>>>>       >>> Another thing I'm concerned about is that, because we
> don't
> >>>>>>>>       include
> >>>>>>>>       >>> tasks in the hierarchy, users wishing to collect all
> >>>>> metrics for
> >>>>>>> a
> >>>>>>>>       >>> particular task (in this case ==vertex) now have to go
> >>>>>>>>       significantly
> >>>>>>>>       >>> out of their way to get them, since they can no longer
> just
> >>>>>>>>       filter by
> >>>>>>>>       >>> the task ID but have to be filter for _all_ operators
> that
> >>>>> are
> >>>>>>>>       part of
> >>>>>>>>       >>> the task.
> >>>>>>>>       >>>
> >>>>>>>>       >>> On 16/01/2023 03:09, Hang Ruan wrote:
> >>>>>>>>       >>>> Hi, @[email protected] <[email protected]> ,
> >>>>>>>>       >>>>
> >>>>>>>>       >>>> Do you have time to help to review this FLIP again
> after
> >>>>> the
> >>>>>>>>       >>>> modification?
> >>>>>>>>       >>>> Looking forward to your reply.
> >>>>>>>>       >>>> This FLIP will add a new configuration for the
> >>>>>>>>       >>>> OperatorCoordinatorMetricGroup scope format. It
> provides
> >>>>> an
> >>>>>>>>       internal
> >>>>>>>>       >>>> implementation and is added as a component to the
> >>>>>>>>       >>>> JobManagerJobMetricGroup.
> >>>>>>>>       >>>> If something doesn't make sense, could you provide some
> >>>>>>>>       advice? It
> >>>>>>>>       >>>> will be
> >>>>>>>>       >>>> very helpful. Thanks a lot for your help.
> >>>>>>>>       >>>>
> >>>>>>>>       >>>> Best,
> >>>>>>>>       >>>> Hang
> >>>>>>>>       >>>>
> >>>>>>>>       >>>> Martijn Visser <[email protected]>
> 于2023年1月11日周三
> >>>>>>>>       16:34写道：
> >>>>>>>>       >>>>
> >>>>>>>>       >>>>
> >>>>>>>>       >>>>> Hi Hang,
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>> I'm a bit surprised that this has gone to a vote,
> given
> >>>>> that
> >>>>>>>>       Chesnay
> >>>>>>>>       >>>>> deliberately mentioned that he would vote against it
> >>>>> as-is.
> >>>>>>>>       I would
> >>>>>>>>       >>>>> expect
> >>>>>>>>       >>>>> that before going to a vote, he has had the
> opportunity
> >>>>> to
> >>>>>>>>       >>>>> participate in
> >>>>>>>>       >>>>> this discussion.
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>> Best regards,
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>> Martijn
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <
> >>>>> [email protected]>
> >>>>>>>>       wrote:
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>>> Hi Dong,
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>> Regarding “SplitEnumeratorContext#metricGroup���, my
> only
> >>>>>>>>       concern is
> >>>>>>>>       >>>>>> that
> >>>>>>>>       >>>>>> this is a core interface for the FLIP. It’s hard to
> >>>>> tell how
> >>>>>>>>       >>>>>> sources use
> >>>>>>>>       >>>>>> metric group without mentioning this interface. Even
> if
> >>>>>>>>       this is an
> >>>>>>>>       >>>>> existing
> >>>>>>>>       >>>>>> API, I think it’s worth introducing the interface
> again
> >>>>> and
> >>>>>>>>       declaring
> >>>>>>>>       >>>>> that
> >>>>>>>>       >>>>>> we will implement the interface instead of a no-op
> >>>>> method
> >>>>>>>>       in this
> >>>>>>>>       >>>>>> FLIP.
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>> Anyway, this is a minor problem and shouldn’t block
> this
> >>>>>>>>       FLIP. I’m
> >>>>>>>>       >>>>>> +1 to
> >>>>>>>>       >>>>>> start a vote.
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>> Best,
> >>>>>>>>       >>>>>> Jark
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>>> 2023年1月3日 10:03，Hang Ruan <[email protected]>
> 写道：
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> Hi, Jark and Dong,
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> Thanks for your comments. Sorry for my late reply.
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> For suggestion 1, I plan to implement the
> >>>>>>>>       >>>>>>> SplitEnumeratorMetricGroup in
> >>>>>>>>       >>>>>>> another issue, and it is not contained in this
> FLIP. I
> >>>>>>>>       will add some
> >>>>>>>>       >>>>>>> description about this part.
> >>>>>>>>       >>>>>>> For suggestion 2, changes about
> >>>>>>>>       OperatorCoordinator#metricGroup has
> >>>>>>>>       >>>>>> already
> >>>>>>>>       >>>>>>> been documented in the proposed change section.
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> Best,
> >>>>>>>>       >>>>>>> Hang
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> Dong Lin <[email protected]> 于2023年1月1日周日
> 09:45写道：
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>>> Let me chime-in and add comments regarding the
> public
> >>>>>>>>       interface
> >>>>>>>>       >>>>> section.
> >>>>>>>>       >>>>>>>> Please see my comments inline.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
> >>>>>>>>       <[email protected]> wrote:
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>>> Hi Hang,
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>> Thanks for driving this discussion. I think this
> is a
> >>>>>>>>       very useful
> >>>>>>>>       >>>>>> feature
> >>>>>>>>       >>>>>>>>> for connectors.
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>> The FLIP looks quite good to me, and I just have
> two
> >>>>>>>>       suggestions.
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>> 1. In the "Public Interface" section, mention that
> >>>>> the
> >>>>>>>>       >>>>>>>>> implementation
> >>>>>>>>       >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup"
> is
> >>>>>>>>       changed from
> >>>>>>>>       >>>>>>>> returning
> >>>>>>>>       >>>>>>>>> null to returning a concrete
> >>>>> SplitEnumeratorMetricGroup
> >>>>>>>>       instance.
> >>>>>>>>       >>>>> Even
> >>>>>>>>       >>>>>>>>> though the API is already there, the behavior
> change
> >>>>> can
> >>>>>>>>       also be
> >>>>>>>>       >>>>>>>> considered
> >>>>>>>>       >>>>>>>>> a public change.
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>> SplitEnumeratorContext#metricGroup is an interface
> and
> >>>>>>>>       this FLIP
> >>>>>>>>       >>>>>>>> does
> >>>>>>>>       >>>>>> not
> >>>>>>>>       >>>>>>>> seem to change its semantics/behavior. The FLIP
> does
> >>>>>>>>       change the
> >>>>>>>>       >>>>>>>> implementation/behavior of
> >>>>>>>>       SourceCoordinatorContext#metricGroup,
> >>>>>>>>       >>>>>>>> which
> >>>>>>>>       >>>>>> is
> >>>>>>>>       >>>>>>>> marked @Internal.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>> Thus it might seem a bit weird to add in the public
> >>>>>>> interface
> >>>>>>>>       >>>>>>>> section
> >>>>>>>>       >>>>>>>> saying that we change the interface
> >>>>>>>>       >>>>>>>> SplitEnumeratorContext#metricGroup
> >>>>>>>>       >>>>>> from
> >>>>>>>>       >>>>>>>> returning null to non-null object.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>> 2. Mention the newly added interface of
> >>>>>>>>       >>>>>> "OperatorCoordinator#metricGroup"
> >>>>>>>>       >>>>>>>>> in the "Proposed Changes" section or "Public
> >>>>> Interface"
> >>>>>>>>       section. As
> >>>>>>>>       >>>>> the
> >>>>>>>>       >>>>>>>>> FLIP said, OperatorCoordinator is widely used in
> many
> >>>>>>>>       connectors.
> >>>>>>>>       >>>>>> Though
> >>>>>>>>       >>>>>>>> it
> >>>>>>>>       >>>>>>>>> is still an @Internal API, I think it is worth
> >>>>>>>>       mentioning the
> >>>>>>>>       >>>>>>>>> change
> >>>>>>>>       >>>>> in
> >>>>>>>>       >>>>>>>> the
> >>>>>>>>       >>>>>>>>> FLIP.
> >>>>>>>>       >>>>>>>> Since OperatorCoordinator is an internal API, it
> seems
> >>>>>>>>       reasonable to
> >>>>>>>>       >>>>>>>> explain it in the proposed change section. The FLIP
> >>>>> seems
> >>>>>>>>       to have
> >>>>>>>>       >>>>>>>> documented this in point 5 of the proposed change
> >>>>> section.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>> BTW, if we think there are @internal classes that
> are
> >>>>>>>>       important
> >>>>>>>>       >>>>>>>> enough
> >>>>>>>>       >>>>>> to
> >>>>>>>>       >>>>>>>> be added in the public interface section, it might
> be
> >>>>>>>>       useful to
> >>>>>>>>       >>>>>> explicitly
> >>>>>>>>       >>>>>>>> discuss this topic and document it in the "*What
> are
> >>>>> the
> >>>>>>>>       "public
> >>>>>>>>       >>>>>>>> interfaces" of the project*" in this
> >>>>>>>>       >>>>>>>> <
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>>>>>>       >>>>>>>> wiki.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>> Jark
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
> >>>>>>>>       <[email protected]>
> >>>>>>>>       >>>>>> wrote:
> >>>>>>>>       >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and
> Qingsheng.
> >>>>>>>>       >>>>>>>>>> After combining everyone's comments, the main
> >>>>> concerns and
> >>>>>>>>       >>>>>>>> corresponding
> >>>>>>>>       >>>>>>>>>> adjustments are as follows.
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Q1: Common metrics are not quite useful.
> >>>>>>>>       >>>>>>>>>> numEventsIn and numEventsOut counters will be
> >>>>> removed
> >>>>>>>>       from the
> >>>>>>>>       >>>>>>>>>> OperatorCoordinatorMetricGroup. These common
> >>>>> metrics do
> >>>>>>> not
> >>>>>>>>       >>>>>>>>>> provide
> >>>>>>>>       >>>>>>>>> enough
> >>>>>>>>       >>>>>>>>>> information for users. The users are more
> willing to
> >>>>>>>>       get the
> >>>>>>>>       >>>>>>>>>> number
> >>>>>>>>       >>>>> of
> >>>>>>>>       >>>>>>>>>> events of the specified type instead of the total
> >>>>>>>>       number. And this
> >>>>>>>>       >>>>>>>> metric
> >>>>>>>>       >>>>>>>>>> is calculated differently. The implementation
> could
> >>>>>>>>       register the
> >>>>>>>>       >>>>>> metric
> >>>>>>>>       >>>>>>>>> by
> >>>>>>>>       >>>>>>>>>> themselves.
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Q2: This FLIP is overly complicated.
> >>>>>>>>       >>>>>>>>>> This FLIP will become concise after these
> >>>>> modifications.
> >>>>>>>>       >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
> >>>>>>>>       introduced into
> >>>>>>>>       >>>>> Flink
> >>>>>>>>       >>>>>>>> by
> >>>>>>>>       >>>>>>>>>> FLIP-179<
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>>>>>>       >>>>>>>>>>> .
> >>>>>>>>       >>>>>>>>>> And
> >>>>>>>>       >>>>>>>>>> this FLIP will not change it. This FLIP only
> >>>>> provides a
> >>>>>>>>       new metric
> >>>>>>>>       >>>>>>>> option
> >>>>>>>>       >>>>>>>>>> and a new metric group scope. The changes in
> >>>>> proposed
> >>>>>>>>       changes
> >>>>>>>>       >>>>> provide
> >>>>>>>>       >>>>>>>> the
> >>>>>>>>       >>>>>>>>>> details about the modifications for the internal
> >>>>>>>>       classes, which
> >>>>>>>>       >>>>> might
> >>>>>>>>       >>>>>>>>> make
> >>>>>>>>       >>>>>>>>>> it look complicated.
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Thanks for all the comments again. If there are
> no
> >>>>> further
> >>>>>>>>       >>>>>>>>>> comments,
> >>>>>>>>       >>>>>> we
> >>>>>>>>       >>>>>>>>>> plan to start the voting thread this week.
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>> Hang
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Qingsheng Ren <[email protected]>
> 于2022年12月26日周一
> >>>>>>>>       16:48写道：
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>> Thanks for the FLIP, Hang!
> >>>>>>>>       >>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>> This FLIP overall looks good to me. Actually I
> >>>>> share
> >>>>>>>>       the same
> >>>>>>>>       >>>>> concern
> >>>>>>>>       >>>>>>>>>> with
> >>>>>>>>       >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters
> are
> >>>>> not
> >>>>>>>>       quite
> >>>>>>>>       >>>>>>>>>>> useful
> >>>>>>>>       >>>>>>>> to
> >>>>>>>>       >>>>>>>>>> end
> >>>>>>>>       >>>>>>>>>>> users. OperatorEvent is a quite low-level
> >>>>> abstraction,
> >>>>>>>>       which
> >>>>>>>>       >>>>> requires
> >>>>>>>>       >>>>>>>>>>> instantialization in order to be practical to
> >>>>> users and
> >>>>>>>>       >>>>>>>>>>> developers,
> >>>>>>>>       >>>>>>>> so
> >>>>>>>>       >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
> >>>>>>>>       >>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>>> Qingsheng
> >>>>>>>>       >>>>>>>>>>> Ververica (Alibaba)
> >>>>>>>>       >>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
> >>>>>>>>       <[email protected]>
> >>>>>>>>       >>>>> wrote:
> >>>>>>>>       >>>>>>>>>>>> Hi Hang,
> >>>>>>>>       >>>>>>>>>>>> I still see no strong reason why we need
> >>>>>>>>       >>>>>>>>>>>> numEventsIn/numEventsOut
> >>>>>>>>       >>>>>>>>>>> metrics.
> >>>>>>>>       >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the
> >>>>> same
> >>>>>>>>       concern
> >>>>>>>>       >>>>>>>>>>>> from
> >>>>>>>>       >>>>>>>>>> others.
> >>>>>>>>       >>>>>>>>>>>> So I prefer to exclude them from this FLIP to
> >>>>> avoid
> >>>>>>>>       >>>>>>>>>>>> over-extending
> >>>>>>>>       >>>>>>>>> the
> >>>>>>>>       >>>>>>>>>>>> scope.
> >>>>>>>>       >>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>> Thanks,
> >>>>>>>>       >>>>>>>>>>>> Zhu
> >>>>>>>>       >>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>> Hang Ruan <[email protected]>
> 于2022年12月23日周五
> >>>>>>>>       >>>>>>>>>>>> 15:21写道：
> >>>>>>>>       >>>>>>>>>>>>> Hi everyone,
> >>>>>>>>       >>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we
> should
> >>>>> keep
> >>>>>>>>       the common
> >>>>>>>>       >>>>>>>>>>> metrics,
> >>>>>>>>       >>>>>>>>>>>> which will help to observe incoming and
> outgoing
> >>>>>>>>       events. What do
> >>>>>>>>       >>>>>>>> you
> >>>>>>>>       >>>>>>>>>>> think,
> >>>>>>>>       >>>>>>>>>>>> @Zhu Zhu ?
> >>>>>>>>       >>>>>>>>>>>>> And @Chesnay, are there any other issues you
> are
> >>>>>>>>       more concerned
> >>>>>>>>       >>>>>>>>>> about?
> >>>>>>>>       >>>>>>>>>>>> Looking forward to your reply.
> >>>>>>>>       >>>>>>>>>>>>> Thanks for all the comments. If there are no
> >>>>> further
> >>>>>>>>       >>>>>>>>>>>>> comments, we
> >>>>>>>>       >>>>>>>>>> plan
> >>>>>>>>       >>>>>>>>>>>> to start the voting thread next week.
> >>>>>>>>       >>>>>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>>>>> Hang
> >>>>>>>>       >>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>> Hang Ruan <[email protected]>
> >>>>> 于2022年12月15日周四
> >>>>>>>>       >>>>>>>>>>>>> 16:49写道：
> >>>>>>>>       >>>>>>>>>>>>>> Hi, Zhu Zhu,
> >>>>>>>>       >>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>> Thanks for your feedback!
> >>>>>>>>       >>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>> The OperatorCoordinator implementations are
> >>>>>>>>       different. And
> >>>>>>>>       >>>>>>>>>>>>>> their
> >>>>>>>>       >>>>>>>>>>>> metrics are much different too. We try to find
> the
> >>>>>>> common
> >>>>>>>>       >>>>>>>>>>>> metrics
> >>>>>>>>       >>>>>>>> and
> >>>>>>>>       >>>>>>>>>> put
> >>>>>>>>       >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If
> >>>>> most
> >>>>>>>>       developers
> >>>>>>>>       >>>>>>>> think
> >>>>>>>>       >>>>>>>>> we
> >>>>>>>>       >>>>>>>>>>> do
> >>>>>>>>       >>>>>>>>>>>> not need these common metrics, removing the
> common
> >>>>>>>>       metrics is
> >>>>>>>>       >>>>>>>>>>>> also
> >>>>>>>>       >>>>>>>>>>>> acceptable.
> >>>>>>>>       >>>>>>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>>>>>> Hang
> >>>>>>>>       >>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>> Zhu Zhu <[email protected]> 于2022年12月14日周三
> >>>>>>>>       22:09写道：
> >>>>>>>>       >>>>>>>>>>>>>>> Hi Hang & MengYue,
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> Thanks for creating this FLIP!
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> I think it is very useful, mainly in two
> >>>>> aspects:
> >>>>>>>>       >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register
> >>>>> metrics.
> >>>>>>>>       >>>>>>>>>>>>>>> Currently
> >>>>>>>>       >>>>>>>>>>>>>>> the coordinators has no way to do this. And
> >>>>> operator
> >>>>>>>>       >>>>>>>> coordinator
> >>>>>>>>       >>>>>>>>>>>>>>> metric group further enables the
> >>>>> SplitEnumerator
> >>>>>>>>       to have
> >>>>>>>>       >>>>>>>>>>>>>>> access
> >>>>>>>>       >>>>>>>>>>>>>>> to a registered metric group (via the
> existing
> >>>>> public
> >>>>>>>>       >>>>>>>>>>>>>>> interface
> >>>>>>>>       >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()),
> which is
> >>>>>>>>       null at the
> >>>>>>>>       >>>>>>>>> moment.
> >>>>>>>>       >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
> >>>>>>>>       metrics. A clear
> >>>>>>>>       >>>>>>>>>>>> definition
> >>>>>>>>       >>>>>>>>>>>>>>> makes it easy for users to find their wanted
> >>>>>>>>       metrics. The
> >>>>>>>>       >>>>>>>>>> definition
> >>>>>>>>       >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics
> from
> >>>>>>> multiple
> >>>>>>>>       >>>>>>>>>>>> OperatorCoordinators
> >>>>>>>>       >>>>>>>>>>>>>>> of the same kind. E.g. each
> SourceCoordinator
> >>>>> may
> >>>>>>>>       have its
> >>>>>>>>       >>>>>>>>>>>>>>> own
> >>>>>>>>       >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should
> >>>>> not
> >>>>>>>>       be directly
> >>>>>>>>       >>>>>>>>>>> registered
> >>>>>>>>       >>>>>>>>>>>>>>> to the job metric group.
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of
> >>>>> the
> >>>>>>>>       introduced
> >>>>>>>>       >>>>>>>>> common
> >>>>>>>>       >>>>>>>>>>>> metrics
> >>>>>>>>       >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If
> >>>>> there
> >>>>>>>>       any case
> >>>>>>>>       >>>>>>>> which
> >>>>>>>>       >>>>>>>>>>>> strongly
> >>>>>>>>       >>>>>>>>>>>>>>> requires them?
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
> >>>>>>>>       >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation
> is
> >>>>>>> overkill
> >>>>>>>>       >>>>>>>>>>>>>>> Directly using the job metric group can
> result
> >>>>> in
> >>>>>>>>       metric
> >>>>>>>>       >>>>>>>>> conflicts,
> >>>>>>>>       >>>>>>>>>>> as
> >>>>>>>>       >>>>>>>>>>>> mentioned
> >>>>>>>>       >>>>>>>>>>>>>>> in above #2.
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>       >>>>>>>>>>>>>>> Zhu
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> Dong Lin <[email protected]>
> 于2022年12月10日周六
> >>>>>>>>       >>>>>>>>>>>>>>> 14:16写道：
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>> Hi Chesney,
> >>>>>>>>       >>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>> Just to double check with you,
> >>>>>>>>       >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
> >>>>>>>>       >>>>>>>>>>>> (annotated as
> >>>>>>>>       >>>>>>>>>>>>>>>> @PublicEvolving) has already been
> introduced
> >>>>> into
> >>>>>>>>       Flink by
> >>>>>>>>       >>>>>>>>>> FLIP-179
> >>>>>>>>       >>>>>>>>>>>>>>>> <
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>>>>>>       >>>>>>>>>>>>> .
> >>>>>>>>       >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean
> we
> >>>>>>>>       should remove
> >>>>>>>>       >>>>>>>>> this
> >>>>>>>>       >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
> >>>>>>>>       >>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>       >>>>>>>>>>>>>>>> Dong
> >>>>>>>>       >>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
> >>>>> Schepler <
> >>>>>>>>       >>>>>>>>>>> [email protected]>
> >>>>>>>>       >>>>>>>>>>>> wrote:
> >>>>>>>>       >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
> >>>>>>>>       complicated. A
> >>>>>>>>       >>>>>>>>>>> dedicated
> >>>>>>>>       >>>>>>>>>>>>>>>>> coordinator MG implementation is
> overkill; it
> >>>>>>>>       could just
> >>>>>>>>       >>>>>>>>> re-use
> >>>>>>>>       >>>>>>>>>>> the
> >>>>>>>>       >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the
> same
> >>>>>>>>       structure we
> >>>>>>>>       >>>>>>>>> have
> >>>>>>>>       >>>>>>>>>> on
> >>>>>>>>       >>>>>>>>>>>> TMs,
> >>>>>>>>       >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
> >>>>>>>>       >>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>> However, I'm not convinced that this is
> >>>>> required
> >>>>>>>>       anyway,
> >>>>>>>>       >>>>>>>>>> because
> >>>>>>>>       >>>>>>>>>>>> all the
> >>>>>>>>       >>>>>>>>>>>>>>>>> example metrics you listed can be
> >>>>> implemented on
> >>>>>>>>       the TM
> >>>>>>>>       >>>>>>>> side
> >>>>>>>>       >>>>>>>>> +
> >>>>>>>>       >>>>>>>>>>>>>>>>> aggregating them in the external metrics
> >>>>> backend.
> >>>>>>>>       >>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one
> >>>>> tries
> >>>>>>>>       to pull a
> >>>>>>>>       >>>>>>>>> fast
> >>>>>>>>       >>>>>>>>>>>> one on
> >>>>>>>>       >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is
> I'd be
> >>>>>>>>       against it.
> >>>>>>>>       >>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> >>>>>>>>       >>>>>>>>>>>>>>>>>> Hi Hang,
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good
> >>>>> and it
> >>>>>>>>       is pretty
> >>>>>>>>       >>>>>>>>>>>> informative.
> >>>>>>>>       >>>>>>>>>>>>>>>>>> I have just two minor comments regarding
> >>>>> names:
> >>>>>>>>       >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config
> >>>>> key as
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> *metrics.scope.jm.job.operator-coordinator*
> >>>>> for
> >>>>>>>>       >>>>>>>> consistency
> >>>>>>>>       >>>>>>>>>>> with
> >>>>>>>>       >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
> >>>>>>>>       >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
> >>>>>>>>       >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
> >>>>>>>>       >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> >>>>>>>>       >>>>>>>>>>>> for
> >>>>>>>>       >>>>>>>>>>>>>>>>>> simplicity and consistency with
> >>>>>>>>       SCOPE_NAMING_OPERATOR
> >>>>>>>>       >>>>>>>>> (which
> >>>>>>>>       >>>>>>>>>> is
> >>>>>>>>       >>>>>>>>>>>> not named
> >>>>>>>>       >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>       >>>>>>>>>>>>>>>>>> Dong
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan
> <
> >>>>>>>>       >>>>>>>>>>> [email protected]>
> >>>>>>>>       >>>>>>>>>>>> wrote:
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1]
> Introduce
> >>>>>>>>       metric group
> >>>>>>>>       >>>>>>>>> for
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator
> >>>>> is the
> >>>>>>>>       >>>>>>>>> coordinator
> >>>>>>>>       >>>>>>>>>>> for
> >>>>>>>>       >>>>>>>>>>>> runtime
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> operators and running on Job Manager.
> The
> >>>>>>>>       coordination
> >>>>>>>>       >>>>>>>>>>>> mechanism is
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> operator events between
> OperatorCoordinator
> >>>>>>>>       and its all
> >>>>>>>>       >>>>>>>>>>>> operators, the
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> coordination is more and more using in
> >>>>> Flink, for
> >>>>>>>>       >>>>>>>> example
> >>>>>>>>       >>>>>>>>>> many
> >>>>>>>>       >>>>>>>>>>>> Sources
> >>>>>>>>       >>>>>>>>>>>>>>>>> and
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign
> >>>>> splits
> >>>>>>> and
> >>>>>>>>       >>>>>>>>>> coordinate
> >>>>>>>>       >>>>>>>>>>>> commits to
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> external systems. The
> OperatorCoordinator
> >>>>> is
> >>>>>>>>       widely
> >>>>>>>>       >>>>>>>> using
> >>>>>>>>       >>>>>>>>> in
> >>>>>>>>       >>>>>>>>>>>> flink kafka
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink
> >>>>> cdc
> >>>>>>>>       connector,
> >>>>>>>>       >>>>>>>>>> flink
> >>>>>>>>       >>>>>>>>>>>> hudi
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> connector and so on.
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group
> >>>>> scope
> >>>>>>>>       for the
> >>>>>>>>       >>>>>>>>>>>>>>>>> OperatorCoordinator
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> and not an implementation for the
> interface
> >>>>>>>>       >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator
> could
> >>>>> be
> >>>>>>>>       how many
> >>>>>>>>       >>>>>>>>>>>> splits/partitions
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> have been assigned to source readers,
> how
> >>>>> many
> >>>>>>>>       files
> >>>>>>>>       >>>>>>>> have
> >>>>>>>>       >>>>>>>>>> been
> >>>>>>>>       >>>>>>>>>>>> written
> >>>>>>>>       >>>>>>>>>>>>>>>>> out
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only
> >>>>> help
> >>>>>>>>       users to
> >>>>>>>>       >>>>>>>> know
> >>>>>>>>       >>>>>>>>>> the
> >>>>>>>>       >>>>>>>>>>>> job
> >>>>>>>>       >>>>>>>>>>>>>>>>> progress
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> but also make big job maintaining
> easier.
> >>>>> Thus we
> >>>>>>>>       >>>>>>>> propose
> >>>>>>>>       >>>>>>>>>> the
> >>>>>>>>       >>>>>>>>>>>> FLIP-274
> >>>>>>>>       >>>>>>>>>>>>>>>>> to
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
> >>>>>>>>       >>>>>>>> OperatorCoordinator
> >>>>>>>>       >>>>>>>>>> and
> >>>>>>>>       >>>>>>>>>>>> provide
> >>>>>>>>       >>>>>>>>>>>>>>>>> an
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> internal implementation for
> >>>>>>>>       >>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you
> >>>>> get
> >>>>>>>>       time? Any
> >>>>>>>>       >>>>>>>>>>> feedback
> >>>>>>>>       >>>>>>>>>>>> is
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Hang
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Reply via email to