Hi Chesnay,

> The new option must default to the configured value of
metrics.scope.jm.job because otherwise it isn't backwards-compatible with
existing coordinator metrics.
IIUC, there is only an OperatorCoordinatorMetricGroup interface without
implementations (it never worked). There were no coordinator metrics
implemented and registered. @Hang Ruan <ruanhang1...@gmail.com> please
correct me if I'm wrong.

> Intuitively I think the config key should be metrics.scope.jm.operator.,
similar to metrics.scope.jm.job.
However, the config key "metrics.scope.jm.job" is deprecated and is
replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
replaced with "metrics.scope.tm-job". That's why I though
"metrics.scope.jm-operator" might be more consistent.

> We would only support operator_name, operator_id, task_name and task_id
(+ everything else from the jm/job groups of course).
+1

Best,
Jark



On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ches...@apache.org> wrote:

> The new option must default to the configured value of
> metrics.scope.jm.job because otherwise it isn't backwards-compatible with
> existing coordinator metrics.
> According to some earlier response in this thread there are already
> coordinator metrics.
>
> If that weren't the case you'd be correct, and we'd keep the default
> consistent with the default for tm operator metrics.
>
> Intuitively I think the config key should be metrics.scope.jm.operator.,
> similar to metrics.scope.jm.job.
> For symmetry we should also add new metrics.scope.tm.operator/task and
> eventually phase out metrics.scope.operator.
>
> > Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I
> mean the coordinator doesn't run on specific tasks, so it doesn't have
> runtime information of tasks.
>
> That is true. I intentionally only referred to task id/name, as things
> like subtask index or attempt ids don't make sense for coordinators as you
> said.
> We would only support operator_name, operator_id, task_name and task_id (+
> everything else from the jm/job groups of course).
> (Maybe in the future attempt_num could be interesting though)
>
> On 30/01/2023 10:39, Jark Wu wrote:
>
> Hi Chesnay,
>
> IIUC, our only disagreement is whether to support task info in the metric
> format. But we all agree with the default metric format
> "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
> which doesn't include task info, right?
>
> Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I
> mean the coordinator doesn't run on specific tasks, so it doesn't have
> runtime information of tasks. My concern about adding task info to the
> coordinator metric is that
> 1) if we register coordinator metrics for every subtask, the number of
> coordinator metrics will be exploded when task parallelism is large. This
> is an overhead to JM & metric systems, and the duplicated metrics are
> useless for users.
> 2) Correct me if I'm wrong, MetricGroup#counter only supports registering
> one metric. It seems it can't register a metric reporting for multiple
> metric names (e.g. multiple subtasks).
>
> Best,
> Jark
>
>
> On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ches...@apache.org> wrote:
>
>> I hadn't looked in detail in to how the task info can be introduced, but
>> given that the coordinates are created by the execution graph, where we
>> only work on tasks, it should be possible and rather simple.
>>
>> The OperatorCoordinatorHolder gets access to the ExecutionJobVertex from
>> which it can extract everything and create the task/operator groups.
>> Exposure to the coordinator could happen via the coordinator context as
>> originally planned.
>>
>>  > As we know, coordinator metrics are just like JMJobMetricGroup, which
>> don't belong to any tasks/vertexes.
>>
>> I don't think this is true, every coordinator is scoped to a specific
>> operator, and every operator is associated with a particular vertex.
>>
>>  From the javadocs:
>>
>> " A coordinator for runtime operators. The OperatorCoordinator runs on
>> the master, associated with
>>   the job vertex of the operator. It communicates with operators via
>> sending operator events."
>>
>> On 28/01/2023 09:15, Jark Wu wrote:
>> > Hi all,
>> >
>> > IIUC, Chesnay means we should have a more general metric group for
>> > **operator**
>> > not for **coordinator** in JM, this would be useful to extend
>> > other operator-specific
>> > metrics in the future. That means the new scope format should be
>> designed
>> > for
>> > the operator,
>> > e.g.,
>> metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
>> > The coordinator metric is a subgroup (a constant "coordinator" suffix)
>> of
>> > the JMOperatorMG.
>> >
>> > I think this is a nice design. However, I have a question about adding
>> task
>> > id/name to this list.
>> > How to get the task id/name when reporting a coordinator metric? As we
>> > know, coordinator metrics
>> > are just like JMJobMetricGroup, which don't belong to any
>> tasks/vertexes.
>> > Do you mean reporting
>> > coordinator metrics for every task id under the operator?
>> >
>> > Best,
>> > Jark
>> >
>> >
>> > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ches...@apache.org>
>> wrote:
>> >
>> >>   > First, I do not understand why users have to configure the new
>> scope
>> >> format, which has a default value.
>> >>
>> >> If you don't use scope formats, sure. If you do use scope formats, e.g.
>> >> to add a common prefix (which is the case for datadog users for
>> >> example), then the current default in the FLIP is insufficient and
>> >> requires the user to update the configuration.
>> >>
>> >>   > I usually do not need to change these configuration of scope
>> formats
>> >> when submitting the flink job.
>> >>
>> >> Scope formats as a whole are quite a power-user feature, but that
>> >> doesn't mean we should ignore it.
>> >>
>> >>   > I try to let it extend the ComponentMetricGroup
>> >>
>> >> This isn't inherently required just because it is a "component".
>> >> Component metric groups should _only_ be used for cases where such a
>> >> component provides several bits of metadata.
>> >> For example, tasks provide vertex ids, task names, attempted IDs etc.,
>> >> and we'd like users to have the option on how this metadata ends up
>> >> being used (via scope formats). This currently can't be built with the
>> >> addGroup() methods, hence why the component groups exist.
>> >> However, the operator coordinator _itself_does not provide several bits
>> >> of metadata. Logically, the _operator_ does, not the coordinator.
>> >>
>> >>   > This make me decide to add a new scope format.
>> >>
>> >> And this is fine; just don't add a new scope format for the coordinator
>> >> but logically for operators on the JM side, such that we can extend the
>> >> set of operator-specific metrics in the future without having to add
>> yet
>> >> another scope format.
>> >>
>> >>   > The relationship between the Operator and OperatorCoordinator is
>> >> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>> >> name/id could be found and the OperatorCoordinatorMetricGroup will be
>> >> created here.
>> >>
>> >> That's all irrelevant.
>> >>
>> >>   > The registered metrics of OperatorCoordinator always are those
>> which
>> >> can not be aggregated from the tasks, like the number of unassigned
>> >> splits in the SourceCoordinator. Actually we have not encountered the
>> >> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>> >> should only contain the metrics for itself instead of its subtasks.
>> >>
>> >> You are misunderstanding the problem.
>> >>
>> >> This isn't about aggregating metrics on our side or exposing metrics
>> >> from TMs on the JM side or anything like that.
>> >>
>> >> It's purely about the available metadata for operator metrics on the JM
>> >> side. Currently you suggest operator name / id; what I'm proposing is
>> to
>> >> add task(==vertex!) id / name to this list.
>> >>
>> >> The benefit here is simple. I can look up all metrics where
>> task_id==XXX
>> >> and get _everything_ related to that vertex, including all metrics
>> >> associated with operators that are part of that vertex, including the
>> >> coordinator.
>> >>
>> >>   > Using metrics.scope.jm.job is not enough to distinguish the
>> different
>> >> coordinators.
>> >>
>> >> I did not suggest just using metrics.scope.jm.job. I suggested that as
>> >> the default for the jm operator scope format, that is used for the
>> >> proposed JobManagerOperatorMetricGroup, which will have some plain
>> >> metric groups as children for coordinators.
>> >> Aka, you have the JMOperatorMG, then you call addGroup("coordinators")
>> >> and then addGroup(coordinator_name) for each coordinator.
>> >>
>> >>   > So there are two choice for the
>> >> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>> scope
>> >> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one
>> >> is better?
>> >>
>> >> There are more options than that.
>> >> I feel like there are a lot of misunderstandings in this discussion.
>> >> Please let me know if I could clear things up. If not I can also
>> provide
>> >> a PoC based on your PoC if that can speed things up. It's not that
>> >> different anyway.
>> >>
>> >> On 19/01/2023 07:39, Hang Ruan wrote:
>> >>> Hi, chesnay,
>> >>>
>> >>> Thanks for your reply. I still have some doubts about the questions
>> >>> you raised.
>> >>>
>> >>>      > Extending the set of ScopeFormats is problematic because it in
>> >>>      practice
>> >>>      it breaks the config if users actively rely on it, since there's
>> now
>> >>>      another key that they _must_ set for it to be
>> >>>      consistent/compatible with
>> >>>      their existing setup.
>> >>>      Unfortunately due to how powerful scope formats are we can't
>> derive a
>> >>>      default value that matches their existing setup.
>> >>>      Hence we should try to do this as rarely as possible.
>> >>>
>> >>>
>> >>>      > This FLIP does not adhere to that since it proposes a dedicated
>> >>>      format
>> >>>      for coordinators; next time we want to expose operator-specific
>> >>>      metrics
>> >>>      (e.g., in the scheduler) we'd have to add another one to support
>> it.
>> >>>
>> >>>
>> >>> First, I do not understand why users have to configure the new scope
>> >>> format, which has a default value. I usually do not need to change
>> >>> these configuration of scope formats when submitting the flink job.
>> >>> IMO, the OperatorCoordinator is a component running at the JobMaster.
>> >>> I try to let it extend the ComponentMetricGroup, but I cannot find
>> >>> a suitable scope. This make me decide to add a new scope format.
>> >>>
>> >>>      > Additionally, the configurable variables (operator name/id) are
>> >>>      logically not attached to the coordinator, but operators, so to
>> me it
>> >>>      just doesn't make sense to structure it like this.
>> >>>
>> >>>
>> >>> The relationship between the Operator and OperatorCoordinator is
>> >>> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>> >>> name/id could be found and the OperatorCoordinatorMetricGroup will be
>> >>> created here.
>> >>>
>> >>>      > Another thing I'm concerned about is that, because we don't
>> include
>> >>>      tasks in the hierarchy, users wishing to collect all metrics for
>> a
>> >>>      particular task (in this case ==vertex) now have to go
>> >>>      significantly out
>> >>>      of their way to get them, since they can no longer just filter
>> by the
>> >>>      task ID but have to be filter for _all_ operators that are part
>> of
>> >>>      the task.
>> >>>
>> >>>
>> >>> The registered metrics of OperatorCoordinator always are those which
>> >>> can not be aggregated from the tasks, like the number of unassigned
>> >>> splits in the SourceCoordinator. Actually we have not encountered the
>> >>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>> >>> should only contain the metrics for itself instead of its subtasks.
>> >>>
>> >>> Using metrics.scope.jm.job is not enough to distinguish the different
>> >>> coordinators. The operator id/name is necessary. Then
>> >>> the implementation should be like this. But users can not change the
>> >>> scope in this way. This is also acceptable.
>> >>> @Internal public class InternalOperatorCoordinatorMetricGroup
>> >>>           extends ProxyMetricGroup<MetricGroup>
>> >>>           implements OperatorCoordinatorMetricGroup {
>> >>>       public InternalOperatorCoordinatorMetricGroup(
>> >>>               JobManagerJobMetricGroup parent, OperatorID operatorID,
>> >> String operatorName) {
>> >>>           super(parent.addGroup(operatorID +
>> >> operatorName).addGroup("coordinator")); }
>> >>> }
>> >>> So there are two choice for the
>> >>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>> >>> scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
>> >>> Which one is better?
>> >>>
>> >>> Best,
>> >>> Hang
>> >>>
>> >>>
>> >>> Chesnay Schepler <ches...@apache.org> 于2023年1月18日周三 17:03写道:
>> >>>
>> >>>      You're misunderstanding the problem.
>> >>>
>> >>>      Metric groups form a tree with each group providing certain
>> metadata.
>> >>>
>> >>>      E.g., on the taskmanager we have a TM metric group that provides
>> info
>> >>>      about the TM, that has child task metric groups, that have child
>> >>>      operator metric groups etc. The operator metric group again has
>> >>>      children, where we are now mostly in the user/connector land,
>> like a
>> >>>      kafka metric group providing partition info or something.
>> >>>
>> >>>      What is being proposed is to create a coordinator metric group on
>> >>>      the JM
>> >>>      side that provides operator metadata.
>> >>>
>> >>>      A more appropriate structure is to create an operator group on
>> the JM
>> >>>      side that provides this info, with the coordinator metric group
>> >>>      being a
>> >>>      child.
>> >>>
>> >>>      On 17/01/2023 19:56, Steven Wu wrote:
>> >>>      >> Additionally, the configurable variables (operator name/id)
>> are
>> >>>      > logically not attached to the coordinator, but operators, so to
>> >>>      me it
>> >>>      > just doesn't make sense to structure it like this.
>> >>>      >
>> >>>      > Chesnay, maybe we should clarify the terminology. To me,
>> >>>      pperators (like
>> >>>      > FLIP-27 source) can have two parts (coordinator and
>> >>>      reader/subtask). I
>> >>>      > think it is fine to include operator name/id for coordinator
>> >>>      metrics.
>> >>>      >
>> >>>      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>> >>>      <ches...@apache.org> wrote:
>> >>>      >
>> >>>      >> Slight correction: Using metrics.scope.jm.job as the default
>> >>>      should be
>> >>>      >> safe.
>> >>>      >>
>> >>>      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>> >>>      >>> The proposed ScopeFormat is still problematic for a few
>> reasons.
>> >>>      >>>
>> >>>      >>> Extending the set of ScopeFormats is problematic because it
>> in
>> >>>      >>> practice it breaks the config if users actively rely on it,
>> since
>> >>>      >>> there's now another key that they _must_ set for it to be
>> >>>      >>> consistent/compatible with their existing setup.
>> >>>      >>> Unfortunately due to how powerful scope formats are we can't
>> >>>      derive a
>> >>>      >>> default value that matches their existing setup.
>> >>>      >>> Hence we should try to do this as rarely as possible.
>> >>>      >>>
>> >>>      >>> This FLIP does not adhere to that since it proposes a
>> >>>      dedicated format
>> >>>      >>> for coordinators; next time we want to expose
>> operator-specific
>> >>>      >>> metrics (e.g., in the scheduler) we'd have to add another
>> one to
>> >>>      >>> support it.
>> >>>      >>>
>> >>>      >>> Additionally, the configurable variables (operator name/id)
>> are
>> >>>      >>> logically not attached to the coordinator, but operators, so
>> >>>      to me it
>> >>>      >>> just doesn't make sense to structure it like this.
>> >>>      >>>
>> >>>      >>> Another thing I'm concerned about is that, because we don't
>> >>>      include
>> >>>      >>> tasks in the hierarchy, users wishing to collect all metrics
>> for
>> >> a
>> >>>      >>> particular task (in this case ==vertex) now have to go
>> >>>      significantly
>> >>>      >>> out of their way to get them, since they can no longer just
>> >>>      filter by
>> >>>      >>> the task ID but have to be filter for _all_ operators that
>> are
>> >>>      part of
>> >>>      >>> the task.
>> >>>      >>>
>> >>>      >>> On 16/01/2023 03:09, Hang Ruan wrote:
>> >>>      >>>> Hi, @ches...@apache.org <ches...@apache.org> ,
>> >>>      >>>>
>> >>>      >>>> Do you have time to help to review this FLIP again after the
>> >>>      >>>> modification?
>> >>>      >>>> Looking forward to your reply.
>> >>>      >>>> This FLIP will add a new configuration for the
>> >>>      >>>> OperatorCoordinatorMetricGroup scope format. It provides an
>> >>>      internal
>> >>>      >>>> implementation and is added as a component to the
>> >>>      >>>> JobManagerJobMetricGroup.
>> >>>      >>>> If something doesn't make sense, could you provide some
>> >>>      advice? It
>> >>>      >>>> will be
>> >>>      >>>> very helpful. Thanks a lot for your help.
>> >>>      >>>>
>> >>>      >>>> Best,
>> >>>      >>>> Hang
>> >>>      >>>>
>> >>>      >>>> Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三
>> >>>      16:34写道:
>> >>>      >>>>
>> >>>      >>>>
>> >>>      >>>>> Hi Hang,
>> >>>      >>>>>
>> >>>      >>>>> I'm a bit surprised that this has gone to a vote, given
>> that
>> >>>      Chesnay
>> >>>      >>>>> deliberately mentioned that he would vote against it as-is.
>> >>>      I would
>> >>>      >>>>> expect
>> >>>      >>>>> that before going to a vote, he has had the opportunity to
>> >>>      >>>>> participate in
>> >>>      >>>>> this discussion.
>> >>>      >>>>>
>> >>>      >>>>> Best regards,
>> >>>      >>>>>
>> >>>      >>>>> Martijn
>> >>>      >>>>>
>> >>>      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <imj...@gmail.com>
>> >>>      wrote:
>> >>>      >>>>>
>> >>>      >>>>>> Hi Dong,
>> >>>      >>>>>>
>> >>>      >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
>> >>>      concern is
>> >>>      >>>>>> that
>> >>>      >>>>>> this is a core interface for the FLIP. It’s hard to tell
>> how
>> >>>      >>>>>> sources use
>> >>>      >>>>>> metric group without mentioning this interface. Even if
>> >>>      this is an
>> >>>      >>>>> existing
>> >>>      >>>>>> API, I think it’s worth introducing the interface again
>> and
>> >>>      declaring
>> >>>      >>>>> that
>> >>>      >>>>>> we will implement the interface instead of a no-op method
>> >>>      in this
>> >>>      >>>>>> FLIP.
>> >>>      >>>>>>
>> >>>      >>>>>> Anyway, this is a minor problem and shouldn’t block this
>> >>>      FLIP. I’m
>> >>>      >>>>>> +1 to
>> >>>      >>>>>> start a vote.
>> >>>      >>>>>>
>> >>>      >>>>>> Best,
>> >>>      >>>>>> Jark
>> >>>      >>>>>>
>> >>>      >>>>>>
>> >>>      >>>>>>> 2023年1月3日 10:03,Hang Ruan <ruanhang1...@gmail.com> 写道:
>> >>>      >>>>>>>
>> >>>      >>>>>>> Hi, Jark and Dong,
>> >>>      >>>>>>>
>> >>>      >>>>>>> Thanks for your comments. Sorry for my late reply.
>> >>>      >>>>>>>
>> >>>      >>>>>>> For suggestion 1, I plan to implement the
>> >>>      >>>>>>> SplitEnumeratorMetricGroup in
>> >>>      >>>>>>> another issue, and it is not contained in this FLIP. I
>> >>>      will add some
>> >>>      >>>>>>> description about this part.
>> >>>      >>>>>>> For suggestion 2, changes about
>> >>>      OperatorCoordinator#metricGroup has
>> >>>      >>>>>> already
>> >>>      >>>>>>> been documented in the proposed change section.
>> >>>      >>>>>>>
>> >>>      >>>>>>> Best,
>> >>>      >>>>>>> Hang
>> >>>      >>>>>>>
>> >>>      >>>>>>> Dong Lin <lindon...@gmail.com> 于2023年1月1日周日 09:45写道:
>> >>>      >>>>>>>
>> >>>      >>>>>>>> Let me chime-in and add comments regarding the public
>> >>>      interface
>> >>>      >>>>> section.
>> >>>      >>>>>>>> Please see my comments inline.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>> >>>      <imj...@gmail.com> wrote:
>> >>>      >>>>>>>>
>> >>>      >>>>>>>>> Hi Hang,
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>> Thanks for driving this discussion. I think this is a
>> >>>      very useful
>> >>>      >>>>>> feature
>> >>>      >>>>>>>>> for connectors.
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>> The FLIP looks quite good to me, and I just have two
>> >>>      suggestions.
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>> 1. In the "Public Interface" section, mention that the
>> >>>      >>>>>>>>> implementation
>> >>>      >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>> >>>      changed from
>> >>>      >>>>>>>> returning
>> >>>      >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
>> >>>      instance.
>> >>>      >>>>> Even
>> >>>      >>>>>>>>> though the API is already there, the behavior change
>> can
>> >>>      also be
>> >>>      >>>>>>>> considered
>> >>>      >>>>>>>>> a public change.
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
>> >>>      this FLIP
>> >>>      >>>>>>>> does
>> >>>      >>>>>> not
>> >>>      >>>>>>>> seem to change its semantics/behavior. The FLIP does
>> >>>      change the
>> >>>      >>>>>>>> implementation/behavior of
>> >>>      SourceCoordinatorContext#metricGroup,
>> >>>      >>>>>>>> which
>> >>>      >>>>>> is
>> >>>      >>>>>>>> marked @Internal.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>> Thus it might seem a bit weird to add in the public
>> >> interface
>> >>>      >>>>>>>> section
>> >>>      >>>>>>>> saying that we change the interface
>> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup
>> >>>      >>>>>> from
>> >>>      >>>>>>>> returning null to non-null object.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>> 2. Mention the newly added interface of
>> >>>      >>>>>> "OperatorCoordinator#metricGroup"
>> >>>      >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
>> >>>      section. As
>> >>>      >>>>> the
>> >>>      >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
>> >>>      connectors.
>> >>>      >>>>>> Though
>> >>>      >>>>>>>> it
>> >>>      >>>>>>>>> is still an @Internal API, I think it is worth
>> >>>      mentioning the
>> >>>      >>>>>>>>> change
>> >>>      >>>>> in
>> >>>      >>>>>>>> the
>> >>>      >>>>>>>>> FLIP.
>> >>>      >>>>>>>> Since OperatorCoordinator is an internal API, it seems
>> >>>      reasonable to
>> >>>      >>>>>>>> explain it in the proposed change section. The FLIP
>> seems
>> >>>      to have
>> >>>      >>>>>>>> documented this in point 5 of the proposed change
>> section.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>> BTW, if we think there are @internal classes that are
>> >>>      important
>> >>>      >>>>>>>> enough
>> >>>      >>>>>> to
>> >>>      >>>>>>>> be added in the public interface section, it might be
>> >>>      useful to
>> >>>      >>>>>> explicitly
>> >>>      >>>>>>>> discuss this topic and document it in the "*What are the
>> >>>      "public
>> >>>      >>>>>>>> interfaces" of the project*" in this
>> >>>      >>>>>>>> <
>> >>>      >>>>>>>>
>> >>>      >>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>> >>>      >>>>>>>> wiki.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>>
>> >>>      >>>>>>>>> Best,
>> >>>      >>>>>>>>> Jark
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>> >>>      <ruanhang1...@gmail.com>
>> >>>      >>>>>> wrote:
>> >>>      >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>> >>>      >>>>>>>>>> After combining everyone's comments, the main
>> concerns and
>> >>>      >>>>>>>> corresponding
>> >>>      >>>>>>>>>> adjustments are as follows.
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Q1: Common metrics are not quite useful.
>> >>>      >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
>> >>>      from the
>> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics
>> do
>> >> not
>> >>>      >>>>>>>>>> provide
>> >>>      >>>>>>>>> enough
>> >>>      >>>>>>>>>> information for users. The users are more willing to
>> >>>      get the
>> >>>      >>>>>>>>>> number
>> >>>      >>>>> of
>> >>>      >>>>>>>>>> events of the specified type instead of the total
>> >>>      number. And this
>> >>>      >>>>>>>> metric
>> >>>      >>>>>>>>>> is calculated differently. The implementation could
>> >>>      register the
>> >>>      >>>>>> metric
>> >>>      >>>>>>>>> by
>> >>>      >>>>>>>>>> themselves.
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Q2: This FLIP is overly complicated.
>> >>>      >>>>>>>>>> This FLIP will become concise after these
>> modifications.
>> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>> >>>      introduced into
>> >>>      >>>>> Flink
>> >>>      >>>>>>>> by
>> >>>      >>>>>>>>>> FLIP-179<
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>>
>> >>>      >>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>> >>>      >>>>>>>>>>> .
>> >>>      >>>>>>>>>> And
>> >>>      >>>>>>>>>> this FLIP will not change it. This FLIP only provides
>> a
>> >>>      new metric
>> >>>      >>>>>>>> option
>> >>>      >>>>>>>>>> and a new metric group scope. The changes in proposed
>> >>>      changes
>> >>>      >>>>> provide
>> >>>      >>>>>>>> the
>> >>>      >>>>>>>>>> details about the modifications for the internal
>> >>>      classes, which
>> >>>      >>>>> might
>> >>>      >>>>>>>>> make
>> >>>      >>>>>>>>>> it look complicated.
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Thanks for all the comments again. If there are no
>> further
>> >>>      >>>>>>>>>> comments,
>> >>>      >>>>>> we
>> >>>      >>>>>>>>>> plan to start the voting thread this week.
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Best,
>> >>>      >>>>>>>>>> Hang
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Qingsheng Ren <renqs...@gmail.com> 于2022年12月26日周一
>> >>>      16:48写道:
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>>> Thanks for the FLIP, Hang!
>> >>>      >>>>>>>>>>>
>> >>>      >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
>> >>>      the same
>> >>>      >>>>> concern
>> >>>      >>>>>>>>>> with
>> >>>      >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are
>> not
>> >>>      quite
>> >>>      >>>>>>>>>>> useful
>> >>>      >>>>>>>> to
>> >>>      >>>>>>>>>> end
>> >>>      >>>>>>>>>>> users. OperatorEvent is a quite low-level
>> abstraction,
>> >>>      which
>> >>>      >>>>> requires
>> >>>      >>>>>>>>>>> instantialization in order to be practical to users
>> and
>> >>>      >>>>>>>>>>> developers,
>> >>>      >>>>>>>> so
>> >>>      >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>> >>>      >>>>>>>>>>>
>> >>>      >>>>>>>>>>> Best,
>> >>>      >>>>>>>>>>> Qingsheng
>> >>>      >>>>>>>>>>> Ververica (Alibaba)
>> >>>      >>>>>>>>>>>
>> >>>      >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>> >>>      <reed...@gmail.com>
>> >>>      >>>>> wrote:
>> >>>      >>>>>>>>>>>> Hi Hang,
>> >>>      >>>>>>>>>>>> I still see no strong reason why we need
>> >>>      >>>>>>>>>>>> numEventsIn/numEventsOut
>> >>>      >>>>>>>>>>> metrics.
>> >>>      >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
>> >>>      concern
>> >>>      >>>>>>>>>>>> from
>> >>>      >>>>>>>>>> others.
>> >>>      >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
>> >>>      >>>>>>>>>>>> over-extending
>> >>>      >>>>>>>>> the
>> >>>      >>>>>>>>>>>> scope.
>> >>>      >>>>>>>>>>>>
>> >>>      >>>>>>>>>>>> Thanks,
>> >>>      >>>>>>>>>>>> Zhu
>> >>>      >>>>>>>>>>>>
>> >>>      >>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月23日周五
>> >>>      >>>>>>>>>>>> 15:21写道:
>> >>>      >>>>>>>>>>>>> Hi everyone,
>> >>>      >>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should
>> keep
>> >>>      the common
>> >>>      >>>>>>>>>>> metrics,
>> >>>      >>>>>>>>>>>> which will help to observe incoming and outgoing
>> >>>      events. What do
>> >>>      >>>>>>>> you
>> >>>      >>>>>>>>>>> think,
>> >>>      >>>>>>>>>>>> @Zhu Zhu ?
>> >>>      >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>> >>>      more concerned
>> >>>      >>>>>>>>>> about?
>> >>>      >>>>>>>>>>>> Looking forward to your reply.
>> >>>      >>>>>>>>>>>>> Thanks for all the comments. If there are no
>> further
>> >>>      >>>>>>>>>>>>> comments, we
>> >>>      >>>>>>>>>> plan
>> >>>      >>>>>>>>>>>> to start the voting thread next week.
>> >>>      >>>>>>>>>>>>> Best,
>> >>>      >>>>>>>>>>>>> Hang
>> >>>      >>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月15日周四
>> >>>      >>>>>>>>>>>>> 16:49写道:
>> >>>      >>>>>>>>>>>>>> Hi, Zhu Zhu,
>> >>>      >>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>> Thanks for your feedback!
>> >>>      >>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>> >>>      different. And
>> >>>      >>>>>>>>>>>>>> their
>> >>>      >>>>>>>>>>>> metrics are much different too. We try to find the
>> >> common
>> >>>      >>>>>>>>>>>> metrics
>> >>>      >>>>>>>> and
>> >>>      >>>>>>>>>> put
>> >>>      >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
>> >>>      developers
>> >>>      >>>>>>>> think
>> >>>      >>>>>>>>> we
>> >>>      >>>>>>>>>>> do
>> >>>      >>>>>>>>>>>> not need these common metrics, removing the common
>> >>>      metrics is
>> >>>      >>>>>>>>>>>> also
>> >>>      >>>>>>>>>>>> acceptable.
>> >>>      >>>>>>>>>>>>>> Best,
>> >>>      >>>>>>>>>>>>>> Hang
>> >>>      >>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年12月14日周三
>> >>>      22:09写道:
>> >>>      >>>>>>>>>>>>>>> Hi Hang & MengYue,
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>> >>>      >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register
>> metrics.
>> >>>      >>>>>>>>>>>>>>> Currently
>> >>>      >>>>>>>>>>>>>>> the coordinators has no way to do this. And
>> operator
>> >>>      >>>>>>>> coordinator
>> >>>      >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
>> >>>      to have
>> >>>      >>>>>>>>>>>>>>> access
>> >>>      >>>>>>>>>>>>>>> to a registered metric group (via the existing
>> public
>> >>>      >>>>>>>>>>>>>>> interface
>> >>>      >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>> >>>      null at the
>> >>>      >>>>>>>>> moment.
>> >>>      >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>> >>>      metrics. A clear
>> >>>      >>>>>>>>>>>> definition
>> >>>      >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>> >>>      metrics. The
>> >>>      >>>>>>>>>> definition
>> >>>      >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>> >> multiple
>> >>>      >>>>>>>>>>>> OperatorCoordinators
>> >>>      >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
>> >>>      have its
>> >>>      >>>>>>>>>>>>>>> own
>> >>>      >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
>> >>>      be directly
>> >>>      >>>>>>>>>>> registered
>> >>>      >>>>>>>>>>>>>>> to the job metric group.
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
>> >>>      introduced
>> >>>      >>>>>>>>> common
>> >>>      >>>>>>>>>>>> metrics
>> >>>      >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If
>> there
>> >>>      any case
>> >>>      >>>>>>>> which
>> >>>      >>>>>>>>>>>> strongly
>> >>>      >>>>>>>>>>>>>>> requires them?
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>> >>>      >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>> >> overkill
>> >>>      >>>>>>>>>>>>>>> Directly using the job metric group can result in
>> >>>      metric
>> >>>      >>>>>>>>> conflicts,
>> >>>      >>>>>>>>>>> as
>> >>>      >>>>>>>>>>>> mentioned
>> >>>      >>>>>>>>>>>>>>> in above #2.
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> Thanks,
>> >>>      >>>>>>>>>>>>>>> Zhu
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> Dong Lin <lindon...@gmail.com> 于2022年12月10日周六
>> >>>      >>>>>>>>>>>>>>> 14:16写道:
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>> Hi Chesney,
>> >>>      >>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>> Just to double check with you,
>> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>> >>>      >>>>>>>>>>>> (annotated as
>> >>>      >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced
>> into
>> >>>      Flink by
>> >>>      >>>>>>>>>> FLIP-179
>> >>>      >>>>>>>>>>>>>>>> <
>> >>>      >>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>> >>>      >>>>>>>>>>>>> .
>> >>>      >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>> >>>      should remove
>> >>>      >>>>>>>>> this
>> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>> >>>      >>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>> Regards,
>> >>>      >>>>>>>>>>>>>>>> Dong
>> >>>      >>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
>> Schepler <
>> >>>      >>>>>>>>>>> ches...@apache.org>
>> >>>      >>>>>>>>>>>> wrote:
>> >>>      >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>> >>>      complicated. A
>> >>>      >>>>>>>>>>> dedicated
>> >>>      >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>> >>>      could just
>> >>>      >>>>>>>>> re-use
>> >>>      >>>>>>>>>>> the
>> >>>      >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>> >>>      structure we
>> >>>      >>>>>>>>> have
>> >>>      >>>>>>>>>> on
>> >>>      >>>>>>>>>>>> TMs,
>> >>>      >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>> >>>      >>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>> However, I'm not convinced that this is
>> required
>> >>>      anyway,
>> >>>      >>>>>>>>>> because
>> >>>      >>>>>>>>>>>> all the
>> >>>      >>>>>>>>>>>>>>>>> example metrics you listed can be implemented
>> on
>> >>>      the TM
>> >>>      >>>>>>>> side
>> >>>      >>>>>>>>> +
>> >>>      >>>>>>>>>>>>>>>>> aggregating them in the external metrics
>> backend.
>> >>>      >>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one
>> tries
>> >>>      to pull a
>> >>>      >>>>>>>>> fast
>> >>>      >>>>>>>>>>>> one on
>> >>>      >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>> >>>      against it.
>> >>>      >>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>> >>>      >>>>>>>>>>>>>>>>>> Hi Hang,
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and
>> it
>> >>>      is pretty
>> >>>      >>>>>>>>>>>> informative.
>> >>>      >>>>>>>>>>>>>>>>>> I have just two minor comments regarding
>> names:
>> >>>      >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key
>> as
>> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator*
>> for
>> >>>      >>>>>>>> consistency
>> >>>      >>>>>>>>>>> with
>> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>> >>>      >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>> >>>      >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>> >>>      >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>> >>>      >>>>>>>>>>>> for
>> >>>      >>>>>>>>>>>>>>>>>> simplicity and consistency with
>> >>>      SCOPE_NAMING_OPERATOR
>> >>>      >>>>>>>>> (which
>> >>>      >>>>>>>>>> is
>> >>>      >>>>>>>>>>>> not named
>> >>>      >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>> Cheers,
>> >>>      >>>>>>>>>>>>>>>>>> Dong
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>> >>>      >>>>>>>>>>> ruanhang1...@gmail.com>
>> >>>      >>>>>>>>>>>> wrote:
>> >>>      >>>>>>>>>>>>>>>>>>> Hi all,
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>> >>>      metric group
>> >>>      >>>>>>>>> for
>> >>>      >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is
>> the
>> >>>      >>>>>>>>> coordinator
>> >>>      >>>>>>>>>>> for
>> >>>      >>>>>>>>>>>> runtime
>> >>>      >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>> >>>      coordination
>> >>>      >>>>>>>>>>>> mechanism is
>> >>>      >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>> >>>      and its all
>> >>>      >>>>>>>>>>>> operators, the
>> >>>      >>>>>>>>>>>>>>>>>>> coordination is more and more using in
>> Flink, for
>> >>>      >>>>>>>> example
>> >>>      >>>>>>>>>> many
>> >>>      >>>>>>>>>>>> Sources
>> >>>      >>>>>>>>>>>>>>>>> and
>> >>>      >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign
>> splits
>> >> and
>> >>>      >>>>>>>>>> coordinate
>> >>>      >>>>>>>>>>>> commits to
>> >>>      >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
>> >>>      widely
>> >>>      >>>>>>>> using
>> >>>      >>>>>>>>> in
>> >>>      >>>>>>>>>>>> flink kafka
>> >>>      >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
>> >>>      connector,
>> >>>      >>>>>>>>>> flink
>> >>>      >>>>>>>>>>>> hudi
>> >>>      >>>>>>>>>>>>>>>>>>> connector and so on.
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group
>> scope
>> >>>      for the
>> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinator
>> >>>      >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>> >>>      >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
>> >>>      how many
>> >>>      >>>>>>>>>>>> splits/partitions
>> >>>      >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how
>> many
>> >>>      files
>> >>>      >>>>>>>> have
>> >>>      >>>>>>>>>> been
>> >>>      >>>>>>>>>>>> written
>> >>>      >>>>>>>>>>>>>>>>> out
>> >>>      >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
>> >>>      users to
>> >>>      >>>>>>>> know
>> >>>      >>>>>>>>>> the
>> >>>      >>>>>>>>>>>> job
>> >>>      >>>>>>>>>>>>>>>>> progress
>> >>>      >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier.
>> Thus we
>> >>>      >>>>>>>> propose
>> >>>      >>>>>>>>>> the
>> >>>      >>>>>>>>>>>> FLIP-274
>> >>>      >>>>>>>>>>>>>>>>> to
>> >>>      >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>> >>>      >>>>>>>> OperatorCoordinator
>> >>>      >>>>>>>>>> and
>> >>>      >>>>>>>>>>>> provide
>> >>>      >>>>>>>>>>>>>>>>> an
>> >>>      >>>>>>>>>>>>>>>>>>> internal implementation for
>> >>>      >>>>>>>>> OperatorCoordinatorMetricGroup.
>> >>>      >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
>> >>>      time? Any
>> >>>      >>>>>>>>>>> feedback
>> >>>      >>>>>>>>>>>> is
>> >>>      >>>>>>>>>>>>>>>>>>> appreciated!
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>> Best,
>> >>>      >>>>>>>>>>>>>>>>>>> Hang
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>> [1]
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>> >>>      >>
>> >>>
>>
>>
>

Reply via email to