Hi all,

Alex (cc'ed) and I discussed offline about the endpoint path and making it
easier to use. Focusing on the main clients of the Flink Metrics REST API,
the Flink UI and Flink Kubernetes Operator, we determined that exposing the
operator id is unnecessary. There are few considerations:

1. Integrate Flink UI to show source coordinator metrics

The Flink UI currently doesn't expose per operator metrics, only task
metrics. For operator metrics, metric reporters provide that extensibility
to expose operator granularity metrics. So, the operator id is unnecessary
for this case.

2. Integrate Flink Kubernetes Operator to read autoscaling metrics from
source enumerator

The K8s operator currently reads vertex metrics from the Flink Metric REST
API to perform autoscaling. In this situation, the operator id is
unnecessary as well and in fact, a vertex can only contain 1 source.
Therefore, we don't need a parameter for the operator id.

Due to these requirements, I recommend to use the path:

`/jobs/<jobid>/vertices/<vertexid>/coordinator-metrics`.
`coordinator-metrics` makes it obvious that the metrics are from the
OperatorCoordinator, not to be confused with something like
`operator-metrics` (which operator is it?).

In addition, I also propose to fix the metric scope [1]:

   - metrics.scope.operator
      - Default:
      <host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>
      - Applied to all metrics that were scoped to an operator.

The default should not contain the subtask index, since the coordinator
does not correspond to a subtask index. In addition, this configuration
could be renamed to `metrics.scope.coordinator` since `operator` is vague.
While we will point to the new config in the docs, backward compatibility
will be provided for the old config key.

I have updated the FLIP doc with these details.

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope

Best,
Mason

On Tue, Jan 23, 2024 at 12:13 PM Mason Chen <mas.chen6...@gmail.com> wrote:

> Hi all,
>
> I hope that everyone has had sufficient time to review the discussion and
> the updates on the FLIP doc. If there are no objections, I'll start a
> voting thread in 2 days.
>
> Best,
> Mason
>
> On Thu, Jan 18, 2024 at 2:39 PM Mason Chen <mas.chen6...@gmail.com> wrote:
>
>> Hi Lijie,
>>
>> That's also a possibility but I would prefer to keep it consistent with
>> how the existing metric APIs are used.
>>
>> For example, in the current metric APIs [1], there is no way to figure
>> out the vertexid and subtaskindex without getting the job graph from
>> `/jobs/<jobid>/plan` and correspondingly there are no APIs to return a map
>> of all metrics for every vertex and to return a map of all metrics for
>> every subtask. Essentially, the plan API is already required to use the
>> finer-grained metric apis.
>>
>> In addition, keeping the design similar lends itself better for the
>> implementation. The metric handler utilities [2] assume a
>> ComponentMetricStore is returned rather than a Map.
>>
>> I've updated the FLIP doc (
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-417%3A+Expose+JobManagerOperatorMetrics+via+REST+API)
>> with our discussion so far.
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#rest-api-integration
>> [2]
>> https://github.com/apache/flink/blob/a41229b24d82e8c561350c42d8a98dfb865c3f69/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/metrics/AbstractMetricsHandler.java#L109
>>
>> Best,
>> Mason
>>
>> On Wed, Jan 17, 2024 at 8:28 AM Lijie Wang <wangdachui9...@gmail.com>
>> wrote:
>>
>>> Hi Mason,
>>>
>>> Thanks for driving the discussion. +1 for the proposal.
>>>
>>> How about we return all operator metrics in a vertex? (the response is a
>>> map of operatorId/operatorName -> operator-metrics). Correspondingly, the
>>> url may be changed to /jobs/<jobid>/vertices/<vertexid>/operator-metrics
>>>
>>> In this way, users can skip the step of obtaining the operator id.
>>>
>>> Best,
>>> Lijie
>>>
>>> Hang Ruan <ruanhang1...@gmail.com> 于2024年1月17日周三 10:31写道:
>>>
>>> > Hi, Mason.
>>> >
>>> > The field `operatorName` in JobManagerOperatorQueryScopeInfo refers to
>>> the
>>> > fields in OperatorQueryScopeInfo and chooses the operatorName instead
>>> of
>>> > OperatorID.
>>> > It is fine by my side to change from opertorName to operatorID in this
>>> > FLIP.
>>> >
>>> > Best,
>>> > Hang
>>> >
>>> > Mason Chen <mas.chen6...@gmail.com> 于2024年1月17日周三 09:39写道:
>>> >
>>> > > Hi Xuyang and Hang,
>>> > >
>>> > > Thanks for your support and feedback! See my responses below:
>>> > >
>>> > > 1. IIRC, in a sense, operator ID and vertex ID are the same thing.
>>> The
>>> > > > operator ID can
>>> > > > be converted from the vertex ID[1]. Therefore, it is somewhat
>>> strange
>>> > to
>>> > > > have both vertex
>>> > > > ID and operator ID in a single URL.
>>> > > >
>>> > > I think Hang explained it perfectly. Essentially, a vertix may
>>> contain
>>> > one
>>> > > or more operators so the operator ID is required to distinguish this
>>> > case.
>>> > >
>>> > > 2. If I misunderstood the semantics of operator IDs here, then what
>>> is
>>> > the
>>> > > > relationship
>>> > > > between vertex ID and operator ID, and do we need a url like
>>> > > > `/jobs/<jobid>/vertices/<vertexid>/operators/`
>>> > > > to list all operator ids under this vertices?
>>> > > >
>>> > > Good question, we definitely need expose operator IDs through the
>>> REST
>>> > API
>>> > > to make this usable. I'm looking at how users would currently
>>> discover
>>> > the
>>> > > vertex id to query. From the supported REST APIs [1], you can
>>> currently
>>> > > obtain it from
>>> > >
>>> > > 1. `/jobs/<jobid>`
>>> > > 2. `/jobs/<jobid>/plan`
>>> > >
>>> > > From the response of both these APIs, they include the vertex ids
>>> (the
>>> > > vertices AND nodes fields), but not the operator ids. We would need
>>> to
>>> > add
>>> > > the logic to the plan generation [2]. The response is a little
>>> confusing
>>> > > because there is a field in the vertices called "operator name". I
>>> > propose
>>> > > to add a new field called "operators" to the vertex response object,
>>> > which
>>> > > would be a list of objects with the structure
>>> > >
>>> > > Operator
>>> > > {
>>> > >   "id": "THE-FLINK-GENERATED-ID"
>>> > > }.
>>> > >
>>> > > The JobManagerOperatorQueryScopeInfo has three fields: jobID,
>>> vertexID
>>> > and
>>> > > > operatorName. So we should use the operator name in the API.
>>> > > > If you think we should use the operator id, there need be more
>>> changes
>>> > > > about it.
>>> > > >
>>> > > I think we should use operator id since it uniquely identifies an
>>> > > operator--on the contrary, the operator name does not (it may be
>>> empty or
>>> > > repeated between operators by the user). I actually had a question on
>>> > that
>>> > > since you implemented the metric group. What's the reason we use
>>> operator
>>> > > name currently? Could it also use operator id so we can match
>>> against the
>>> > > id?
>>> > >
>>> > > [1]
>>> > >
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/
>>> > > [2]
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/416cb7aaa02c176e01485ff11ab4269f76b5e9e2/flink-runtime/src/main/java/org/apache/flink/runtime/jobgraph/jsonplan/JsonPlanGenerator.java#L53
>>> > >
>>> > > Best,
>>> > > Mason
>>> > >
>>> > >
>>> > > On Thu, Jan 11, 2024 at 10:54 PM Hang Ruan <ruanhang1...@gmail.com>
>>> > wrote:
>>> > >
>>> > > > Hi, Mason.
>>> > > >
>>> > > > Thanks for driving this FLIP.
>>> > > >
>>> > > > The JobManagerOperatorQueryScopeInfo has three fields: jobID,
>>> vertexID
>>> > > and
>>> > > > operatorName. So we should use the operator name in the API.
>>> > > > If you think we should use the operator id, there need be more
>>> changes
>>> > > > about it.
>>> > > >
>>> > > > About the Xuyang's questions, we add both vertexID and operatorID
>>> > > > information because of the operator chain.
>>> > > > A operator chain has a vertexID and contains many different
>>> operators.
>>> > > The
>>> > > > operator information helps to distinguish them in the same operator
>>> > > chain.
>>> > > >
>>> > > > Best,
>>> > > > Hang
>>> > > >
>>> > > >
>>> > > > Xuyang <xyzhong...@163.com> 于2024年1月12日周五 10:21写道:
>>> > > >
>>> > > > > Hi, Mason.
>>> > > > > Thanks for driving this Flip. I think it's important for external
>>> > > system
>>> > > > > to be able to
>>> > > > > perceive the metric of the operator coordinator. +1 for it.
>>> > > > >
>>> > > > >
>>> > > > > I just have the following minor questions and am looking forward
>>> to
>>> > > your
>>> > > > > reply. Please forgive
>>> > > > > me if I have some misunderstandings.
>>> > > > >
>>> > > > >
>>> > > > > 1. IIRC, in a sense, operator ID and vertex ID are the same
>>> thing.
>>> > The
>>> > > > > operator ID can
>>> > > > > be converted from the vertex ID[1]. Therefore, it is somewhat
>>> strange
>>> > > to
>>> > > > > have both vertex
>>> > > > > ID and operator ID in a single URL.
>>> > > > >
>>> > > > >
>>> > > > > 2. If I misunderstood the semantics of operator IDs here, then
>>> what
>>> > is
>>> > > > the
>>> > > > > relationship
>>> > > > > between vertex ID and operator ID, and do we need a url like
>>> > > > > `/jobs/<jobid>/vertices/<vertexid>/operators/`
>>> > > > > to list all operator ids under this vertices?
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > [1]
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/7bebd2d9fac517c28afc24c0c034d77cfe2b43a6/flink-runtime/src/main/java/org/apache/flink/runtime/jobgraph/OperatorID.java#L40C27-L40C27
>>> > > > >
>>> > > > > --
>>> > > > >
>>> > > > >     Best!
>>> > > > >     Xuyang
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > At 2024-01-12 04:20:03, "Mason Chen" <mas.chen6...@gmail.com>
>>> wrote:
>>> > > > > >Hi Devs,
>>> > > > > >
>>> > > > > >I'm opening this thread to discuss a short FLIP for exposing
>>> > > > > >JobManagerOperatorMetrics via REST API [1].
>>> > > > > >
>>> > > > > >The current set of REST APIs make it impossible to query
>>> coordinator
>>> > > > > >metrics. This FLIP proposes a new REST API to query the
>>> > > > > >JobManagerOperatorMetrics.
>>> > > > > >
>>> > > > > >[1]
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-417%3A+Expose+JobManagerOperatorMetrics+via+REST+API
>>> > > > > >
>>> > > > > >Best,
>>> > > > > >Mason
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Reply via email to