yunfengzhou-hub opened a new pull request, #26951:
URL: https://github.com/apache/flink/pull/26951
## What is the purpose of the change
This PR optimizes the latency of Flink REST handlers used to generate the
DAG in Flink UI.
In the current implementation, REST handlers like `JobDetailsHandle`r would
iterate through all vertexes of a job, and invoke
`MetricStore#getSubtaskAttemptMetricStore` during each iteration. Given that
this is a synchronized method, invocations to this method could possibly be
blocked until other threads finished invoking other synchronized methods. This
blocking overhead is accumulated with the for loop, resulting in high latency
when Flink UI tries to render the status of a Flink job through
`JobDetailsHandler`.
In order to solve this problem, this PR proposes to reduce the number of
synchronized invocations in REST handlers. A snapshot of the MetricStore jobs
is acquired for each handler (and the synchronization overhead is accumulated
only once here), and the snapshot is then reused in the for loops. The snapshot
is read only so it needs not be synchronized.
As for benchmark results, we manually measured the latency for the Flink UI
to display the DAG of a sophisticated Flink job in our company. Before
optimization, the Flink UI needs more than 1 minute to finish the display.
After the optimization, the latency decreased to less than 10 seconds.
## Brief change log
- Introduce MetricStore.MetricStoreJobs to manage a snapshot of all jobs in
the MetricStore. Compared with original implementation to operate on
MetricStore jobs, the new implementation does not need synchronized keywords on
the methods.
## Verifying this change
The correctness of this PR is covered by existing tests, such as
JobDetailsHandlerTest and MetricStoreTest.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]