wangxiaojing opened a new pull request, #27876:
URL: https://github.com/apache/flink/pull/27876
What is the purpose of the change
This pull request adds the TaskManager IP address to the per-subtask
checkpoint statistics displayed in the Flink Web UI and exposed via the REST
API.
When diagnosing slow or failing checkpoints, operators need to identify
which TaskManager host a particular subtask was running on. Without this
information, correlating checkpoint latency with specific nodes (e.g., machines
with disk I/O bottlenecks, network issues, or GC
pressure) requires cross-referencing subtask indices with the TaskManager
assignment through a separate UI path. This change surfaces the IP address
directly in the subtask checkpoint statistics table, reducing time-to-diagnose
for checkpoint-related incidents.
Brief change log
- SubtaskStateStats: Added ip field (nullable String) with getter to carry
the TaskManager IP address per subtask. Updated serialVersionUID to reflect the
serialization format change.
- PendingCheckpoint: When a subtask acknowledges a checkpoint, extract the
TaskManager's IP via TaskManagerLocation.getAddress().getHostAddress() and
store it in the resulting SubtaskStateStats.
- DefaultCheckpointStatsTracker: Pass null for ip on the internal
stats-only code path where no execution context is available.
- SubtaskCheckpointStatistics.CompletedSubtaskCheckpointStatistics: Added
optional ip JSON field to the REST API response. The field is backwards
compatible — clients that do not recognise it will ignore it.
- TaskCheckpointStatisticDetailsHandler: Propagate
SubtaskStateStats.getIp() into the REST response object.
- Web UI (job-checkpoints-subtask component): Added a sortable "IP
Address" column to the subtask checkpoint statistics table. The sort function
correctly normalises IPv4 segments for lexicographic ordering. Null/missing
values are displayed as -.
Verifying this change
This change added tests and can be verified as follows:
- SubtaskStateStatsTest: Extended existing getter tests to assert that
getIp() returns the value passed at construction time, and that the value
survives Java serialization round-trips.
- PendingCheckpointTest#testAcknowledgeTaskCapturesTaskManagerIp: New
test. Creates a mock ExecutionVertex backed by a real TaskManagerLocation (),
calls PendingCheckpoint.acknowledgeTask(), and asserts that
PendingCheckpointStats.getLatestAcknowledgedSubtaskStats().getIp() equals
"". This directly validates the extraction path through
location.getAddress().getHostAddress().
- TaskCheckpointStatisticsWithSubtaskDetailsTest: Updated JSON
marshalling/unmarshalling round-trip test to include a realistic IP value ,
validating that the ip field is correctly serialized and deserialized via
Jackson.
- DefaultCheckpointStatsTrackerTest, PendingCheckpointStatsTest,
TaskStateStatsTest, CompletedCheckpointTest: Updated all SubtaskStateStats
construction call sites to pass the new ip parameter with a consistent null
sentinel.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving): no
- The serializers: yes — SubtaskStateStats is Serializable; a new ip field
was added . Old serialized forms of SubtaskStateStats (checkpoint statistics
history) are not compatible with the new class.
This affects checkpoint stats display after a rolling upgrade but does not
affect checkpoint data correctness.
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes — the change
touches PendingCheckpoint.acknowledgeTask(), which is on the checkpoint
acknowledgement path in the JobManager. The change is read-only
with respect to checkpoint correctness (it only reads TaskManagerLocation
metadata already available on the ExecutionVertex) and adds no synchronisation
overhead.
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? yes
- If yes, how is the feature documented? not documented (the IP address
column is self-explanatory in the Web UI; no changes to the public REST API
contract beyond adding an optional backwards-compatible field)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]