Purushottam Sinha created FLINK-39566:
-----------------------------------------
Summary: [runtime-web] Add checkpoint duration Gantt view
Key: FLINK-39566
URL: https://issues.apache.org/jira/browse/FLINK-39566
Project: Flink
Issue Type: New Feature
Components: Runtime / Web Frontend
Reporter: Purushottam Sinha
Description
The current Checkpoints tab renders checkpoint statistics as nz-table rows:
four numeric columns (start_delay, alignment.duration, checkpoint.sync,
checkpoint.async) per subtask. Diagnosing a slow checkpoint requires manually
scanning these columns across N subtasks and mentally reconstructing which
phase dominated and which subtasks straggled.
This ticket adds a new Gantt tab next to the existing Overview | History |
Summary | Configuration tabs, with a two-tier visualization:
1. Recent-checkpoint strip — horizontal strip, one slim bar per checkpoint over
the last 60 (capped against the JobManager's web.checkpoints.history). Width ∝
end_to_end_duration; color by status (green completed, purple savepoint, amber
in-progress, red failed). The strip auto-refreshes on the checkpoint cadence
(clamped to 1.5–15s) so newly-triggered checkpoints appear without a page
refresh.
2. Per-checkpoint Gantt — one row per subtask, grouped by operator vertex. Each
row is a stacked bar of the four checkpoint phases. Rows are sorted by total
duration descending so stragglers float to the top. Subtasks above the outlier
threshold (max(p95, 1.5 × median)) or marked aborted are outlined in red
dashed. Unaligned-checkpoint segments are visually faded with an (unaligned)
tooltip label. Hover reveals exact phase ms, total duration, state_size,
checkpointed_size, full operator name, and subtask index.
Interaction model:
- Click a strip bar (or any Gantt bar) → pins the Gantt to that checkpoint.
The Gantt then stays frozen even if the JobManager rotates the bar out of
memory.
- Follow newest button → resumes auto-follow of the most recent checkpoint.
- Export PNG → composites strip + Gantt + title block into a single PNG for
incident write-ups and post-mortems.
Data: No backend changes. Existing endpoints /jobs/:id/checkpoints,
/jobs/:id/checkpoints/details/:n,
/jobs/:id/checkpoints/details/:n/subtasks/:vertex. A new client-side helper
loadCheckpointAllSubtaskDetails does parallel per-vertex fetches with
per-vertex error tolerance (a vertex with no detail entry — e.g., still pending
or rotated out — does not poison the batch).
Implementation: New standalone OnPush component
pages/job/checkpoints/gantt/job-checkpoints-gantt.component.{ts,html,less},
rendered via the existing @antv/g2 dependency (SVG renderer, so axis labels can
carry <title> tooltips for full operator names). One client-side service method
on JobService. A small adjustment to app.interceptor.ts to suppress error
toasts for 404s on /checkpoints/details/.
Benefits
- Answers "why was checkpoint #147 slow?" at a glance instead of scanning
four numeric columns per subtask. The dominant phase and the straggler subtasks
are preattentive — no arithmetic required.
- Live-polling strip + auto-follow newest means the Gantt stays current
during incidents without any user action; pin freezes a snapshot for focused
investigation.
- PNG export captures incident state directly for tickets and write-ups, a
friction point that today drives screenshot-stitching workflows.
- Targets the single most common Flink operational question; slow checkpoints
drive job restarts, reprocessing, and missed SLAs.
- Zero backend cost, no new REST endpoints, no migration risk — existing
checkpoint tabs are untouched.
Prior art / competitor references
- Apache Spark UI — Stages tab "Event Timeline": per-task stacked timeline
decomposing each task into scheduler delay, deserialization, executor compute,
GC, shuffle read/write. The closest direct analog for
"stacked phases per parallel task as a Gantt." →
https://spark.apache.org/docs/latest/web-ui.html#stage-detail
- Google Cloud Dataflow monitoring interface: per-stage commit/snapshot
timing surfaced as phase-decomposed bars rather than tables; widely considered
the streaming UX benchmark. →
https://cloud.google.com/dataflow/docs/guides/using-monitoring-intf
- Apache Airflow — Gantt view: per-task Gantt within a DAG run; the canonical
"task durations on a shared time axis" idiom. →
https://airflow.apache.org/docs/apache-airflow/stable/ui.html#gantt
- Trino query timeline: per-stage timeline view for query-execution
diagnostics. → https://trino.io/docs/current/admin/web-interface.html
--
This message was sent by Atlassian Jira
(v8.20.10#820010)