[ 
https://issues.apache.org/jira/browse/FLINK-39566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39566:
-----------------------------------
    Labels: pull-request-available  (was: )

> [runtime-web] Add checkpoint duration Gantt view
> ------------------------------------------------
>
>                 Key: FLINK-39566
>                 URL: https://issues.apache.org/jira/browse/FLINK-39566
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Web Frontend
>            Reporter: Purushottam Sinha
>            Priority: Minor
>              Labels: pull-request-available
>
> Description
> The current Checkpoints tab renders checkpoint statistics as nz-table rows: 
> four numeric columns (start_delay, alignment.duration, checkpoint.sync, 
> checkpoint.async) per subtask. Diagnosing a slow checkpoint requires manually 
> scanning these columns across N subtasks and mentally reconstructing which 
> phase dominated and which subtasks straggled.
> This ticket adds a new Gantt tab next to the existing Overview | History | 
> Summary | Configuration tabs, with a two-tier visualization:
> 1. Recent-checkpoint strip — horizontal strip, one slim bar per checkpoint 
> over the last 60 (capped against the JobManager's web.checkpoints.history). 
> Width ∝ end_to_end_duration; color by status (green completed, purple 
> savepoint, amber in-progress, red failed). The strip auto-refreshes on the 
> checkpoint cadence (clamped to 1.5–15s) so newly-triggered checkpoints appear 
> without a page refresh.
> 2. Per-checkpoint Gantt — one row per subtask, grouped by operator vertex. 
> Each row is a stacked bar of the four checkpoint phases. Rows are sorted by 
> total duration descending so stragglers float to the top. Subtasks above the 
> outlier threshold (max(p95, 1.5 × median)) or marked aborted are outlined in 
> red dashed. Unaligned-checkpoint segments are visually faded with an 
> (unaligned) tooltip label. Hover reveals exact phase ms, total duration, 
> state_size, checkpointed_size, full operator name, and subtask index.
> Interaction model:                                                            
>                                                                               
>                                                    
>   - Click a strip bar (or any Gantt bar) → pins the Gantt to that checkpoint. 
> The Gantt then stays frozen even if the JobManager rotates the bar out of 
> memory.
>   - Follow newest button → resumes auto-follow of the most recent checkpoint. 
>                                               
>   - Export PNG → composites strip + Gantt + title block into a single PNG for 
> incident write-ups and post-mortems.
> Data: No backend changes. Existing endpoints /jobs/:id/checkpoints, 
> /jobs/:id/checkpoints/details/:n, 
> /jobs/:id/checkpoints/details/:n/subtasks/:vertex. A new client-side helper 
> loadCheckpointAllSubtaskDetails does parallel per-vertex fetches with 
> per-vertex error tolerance (a vertex with no detail entry — e.g., still 
> pending or rotated out — does not poison the batch).
> Implementation: New standalone OnPush component 
> pages/job/checkpoints/gantt/job-checkpoints-gantt.component.{ts,html,less}, 
> rendered via the existing @antv/g2 dependency (SVG renderer, so axis labels 
> can carry <title> tooltips for full operator names). One client-side service 
> method on JobService. A small adjustment to app.interceptor.ts to suppress 
> error toasts for 404s on /checkpoints/details/.
>                                                                               
>                                                                               
>                                                      
> Benefits
>   - Answers "why was checkpoint #147 slow?" at a glance instead of scanning 
> four numeric columns per subtask. The dominant phase and the straggler 
> subtasks are preattentive — no arithmetic required.
>   - Live-polling strip + auto-follow newest means the Gantt stays current 
> during incidents without any user action; pin freezes a snapshot for focused 
> investigation.
>   - PNG export captures incident state directly for tickets and write-ups, a 
> friction point that today drives screenshot-stitching workflows.
>   - Targets the single most common Flink operational question; slow 
> checkpoints drive job restarts, reprocessing, and missed SLAs.
>   - Zero backend cost, no new REST endpoints, no migration risk — existing 
> checkpoint tabs are untouched.
>  Prior art / competitor references
>                                                                               
>                              
>   - Apache Spark UI — Stages tab "Event Timeline": per-task stacked timeline 
> decomposing each task into scheduler delay, deserialization, executor 
> compute, GC, shuffle read/write. The closest direct analog for
>    "stacked phases per parallel task as a Gantt." → 
> https://spark.apache.org/docs/latest/web-ui.html#stage-detail
>   - Google Cloud Dataflow monitoring interface: per-stage commit/snapshot 
> timing surfaced as phase-decomposed bars rather than tables; widely 
> considered the streaming UX benchmark. →                           
>   https://cloud.google.com/dataflow/docs/guides/using-monitoring-intf         
>                                                                               
>    
>   - Apache Airflow — Gantt view: per-task Gantt within a DAG run; the 
> canonical "task durations on a shared time axis" idiom. → 
> https://airflow.apache.org/docs/apache-airflow/stable/ui.html#gantt
>   - Trino query timeline: per-stage timeline view for query-execution 
> diagnostics. → https://trino.io/docs/current/admin/web-interface.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to