[ 
https://issues.apache.org/jira/browse/FLINK-39566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Purushottam Sinha updated FLINK-39566:
--------------------------------------
    Description: 
Problem
Today the Checkpoints tab is a wall of numbers. To find why a checkpoint was 
slow, you scroll through subtask rows and eyeball columns. It's hard to spot 
stragglers or compare checkpoints over time.         
                              
Proposal                                                                        
                                                                                
                                               
Add a Gantt-style view to the Checkpoints tab:
- Recent strip — last 60 checkpoints as colored bars (width = duration, color = 
status: completed / savepoint / in-progress / failed). Auto-refreshes.          
                                               
- Per-checkpoint Gantt — one row per subtask, grouped by operator, with stacked 
segments for the four checkpoint phases. Sorted by total duration so stragglers 
float to the top. Outliers and aborts          
  highlighted.                                                                  
                                                                                
                                                 
                              
Interactions                                                                    
                                                                                
                                               
- Click a bar in the strip → pins the Gantt to that checkpoint
- "Follow newest" → resume live tracking
- "Export PNG" → snapshot for incident reports
                                                                                
                                                                                
                                                 
Scope                       
Frontend-only. Uses existing REST endpoints, renders via @antv/g2. No backend 
changes.

  was:
Description

The current Checkpoints tab renders checkpoint statistics as nz-table rows: 
four numeric columns (start_delay, alignment.duration, checkpoint.sync, 
checkpoint.async) per subtask. Diagnosing a slow checkpoint requires manually 
scanning these columns across N subtasks and mentally reconstructing which 
phase dominated and which subtasks straggled.

This ticket adds a new Gantt tab next to the existing Overview | History | 
Summary | Configuration tabs, with a two-tier visualization:
1. Recent-checkpoint strip — horizontal strip, one slim bar per checkpoint over 
the last 60 (capped against the JobManager's web.checkpoints.history). Width ∝ 
end_to_end_duration; color by status (green completed, purple savepoint, amber 
in-progress, red failed). The strip auto-refreshes on the checkpoint cadence 
(clamped to 1.5–15s) so newly-triggered checkpoints appear without a page 
refresh.
2. Per-checkpoint Gantt — one row per subtask, grouped by operator vertex. Each 
row is a stacked bar of the four checkpoint phases. Rows are sorted by total 
duration descending so stragglers float to the top. Subtasks above the outlier 
threshold (max(p95, 1.5 × median)) or marked aborted are outlined in red 
dashed. Unaligned-checkpoint segments are visually faded with an (unaligned) 
tooltip label. Hover reveals exact phase ms, total duration, state_size, 
checkpointed_size, full operator name, and subtask index.

Interaction model:                                                              
                                                                                
                                               
  - Click a strip bar (or any Gantt bar) → pins the Gantt to that checkpoint. 
The Gantt then stays frozen even if the JobManager rotates the bar out of 
memory.
  - Follow newest button → resumes auto-follow of the most recent checkpoint.   
                                            
  - Export PNG → composites strip + Gantt + title block into a single PNG for 
incident write-ups and post-mortems.

Data: No backend changes. Existing endpoints /jobs/:id/checkpoints, 
/jobs/:id/checkpoints/details/:n, 
/jobs/:id/checkpoints/details/:n/subtasks/:vertex. A new client-side helper 
loadCheckpointAllSubtaskDetails does parallel per-vertex fetches with 
per-vertex error tolerance (a vertex with no detail entry — e.g., still pending 
or rotated out — does not poison the batch).

Implementation: New standalone OnPush component 
pages/job/checkpoints/gantt/job-checkpoints-gantt.component.{ts,html,less}, 
rendered via the existing @antv/g2 dependency (SVG renderer, so axis labels can 
carry <title> tooltips for full operator names). One client-side service method 
on JobService. A small adjustment to app.interceptor.ts to suppress error 
toasts for 404s on /checkpoints/details/.
                                                                                
                                                                                
                                                 
Benefits
  - Answers "why was checkpoint #147 slow?" at a glance instead of scanning 
four numeric columns per subtask. The dominant phase and the straggler subtasks 
are preattentive — no arithmetic required.
  - Live-polling strip + auto-follow newest means the Gantt stays current 
during incidents without any user action; pin freezes a snapshot for focused 
investigation.
  - PNG export captures incident state directly for tickets and write-ups, a 
friction point that today drives screenshot-stitching workflows.
  - Targets the single most common Flink operational question; slow checkpoints 
drive job restarts, reprocessing, and missed SLAs.
  - Zero backend cost, no new REST endpoints, no migration risk — existing 
checkpoint tabs are untouched.

 Prior art / competitor references
                                                                                
                           
  - Apache Spark UI — Stages tab "Event Timeline": per-task stacked timeline 
decomposing each task into scheduler delay, deserialization, executor compute, 
GC, shuffle read/write. The closest direct analog for
   "stacked phases per parallel task as a Gantt." → 
https://spark.apache.org/docs/latest/web-ui.html#stage-detail
  - Google Cloud Dataflow monitoring interface: per-stage commit/snapshot 
timing surfaced as phase-decomposed bars rather than tables; widely considered 
the streaming UX benchmark. →                           
  https://cloud.google.com/dataflow/docs/guides/using-monitoring-intf           
                                                                               
  - Apache Airflow — Gantt view: per-task Gantt within a DAG run; the canonical 
"task durations on a shared time axis" idiom. → 
https://airflow.apache.org/docs/apache-airflow/stable/ui.html#gantt
  - Trino query timeline: per-stage timeline view for query-execution 
diagnostics. → https://trino.io/docs/current/admin/web-interface.html


> [runtime-web] Add checkpoint duration Gantt view
> ------------------------------------------------
>
>                 Key: FLINK-39566
>                 URL: https://issues.apache.org/jira/browse/FLINK-39566
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Web Frontend
>            Reporter: Purushottam Sinha
>            Priority: Minor
>              Labels: pull-request-available
>
> Problem
> Today the Checkpoints tab is a wall of numbers. To find why a checkpoint was 
> slow, you scroll through subtask rows and eyeball columns. It's hard to spot 
> stragglers or compare checkpoints over time.         
>                               
> Proposal                                                                      
>                                                                               
>                                                    
> Add a Gantt-style view to the Checkpoints tab:
> - Recent strip — last 60 checkpoints as colored bars (width = duration, color 
> = status: completed / savepoint / in-progress / failed). Auto-refreshes.      
>                                                    
> - Per-checkpoint Gantt — one row per subtask, grouped by operator, with 
> stacked segments for the four checkpoint phases. Sorted by total duration so 
> stragglers float to the top. Outliers and aborts          
>   highlighted.                                                                
>                                                                               
>                                                      
>                               
> Interactions                                                                  
>                                                                               
>                                                    
> - Click a bar in the strip → pins the Gantt to that checkpoint
> - "Follow newest" → resume live tracking
> - "Export PNG" → snapshot for incident reports
>                                                                               
>                                                                               
>                                                      
> Scope                       
> Frontend-only. Uses existing REST endpoints, renders via @antv/g2. No backend 
> changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to