Hi everyone, I'd like to start a discussion on a small Web UI FLIP aimed at shortening the "which subtask should I look at?" loop when triaging a misbehaving job.
*Problem.* When a Flink job lags, slows down, or becomes unstable, operators today open the Web UI and click vertex-by-vertex, subtask-by-subtask to find the offender. For jobs with many vertices and high parallelism this is slow and error-prone, even though the underlying metrics (backpressure, busy time, FLIP-33 source lag, JVM CPU, JVM GC) are already available in the runtime. The gap is not data — it's surfacing. *Proposal.* Add a new "Top N" tab to the job detail page of the Web UI, backed by a new REST endpoint GET /jobs/:jobid/metrics/top-n?top=N. The tab shows five ranked lists, grouped into two tiers: Primary triage (per-subtask / per-vertex, expanded by default): 1. Top Backpressured Subtasks — from backPressuredTimeMsPerSecond, exposed as a ratio in [0, 1]. 2. Top Busy Subtasks — from busyTimeMsPerSecond, same scaling. 3. Top Lagging Sources — any vertex exposing at least one FLIP-33 source metric (pendingRecords, currentFetchEventTimeLag, currentEmitEventTimeLag). Backlog is summed across subtasks; the two lag metrics are taken as max. JVM diagnostics (TaskManager-scoped, collapsed by default): 4. Top CPU Consumers — from Status.JVM.CPU.Load. 5. Top GC-intensive TaskManagers — from Status.JVM.GarbageCollector.*.Time summed across collectors. *What this FLIP deliberately does not do.* - No new metric is emitted by the runtime. Everything is derived from metrics that already exist. - No new ConfigOption. - No per-job attribution of CPU / GC. Those are JVM-process metrics; per-job attribution requires a slot-to-TM mapping the metric store does not expose today. The FLIP calls this out explicitly, demotes the JVM tier visually, and leaves per-job attribution as future work. - No Top N for checkpoints in this FLIP. Its data source is CheckpointStatsSnapshot, not MetricStore; it fits better as a separate FLIP / JIRA once this one lands. - No changes to any existing endpoint, response body, or metric. The new wire fields are purely additive. *Open points I'd most appreciate feedback on.* 1. *Scope of the JVM tier.* I think CPU/GC are still worth surfacing as TaskManager-scoped rather than dropping them, because they're what operators check next after the per-subtask tier. But I'm open to being convinced that we should gate the whole JVM tier behind per-job attribution landing first. 2. *"Source vertex" detection.* The handler treats any vertex that exposes one of the three FLIP-33 metrics as a source, instead of walking the job graph. This is simpler and extends to every FLIP-33-compliant connector automatically. Is there a case where this heuristic would mislabel a non-source vertex as a source? I couldn't come up with one, but it's the one design choice I'd like a second pair of eyes on. 3. *Missing-value convention.* For source lag I use -1 as a numeric sentinel so the JSON stays flat. The frontend renders "n/a". Alternative would be to keep the field nullable. Any preference? 4. *Ranking of topLaggingSources.* Current order is pendingRecords → fetchLag → emitLag. Happy to change if there's a stronger convention in the community. *Prototype / PR.* A working end-to-end prototype is already up as a draft PR against apache/flink:master, so reviewers can try it out while reading the FLIP. It's meant to show feasibility and make the wire format concrete; it is *not* asking for review-to-merge before the FLIP is accepted. [image: image.png] - JIRA: https://issues.apache.org/jira/browse/FLINK-39489 - PR: https://github.com/apache/flink/pull/27774 ([FLINK-39489][Web Dashboard] Add Top N Metrics Dashboard, +1900 / -2 across 23 files, branch feature/FLINK-top-n-metrics-dashboard) - FLIP: FLIP-XXX: Top-N Job Health Dashboard in the Web UI <https://drive.google.com/open?id=1Cq2G1zVWCWmjJNdtO9Cm4vzuDMHIUgyBTldBZZAxNR0> *What the prototype already does* (matches the FLIP 1:1): Backend (flink-runtime): - New handler TopNMetricsHandler on GET /jobs/:jobid/metrics/top-n?top=N (default N=5, bounded to a reasonable cap), reading from the existing MetricStore — no new metric is emitted, no ConfigOption is added. - Response body TopNMetricsResponseBody with five additive lists: topBackpressuredSubtasks, topBusySubtasks, topLaggingSources, topCpuConsumers, topGcIntensiveTaskManagers. Missing source lag is encoded as -1. - backPressuredTimeMsPerSecond / busyTimeMsPerSecond are normalized to [0, 1]; source lag aggregates pendingRecords (sum) and the two FLIP-33 lag metrics (max) across subtasks; GC time is summed across collectors per TaskManager. - Unit tests for TopNMetricsHandler covering ranking, capping, missing-metric fallbacks, and the FLIP-33 source-vertex heuristic. - REST API docs and the API snapshot are regenerated; the new topN query parameter is reflected there. Frontend (flink-runtime-web): - New "Top N" tab on the job detail page, wired to the real endpoint (no mock data left behind). - Two-tier layout: the per-subtask tier (backpressure / busy / source lag) is expanded by default; the JVM tier (CPU / GC) is collapsed by default to reflect that it is TaskManager-scoped, not per-job. - Each row links back to the corresponding vertex / subtask / TaskManager page so triage stays one click away from the existing UI; "n/a" is rendered for missing source lag. - The earlier empty "Top N" placeholder on the Overview page has been removed; only the job-scoped tab remains. *How to try it locally.* git fetch https://github.com/apache/flink refs/pull/27774/head:flink-topn git checkout flink-topn ./mvnw -pl flink-runtime-web -Pskip-webui-build=false -DskipTests \ clean install # then start any job and open: # http://<jm-host>:8081/#/job/<jobid>/top-n # or hit the REST API directly: # curl http://<jm-host>:8081/jobs/<jobid>/metrics/top-n?top=10 FLIP document (with sample JSON, test plan, and rejected alternatives): Thanks in advance for the feedback. I'll give the thread a full week before calling a vote, and longer if discussion is active. Best regards, Feat Zhang
