Hi everyone,

I'd like to start a discussion on a small Web UI FLIP aimed at shortening
the
"which subtask should I look at?" loop when triaging a misbehaving job.

*Problem.* When a Flink job lags, slows down, or becomes unstable, operators
today open the Web UI and click vertex-by-vertex, subtask-by-subtask to find
the offender. For jobs with many vertices and high parallelism this is slow
and error-prone, even though the underlying metrics (backpressure, busy
time,
FLIP-33 source lag, JVM CPU, JVM GC) are already available in the runtime.
The gap is not data — it's surfacing.

*Proposal.* Add a new "Top N" tab to the job detail page of the Web UI,
backed by a new REST endpoint GET /jobs/:jobid/metrics/top-n?top=N. The tab
shows five ranked lists, grouped into two tiers:

Primary triage (per-subtask / per-vertex, expanded by default):

   1. Top Backpressured Subtasks — from backPressuredTimeMsPerSecond,
   exposed as a ratio in [0, 1].
   2. Top Busy Subtasks — from busyTimeMsPerSecond, same scaling.
   3. Top Lagging Sources — any vertex exposing at least one FLIP-33 source
   metric (pendingRecords, currentFetchEventTimeLag,
   currentEmitEventTimeLag). Backlog is summed across subtasks; the two
   lag metrics are taken as max.

JVM diagnostics (TaskManager-scoped, collapsed by default):

   4. Top CPU Consumers — from Status.JVM.CPU.Load.
   5. Top GC-intensive TaskManagers — from
   Status.JVM.GarbageCollector.*.Time summed across collectors.

*What this FLIP deliberately does not do.*

   - No new metric is emitted by the runtime. Everything is derived from
   metrics
   that already exist.
   - No new ConfigOption.
   - No per-job attribution of CPU / GC. Those are JVM-process metrics;
   per-job
   attribution requires a slot-to-TM mapping the metric store does not
   expose
   today. The FLIP calls this out explicitly, demotes the JVM tier visually,
   and leaves per-job attribution as future work.
   - No Top N for checkpoints in this FLIP. Its data source is
   CheckpointStatsSnapshot, not MetricStore; it fits better as a separate
   FLIP / JIRA once this one lands.
   - No changes to any existing endpoint, response body, or metric. The new
   wire fields are purely additive.

*Open points I'd most appreciate feedback on.*

   1. *Scope of the JVM tier.* I think CPU/GC are still worth surfacing as
   TaskManager-scoped rather than dropping them, because they're what
   operators check next after the per-subtask tier. But I'm open to being
   convinced that we should gate the whole JVM tier behind per-job
   attribution landing first.
   2. *"Source vertex" detection.* The handler treats any vertex that
   exposes
   one of the three FLIP-33 metrics as a source, instead of walking the job
   graph. This is simpler and extends to every FLIP-33-compliant connector
   automatically. Is there a case where this heuristic would mislabel a
   non-source vertex as a source? I couldn't come up with one, but it's the
   one design choice I'd like a second pair of eyes on.
   3. *Missing-value convention.* For source lag I use -1 as a numeric
   sentinel so the JSON stays flat. The frontend renders "n/a". Alternative
   would be to keep the field nullable. Any preference?
   4. *Ranking of topLaggingSources.* Current order is pendingRecords →
   fetchLag → emitLag. Happy to change if there's a stronger convention in
   the community.

*Prototype / PR.* A working end-to-end prototype is already up as a draft
PR against apache/flink:master, so reviewers can try it out while reading
the FLIP. It's meant to show feasibility and make the wire format concrete;
it is *not* asking for review-to-merge before the FLIP is accepted.


[image: image.png]



   - JIRA: https://issues.apache.org/jira/browse/FLINK-39489
   - PR: https://github.com/apache/flink/pull/27774
   ([FLINK-39489][Web Dashboard] Add Top N Metrics Dashboard,
   +1900 / -2 across 23 files, branch
   feature/FLINK-top-n-metrics-dashboard)
   - FLIP:
    FLIP-XXX: Top-N Job Health Dashboard in the Web UI
   
<https://drive.google.com/open?id=1Cq2G1zVWCWmjJNdtO9Cm4vzuDMHIUgyBTldBZZAxNR0>

*What the prototype already does* (matches the FLIP 1:1):

Backend (flink-runtime):

   - New handler TopNMetricsHandler on
   GET /jobs/:jobid/metrics/top-n?top=N (default N=5, bounded to a
   reasonable cap), reading from the existing MetricStore — no new
   metric is emitted, no ConfigOption is added.
   - Response body TopNMetricsResponseBody with five additive lists:
   topBackpressuredSubtasks, topBusySubtasks, topLaggingSources,
   topCpuConsumers, topGcIntensiveTaskManagers. Missing source lag is
   encoded as -1.
   - backPressuredTimeMsPerSecond / busyTimeMsPerSecond are normalized
   to [0, 1]; source lag aggregates pendingRecords (sum) and the two
   FLIP-33 lag metrics (max) across subtasks; GC time is summed across
   collectors per TaskManager.
   - Unit tests for TopNMetricsHandler covering ranking, capping,
   missing-metric fallbacks, and the FLIP-33 source-vertex heuristic.
   - REST API docs and the API snapshot are regenerated; the new topN
   query parameter is reflected there.

Frontend (flink-runtime-web):

   - New "Top N" tab on the job detail page, wired to the real endpoint
   (no mock data left behind).
   - Two-tier layout: the per-subtask tier (backpressure / busy / source
   lag) is expanded by default; the JVM tier (CPU / GC) is collapsed by
   default to reflect that it is TaskManager-scoped, not per-job.
   - Each row links back to the corresponding vertex / subtask / TaskManager
   page so triage stays one click away from the existing UI; "n/a" is
   rendered for missing source lag.
   - The earlier empty "Top N" placeholder on the Overview page has been
   removed; only the job-scoped tab remains.

*How to try it locally.*

git fetch https://github.com/apache/flink refs/pull/27774/head:flink-topn
git checkout flink-topn
./mvnw -pl flink-runtime-web -Pskip-webui-build=false -DskipTests \
    clean install
# then start any job and open:
#   http://<jm-host>:8081/#/job/<jobid>/top-n
# or hit the REST API directly:
#   curl http://<jm-host>:8081/jobs/<jobid>/metrics/top-n?top=10

FLIP document (with sample JSON, test plan, and rejected alternatives):

Thanks in advance for the feedback. I'll give the thread a full week before
calling a vote, and longer if discussion is active.

Best regards,
Feat Zhang

Reply via email to