mrproliu opened a new pull request, #1200:
URL: https://github.com/apache/skywalking-banyandb/pull/1200
### Summary
Adds an automatic pprof-capture path to FODC: when a BanyanDB container's
RSS approaches its cgroup memory limit, the co-located FODC agent pulls heap
and goroutine profiles from the container's `:6060` pprof endpoint, stores them
on a shared volume, and exposes them for listing and download through the FODC
proxy's HTTP API. This gives operators the memory snapshot from the moment
right before an OOM — exactly the data that is otherwise lost because an OOM
kill (SIGKILL / exit 137) leaves no panic artifact for the existing
`/diagnostics` path to collect.
### Motivation
The existing FODC crash pipeline only covers panics and file corruption; an
OOM is a SIGKILL with no in-process hook, so there is no way to see what
allocated the memory. Capturing heap/goroutine profiles just before the limit
is reached closes that gap without adding any always-on profiling overhead —
the capture only fires under real pressure.
### How it works
The banyand memory protector now exposes the raw cgroup limit as a gauge
(`banyandb_memory_protector_cgroup_limit_bytes`); combined with the existing
`process_resident_memory_bytes`, the agent has both the usage and the bound it
needs. The agent's watchdog already scrapes the container's `/metrics` on a
fixed cadence, so pressure evaluation is driven off each poll completion
(`OnPollComplete`) rather than a separate timer — same freshness, no extra
goroutine. When `rss / cgroup_limit >= trigger_percent`, the agent fetches each
pprof target over HTTP, streaming the response straight to a file (never
buffering a whole profile in memory, so profiling under pressure cannot itself
trigger a second OOM), and finalizes the capture event with an atomically
written `meta.json`. A cooldown and an in-progress guard prevent overlapping or
runaway captures, and a retention policy keeps the highest-RSS events within
both an artifact-count and a total-disk bound.
The proxy holds only metadata (never the profile bytes): agents stream their
capture-event metadata over the existing gRPC control stream, and the proxy
serves the list over HTTP and proxies downloads by streaming the bytes from the
owning agent in bounded chunks. The proxy's list cache is authoritative per
agent — each list round is staged and swapped in atomically on a `ListComplete`
handshake, so events an agent has evicted from its disk drop out of the proxy
view instead of lingering as unservable entries. A download for a profile that
is no longer available returns `404` rather than an empty `200`.
### API (proxy HTTP)
- `GET /pressure-profiles` — list capture-event metadata across agents, with
optional `role` / `pod_name` filters.
- `GET /pressure-profiles/{pod_name}/{profile_id}/{type}` — stream-download
one profile (`type` is `heap` or `goroutine`); routes by the stable pod name so
it survives agent reconnects.
### Agent flags
`--pressure-profiler-enabled` (default true),
`--pressure-profiler-trigger-percent` (75), `--pressure-profiler-pprof-port`
(6060), `--pressure-profiler-cooldown` (5m), `--pressure-profiler-dir`
(/tmp/pressure-profiles), `--pressure-profiler-max-artifacts` (16),
`--pressure-profiler-max-disk-bytes` (512MiB).
### Metrics
`banyandb_memory_protector_cgroup_limit_bytes` (banyand);
`fodc_agent_pressure_capture_total`,
`fodc_agent_pressure_skipped_cooldown_total`,
`fodc_agent_pressure_failures_total{reason}` (agent).
### Notable design points
- The set of captured profile types lives in a single shared package
(`fodc/internal/pprofcapture`) so the agent's capture loop and the proxy's
download-type validation can never drift apart.
- Records arrive incrementally over the stream, so the proxy keeps a live
cache plus a per-agent staging area and promotes a round in one swap on
`ListComplete`; this is what makes the "evicted events disappear" behavior
atomic and race-free (readers never see a partial round).
- Capture is memory-bounded end to end (HTTP response streamed to disk on
the agent; 1MB chunks on the download path through the proxy).
### Testing
Unit tests cover the agent (trigger/threshold/cooldown/missing-metrics,
finalize-on-fetch-failure, retention by count and by disk, path-traversal
rejection, storage self-check), the proxy aggregator (authoritative replacement
drops evicted entries, per-agent isolation, disconnect clears staging), and the
proxy gRPC list/fetch handshake (waiter ack/force-done,
zero/disconnected/send-failure agents, cleanup-on-disconnect). A dedicated Kind
e2e (`test/e2e-v2/cases/fodc-pressure/`, wired as the `fodc-pressure-kind` job)
runs the full chain against a real cluster: capture fires deterministically at
`trigger-percent=1`, the proxy lists the metadata, the download streams back,
and `go tool pprof` validates the bytes. The e2e was executed end-to-end and
passes (1 passed / 0 failed).
- [ ] If this pull request closes/resolves/fixes an existing issue, replace
the issue number. Fixes apache/skywalking#<issue number>.
- [x] Update the [`CHANGES`
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]