mrproliu opened a new pull request, #1200:
URL: https://github.com/apache/skywalking-banyandb/pull/1200

   ### Summary
   Adds an automatic pprof-capture path to FODC: when a BanyanDB container's 
RSS approaches its cgroup memory limit, the co-located FODC agent pulls heap 
and goroutine profiles from the container's `:6060` pprof endpoint, stores them 
on a shared volume, and exposes them for listing and download through the FODC 
proxy's HTTP API. This gives operators the memory snapshot from the moment 
right before an OOM — exactly the data that is otherwise lost because an OOM 
kill (SIGKILL / exit 137) leaves no panic artifact for the existing 
`/diagnostics` path to collect.
   
   ### Motivation
   The existing FODC crash pipeline only covers panics and file corruption; an 
OOM is a SIGKILL with no in-process hook, so there is no way to see what 
allocated the memory. Capturing heap/goroutine profiles just before the limit 
is reached closes that gap without adding any always-on profiling overhead — 
the capture only fires under real pressure.
   
   ### How it works
   The banyand memory protector now exposes the raw cgroup limit as a gauge 
(`banyandb_memory_protector_cgroup_limit_bytes`); combined with the existing 
`process_resident_memory_bytes`, the agent has both the usage and the bound it 
needs. The agent's watchdog already scrapes the container's `/metrics` on a 
fixed cadence, so pressure evaluation is driven off each poll completion 
(`OnPollComplete`) rather than a separate timer — same freshness, no extra 
goroutine. When `rss / cgroup_limit >= trigger_percent`, the agent fetches each 
pprof target over HTTP, streaming the response straight to a file (never 
buffering a whole profile in memory, so profiling under pressure cannot itself 
trigger a second OOM), and finalizes the capture event with an atomically 
written `meta.json`. A cooldown and an in-progress guard prevent overlapping or 
runaway captures, and a retention policy keeps the highest-RSS events within 
both an artifact-count and a total-disk bound.
   
   The proxy holds only metadata (never the profile bytes): agents stream their 
capture-event metadata over the existing gRPC control stream, and the proxy 
serves the list over HTTP and proxies downloads by streaming the bytes from the 
owning agent in bounded chunks. The proxy's list cache is authoritative per 
agent — each list round is staged and swapped in atomically on a `ListComplete` 
handshake, so events an agent has evicted from its disk drop out of the proxy 
view instead of lingering as unservable entries. A download for a profile that 
is no longer available returns `404` rather than an empty `200`.
   
   ### API (proxy HTTP)
   - `GET /pressure-profiles` — list capture-event metadata across agents, with 
optional `role` / `pod_name` filters.
   - `GET /pressure-profiles/{pod_name}/{profile_id}/{type}` — stream-download 
one profile (`type` is `heap` or `goroutine`); routes by the stable pod name so 
it survives agent reconnects.
   
   ### Agent flags
   `--pressure-profiler-enabled` (default true), 
`--pressure-profiler-trigger-percent` (75), `--pressure-profiler-pprof-port` 
(6060), `--pressure-profiler-cooldown` (5m), `--pressure-profiler-dir` 
(/tmp/pressure-profiles), `--pressure-profiler-max-artifacts` (16), 
`--pressure-profiler-max-disk-bytes` (512MiB).
   
   ### Metrics
   `banyandb_memory_protector_cgroup_limit_bytes` (banyand); 
`fodc_agent_pressure_capture_total`, 
`fodc_agent_pressure_skipped_cooldown_total`, 
`fodc_agent_pressure_failures_total{reason}` (agent).
   
   ### Notable design points
   - The set of captured profile types lives in a single shared package 
(`fodc/internal/pprofcapture`) so the agent's capture loop and the proxy's 
download-type validation can never drift apart.
   - Records arrive incrementally over the stream, so the proxy keeps a live 
cache plus a per-agent staging area and promotes a round in one swap on 
`ListComplete`; this is what makes the "evicted events disappear" behavior 
atomic and race-free (readers never see a partial round).
   - Capture is memory-bounded end to end (HTTP response streamed to disk on 
the agent; 1MB chunks on the download path through the proxy).
   
   ### Testing
   Unit tests cover the agent (trigger/threshold/cooldown/missing-metrics, 
finalize-on-fetch-failure, retention by count and by disk, path-traversal 
rejection, storage self-check), the proxy aggregator (authoritative replacement 
drops evicted entries, per-agent isolation, disconnect clears staging), and the 
proxy gRPC list/fetch handshake (waiter ack/force-done, 
zero/disconnected/send-failure agents, cleanup-on-disconnect). A dedicated Kind 
e2e (`test/e2e-v2/cases/fodc-pressure/`, wired as the `fodc-pressure-kind` job) 
runs the full chain against a real cluster: capture fires deterministically at 
`trigger-percent=1`, the proxy lists the metadata, the download streams back, 
and `go tool pprof` validates the bytes. The e2e was executed end-to-end and 
passes (1 passed / 0 failed).
   
   - [ ] If this pull request closes/resolves/fixes an existing issue, replace 
the issue number. Fixes apache/skywalking#<issue number>.
   - [x] Update the [`CHANGES` 
log](https://github.com/apache/skywalking-banyandb/blob/main/CHANGES.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to