This is an automated email from the ASF dual-hosted git repository.
wusheng pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git
The following commit(s) were added to refs/heads/main by this push:
new 7dcd935a Add First Occurrence Data Collection (FODC) doc (#895)
7dcd935a is described below
commit 7dcd935ac8124ad801ca9b22b50c1487c849c7d0
Author: 吴晟 Wu Sheng <[email protected]>
AuthorDate: Thu Dec 11 16:25:56 2025 +0800
Add First Occurrence Data Collection (FODC) doc (#895)
---
CHANGES.md | 1 +
docs/menu.yml | 4 +
docs/operation/fodc/overview.md | 273 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 278 insertions(+)
diff --git a/CHANGES.md b/CHANGES.md
index 1922e3a0..2993a4e4 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -22,6 +22,7 @@ Release Notes.
### Document
- Add read write benchmark document for 0.9.0 release.
+- Add FODC overview doc.
## 0.9.0
diff --git a/docs/menu.yml b/docs/menu.yml
index 78391ba0..d3e13a48 100644
--- a/docs/menu.yml
+++ b/docs/menu.yml
@@ -149,6 +149,10 @@ catalog:
path: "/operation/mcp/inspector"
- name: "Build and Package"
path: "/operation/mcp/build"
+ - name: "First Occurrence Data Collection (FODC)"
+ catalog:
+ - name: "Overview"
+ path: "/operation/fodc/overview"
- name: "Property Background Repair"
path: "/concept/property-repair"
- name: "Benchmark"
diff --git a/docs/operation/fodc/overview.md b/docs/operation/fodc/overview.md
new file mode 100644
index 00000000..4beb1355
--- /dev/null
+++ b/docs/operation/fodc/overview.md
@@ -0,0 +1,273 @@
+# First Occurrence Data Collection (FODC)
+
+First Occurrence Data Collection (FODC) is an observability and diagnostics
subsystem for BanyanDB.
+It continuously collects runtime parameters, performance indicators, node
states, and configuration
+data from DB nodes, and also supports **on-demand** performance profiling and
memory snapshots.
+
+FODC has two primary goals:
+
+1. Ensure the stable operation of DB nodes across all lifecycle stages
(bootstrap, steady state, scaling, failure, etc.).
+2. Provide trustworthy, structured data that supports capacity planning,
performance analysis, and troubleshooting.
+
+FODC adopts a **Proxy + Agent** deployment model and exposes a unified,
ecosystem-friendly data interface to external systems
+(such as Prometheus and other observability platforms).
+
+---
+
+## Overview
+
+FODC provides multiple categories of data:
+
+1. **Metric collection and short-term caching** (small time window)
+2. **Node topology and status**, including runtime parameters and role states
+3. **Node configuration collection**
+4. **On-demand performance profiling and memory snapshots**
+
+To accomplish this, FODC is deployed as:
+
+- A **central Proxy** service
+- Multiple **Agents**, typically co-located with BanyanDB nodes (sidecar
pattern)
+
+Agents connect to the Proxy via gRPC and register themselves. The Proxy then:
+
+- Aggregates and normalizes data from all agents
+- Exposes unified REST/Prometheus-style interfaces
+- Issues **on-demand diagnostic commands** (profiling, snapshots, config
capture, etc.) to one or more agents
+
+---
+
+## Architecture
+
+### Deployment Model
+
+FODC uses a **one-to-one mapping** between an Agent and a BanyanDB node:
+
+- Each **liaison** node has one FODC Agent
+- Each **data node** (hot / warm / cold) has one FODC Agent
+- In manual deployment modes, the same 1:1 relationship must be preserved
+
+Agents are typically deployed as **sidecars** in the same pod or host as the
corresponding BanyanDB node.
+
+### Proxy–Agent Relationship
+
+The Proxy acts as the **control plane and data aggregator**, while Agents act
as the **data plane** local to each DB node.
+
+Key characteristics:
+
+- Agents connect to the Proxy using **gRPC** and a configured **Proxy domain
name**.
+- The connection is **bi-directional**:
+ - Agents stream node metrics, status, and configuration to the Proxy.
+ - The Proxy sends on-demand diagnostic commands back to the Agents.
+
+#### ASCII Architecture Diagram
+
+```text
+ +-----------------------------------------+
+ | FODC Proxy |
+ |-----------------------------------------|
+ | - Agent registry |
+ | - Cluster topology view |
+External | - Aggregated metrics (/metrics |
+Clients & <---- | and /metrics-windows) |
+Ecosystem | - Config view (/cluster/config) |
+ | - On-demand control APIs |
+ +-----------------^-----------------------+
+ |
+ gRPC bi-directional streams
+ |
+ -----------------------------------------------------------------
+ | | |
+ v v v
++------------------+ +------------------+
+------------------+
+| FODC Agent | | FODC Agent | | FODC Agent
|
+| (sidecar with | | (sidecar with | | (sidecar with
|
+| liaison node) | | data node - hot) | | data node -
warm)|
+|------------------| |------------------|
|------------------|
+| - Scrape local | | - Scrape local | | - Scrape local
|
+| Prometheus | | Prometheus | | Prometheus
|
+| metrics | | metrics | | metrics
|
+| - Collect OS & | | - KTM / OS obs | | - KTM / OS obs
|
+| KTM telemetry | | metrics | | metrics
|
+| - Execute on- | | - Execute on- | | - Execute on-
|
+| demand profile | | demand profile | | demand
profile |
+| & heap dump | | & heap dump | | & heap dump
|
++--------^---------+ +--------^---------+
+--------^---------+
+ | | |
+ | | |
+ +------+--------+ +------+--------+
+------+--------+
+ | BanyanDB | | BanyanDB | | BanyanDB
|
+ | liaison node | | data node | | data node
|
+ | (process) | | (hot tier) | | (warm
tier) |
+ +---------------+ +---------------+
+---------------+
+```
+
+> Additional data node tiers (e.g. `datanode-cold`) follow the same **BanyanDB
node ↔ FODC Agent** 1:1 pattern.
+
+---
+
+## Metric Collection and Prometheus Integration
+
+### Data Sources
+
+Each FODC Agent collects metrics from:
+
+1. **The local BanyanDB node**
+ - Via its Prometheus `/metrics` HTTP endpoint
+ - Includes DB performance, internal queues, I/O stats, query latency, etc.
+2. **Kernel & OS-level telemetry**
+ - Through an integrated **Kernel Telemetry Module (KTM)** powered by eBPF
+ - Examples: OS page cache statistics, system I/O latency, CPU scheduling
behavior
+
+### In-Memory Sliding Window Cache
+
+Agents maintain a **sliding window** of recent metric samples in memory:
+
+- A **wake-up queue** is used to buffer the last **N** collections.
+- The time window is **auto-tuned** at startup based on:
+ - Sample interval
+ - Number of metrics
+ - Available memory constraints
+- Target memory usage is kept low (around **30 MB** per Agent) while still
supporting:
+ - Short-term trend analysis
+ - Correlation during incident triage (e.g. spikes around first occurrence)
+
+This design allows FODC to provide recent time-series context **without
depending on an external TSDB**.
+
+### Agent Metric Exposure
+
+Each FODC Agent exposes a **Prometheus-compatible** endpoint:
+
+- `GET /metrics`
+ - Returns the **latest** scraped metrics and local telemetry
+ - Can be scraped directly by:
+ - The FODC Proxy
+ - External observability systems (if desired and authorized)
+
+### Proxy Metric Exposure
+
+The FODC Proxy provides aggregated / enriched metric endpoints:
+
+- `GET /metrics`
+ - Proxy’s own metrics (health, number of agents, RPC latency, etc.)
+- `GET /metrics-windows`
+ - Returns metrics **within the maintained time window** for all known
agents
+ - Includes **additional node metadata**, such as:
+ - Node role (`liaison`, `datanode-hot`, `datanode-warm`,
`datanode-cold`, etc.)
+ - Node IDs and cluster membership
+ - Location / shard information (if available)
+
+This makes FODC a **drop-in component** for Prometheus-based ecosystems, while
preserving richer semantic context about each node.
+
+---
+
+## Cluster Topology, Roles, and Runtime State
+
+The FODC Proxy maintains an up-to-date view of cluster topology based on
**Agent registration**:
+
+1. On startup, each Agent:
+ - Connects to the Proxy via gRPC
+ - Registers its:
+ - Node ID and role
+ - Basic runtime attributes and capabilities
+2. The Proxy aggregates these registrations into a **logical cluster hierarchy
structure**.
+
+### Topology & Status API
+
+The Proxy exposes a unified cluster discovery endpoint:
+
+- `GET /cluster`
+ - Returns the list of all registered nodes
+ - Includes:
+ - Node identity (ID, name, address)
+ - Role:
+ - `liaison`
+ - `datanode-hot`
+ - `datanode-warm`
+ - `datanode-cold`
+ - (extensible for future roles)
+ - Agent status (online/offline, last heartbeat time)
+ - Key runtime indicators (optional: load, health, etc.)
+
+This simplifies integration with:
+
+- Cluster dashboards
+- Automated operations (e.g. scheduled checks before resharding / scaling)
+- Higher-level diagnostics tooling that needs a consistent cluster graph
+
+---
+
+## Node Configuration Collection
+
+FODC also collects and exposes static and dynamic configuration for each node.
+
+### What Is Collected
+
+Typical configuration categories include:
+
+- **Startup parameters**
+ - Command-line flags
+ - Environment variables (where permitted)
+- **Affected configurations**
+ - Configurations used by the database node in the runtime.
+
+### Configuration API
+
+The FODC Proxy aggregates configuration from all Agents and exposes it via:
+
+- `GET /cluster/config`
+ - Returns configuration for all nodes in the cluster
+- Potential filters (implementation-dependent):
+ - `GET /cluster/config?node_id=<id>`
+ - `GET /cluster/config?role=datanode-hot`
+
+This allows:
+
+- Quickly verifying configuration consistency across nodes and tiers
+- Comparing pre- and post-incident configuration states
+- Supporting automated configuration audits and drift detection
+
+---
+
+## On-Demand Performance Profiling and Memory Snapshots
+
+On-demand diagnostics are the **first non read-only capability** exposed by
FODC.
+They enable deep performance analysis while carefully controlling overhead.
+
+### Design Principles
+
+- **Opt-in and controlled**
+ Diagnostic actions are only triggered through explicit API calls to the
Proxy.
+- **Local execution, remote control**
+ Agents perform the heavy work (profiling, snapshots) on the local node; the
Proxy only orchestrates.
+- **Low default footprint**
+ By default, Agents run in **low CPU / low memory** mode and do not perform
expensive diagnostics.
+- **Burst resource usage when needed**
+ Extra CPU/memory budget is mainly consumed **only during active profiling or
snapshot sessions**.
+
+### Typical On-Demand Actions
+
+Exact APIs may vary by implementation, but commonly include:
+
+- **CPU profiling**
+ - Short-term CPU usage profiling (e.g. pprof)
+ - Useful for identifying hot code paths under load
+- **Heap / memory snapshots**
+ - Captures heap allocation state for leak or fragmentation analysis
+- **I/O / lock contention profiling**
+ - Optional profiling of DB internal lock contention or I/O stalls
+- **Configuration snapshot on demand**
+ - Force a re-capture of configuration and runtime flags at a specific
point in time
+- **Apply RBAC or other authorization controls on Proxy APIs**
+
+---
+
+## Summary
+
+FODC provides:
+
+- **Unified, structured observability** for BanyanDB clusters (metrics,
topology, configuration)
+- **Prometheus-friendly** interfaces for easy ecosystem integration
+- **On-demand deep diagnostics** (profiling, memory snapshots) orchestrated
centrally but executed locally
+- A lightweight, extensible **Proxy + Agent** architecture that respects
resource constraints
+
+This makes FODC a foundational component for reliable operation, performance
analysis, and automated troubleshooting of BanyanDB deployments.
\ No newline at end of file