hanahmily commented on code in PR #903:
URL:
https://github.com/apache/skywalking-banyandb/pull/903#discussion_r2630818465
##########
docs/design/ktm.md:
##########
@@ -50,69 +50,70 @@ Notes:
- Focus: page cache add/delete, fadvise() calls, I/O counters, and memory
reclaim signals.
- Attachment points: stable tracepoints where possible; fentry/fexit preferred
on newer kernels.
- Data path: kernel events -> BPF maps (monotonic counters) -> userspace
collector -> exporters.
-- Scoping: Fixed to the single, co-located BanyanDB process within the same
container/pod.
+- Scoping: Fixed to the single, co-located BanyanDB process within the same
container/pod, using cgroup membership first and a `banyand` comm-prefix
fallback.
## Metrics Model and Collection Strategy
-- Counters in BPF maps are monotonic and are not cleared by the userspace
collector (NoCleanup).
+- Counters in BPF maps are monotonic and are not cleared by the userspace
collector.
- Collection and push interval: 10 seconds by default.
- KTM periodically pushes collected metrics into the FODC Flight Recorder
through a Go-native interface at the configured interval (default 10s). The
push interval is exported through the `collector.interval` configuration
option. The Flight Recorder is responsible for any subsequent export,
persistence, or diagnostics workflows.
-- Downstream systems (for example, FODC Discovery Proxy or higher-level
exporters) should derive rates using `rate()`/`irate()` or equivalents; we
avoid windowed counters and map resets to preserve counter semantics.
+- Downstream systems derive rates (for example, Prometheus/PromQL
`rate()`/`irate()`); FODC/KTM only provides raw counters and does not compute
rates internally. We avoid windowed counters and map resets to preserve counter
semantics.
- int64 overflow is not a practical concern for our use cases; we accept
long-lived monotonic growth.
+- KTM exports only raw counters; any ratios/percentages are derived upstream
(see FODC operations/overview for exporter behavior).
Configuration surface (current):
- `collector.interval`: Controls the periodic push interval for metrics to
Flight Recorder. Defaults to 10s.
-- `collector.enable_cgroup_filter`, `collector.enable_mntns_filter`: default
on when in sidecar mode; can be toggled.
-- `collector.target_pid`/`collector.target_comm`: optional helpers for
discovering scoping targets.
-- `collector.target_comm_regex`: process matcher regular expression used
during target discovery (matches `/proc/<pid>/comm` and/or executable
basename). Defaults to `banyand`.
-- Cleanup strategy is effectively `no_cleanup` by design intent;
clear-after-read logic is deprecated for production metrics.
+- `collector.ebpf.cgroup_path` (optional): absolute or
`/sys/fs/cgroup`-relative path to the BanyanDB cgroup v2; if unset, KTM
autodetects by scanning `/proc/*/comm` for `banyand`.
+- Target discovery heuristic: match `/proc/<pid>/comm` prefix `banyand` to
locate BanyanDB and derive its cgroup; this also serves as the runtime fallback
if the cgroup filter is unset.
+- Cleanup strategy is monotonic counters only; downstream derives rates. KTM
does not clear BPF maps during collection.
- Configuration is applied via the FODC sidecar; KTM does not define its own
standalone process-level configuration surface.
## Scoping and Filtering
- Scoping is not optional; KTM is designed exclusively to monitor the single
BanyanDB process it is co-located with in a sidecar deployment.
-- The target process is identified at startup, and eBPF programs are
instructed to filter events to only that process.
-- Primary filtering mechanism: cgroup v2. This ensures all events originate
from the correct container. PID and mount namespace filters are used as
supplementary checks.
+- The target container is identified at startup; eBPF programs filter events
by cgroup membership first. If the cgroup filter is absent or misses, a
comm-prefix match (`banyand`) is used as a narrow fallback.
- The design intentionally avoids multi-process or node-level (DaemonSet)
monitoring to keep the implementation simple and overhead minimal.
### Target Process Discovery (Pod / VM)
-KTM needs to resolve the single “target” BanyanDB process before enabling
filters and attaching eBPF programs. In both Kubernetes pods and VM/bare-metal
deployments, KTM uses a **process matcher** driven by a configurable regular
expression (`collector.target_comm_regex`, default `banyand`).
+KTM needs to resolve the single “target” BanyanDB process before enabling
filters and attaching eBPF programs. In both Kubernetes pods and VM/bare-metal
deployments, KTM uses a **process matcher** based on a fixed comm-prefix match
(`banyand`).
#### Kubernetes Pod (sidecar)
Preconditions:
-- The pod should be configured with `shareProcessNamespace: true` so the
monitor sidecar can see the target container’s `/proc` entries.
-- The monitor container should have cgroup v2 mounted (typically at
`/sys/fs/cgroup`).
+- The pod must be configured with `shareProcessNamespace: true` so the monitor
sidecar can see the target container’s `/proc` entries.
+- cgroup v2 mounted (typically at `/sys/fs/cgroup`) to enable the primary
cgroup filter.
+- If the target process cannot be discovered (for example,
`shareProcessNamespace` is off), KTM logs the error and disables the module.
Review Comment:
KTM should keep checking if the process is online.
##########
docs/design/ktm.md:
##########
@@ -3,7 +3,7 @@
## Overview
-Kernel Telemetry Module (KTM) is an optional, modular kernel observability
component embedded inside the BanyanDB First Occurrence Data Collection (FODC)
sidecar. The first built-in module is an eBPF-based I/O monitor ("iomonitor")
that focuses on page cache behavior, fadvise() effectiveness, and memory
pressure signals and their impact on BanyanDB performance. KTM is not a
standalone agent or network-facing service; it runs as a sub-component of the
FODC sidecar ("black box") and exposes a Go-native interface to the Flight
Recorder for ingesting metrics. Collection scoping is configurable and defaults
to cgroup v2.
+Kernel Telemetry Module (KTM) is an optional, modular kernel observability
component embedded inside the BanyanDB First Occurrence Data Collection (FODC)
sidecar. The first built-in module is an eBPF-based I/O monitor ("iomonitor")
that focuses on page cache behavior, fadvise() effectiveness, and memory
pressure signals and their impact on BanyanDB performance. KTM is not a
standalone agent or network-facing service; it runs as a sub-component of the
FODC sidecar ("black box") and exposes a Go-native interface to the Flight
Recorder for ingesting metrics. Collection scoping defaults to the BanyanDB
container’s cgroup v2, with a `banyand` comm fallback.
Review Comment:
The name of comm fallback should be configurable, defaults to `banyand`
##########
docs/operation/fodc/ktm_metrics.md:
##########
@@ -0,0 +1,349 @@
+# KTM Metrics — Semantics & Workload Interpretation
+
+This document defines the **semantic meaning** of kernel-level metrics
collected by the
+Kernel Telemetry Module (KTM) under different BanyanDB workloads.
+
+It serves as the **authoritative interpretation guide** for:
+- First Occurrence Data Capture (FODC)
+- Automated analysis and reporting by LLM agents
+- Self-healing and tuning recommendations
+
+This document does **not** describe kernel attachment points or implementation
details.
+Those are covered separately in the KTM design document.
+
+---
+
+## 1. Scope and Non-Goals
+
+### In Scope
+- Interpreting kernel metrics in the context of **LSM-style read + compaction
workloads**
+- Distinguishing **benign background activity** from **user-visible read-path
impact**
+- Providing **actionable, explainable signals** for automated analysis
+
+### Out of Scope
+- Device-level I/O profiling or per-disk attribution
+- SLA-grade performance accounting
+- Precise block-layer root cause isolation
+
+SLA-grade performance accounting is explicitly out of scope because
+eBPF-based sampling and histogram bucketing introduce statistical
+approximation, and kernel-level telemetry cannot capture application-
+or network-level queuing delays.
+
+KTM focuses on **user-visible impact first**, followed by kernel-side
explanations.
+
+---
+
+## 2. Core Metrics Overview
+
+### 2.1 Read / Pread Syscall Latency (Histogram)
+
+**Metric Type**
+- Histogram (bucketed latency)
+- Collected at syscall entry/exit for `read` and `pread64`
+
+**Semantic Meaning**
+This metric represents the **time BanyanDB threads spend blocked in the read
syscall path**.
+
+It is the **primary impact signal** in KTM.
+
+**Key Rule**
+> If syscall-level read latency does **not** increase, the situation is **not
considered an incident**, regardless of background cache or reclaim activity.
+
+**Why Histogram**
+- Captures long-tail latency (p95 / p99) reliably
+- More representative of user experience than averages
+- Suitable for LLM-based reasoning and reporting
+
+---
+
+### 2.2 fadvise Policy Actions
+
+**Metric Type**
+- Counter
+
+**Semantic Meaning**
+Records **explicit page cache eviction hints** issued by BanyanDB.
+
+This metric represents **policy intent**, not impact.
+
+**Interpretation Notes**
+- fadvise activity alone is not an anomaly
+- Must be correlated with read latency to assess impact
+
+---
+
+### 2.3 Page Cache Add / Fill Activity
+
+**Metric Type**
+- Counter
+
+**Semantic Meaning**
+Represents pages being added to the OS page cache due to:
+- Read misses
+- Sequential scans
+- Compaction activity
+
+High page cache add rates are **expected** under LSM workloads.
+
+**Note**
+Page cache add activity does not necessarily imply disk I/O or cache miss.
+It may increase due to readahead, sequential scans, or compaction reads,
+and should be treated as a **correlated signal**, not a causal indicator,
+unless accompanied by read latency degradation.
+
+---
+
+### 2.4 Memory Reclaim and Pressure Signals
+
+**Metrics**
+- LRU shrink activity
+- Direct reclaim entry events
+
+**Semantic Meaning**
+Indicates **kernel memory pressure** that may destabilize page cache residency.
+
+These metrics act as **root-cause hints**, not incident triggers.
+
+---
+
+## 3. Interpretation Principles
+
+### 3.1 Impact-First Gating
+
+All incident detection and analysis is gated on:
+
+> **Syscall-level read latency histogram**
+
+Other metrics are used **only to explain why latency increased**, not to
decide whether an incident occurred.
+
+---
+
+### 3.2 Cache Churn Is Not an Incident
+
+High values of:
+- page cache add
+- reclaim
+- background scans
+
+are **normal** under LSM-style workloads and **must not** be treated as
incidents unless they result in read latency degradation.
+
+---
+
+## 4. Workload Semantics
+
+This section defines canonical workload patterns and how KTM metrics should be
interpreted.
+
+---
+
+> **Global Rule — Latency-Gated Evaluation**
+>
+> All workload patterns below are evaluated **only after syscall-level
+> read latency degradation has been detected** (e.g., p95/p99 bucket shift).
+> Kernel signals such as page cache activity, reclaim, or fadvise **must not**
+> be interpreted as incident triggers on their own.
+
+---
+
+### Workload 1 — Sequential Read / Background Compaction (Benign)
+
+**Typical Signals**
+- `page_cache_add ↑`
+- `lru_shrink ↑` (optional)
+- `read syscall latency stable`
+
+**Interpretation**
+Sequential scans and compaction naturally introduce cache churn.
+As long as read latency remains stable, this workload is benign.
+
+**Operational Decision**
+- Do not trigger FODC
+- No self-healing action required
+
+---
+
+### Workload 2 — High Page Cache Pressure, Foreground Sustained
+
+**Typical Signals**
+- `page_cache_add ↑`
+- `lru_shrink ↑`
+- occasional `direct_reclaim`
+- `read syscall latency stable`
+
+**Interpretation**
+System memory pressure exists, but foreground reads are not impacted.
+This indicates a tight but stable operating point.
+
+**Operational Decision**
+- No incident
+- Monitor trends only
+
+---
+
+### Workload 3 — Aggressive Cache Eviction or Reclaim Impact
+
+**Typical Signals**
+- `fadvise_calls ↑` or early reclaim activity
+- `page_cache_add ↑` (repeated refills)
+- `read syscall latency ↑` (long-tail buckets appear)
+
+**Interpretation**
+Hot pages are evicted too aggressively, causing read amplification.
+Foreground reads are directly impacted.
+
+**Operational Decision**
+- Trigger FODC
+- Recommend tuning eviction thresholds or rate-limiting background activity
+
+**Discriminator**
+Eviction-driven degradation is typically characterized by:
+- Elevated `fadvise` activity
+- Repeated page cache refills
+- Read latency degradation **without sustained compaction throughput
+ or disk I/O saturation**
+
+This pattern indicates policy-induced cache churn rather than workload
contention.
Review Comment:
It is also from the query pattern: continuously scanning an extensive time
range.
##########
docs/design/ktm.md:
##########
@@ -179,15 +189,19 @@ This approach ensures that a failure within the
observability module does not im
## Restart Semantics
-On sidecar restart, BPF maps are recreated and all counters reset to zero.
Downstream systems (e.g., Prometheus via FODC integrations) should treat this
as a new counter lifecycle and continue deriving rates/derivatives normally.
+- On sidecar restart, BPF maps are recreated and all counters reset to zero.
Downstream systems (e.g., Prometheus via FODC integrations) should treat this
as a new counter lifecycle and continue deriving rates/derivatives normally.
+- If BanyanDB restarts (PID changes), the cgroup filter continues to match as
long as the container does not change; the comm fallback also still matches
`banyand`.
+- If the pod/container is recreated (cgroup path changes), KTM re-runs target
discovery, re-programs the cgroup filter, and starts counters from zero;
metrics from the old container are discarded without reconciliation.
+- KTM performs a lightweight health check during collection to ensure the
cgroup filter is still populated; if it is missing (for example, container
crash/restart), KTM re-detects and re-programs the filter automatically.
## Kernel Attachment Points (Current)
-- `ksys_fadvise64_64` → fentry/fexit (preferred) or syscall tracepoints with
kprobe fallback.
-- Page cache add/remove → `filemap_get_read_batch` and
`mm_filemap_add_to_page_cache` tracepoints, with kprobe fallbacks.
-- Memory reclaim → `mm_vmscan_lru_shrink_inactive` and
`mm_vmscan_direct_reclaim_begin` tracepoints.
+- `sys_enter_read`, `sys_exit_read`, `sys_enter_pread64`, `sys_exit_pread64`
(syscall-level I/O latency).
Review Comment:
Do you tend to implement them? If not, remove this line.
##########
docs/operation/fodc/ktm_metrics.md:
##########
@@ -0,0 +1,349 @@
+# KTM Metrics — Semantics & Workload Interpretation
+
+This document defines the **semantic meaning** of kernel-level metrics
collected by the
+Kernel Telemetry Module (KTM) under different BanyanDB workloads.
+
+It serves as the **authoritative interpretation guide** for:
+- First Occurrence Data Capture (FODC)
+- Automated analysis and reporting by LLM agents
+- Self-healing and tuning recommendations
+
+This document does **not** describe kernel attachment points or implementation
details.
+Those are covered separately in the KTM design document.
+
+---
+
+## 1. Scope and Non-Goals
+
+### In Scope
+- Interpreting kernel metrics in the context of **LSM-style read + compaction
workloads**
+- Distinguishing **benign background activity** from **user-visible read-path
impact**
+- Providing **actionable, explainable signals** for automated analysis
+
+### Out of Scope
+- Device-level I/O profiling or per-disk attribution
+- SLA-grade performance accounting
+- Precise block-layer root cause isolation
+
+SLA-grade performance accounting is explicitly out of scope because
+eBPF-based sampling and histogram bucketing introduce statistical
+approximation, and kernel-level telemetry cannot capture application-
+or network-level queuing delays.
+
+KTM focuses on **user-visible impact first**, followed by kernel-side
explanations.
+
+---
+
+## 2. Core Metrics Overview
+
+### 2.1 Read / Pread Syscall Latency (Histogram)
+
+**Metric Type**
+- Histogram (bucketed latency)
+- Collected at syscall entry/exit for `read` and `pread64`
+
+**Semantic Meaning**
+This metric represents the **time BanyanDB threads spend blocked in the read
syscall path**.
+
+It is the **primary impact signal** in KTM.
+
+**Key Rule**
+> If syscall-level read latency does **not** increase, the situation is **not
considered an incident**, regardless of background cache or reclaim activity.
+
+**Why Histogram**
+- Captures long-tail latency (p95 / p99) reliably
+- More representative of user experience than averages
+- Suitable for LLM-based reasoning and reporting
+
+---
+
+### 2.2 fadvise Policy Actions
+
+**Metric Type**
+- Counter
+
+**Semantic Meaning**
+Records **explicit page cache eviction hints** issued by BanyanDB.
+
+This metric represents **policy intent**, not impact.
+
+**Interpretation Notes**
+- fadvise activity alone is not an anomaly
+- Must be correlated with read latency to assess impact
+
+---
+
+### 2.3 Page Cache Add / Fill Activity
+
+**Metric Type**
+- Counter
+
+**Semantic Meaning**
+Represents pages being added to the OS page cache due to:
+- Read misses
+- Sequential scans
+- Compaction activity
+
+High page cache add rates are **expected** under LSM workloads.
+
+**Note**
+Page cache add activity does not necessarily imply disk I/O or cache miss.
+It may increase due to readahead, sequential scans, or compaction reads,
+and should be treated as a **correlated signal**, not a causal indicator,
+unless accompanied by read latency degradation.
+
+---
+
+### 2.4 Memory Reclaim and Pressure Signals
+
+**Metrics**
+- LRU shrink activity
+- Direct reclaim entry events
+
+**Semantic Meaning**
+Indicates **kernel memory pressure** that may destabilize page cache residency.
+
+These metrics act as **root-cause hints**, not incident triggers.
+
+---
+
+## 3. Interpretation Principles
+
+### 3.1 Impact-First Gating
+
+All incident detection and analysis is gated on:
+
+> **Syscall-level read latency histogram**
+
+Other metrics are used **only to explain why latency increased**, not to
decide whether an incident occurred.
+
+---
+
+### 3.2 Cache Churn Is Not an Incident
+
+High values of:
+- page cache add
+- reclaim
+- background scans
+
+are **normal** under LSM-style workloads and **must not** be treated as
incidents unless they result in read latency degradation.
+
+---
+
+## 4. Workload Semantics
+
+This section defines canonical workload patterns and how KTM metrics should be
interpreted.
+
+---
+
+> **Global Rule — Latency-Gated Evaluation**
+>
+> All workload patterns below are evaluated **only after syscall-level
+> read latency degradation has been detected** (e.g., p95/p99 bucket shift).
+> Kernel signals such as page cache activity, reclaim, or fadvise **must not**
+> be interpreted as incident triggers on their own.
+
+---
+
+### Workload 1 — Sequential Read / Background Compaction (Benign)
+
+**Typical Signals**
+- `page_cache_add ↑`
+- `lru_shrink ↑` (optional)
+- `read syscall latency stable`
+
+**Interpretation**
+Sequential scans and compaction naturally introduce cache churn.
+As long as read latency remains stable, this workload is benign.
+
+**Operational Decision**
+- Do not trigger FODC
+- No self-healing action required
+
+---
+
+### Workload 2 — High Page Cache Pressure, Foreground Sustained
+
+**Typical Signals**
+- `page_cache_add ↑`
+- `lru_shrink ↑`
+- occasional `direct_reclaim`
+- `read syscall latency stable`
+
+**Interpretation**
+System memory pressure exists, but foreground reads are not impacted.
+This indicates a tight but stable operating point.
+
+**Operational Decision**
+- No incident
+- Monitor trends only
+
+---
+
+### Workload 3 — Aggressive Cache Eviction or Reclaim Impact
+
+**Typical Signals**
+- `fadvise_calls ↑` or early reclaim activity
+- `page_cache_add ↑` (repeated refills)
+- `read syscall latency ↑` (long-tail buckets appear)
+
+**Interpretation**
+Hot pages are evicted too aggressively, causing read amplification.
+Foreground reads are directly impacted.
+
+**Operational Decision**
+- Trigger FODC
+- Recommend tuning eviction thresholds or rate-limiting background activity
+
+**Discriminator**
+Eviction-driven degradation is typically characterized by:
+- Elevated `fadvise` activity
+- Repeated page cache refills
+- Read latency degradation **without sustained compaction throughput
+ or disk I/O saturation**
+
+This pattern indicates policy-induced cache churn rather than workload
contention.
+These discriminator signals are typically sourced from DB-level or system-level
+metrics outside KTM.
+
+---
+
+### Workload 4 — Compaction vs Foreground Read Contention
+
+**Typical Signals**
+- `page_cache_add ↑` (compaction scans)
+- `read syscall latency ↑`
+- reclaim may or may not be present
+
+**Interpretation**
+Latency degradation caused by workload-induced I/O contention.
+This is not necessarily a policy bug, but a scheduling and resource contention
issue.
+
+**Operational Decision**
+- Trigger FODC
+- Suggest reducing compaction concurrency or isolating foreground reads
+
+**Discriminator**
+Compaction-driven contention is typically characterized by:
+- Sustained page cache add activity
+- Read latency degradation
+- Concurrent high compaction throughput, background I/O pressure,
+ or elevated compaction thread utilization
+
+This pattern reflects workload-induced resource contention rather than
+explicit cache eviction policy.
+These discriminator signals are typically sourced from DB-level or system-level
+metrics outside KTM.
+
+---
+
+### Workload 5 — OS Memory Pressure–Driven Cache Drop
+
+**Typical Signals**
+- `direct_reclaim ↑`
+- `lru_shrink ↑`
+- `read syscall latency ↑`
+- `fadvise` may be absent
+
+**Interpretation**
+Cache eviction is driven by OS memory pressure rather than DB policy.
+Foreground reads stall due to synchronous reclaim.
+
+**Operational Decision**
+- Trigger FODC
+- Recommend adjusting memory limits or reducing background memory usage
+
+---
+
+### Workload 6 — DB Block Cache Miss → OS Fallback
+
+**Typical Signals**
+- DB-level block cache miss (external signal)
+- `page_cache_add ↑`
+- `read syscall latency ↑`
+- reclaim may be present
+
+**Interpretation**
+DB block cache degradation forces fallback to OS page cache and disk.
+Kernel-level read latency confirms user-visible impact.
+
+**Operational Decision**
+- Trigger FODC
+- Recommend tuning DB block cache size or access patterns
+
+**Note**
+This workload cannot be identified by KTM in isolation.
+Confirmation requires correlating kernel-level impact signals
+with database-level block cache metrics via the FODC Proxy.
Review Comment:
Do we have DB-level block cache metrics? I don't have any clues about it.
##########
docs/operation/fodc/overview.md:
##########
Review Comment:
Do not change this file
##########
docs/design/ktm.md:
##########
@@ -148,10 +149,14 @@ Prefix: metrics are currently emitted under the `ktm_`
namespace to reflect thei
- Memory
- `ktm_memory_lru_pages_scanned_total`
- `ktm_memory_lru_pages_reclaimed_total`
- - `ktm_memory_reclaim_efficiency_percent`
- `ktm_memory_direct_reclaim_processes`
-Semantics: all counters are monotonic; use Prometheus functions for
rates/derivatives; no map clearing between scrapes.
+Semantics: all counters are monotonic; use Prometheus functions for
rates/derivatives; no map clearing between scrapes. KTM does not emit
ratio/percentage metrics; derive them upstream.
+
+## Planned Metrics
Review Comment:
Please don't mention them if you don't want to implement right now.
##########
docs/design/ktm.md:
##########
@@ -164,6 +169,11 @@ Loading and managing eBPF programs requires elevated
privileges. The FODC sideca
- `CAP_BPF`: Allows loading, attaching, and managing eBPF programs and maps.
This is the preferred, more restrictive capability.
- `CAP_SYS_ADMIN`: A broader capability that also grants permission to perform
eBPF operations. It may be required on older kernels where `CAP_BPF` is not
fully supported.
+Operational prerequisites and observability:
Review Comment:
Copy the operational part to the `operation/fodc` folder
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]