group model)

wusheng Wed, 10 Jun 2026 08:43:50 -0700

This is an automated email from the ASF dual-hosted git repository.

wu-sheng pushed a commit to branch swip-15-banyandb-so11y-rules
in repository https://gitbox.apache.org/repos/asf/skywalking.git


commit 6a96f7f19c27da296c494e90a990a06a2b600f41
Author: Wu Sheng <[email protected]>
AuthorDate: Wed Jun 10 23:43:27 2026 +0800

    SWIP-15: implement BanyanDB self-observability (cluster / container / group 
model)
    
    Rebuild otel-rules/banyandb around the cluster reality: Service = cluster,
    ServiceInstance = container (pod_name + container_name, with role/tier 
attributes),
    Endpoint = group. Add banyandb-endpoint.yaml; redesign service/instance 
rules to
    mirror the upstream FODC-proxy Grafana boards. Requires BanyanDB 0.11+.
    
    Rewrite the e2e to a no-FODC file-discovery cluster (1 liaison + 1 hot data 
node);
    the collector scrapes each node's :2121 directly and injects the identity 
labels.
    Operator docs rewritten to the cluster/container/group model.
    
    Validated: DSLClassGeneratorTest compiles all rules via the production path;
    the e2e passes 16/16 against BanyanDB 0.11.
    
    Co-Authored-By: Claude Fable 5 <[email protected]>
---
 docs/en/banyandb/dashboards-banyandb.md            | 176 +++++++++++++----
 docs/en/changes/changes.md                         |   8 +
 .../otel-rules/banyandb/banyandb-endpoint.yaml     |  96 +++++++++
 .../otel-rules/banyandb/banyandb-instance.yaml     | 217 ++++++++++++++-------
 .../otel-rules/banyandb/banyandb-service.yaml      | 113 +++++------
 test/e2e-v2/cases/banyandb/banyandb-cases.yaml     |  63 ++++--
 test/e2e-v2/cases/banyandb/docker-compose.yml      |  64 ++++--
 test/e2e-v2/cases/banyandb/e2e.yaml                |   5 +-
 .../metrics-has-label-value.yml}                   |  57 +++---
 .../{otel-collector-config.yaml => nodes.yaml}     |  39 +---
 .../cases/banyandb/otel-collector-config.yaml      |  30 ++-
 11 files changed, 589 insertions(+), 279 deletions(-)

diff --git a/docs/en/banyandb/dashboards-banyandb.md 
b/docs/en/banyandb/dashboards-banyandb.md
index 1d07018f41..33ee16606f 100644
--- a/docs/en/banyandb/dashboards-banyandb.md
+++ b/docs/en/banyandb/dashboards-banyandb.md
@@ -1,49 +1,141 @@
-# BanyanDB self observability dashboard
+# BanyanDB self-observability dashboard
 
-[BanyanDB](https://skywalking.apache.org/docs/skywalking-banyandb/next/readme/),
 as an observability database, aims to ingest, analyze and store Metrics, 
Tracing, and Logging data. It's designed to handle observability data generated 
by **Apache SkyWalking**，it also provides a dashboard to visualize the 
self-observability metrics.
+[Apache SkyWalking 
BanyanDB](https://skywalking.apache.org/docs/skywalking-banyandb/next/readme/) 
is the
+native storage for SkyWalking. A production deployment is one **cluster** made 
of many **nodes**, each
+running one or more **containers** with a role (`liaison` front door, `data` 
backend, and the `lifecycle`
+tier-migration sidecar), and data is organized into **groups**. SkyWalking 
models that reality directly
+and renders it on the `Layer: BANYANDB` dashboards in the Horizon UI:
+
+| SkyWalking entity | BanyanDB concept | Identity |
+| ----------------- | ---------------- | -------- |
+| `Service` | one BanyanDB **cluster** | the `cluster` label |
+| `ServiceInstance` | one **container** on a node | `pod_name` + 
`container_name` (joined by `@`) |
+| &nbsp;&nbsp;↳ attributes | role / tier | `container_name` 
(`liaison`/`data`/`lifecycle`), `node_type` (`hot`/`warm`/`cold`), `node_role`, 
`pod_name` |
+| `Endpoint` | one **group** (storage partition) | the `group` label (e.g. 
`sw_metricsMinute`) |
+
+> **Requires BanyanDB 0.11+.** This feature reads the FODC-proxy 
cluster-observability metric families
+> and the queue / lifecycle metric families that BanyanDB introduced after 
0.10. Run a 0.11+ cluster
+> with the FODC proxy and the Prometheus metrics provider enabled.
 
 ## Data flow
-1. 
[BanyanDB](https://skywalking.apache.org/docs/skywalking-banyandb/next/readme/) 
collects metrics data internally and exposes a Prometheus http endpoint to 
retrieve the metrics.
-2. OpenTelemetry Collector fetches metrics from BanyanDB and pushes metrics to 
SkyWalking OAP Server via OpenTelemetry gRPC exporter.
-3. The SkyWalking OAP Server parses the expression with 
[MAL](../concepts-and-designs/mal.md) to filter/calculate/aggregate and store 
the results.
+
+1. Each BanyanDB container exposes its metrics; in a cluster the
+   [FODC 
proxy](https://skywalking.apache.org/docs/skywalking-banyandb/next/operation/fodc/overview/)
+   aggregates every container's Prometheus metrics onto a single `/metrics` 
endpoint (default `:17913`)
+   and stamps each sample with per-container identity labels (`pod_name`, 
`container_name`, `node_role`,
+   and `node_type` on data containers).
+2. An OpenTelemetry Collector scrapes the FODC proxy `/metrics` as the single 
Prometheus target, adds a
+   static `cluster: <name>` label (the only label SkyWalking must inject), and 
pushes via the
+   OpenTelemetry gRPC exporter to the SkyWalking OAP Server.
+3. The OAP Server parses the [MAL](../concepts-and-designs/mal.md) rules under 
`otel-rules/banyandb/` to
+   filter / calculate / aggregate and store the cluster, instance and group 
metrics.
 
 ## Set up
-1. Start 
[BanyanDB](https://skywalking.apache.org/docs/skywalking-banyandb/next/readme/),supporting
 both [Standalone 
Mode](https://skywalking.apache.org/docs/skywalking-banyandb/next/installation/standalone/)
 and [Cluster 
Mode](https://skywalking.apache.org/docs/skywalking-banyandb/next/installation/cluster/).
-2. Set up [OpenTelemetry Collector 
](https://opentelemetry.io/docs/collector/getting-started/#docker). For details 
on Prometheus Receiver in OpenTelemetry Collector, refer to 
[here](../../../test/e2e-v2/cases/banyandb/otel-collector-config.yaml).
-3. Config SkyWalking [OpenTelemetry 
receiver](https://skywalking.apache.org/docs/main/next/en/setup/backend/opentelemetry-receiver/).
-
-## BanyanDB monitoring
-Self observability monitoring provides monitoring of the status and resources 
of the 
[BanyanDB](https://skywalking.apache.org/docs/skywalking-banyandb/next/readme/) 
server itself. `banyandb-server` is a `Service` in BanyanDB, and land on the 
`Layer: BANYANDB`.
-
-### Self observability metrics
-
-| Unit | Metric Name                                       | Description | 
Data Source |
-|------|---------------------------------------------------|-------------|-------------|
-| o/s | meter_banyandb_write_rate                        | Write Rate 
(Operations per Second) | BanyanDB |
-| GiB | meter_banyandb_total_memory                      | Total Memory | 
BanyanDB |
-| GiB | meter_banyandb_disk_usage                        | Disk Usage | 
BanyanDB |
-| r/s | meter_banyandb_query_rate                        | Query Rate 
(Requests per Second) | BanyanDB |
-| Count | meter_banyandb_total_cpu                        | Total CPU Cores | 
BanyanDB |
-| c/m | meter_banyandb_write_and_query_errors_rate      | Write and Query 
Errors Rate（Counts per Minute） | BanyanDB |
-| c/s | meter_banyandb_etcd_operation_rate               | Etcd Operation 
Rate（Counts per Second） | BanyanDB |
-| Count | meter_banyandb_active_instance                  | Active Instances | 
BanyanDB |
-| % | meter_banyandb_cpu_usage                        | CPU Usage Percentage | 
BanyanDB |
-| % | meter_banyandb_rss_memory_usage                 | RSS Memory Usage 
Percentage | BanyanDB |
-| % | meter_banyandb_disk_usage_all                   | Disk Usage Percentage 
| BanyanDB |
-| KiB/s | meter_banyandb_network_usage_recv               | Network Receive 
Rate | BanyanDB |
-| KiB/s | meter_banyandb_network_usage_sent               | Network Send Rate 
| BanyanDB |
-| o/s | meter_banyandb_storage_write_rate               | Storage Write Rate 
(Operations per Second) | BanyanDB |
-| s | meter_banyandb_query_latency                    | Query Latency (s) | 
BanyanDB |
-| Count | meter_banyandb_total_data                      | Total Data Elements 
| BanyanDB |
-| r/m | meter_banyandb_merge_file_data                 | Merge File Data 
Rate(Revolutions per Minute) | BanyanDB |
-| s | meter_banyandb_merge_file_latency              | Merge File Latency(s) | 
BanyanDB |
-| Count | meter_banyandb_merge_file_partitions          | Merge File 
Partitions | BanyanDB |
-| o/s | meter_banyandb_series_write_rate               | Series Write Rate 
(Operations per Second) | BanyanDB |
-| o/s | meter_banyandb_series_term_search_rate         | Series Term Search 
Rate (Operations per Second) | BanyanDB |
-| Count | meter_banyandb_total_series                   | Total Series Count | 
BanyanDB |
-| ops | meter_banyandb_stream_write_rate              | Stream Write Rate 
(Operations per Second) | BanyanDB |
-| ops | meter_banyandb_term_search_rate                | Term Search Rate 
(Operations per Second) | BanyanDB |
-| Count | meter_banyandb_total_document                 | Total Document Count 
| BanyanDB |
+
+1. Run a BanyanDB **0.11+** cluster (liaison + data nodes; data nodes may be 
tiered hot/warm/cold) with
+   the **FODC proxy** enabled and the Prometheus metrics provider on 
(default). Standalone mode is the
+   degenerate case — one cluster, one node, one `container_name=standalone`.
+2. Run an **OpenTelemetry Collector** whose `prometheus` receiver scrapes the 
FODC proxy `/metrics`
+   (`:17913`) as the single target and adds a static `cluster: <name>` label, 
exporting OTLP to OAP. For
+   a runnable example, see
+   [the e2e collector 
config](../../../test/e2e-v2/cases/banyandb/otel-collector-config.yaml).
+3. Enable SkyWalking's
+   [OpenTelemetry 
receiver](https://skywalking.apache.org/docs/main/next/en/setup/backend/opentelemetry-receiver/).
+   The `banyandb/*` rules are enabled by default in `enabledOtelMetricsRules`.
+4. Open the **Horizon UI** → `BanyanDB` layer.
+
+## Metrics
+
+The metric source expressions mirror the upstream BanyanDB Grafana boards, so 
the SkyWalking dashboards
+stay in lockstep with the BanyanDB catalog. The rule files are
+`otel-rules/banyandb/banyandb-service.yaml`, `banyandb-instance.yaml` and 
`banyandb-endpoint.yaml`.
+
+### Service scope — cluster summary (`meter_banyandb_*`)
+
+| Unit | Metric | Description |
+| ---- | ------ | ----------- |
+| w/s | `meter_banyandb_cluster_write_rate` | Cluster write rate across 
measure/stream/trace |
+| r/s | `meter_banyandb_cluster_query_rate` | Cluster query rate |
+| c/m | `meter_banyandb_cluster_error_rate` | Cluster error rate (counts/min) |
+| Count | `meter_banyandb_reporting_instances` | Live container count by role |
+| Count | `meter_banyandb_total_cpu_cores` | Cluster CPU capacity |
+| Bytes | `meter_banyandb_total_memory_used` | Cluster memory used |
+| Bytes | `meter_banyandb_total_disk_used` | Cluster disk used |
+
+### Instance scope — per container (`meter_banyandb_instance_*`)
+
+**All roles** (every container emits these):
+
+| Unit | Metric | Description |
+| ---- | ------ | ----------- |
+| s | `node_uptime` | Node uptime |
+| Cores | `cpu_usage` | CPU usage |
+| Bytes | `rss_memory` | Resident memory |
+| percentunit | `system_memory_percent` | System memory used fraction |
+| percentunit | `disk_usage_percent` | Disk used fraction (Σused/Σtotal) |
+| Bytes | `disk_used_by_path` / `disk_total_by_path` | Disk used / total by 
mount path |
+| percentunit | `disk_used_percent_by_path` | Disk used fraction by mount path 
|
+| Bytes/s | `network_recv` / `network_sent` | Network throughput by interface |
+| Count | `goroutines` | Go goroutines |
+| s | `gc_pause_avg` | Average GC pause |
+| Bytes | `heap_inuse` / `heap_next_gc` | Go heap in-use / next-GC threshold |
+| Bytes/s | `alloc_rate` | Go allocation rate |
+
+**Liaison** (front door; the dashboard gates these on `container_name == 
'liaison'`):
+
+| Unit | Metric | Description |
+| ---- | ------ | ----------- |
+| r/s | `query_rate_by_service` | Query rate by data-model service |
+| c/m | `grpc_error_rate` | gRPC error rate |
+| r/s | `non_query_op_rate` | Registry / non-query operation rate |
+| w/s | `write_rate` | Write rate seen at the front door |
+| ops | `publish_throughput` | Tier-2 publish throughput by operation |
+| Bytes/s | `publish_bytes` | Publish bytes |
+| s | `publish_latency_p99` | Publish send latency p99 |
+| Count | `wqueue_pending` / `wqueue_file_parts` / `wqueue_mem_part` | 
Write-queue depth |
+
+**Data** (backend; the dashboard gates these on `container_name == 'data'`):
+
+| Unit | Metric | Description |
+| ---- | ------ | ----------- |
+| Count | `total_data` | Total stored data elements |
+| o/s | `merge_file_rate` | Merge-loop rate |
+| Count | `merge_file_partitions` | Avg parts merged per loop |
+| s | `merge_file_latency` | Avg file-merge latency |
+| o/s | `series_write_rate` / `series_term_search_rate` | Inverted-index write 
/ term-search rate |
+| Count | `total_series` | Inverted-index documents |
+| o/s | `stream_tst_write_rate` / `stream_tst_term_search_rate` | Stream tst 
index write / term-search rate |
+| Count | `stream_tst_total_docs` | Stream tst index documents |
+| ops | `queue_sub_throughput` | Subscribe-queue throughput by operation |
+| s | `queue_sub_latency_p99` | Subscribe-queue latency p99 |
+| percent | `retention_measure_disk_usage_percent` / 
`retention_stream_disk_usage_percent` / `retention_trace_disk_usage_percent` | 
Retention disk-usage % per scope |
+
+**Lifecycle** (the tier-migration sidecar on hot/warm data pods; 
`container_name == 'lifecycle'`):
+
+| Unit | Metric | Description |
+| ---- | ------ | ----------- |
+| Count | `lifecycle_cycles` | Cumulative migration cycles |
+| s | `lifecycle_last_run` | Seconds since the last migration cycle started |
+| Status | `lifecycle_last_run_success` | Last cycle status (1 = OK, 0 = 
failed) |
+
+### Endpoint scope — per group (`meter_banyandb_endpoint_*`)
+
+| Unit | Metric | Description |
+| ---- | ------ | ----------- |
+| w/s | `write_rate` | Write rate for the group |
+| s | `query_latency` | Mean query latency for the group |
+| Count | `total_data` | Total stored data elements for the group |
+| o/s | `merge_file_rate` | Merge-loop rate for the group |
+| s | `merge_file_latency` | Avg file-merge latency for the group |
+| Count | `merge_file_partitions` | Avg parts merged per loop for the group |
+| o/s | `series_write_rate` | Inverted-index write rate for the group |
+| Count | `total_series` | Inverted-index documents for the group |
+| ops | `queue_throughput` | Subscribe-queue throughput by operation for the 
group |
+| s | `queue_latency_p99` | Publish-queue latency p99 for the group |
+| Bytes/s | `publish_bytes` | Publish bytes for the group |
 
 ## Customizations
-You can customize your own metrics/expression/dashboard panel.The metrics 
definition and expression rules are found in `/config/otel-rules/banyandb`.The 
[BanyanDB](https://skywalking.apache.org/docs/skywalking-banyandb/next/readme/) 
dashboard panel configurations ship from the SkyWalking Horizon UI bundle 
(apache/skywalking-horizon-ui); the OAP backend no longer hosts UI dashboard 
JSONs.
+
+You can customize your own metrics / expressions. The metric definitions and 
expression rules are in
+`/config/otel-rules/banyandb`. The dashboard panel configurations ship from 
the SkyWalking Horizon UI
+bundle (apache/skywalking-horizon-ui); the OAP backend does not host UI 
dashboard JSONs.
diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md
index d548d25cd5..f6dd1e414a 100644
--- a/docs/en/changes/changes.md
+++ b/docs/en/changes/changes.md
@@ -242,6 +242,14 @@
   admin-host only" entry above for the public REST retirement.
 
 #### OAP Server
+* SWIP-15: rebuild BanyanDB self-observability around the cluster / container 
/ group model
+  (requires BanyanDB 0.11+). `otel-rules/banyandb/` now models a BanyanDB 
cluster as one `Service`
+  (`service(['cluster'])`), each container as a `ServiceInstance` keyed on 
`pod_name` + `container_name`
+  (with `node_role` / `node_type` / `container_name` / `pod_name` as instance 
attributes), and each
+  storage group as an `Endpoint`. New `banyandb-endpoint.yaml`; 
`banyandb-service.yaml` and
+  `banyandb-instance.yaml` redesigned to mirror the upstream FODC-proxy 
Grafana boards. The stale
+  single-node `host_name` model and the removed `etcd_operation_rate` / 
`up`-derived `active_instance`
+  metrics are gone.
 * Runtime MAL/LAL hot-update rules can declare `layerDefinitions:` to 
introduce new
   layers. Ordinals are operator-pinned in the `100_000+` tier; the layer is
   refcount-tracked and unregistered when the last declaring rule is removed. 
See
diff --git 
a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-endpoint.yaml
 
b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-endpoint.yaml
new file mode 100644
index 0000000000..ef61c460c8
--- /dev/null
+++ 
b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-endpoint.yaml
@@ -0,0 +1,96 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# SWIP-15 section 3.3 Endpoint scope: a BanyanDB `group` (storage group, e.g. 
sw_metricsMinute,
+# sw_trace) is modeled as an Endpoint under the cluster Service. The `cluster` 
label is the
+# single static label the OTel collector injects per scrape job (it is NOT on 
the raw FODC
+# wire); `group` is carried natively by every family referenced below. Every 
metric here is
+# aggregated across all cluster nodes per group, so each rule's .sum() 
collapses the per-node /
+# per-seg / per-shard / per-operation / per-remote dimensions down to 
['cluster','group']
+# before any rate/histogram/division. MAL arithmetic ('+', '/') inner-joins on 
exact label
+# equality, so every operand is reduced to the identical ['cluster','group'] 
(or
+# ['cluster','group','le'] for histograms) label set first.
+# Source expressions mirror the upstream BanyanDB Grafana "Workload" board
+# (docs/operation/grafana-fodc-workload.json).
+filter: "{ tags -> tags.job_name == 'banyandb-monitoring' }"
+expSuffix: endpoint(['cluster'], ['group'], Layer.BANYANDB)
+metricPrefix: meter_banyandb_endpoint
+metricsRules:
+  # writes/s for the group, across the three data-model scopes (measure, 
stream, trace). The
+  # write counter carries `group` regardless of which role records it, so the 
by-group roll-up
+  # is exact.
+  - name: write_rate
+    exp: (banyandb_measure_total_written.sum(['cluster', 
'group']).rate('PT1M') + banyandb_stream_tst_total_written.sum(['cluster', 
'group']).rate('PT1M') + banyandb_trace_tst_total_written.sum(['cluster', 
'group']).rate('PT1M'))
+
+  # mean query latency (ms) for the group = sum(latency) / sum(count). 
liaison_grpc_total_latency
+  # and _started are BOTH counters (not a histogram), so this is a ratio of 
cumulative counters,
+  # not a percentile. Both filtered to method='query' and reduced to 
['cluster','group']
+  # (collapsing the `service` data-model facet) before the division joins on 
equal labels.
+  - name: query_latency
+    exp: (banyandb_liaison_grpc_total_latency.tagEqual('method', 
'query').sum(['cluster', 'group']) / 
banyandb_liaison_grpc_total_started.tagEqual('method', 'query').sum(['cluster', 
'group'])) * 1000
+
+  # current total stored data elements for the group (gauge). Dimensioned by 
seg+shard+node_type
+  # across data nodes; .sum(['cluster','group']) collapses them into one 
per-group total.
+  - name: total_data
+    exp: (banyandb_measure_total_file_elements.sum(['cluster', 'group']) + 
banyandb_stream_tst_total_file_elements.sum(['cluster', 'group']) + 
banyandb_trace_tst_total_file_elements.sum(['cluster', 'group']))
+
+  # merge-loop iterations/min for the group (matches the upstream "Merge File 
Rate" rotrpm panel,
+  # which is rate(merge_loop_started) * 60). merge_loop_started carries 
node_type (NOT a `type`
+  # label), so no type filter applies here.
+  - name: merge_file_rate
+    exp: (banyandb_measure_total_merge_loop_started.sum(['cluster', 
'group']).rate('PT1M') + 
banyandb_stream_tst_total_merge_loop_started.sum(['cluster', 
'group']).rate('PT1M') + 
banyandb_trace_tst_total_merge_loop_started.sum(['cluster', 
'group']).rate('PT1M')) * 60
+
+  # mean file-merge latency (ms) per merge loop for the group. merge_latency 
carries a `type`
+  # label (file/hot/mem); type='file' selects on-disk merges and is DATA-only 
on the wire
+  # (liaison emits only type='mem'). Divide accumulated merge-seconds by merge 
loops, both
+  # type/scope-aligned to ['cluster','group']. Matches the upstream "Merge 
File Latency" panel.
+  - name: merge_file_latency
+    exp: ((banyandb_measure_total_merge_latency.tagEqual('type', 
'file').sum(['cluster', 'group']).rate('PT1M') / 
banyandb_measure_total_merge_loop_started.sum(['cluster', 
'group']).rate('PT1M')) + 
(banyandb_stream_tst_total_merge_latency.tagEqual('type', 
'file').sum(['cluster', 'group']).rate('PT1M') / 
banyandb_stream_tst_total_merge_loop_started.sum(['cluster', 
'group']).rate('PT1M')) + 
(banyandb_trace_tst_total_merge_latency.tagEqual('type', 
'file').sum(['cluster', 'group']).rate('PT1 [...]
+
+  # avg parts merged per merge loop on the on-disk merge path for the group 
(matches the upstream
+  # "Merge File Partitions" panel = rate(merged_parts{type=file}) / 
rate(merge_loop_started)).
+  # merged_parts carries `type`; type='file' is DATA-only (liaison emits only 
type='mem').
+  - name: merge_file_partitions
+    exp: ((banyandb_measure_total_merged_parts.tagEqual('type', 
'file').sum(['cluster', 'group']).rate('PT1M') / 
banyandb_measure_total_merge_loop_started.sum(['cluster', 
'group']).rate('PT1M')) + 
(banyandb_stream_tst_total_merged_parts.tagEqual('type', 
'file').sum(['cluster', 'group']).rate('PT1M') / 
banyandb_stream_tst_total_merge_loop_started.sum(['cluster', 
'group']).rate('PT1M')) + 
(banyandb_trace_tst_total_merged_parts.tagEqual('type', 'file').sum(['cluster', 
'group']).rate('PT1M') [...]
+
+  # inverted-index updates/s for the group. NOTE: 
*_inverted_index_total_updates is # TYPE=gauge
+  # though cumulative; rate() over a cumulative gauge yields a per-window 
delta (updates/s). Stream
+  # uses two index scopes -- both stream_storage_* and stream_tst_* are summed 
in. Data-only family.
+  - name: series_write_rate
+    exp: (banyandb_measure_inverted_index_total_updates.sum(['cluster', 
'group']).rate('PT1M') + 
banyandb_stream_storage_inverted_index_total_updates.sum(['cluster', 
'group']).rate('PT1M') + 
banyandb_stream_tst_inverted_index_total_updates.sum(['cluster', 
'group']).rate('PT1M'))
+
+  # total inverted-index documents (series proxy) for the group (gauge, direct 
read, no rate).
+  # Both stream index scopes summed. Dimensioned by seg across data nodes; sum 
collapses to group.
+  - name: total_series
+    exp: (banyandb_measure_inverted_index_total_doc_count.sum(['cluster', 
'group']) + 
banyandb_stream_storage_inverted_index_total_doc_count.sum(['cluster', 
'group']) + banyandb_stream_tst_inverted_index_total_doc_count.sum(['cluster', 
'group']))
+
+  # subscribe-side queue throughput (msgs/s) for the group, broken out by 
operation. queue_sub is
+  # emitted on BOTH data (operations batch-write/control/file-sync/query) and 
liaison
+  # (batch-write only); `operation` is kept in the group-by so the dashboard 
can split per op.
+  - name: queue_throughput
+    exp: banyandb_queue_sub_total_finished.sum(['cluster', 'group', 
'operation']).rate('PT1M')
+
+  # publish-side queue p99 latency for the group. queue_pub_total_latency IS a 
histogram on the
+  # wire (_bucket carries le); keep le + group + operation in the .sum() 
group-by, then
+  # .histogram().histogram_percentile([99]). queue_pub is liaison-only. 
Precedent:
+  # oap.yaml / nginx-endpoint.yaml histogram idiom.
+  - name: queue_latency_p99
+    exp: banyandb_queue_pub_total_latency.sum(['le', 'cluster', 'group', 
'operation']).histogram().histogram_percentile([99])
+
+  # publish bytes/s for the group. Wire family is 
banyandb_queue_pub_sent_bytes -- NO `total`
+  # infix (unlike queue_pub_total_started/_finished). Liaison-only; sum 
collapses
+  # operation/remote_node/remote_role/remote_tier before the rate.
+  - name: publish_bytes
+    exp: banyandb_queue_pub_sent_bytes.sum(['cluster', 'group']).rate('PT1M')
diff --git 
a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml
 
b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml
index 21955331f3..c3f728b139 100644
--- 
a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml
+++ 
b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml
@@ -13,74 +13,157 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This will parse a textual representation of a duration. The formats
-# accepted are based on the ISO-8601 duration format {@code PnDTnHnMn.nS}
-# with days considered to be exactly 24 hours.
-# <p>
-# Examples:
-# <pre>
-#    "PT20.345S" -- parses as "20.345 seconds"
-#    "PT15M"     -- parses as "15 minutes" (where a minute is 60 seconds)
-#    "PT10H"     -- parses as "10 hours" (where an hour is 3600 seconds)
-#    "P2D"       -- parses as "2 days" (where a day is 24 hours or 86400 
seconds)
-#    "P2DT3H4M"  -- parses as "2 days, 3 hours and 4 minutes"
-#    "P-6H3M"    -- parses as "-6 hours and +3 minutes"
-#    "-P6H3M"    -- parses as "-6 hours and -3 minutes"
-#    "-P-6H+3M"  -- parses as "+6 hours and -3 minutes"
-# </pre>
+# SWIP-15: BanyanDB self-observability, ServiceInstance scope = one container 
on a node.
+# The instance identity is pod_name + container_name (a data hot/warm pod 
co-hosts a data and a
+# lifecycle container under one pod_name), joined by '@'. role 
(container_name) and tier (node_type)
+# ride as instance attributes via the 6-arg instance() properties closure; 
node_type Elvis-defaults
+# to 'n/a' off data containers (it is absent on liaison samples, present on 
every ROLE_DATA sample).
+#
+# Every rule that aggregates keeps 
['cluster','pod_name','container_name','node_role','node_type'] in
+# its .sum()/.avg()/.max() group-by: SampleFamily.aggregate() drops labels not 
in the group-by, and
+# the properties closure reads them from the post-aggregation sample 
(SampleFamily.java:810). node_type
+# rides on every ROLE_DATA sample (system_*, go_*, process_* included), so a 
data instance resolves a
+# stable tier across all rules; liaison families carry none, so liaison 
resolves 'n/a' consistently.
+#
+# Source expressions mirror the upstream BanyanDB Grafana "Nodes" board
+# (docs/operation/grafana-fodc-nodes.json) plus the liaison/data rows of the 
"Workload" board, so the
+# SkyWalking instance dashboard stays in lockstep with the upstream catalog.
 filter: "{ tags -> tags.job_name == 'banyandb-monitoring' }"
-expSuffix:  tag({tags -> tags.host_name = 'banyandb::' + 
tags.host_name}).service(['host_name'] , 
Layer.BANYANDB).instance(['host_name'], ['service_instance_id'], Layer.BANYANDB)
-metricPrefix: meter_banyandb
+expSuffix: |-
+  service(['cluster'], Layer.BANYANDB)
+  .instance(['cluster'], '::', ['pod_name', 'container_name'], '@', 
Layer.BANYANDB, { tags -> ['node_role': tags.node_role, 'node_type': 
tags.node_type ?: 'n/a', 'pod_name': tags.pod_name, 'container_name': 
tags.container_name] })
+metricPrefix: meter_banyandb_instance
 metricsRules:
-  - name: instance_write_rate
-    exp: 
banyandb_measure_total_written.rate('PT15S')+banyandb_stream_tst_total_written.rate('PT15S')
-  - name: instance_total_memory
-    exp: banyandb_system_memory_state.tagEqual('kind','total')
-  - name: instance_disk_usage
-    exp: 
banyandb_system_disk.tagEqual('kind','used').sum(['host_name','service_instance_id'])
-  - name: instance_query_rate
-    exp: 
banyandb_liaison_grpc_total_started.sum(['method','host_name','service_instance_id'])
-  - name: instance_total_cpu
-    exp: banyandb_system_cpu_num
-  - name: instance_write_and_query_errors_rate
-    exp: 
banyandb_liaison_grpc_total_err.tagEqual('method','query').sum(['method','host_name','service_instance_id']).rate('PT15S')*60
 + 
banyandb_liaison_grpc_total_stream_msg_sent_err.sum(['host_name','service_instance_id']).rate('PT15S')*60
 + 
banyandb_liaison_grpc_total_stream_msg_received_err.sum(['host_name','service_instance_id']).rate('PT15S')*60
 + 
banyandb_queue_sub_total_msg_sent_err.sum(['host_name','service_instance_id']).rate('PT15S')*60
-  - name: instance_etcd_operation_rate
-    exp: 
banyandb_liaison_grpc_total_registry_started.sum(['host_name','service_instance_id']).rate('PT15S')
 + 
banyandb_liaison_grpc_total_started.sum(['host_name','service_instance_id']).rate('PT15S')
-  - name: instance_active_instance
-    exp: up.sum(['host_name','service_instance_id']).downsampling(MIN)
-  - name: instance_cpu_usage
-    exp: 
(((process_cpu_seconds_total.sum(['host_name','service_instance_id']).rate('PT15S')
 / 
banyandb_system_cpu_num.sum(['host_name','service_instance_id']))).max(['host_name','service_instance_id']))*1000
-  - name: instance_rss_memory_usage
-    exp: 
((process_resident_memory_bytes.sum(['host_name','service_instance_id']).downsampling(MAX)
 / 
banyandb_system_memory_state.tagEqual('kind','total').sum(['host_name','service_instance_id'])).max(['host_name','service_instance_id']))*1000
-  - name: instance_disk_usage_all
-    exp: 
((banyandb_system_disk.tagEqual('kind','used').sum(['host_name','service_instance_id'])
 / 
banyandb_system_memory_state.tagEqual('kind','total').sum(['host_name','service_instance_id'])).max(['host_name','service_instance_id']))*1000
-  - name: instance_network_usage_recv
-    exp: 
banyandb_system_net_state.tagEqual('kind','bytes_recv').sum(['host_name','service_instance_id']).rate('PT15S')
-  - name: instance_network_usage_sent
-    exp: 
banyandb_system_net_state.tagEqual('kind','bytes_sent').sum(['host_name','service_instance_id']).rate('PT15S')
-  - name: instance_storage_write_rate
-    exp: 
banyandb_measure_total_written.sum(['group','host_name','service_instance_id']).rate('PT15S')*1000
-  - name: instance_query_latency
-    exp: 
(banyandb_liaison_grpc_total_latency.tagEqual('method','query').sum(['group','host_name','service_instance_id']).rate('PT15S')
 / 
banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['group','host_name','service_instance_id']).rate('PT15S'))*1000
-  - name: instance_total_data
-    exp: 
banyandb_measure_total_file_elements.sum(['group','host_name','service_instance_id'])
-  - name: instance_merge_file_data
-    exp: 
banyandb_measure_total_merge_loop_started.sum(['group','host_name','service_instance_id']).rate('PT15S')
 * 60 *1000
-  - name: instance_merge_file_latency
-    exp: 
(banyandb_measure_total_merge_latency.tagEqual('type','file').sum(['group','host_name','service_instance_id']).rate('PT15S')
 / 
banyandb_measure_total_merge_loop_started.sum(['group','host_name','service_instance_id']).rate('PT15S'))*1000
-  - name: instance_merge_file_partitions
-    exp: 
(banyandb_measure_total_merged_parts.tagEqual('type','file').sum(['group','host_name','service_instance_id']).rate('PT15S')
 / 
banyandb_measure_total_merge_loop_started.sum(['group','host_name','service_instance_id']).rate('PT15S'))*1000
-  - name: instance_series_write_rate
-    exp: 
(banyandb_measure_inverted_index_total_updates.sum(['group','host_name','service_instance_id']).rate('PT15S'))*1000
-  - name: instance_series_term_search_rate
-    exp: 
banyandb_stream_storage_inverted_index_total_term_searchers_started.sum(['group','host_name','service_instance_id']).rate('PT15S')
-  - name: instance_total_series
-    exp: 
banyandb_measure_inverted_index_total_doc_count.sum(['group','host_name','service_instance_id'])
-  - name: instance_stream_write_rate
-    exp: 
banyandb_stream_tst_inverted_index_total_updates.sum(['group','host_name','service_instance_id']).rate('PT15S')
-  - name: instance_term_search_rate
-    exp: 
banyandb_stream_tst_inverted_index_total_term_searchers_started.sum(['group','host_name','service_instance_id']).rate('PT15S')*
 1000
-  - name: instance_total_document
-    exp: 
banyandb_stream_tst_inverted_index_total_doc_count.sum(['group','host_name','service_instance_id'])
+  # ---- All roles: Resources / Disk by Path / Go Runtime (every container 
emits these) ----
+  # node uptime (s). Raw gauge; ABSENT on lifecycle containers (their binary 
runs the metric service
+  # without the system collector), so the lifecycle instance shows no uptime.
+  - name: node_uptime
+    exp: banyandb_system_up_time
+  # CPU usage (cores). process_* rides on every container including lifecycle.
+  - name: cpu_usage
+    exp: 
process_cpu_seconds_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  # resident memory (bytes). Raw gauge, present on all containers.
+  - name: rss_memory
+    exp: process_resident_memory_bytes
+  # system memory used %. kind='used_percent' is emitted directly (a 0-1 
fraction; source divides by 100).
+  - name: system_memory_percent
+    exp: banyandb_system_memory_state.tagEqual('kind','used_percent')
+  # disk used % = Σused / Σtotal across the node's data paths (matches the 
Grafana "Disk Usage %" panel).
+  - name: disk_usage_percent
+    exp: 
banyandb_system_disk.tagEqual('kind','used').sum(['cluster','pod_name','container_name','node_role','node_type'])
 / 
banyandb_system_disk.tagEqual('kind','total').sum(['cluster','pod_name','container_name','node_role','node_type'])
+  # disk used / total / used% broken out per mount path.
+  - name: disk_used_by_path
+    exp: 
banyandb_system_disk.tagEqual('kind','used').sum(['cluster','pod_name','container_name','node_role','node_type','path'])
+  - name: disk_total_by_path
+    exp: 
banyandb_system_disk.tagEqual('kind','total').sum(['cluster','pod_name','container_name','node_role','node_type','path'])
+  - name: disk_used_percent_by_path
+    exp: 
banyandb_system_disk.tagEqual('kind','used').sum(['cluster','pod_name','container_name','node_role','node_type','path'])
 / 
banyandb_system_disk.tagEqual('kind','total').sum(['cluster','pod_name','container_name','node_role','node_type','path'])
+  # network throughput (bytes/s) by interface name.
+  - name: network_recv
+    exp: 
banyandb_system_net_state.tagEqual('kind','bytes_recv').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT15S')
+  - name: network_sent
+    exp: 
banyandb_system_net_state.tagEqual('kind','bytes_sent').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT15S')
+  # Go runtime.
+  - name: goroutines
+    exp: go_goroutines
+  # average GC pause (s) = rate(Σpause) / rate(Σcount). go_gc_duration_seconds 
is a summary (no buckets),
+  # so this ratio of _sum/_count is the only valid average — do not apply 
histogram_percentile to it.
+  - name: gc_pause_avg
+    exp: 
go_gc_duration_seconds_sum.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 / 
go_gc_duration_seconds_count.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  - name: heap_inuse
+    exp: go_memstats_heap_inuse_bytes
+  - name: heap_next_gc
+    exp: go_memstats_next_gc_bytes
+  - name: alloc_rate
+    exp: 
go_memstats_alloc_bytes_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+
+  # ---- Liaison only (front door; the dashboard gates these on container_name 
== liaison) ----
+  # query rate (req/s) by data-model service (measure/stream/trace/property). 
method literal is "query".
+  - name: query_rate_by_service
+    exp: 
banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster','pod_name','container_name','node_role','node_type','service']).rate('PT15S')
+  # gRPC errors/min. Three liaison-side error families (mirrors the Grafana 
"gRPC Error Rate" panel,
+  # which sums total_err + registry_err + stream_msg_received_err). All lazily 
registered -> empty on a
+  # healthy cluster; each pre-aggregated to the same label set before '+'.
+  - name: grpc_error_rate
+    exp: 
(banyandb_liaison_grpc_total_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 + 
banyandb_liaison_grpc_total_registry_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 + 
banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S'))
 * 60
+  # non-query operation rate (req/s): registry ops + any non-query unary call. 
total_started is
+  # query-only on the wire, so tagNotEqual('method','query') is empty today; 
registry_started carries it.
+  - name: non_query_op_rate
+    exp: 
banyandb_liaison_grpc_total_registry_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 + 
banyandb_liaison_grpc_total_started.tagNotEqual('method','query').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  # write rate (writes/s) seen at the liaison front door. group label dropped 
(instance-level total).
+  - name: write_rate
+    exp: 
banyandb_measure_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 + 
banyandb_stream_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 + 
banyandb_trace_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  # tier-2 publish pipeline (liaison -> data): throughput by operation, 
bytes/s, and p99 send latency.
+  - name: publish_throughput
+    exp: 
banyandb_queue_pub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S')
+  - name: publish_bytes
+    exp: 
banyandb_queue_pub_sent_bytes.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  - name: publish_latency_p99
+    exp: 
banyandb_queue_pub_total_latency.sum(['cluster','pod_name','container_name','node_role','node_type','operation','le']).histogram().histogram_percentile([99])
+  # write-queue (wqueue) depth: pending records, on-disk file parts, in-memory 
parts. On the liaison
+  # these reflect the write buffer; the same families on data containers 
reflect storage parts (the
+  # dashboard gates on container_name). Gauges, summed to the instance.
+  - name: wqueue_pending
+    exp: 
banyandb_measure_pending_data_count.sum(['cluster','pod_name','container_name','node_role','node_type'])
 + 
banyandb_stream_tst_pending_data_count.sum(['cluster','pod_name','container_name','node_role','node_type'])
 + 
banyandb_trace_tst_pending_data_count.sum(['cluster','pod_name','container_name','node_role','node_type'])
+  - name: wqueue_file_parts
+    exp: 
banyandb_measure_total_file_parts.sum(['cluster','pod_name','container_name','node_role','node_type'])
 + 
banyandb_stream_tst_total_file_parts.sum(['cluster','pod_name','container_name','node_role','node_type'])
 + 
banyandb_trace_tst_total_file_parts.sum(['cluster','pod_name','container_name','node_role','node_type'])
+  - name: wqueue_mem_part
+    exp: 
banyandb_measure_total_mem_part.sum(['cluster','pod_name','container_name','node_role','node_type'])
 + 
banyandb_stream_tst_total_mem_part.sum(['cluster','pod_name','container_name','node_role','node_type'])
 + 
banyandb_trace_tst_total_mem_part.sum(['cluster','pod_name','container_name','node_role','node_type'])
 
+  # ---- Data only (backend; the dashboard gates these on container_name == 
data) ----
+  # total stored data elements (gauge).
+  - name: total_data
+    exp: 
banyandb_measure_total_file_elements.sum(['cluster','pod_name','container_name','node_role','node_type'])
 + 
banyandb_stream_tst_total_file_elements.sum(['cluster','pod_name','container_name','node_role','node_type'])
 + 
banyandb_trace_tst_total_file_elements.sum(['cluster','pod_name','container_name','node_role','node_type'])
+  # merge-loop iterations/s.
+  - name: merge_file_rate
+    exp: 
banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 + 
banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 + 
banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  # avg parts merged per merge loop on the file path (matches Grafana = 
rate(merged_parts{type=file}) /
+  # rate(merge_loop_started)). type='file' is data-only on the wire (liaison 
emits only type='mem').
+  - name: merge_file_partitions
+    exp: 
(banyandb_measure_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 / 
banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S'))
 + 
(banyandb_stream_tst_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 / banyandb_stream_tst_total_merge_loop_started.sum(['cluster', [...]
+  # avg file-merge latency (ms) per merge loop.
+  - name: merge_file_latency
+    exp: 
((banyandb_measure_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 / 
banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S'))
 + 
(banyandb_stream_tst_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 / banyandb_stream_tst_total_merge_loop_started.sum(['cluste [...]
+  # inverted-index (series) write rate / term-search rate / total docs. 
*_inverted_index_total_* are
+  # # TYPE=gauge but cumulative, so rate() yields a per-window delta. Stream's 
series index is the
+  # storage scope (stream_storage_*); the tst scope is reported separately 
below.
+  - name: series_write_rate
+    exp: 
banyandb_measure_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 + 
banyandb_stream_storage_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  - name: series_term_search_rate
+    exp: 
banyandb_measure_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
 + 
banyandb_stream_storage_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  - name: total_series
+    exp: 
banyandb_measure_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type'])
 + 
banyandb_stream_storage_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type'])
+  # stream time-series-table (tst) index, distinct from the stream series 
(storage) index above.
+  - name: stream_tst_write_rate
+    exp: 
banyandb_stream_tst_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  - name: stream_tst_term_search_rate
+    exp: 
banyandb_stream_tst_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')
+  - name: stream_tst_total_docs
+    exp: 
banyandb_stream_tst_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type'])
+  # subscribe-side queue (data receives from liaison): throughput by operation 
+ p99 latency.
+  - name: queue_sub_throughput
+    exp: 
banyandb_queue_sub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S')
+  - name: queue_sub_latency_p99
+    exp: 
banyandb_queue_sub_total_latency.sum(['cluster','pod_name','container_name','node_role','node_type','operation','le']).histogram().histogram_percentile([99])
+  # retention disk-usage % per data-model scope (0-100 gauge). Kept per scope 
rather than summed (a sum
+  # of three percentages is meaningless). Not in the upstream Grafana boards; 
a SkyWalking addition.
+  - name: retention_measure_disk_usage_percent
+    exp: banyandb_storage_retention_measure_disk_usage_percent
+  - name: retention_stream_disk_usage_percent
+    exp: banyandb_storage_retention_stream_disk_usage_percent
+  - name: retention_trace_disk_usage_percent
+    exp: banyandb_storage_retention_trace_disk_usage_percent
 
+  # ---- Lifecycle only (the tier-migration sidecar on hot/warm data pods; 
container_name == lifecycle) ----
+  # cumulative migration cycles. Compiled into build #1166 (BanyanDB #1164) 
but lazily registered:
+  # emits no series until the first migration cycle fires. Dashboard renders 
absent-as-0.
+  - name: lifecycle_cycles
+    exp: banyandb_lifecycle_cycles_total
+  # seconds since the last migration cycle started = now - epoch. time() is 
the MAL ingest-time scalar
+  # (MQE has no current-time function), computed when the rule runs. 
BUILD-GATED: the source gauge is
+  # BanyanDB #1167+, absent on build #1166 -> no series until the cluster runs 
>= #1167 AND a cycle ends.
+  - name: lifecycle_last_run
+    exp: (time() - 
banyandb_lifecycle_last_run_timestamp_seconds.max(['cluster','pod_name','container_name','node_role','node_type']))
+  # last cycle status (1 = OK, 0 = failed). Same #1167 build gate as 
lifecycle_last_run.
+  - name: lifecycle_last_run_success
+    exp: banyandb_lifecycle_last_run_success
diff --git 
a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml
 
b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml
index 566f893cc4..97c6cac8f6 100644
--- 
a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml
+++ 
b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml
@@ -13,74 +13,51 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This will parse a textual representation of a duration. The formats
-# accepted are based on the ISO-8601 duration format {@code PnDTnHnMn.nS}
-# with days considered to be exactly 24 hours.
-# <p>
-# Examples:
-# <pre>
-#    "PT20.345S" -- parses as "20.345 seconds"
-#    "PT15M"     -- parses as "15 minutes" (where a minute is 60 seconds)
-#    "PT10H"     -- parses as "10 hours" (where an hour is 3600 seconds)
-#    "P2D"       -- parses as "2 days" (where a day is 24 hours or 86400 
seconds)
-#    "P2DT3H4M"  -- parses as "2 days, 3 hours and 4 minutes"
-#    "P-6H3M"    -- parses as "-6 hours and +3 minutes"
-#    "-P6H3M"    -- parses as "-6 hours and -3 minutes"
-#    "-P-6H+3M"  -- parses as "+6 hours and -3 minutes"
-# </pre>
+# SWIP-15: BanyanDB self-observability, Service scope = one BanyanDB cluster.
+# The FODC proxy is the single scrape target; the collector injects one static 
label
+# `cluster` (the only label SkyWalking must add). Every BanyanDB-native family 
carries
+# the `banyandb_` prefix; only the Go-runtime / process exporter families 
(go_* / process_*)
+# are bare. All cluster KPIs collapse to the single `cluster` series via 
`.sum(['cluster'])`,
+# which is also what makes the heterogeneous error families joinable by MAL `+`
+# (MAL arithmetic inner-joins on exact label equality).
+# Source expressions mirror the upstream BanyanDB Grafana boards
+# (docs/operation/grafana-fodc-workload.json) so the SkyWalking dashboards 
stay in lockstep.
 filter: "{ tags -> tags.job_name == 'banyandb-monitoring' }"
-expSuffix:  tag({tags -> tags.host_name = 'banyandb::' + 
tags.host_name}).service(['host_name'] , Layer.BANYANDB)
+expSuffix: service(['cluster'], Layer.BANYANDB)
 metricPrefix: meter_banyandb
 metricsRules:
-  - name: write_rate
-    exp: 
(banyandb_measure_total_written.sum(['host_name','service_instance_id']).rate('PT15S')
 + 
banyandb_stream_tst_total_written.sum(['host_name','service_instance_id']).rate('PT15S'))
-  - name: total_memory
-    exp: 
banyandb_system_memory_state.tagEqual('kind','total').sum(['host_name'])
-  - name: disk_usage
-    exp: 
banyandb_system_disk.tagEqual('kind','used').sum(['host_name','service_instance_id'])
-  - name: query_rate
-    exp: 
banyandb_liaison_grpc_total_started.sum(['method','host_name','service_instance_id'])
-  - name: total_cpu
-    exp: 
banyandb_system_cpu_num.sum(['method','host_name','service_instance_id'])
-  - name: write_and_query_errors_rate
-    exp: 
banyandb_liaison_grpc_total_err.tagEqual('method','query').sum(['method','host_name','service_instance_id']).rate('PT15S')*60
 + 
banyandb_liaison_grpc_total_stream_msg_sent_err.sum(['host_name','service_instance_id']).rate('PT15S')*60
 + 
banyandb_liaison_grpc_total_stream_msg_received_err.sum(['host_name','service_instance_id']).rate('PT15S')*60
 + 
banyandb_queue_sub_total_msg_sent_err.sum(['host_name','service_instance_id']).rate('PT15S')*60
-  - name: etcd_operation_rate
-    exp: 
banyandb_liaison_grpc_total_registry_started.sum(['host_name','service_instance_id']).rate('PT15S')
 + 
banyandb_liaison_grpc_total_started.sum(['host_name','service_instance_id']).rate('PT15S')
-  - name: active_instance
-    exp: up.sum(['host_name','service_instance_id']).downsampling(MIN)
-  - name: cpu_usage
-    exp: 
(((process_cpu_seconds_total.sum(['host_name','service_instance_id']).rate('PT15S')
 / 
banyandb_system_cpu_num.sum(['host_name','service_instance_id']))).max(['host_name','service_instance_id']))*1000
-  - name: rss_memory_usage
-    exp: 
((process_resident_memory_bytes.sum(['host_name','service_instance_id']).downsampling(MAX)
 / 
banyandb_system_memory_state.tagEqual('kind','total').sum(['host_name','service_instance_id'])).max(['host_name','service_instance_id']))*1000
-  - name: disk_usage_all
-    exp: 
((banyandb_system_disk.tagEqual('kind','used').sum(['host_name','service_instance_id'])
 / 
banyandb_system_memory_state.tagEqual('kind','total').sum(['host_name','service_instance_id'])).max(['host_name','service_instance_id']))*1000
-  - name: network_usage_recv
-    exp: 
banyandb_system_net_state.tagEqual('kind','bytes_recv').sum(['host_name','service_instance_id']).rate('PT15S')
-  - name: network_usage_sent
-    exp: 
banyandb_system_net_state.tagEqual('kind','bytes_sent').sum(['host_name','service_instance_id']).rate('PT15S')
-  - name: storage_write_rate
-    exp: 
banyandb_measure_total_written.sum(['group','host_name','service_instance_id']).rate('PT15S')*1000
-  - name: query_latency
-    exp: 
(banyandb_liaison_grpc_total_latency.tagEqual('method','query').sum(['group','host_name','service_instance_id']).rate('PT15S')
 / 
banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['group','host_name','service_instance_id']).rate('PT15S'))*1000
-  - name: total_data
-    exp: 
banyandb_measure_total_file_elements.sum(['group','host_name','service_instance_id'])
-  - name: merge_file_data
-    exp: 
banyandb_measure_total_merge_loop_started.sum(['group','host_name','service_instance_id']).rate('PT15S')
 * 60 *1000
-  - name: merge_file_latency
-    exp: 
(banyandb_measure_total_merge_latency.tagEqual('type','file').sum(['group','host_name','service_instance_id']).rate('PT15S')
 / 
banyandb_measure_total_merge_loop_started.sum(['group','host_name','service_instance_id']).rate('PT15S'))*1000
-  - name: merge_file_partitions
-    exp: 
(banyandb_measure_total_merged_parts.tagEqual('type','file').sum(['group','host_name','service_instance_id']).rate('PT15S')
 / 
banyandb_measure_total_merge_loop_started.sum(['group','host_name','service_instance_id']).rate('PT15S'))*1000
-  - name: series_write_rate
-    exp: 
(banyandb_measure_inverted_index_total_updates.sum(['group','host_name','service_instance_id']).rate('PT15S'))*1000
-  - name: series_term_search_rate
-    exp: 
banyandb_stream_storage_inverted_index_total_term_searchers_started.sum(['group','host_name','service_instance_id']).rate('PT15S')
-  - name: total_series
-    exp: 
banyandb_measure_inverted_index_total_doc_count.sum(['group','host_name','service_instance_id'])
-  - name: stream_write_rate
-    exp: 
banyandb_stream_tst_inverted_index_total_updates.sum(['group','host_name','service_instance_id']).rate('PT15S')
-  - name: term_search_rate
-    exp: 
banyandb_stream_tst_inverted_index_total_term_searchers_started.sum(['group','host_name','service_instance_id']).rate('PT15S')*
 1000
-  - name: total_document
-    exp: 
banyandb_stream_tst_inverted_index_total_doc_count.sum(['group','host_name','service_instance_id'])
-
-
+  # cluster writes/s across the three data-model scopes (measure, stream, 
trace). Each scope's
+  # write counter is collapsed to one per-cluster series before `+`.
+  - name: cluster_write_rate
+    exp: (banyandb_measure_total_written.sum(['cluster']).rate('PT15S') + 
banyandb_stream_tst_total_written.sum(['cluster']).rate('PT15S') + 
banyandb_trace_tst_total_written.sum(['cluster']).rate('PT15S'))
+  # cluster queries/s. `service` on this family is BanyanDB's data-model facet
+  # (measure/stream/trace/property), not a SkyWalking service; method literal 
is "query".
+  - name: cluster_query_rate
+    exp: 
banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster']).rate('PT15S')
+  # cluster errors/min. The seven liaison-side error families mirror the 
upstream Grafana
+  # "Error Rate" stat (grafana-fodc-workload.json). Each is pre-aggregated to 
['cluster']
+  # BEFORE `+` because their wire label sets differ (stream_msg_received_err 
carries
+  # group/method/service, registry_err carries method/service, sync_loop_err 
carries group)
+  # and MAL `+` joins on exact label equality. On a healthy cluster most of 
these are lazily
+  # registered and emit no series; MAL treats an empty operand as the additive 
identity, so the
+  # sum emits from whatever has fired and renders absent-as-0 when nothing has.
+  - name: cluster_error_rate
+    exp: (banyandb_liaison_grpc_total_err.sum(['cluster']).rate('PT15S') + 
banyandb_liaison_grpc_total_registry_err.sum(['cluster']).rate('PT15S') + 
banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster']).rate('PT15S')
 + banyandb_queue_pub_total_err.sum(['cluster']).rate('PT15S') + 
banyandb_measure_total_sync_loop_err.sum(['cluster']).rate('PT15S') + 
banyandb_stream_tst_total_sync_loop_err.sum(['cluster']).rate('PT15S') + 
banyandb_trace_tst_total_sync_loop_err.sum(['cluster' [...]
+  # live container count by role. 
count(['cluster','container_name','pod_name']) groups by all
+  # three then re-groups excluding the last key (pod_name), yielding one 
sample per
+  # (cluster, container_name) whose value = distinct pod_name count -> data=N, 
liaison=M.
+  # Mirrors the upstream "Nodes by Role" stat (count(banyandb_system_up_time) 
by container_name).
+  # CAVEAT: banyandb_system_up_time has NO lifecycle series (the lifecycle 
sidecar runs its
+  # metric service without the system collector), so this never emits a 
lifecycle row.
+  - name: reporting_instances
+    exp: banyandb_system_up_time.count(['cluster','container_name','pod_name'])
+  # cluster CPU capacity = sum of per-container visible core counts (no 
lifecycle series).
+  - name: total_cpu_cores
+    exp: banyandb_system_cpu_num.sum(['cluster'])
+  # cluster memory used (bytes). kind='used' is a real wire value (kind in 
total/used/used_percent).
+  - name: total_memory_used
+    exp: banyandb_system_memory_state.tagEqual('kind','used').sum(['cluster'])
+  # cluster disk used (bytes). system_disk carries a `path` label with 
multiple data roots;
+  # .sum(['cluster']) collapses all paths into one cluster total.
+  - name: total_disk_used
+    exp: banyandb_system_disk.tagEqual('kind','used').sum(['cluster'])
diff --git a/test/e2e-v2/cases/banyandb/banyandb-cases.yaml 
b/test/e2e-v2/cases/banyandb/banyandb-cases.yaml
index dfc901f490..83367a20c9 100644
--- a/test/e2e-v2/cases/banyandb/banyandb-cases.yaml
+++ b/test/e2e-v2/cases/banyandb/banyandb-cases.yaml
@@ -13,30 +13,59 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This file contains BanyanDB instance metrics queries, referencing
-# oap-server/server-starter/src/main/resources/otel-rules/banyandb.yaml
-
+# SWIP-15 BanyanDB self-observability metrics, the cluster / container / group 
model. References
+# 
oap-server/server-starter/src/main/resources/otel-rules/banyandb/{banyandb-service,banyandb-instance,banyandb-endpoint}.yaml
+# Entity identities come from the collector's injected labels 
(otel-collector-config.yaml):
+#   Service  = cluster                       -> e2e-banyandb
+#   Instance = pod_name '@' container_name    -> banyandb-liaison-0@liaison, 
banyandb-data-hot-0@data
+#   Endpoint = group                          -> sw_metricsMinute (an OAP 
self-telemetry group)
+# Expected templates:
+#   metrics-has-value.yml        — single unlabeled series (service/endpoint 
metrics summed to the entity key)
+#   metrics-has-label-value.yml  — labeled series. Instance metrics retain 
node_role/node_type labels
+#                                  (kept in the .sum() group-by so the 
instance properties closure resolves
+#                                  role/tier); reporting_instances is labeled 
by container_name; queue
+#                                  metrics are labeled by operation.
+# This minimal cluster (1 liaison + 1 hot data, no FODC) intentionally does 
NOT cover: lifecycle
+# (no migration sidecar / cycle), warm & cold tiers, error counters (lazily 
registered -> empty on a
+# healthy cluster), or multi-node aggregation.
 cases:
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_total_memory --service-name=banyandb::server
-    expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_write_rate --service-name=banyandb::server 
--instance-name=banyandb:2121
-    expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_total_memory 
--service-name=banyandb::server --instance-name=banyandb:2121
+  # ---- Service scope (cluster KPIs) ----
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_cluster_write_rate --service-name=e2e-banyandb
     expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_total_cpu --service-name=banyandb::server 
--instance-name=banyandb:2121
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_cluster_query_rate --service-name=e2e-banyandb
     expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_etcd_operation_rate 
--service-name=banyandb::server --instance-name=banyandb:2121
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_total_cpu_cores --service-name=e2e-banyandb
     expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_active_instance 
--service-name=banyandb::server --instance-name=banyandb:2121
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_total_memory_used --service-name=e2e-banyandb
     expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_cpu_usage --service-name=banyandb::server 
--instance-name=banyandb:2121
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_total_disk_used --service-name=e2e-banyandb
     expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_rss_memory_usage 
--service-name=banyandb::server --instance-name=banyandb:2121
+  # live container count by role (labeled by container_name)
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_reporting_instances --service-name=e2e-banyandb
+    expected: expected/metrics-has-label-value.yml
+
+  # ---- Instance scope (labeled by node_role/node_type, and operation where 
applicable) ----
+  # data node (banyandb-data-hot-0@data)
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_node_uptime --service-name=e2e-banyandb 
--instance-name=banyandb-data-hot-0@data
     expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_disk_usage_all 
--service-name=banyandb::server --instance-name=banyandb:2121
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_cpu_usage --service-name=e2e-banyandb 
--instance-name=banyandb-data-hot-0@data
+    expected: expected/metrics-has-label-value.yml
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_total_data --service-name=e2e-banyandb 
--instance-name=banyandb-data-hot-0@data
+    expected: expected/metrics-has-label-value.yml
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_queue_sub_throughput 
--service-name=e2e-banyandb --instance-name=banyandb-data-hot-0@data
+    expected: expected/metrics-has-label-value.yml
+  # liaison node (banyandb-liaison-0@liaison)
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_node_uptime --service-name=e2e-banyandb 
--instance-name=banyandb-liaison-0@liaison
     expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_network_usage_recv 
--service-name=banyandb::server --instance-name=banyandb:2121
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_write_rate --service-name=e2e-banyandb 
--instance-name=banyandb-liaison-0@liaison
+    expected: expected/metrics-has-label-value.yml
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_publish_throughput 
--service-name=e2e-banyandb --instance-name=banyandb-liaison-0@liaison
+    expected: expected/metrics-has-label-value.yml
+
+  # ---- Endpoint scope (storage group sw_metricsMinute) ----
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_endpoint_write_rate --service-name=e2e-banyandb 
--endpoint-name=sw_metricsMinute
     expected: expected/metrics-has-value.yml
-  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_instance_network_usage_sent 
--service-name=banyandb::server --instance-name=banyandb:2121
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_endpoint_total_data --service-name=e2e-banyandb 
--endpoint-name=sw_metricsMinute
     expected: expected/metrics-has-value.yml
-
+  - query: swctl --display yaml 
--base-url=http://${oap_host}:${oap_12800}/graphql metrics exec 
--expression=meter_banyandb_endpoint_queue_throughput 
--service-name=e2e-banyandb --endpoint-name=sw_metricsMinute
+    expected: expected/metrics-has-label-value.yml
diff --git a/test/e2e-v2/cases/banyandb/docker-compose.yml 
b/test/e2e-v2/cases/banyandb/docker-compose.yml
index 2b0d618a81..fd8468552b 100644
--- a/test/e2e-v2/cases/banyandb/docker-compose.yml
+++ b/test/e2e-v2/cases/banyandb/docker-compose.yml
@@ -13,25 +13,44 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+# SWIP-15 BanyanDB self-observability e2e: a minimal CLUSTER (1 liaison + 1 
hot data node), scraped
+# WITHOUT the FODC proxy. The OTel collector scrapes each node's own :2121 
Prometheus endpoint
+# directly and injects the FODC-equivalent identity labels
+# (cluster / container_name / node_role / node_type / pod_name) as static 
per-scrape-job labels
+# (see otel-collector-config.yaml). The MAL rules read those tags regardless 
of origin, so the
+# cluster / instance / endpoint rule set is exercised without a FODC 
deployment.
+#
+# BanyanDB 0.11+ is required: the FODC-proxy cluster observability and the 
queue / lifecycle metric
+# families this feature reads were introduced after 0.10, and 0.11 uses 
file/DNS node discovery
+# (no etcd). The image is pinned per-case to the latest 0.11-dev build the 
public demo runs
+# (commit 8a1936ce9); the repo-wide ${SW_BANYANDB_COMMIT} is an older build 
that predates the
+# queue/lifecycle metric families.
 services:
-  oap:
+  data-hot:
     extends:
       file: ../../script/docker-compose/base-compose.yml
-      service: oap
-    expose:
-      - 11800
-    ports:
-      - "11800:11800"
-      - "12800:12800"
+      service: banyandb-data
+    image: 
"ghcr.io/apache/skywalking-banyandb:8a1936ce96653e89d3d13250a42abc6e3d42fae7-testing"
+    hostname: data-hot
+    command: data --node-discovery-mode=file 
--node-discovery-file-path=/etc/banyandb/nodes.yaml --node-labels type=hot
+    volumes:
+      - ./nodes.yaml:/etc/banyandb/nodes.yaml
     networks:
       - e2e
-  banyandb:
+
+  liaison:
     extends:
       file: ../../script/docker-compose/base-compose.yml
-      service: banyandb
-    ports:
-      - "17913:17913"
-      - "2121:2121"
+      service: liaison
+    image: 
"ghcr.io/apache/skywalking-banyandb:8a1936ce96653e89d3d13250a42abc6e3d42fae7-testing"
+    command: liaison --node-discovery-mode=file 
--node-discovery-file-path=/etc/banyandb/nodes.yaml --data-node-selector 
type=hot
+    volumes:
+      - ./nodes.yaml:/etc/banyandb/nodes.yaml
+    depends_on:
+      data-hot:
+        condition: service_healthy
+    networks:
+      - e2e
 
   otel-collector:
     image: otel/opentelemetry-collector:${OTEL_COLLECTOR_VERSION}
@@ -42,6 +61,27 @@ services:
       - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
     expose:
       - 55678
+    depends_on:
+      liaison:
+        condition: service_healthy
+      data-hot:
+        condition: service_healthy
+
+  oap:
+    extends:
+      file: ../../script/docker-compose/base-compose.yml
+      service: oap
+    environment:
+      SW_STORAGE: banyandb
+      SW_STORAGE_BANYANDB_TARGETS: "liaison:17912"
+    ports:
+      - "11800:11800"
+      - "12800:12800"
+    networks:
+      - e2e
+    depends_on:
+      liaison:
+        condition: service_healthy
 
 networks:
   e2e:
diff --git a/test/e2e-v2/cases/banyandb/e2e.yaml 
b/test/e2e-v2/cases/banyandb/e2e.yaml
index 965f8da653..96b4ba2371 100644
--- a/test/e2e-v2/cases/banyandb/e2e.yaml
+++ b/test/e2e-v2/cases/banyandb/e2e.yaml
@@ -44,7 +44,7 @@ cleanup:
     on: failure
     output-dir: $SW_INFRA_E2E_LOG_DIR/banyandb-data
     items:
-      - service: banyandb
+      - service: data-hot
         paths:
           - /tmp/trace/
           - /tmp/stream/
@@ -52,3 +52,6 @@ cleanup:
           - /tmp/property/
           - /tmp/schema-property/
           - /tmp/accesslog/
+      - service: liaison
+        paths:
+          - /tmp/accesslog/
diff --git a/test/e2e-v2/cases/banyandb/docker-compose.yml 
b/test/e2e-v2/cases/banyandb/expected/metrics-has-label-value.yml
similarity index 52%
copy from test/e2e-v2/cases/banyandb/docker-compose.yml
copy to test/e2e-v2/cases/banyandb/expected/metrics-has-label-value.yml
index 2b0d618a81..6fc7bffccc 100644
--- a/test/e2e-v2/cases/banyandb/docker-compose.yml
+++ b/test/e2e-v2/cases/banyandb/expected/metrics-has-label-value.yml
@@ -13,35 +13,28 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-services:
-  oap:
-    extends:
-      file: ../../script/docker-compose/base-compose.yml
-      service: oap
-    expose:
-      - 11800
-    ports:
-      - "11800:11800"
-      - "12800:12800"
-    networks:
-      - e2e
-  banyandb:
-    extends:
-      file: ../../script/docker-compose/base-compose.yml
-      service: banyandb
-    ports:
-      - "17913:17913"
-      - "2121:2121"
-
-  otel-collector:
-    image: otel/opentelemetry-collector:${OTEL_COLLECTOR_VERSION}
-    networks:
-      - e2e
-    command: [ "--config=/etc/otel-collector-config.yaml" ]
-    volumes:
-      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
-    expose:
-      - 55678
-
-networks:
-  e2e:
+# For labeled metric series (e.g. reporting_instances by container_name, queue 
throughput by
+# operation): at least one result carries a non-empty label and at least one 
non-null value bucket.
+debuggingtrace: null
+type: TIME_SERIES_VALUES
+results:
+  {{- contains .results }}
+  - metric:
+      labels:
+        {{- contains .metric.labels }}
+        - key: {{ notEmpty .key }}
+          value: {{ notEmpty .value }}
+        {{- end }}
+    values:
+      {{- contains .values }}
+      - id: {{ notEmpty .id }}
+        value: {{ notEmpty .value }}
+        traceid: null
+        owner: null
+      - id: {{ notEmpty .id }}
+        value: null
+        traceid: null
+        owner: null
+      {{- end }}
+  {{- end}}
+error: null
diff --git a/test/e2e-v2/cases/banyandb/otel-collector-config.yaml 
b/test/e2e-v2/cases/banyandb/nodes.yaml
similarity index 59%
copy from test/e2e-v2/cases/banyandb/otel-collector-config.yaml
copy to test/e2e-v2/cases/banyandb/nodes.yaml
index 30d45627d2..9c0602822a 100644
--- a/test/e2e-v2/cases/banyandb/otel-collector-config.yaml
+++ b/test/e2e-v2/cases/banyandb/nodes.yaml
@@ -13,36 +13,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-receivers:
-  prometheus:
-    config:
-      scrape_configs:
-        - job_name: "banyandb-monitoring" 
-          scrape_interval: 5s
-          static_configs:
-            - targets: ["banyandb:2121"]
-              labels:
-                host_name: server
-
-processors:
-  batch:
-
-exporters:
-  otlp:
-    endpoint: oap:11800
-    tls:
-      insecure: true
-  debug:  
-    verbosity: detailed  
-
-service:
-  pipelines:
-    metrics:
-      receivers:
-        - prometheus
-      processors:
-        - batch
-      exporters:
-        - otlp
-
-
+# Static node-discovery file for the BanyanDB 0.11 cluster (file discovery 
mode; no etcd).
+nodes:
+  - name: data-hot
+    grpc_address: data-hot:17912
+  - name: liaison
+    grpc_address: liaison:17912
diff --git a/test/e2e-v2/cases/banyandb/otel-collector-config.yaml 
b/test/e2e-v2/cases/banyandb/otel-collector-config.yaml
index 30d45627d2..74016d2303 100644
--- a/test/e2e-v2/cases/banyandb/otel-collector-config.yaml
+++ b/test/e2e-v2/cases/banyandb/otel-collector-config.yaml
@@ -13,16 +13,36 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+# No FODC proxy in this e2e: the collector scrapes each BanyanDB node's own 
:2121 Prometheus
+# endpoint directly and injects the identity labels the SWIP-15 MAL rules key 
on. On a real
+# FODC deployment these are stamped by the proxy; here they are static 
per-target scrape labels
+# (the same mechanism the previous single-node e2e used to inject host_name, 
extended to two
+# targets and the full identity set). The OTel Prometheus receiver maps the 
Prometheus `job` to
+# service.name, which OAP's receiver maps back to the `job_name` tag the rules 
filter on; all
+# other static labels arrive as datapoint attributes (tags). Hence:
+#   - one job_name "banyandb-monitoring" (matches the filter on all three rule 
files)
+#   - per-target: cluster (service key), container_name (role discriminator + 
instance key),
+#     node_role, pod_name, and node_type (data only; liaison Elvis-defaults it 
to 'n/a').
 receivers:
   prometheus:
     config:
       scrape_configs:
-        - job_name: "banyandb-monitoring" 
+        - job_name: "banyandb-monitoring"
           scrape_interval: 5s
           static_configs:
-            - targets: ["banyandb:2121"]
+            - targets: ["liaison:2121"]
               labels:
-                host_name: server
+                cluster: e2e-banyandb
+                container_name: liaison
+                node_role: ROLE_LIAISON
+                pod_name: banyandb-liaison-0
+            - targets: ["data-hot:2121"]
+              labels:
+                cluster: e2e-banyandb
+                container_name: data
+                node_role: ROLE_DATA
+                node_type: hot
+                pod_name: banyandb-data-hot-0
 
 processors:
   batch:
@@ -32,8 +52,6 @@ exporters:
     endpoint: oap:11800
     tls:
       insecure: true
-  debug:  
-    verbosity: detailed  
 
 service:
   pipelines:
@@ -44,5 +62,3 @@ service:
         - batch
       exporters:
         - otlp
-
-

(skywalking) 01/01: SWIP-15: implement BanyanDB self-observability (cluster / container / group model)

Reply via email to