(skywalking) branch master updated: SWIP-15: container-level instances and post-review corrections (#13901)

wusheng Wed, 10 Jun 2026 05:56:43 -0700

This is an automated email from the ASF dual-hosted git repository.

wu-sheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/skywalking.git



The following commit(s) were added to refs/heads/master by this push:
     new d30d022354 SWIP-15: container-level instances and post-review 
corrections (#13901)
d30d022354 is described below

commit d30d022354e297079e40389235a0527d13c99cb8
Author: 吴晟 Wu Sheng <[email protected]>
AuthorDate: Wed Jun 10 20:55:55 2026 +0800

    SWIP-15: container-level instances and post-review corrections (#13901)
---
 docs/en/swip/SWIP-15.md | 470 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 326 insertions(+), 144 deletions(-)

diff --git a/docs/en/swip/SWIP-15.md b/docs/en/swip/SWIP-15.md
index c72248558e..3ad4853fb8 100644
--- a/docs/en/swip/SWIP-15.md
+++ b/docs/en/swip/SWIP-15.md
@@ -28,31 +28,36 @@ Three things changed underneath it:
 3. **SkyWalking replaced the bundled booster UI with the Horizon UI.** The OAP 
backend no longer ships
    dashboard JSON (dropped in #13877); BanyanDB has not yet been ported to 
Horizon UI at all. Horizon
    UI is config-driven, has a real **Service → Instance → Endpoint** 
hierarchy, surfaces per-instance
-   attributes, and can hide panels that have no data — and, with a small 
enhancement, can drive panel
-   visibility from instance **attributes** (role / tier).
+   attributes, and gates widget visibility through structured, 
server-evaluated `visibleWhen`
+   predicates — data presence and instance-**attribute** equality ship today
+   ([horizon-ui 
#46](https://github.com/apache/skywalking-horizon-ui/pull/46)); a small
+   extension (membership / negation operators) completes role/tier-driven 
dashboards.
 
 The current feature does none of this. It models **each node as its own 
`Service`**
 (`service(['host_name'], Layer.BANYANDB)`), so a cluster appears as a pile of 
unrelated services; it
-never models the cluster, the node role, the tier, or the group; and it still 
references metrics that
-BanyanDB removed (an `etcd`-era operation rate, a Prometheus `up`-derived 
"active instances", and the
-pre-refactor `queue_sub_total_msg_sent_err` family).
+never models the cluster, the node role, the tier, or the group; and it still 
ships stale or misleading
+metrics (an operation rate still named after the retired `etcd` registry, a 
Prometheus `up`-derived
+"active instances" that under the FODC proxy would describe the proxy rather 
than any node, and the
+`queue_sub_total_msg_sent_err` family, which BanyanDB removed).
 
 **This SWIP proposes to discard that model and rebuild BanyanDB 
self-observability around the cluster /
 node / group reality**, matching the upstream FODC-proxy metric catalog, and 
to design the Horizon UI
-side — a net-new BanyanDB layer dashboard whose node view **adapts to the 
selected node's role and
-tier** — including the one Horizon UI enhancement that makes attribute-driven 
dashboards possible.
+side — a net-new BanyanDB layer dashboard whose instance view **adapts to the 
selected container's
+role and tier** — including the small Horizon UI entity-gate extension that 
completes attribute-driven
+dashboards.
 
 ### Goals
 
 - Model a BanyanDB **cluster** as a single SkyWalking `Service`.
-- Model each **node** as a `ServiceInstance`, carrying its **role** and 
**tier** as instance
-  attributes, so the UI can show "what this node is".
+- Model each **container** (`pod_name` + `container_name`) as a 
`ServiceInstance`, carrying its
+  **role** and **tier** as instance attributes, so the UI can show "what this 
container is".
 - Model each **group** as an `Endpoint` of the cluster.
 - Mirror the upstream FODC-proxy metric catalog faithfully (the two-dashboard 
split becomes the
   Instance and Endpoint views).
-- Make the **node dashboard dynamic** — a liaison node shows 
ingestion/queue/publish panels, a data
-  node shows storage/index/subscribe panels, and the tier refines the data 
view — first via the
-  data-presence mechanism that already exists, then via a proposed attribute 
predicate.
+- Make the **instance dashboard dynamic** — a liaison container shows 
ingestion/queue/publish panels, a
+  data container shows storage/index/subscribe panels, a lifecycle container 
shows migration panels, and
+  the tier refines the data view — via the structured `visibleWhen` gates 
Horizon UI already evaluates
+  (data presence + attribute equality), completed by a proposed 
membership/negation extension.
 
 ### Non-goals
 
@@ -72,18 +77,18 @@ tier** — including the one Horizon UI enhancement that 
makes attribute-driven
  ────────────────                         ─────────────                        
 ──────────
  ┌─ liaison node ─┐  FODC agent ┐                                       
BANYANDB layer
  │  :2121 /metrics │ (sidecar)  │                                       
┌───────────────────────┐
- └────────────────┘            │                                       │ Root: 
cluster list    │
- ┌─ data hot ─────┐  FODC agent ├─► FODC proxy ──► OTel Collector ──►  │ 
Service: cluster KPIs │
- │  :2121 /metrics │ (sidecar)  │   :17913          (prometheus recv,   │ 
Instance: node, panels│
- └────────────────┘            │   /metrics         adds `cluster`      │   
adapt to role/tier  │
+ └────────────────┘            │                                        │ 
Root: cluster list    │
+ ┌─ data hot ─────┐  FODC agent ├─► FODC proxy ──► OTel Collector ──►   │ 
Service: cluster KPIs │
+ │  :2121 /metrics │ (sidecar)  │   :17913          (prometheus recv,   │ 
Instance: container,  │
+ └────────────────┘            │   /metrics         adds `cluster`      │   
adapts to role/tier │
  ┌─ data warm ────┐  FODC agent │   single target,  label) ──OTLP──►    │ 
Endpoint: group       │
- │  :2121 /metrics │ (sidecar)  │   per-node labels      │              
└───────────────────────┘
+ │  :2121 /metrics │ (sidecar)  │   identity labels      │              
└───────────────────────┘
  └────────────────┘            ┘   node_role/pod_name/   │                     
   ▲
  ┌─ data cold ────┐                container_name/        ▼                    
    │ MQE over
  │  :2121 /metrics │                node_type     receiver-otel ──► MAL        
    │ GraphQL
  └────────────────┘                              otel-rules/banyandb/*  
───────────┘ execExpression
                                                   ├ banyandb-service.yaml   → 
Service  (cluster)
-                                                  ├ banyandb-instance.yaml  → 
Instance (node + attrs)
+                                                  ├ banyandb-instance.yaml  → 
Instance (container + attrs)
                                                   └ banyandb-endpoint.yaml  → 
Endpoint (group)
                                                           │
                                                           ▼
@@ -100,41 +105,62 @@ identity.
 
 ### 1. Entity model
 
-| SkyWalking entity            | BanyanDB concept                              
| Identity source (label)                 |
-| ---------------------------- | --------------------------------------------- 
| --------------------------------------- |
-| `Service` (Layer `BANYANDB`) | one BanyanDB **cluster**                      
| `cluster` (injected by the collector)   |
-| `ServiceInstance`            | one **node**                                  
| `pod_name` (e.g. `banyandb-data-hot-0`) |
-| &nbsp;&nbsp;↳ attribute `node_role` | node **role**                          
| `container_name` (`liaison` / `data`)   |
-| &nbsp;&nbsp;↳ attribute `node_type` | data-node **tier**                     
| `node_type` (`hot` / `warm` / `cold`)   |
-| `Endpoint`                   | one **group** (storage partition)             
| `group` (`measure-default`, …)          |
-
-A standalone BanyanDB is the degenerate case: one cluster, one node whose role 
is `standalone` (all
-roles co-resident) and no tier.
-
-**Why role/tier are instance attributes, not separate services or endpoints.** 
A node's identity is
-its `pod_name`; its role and tier are *properties of that node*, which is 
exactly what
-`InstanceTraffic.properties` (the UI "Attributes" panel) is for. Keeping the 
cluster as the single
-service means the node list, the group list, and cluster-wide KPIs all live 
under one entity the
-operator can reason about — and it makes the node dashboard able to adapt to 
the selected node's
-attributes.
+| SkyWalking entity            | BanyanDB concept                          | 
Identity source (label)                                |
+| ---------------------------- | ----------------------------------------- | 
------------------------------------------------------ |
+| `Service` (Layer `BANYANDB`) | one BanyanDB **cluster**                  | 
`cluster` (injected by the collector)                  |
+| `ServiceInstance`            | one **container** on a node               | 
`pod_name` + `container_name` (composite)              |
+| &nbsp;&nbsp;↳ attribute `container_name` | container **role** 
(discriminator) | `liaison` / `data` / `lifecycle`                  |
+| &nbsp;&nbsp;↳ attribute `node_type` | data-node **tier**                 | 
`hot` / `warm` / `cold` (data containers only; `n/a` elsewhere) |
+| &nbsp;&nbsp;↳ attribute `node_role` | role enum (coarse)                 | 
`ROLE_LIAISON` / `ROLE_DATA`                           |
+| &nbsp;&nbsp;↳ attribute `pod_name`  | host pod (sibling key)             | 
`demo-banyandb-data-hot-0`                             |
+| `Endpoint`                   | one **group** (storage partition)         | 
`group` (`sw_metricsMinute`, …)                        |
+
+All four labels are attached as instance attributes **verbatim** (not 
renamed), because the Horizon UI
+deployment/topology component groups the intra-cluster instance graph by them: 
`clusterBy` =
+`node_role` + `node_type`, `siblingBy` = `pod_name`, `roleBy` = 
`container_name`. Emitting the raw
+label names keeps the OAP attribute bag and the UI grouping config in lockstep.
+
+**Why the instance is a container, not a `pod_name`.** `pod_name` is **not 
unique per metrics
+emitter**: a data hot/warm pod co-hosts a `lifecycle` migration sidecar that 
reports under the *same*
+`pod_name` (verified on the live cluster — `demo-banyandb-data-hot-0` emits 
both `container_name=data`
+and `container_name=lifecycle`). Keying the instance on `pod_name` alone would 
silently merge the two
+series. The instance identity is therefore `pod_name` + `container_name`, and 
`container_name` — not
+`node_role` — is the role discriminator: `node_role` carries only 
`ROLE_LIAISON` / `ROLE_DATA` on a
+healthy cluster (it stays `ROLE_DATA` on the lifecycle sidecar, and the FODC 
agent maps unresolved or
+meta-only nodes to a transient `ROLE_UNSPECIFIED`), whereas `container_name` 
cleanly separates
+`liaison` / `data` / `lifecycle`. A standalone BanyanDB is the degenerate 
case: one cluster, one node,
+one `container_name=standalone`, no tier.
+
+**Why container/tier are instance attributes, not separate services or 
endpoints.** A container's role
+and tier are *properties of that instance*, which is exactly what 
`InstanceTraffic.properties` (the UI
+"Attributes" panel) is for. Keeping the cluster as the single service means 
the instance list, the
+group list, and cluster-wide KPIs all live under one entity the operator can 
reason about — and it
+makes the instance dashboard able to adapt to the selected container's 
attributes.
 
 ### 2. Scrape source and label scheme (FODC proxy only)
 
 SkyWalking scrapes the **FODC proxy `/metrics`** (default `:17913`) as the 
single Prometheus target.
-The proxy aggregates every node's metrics and stamps four identity labels onto 
each sample (verified in
-the FODC agent's `ParseWithNodeLabels`):
+The proxy aggregates every container's metrics and stamps four identity labels 
onto each sample
+(verified in the FODC agent's `ParseWithNodeLabels` and against the live 
cluster):
+
+| Label            | Value                                          | Used for 
                                             |
+| ---------------- | ---------------------------------------------- | 
----------------------------------------------------- |
+| `pod_name`       | node identity, e.g. `banyandb-data-hot-0`      | instance 
name (part 1) — **not unique**, see below    |
+| `container_name` | `liaison` / `data` / `lifecycle`               | instance 
name (part 2) + attribute `container_name` (the role discriminator) |
+| `node_role`      | raw enum `ROLE_LIAISON` / `ROLE_DATA` (transiently 
`ROLE_UNSPECIFIED`) | **not** the discriminator — coarser than 
`container_name`, stays `ROLE_DATA` on the lifecycle sidecar |
+| `node_type`      | `hot` / `warm` / `cold` (data containers only) | instance 
attribute `node_type` (tier)                 |
 
-| Label            | Value                                        | Used for   
                       |
-| ---------------- | -------------------------------------------- | 
--------------------------------- |
-| `pod_name`       | full node identity, e.g. `banyandb-data-hot-0` | instance 
name                   |
-| `container_name` | `liaison` / `data` (the role discriminator)  | instance 
attribute `node_role`    |
-| `node_role`      | raw enum `ROLE_LIAISON` / `ROLE_DATA`        | 
(available; `container_name` preferred for clean values) |
-| `node_type`      | `hot` / `warm` / `cold` (data nodes only)    | instance 
attribute `node_type` (tier) |
+`pod_name` alone does **not** identify an instance: on the live cluster the 
four data hot/warm pods
+each run two containers (`data` + `lifecycle`) under one `pod_name`, so the 
instance key is
+`pod_name` + `container_name`.
 
 All original BanyanDB labels are preserved on every sample: `group`, 
`service`, `method`, `operation`,
-`remote_node`, `remote_role`, `remote_tier`, `error_type`, `kind`, `path`, 
`type`, `name`, `le`, …. The
-Prometheus-synthesized `instance` / `job` / `up` describe the **proxy**, not 
individual nodes — node
-liveness is derived from the always-present per-node gauge 
`banyandb_system_up_time`, never from `up`.
+`remote_node`, `remote_role`, `remote_tier`, `error_type`, `kind`, `path`, 
`type`, `seg`, `shard`,
+`le`, …. Note `service` is BanyanDB's internal **data-model module** 
(`measure` / `stream` / `trace` /
+`property` / `group`) — a workload facet, **never** a SkyWalking service 
identity. The
+Prometheus-synthesized `instance` / `job` / `up` describe the **proxy**, not 
individual containers —
+node liveness is derived from the always-present per-container gauge 
`banyandb_system_up_time`, never
+from `up`.
 
 **Collector scrape job (illustrative — operator configuration, not a shipped 
file):**
 
@@ -168,26 +194,64 @@ filter: "{ tags -> tags.job_name == 'banyandb-monitoring' 
}"
 # banyandb-service.yaml  → cluster
 expSuffix: service(['cluster'], Layer.BANYANDB)
 
-# banyandb-instance.yaml → node, with role + tier as attributes
+# banyandb-instance.yaml → container (a node may run >1 container), role + 
tier as attributes
 expSuffix: |-
   service(['cluster'], Layer.BANYANDB)
-  .instance(['cluster'], '::', ['pod_name'], '', Layer.BANYANDB,
-            { tags -> ['node_role': tags.container_name, 'node_type': 
tags.node_type ?: 'n/a'] })
+  .instance(['cluster'], '::', ['pod_name', 'container_name'], '@', 
Layer.BANYANDB,
+            { tags -> ['node_role':      tags.node_role,
+                       'node_type':      tags.node_type ?: 'n/a',
+                       'pod_name':       tags.pod_name,
+                       'container_name': tags.container_name] })
 
 # banyandb-endpoint.yaml → group
 expSuffix: endpoint(['cluster'], ['group'], Layer.BANYANDB)
 ```
 
-The 6-argument `.instance(...)` overload's properties closure is the standard, 
precedented mechanism for
+The instance key is the pair `['pod_name', 'container_name']` joined by `'@'` 
(signature
+`instance(serviceKeys, serviceDelimiter, instanceKeys, instanceDelimiter, 
layer, propertiesExtractor)`),
+so the four `data` hot/warm pods surface as distinct `…@data` and 
`…@lifecycle` instances rather than
+colliding. The 6-argument overload's properties closure is the standard, 
precedented mechanism for
 attaching labels as instance attributes (the same shape used by 
`k8s-instance.yaml`). The attributes
-ride entirely on the scraped labels — no separate update API.
+ride entirely on the scraped labels — no separate update API. (Two 
implementation notes: the MAL v2
+grammar supports the Elvis operator inside a map-literal value, but no shipped 
rule combines the two
+yet — the implementation PR should pin this exact closure shape with a compile 
test. And `language` is
+the one reserved property key — the instance query maps it to the language 
field instead of an
+attribute; none of these four labels collides with it.)
 
 ### 3. Metric catalog → MAL rules
 
 The redesigned rules mirror the upstream FODC-proxy catalog. The two upstream 
Grafana boards map onto
-two SkyWalking scopes — **Nodes → Instance** (per `pod_name`), **Workload → 
Endpoint** (per `group`) —
-plus a small **Service** summary for cluster KPIs. Source metric names below 
are verified against
-BanyanDB `origin/main` (the base of the upstream observability PR).
+two SkyWalking scopes — **Nodes → Instance** (per `pod_name` + 
`container_name`), **Workload →
+Endpoint** (per `group`) — plus a small **Service** summary for cluster KPIs. 
Source metric names
+below are verified against the **live demo cluster** — which runs upstream 
`main` builds (the
+validation pull used the showcase-pinned `main` image of 2026-06-09) — and 
against BanyanDB
+`origin/main` source. The upstream observability PR
+[#1159](https://github.com/apache/skywalking-banyandb/pull/1159) (open; docs 
and Grafana dashboards
+only, no metric code) documents the same catalog and defines the two boards 
this design mirrors.
+
+> **Metric-name prefix (build-critical).** The sketches below drop a common 
prefix for readability.
+> On the wire **every BanyanDB-native family carries the `banyandb_` prefix** 
(`banyandb_measure_total_written`,
+> `banyandb_liaison_grpc_total_started`, `banyandb_system_disk`, …) — the MAL 
rules must use the full
+> prefixed name. The **only** exceptions are the standard Go-runtime and 
process exporter families
+> `go_*` / `process_*`, which are **bare** (no prefix) and are referenced 
as-is. Every error counter
+> this catalog references is lazily registered and emits nothing until the 
first error fires
+> (`banyandb_liaison_grpc_total_err`, 
`banyandb_liaison_grpc_total_stream_msg_received_err`,
+> `banyandb_queue_pub_total_err`, the `*_total_sync_loop_err` family), and the 
lifecycle last-run
+> gauges (`banyandb_lifecycle_last_run_*`, BanyanDB #1167) post-date the build 
the demo pull validated;
+> every other cited family was present in that pull.
+>
+> **Sketch notation (PromQL-flavored).** Source expressions are written 
PromQL-style for readability;
+> the MAL forms differ mechanically. **(1)** No `or vector(0)` guard exists in 
MAL — nor is one
+> needed: an unfired family resolves to the empty sample family, MAL's `+` 
treats an empty operand as
+> identity, and a rule is skipped only when *all* referenced families are 
absent — so an error sum
+> emits as soon as any one term fires, and a fully healthy cluster shows no 
series at all (dashboards
+> should render absent as 0). **(2)** MAL arithmetic joins samples on exact 
label equality, so each
+> term must be aggregated to the same label set (e.g. `.sum(['cluster'])`) 
before `+`. **(3)**
+> `count(...) by (...)` maps to MAL's multi-label `count([...])`; 
`histogram_quantile(0.99, …_bucket)`
+> maps to `.histogram().histogram_percentile([99])` on the `le`-labeled base 
family (no `_bucket`
+> suffix remains after OTLP conversion); and `time() - <metric>` is computed 
at **ingest** in the MAL
+> rule — MAL ships `time()` (the shipped `envoy-ca.yaml` cert-staleness metric 
is the precedent),
+> while MQE has no current-time function, so it cannot be computed at query 
time.
 
 #### 3.1 Service scope — cluster summary (`banyandb-service.yaml`)
 
@@ -195,15 +259,15 @@ BanyanDB `origin/main` (the base of the upstream 
observability PR).
 | --------------------------- | ------------------------ | 
------------------------------------------------------------------------------------------
 |
 | `cluster_write_rate`        | cluster writes/s         | 
`rate(measure_total_written) + rate(stream_tst_total_written) + 
rate(trace_tst_total_written)` |
 | `cluster_query_rate`        | cluster queries/s        | 
`rate(liaison_grpc_total_started{method='query'})`                              
            |
-| `cluster_error_rate`        | cluster errors/min       | 
`liaison_grpc_total_err + _stream_msg_received_err + 
schema_server_grpc_total_err + queue_pub_total_err + Σ *_total_sync_loop_err` 
(×60, each `or vector(0)`) |
-| `reporting_nodes`           | live node count by role  | 
`count(system_up_time) by (container_name)`                                     
            |
+| `cluster_error_rate`        | cluster errors/min       | 
`liaison_grpc_total_err + liaison_grpc_total_stream_msg_received_err + 
schema_server_grpc_total_err + queue_pub_total_err + Σ *_total_sync_loop_err` 
(×60; all lazily registered — see sketch notation above) |
+| `reporting_instances`       | live container count by role | 
`count(system_up_time) by (container_name)`                                     
         |
 | `total_cpu_cores`           | cluster CPU capacity     | 
`sum(system_cpu_num)`                                                           
            |
 | `total_memory_used`         | cluster memory used      | 
`sum(system_memory_state{kind='used'})`                                         
            |
 | `total_disk_used`           | cluster disk used        | 
`sum(system_disk{kind='used'})`                                                 
            |
 
-#### 3.2 Instance scope — per node (`banyandb-instance.yaml`)
+#### 3.2 Instance scope — per container (`banyandb-instance.yaml`)
 
-**All roles** (every node emits these — the "Nodes" board):
+**All roles** (every container emits these — the "Nodes" board):
 
 | Metric (`meter_banyandb_instance_*`) | Source                                
                            |
 | ------------------------------------ | 
---------------------------------------------------------------- |
@@ -218,26 +282,44 @@ BanyanDB `origin/main` (the base of the upstream 
observability PR).
 | `gc_pause_avg`                      | `rate(go_gc_duration_seconds_sum) / 
rate(go_gc_duration_seconds_count)` |
 | `heap_inuse` / `heap_next_gc` / `alloc_rate` | 
`go_memstats_heap_inuse_bytes` / `go_memstats_next_gc_bytes` / 
`rate(go_memstats_alloc_bytes_total)` |
 
-**Liaison-only** (front door; hidden on data nodes — see 
[§4](#4-dynamic-metrics-by-role-and-tier)):
+**Liaison-only** (front door; hidden on data containers — see [dynamic metrics 
by role and tier](#4-dynamic-metrics-by-role-and-tier)):
 
 | Metric (`meter_banyandb_instance_*`)  | Source                               
                                   |
 | ------------------------------------- | 
----------------------------------------------------------------------- |
 | `query_rate_by_service`               | 
`rate(liaison_grpc_total_started{method='query'}) by (service)`         |
-| `grpc_error_rate`                     | `rate(liaison_grpc_total_err) by 
(service, method)` (+ `_stream_msg_received_err`; both lazily registered) |
+| `grpc_error_rate`                     | `rate(liaison_grpc_total_err) by 
(service, method)` (+ `liaison_grpc_total_stream_msg_received_err`; both lazily 
registered) |
 | `non_query_op_rate`                   | 
`rate(liaison_grpc_total_started{method!='query'}) by (method)` |
 | `write_rate`                          | 
`rate({measure,stream_tst,trace_tst}_total_written)`                    |
 | `publish_throughput` / `publish_latency_p99` | 
`rate(queue_pub_total_finished) by (operation)` / `histogram_quantile(0.99, 
…queue_pub_total_latency_bucket)` |
 | `wqueue_file_parts` / `wqueue_mem_part` / `wqueue_pending` | 
`{measure,stream_tst,trace_tst}_total_file_parts` / `_total_mem_part` / 
`_pending_data_count` |
 
-**Data-only** (backend; hidden on liaison nodes):
+**Data-only** (backend; hidden on liaison containers):
 
 | Metric (`meter_banyandb_instance_*`)            | Source                     
                                         |
 | ----------------------------------------------- | 
------------------------------------------------------------------ |
 | `total_data`                                    | 
`{measure,stream_tst,trace_tst}_total_file_elements`               |
 | `merge_file_rate` / `merge_file_latency` / `merge_file_partitions` | 
`rate(*_total_merge_loop_started)` / `…_merge_latency{type='file'}` / 
`…_merged_parts{type='file'}` |
-| `series_write_rate` / `series_term_search_rate` / `total_series` | 
`measure_inverted_index_total_updates` / `_term_searchers_started` / 
`_doc_count`; `stream_storage_inverted_index_*` |
+| `series_write_rate` / `series_term_search_rate` / `total_series` | 
`measure_inverted_index_total_updates` / `_total_term_searchers_started` / 
`_total_doc_count`; `stream_storage_inverted_index_*` |
 | `stream_tst_write_rate` / `stream_tst_term_search_rate` / 
`stream_tst_total_docs` | `stream_tst_inverted_index_*` |
 | `queue_sub_throughput` / `queue_sub_latency_p99` (per `operation`) | 
`rate(queue_sub_total_started/finished) by (operation)` / 
`histogram_quantile(0.99, …queue_sub_total_latency_bucket) by (operation)` |
+| `retention_disk_usage_percent` / `retention_cooldown` | 
`storage_retention_{measure,stream,trace}_disk_usage_percent` / 
`_forced_retention_cooldown_seconds` |
+
+**Lifecycle-only** (the tier-migration sidecar co-located on `hot`/`warm` data 
pods; `container_name == 'lifecycle'`):
+
+| Metric (`meter_banyandb_instance_*`) | Source                                
                              |
+| ------------------------------------ | 
------------------------------------------------------------------ |
+| `lifecycle_cycles`                   | `lifecycle_cycles_total` (cumulative 
migration cycles)            |
+| `lifecycle_last_run`                 | 
`lifecycle_last_run_timestamp_seconds` — epoch of the last cycle's start; "time 
since last sync" = `time() - <metric>`, computed at ingest in the MAL rule (MQE 
has no `time()`) |
+| `lifecycle_last_run_success`         | `lifecycle_last_run_success` (`1` = 
last cycle OK, `0` = failed)  |
+
+> **Lifecycle last-run signals.** The two gauges above were added in BanyanDB
+> [#1167](https://github.com/apache/skywalking-banyandb/pull/1167) (merged to 
`main` on 2026-06-09,
+> post-dating the build the demo pull validated) — both are
+> stamped on every cycle end (success, error, or panic-recovered), so they 
drive a "time since last
+> sync" staleness panel and a "last sync OK?" status panel directly. They emit 
only **after the first
+> migration runs**, so the staleness panel must guard the never-run case. The 
same PR also stamps the
+> lifecycle's sender identity onto its migration publisher, so a destination 
data node's `queue_sub`
+> `remote_node` / `remote_role` / `remote_tier` now identify the migration 
source (were empty before).
 
 #### 3.3 Endpoint scope — per group (`banyandb-endpoint.yaml`)
 
@@ -250,7 +332,7 @@ nodes per group):
 | `query_latency`                      | 
`rate(liaison_grpc_total_latency{method='query'}) / 
rate(…_started{method='query'}) by (group)` |
 | `total_data`                         | 
`{measure,stream_tst,trace_tst}_total_file_elements by (group)`    |
 | `merge_file_rate` / `merge_file_latency` / `merge_file_partitions` | the 
merge family `by (group)`                       |
-| `series_write_rate` / `total_series` | inverted-index `_total_updates` / 
`_doc_count` `by (group)`        |
+| `series_write_rate` / `total_series` | inverted-index `_total_updates` / 
`_total_doc_count` `by (group)`  |
 | `queue_throughput` / `queue_latency_p99` | `queue_sub` / `queue_pub` `by 
(operation, group)`             |
 | `publish_bytes`                      | `rate(queue_pub_sent_bytes) by 
(group)`                            |
 
@@ -262,98 +344,168 @@ nodes per group):
 
 ### 4. Dynamic metrics by role and tier
 
-Different roles expose different metrics, so the **node (Instance) dashboard 
must adapt to the selected
-node**. Two mechanisms, layered:
+Different roles expose different metrics, so the **instance dashboard must 
adapt to the selected
+container**. Horizon UI's widget `visibleWhen` is a structured, 
**server-evaluated** gate (the BFF
+resolves it against data presence or the selected instance's attributes and 
returns gated-out widgets
+as hidden; legacy free-text predicate strings are no longer parsed and degrade 
to ungated). Two gate
+kinds, layered:
 
-**(a) Data-presence gating — available today, no UI code.** Horizon UI already 
supports
-`visibleWhen: "<metric> has value"` on a widget; a panel whose metric returns 
all-null self-hides. Each
-MAL rule only produces samples for nodes that emit its source metric, so 
liaison-only metrics are simply
-absent on data instances and vice-versa. This gives correct adaptive behavior 
out of the box:
+**(a) Data-presence gating — available today, no UI code.** The `mqe`-kind 
gate hides a widget whose
+expression returns no data. Each MAL rule only produces samples for containers 
that emit its source
+metric, so liaison-only metrics are simply absent on data instances and 
vice-versa. This gives correct
+adaptive behavior out of the box:
 
 ```jsonc
 { "id": "wqueue", "title": "Write Queue (wqueue)", "type": "line",
   "expressions": ["meter_banyandb_instance_wqueue_pending"],
-  "visibleWhen": "meter_banyandb_instance_wqueue_pending has value" }
+  "visibleWhen": { "kind": "mqe", "expression": 
"meter_banyandb_instance_wqueue_pending", "op": "exists" } }
 ```
 
-**(b) Attribute predicate — proposed enhancement (see 
[§6](#6-horizon-ui-enhancement-entity-attribute-predicate)).**
+**(b) Attribute gating — equality ships today; membership is the proposed 
extension (see
+[entity-gate membership 
operators](#6-horizon-ui-enhancement-entity-gate-membership-operators)).**
 Data-presence can't distinguish "wrong role" from "idle but right role", and 
it still issues the query.
-An attribute predicate keys panel visibility directly on the node's 
`node_role` / `node_type`
-attributes:
+The `entity`-kind gate keys panel visibility directly on the selected 
instance's `container_name` /
+`node_type` attributes (meaningful on the Instance scope only):
 
 ```jsonc
-{ "id": "wqueue", "visibleWhen": "#entity.node_role == 'liaison'" }
-{ "id": "cold_tier_note", "visibleWhen": "#entity.node_type == 'cold'" }
+{ "id": "wqueue", "visibleWhen": { "kind": "entity", "attribute": 
"container_name", "op": "eq", "value": "liaison" } }
+{ "id": "cold_tier_note", "visibleWhen": { "kind": "entity", "attribute": 
"node_type", "op": "eq", "value": "cold" } }
 ```
 
 This is the precise, declarative form, and it is the natural way to express 
tier-specific panels (a
-`hot` data node merges constantly; a `cold` node is mostly static).
+`hot` data container merges constantly; a `cold` container is mostly static). 
The landed gate supports
+`exists` and case-insensitive `eq`; tier *sets* need the proposed `in` 
operator — until it lands they
+are expressible as duplicated `eq`-gated widget variants.
 
 Role/tier scoping of the catalog:
 
-| Bucket          | Panels                                                     
            | Predicate                         |
-| --------------- | 
--------------------------------------------------------------------- | 
--------------------------------- |
-| **All roles**   | system resources, disk-by-path, network, Go runtime, node 
uptime      | (always shown)                    |
-| **Liaison**     | gRPC query & errors, non-query ops, write rate, publish 
throughput & latency, wqueue depth | `#entity.node_role == 'liaison'` |
-| **Data**        | storage totals, merge/compaction, inverted index, 
subscribe queue     | `#entity.node_role == 'data'`     |
-| **Data + tier** | tier-specific merge/retention hints                        
           | `#entity.node_type in (hot,warm)` |
+| Bucket          | Panels                                                     
            | Entity gate                        |
+| --------------- | 
--------------------------------------------------------------------- | 
---------------------------------- |
+| **All roles**   | system resources, disk-by-path, network, Go runtime, node 
uptime      | (always shown)                     |
+| **Liaison**     | gRPC query & errors, non-query ops, write rate, publish 
throughput & latency, wqueue depth | `container_name eq liaison` |
+| **Data**        | storage totals, merge/compaction, inverted index, 
subscribe queue, retention | `container_name eq data`     |
+| **Data + tier** | tier-specific merge/retention hints                        
           | `node_type in (hot, warm)` †       |
+| **Lifecycle**   | migration cycles, last-run time + status                   
           | `container_name eq lifecycle`      |
+
+† `in` is the proposed extension of [section 
6](#6-horizon-ui-enhancement-entity-gate-membership-operators);
+until it lands, two `eq`-gated widget variants.
 
 ### 5. Dashboards (Horizon UI BANYANDB layer template)
 
 A net-new layer template `apps/bff/src/bundled_templates/layers/banyandb.json` 
(config-driven JSON, one
-file per layer, per-scope widget arrays, MQE expression strings). The design 
mirrors the upstream two
-boards across the SkyWalking hierarchy:
+file per layer keyed by its `key` field — `BANYANDB`, filename lowercased — 
with per-scope widget
+arrays and MQE expression strings). One menu touchpoint exists: Horizon UI 
currently hard-codes the
+`BANYANDB` layer out of the sidebar (`HIDDEN_LAYERS`);
+[horizon-ui #47](https://github.com/apache/skywalking-horizon-ui/pull/47) 
replaces that with a
+config-driven `layers.excluded` list that un-hides BanyanDB — this SWIP rides 
on that change (or an
+equivalent one-line un-hide). The design mirrors the upstream two boards 
across the SkyWalking
+hierarchy:
 
 ```
 BANYANDB layer
-├─ Root            → cluster list (ServiceList), showGroup=false
+├─ Root            → cluster list (the layer landing's service-list picker: 
header columns + sort)
 ├─ Service (cluster)
 │   └─ Overview KPIs + "Cluster Workload Summary" + "Fleet Overview" capacity
 │       (cluster_write_rate, cluster_query_rate, cluster_error_rate,
-│        reporting_nodes by role, total_cpu/memory/disk)
-├─ Instance (node)   ← the "Nodes" board, made dynamic
+│        reporting_instances by role, total_cpu/memory/disk)
+├─ Instance (container)   ← the "Nodes" board, made dynamic; instance = 
pod_name@container_name
 │   ├─ All roles: Resources (CPU/RSS/mem%/disk%), Disk by Path, Network, Go 
Runtime
-│   ├─ Liaison (visibleWhen role==liaison): Ingestion/Query, Registry, Errors,
+│   ├─ Liaison (entity gate container_name eq liaison): Ingestion/Query, 
Registry, Errors,
 │   │     Publish throughput & p99, Write Queue (wqueue) depth
-│   └─ Data (visibleWhen role==data): Storage totals, Merge, Inverted Index,
-│         Subscribe Queue (per operation: query/file-sync/batch-write/control)
+│   ├─ Data (entity gate container_name eq data): Storage totals, Merge, 
Inverted Index, Retention,
+│   │     Subscribe Queue (per operation: query/file-sync/batch-write/control)
+│   └─ Lifecycle (entity gate container_name eq lifecycle): migration cycles, 
last-run time + status
 └─ Endpoint (group)  ← the "Workload" board, by group
     └─ Write rate, Query latency, Total data, Merge, Inverted index, Queue, 
Publish bytes
 ```
 
 Panel **types/units** follow the upstream Grafana boards for fidelity (stat 
for KPIs; timeseries for
-rates/latencies; table for the per-node health row; `bytes` / `percentunit` / 
`s` / `reqps` / `wps`
-units; disk% and memory% turn red at 80%). The per-node "health table" 
(uptime, CPU cores, RSS, mem%,
-disk%) becomes the Instance-list columns on the Service view.
+rates/latencies; `bytes` / `percentunit` / `s` / `reqps` / `wps` units; disk% 
and memory% turn red at
+80%). The upstream per-node "health table" (uptime, CPU cores, RSS, mem%, 
disk%) maps onto the
+all-roles Resources widgets of the Instance view — Horizon UI's instance list 
deliberately shows only
+name + attributes (the role/tier chips), and per-instance metric columns are 
not assumed by this
+design; if embedded health columns prove necessary later, that is an additive 
Horizon UI enhancement.
 
 This is **design only** — the production `banyandb.json` and its exact widget 
grid are deliberately left
 to the implementation PR in the Horizon UI repository.
 
-### 6. Horizon UI enhancement: `#entity` attribute predicate
+### 6. Horizon UI enhancement: entity-gate membership operators
 
-Horizon UI's widget `visibleWhen` already parses two predicate forms but only 
one is implemented:
+When this SWIP was first drafted, Horizon UI parsed `visibleWhen` as free text 
and stubbed the
+entity-attribute form. That is no longer the upstream state: horizon-ui PR #46 
(merged 2026-06-08)
+replaced the free-text parser with a structured, **BFF-evaluated** union —
 
-- `"<metric> has value"` — implemented (client-side data-presence gating).
-- `"#entity.<key>"` — **parsed but stubbed**: the renderer's `isVisible` 
currently returns `true`
-  unconditionally for any `#entity.*` predicate, with the comment 
*"Entity-attribute predicates need an
-  attributes feed we don't surface yet. Render the widget unconditionally for 
now."*
+- `{ "kind": "mqe", "expression": "<expr>", "op": "exists" }` — data-presence 
gating;
+- `{ "kind": "entity", "attribute": "<key>", "op": "exists" }` /
+  `{ "kind": "entity", "attribute": "<key>", "op": "eq", "value": "<v>" }` — 
entity-attribute gating
+  against the selected instance's attribute feed (`eq` compares 
case-insensitively; meaningful on the
+  Instance scope only, a no-op elsewhere)
 
-The data is already on the wire: the instance list the UI fetches carries each 
instance's
-`attributes [{name,value}]`. The enhancement is to **wire those attributes 
into the predicate
-evaluator** and give the predicate a small comparison grammar:
+— so the attribute feed and the evaluator this section originally proposed 
**already exist upstream**:
+the BFF fetches the selected instance's `attributes [{name,value}]` and 
returns gated-out widgets as
+hidden. Legacy free-text predicates (`"<metric> has value"`, 
`"#entity.<key>"`) are no longer parsed
+and degrade to ungated.
 
-| Predicate form                          | Meaning                            
                |
-| --------------------------------------- | 
-------------------------------------------------- |
-| `#entity.<key>`                         | attribute present and truthy       
                |
-| `#entity.<key> == '<v>'` / `!= '<v>'`   | equals / not-equals a literal      
                |
-| `#entity.<key> in (<v1>,<v2>)`          | membership                         
                |
+What remains for this design is only **membership and negation**:
 
-Scope of the enhancement (design): (1) pass the selected instance's 
`attributes` into the
-`LayerDashboardsView` predicate context; (2) implement the `#entity.*` branch 
of `isVisible` to read
-that context; (3) extend the predicate parser with `==` / `!=` / `in`; (4) 
document it in the Horizon UI
-layer-template authoring docs. It is generic — any layer (K8s node roles, 
gateway tiers, …) benefits;
+| Proposed gate                                                                
        | Meaning              |
+| 
------------------------------------------------------------------------------------
 | -------------------- |
+| `{ "kind": "entity", "attribute": "<key>", "op": "neq", "value": "<v>" }`    
         | not-equals a literal |
+| `{ "kind": "entity", "attribute": "<key>", "op": "in", "values": ["<v1>", 
"<v2>"] }`  | membership           |
+
+Scope of the enhancement (design): (1) add the two operator arms to the BFF 
`visibleWhen` schema and
+its entity-gate evaluator; (2) document them in the Horizon UI layer-template 
authoring docs. Until it
+lands, a tier set like `node_type in (hot, warm)` is expressible as two 
`eq`-gated widget variants —
+`in` removes the duplication. It is generic — any layer (K8s node roles, 
gateway tiers, …) benefits;
 BanyanDB is the first consumer. The exact code lands in the Horizon UI 
repository.
 
+### 7. Intra-cluster instance topology (the "deployment" component)
+
+Beyond the per-instance dashboards, the BanyanDB layer adds a **deployment 
view**: the
+container-to-container call graph *within* the single BanyanDB cluster service 
— liaison↔data writes,
+the hot→warm→cold lifecycle migration chain, and inter-liaison gossip. The 
legacy booster UI only ever
+drew instance topology *between two services*; this is a net-new Horizon UI 
component for the
+**one-service** case (landing via horizon-ui PR #47).
+
+**Data path — no query API change.** The component calls
+`getServiceInstanceTopology(clientServiceId, serverServiceId, duration)` with 
the **same** service id
+on both sides. OAP's relation filter is symmetric, so `client == server == 
svc` collapses to
+`source_service_id == dest_service_id == svc`, returning exactly the 
intra-cluster instance relations
+(verified across the BanyanDB / JDBC / ES topology DAOs). Per-node metrics 
evaluate under
+`{ scope: ServiceInstance }`; per-edge metrics under `ServiceInstanceRelation` 
(server + client
+families) — both ordinary MQE.
+
+**Grouping contract.** The component lays the graph out from the instance 
attributes this SWIP emits
+([entity model](#1-entity-model)):
+
+| Config key  | Attribute(s)              | Effect                             
                             |
+| ----------- | ------------------------- | 
-------------------------------------------------------------- |
+| `clusterBy` | `node_role` + `node_type` | one box per role/tier — liaison, 
data hot/warm/cold            |
+| `siblingBy` | `pod_name`                | a pod = main container + sibling 
containers (data + lifecycle)  |
+| `roleBy`    | `container_name`          | per-role node metrics (`liaison` / 
`data` / `lifecycle`)        |
+
+Per-role node MQE binds to the `meter_banyandb_instance_*` metrics from the 
catalog above — e.g.
+liaison → `query_rate_by_service`, data → `write_rate` / `disk_usage_percent`, 
lifecycle →
+`lifecycle_cycles` / `lifecycle_last_run_success`. Only `container_name` ∈
+{`liaison`, `data`, `lifecycle`} exists on the wire — there is **no `fodc` 
container** (the FODC agent
+publishes no self-metrics through the proxy), so a `fodc` role is not modeled.
+
+**Open dependency — a MAL `SERVICE_INSTANCE_RELATION` scope.** This feature is 
MAL-only: every BanyanDB
+entity, metric, and attribute here is produced by the `banyandb/*` MAL rules. 
MAL builds relations
+through `MeterEntity` / `ScopeType`, which ships `SERVICE_RELATION` and 
`PROCESS_RELATION` (the latter
+already powers the eBPF process topology via `network-profiling.yaml`) — but 
it has **no
+`SERVICE_INSTANCE_RELATION` scope** and no 
`SampleFamily.instanceRelation(...)` builder. So MAL cannot
+emit the instance-relation metric that `getServiceInstanceTopology` reads, and 
on a metrics-only
+BanyanDB the deployment graph is **empty** — the Horizon UI component 
(horizon-ui PR #47) renders that empty
+state by design until the scope lands (its earlier preview mock has been 
dropped).
+Closing the gap means adding that third relation scope (a 
`SERVICE_INSTANCE_RELATION` `ScopeType` +
+`MeterEntity` factory + `instanceRelation(...)` builder + entity description, 
mirroring the two that
+ship), fed by the queue `remote_node` / `remote_role` / `remote_tier` labels 
(now carrying the
+lifecycle sender identity per BanyanDB #1167). That is MAL-**engine** code 
(`server-core` +
+`meter-analyzer`), which exceeds this SWIP's [config-only 
non-goals](#non-goals); it is tracked under
+[future work](#future-work). The component, the query path, and the grouping 
contract above are ready
+the moment that scope lands.
+
 ## Feasibility and precedent
 
 Verified against the OAP and Horizon UI source — **no OAP core / MAL / 
receiver change is required**:
@@ -367,21 +519,35 @@ Verified against the OAP and Horizon UI source — **no OAP 
core / MAL / receive
   `EndpointTraffic` whenever the endpoint name is non-empty; `EndpointTraffic` 
is `supportUpdate=true`
   and is listed by GraphQL `findEndpoint` (empty keyword ⇒ list all), which 
the BanyanDB metadata DAO
   serves from the traffic table without touching any trace data.
-- **Layer.** `Layer.BANYANDB` (ordinal 43) already exists; layer dashboards 
are auto-discovered by the
-  UI from the template's own `layer` field — no menu code change.
+- **Layer.** `Layer.BANYANDB` (ordinal 43) already exists; layer dashboards 
are auto-discovered from
+  the template's own `key` field. The one menu touchpoint: Horizon UI's 
hard-coded hidden-layers set
+  currently drops `BANYANDB` from the sidebar — un-hidden by horizon-ui PR 
#47's config-driven
+  `layers.excluded` (see 
[Dashboards](#5-dashboards-horizon-ui-banyandb-layer-template)).
 
 ## Live validation
 
 The entity scheme and the metric catalog above were validated against a **live 
7-node BanyanDB
 cluster** — the public SkyWalking demo's FODC proxy `/metrics` (2 liaison + 5 
data: `hot×2`, `warm×2`,
-`cold×1`). Findings:
-
-- **All four identity labels are present and exactly as designed.** Every 
sample carries `pod_name`
-  (e.g. `demo-banyandb-data-hot-0`), `node_role` (`ROLE_LIAISON` / 
`ROLE_DATA`), `container_name`
-  (`liaison` / `data`), and — on **data nodes only** — `node_type` (`hot` / 
`warm` / `cold`). Liaison
-  nodes carry no `node_type`, so the instance closure defaults the tier 
attribute (`tags.node_type ?:
-  'n/a'`). This validates Service = `cluster`, Instance = `pod_name`, 
attributes `node_role` /
-  `node_type`.
+`cold×1`), running an upstream `main` build (the showcase-pinned image of 
2026-06-09; upstream PR
+[#1159](https://github.com/apache/skywalking-banyandb/pull/1159) — open, docs 
and Grafana dashboards
+only — documents the same catalog). The live `/metrics` pull is the 
authoritative wire reference.
+393 metric families. Findings:
+
+- **Instance must be `pod_name` + `container_name`, not `pod_name`.** Every 
sample carries `pod_name`,
+  `node_role` (`ROLE_LIAISON` / `ROLE_DATA` observed; the FODC agent stamps a 
transient
+  `ROLE_UNSPECIFIED` for unresolved or meta-only nodes), `container_name`
+  (`liaison` / `data` / **`lifecycle`**), and — on **data containers only** — 
`node_type`
+  (`hot` / `warm` / `cold`). Crucially, the four `data` hot/warm pods each run 
**two containers under
+  one `pod_name`** (`…@data` and `…@lifecycle`), so `pod_name` is not a unique 
instance key and
+  `node_role` is not the discriminator (it reads `ROLE_DATA` on the lifecycle 
sidecar). This validates
+  Service = `cluster`, Instance = `pod_name` + `container_name`, attributes 
`container_name` / `node_type`.
+- **The `lifecycle` migrator surfaces as its own container instance.** It 
co-locates on the `hot`/`warm`
+  data pods and emits `banyandb_lifecycle_cycles_total` plus the shared 
`system_*` / `go_*` /
+  `process_*` runtime families — 50 families under `container_name=lifecycle` 
in the demo pull. The
+  `last_run_timestamp_seconds` / `last_run_success` gauges (BanyanDB #1167) 
post-date the demo's
+  deployed build, so they were absent from that pull but are present on `main` 
and emit once a migration
+  cycle runs (the showcase has since pinned the BanyanDB #1167 merge SHA, so a 
redeployed demo will
+  expose them).
 - **The queue model is confirmed verbatim.** `banyandb_queue_sub_*` / 
`queue_pub_*` carry
   `operation` ∈ {`batch-write`, `control`, `file-sync`, `query`}, plus 
`group`, `remote_node`,
   `remote_role` (`liaison` / `data`) and `remote_tier` (`hot` / …); 
`total_latency` is a histogram. The
@@ -391,16 +557,23 @@ cluster** — the public SkyWalking demo's FODC proxy 
`/metrics` (2 liaison + 5
   `liaison_grpc_total_started{group,method,service}`, `*_total_written{group}`,
   `*_inverted_index_*{group,seg,node_type}`. Data-node metrics also carry 
`node_type`, so the by-group
   endpoint view can be refined by tier.
-- **One reconciliation vs. the upstream doc.** Schema/registry operations are 
**not** exposed as
-  `banyandb_liaison_grpc_total_registry_*` (those series do not exist on the 
live cluster) — they are a
-  **separate `banyandb_schema_server_grpc_*` scope** (`total_started{method}`, 
`_finished`, `_latency`,
-  `_err`), running on the nodes hosting the metadata/schema server. The tables 
above use the
-  `schema_server_grpc_*` names accordingly.
+- **Two registry/schema scopes coexist (corrected).** The live cluster exposes 
**both**
+  `banyandb_liaison_grpc_total_registry_*` (`group`, `service`, `method`; on 
liaison containers) **and**
+  a separate `banyandb_schema_server_grpc_*` scope (`total_started{method}`, 
`_finished`, `_latency`,
+  `_err`; on the data container hosting the metadata/schema server). The 
`cluster_error_rate` and
+  registry panels should pick one deliberately — they are different layers, 
not aliases. (An earlier
+  draft claimed the `liaison_grpc_total_registry_*` series were absent; 
BanyanDB `main` has emitted
+  them since BanyanDB #517.)
+- **`storage_retention_*` is a real data-only family** not in earlier drafts:
+  `storage_retention_{measure,stream,trace}_disk_usage_percent{service}` and
+  `_forced_retention_cooldown_seconds{service}` — the source for the 
data-container retention panels.
 - **Error counters are absent on a healthy cluster, by design.** 
`liaison_grpc_total_err`,
-  `*_total_sync_loop_err` and `queue_pub_total_err` are label-dimensioned 
counters that emit no series
-  until the first error — so the rules must guard each error term with `or 
vector(0)`, exactly as the
-  upstream "Error Rate" panel does. Their non-error siblings (`_started` / 
`_finished` / `_latency` /
-  `_bytes`) are all present.
+  `liaison_grpc_total_stream_msg_received_err`, `*_total_sync_loop_err` and 
`queue_pub_total_err` are
+  label-dimensioned counters that emit no series until the first error. The 
upstream Grafana "Error
+  Rate" panel guards each term with PromQL's `or vector(0)`; the MAL rules 
need no guard — an absent
+  family is the identity for MAL's `+` (see the sketch-notation note in the 
metric catalog, section 3)
+  — the summed metric simply has no series until the first error fires. Their 
non-error siblings
+  (`_started` / `_finished` / `_latency` / `_bytes`) are all present.
 
 ## Imported Dependencies libs and their licenses
 
@@ -415,10 +588,11 @@ This is a **breaking change** to the BanyanDB 
self-observability feature (an int
 feature, not a public protocol/storage contract):
 
 - **Entity model.** A BanyanDB cluster that previously appeared as *N* 
services (one per node) now
-  appears as *one* service with *N* instances. Old per-node `Service` entities 
and their
-  `meter_banyandb_*` / `meter_banyandb_instance_*` metric series are 
superseded; the new series use the
-  cluster/node/group identities and a partly new metric set. Historical data 
under the old model is not
-  migrated.
+  appears as *one* service with one instance **per container** (`pod_name` + 
`container_name`, so a
+  data hot/warm pod yields both a `data` and a `lifecycle` instance). Old 
per-node `Service` entities
+  and their `meter_banyandb_*` / `meter_banyandb_instance_*` metric series are 
superseded; the new
+  series use the cluster/container/group identities and a partly new metric 
set. Historical data under
+  the old model is not migrated.
 - **Scrape target.** Cluster deployments must scrape the **FODC proxy 
`:17913`** (single target) and
   inject a `cluster` label. The legacy per-pod `:2121` collector config is 
replaced. Direct per-pod
   scraping is **out of scope** for this redesign (a standalone node still 
reports through its FODC
@@ -430,9 +604,9 @@ feature, not a public protocol/storage contract):
   Horizon UI bundle.
 - **OAP rule loading** is unchanged: `enabledOtelMetricsRules` already globs 
`banyandb/*`, so the new
   `banyandb-endpoint.yaml` is picked up without an `application.yml` change.
-- **Horizon UI predicate enhancement is backward compatible** — `#entity.*` 
only ever returned `true`
-  before, so implementing it can only *add* hiding behavior to templates that 
opt in; existing templates
-  are unaffected.
+- **The Horizon UI entity-gate extension is backward compatible** — `neq` / 
`in` are additive arms of
+  the structured `visibleWhen` union (horizon-ui #46); templates that don't 
use them are unaffected,
+  and legacy free-text predicates already degrade to ungated rather than 
erroring.
 
 ## General usage docs
 
@@ -451,17 +625,25 @@ This is a preliminary usage sketch to help reviewers; the 
final operator docs (r
 **What the operator sees**
 
 - A **cluster** as a single service, with cluster-wide write/query/error rates 
and capacity.
-- A **node list** where each node shows its **role** (`liaison` / `data`) and 
**tier**
-  (`hot` / `warm` / `cold`) as attributes; selecting a node shows a dashboard 
**scoped to what that node
-  actually does** — ingestion/queue/publish for liaison, 
storage/index/subscribe for data, refined by
-  tier.
+- An **instance (container) list** where each entry shows its **container** 
role
+  (`liaison` / `data` / `lifecycle`) and **tier** (`hot` / `warm` / `cold`) as 
attributes; selecting one
+  shows a dashboard **scoped to what that container actually does** — 
ingestion/queue/publish for
+  liaison, storage/index/subscribe/retention for data, migration cycles + 
last-run time/status for
+  lifecycle, refined by tier.
 - A **group list** (Endpoints) with per-group throughput, latency, storage, 
index and queue health.
 
 ## Future work
 
-- **Topology / lifecycle.** Fuse FODC `/cluster/topology` (node inventory + 
roles + tiers) and the queue
-  `remote_node` / `remote_role` / `remote_tier` labels into a node-to-node 
call graph, and surface FODC
-  `/cluster/lifecycle` group settings (shards / segment interval / TTL) on the 
Endpoint view.
+- **A MAL `SERVICE_INSTANCE_RELATION` scope for the deployment component.** 
Add the third relation scope
+  (`ScopeType` + `MeterEntity` factory + `SampleFamily.instanceRelation(...)` 
+ entity description,
+  mirroring the shipping `serviceRelation` / `processRelation`) so the
+  [intra-cluster instance 
topology](#7-intra-cluster-instance-topology-the-deployment-component) renders
+  live instead of mock-backed, fed by the queue `remote_node` / `remote_role` 
/ `remote_tier` labels
+  (verified reconstructable from the live data; BanyanDB #1167 also populates 
the lifecycle migration
+  sender identity, so hot→warm→cold tier-migration edges are distinguishable). 
This is MAL-engine code,
+  beyond this SWIP's config-only scope. Also
+  surface FODC `/cluster/topology` and `/cluster/lifecycle` group settings 
(shards / segment interval /
+  TTL) on the Endpoint view.
 - **Alerting.** Ship default alarm rules for the upstream "Key Signals to 
Watch" (query p99, error rate,
   disk > 85%, memory near the protector limit, sustained wqueue / `queue_pub` 
backlog).
 - **Direct-scrape variant** for standalone / non-FODC deployments, if demand 
warrants.

(skywalking) branch master updated: SWIP-15: container-level instances and post-review corrections (#13901)

Reply via email to