This is an automated email from the ASF dual-hosted git repository.
wu-sheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/skywalking.git
The following commit(s) were added to refs/heads/master by this push:
new d30d022354 SWIP-15: container-level instances and post-review
corrections (#13901)
d30d022354 is described below
commit d30d022354e297079e40389235a0527d13c99cb8
Author: 吴晟 Wu Sheng <[email protected]>
AuthorDate: Wed Jun 10 20:55:55 2026 +0800
SWIP-15: container-level instances and post-review corrections (#13901)
---
docs/en/swip/SWIP-15.md | 470 +++++++++++++++++++++++++++++++++---------------
1 file changed, 326 insertions(+), 144 deletions(-)
diff --git a/docs/en/swip/SWIP-15.md b/docs/en/swip/SWIP-15.md
index c72248558e..3ad4853fb8 100644
--- a/docs/en/swip/SWIP-15.md
+++ b/docs/en/swip/SWIP-15.md
@@ -28,31 +28,36 @@ Three things changed underneath it:
3. **SkyWalking replaced the bundled booster UI with the Horizon UI.** The OAP
backend no longer ships
dashboard JSON (dropped in #13877); BanyanDB has not yet been ported to
Horizon UI at all. Horizon
UI is config-driven, has a real **Service → Instance → Endpoint**
hierarchy, surfaces per-instance
- attributes, and can hide panels that have no data — and, with a small
enhancement, can drive panel
- visibility from instance **attributes** (role / tier).
+ attributes, and gates widget visibility through structured,
server-evaluated `visibleWhen`
+ predicates — data presence and instance-**attribute** equality ship today
+ ([horizon-ui
#46](https://github.com/apache/skywalking-horizon-ui/pull/46)); a small
+ extension (membership / negation operators) completes role/tier-driven
dashboards.
The current feature does none of this. It models **each node as its own
`Service`**
(`service(['host_name'], Layer.BANYANDB)`), so a cluster appears as a pile of
unrelated services; it
-never models the cluster, the node role, the tier, or the group; and it still
references metrics that
-BanyanDB removed (an `etcd`-era operation rate, a Prometheus `up`-derived
"active instances", and the
-pre-refactor `queue_sub_total_msg_sent_err` family).
+never models the cluster, the node role, the tier, or the group; and it still
ships stale or misleading
+metrics (an operation rate still named after the retired `etcd` registry, a
Prometheus `up`-derived
+"active instances" that under the FODC proxy would describe the proxy rather
than any node, and the
+`queue_sub_total_msg_sent_err` family, which BanyanDB removed).
**This SWIP proposes to discard that model and rebuild BanyanDB
self-observability around the cluster /
node / group reality**, matching the upstream FODC-proxy metric catalog, and
to design the Horizon UI
-side — a net-new BanyanDB layer dashboard whose node view **adapts to the
selected node's role and
-tier** — including the one Horizon UI enhancement that makes attribute-driven
dashboards possible.
+side — a net-new BanyanDB layer dashboard whose instance view **adapts to the
selected container's
+role and tier** — including the small Horizon UI entity-gate extension that
completes attribute-driven
+dashboards.
### Goals
- Model a BanyanDB **cluster** as a single SkyWalking `Service`.
-- Model each **node** as a `ServiceInstance`, carrying its **role** and
**tier** as instance
- attributes, so the UI can show "what this node is".
+- Model each **container** (`pod_name` + `container_name`) as a
`ServiceInstance`, carrying its
+ **role** and **tier** as instance attributes, so the UI can show "what this
container is".
- Model each **group** as an `Endpoint` of the cluster.
- Mirror the upstream FODC-proxy metric catalog faithfully (the two-dashboard
split becomes the
Instance and Endpoint views).
-- Make the **node dashboard dynamic** — a liaison node shows
ingestion/queue/publish panels, a data
- node shows storage/index/subscribe panels, and the tier refines the data
view — first via the
- data-presence mechanism that already exists, then via a proposed attribute
predicate.
+- Make the **instance dashboard dynamic** — a liaison container shows
ingestion/queue/publish panels, a
+ data container shows storage/index/subscribe panels, a lifecycle container
shows migration panels, and
+ the tier refines the data view — via the structured `visibleWhen` gates
Horizon UI already evaluates
+ (data presence + attribute equality), completed by a proposed
membership/negation extension.
### Non-goals
@@ -72,18 +77,18 @@ tier** — including the one Horizon UI enhancement that
makes attribute-driven
──────────────── ─────────────
──────────
┌─ liaison node ─┐ FODC agent ┐
BANYANDB layer
│ :2121 /metrics │ (sidecar) │
┌───────────────────────┐
- └────────────────┘ │ │ Root:
cluster list │
- ┌─ data hot ─────┐ FODC agent ├─► FODC proxy ──► OTel Collector ──► │
Service: cluster KPIs │
- │ :2121 /metrics │ (sidecar) │ :17913 (prometheus recv, │
Instance: node, panels│
- └────────────────┘ │ /metrics adds `cluster` │
adapt to role/tier │
+ └────────────────┘ │ │
Root: cluster list │
+ ┌─ data hot ─────┐ FODC agent ├─► FODC proxy ──► OTel Collector ──► │
Service: cluster KPIs │
+ │ :2121 /metrics │ (sidecar) │ :17913 (prometheus recv, │
Instance: container, │
+ └────────────────┘ │ /metrics adds `cluster` │
adapts to role/tier │
┌─ data warm ────┐ FODC agent │ single target, label) ──OTLP──► │
Endpoint: group │
- │ :2121 /metrics │ (sidecar) │ per-node labels │
└───────────────────────┘
+ │ :2121 /metrics │ (sidecar) │ identity labels │
└───────────────────────┘
└────────────────┘ ┘ node_role/pod_name/ │
▲
┌─ data cold ────┐ container_name/ ▼
│ MQE over
│ :2121 /metrics │ node_type receiver-otel ──► MAL
│ GraphQL
└────────────────┘ otel-rules/banyandb/*
───────────┘ execExpression
├ banyandb-service.yaml →
Service (cluster)
- ├ banyandb-instance.yaml →
Instance (node + attrs)
+ ├ banyandb-instance.yaml →
Instance (container + attrs)
└ banyandb-endpoint.yaml →
Endpoint (group)
│
▼
@@ -100,41 +105,62 @@ identity.
### 1. Entity model
-| SkyWalking entity | BanyanDB concept
| Identity source (label) |
-| ---------------------------- | ---------------------------------------------
| --------------------------------------- |
-| `Service` (Layer `BANYANDB`) | one BanyanDB **cluster**
| `cluster` (injected by the collector) |
-| `ServiceInstance` | one **node**
| `pod_name` (e.g. `banyandb-data-hot-0`) |
-| ↳ attribute `node_role` | node **role**
| `container_name` (`liaison` / `data`) |
-| ↳ attribute `node_type` | data-node **tier**
| `node_type` (`hot` / `warm` / `cold`) |
-| `Endpoint` | one **group** (storage partition)
| `group` (`measure-default`, …) |
-
-A standalone BanyanDB is the degenerate case: one cluster, one node whose role
is `standalone` (all
-roles co-resident) and no tier.
-
-**Why role/tier are instance attributes, not separate services or endpoints.**
A node's identity is
-its `pod_name`; its role and tier are *properties of that node*, which is
exactly what
-`InstanceTraffic.properties` (the UI "Attributes" panel) is for. Keeping the
cluster as the single
-service means the node list, the group list, and cluster-wide KPIs all live
under one entity the
-operator can reason about — and it makes the node dashboard able to adapt to
the selected node's
-attributes.
+| SkyWalking entity | BanyanDB concept |
Identity source (label) |
+| ---------------------------- | ----------------------------------------- |
------------------------------------------------------ |
+| `Service` (Layer `BANYANDB`) | one BanyanDB **cluster** |
`cluster` (injected by the collector) |
+| `ServiceInstance` | one **container** on a node |
`pod_name` + `container_name` (composite) |
+| ↳ attribute `container_name` | container **role**
(discriminator) | `liaison` / `data` / `lifecycle` |
+| ↳ attribute `node_type` | data-node **tier** |
`hot` / `warm` / `cold` (data containers only; `n/a` elsewhere) |
+| ↳ attribute `node_role` | role enum (coarse) |
`ROLE_LIAISON` / `ROLE_DATA` |
+| ↳ attribute `pod_name` | host pod (sibling key) |
`demo-banyandb-data-hot-0` |
+| `Endpoint` | one **group** (storage partition) |
`group` (`sw_metricsMinute`, …) |
+
+All four labels are attached as instance attributes **verbatim** (not
renamed), because the Horizon UI
+deployment/topology component groups the intra-cluster instance graph by them:
`clusterBy` =
+`node_role` + `node_type`, `siblingBy` = `pod_name`, `roleBy` =
`container_name`. Emitting the raw
+label names keeps the OAP attribute bag and the UI grouping config in lockstep.
+
+**Why the instance is a container, not a `pod_name`.** `pod_name` is **not
unique per metrics
+emitter**: a data hot/warm pod co-hosts a `lifecycle` migration sidecar that
reports under the *same*
+`pod_name` (verified on the live cluster — `demo-banyandb-data-hot-0` emits
both `container_name=data`
+and `container_name=lifecycle`). Keying the instance on `pod_name` alone would
silently merge the two
+series. The instance identity is therefore `pod_name` + `container_name`, and
`container_name` — not
+`node_role` — is the role discriminator: `node_role` carries only
`ROLE_LIAISON` / `ROLE_DATA` on a
+healthy cluster (it stays `ROLE_DATA` on the lifecycle sidecar, and the FODC
agent maps unresolved or
+meta-only nodes to a transient `ROLE_UNSPECIFIED`), whereas `container_name`
cleanly separates
+`liaison` / `data` / `lifecycle`. A standalone BanyanDB is the degenerate
case: one cluster, one node,
+one `container_name=standalone`, no tier.
+
+**Why container/tier are instance attributes, not separate services or
endpoints.** A container's role
+and tier are *properties of that instance*, which is exactly what
`InstanceTraffic.properties` (the UI
+"Attributes" panel) is for. Keeping the cluster as the single service means
the instance list, the
+group list, and cluster-wide KPIs all live under one entity the operator can
reason about — and it
+makes the instance dashboard able to adapt to the selected container's
attributes.
### 2. Scrape source and label scheme (FODC proxy only)
SkyWalking scrapes the **FODC proxy `/metrics`** (default `:17913`) as the
single Prometheus target.
-The proxy aggregates every node's metrics and stamps four identity labels onto
each sample (verified in
-the FODC agent's `ParseWithNodeLabels`):
+The proxy aggregates every container's metrics and stamps four identity labels
onto each sample
+(verified in the FODC agent's `ParseWithNodeLabels` and against the live
cluster):
+
+| Label | Value | Used for
|
+| ---------------- | ---------------------------------------------- |
----------------------------------------------------- |
+| `pod_name` | node identity, e.g. `banyandb-data-hot-0` | instance
name (part 1) — **not unique**, see below |
+| `container_name` | `liaison` / `data` / `lifecycle` | instance
name (part 2) + attribute `container_name` (the role discriminator) |
+| `node_role` | raw enum `ROLE_LIAISON` / `ROLE_DATA` (transiently
`ROLE_UNSPECIFIED`) | **not** the discriminator — coarser than
`container_name`, stays `ROLE_DATA` on the lifecycle sidecar |
+| `node_type` | `hot` / `warm` / `cold` (data containers only) | instance
attribute `node_type` (tier) |
-| Label | Value | Used for
|
-| ---------------- | -------------------------------------------- |
--------------------------------- |
-| `pod_name` | full node identity, e.g. `banyandb-data-hot-0` | instance
name |
-| `container_name` | `liaison` / `data` (the role discriminator) | instance
attribute `node_role` |
-| `node_role` | raw enum `ROLE_LIAISON` / `ROLE_DATA` |
(available; `container_name` preferred for clean values) |
-| `node_type` | `hot` / `warm` / `cold` (data nodes only) | instance
attribute `node_type` (tier) |
+`pod_name` alone does **not** identify an instance: on the live cluster the
four data hot/warm pods
+each run two containers (`data` + `lifecycle`) under one `pod_name`, so the
instance key is
+`pod_name` + `container_name`.
All original BanyanDB labels are preserved on every sample: `group`,
`service`, `method`, `operation`,
-`remote_node`, `remote_role`, `remote_tier`, `error_type`, `kind`, `path`,
`type`, `name`, `le`, …. The
-Prometheus-synthesized `instance` / `job` / `up` describe the **proxy**, not
individual nodes — node
-liveness is derived from the always-present per-node gauge
`banyandb_system_up_time`, never from `up`.
+`remote_node`, `remote_role`, `remote_tier`, `error_type`, `kind`, `path`,
`type`, `seg`, `shard`,
+`le`, …. Note `service` is BanyanDB's internal **data-model module**
(`measure` / `stream` / `trace` /
+`property` / `group`) — a workload facet, **never** a SkyWalking service
identity. The
+Prometheus-synthesized `instance` / `job` / `up` describe the **proxy**, not
individual containers —
+node liveness is derived from the always-present per-container gauge
`banyandb_system_up_time`, never
+from `up`.
**Collector scrape job (illustrative — operator configuration, not a shipped
file):**
@@ -168,26 +194,64 @@ filter: "{ tags -> tags.job_name == 'banyandb-monitoring'
}"
# banyandb-service.yaml → cluster
expSuffix: service(['cluster'], Layer.BANYANDB)
-# banyandb-instance.yaml → node, with role + tier as attributes
+# banyandb-instance.yaml → container (a node may run >1 container), role +
tier as attributes
expSuffix: |-
service(['cluster'], Layer.BANYANDB)
- .instance(['cluster'], '::', ['pod_name'], '', Layer.BANYANDB,
- { tags -> ['node_role': tags.container_name, 'node_type':
tags.node_type ?: 'n/a'] })
+ .instance(['cluster'], '::', ['pod_name', 'container_name'], '@',
Layer.BANYANDB,
+ { tags -> ['node_role': tags.node_role,
+ 'node_type': tags.node_type ?: 'n/a',
+ 'pod_name': tags.pod_name,
+ 'container_name': tags.container_name] })
# banyandb-endpoint.yaml → group
expSuffix: endpoint(['cluster'], ['group'], Layer.BANYANDB)
```
-The 6-argument `.instance(...)` overload's properties closure is the standard,
precedented mechanism for
+The instance key is the pair `['pod_name', 'container_name']` joined by `'@'`
(signature
+`instance(serviceKeys, serviceDelimiter, instanceKeys, instanceDelimiter,
layer, propertiesExtractor)`),
+so the four `data` hot/warm pods surface as distinct `…@data` and
`…@lifecycle` instances rather than
+colliding. The 6-argument overload's properties closure is the standard,
precedented mechanism for
attaching labels as instance attributes (the same shape used by
`k8s-instance.yaml`). The attributes
-ride entirely on the scraped labels — no separate update API.
+ride entirely on the scraped labels — no separate update API. (Two
implementation notes: the MAL v2
+grammar supports the Elvis operator inside a map-literal value, but no shipped
rule combines the two
+yet — the implementation PR should pin this exact closure shape with a compile
test. And `language` is
+the one reserved property key — the instance query maps it to the language
field instead of an
+attribute; none of these four labels collides with it.)
### 3. Metric catalog → MAL rules
The redesigned rules mirror the upstream FODC-proxy catalog. The two upstream
Grafana boards map onto
-two SkyWalking scopes — **Nodes → Instance** (per `pod_name`), **Workload →
Endpoint** (per `group`) —
-plus a small **Service** summary for cluster KPIs. Source metric names below
are verified against
-BanyanDB `origin/main` (the base of the upstream observability PR).
+two SkyWalking scopes — **Nodes → Instance** (per `pod_name` +
`container_name`), **Workload →
+Endpoint** (per `group`) — plus a small **Service** summary for cluster KPIs.
Source metric names
+below are verified against the **live demo cluster** — which runs upstream
`main` builds (the
+validation pull used the showcase-pinned `main` image of 2026-06-09) — and
against BanyanDB
+`origin/main` source. The upstream observability PR
+[#1159](https://github.com/apache/skywalking-banyandb/pull/1159) (open; docs
and Grafana dashboards
+only, no metric code) documents the same catalog and defines the two boards
this design mirrors.
+
+> **Metric-name prefix (build-critical).** The sketches below drop a common
prefix for readability.
+> On the wire **every BanyanDB-native family carries the `banyandb_` prefix**
(`banyandb_measure_total_written`,
+> `banyandb_liaison_grpc_total_started`, `banyandb_system_disk`, …) — the MAL
rules must use the full
+> prefixed name. The **only** exceptions are the standard Go-runtime and
process exporter families
+> `go_*` / `process_*`, which are **bare** (no prefix) and are referenced
as-is. Every error counter
+> this catalog references is lazily registered and emits nothing until the
first error fires
+> (`banyandb_liaison_grpc_total_err`,
`banyandb_liaison_grpc_total_stream_msg_received_err`,
+> `banyandb_queue_pub_total_err`, the `*_total_sync_loop_err` family), and the
lifecycle last-run
+> gauges (`banyandb_lifecycle_last_run_*`, BanyanDB #1167) post-date the build
the demo pull validated;
+> every other cited family was present in that pull.
+>
+> **Sketch notation (PromQL-flavored).** Source expressions are written
PromQL-style for readability;
+> the MAL forms differ mechanically. **(1)** No `or vector(0)` guard exists in
MAL — nor is one
+> needed: an unfired family resolves to the empty sample family, MAL's `+`
treats an empty operand as
+> identity, and a rule is skipped only when *all* referenced families are
absent — so an error sum
+> emits as soon as any one term fires, and a fully healthy cluster shows no
series at all (dashboards
+> should render absent as 0). **(2)** MAL arithmetic joins samples on exact
label equality, so each
+> term must be aggregated to the same label set (e.g. `.sum(['cluster'])`)
before `+`. **(3)**
+> `count(...) by (...)` maps to MAL's multi-label `count([...])`;
`histogram_quantile(0.99, …_bucket)`
+> maps to `.histogram().histogram_percentile([99])` on the `le`-labeled base
family (no `_bucket`
+> suffix remains after OTLP conversion); and `time() - <metric>` is computed
at **ingest** in the MAL
+> rule — MAL ships `time()` (the shipped `envoy-ca.yaml` cert-staleness metric
is the precedent),
+> while MQE has no current-time function, so it cannot be computed at query
time.
#### 3.1 Service scope — cluster summary (`banyandb-service.yaml`)
@@ -195,15 +259,15 @@ BanyanDB `origin/main` (the base of the upstream
observability PR).
| --------------------------- | ------------------------ |
------------------------------------------------------------------------------------------
|
| `cluster_write_rate` | cluster writes/s |
`rate(measure_total_written) + rate(stream_tst_total_written) +
rate(trace_tst_total_written)` |
| `cluster_query_rate` | cluster queries/s |
`rate(liaison_grpc_total_started{method='query'})`
|
-| `cluster_error_rate` | cluster errors/min |
`liaison_grpc_total_err + _stream_msg_received_err +
schema_server_grpc_total_err + queue_pub_total_err + Σ *_total_sync_loop_err`
(×60, each `or vector(0)`) |
-| `reporting_nodes` | live node count by role |
`count(system_up_time) by (container_name)`
|
+| `cluster_error_rate` | cluster errors/min |
`liaison_grpc_total_err + liaison_grpc_total_stream_msg_received_err +
schema_server_grpc_total_err + queue_pub_total_err + Σ *_total_sync_loop_err`
(×60; all lazily registered — see sketch notation above) |
+| `reporting_instances` | live container count by role |
`count(system_up_time) by (container_name)`
|
| `total_cpu_cores` | cluster CPU capacity |
`sum(system_cpu_num)`
|
| `total_memory_used` | cluster memory used |
`sum(system_memory_state{kind='used'})`
|
| `total_disk_used` | cluster disk used |
`sum(system_disk{kind='used'})`
|
-#### 3.2 Instance scope — per node (`banyandb-instance.yaml`)
+#### 3.2 Instance scope — per container (`banyandb-instance.yaml`)
-**All roles** (every node emits these — the "Nodes" board):
+**All roles** (every container emits these — the "Nodes" board):
| Metric (`meter_banyandb_instance_*`) | Source
|
| ------------------------------------ |
---------------------------------------------------------------- |
@@ -218,26 +282,44 @@ BanyanDB `origin/main` (the base of the upstream
observability PR).
| `gc_pause_avg` | `rate(go_gc_duration_seconds_sum) /
rate(go_gc_duration_seconds_count)` |
| `heap_inuse` / `heap_next_gc` / `alloc_rate` |
`go_memstats_heap_inuse_bytes` / `go_memstats_next_gc_bytes` /
`rate(go_memstats_alloc_bytes_total)` |
-**Liaison-only** (front door; hidden on data nodes — see
[§4](#4-dynamic-metrics-by-role-and-tier)):
+**Liaison-only** (front door; hidden on data containers — see [dynamic metrics
by role and tier](#4-dynamic-metrics-by-role-and-tier)):
| Metric (`meter_banyandb_instance_*`) | Source
|
| ------------------------------------- |
----------------------------------------------------------------------- |
| `query_rate_by_service` |
`rate(liaison_grpc_total_started{method='query'}) by (service)` |
-| `grpc_error_rate` | `rate(liaison_grpc_total_err) by
(service, method)` (+ `_stream_msg_received_err`; both lazily registered) |
+| `grpc_error_rate` | `rate(liaison_grpc_total_err) by
(service, method)` (+ `liaison_grpc_total_stream_msg_received_err`; both lazily
registered) |
| `non_query_op_rate` |
`rate(liaison_grpc_total_started{method!='query'}) by (method)` |
| `write_rate` |
`rate({measure,stream_tst,trace_tst}_total_written)` |
| `publish_throughput` / `publish_latency_p99` |
`rate(queue_pub_total_finished) by (operation)` / `histogram_quantile(0.99,
…queue_pub_total_latency_bucket)` |
| `wqueue_file_parts` / `wqueue_mem_part` / `wqueue_pending` |
`{measure,stream_tst,trace_tst}_total_file_parts` / `_total_mem_part` /
`_pending_data_count` |
-**Data-only** (backend; hidden on liaison nodes):
+**Data-only** (backend; hidden on liaison containers):
| Metric (`meter_banyandb_instance_*`) | Source
|
| ----------------------------------------------- |
------------------------------------------------------------------ |
| `total_data` |
`{measure,stream_tst,trace_tst}_total_file_elements` |
| `merge_file_rate` / `merge_file_latency` / `merge_file_partitions` |
`rate(*_total_merge_loop_started)` / `…_merge_latency{type='file'}` /
`…_merged_parts{type='file'}` |
-| `series_write_rate` / `series_term_search_rate` / `total_series` |
`measure_inverted_index_total_updates` / `_term_searchers_started` /
`_doc_count`; `stream_storage_inverted_index_*` |
+| `series_write_rate` / `series_term_search_rate` / `total_series` |
`measure_inverted_index_total_updates` / `_total_term_searchers_started` /
`_total_doc_count`; `stream_storage_inverted_index_*` |
| `stream_tst_write_rate` / `stream_tst_term_search_rate` /
`stream_tst_total_docs` | `stream_tst_inverted_index_*` |
| `queue_sub_throughput` / `queue_sub_latency_p99` (per `operation`) |
`rate(queue_sub_total_started/finished) by (operation)` /
`histogram_quantile(0.99, …queue_sub_total_latency_bucket) by (operation)` |
+| `retention_disk_usage_percent` / `retention_cooldown` |
`storage_retention_{measure,stream,trace}_disk_usage_percent` /
`_forced_retention_cooldown_seconds` |
+
+**Lifecycle-only** (the tier-migration sidecar co-located on `hot`/`warm` data
pods; `container_name == 'lifecycle'`):
+
+| Metric (`meter_banyandb_instance_*`) | Source
|
+| ------------------------------------ |
------------------------------------------------------------------ |
+| `lifecycle_cycles` | `lifecycle_cycles_total` (cumulative
migration cycles) |
+| `lifecycle_last_run` |
`lifecycle_last_run_timestamp_seconds` — epoch of the last cycle's start; "time
since last sync" = `time() - <metric>`, computed at ingest in the MAL rule (MQE
has no `time()`) |
+| `lifecycle_last_run_success` | `lifecycle_last_run_success` (`1` =
last cycle OK, `0` = failed) |
+
+> **Lifecycle last-run signals.** The two gauges above were added in BanyanDB
+> [#1167](https://github.com/apache/skywalking-banyandb/pull/1167) (merged to
`main` on 2026-06-09,
+> post-dating the build the demo pull validated) — both are
+> stamped on every cycle end (success, error, or panic-recovered), so they
drive a "time since last
+> sync" staleness panel and a "last sync OK?" status panel directly. They emit
only **after the first
+> migration runs**, so the staleness panel must guard the never-run case. The
same PR also stamps the
+> lifecycle's sender identity onto its migration publisher, so a destination
data node's `queue_sub`
+> `remote_node` / `remote_role` / `remote_tier` now identify the migration
source (were empty before).
#### 3.3 Endpoint scope — per group (`banyandb-endpoint.yaml`)
@@ -250,7 +332,7 @@ nodes per group):
| `query_latency` |
`rate(liaison_grpc_total_latency{method='query'}) /
rate(…_started{method='query'}) by (group)` |
| `total_data` |
`{measure,stream_tst,trace_tst}_total_file_elements by (group)` |
| `merge_file_rate` / `merge_file_latency` / `merge_file_partitions` | the
merge family `by (group)` |
-| `series_write_rate` / `total_series` | inverted-index `_total_updates` /
`_doc_count` `by (group)` |
+| `series_write_rate` / `total_series` | inverted-index `_total_updates` /
`_total_doc_count` `by (group)` |
| `queue_throughput` / `queue_latency_p99` | `queue_sub` / `queue_pub` `by
(operation, group)` |
| `publish_bytes` | `rate(queue_pub_sent_bytes) by
(group)` |
@@ -262,98 +344,168 @@ nodes per group):
### 4. Dynamic metrics by role and tier
-Different roles expose different metrics, so the **node (Instance) dashboard
must adapt to the selected
-node**. Two mechanisms, layered:
+Different roles expose different metrics, so the **instance dashboard must
adapt to the selected
+container**. Horizon UI's widget `visibleWhen` is a structured,
**server-evaluated** gate (the BFF
+resolves it against data presence or the selected instance's attributes and
returns gated-out widgets
+as hidden; legacy free-text predicate strings are no longer parsed and degrade
to ungated). Two gate
+kinds, layered:
-**(a) Data-presence gating — available today, no UI code.** Horizon UI already
supports
-`visibleWhen: "<metric> has value"` on a widget; a panel whose metric returns
all-null self-hides. Each
-MAL rule only produces samples for nodes that emit its source metric, so
liaison-only metrics are simply
-absent on data instances and vice-versa. This gives correct adaptive behavior
out of the box:
+**(a) Data-presence gating — available today, no UI code.** The `mqe`-kind
gate hides a widget whose
+expression returns no data. Each MAL rule only produces samples for containers
that emit its source
+metric, so liaison-only metrics are simply absent on data instances and
vice-versa. This gives correct
+adaptive behavior out of the box:
```jsonc
{ "id": "wqueue", "title": "Write Queue (wqueue)", "type": "line",
"expressions": ["meter_banyandb_instance_wqueue_pending"],
- "visibleWhen": "meter_banyandb_instance_wqueue_pending has value" }
+ "visibleWhen": { "kind": "mqe", "expression":
"meter_banyandb_instance_wqueue_pending", "op": "exists" } }
```
-**(b) Attribute predicate — proposed enhancement (see
[§6](#6-horizon-ui-enhancement-entity-attribute-predicate)).**
+**(b) Attribute gating — equality ships today; membership is the proposed
extension (see
+[entity-gate membership
operators](#6-horizon-ui-enhancement-entity-gate-membership-operators)).**
Data-presence can't distinguish "wrong role" from "idle but right role", and
it still issues the query.
-An attribute predicate keys panel visibility directly on the node's
`node_role` / `node_type`
-attributes:
+The `entity`-kind gate keys panel visibility directly on the selected
instance's `container_name` /
+`node_type` attributes (meaningful on the Instance scope only):
```jsonc
-{ "id": "wqueue", "visibleWhen": "#entity.node_role == 'liaison'" }
-{ "id": "cold_tier_note", "visibleWhen": "#entity.node_type == 'cold'" }
+{ "id": "wqueue", "visibleWhen": { "kind": "entity", "attribute":
"container_name", "op": "eq", "value": "liaison" } }
+{ "id": "cold_tier_note", "visibleWhen": { "kind": "entity", "attribute":
"node_type", "op": "eq", "value": "cold" } }
```
This is the precise, declarative form, and it is the natural way to express
tier-specific panels (a
-`hot` data node merges constantly; a `cold` node is mostly static).
+`hot` data container merges constantly; a `cold` container is mostly static).
The landed gate supports
+`exists` and case-insensitive `eq`; tier *sets* need the proposed `in`
operator — until it lands they
+are expressible as duplicated `eq`-gated widget variants.
Role/tier scoping of the catalog:
-| Bucket | Panels
| Predicate |
-| --------------- |
--------------------------------------------------------------------- |
--------------------------------- |
-| **All roles** | system resources, disk-by-path, network, Go runtime, node
uptime | (always shown) |
-| **Liaison** | gRPC query & errors, non-query ops, write rate, publish
throughput & latency, wqueue depth | `#entity.node_role == 'liaison'` |
-| **Data** | storage totals, merge/compaction, inverted index,
subscribe queue | `#entity.node_role == 'data'` |
-| **Data + tier** | tier-specific merge/retention hints
| `#entity.node_type in (hot,warm)` |
+| Bucket | Panels
| Entity gate |
+| --------------- |
--------------------------------------------------------------------- |
---------------------------------- |
+| **All roles** | system resources, disk-by-path, network, Go runtime, node
uptime | (always shown) |
+| **Liaison** | gRPC query & errors, non-query ops, write rate, publish
throughput & latency, wqueue depth | `container_name eq liaison` |
+| **Data** | storage totals, merge/compaction, inverted index,
subscribe queue, retention | `container_name eq data` |
+| **Data + tier** | tier-specific merge/retention hints
| `node_type in (hot, warm)` † |
+| **Lifecycle** | migration cycles, last-run time + status
| `container_name eq lifecycle` |
+
+† `in` is the proposed extension of [section
6](#6-horizon-ui-enhancement-entity-gate-membership-operators);
+until it lands, two `eq`-gated widget variants.
### 5. Dashboards (Horizon UI BANYANDB layer template)
A net-new layer template `apps/bff/src/bundled_templates/layers/banyandb.json`
(config-driven JSON, one
-file per layer, per-scope widget arrays, MQE expression strings). The design
mirrors the upstream two
-boards across the SkyWalking hierarchy:
+file per layer keyed by its `key` field — `BANYANDB`, filename lowercased —
with per-scope widget
+arrays and MQE expression strings). One menu touchpoint exists: Horizon UI
currently hard-codes the
+`BANYANDB` layer out of the sidebar (`HIDDEN_LAYERS`);
+[horizon-ui #47](https://github.com/apache/skywalking-horizon-ui/pull/47)
replaces that with a
+config-driven `layers.excluded` list that un-hides BanyanDB — this SWIP rides
on that change (or an
+equivalent one-line un-hide). The design mirrors the upstream two boards
across the SkyWalking
+hierarchy:
```
BANYANDB layer
-├─ Root → cluster list (ServiceList), showGroup=false
+├─ Root → cluster list (the layer landing's service-list picker:
header columns + sort)
├─ Service (cluster)
│ └─ Overview KPIs + "Cluster Workload Summary" + "Fleet Overview" capacity
│ (cluster_write_rate, cluster_query_rate, cluster_error_rate,
-│ reporting_nodes by role, total_cpu/memory/disk)
-├─ Instance (node) ← the "Nodes" board, made dynamic
+│ reporting_instances by role, total_cpu/memory/disk)
+├─ Instance (container) ← the "Nodes" board, made dynamic; instance =
pod_name@container_name
│ ├─ All roles: Resources (CPU/RSS/mem%/disk%), Disk by Path, Network, Go
Runtime
-│ ├─ Liaison (visibleWhen role==liaison): Ingestion/Query, Registry, Errors,
+│ ├─ Liaison (entity gate container_name eq liaison): Ingestion/Query,
Registry, Errors,
│ │ Publish throughput & p99, Write Queue (wqueue) depth
-│ └─ Data (visibleWhen role==data): Storage totals, Merge, Inverted Index,
-│ Subscribe Queue (per operation: query/file-sync/batch-write/control)
+│ ├─ Data (entity gate container_name eq data): Storage totals, Merge,
Inverted Index, Retention,
+│ │ Subscribe Queue (per operation: query/file-sync/batch-write/control)
+│ └─ Lifecycle (entity gate container_name eq lifecycle): migration cycles,
last-run time + status
└─ Endpoint (group) ← the "Workload" board, by group
└─ Write rate, Query latency, Total data, Merge, Inverted index, Queue,
Publish bytes
```
Panel **types/units** follow the upstream Grafana boards for fidelity (stat
for KPIs; timeseries for
-rates/latencies; table for the per-node health row; `bytes` / `percentunit` /
`s` / `reqps` / `wps`
-units; disk% and memory% turn red at 80%). The per-node "health table"
(uptime, CPU cores, RSS, mem%,
-disk%) becomes the Instance-list columns on the Service view.
+rates/latencies; `bytes` / `percentunit` / `s` / `reqps` / `wps` units; disk%
and memory% turn red at
+80%). The upstream per-node "health table" (uptime, CPU cores, RSS, mem%,
disk%) maps onto the
+all-roles Resources widgets of the Instance view — Horizon UI's instance list
deliberately shows only
+name + attributes (the role/tier chips), and per-instance metric columns are
not assumed by this
+design; if embedded health columns prove necessary later, that is an additive
Horizon UI enhancement.
This is **design only** — the production `banyandb.json` and its exact widget
grid are deliberately left
to the implementation PR in the Horizon UI repository.
-### 6. Horizon UI enhancement: `#entity` attribute predicate
+### 6. Horizon UI enhancement: entity-gate membership operators
-Horizon UI's widget `visibleWhen` already parses two predicate forms but only
one is implemented:
+When this SWIP was first drafted, Horizon UI parsed `visibleWhen` as free text
and stubbed the
+entity-attribute form. That is no longer the upstream state: horizon-ui PR #46
(merged 2026-06-08)
+replaced the free-text parser with a structured, **BFF-evaluated** union —
-- `"<metric> has value"` — implemented (client-side data-presence gating).
-- `"#entity.<key>"` — **parsed but stubbed**: the renderer's `isVisible`
currently returns `true`
- unconditionally for any `#entity.*` predicate, with the comment
*"Entity-attribute predicates need an
- attributes feed we don't surface yet. Render the widget unconditionally for
now."*
+- `{ "kind": "mqe", "expression": "<expr>", "op": "exists" }` — data-presence
gating;
+- `{ "kind": "entity", "attribute": "<key>", "op": "exists" }` /
+ `{ "kind": "entity", "attribute": "<key>", "op": "eq", "value": "<v>" }` —
entity-attribute gating
+ against the selected instance's attribute feed (`eq` compares
case-insensitively; meaningful on the
+ Instance scope only, a no-op elsewhere)
-The data is already on the wire: the instance list the UI fetches carries each
instance's
-`attributes [{name,value}]`. The enhancement is to **wire those attributes
into the predicate
-evaluator** and give the predicate a small comparison grammar:
+— so the attribute feed and the evaluator this section originally proposed
**already exist upstream**:
+the BFF fetches the selected instance's `attributes [{name,value}]` and
returns gated-out widgets as
+hidden. Legacy free-text predicates (`"<metric> has value"`,
`"#entity.<key>"`) are no longer parsed
+and degrade to ungated.
-| Predicate form | Meaning
|
-| --------------------------------------- |
-------------------------------------------------- |
-| `#entity.<key>` | attribute present and truthy
|
-| `#entity.<key> == '<v>'` / `!= '<v>'` | equals / not-equals a literal
|
-| `#entity.<key> in (<v1>,<v2>)` | membership
|
+What remains for this design is only **membership and negation**:
-Scope of the enhancement (design): (1) pass the selected instance's
`attributes` into the
-`LayerDashboardsView` predicate context; (2) implement the `#entity.*` branch
of `isVisible` to read
-that context; (3) extend the predicate parser with `==` / `!=` / `in`; (4)
document it in the Horizon UI
-layer-template authoring docs. It is generic — any layer (K8s node roles,
gateway tiers, …) benefits;
+| Proposed gate
| Meaning |
+|
------------------------------------------------------------------------------------
| -------------------- |
+| `{ "kind": "entity", "attribute": "<key>", "op": "neq", "value": "<v>" }`
| not-equals a literal |
+| `{ "kind": "entity", "attribute": "<key>", "op": "in", "values": ["<v1>",
"<v2>"] }` | membership |
+
+Scope of the enhancement (design): (1) add the two operator arms to the BFF
`visibleWhen` schema and
+its entity-gate evaluator; (2) document them in the Horizon UI layer-template
authoring docs. Until it
+lands, a tier set like `node_type in (hot, warm)` is expressible as two
`eq`-gated widget variants —
+`in` removes the duplication. It is generic — any layer (K8s node roles,
gateway tiers, …) benefits;
BanyanDB is the first consumer. The exact code lands in the Horizon UI
repository.
+### 7. Intra-cluster instance topology (the "deployment" component)
+
+Beyond the per-instance dashboards, the BanyanDB layer adds a **deployment
view**: the
+container-to-container call graph *within* the single BanyanDB cluster service
— liaison↔data writes,
+the hot→warm→cold lifecycle migration chain, and inter-liaison gossip. The
legacy booster UI only ever
+drew instance topology *between two services*; this is a net-new Horizon UI
component for the
+**one-service** case (landing via horizon-ui PR #47).
+
+**Data path — no query API change.** The component calls
+`getServiceInstanceTopology(clientServiceId, serverServiceId, duration)` with
the **same** service id
+on both sides. OAP's relation filter is symmetric, so `client == server ==
svc` collapses to
+`source_service_id == dest_service_id == svc`, returning exactly the
intra-cluster instance relations
+(verified across the BanyanDB / JDBC / ES topology DAOs). Per-node metrics
evaluate under
+`{ scope: ServiceInstance }`; per-edge metrics under `ServiceInstanceRelation`
(server + client
+families) — both ordinary MQE.
+
+**Grouping contract.** The component lays the graph out from the instance
attributes this SWIP emits
+([entity model](#1-entity-model)):
+
+| Config key | Attribute(s) | Effect
|
+| ----------- | ------------------------- |
-------------------------------------------------------------- |
+| `clusterBy` | `node_role` + `node_type` | one box per role/tier — liaison,
data hot/warm/cold |
+| `siblingBy` | `pod_name` | a pod = main container + sibling
containers (data + lifecycle) |
+| `roleBy` | `container_name` | per-role node metrics (`liaison` /
`data` / `lifecycle`) |
+
+Per-role node MQE binds to the `meter_banyandb_instance_*` metrics from the
catalog above — e.g.
+liaison → `query_rate_by_service`, data → `write_rate` / `disk_usage_percent`,
lifecycle →
+`lifecycle_cycles` / `lifecycle_last_run_success`. Only `container_name` ∈
+{`liaison`, `data`, `lifecycle`} exists on the wire — there is **no `fodc`
container** (the FODC agent
+publishes no self-metrics through the proxy), so a `fodc` role is not modeled.
+
+**Open dependency — a MAL `SERVICE_INSTANCE_RELATION` scope.** This feature is
MAL-only: every BanyanDB
+entity, metric, and attribute here is produced by the `banyandb/*` MAL rules.
MAL builds relations
+through `MeterEntity` / `ScopeType`, which ships `SERVICE_RELATION` and
`PROCESS_RELATION` (the latter
+already powers the eBPF process topology via `network-profiling.yaml`) — but
it has **no
+`SERVICE_INSTANCE_RELATION` scope** and no
`SampleFamily.instanceRelation(...)` builder. So MAL cannot
+emit the instance-relation metric that `getServiceInstanceTopology` reads, and
on a metrics-only
+BanyanDB the deployment graph is **empty** — the Horizon UI component
(horizon-ui PR #47) renders that empty
+state by design until the scope lands (its earlier preview mock has been
dropped).
+Closing the gap means adding that third relation scope (a
`SERVICE_INSTANCE_RELATION` `ScopeType` +
+`MeterEntity` factory + `instanceRelation(...)` builder + entity description,
mirroring the two that
+ship), fed by the queue `remote_node` / `remote_role` / `remote_tier` labels
(now carrying the
+lifecycle sender identity per BanyanDB #1167). That is MAL-**engine** code
(`server-core` +
+`meter-analyzer`), which exceeds this SWIP's [config-only
non-goals](#non-goals); it is tracked under
+[future work](#future-work). The component, the query path, and the grouping
contract above are ready
+the moment that scope lands.
+
## Feasibility and precedent
Verified against the OAP and Horizon UI source — **no OAP core / MAL /
receiver change is required**:
@@ -367,21 +519,35 @@ Verified against the OAP and Horizon UI source — **no OAP
core / MAL / receive
`EndpointTraffic` whenever the endpoint name is non-empty; `EndpointTraffic`
is `supportUpdate=true`
and is listed by GraphQL `findEndpoint` (empty keyword ⇒ list all), which
the BanyanDB metadata DAO
serves from the traffic table without touching any trace data.
-- **Layer.** `Layer.BANYANDB` (ordinal 43) already exists; layer dashboards
are auto-discovered by the
- UI from the template's own `layer` field — no menu code change.
+- **Layer.** `Layer.BANYANDB` (ordinal 43) already exists; layer dashboards
are auto-discovered from
+ the template's own `key` field. The one menu touchpoint: Horizon UI's
hard-coded hidden-layers set
+ currently drops `BANYANDB` from the sidebar — un-hidden by horizon-ui PR
#47's config-driven
+ `layers.excluded` (see
[Dashboards](#5-dashboards-horizon-ui-banyandb-layer-template)).
## Live validation
The entity scheme and the metric catalog above were validated against a **live
7-node BanyanDB
cluster** — the public SkyWalking demo's FODC proxy `/metrics` (2 liaison + 5
data: `hot×2`, `warm×2`,
-`cold×1`). Findings:
-
-- **All four identity labels are present and exactly as designed.** Every
sample carries `pod_name`
- (e.g. `demo-banyandb-data-hot-0`), `node_role` (`ROLE_LIAISON` /
`ROLE_DATA`), `container_name`
- (`liaison` / `data`), and — on **data nodes only** — `node_type` (`hot` /
`warm` / `cold`). Liaison
- nodes carry no `node_type`, so the instance closure defaults the tier
attribute (`tags.node_type ?:
- 'n/a'`). This validates Service = `cluster`, Instance = `pod_name`,
attributes `node_role` /
- `node_type`.
+`cold×1`), running an upstream `main` build (the showcase-pinned image of
2026-06-09; upstream PR
+[#1159](https://github.com/apache/skywalking-banyandb/pull/1159) — open, docs
and Grafana dashboards
+only — documents the same catalog). The live `/metrics` pull is the
authoritative wire reference.
+393 metric families. Findings:
+
+- **Instance must be `pod_name` + `container_name`, not `pod_name`.** Every
sample carries `pod_name`,
+ `node_role` (`ROLE_LIAISON` / `ROLE_DATA` observed; the FODC agent stamps a
transient
+ `ROLE_UNSPECIFIED` for unresolved or meta-only nodes), `container_name`
+ (`liaison` / `data` / **`lifecycle`**), and — on **data containers only** —
`node_type`
+ (`hot` / `warm` / `cold`). Crucially, the four `data` hot/warm pods each run
**two containers under
+ one `pod_name`** (`…@data` and `…@lifecycle`), so `pod_name` is not a unique
instance key and
+ `node_role` is not the discriminator (it reads `ROLE_DATA` on the lifecycle
sidecar). This validates
+ Service = `cluster`, Instance = `pod_name` + `container_name`, attributes
`container_name` / `node_type`.
+- **The `lifecycle` migrator surfaces as its own container instance.** It
co-locates on the `hot`/`warm`
+ data pods and emits `banyandb_lifecycle_cycles_total` plus the shared
`system_*` / `go_*` /
+ `process_*` runtime families — 50 families under `container_name=lifecycle`
in the demo pull. The
+ `last_run_timestamp_seconds` / `last_run_success` gauges (BanyanDB #1167)
post-date the demo's
+ deployed build, so they were absent from that pull but are present on `main`
and emit once a migration
+ cycle runs (the showcase has since pinned the BanyanDB #1167 merge SHA, so a
redeployed demo will
+ expose them).
- **The queue model is confirmed verbatim.** `banyandb_queue_sub_*` /
`queue_pub_*` carry
`operation` ∈ {`batch-write`, `control`, `file-sync`, `query`}, plus
`group`, `remote_node`,
`remote_role` (`liaison` / `data`) and `remote_tier` (`hot` / …);
`total_latency` is a histogram. The
@@ -391,16 +557,23 @@ cluster** — the public SkyWalking demo's FODC proxy
`/metrics` (2 liaison + 5
`liaison_grpc_total_started{group,method,service}`, `*_total_written{group}`,
`*_inverted_index_*{group,seg,node_type}`. Data-node metrics also carry
`node_type`, so the by-group
endpoint view can be refined by tier.
-- **One reconciliation vs. the upstream doc.** Schema/registry operations are
**not** exposed as
- `banyandb_liaison_grpc_total_registry_*` (those series do not exist on the
live cluster) — they are a
- **separate `banyandb_schema_server_grpc_*` scope** (`total_started{method}`,
`_finished`, `_latency`,
- `_err`), running on the nodes hosting the metadata/schema server. The tables
above use the
- `schema_server_grpc_*` names accordingly.
+- **Two registry/schema scopes coexist (corrected).** The live cluster exposes
**both**
+ `banyandb_liaison_grpc_total_registry_*` (`group`, `service`, `method`; on
liaison containers) **and**
+ a separate `banyandb_schema_server_grpc_*` scope (`total_started{method}`,
`_finished`, `_latency`,
+ `_err`; on the data container hosting the metadata/schema server). The
`cluster_error_rate` and
+ registry panels should pick one deliberately — they are different layers,
not aliases. (An earlier
+ draft claimed the `liaison_grpc_total_registry_*` series were absent;
BanyanDB `main` has emitted
+ them since BanyanDB #517.)
+- **`storage_retention_*` is a real data-only family** not in earlier drafts:
+ `storage_retention_{measure,stream,trace}_disk_usage_percent{service}` and
+ `_forced_retention_cooldown_seconds{service}` — the source for the
data-container retention panels.
- **Error counters are absent on a healthy cluster, by design.**
`liaison_grpc_total_err`,
- `*_total_sync_loop_err` and `queue_pub_total_err` are label-dimensioned
counters that emit no series
- until the first error — so the rules must guard each error term with `or
vector(0)`, exactly as the
- upstream "Error Rate" panel does. Their non-error siblings (`_started` /
`_finished` / `_latency` /
- `_bytes`) are all present.
+ `liaison_grpc_total_stream_msg_received_err`, `*_total_sync_loop_err` and
`queue_pub_total_err` are
+ label-dimensioned counters that emit no series until the first error. The
upstream Grafana "Error
+ Rate" panel guards each term with PromQL's `or vector(0)`; the MAL rules
need no guard — an absent
+ family is the identity for MAL's `+` (see the sketch-notation note in the
metric catalog, section 3)
+ — the summed metric simply has no series until the first error fires. Their
non-error siblings
+ (`_started` / `_finished` / `_latency` / `_bytes`) are all present.
## Imported Dependencies libs and their licenses
@@ -415,10 +588,11 @@ This is a **breaking change** to the BanyanDB
self-observability feature (an int
feature, not a public protocol/storage contract):
- **Entity model.** A BanyanDB cluster that previously appeared as *N*
services (one per node) now
- appears as *one* service with *N* instances. Old per-node `Service` entities
and their
- `meter_banyandb_*` / `meter_banyandb_instance_*` metric series are
superseded; the new series use the
- cluster/node/group identities and a partly new metric set. Historical data
under the old model is not
- migrated.
+ appears as *one* service with one instance **per container** (`pod_name` +
`container_name`, so a
+ data hot/warm pod yields both a `data` and a `lifecycle` instance). Old
per-node `Service` entities
+ and their `meter_banyandb_*` / `meter_banyandb_instance_*` metric series are
superseded; the new
+ series use the cluster/container/group identities and a partly new metric
set. Historical data under
+ the old model is not migrated.
- **Scrape target.** Cluster deployments must scrape the **FODC proxy
`:17913`** (single target) and
inject a `cluster` label. The legacy per-pod `:2121` collector config is
replaced. Direct per-pod
scraping is **out of scope** for this redesign (a standalone node still
reports through its FODC
@@ -430,9 +604,9 @@ feature, not a public protocol/storage contract):
Horizon UI bundle.
- **OAP rule loading** is unchanged: `enabledOtelMetricsRules` already globs
`banyandb/*`, so the new
`banyandb-endpoint.yaml` is picked up without an `application.yml` change.
-- **Horizon UI predicate enhancement is backward compatible** — `#entity.*`
only ever returned `true`
- before, so implementing it can only *add* hiding behavior to templates that
opt in; existing templates
- are unaffected.
+- **The Horizon UI entity-gate extension is backward compatible** — `neq` /
`in` are additive arms of
+ the structured `visibleWhen` union (horizon-ui #46); templates that don't
use them are unaffected,
+ and legacy free-text predicates already degrade to ungated rather than
erroring.
## General usage docs
@@ -451,17 +625,25 @@ This is a preliminary usage sketch to help reviewers; the
final operator docs (r
**What the operator sees**
- A **cluster** as a single service, with cluster-wide write/query/error rates
and capacity.
-- A **node list** where each node shows its **role** (`liaison` / `data`) and
**tier**
- (`hot` / `warm` / `cold`) as attributes; selecting a node shows a dashboard
**scoped to what that node
- actually does** — ingestion/queue/publish for liaison,
storage/index/subscribe for data, refined by
- tier.
+- An **instance (container) list** where each entry shows its **container**
role
+ (`liaison` / `data` / `lifecycle`) and **tier** (`hot` / `warm` / `cold`) as
attributes; selecting one
+ shows a dashboard **scoped to what that container actually does** —
ingestion/queue/publish for
+ liaison, storage/index/subscribe/retention for data, migration cycles +
last-run time/status for
+ lifecycle, refined by tier.
- A **group list** (Endpoints) with per-group throughput, latency, storage,
index and queue health.
## Future work
-- **Topology / lifecycle.** Fuse FODC `/cluster/topology` (node inventory +
roles + tiers) and the queue
- `remote_node` / `remote_role` / `remote_tier` labels into a node-to-node
call graph, and surface FODC
- `/cluster/lifecycle` group settings (shards / segment interval / TTL) on the
Endpoint view.
+- **A MAL `SERVICE_INSTANCE_RELATION` scope for the deployment component.**
Add the third relation scope
+ (`ScopeType` + `MeterEntity` factory + `SampleFamily.instanceRelation(...)`
+ entity description,
+ mirroring the shipping `serviceRelation` / `processRelation`) so the
+ [intra-cluster instance
topology](#7-intra-cluster-instance-topology-the-deployment-component) renders
+ live instead of mock-backed, fed by the queue `remote_node` / `remote_role`
/ `remote_tier` labels
+ (verified reconstructable from the live data; BanyanDB #1167 also populates
the lifecycle migration
+ sender identity, so hot→warm→cold tier-migration edges are distinguishable).
This is MAL-engine code,
+ beyond this SWIP's config-only scope. Also
+ surface FODC `/cluster/topology` and `/cluster/lifecycle` group settings
(shards / segment interval /
+ TTL) on the Endpoint view.
- **Alerting.** Ship default alarm rules for the upstream "Key Signals to
Watch" (query p99, error rate,
disk > 85%, memory near the protector limit, sustained wqueue / `queue_pub`
backlog).
- **Direct-scrape variant** for standalone / non-FODC deployments, if demand
warrants.