Copilot commented on code in PR #13901:
URL: https://github.com/apache/skywalking/pull/13901#discussion_r3386769297
##########
docs/en/swip/SWIP-15.md:
##########
@@ -72,18 +77,18 @@ tier** — including the one Horizon UI enhancement that
makes attribute-driven
──────────────── ─────────────
──────────
┌─ liaison node ─┐ FODC agent ┐
BANYANDB layer
│ :2121 /metrics │ (sidecar) │
┌───────────────────────┐
- └────────────────┘ │ │ Root:
cluster list │
- ┌─ data hot ─────┐ FODC agent ├─► FODC proxy ──► OTel Collector ──► │
Service: cluster KPIs │
- │ :2121 /metrics │ (sidecar) │ :17913 (prometheus recv, │
Instance: node, panels│
- └────────────────┘ │ /metrics adds `cluster` │
adapt to role/tier │
+ └────────────────┘ │ │
Root: cluster list │
+ ┌─ data hot ─────┐ FODC agent ├─► FODC proxy ──► OTel Collector ──► │
Service: cluster KPIs │
+ │ :2121 /metrics │ (sidecar) │ :17913 (prometheus recv, │
Instance: container, │
+ └────────────────┘ │ /metrics adds `cluster` │
adapts to role/tier │
┌─ data warm ────┐ FODC agent │ single target, label) ──OTLP──► │
Endpoint: group │
│ :2121 /metrics │ (sidecar) │ per-node labels │
└───────────────────────┘
Review Comment:
The architecture diagram still says the proxy stamps "per-node labels", but
this SWIP now defines the metrics-emitter identity at the container level
(pod_name + container_name). The diagram should match that terminology to avoid
confusion.
##########
docs/en/swip/SWIP-15.md:
##########
@@ -100,41 +105,62 @@ identity.
### 1. Entity model
-| SkyWalking entity | BanyanDB concept
| Identity source (label) |
-| ---------------------------- | ---------------------------------------------
| --------------------------------------- |
-| `Service` (Layer `BANYANDB`) | one BanyanDB **cluster**
| `cluster` (injected by the collector) |
-| `ServiceInstance` | one **node**
| `pod_name` (e.g. `banyandb-data-hot-0`) |
-| ↳ attribute `node_role` | node **role**
| `container_name` (`liaison` / `data`) |
-| ↳ attribute `node_type` | data-node **tier**
| `node_type` (`hot` / `warm` / `cold`) |
-| `Endpoint` | one **group** (storage partition)
| `group` (`measure-default`, …) |
-
-A standalone BanyanDB is the degenerate case: one cluster, one node whose role
is `standalone` (all
-roles co-resident) and no tier.
-
-**Why role/tier are instance attributes, not separate services or endpoints.**
A node's identity is
-its `pod_name`; its role and tier are *properties of that node*, which is
exactly what
-`InstanceTraffic.properties` (the UI "Attributes" panel) is for. Keeping the
cluster as the single
-service means the node list, the group list, and cluster-wide KPIs all live
under one entity the
-operator can reason about — and it makes the node dashboard able to adapt to
the selected node's
-attributes.
+| SkyWalking entity | BanyanDB concept |
Identity source (label) |
+| ---------------------------- | ----------------------------------------- |
------------------------------------------------------ |
+| `Service` (Layer `BANYANDB`) | one BanyanDB **cluster** |
`cluster` (injected by the collector) |
+| `ServiceInstance` | one **container** on a node |
`pod_name` + `container_name` (composite) |
+| ↳ attribute `container_name` | container **role**
(discriminator) | `liaison` / `data` / `lifecycle` |
+| ↳ attribute `node_type` | data-node **tier** |
`hot` / `warm` / `cold` (absent off data) |
Review Comment:
In the entity model table, the note for `node_type` says "absent off data",
which reads as the opposite of what is described later (node_type is present on
data containers only, and the illustrative MAL closure defaults it). This
should be clarified to avoid misinterpreting when the attribute exists/what
values it can take.
##########
docs/en/swip/SWIP-15.md:
##########
@@ -218,26 +282,44 @@ BanyanDB `origin/main` (the base of the upstream
observability PR).
| `gc_pause_avg` | `rate(go_gc_duration_seconds_sum) /
rate(go_gc_duration_seconds_count)` |
| `heap_inuse` / `heap_next_gc` / `alloc_rate` |
`go_memstats_heap_inuse_bytes` / `go_memstats_next_gc_bytes` /
`rate(go_memstats_alloc_bytes_total)` |
-**Liaison-only** (front door; hidden on data nodes — see
[§4](#4-dynamic-metrics-by-role-and-tier)):
+**Liaison-only** (front door; hidden on data containers — see [dynamic metrics
by role and tier](#4-dynamic-metrics-by-role-and-tier)):
| Metric (`meter_banyandb_instance_*`) | Source
|
| ------------------------------------- |
----------------------------------------------------------------------- |
| `query_rate_by_service` |
`rate(liaison_grpc_total_started{method='query'}) by (service)` |
-| `grpc_error_rate` | `rate(liaison_grpc_total_err) by
(service, method)` (+ `_stream_msg_received_err`; both lazily registered) |
+| `grpc_error_rate` | `rate(liaison_grpc_total_err) by
(service, method)` (+ `liaison_grpc_total_stream_msg_received_err`; both lazily
registered) |
| `non_query_op_rate` |
`rate(liaison_grpc_total_started{method!='query'}) by (method)` |
| `write_rate` |
`rate({measure,stream_tst,trace_tst}_total_written)` |
| `publish_throughput` / `publish_latency_p99` |
`rate(queue_pub_total_finished) by (operation)` / `histogram_quantile(0.99,
…queue_pub_total_latency_bucket)` |
| `wqueue_file_parts` / `wqueue_mem_part` / `wqueue_pending` |
`{measure,stream_tst,trace_tst}_total_file_parts` / `_total_mem_part` /
`_pending_data_count` |
-**Data-only** (backend; hidden on liaison nodes):
+**Data-only** (backend; hidden on liaison containers):
| Metric (`meter_banyandb_instance_*`) | Source
|
| ----------------------------------------------- |
------------------------------------------------------------------ |
| `total_data` |
`{measure,stream_tst,trace_tst}_total_file_elements` |
| `merge_file_rate` / `merge_file_latency` / `merge_file_partitions` |
`rate(*_total_merge_loop_started)` / `…_merge_latency{type='file'}` /
`…_merged_parts{type='file'}` |
-| `series_write_rate` / `series_term_search_rate` / `total_series` |
`measure_inverted_index_total_updates` / `_term_searchers_started` /
`_doc_count`; `stream_storage_inverted_index_*` |
+| `series_write_rate` / `series_term_search_rate` / `total_series` |
`measure_inverted_index_total_updates` / `_total_term_searchers_started` /
`_total_doc_count`; `stream_storage_inverted_index_*` |
| `stream_tst_write_rate` / `stream_tst_term_search_rate` /
`stream_tst_total_docs` | `stream_tst_inverted_index_*` |
| `queue_sub_throughput` / `queue_sub_latency_p99` (per `operation`) |
`rate(queue_sub_total_started/finished) by (operation)` /
`histogram_quantile(0.99, …queue_sub_total_latency_bucket) by (operation)` |
+| `retention_disk_usage_percent` / `retention_cooldown` |
`storage_retention_{measure,stream,trace}_disk_usage_percent` /
`_forced_retention_cooldown_seconds` |
+
+**Lifecycle-only** (the tier-migration sidecar co-located on `hot`/`warm` data
pods; `container_name == 'lifecycle'`):
+
+| Metric (`meter_banyandb_instance_*`) | Source
|
+| ------------------------------------ |
------------------------------------------------------------------ |
+| `lifecycle_cycles` | `banyandb_lifecycle_cycles_total`
(cumulative migration cycles) |
+| `lifecycle_last_run` |
`banyandb_lifecycle_last_run_timestamp_seconds` — epoch of the last cycle's
start; "time since last sync" = `time() - <metric>`, computed at ingest in the
MAL rule (MQE has no `time()`) |
+| `lifecycle_last_run_success` | `banyandb_lifecycle_last_run_success`
(`1` = last cycle OK, `0` = failed) |
Review Comment:
This section says the catalog tables drop the common `banyandb_` prefix in
sketch expressions, but the lifecycle-only rows use the fully-prefixed family
names. Align these with the rest of the sketch notation (the build-critical
note already covers that MAL must use the fully-prefixed names).
##########
docs/en/swip/SWIP-15.md:
##########
@@ -168,42 +194,80 @@ filter: "{ tags -> tags.job_name == 'banyandb-monitoring'
}"
# banyandb-service.yaml → cluster
expSuffix: service(['cluster'], Layer.BANYANDB)
-# banyandb-instance.yaml → node, with role + tier as attributes
+# banyandb-instance.yaml → container (a node may run >1 container), role +
tier as attributes
expSuffix: |-
service(['cluster'], Layer.BANYANDB)
- .instance(['cluster'], '::', ['pod_name'], '', Layer.BANYANDB,
- { tags -> ['node_role': tags.container_name, 'node_type':
tags.node_type ?: 'n/a'] })
+ .instance(['cluster'], '::', ['pod_name', 'container_name'], '@',
Layer.BANYANDB,
+ { tags -> ['node_role': tags.node_role,
+ 'node_type': tags.node_type ?: 'n/a',
+ 'pod_name': tags.pod_name,
+ 'container_name': tags.container_name] })
# banyandb-endpoint.yaml → group
expSuffix: endpoint(['cluster'], ['group'], Layer.BANYANDB)
```
-The 6-argument `.instance(...)` overload's properties closure is the standard,
precedented mechanism for
+The instance key is the pair `['pod_name', 'container_name']` joined by `'@'`
(signature
+`instance(serviceKeys, serviceDelimiter, instanceKeys, instanceDelimiter,
layer, propertiesExtractor)`),
+so the four `data` hot/warm pods surface as distinct `…@data` and
`…@lifecycle` instances rather than
+colliding. The 6-argument overload's properties closure is the standard,
precedented mechanism for
attaching labels as instance attributes (the same shape used by
`k8s-instance.yaml`). The attributes
-ride entirely on the scraped labels — no separate update API.
+ride entirely on the scraped labels — no separate update API. (Two
implementation notes: the MAL v2
+grammar supports the Elvis operator inside a map-literal value, but no shipped
rule combines the two
+yet — the implementation PR should pin this exact closure shape with a compile
test. And `language` is
+the one reserved property key — the instance query maps it to the language
field instead of an
+attribute; none of these four labels collides with it.)
### 3. Metric catalog → MAL rules
The redesigned rules mirror the upstream FODC-proxy catalog. The two upstream
Grafana boards map onto
-two SkyWalking scopes — **Nodes → Instance** (per `pod_name`), **Workload →
Endpoint** (per `group`) —
-plus a small **Service** summary for cluster KPIs. Source metric names below
are verified against
-BanyanDB `origin/main` (the base of the upstream observability PR).
+two SkyWalking scopes — **Nodes → Instance** (per `pod_name` +
`container_name`), **Workload →
+Endpoint** (per `group`) — plus a small **Service** summary for cluster KPIs.
Source metric names
+below are verified against the **live demo cluster** — which runs upstream
`main` builds (the
+validation pull used the showcase-pinned `main` image of 2026-06-09) — and
against BanyanDB
+`origin/main` source. The upstream observability PR
+[#1159](https://github.com/apache/skywalking-banyandb/pull/1159) (open; docs
and Grafana dashboards
+only, no metric code) documents the same catalog and defines the two boards
this design mirrors.
+
+> **Metric-name prefix (build-critical).** The sketches below drop a common
prefix for readability.
+> On the wire **every BanyanDB-native family carries the `banyandb_` prefix**
(`banyandb_measure_total_written`,
+> `banyandb_liaison_grpc_total_started`, `banyandb_system_disk`, …) — the MAL
rules must use the full
+> prefixed name. The **only** exceptions are the standard Go-runtime and
process exporter families
+> `go_*` / `process_*`, which are **bare** (no prefix) and are referenced
as-is. Every error counter
+> this catalog references is lazily registered and emits nothing until the
first error fires
+> (`banyandb_liaison_grpc_total_err`,
`banyandb_liaison_grpc_total_stream_msg_received_err`,
+> `banyandb_queue_pub_total_err`, the `*_total_sync_loop_err` family), and the
lifecycle last-run
+> gauges (`banyandb_lifecycle_last_run_*`, BanyanDB #1167) post-date the build
the demo pull validated;
+> every other cited family was present in that pull.
+>
+> **Sketch notation (PromQL-flavored).** Source expressions are written
PromQL-style for readability;
+> the MAL forms differ mechanically. **(1)** No `or vector(0)` guard exists in
MAL — nor is one
+> needed: an unfired family resolves to the empty sample family, MAL's `+`
treats an empty operand as
+> identity, and a rule is skipped only when *all* referenced families are
absent — so an error sum
+> emits as soon as any one term fires, and a fully healthy cluster shows no
series at all (dashboards
+> should render absent as 0). **(2)** MAL arithmetic joins samples on exact
label equality, so each
+> term must be aggregated to the same label set (e.g. `.sum(['cluster'])`)
before `+`. **(3)**
+> `count(...) by (...)` maps to MAL's multi-label `count([...])`;
`histogram_quantile(0.99, …_bucket)`
+> maps to `.histogram().histogram_percentile([99])` on the `le`-labeled base
family (no `_bucket`
+> suffix remains after OTLP conversion); and `time() - <metric>` is computed
at **ingest** in the MAL
+> rule — MAL ships `time()` (the shipped `envoy-ca.yaml` cert-staleness metric
is the precedent),
+> while MQE has no current-time function, so it cannot be computed at query
time.
#### 3.1 Service scope — cluster summary (`banyandb-service.yaml`)
| Metric (`meter_banyandb_*`) | Meaning | Source expression
(sketch) |
| --------------------------- | ------------------------ |
------------------------------------------------------------------------------------------
|
| `cluster_write_rate` | cluster writes/s |
`rate(measure_total_written) + rate(stream_tst_total_written) +
rate(trace_tst_total_written)` |
| `cluster_query_rate` | cluster queries/s |
`rate(liaison_grpc_total_started{method='query'})`
|
-| `cluster_error_rate` | cluster errors/min |
`liaison_grpc_total_err + _stream_msg_received_err +
schema_server_grpc_total_err + queue_pub_total_err + Σ *_total_sync_loop_err`
(×60, each `or vector(0)`) |
-| `reporting_nodes` | live node count by role |
`count(system_up_time) by (container_name)`
|
+| `cluster_error_rate` | cluster errors/min |
`liaison_grpc_total_err + liaison_grpc_total_stream_msg_received_err +
schema_server_grpc_total_err + queue_pub_total_err + Σ *_total_sync_loop_err`
(×60; all lazily registered — see sketch notation above) |
+| `reporting_nodes` | live container count by role |
`count(system_up_time) by (container_name)`
|
Review Comment:
The metric name `reporting_nodes` now represents a count of reporting
**containers** (instances) by role. Renaming it in the doc to
`reporting_instances` (or similar) would better match the entity model and
avoid implying a 1:1 node↔instance mapping.
##########
docs/en/swip/SWIP-15.md:
##########
@@ -262,98 +344,168 @@ nodes per group):
### 4. Dynamic metrics by role and tier
-Different roles expose different metrics, so the **node (Instance) dashboard
must adapt to the selected
-node**. Two mechanisms, layered:
+Different roles expose different metrics, so the **instance dashboard must
adapt to the selected
+container**. Horizon UI's widget `visibleWhen` is a structured,
**server-evaluated** gate (the BFF
+resolves it against data presence or the selected instance's attributes and
returns gated-out widgets
+as hidden; legacy free-text predicate strings are no longer parsed and degrade
to ungated). Two gate
+kinds, layered:
-**(a) Data-presence gating — available today, no UI code.** Horizon UI already
supports
-`visibleWhen: "<metric> has value"` on a widget; a panel whose metric returns
all-null self-hides. Each
-MAL rule only produces samples for nodes that emit its source metric, so
liaison-only metrics are simply
-absent on data instances and vice-versa. This gives correct adaptive behavior
out of the box:
+**(a) Data-presence gating — available today, no UI code.** The `mqe`-kind
gate hides a widget whose
+expression returns no data. Each MAL rule only produces samples for containers
that emit its source
+metric, so liaison-only metrics are simply absent on data instances and
vice-versa. This gives correct
+adaptive behavior out of the box:
```jsonc
{ "id": "wqueue", "title": "Write Queue (wqueue)", "type": "line",
"expressions": ["meter_banyandb_instance_wqueue_pending"],
- "visibleWhen": "meter_banyandb_instance_wqueue_pending has value" }
+ "visibleWhen": { "kind": "mqe", "expression":
"meter_banyandb_instance_wqueue_pending", "op": "exists" } }
```
-**(b) Attribute predicate — proposed enhancement (see
[§6](#6-horizon-ui-enhancement-entity-attribute-predicate)).**
+**(b) Attribute gating — equality ships today; membership is the proposed
extension (see
+[entity-gate membership
operators](#6-horizon-ui-enhancement-entity-gate-membership-operators)).**
Data-presence can't distinguish "wrong role" from "idle but right role", and
it still issues the query.
-An attribute predicate keys panel visibility directly on the node's
`node_role` / `node_type`
-attributes:
+The `entity`-kind gate keys panel visibility directly on the selected
instance's `container_name` /
+`node_type` attributes (meaningful on the Instance scope only):
```jsonc
-{ "id": "wqueue", "visibleWhen": "#entity.node_role == 'liaison'" }
-{ "id": "cold_tier_note", "visibleWhen": "#entity.node_type == 'cold'" }
+{ "id": "wqueue", "visibleWhen": { "kind": "entity", "attribute":
"container_name", "op": "eq", "value": "liaison" } }
+{ "id": "cold_tier_note", "visibleWhen": { "kind": "entity", "attribute":
"node_type", "op": "eq", "value": "cold" } }
```
This is the precise, declarative form, and it is the natural way to express
tier-specific panels (a
-`hot` data node merges constantly; a `cold` node is mostly static).
+`hot` data container merges constantly; a `cold` container is mostly static).
The landed gate supports
+`exists` and case-insensitive `eq`; tier *sets* need the proposed `in`
operator — until it lands they
+are expressible as duplicated `eq`-gated widget variants.
Role/tier scoping of the catalog:
-| Bucket | Panels
| Predicate |
-| --------------- |
--------------------------------------------------------------------- |
--------------------------------- |
-| **All roles** | system resources, disk-by-path, network, Go runtime, node
uptime | (always shown) |
-| **Liaison** | gRPC query & errors, non-query ops, write rate, publish
throughput & latency, wqueue depth | `#entity.node_role == 'liaison'` |
-| **Data** | storage totals, merge/compaction, inverted index,
subscribe queue | `#entity.node_role == 'data'` |
-| **Data + tier** | tier-specific merge/retention hints
| `#entity.node_type in (hot,warm)` |
+| Bucket | Panels
| Entity gate |
+| --------------- |
--------------------------------------------------------------------- |
---------------------------------- |
+| **All roles** | system resources, disk-by-path, network, Go runtime, node
uptime | (always shown) |
+| **Liaison** | gRPC query & errors, non-query ops, write rate, publish
throughput & latency, wqueue depth | `container_name eq liaison` |
+| **Data** | storage totals, merge/compaction, inverted index,
subscribe queue, retention | `container_name eq data` |
+| **Data + tier** | tier-specific merge/retention hints
| `node_type in (hot, warm)` † |
+| **Lifecycle** | migration cycles, last-run time + status
| `container_name eq lifecycle` |
+
+† `in` is the proposed extension of [section
6](#6-horizon-ui-enhancement-entity-gate-membership-operators);
+until it lands, two `eq`-gated widget variants.
### 5. Dashboards (Horizon UI BANYANDB layer template)
A net-new layer template `apps/bff/src/bundled_templates/layers/banyandb.json`
(config-driven JSON, one
-file per layer, per-scope widget arrays, MQE expression strings). The design
mirrors the upstream two
-boards across the SkyWalking hierarchy:
+file per layer keyed by its `key` field — `BANYANDB`, filename lowercased —
with per-scope widget
+arrays and MQE expression strings). One menu touchpoint exists: Horizon UI
currently hard-codes the
+`BANYANDB` layer out of the sidebar (`HIDDEN_LAYERS`);
+[horizon-ui #47](https://github.com/apache/skywalking-horizon-ui/pull/47)
replaces that with a
+config-driven `layers.excluded` list that un-hides BanyanDB — this SWIP rides
on that change (or an
+equivalent one-line un-hide). The design mirrors the upstream two boards
across the SkyWalking
+hierarchy:
```
BANYANDB layer
-├─ Root → cluster list (ServiceList), showGroup=false
+├─ Root → cluster list (the layer landing's service-list picker:
header columns + sort)
├─ Service (cluster)
│ └─ Overview KPIs + "Cluster Workload Summary" + "Fleet Overview" capacity
│ (cluster_write_rate, cluster_query_rate, cluster_error_rate,
│ reporting_nodes by role, total_cpu/memory/disk)
-├─ Instance (node) ← the "Nodes" board, made dynamic
+├─ Instance (container) ← the "Nodes" board, made dynamic; instance =
pod_name@container_name
Review Comment:
This dashboard tree still refers to `reporting_nodes`, but the SWIP now
counts reporting containers/instances by role. Renaming this reference to match
the metric name/semantics will keep the doc consistent.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]