Copilot commented on code in PR #13901:
URL: https://github.com/apache/skywalking/pull/13901#discussion_r3386769297


##########
docs/en/swip/SWIP-15.md:
##########
@@ -72,18 +77,18 @@ tier** — including the one Horizon UI enhancement that 
makes attribute-driven
  ────────────────                         ─────────────                        
 ──────────
  ┌─ liaison node ─┐  FODC agent ┐                                       
BANYANDB layer
  │  :2121 /metrics │ (sidecar)  │                                       
┌───────────────────────┐
- └────────────────┘            │                                       │ Root: 
cluster list    │
- ┌─ data hot ─────┐  FODC agent ├─► FODC proxy ──► OTel Collector ──►  │ 
Service: cluster KPIs │
- │  :2121 /metrics │ (sidecar)  │   :17913          (prometheus recv,   │ 
Instance: node, panels│
- └────────────────┘            │   /metrics         adds `cluster`      │   
adapt to role/tier  │
+ └────────────────┘            │                                        │ 
Root: cluster list    │
+ ┌─ data hot ─────┐  FODC agent ├─► FODC proxy ──► OTel Collector ──►   │ 
Service: cluster KPIs │
+ │  :2121 /metrics │ (sidecar)  │   :17913          (prometheus recv,   │ 
Instance: container,  │
+ └────────────────┘            │   /metrics         adds `cluster`      │   
adapts to role/tier │
  ┌─ data warm ────┐  FODC agent │   single target,  label) ──OTLP──►    │ 
Endpoint: group       │
  │  :2121 /metrics │ (sidecar)  │   per-node labels      │              
└───────────────────────┘

Review Comment:
   The architecture diagram still says the proxy stamps "per-node labels", but 
this SWIP now defines the metrics-emitter identity at the container level 
(pod_name + container_name). The diagram should match that terminology to avoid 
confusion.



##########
docs/en/swip/SWIP-15.md:
##########
@@ -100,41 +105,62 @@ identity.
 
 ### 1. Entity model
 
-| SkyWalking entity            | BanyanDB concept                              
| Identity source (label)                 |
-| ---------------------------- | --------------------------------------------- 
| --------------------------------------- |
-| `Service` (Layer `BANYANDB`) | one BanyanDB **cluster**                      
| `cluster` (injected by the collector)   |
-| `ServiceInstance`            | one **node**                                  
| `pod_name` (e.g. `banyandb-data-hot-0`) |
-|   ↳ attribute `node_role` | node **role**                          
| `container_name` (`liaison` / `data`)   |
-|   ↳ attribute `node_type` | data-node **tier**                     
| `node_type` (`hot` / `warm` / `cold`)   |
-| `Endpoint`                   | one **group** (storage partition)             
| `group` (`measure-default`, …)          |
-
-A standalone BanyanDB is the degenerate case: one cluster, one node whose role 
is `standalone` (all
-roles co-resident) and no tier.
-
-**Why role/tier are instance attributes, not separate services or endpoints.** 
A node's identity is
-its `pod_name`; its role and tier are *properties of that node*, which is 
exactly what
-`InstanceTraffic.properties` (the UI "Attributes" panel) is for. Keeping the 
cluster as the single
-service means the node list, the group list, and cluster-wide KPIs all live 
under one entity the
-operator can reason about — and it makes the node dashboard able to adapt to 
the selected node's
-attributes.
+| SkyWalking entity            | BanyanDB concept                          | 
Identity source (label)                                |
+| ---------------------------- | ----------------------------------------- | 
------------------------------------------------------ |
+| `Service` (Layer `BANYANDB`) | one BanyanDB **cluster**                  | 
`cluster` (injected by the collector)                  |
+| `ServiceInstance`            | one **container** on a node               | 
`pod_name` + `container_name` (composite)              |
+|   ↳ attribute `container_name` | container **role** 
(discriminator) | `liaison` / `data` / `lifecycle`                  |
+|   ↳ attribute `node_type` | data-node **tier**                 | 
`hot` / `warm` / `cold` (absent off data)              |

Review Comment:
   In the entity model table, the note for `node_type` says "absent off data", 
which reads as the opposite of what is described later (node_type is present on 
data containers only, and the illustrative MAL closure defaults it). This 
should be clarified to avoid misinterpreting when the attribute exists/what 
values it can take.



##########
docs/en/swip/SWIP-15.md:
##########
@@ -218,26 +282,44 @@ BanyanDB `origin/main` (the base of the upstream 
observability PR).
 | `gc_pause_avg`                      | `rate(go_gc_duration_seconds_sum) / 
rate(go_gc_duration_seconds_count)` |
 | `heap_inuse` / `heap_next_gc` / `alloc_rate` | 
`go_memstats_heap_inuse_bytes` / `go_memstats_next_gc_bytes` / 
`rate(go_memstats_alloc_bytes_total)` |
 
-**Liaison-only** (front door; hidden on data nodes — see 
[§4](#4-dynamic-metrics-by-role-and-tier)):
+**Liaison-only** (front door; hidden on data containers — see [dynamic metrics 
by role and tier](#4-dynamic-metrics-by-role-and-tier)):
 
 | Metric (`meter_banyandb_instance_*`)  | Source                               
                                   |
 | ------------------------------------- | 
----------------------------------------------------------------------- |
 | `query_rate_by_service`               | 
`rate(liaison_grpc_total_started{method='query'}) by (service)`         |
-| `grpc_error_rate`                     | `rate(liaison_grpc_total_err) by 
(service, method)` (+ `_stream_msg_received_err`; both lazily registered) |
+| `grpc_error_rate`                     | `rate(liaison_grpc_total_err) by 
(service, method)` (+ `liaison_grpc_total_stream_msg_received_err`; both lazily 
registered) |
 | `non_query_op_rate`                   | 
`rate(liaison_grpc_total_started{method!='query'}) by (method)` |
 | `write_rate`                          | 
`rate({measure,stream_tst,trace_tst}_total_written)`                    |
 | `publish_throughput` / `publish_latency_p99` | 
`rate(queue_pub_total_finished) by (operation)` / `histogram_quantile(0.99, 
…queue_pub_total_latency_bucket)` |
 | `wqueue_file_parts` / `wqueue_mem_part` / `wqueue_pending` | 
`{measure,stream_tst,trace_tst}_total_file_parts` / `_total_mem_part` / 
`_pending_data_count` |
 
-**Data-only** (backend; hidden on liaison nodes):
+**Data-only** (backend; hidden on liaison containers):
 
 | Metric (`meter_banyandb_instance_*`)            | Source                     
                                         |
 | ----------------------------------------------- | 
------------------------------------------------------------------ |
 | `total_data`                                    | 
`{measure,stream_tst,trace_tst}_total_file_elements`               |
 | `merge_file_rate` / `merge_file_latency` / `merge_file_partitions` | 
`rate(*_total_merge_loop_started)` / `…_merge_latency{type='file'}` / 
`…_merged_parts{type='file'}` |
-| `series_write_rate` / `series_term_search_rate` / `total_series` | 
`measure_inverted_index_total_updates` / `_term_searchers_started` / 
`_doc_count`; `stream_storage_inverted_index_*` |
+| `series_write_rate` / `series_term_search_rate` / `total_series` | 
`measure_inverted_index_total_updates` / `_total_term_searchers_started` / 
`_total_doc_count`; `stream_storage_inverted_index_*` |
 | `stream_tst_write_rate` / `stream_tst_term_search_rate` / 
`stream_tst_total_docs` | `stream_tst_inverted_index_*` |
 | `queue_sub_throughput` / `queue_sub_latency_p99` (per `operation`) | 
`rate(queue_sub_total_started/finished) by (operation)` / 
`histogram_quantile(0.99, …queue_sub_total_latency_bucket) by (operation)` |
+| `retention_disk_usage_percent` / `retention_cooldown` | 
`storage_retention_{measure,stream,trace}_disk_usage_percent` / 
`_forced_retention_cooldown_seconds` |
+
+**Lifecycle-only** (the tier-migration sidecar co-located on `hot`/`warm` data 
pods; `container_name == 'lifecycle'`):
+
+| Metric (`meter_banyandb_instance_*`) | Source                                
                              |
+| ------------------------------------ | 
------------------------------------------------------------------ |
+| `lifecycle_cycles`                   | `banyandb_lifecycle_cycles_total` 
(cumulative migration cycles)    |
+| `lifecycle_last_run`                 | 
`banyandb_lifecycle_last_run_timestamp_seconds` — epoch of the last cycle's 
start; "time since last sync" = `time() - <metric>`, computed at ingest in the 
MAL rule (MQE has no `time()`) |
+| `lifecycle_last_run_success`         | `banyandb_lifecycle_last_run_success` 
(`1` = last cycle OK, `0` = failed) |

Review Comment:
   This section says the catalog tables drop the common `banyandb_` prefix in 
sketch expressions, but the lifecycle-only rows use the fully-prefixed family 
names. Align these with the rest of the sketch notation (the build-critical 
note already covers that MAL must use the fully-prefixed names).



##########
docs/en/swip/SWIP-15.md:
##########
@@ -168,42 +194,80 @@ filter: "{ tags -> tags.job_name == 'banyandb-monitoring' 
}"
 # banyandb-service.yaml  → cluster
 expSuffix: service(['cluster'], Layer.BANYANDB)
 
-# banyandb-instance.yaml → node, with role + tier as attributes
+# banyandb-instance.yaml → container (a node may run >1 container), role + 
tier as attributes
 expSuffix: |-
   service(['cluster'], Layer.BANYANDB)
-  .instance(['cluster'], '::', ['pod_name'], '', Layer.BANYANDB,
-            { tags -> ['node_role': tags.container_name, 'node_type': 
tags.node_type ?: 'n/a'] })
+  .instance(['cluster'], '::', ['pod_name', 'container_name'], '@', 
Layer.BANYANDB,
+            { tags -> ['node_role':      tags.node_role,
+                       'node_type':      tags.node_type ?: 'n/a',
+                       'pod_name':       tags.pod_name,
+                       'container_name': tags.container_name] })
 
 # banyandb-endpoint.yaml → group
 expSuffix: endpoint(['cluster'], ['group'], Layer.BANYANDB)
 ```
 
-The 6-argument `.instance(...)` overload's properties closure is the standard, 
precedented mechanism for
+The instance key is the pair `['pod_name', 'container_name']` joined by `'@'` 
(signature
+`instance(serviceKeys, serviceDelimiter, instanceKeys, instanceDelimiter, 
layer, propertiesExtractor)`),
+so the four `data` hot/warm pods surface as distinct `…@data` and 
`…@lifecycle` instances rather than
+colliding. The 6-argument overload's properties closure is the standard, 
precedented mechanism for
 attaching labels as instance attributes (the same shape used by 
`k8s-instance.yaml`). The attributes
-ride entirely on the scraped labels — no separate update API.
+ride entirely on the scraped labels — no separate update API. (Two 
implementation notes: the MAL v2
+grammar supports the Elvis operator inside a map-literal value, but no shipped 
rule combines the two
+yet — the implementation PR should pin this exact closure shape with a compile 
test. And `language` is
+the one reserved property key — the instance query maps it to the language 
field instead of an
+attribute; none of these four labels collides with it.)
 
 ### 3. Metric catalog → MAL rules
 
 The redesigned rules mirror the upstream FODC-proxy catalog. The two upstream 
Grafana boards map onto
-two SkyWalking scopes — **Nodes → Instance** (per `pod_name`), **Workload → 
Endpoint** (per `group`) —
-plus a small **Service** summary for cluster KPIs. Source metric names below 
are verified against
-BanyanDB `origin/main` (the base of the upstream observability PR).
+two SkyWalking scopes — **Nodes → Instance** (per `pod_name` + 
`container_name`), **Workload →
+Endpoint** (per `group`) — plus a small **Service** summary for cluster KPIs. 
Source metric names
+below are verified against the **live demo cluster** — which runs upstream 
`main` builds (the
+validation pull used the showcase-pinned `main` image of 2026-06-09) — and 
against BanyanDB
+`origin/main` source. The upstream observability PR
+[#1159](https://github.com/apache/skywalking-banyandb/pull/1159) (open; docs 
and Grafana dashboards
+only, no metric code) documents the same catalog and defines the two boards 
this design mirrors.
+
+> **Metric-name prefix (build-critical).** The sketches below drop a common 
prefix for readability.
+> On the wire **every BanyanDB-native family carries the `banyandb_` prefix** 
(`banyandb_measure_total_written`,
+> `banyandb_liaison_grpc_total_started`, `banyandb_system_disk`, …) — the MAL 
rules must use the full
+> prefixed name. The **only** exceptions are the standard Go-runtime and 
process exporter families
+> `go_*` / `process_*`, which are **bare** (no prefix) and are referenced 
as-is. Every error counter
+> this catalog references is lazily registered and emits nothing until the 
first error fires
+> (`banyandb_liaison_grpc_total_err`, 
`banyandb_liaison_grpc_total_stream_msg_received_err`,
+> `banyandb_queue_pub_total_err`, the `*_total_sync_loop_err` family), and the 
lifecycle last-run
+> gauges (`banyandb_lifecycle_last_run_*`, BanyanDB #1167) post-date the build 
the demo pull validated;
+> every other cited family was present in that pull.
+>
+> **Sketch notation (PromQL-flavored).** Source expressions are written 
PromQL-style for readability;
+> the MAL forms differ mechanically. **(1)** No `or vector(0)` guard exists in 
MAL — nor is one
+> needed: an unfired family resolves to the empty sample family, MAL's `+` 
treats an empty operand as
+> identity, and a rule is skipped only when *all* referenced families are 
absent — so an error sum
+> emits as soon as any one term fires, and a fully healthy cluster shows no 
series at all (dashboards
+> should render absent as 0). **(2)** MAL arithmetic joins samples on exact 
label equality, so each
+> term must be aggregated to the same label set (e.g. `.sum(['cluster'])`) 
before `+`. **(3)**
+> `count(...) by (...)` maps to MAL's multi-label `count([...])`; 
`histogram_quantile(0.99, …_bucket)`
+> maps to `.histogram().histogram_percentile([99])` on the `le`-labeled base 
family (no `_bucket`
+> suffix remains after OTLP conversion); and `time() - <metric>` is computed 
at **ingest** in the MAL
+> rule — MAL ships `time()` (the shipped `envoy-ca.yaml` cert-staleness metric 
is the precedent),
+> while MQE has no current-time function, so it cannot be computed at query 
time.
 
 #### 3.1 Service scope — cluster summary (`banyandb-service.yaml`)
 
 | Metric (`meter_banyandb_*`) | Meaning                  | Source expression 
(sketch)                                                                 |
 | --------------------------- | ------------------------ | 
------------------------------------------------------------------------------------------
 |
 | `cluster_write_rate`        | cluster writes/s         | 
`rate(measure_total_written) + rate(stream_tst_total_written) + 
rate(trace_tst_total_written)` |
 | `cluster_query_rate`        | cluster queries/s        | 
`rate(liaison_grpc_total_started{method='query'})`                              
            |
-| `cluster_error_rate`        | cluster errors/min       | 
`liaison_grpc_total_err + _stream_msg_received_err + 
schema_server_grpc_total_err + queue_pub_total_err + Σ *_total_sync_loop_err` 
(×60, each `or vector(0)`) |
-| `reporting_nodes`           | live node count by role  | 
`count(system_up_time) by (container_name)`                                     
            |
+| `cluster_error_rate`        | cluster errors/min       | 
`liaison_grpc_total_err + liaison_grpc_total_stream_msg_received_err + 
schema_server_grpc_total_err + queue_pub_total_err + Σ *_total_sync_loop_err` 
(×60; all lazily registered — see sketch notation above) |
+| `reporting_nodes`           | live container count by role | 
`count(system_up_time) by (container_name)`                                     
         |

Review Comment:
   The metric name `reporting_nodes` now represents a count of reporting 
**containers** (instances) by role. Renaming it in the doc to 
`reporting_instances` (or similar) would better match the entity model and 
avoid implying a 1:1 node↔instance mapping.



##########
docs/en/swip/SWIP-15.md:
##########
@@ -262,98 +344,168 @@ nodes per group):
 
 ### 4. Dynamic metrics by role and tier
 
-Different roles expose different metrics, so the **node (Instance) dashboard 
must adapt to the selected
-node**. Two mechanisms, layered:
+Different roles expose different metrics, so the **instance dashboard must 
adapt to the selected
+container**. Horizon UI's widget `visibleWhen` is a structured, 
**server-evaluated** gate (the BFF
+resolves it against data presence or the selected instance's attributes and 
returns gated-out widgets
+as hidden; legacy free-text predicate strings are no longer parsed and degrade 
to ungated). Two gate
+kinds, layered:
 
-**(a) Data-presence gating — available today, no UI code.** Horizon UI already 
supports
-`visibleWhen: "<metric> has value"` on a widget; a panel whose metric returns 
all-null self-hides. Each
-MAL rule only produces samples for nodes that emit its source metric, so 
liaison-only metrics are simply
-absent on data instances and vice-versa. This gives correct adaptive behavior 
out of the box:
+**(a) Data-presence gating — available today, no UI code.** The `mqe`-kind 
gate hides a widget whose
+expression returns no data. Each MAL rule only produces samples for containers 
that emit its source
+metric, so liaison-only metrics are simply absent on data instances and 
vice-versa. This gives correct
+adaptive behavior out of the box:
 
 ```jsonc
 { "id": "wqueue", "title": "Write Queue (wqueue)", "type": "line",
   "expressions": ["meter_banyandb_instance_wqueue_pending"],
-  "visibleWhen": "meter_banyandb_instance_wqueue_pending has value" }
+  "visibleWhen": { "kind": "mqe", "expression": 
"meter_banyandb_instance_wqueue_pending", "op": "exists" } }
 ```
 
-**(b) Attribute predicate — proposed enhancement (see 
[§6](#6-horizon-ui-enhancement-entity-attribute-predicate)).**
+**(b) Attribute gating — equality ships today; membership is the proposed 
extension (see
+[entity-gate membership 
operators](#6-horizon-ui-enhancement-entity-gate-membership-operators)).**
 Data-presence can't distinguish "wrong role" from "idle but right role", and 
it still issues the query.
-An attribute predicate keys panel visibility directly on the node's 
`node_role` / `node_type`
-attributes:
+The `entity`-kind gate keys panel visibility directly on the selected 
instance's `container_name` /
+`node_type` attributes (meaningful on the Instance scope only):
 
 ```jsonc
-{ "id": "wqueue", "visibleWhen": "#entity.node_role == 'liaison'" }
-{ "id": "cold_tier_note", "visibleWhen": "#entity.node_type == 'cold'" }
+{ "id": "wqueue", "visibleWhen": { "kind": "entity", "attribute": 
"container_name", "op": "eq", "value": "liaison" } }
+{ "id": "cold_tier_note", "visibleWhen": { "kind": "entity", "attribute": 
"node_type", "op": "eq", "value": "cold" } }
 ```
 
 This is the precise, declarative form, and it is the natural way to express 
tier-specific panels (a
-`hot` data node merges constantly; a `cold` node is mostly static).
+`hot` data container merges constantly; a `cold` container is mostly static). 
The landed gate supports
+`exists` and case-insensitive `eq`; tier *sets* need the proposed `in` 
operator — until it lands they
+are expressible as duplicated `eq`-gated widget variants.
 
 Role/tier scoping of the catalog:
 
-| Bucket          | Panels                                                     
            | Predicate                         |
-| --------------- | 
--------------------------------------------------------------------- | 
--------------------------------- |
-| **All roles**   | system resources, disk-by-path, network, Go runtime, node 
uptime      | (always shown)                    |
-| **Liaison**     | gRPC query & errors, non-query ops, write rate, publish 
throughput & latency, wqueue depth | `#entity.node_role == 'liaison'` |
-| **Data**        | storage totals, merge/compaction, inverted index, 
subscribe queue     | `#entity.node_role == 'data'`     |
-| **Data + tier** | tier-specific merge/retention hints                        
           | `#entity.node_type in (hot,warm)` |
+| Bucket          | Panels                                                     
            | Entity gate                        |
+| --------------- | 
--------------------------------------------------------------------- | 
---------------------------------- |
+| **All roles**   | system resources, disk-by-path, network, Go runtime, node 
uptime      | (always shown)                     |
+| **Liaison**     | gRPC query & errors, non-query ops, write rate, publish 
throughput & latency, wqueue depth | `container_name eq liaison` |
+| **Data**        | storage totals, merge/compaction, inverted index, 
subscribe queue, retention | `container_name eq data`     |
+| **Data + tier** | tier-specific merge/retention hints                        
           | `node_type in (hot, warm)` †       |
+| **Lifecycle**   | migration cycles, last-run time + status                   
           | `container_name eq lifecycle`      |
+
+† `in` is the proposed extension of [section 
6](#6-horizon-ui-enhancement-entity-gate-membership-operators);
+until it lands, two `eq`-gated widget variants.
 
 ### 5. Dashboards (Horizon UI BANYANDB layer template)
 
 A net-new layer template `apps/bff/src/bundled_templates/layers/banyandb.json` 
(config-driven JSON, one
-file per layer, per-scope widget arrays, MQE expression strings). The design 
mirrors the upstream two
-boards across the SkyWalking hierarchy:
+file per layer keyed by its `key` field — `BANYANDB`, filename lowercased — 
with per-scope widget
+arrays and MQE expression strings). One menu touchpoint exists: Horizon UI 
currently hard-codes the
+`BANYANDB` layer out of the sidebar (`HIDDEN_LAYERS`);
+[horizon-ui #47](https://github.com/apache/skywalking-horizon-ui/pull/47) 
replaces that with a
+config-driven `layers.excluded` list that un-hides BanyanDB — this SWIP rides 
on that change (or an
+equivalent one-line un-hide). The design mirrors the upstream two boards 
across the SkyWalking
+hierarchy:
 
 ```
 BANYANDB layer
-├─ Root            → cluster list (ServiceList), showGroup=false
+├─ Root            → cluster list (the layer landing's service-list picker: 
header columns + sort)
 ├─ Service (cluster)
 │   └─ Overview KPIs + "Cluster Workload Summary" + "Fleet Overview" capacity
 │       (cluster_write_rate, cluster_query_rate, cluster_error_rate,
 │        reporting_nodes by role, total_cpu/memory/disk)
-├─ Instance (node)   ← the "Nodes" board, made dynamic
+├─ Instance (container)   ← the "Nodes" board, made dynamic; instance = 
pod_name@container_name

Review Comment:
   This dashboard tree still refers to `reporting_nodes`, but the SWIP now 
counts reporting containers/instances by role. Renaming this reference to match 
the metric name/semantics will keep the doc consistent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to