wu-sheng opened a new pull request, #13911: URL: https://github.com/apache/skywalking/pull/13911
### Fix BanyanDB runtime-rule schema-cache flood + v2 MAL CounterWindow collision and Elvis falsy semantics - [x] Add a unit test to verify that the fix works. - [x] Explain briefly why the bug exists and how to fix it. Bundles related correctness fixes surfaced while validating BanyanDB self-observability against the live demo. **1. BanyanDB schema-cache self-heal.** Peer nodes flooded `<metric> is not registered` when a node held a live persist worker but never populated its local `MetadataRegistry` schema cache for that model (a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the populate; the registry never evicts and the 30s reconcile only covers runtime-rule rows). The persist DAOs now self-heal a missing entry once with an RPC-free local re-derivation (`MetadataRegistry.repopulateLocally`) before failing, and the no-init defer poll loop retries a transient backend probe error (`isRetryableNoInitProbeFailure` — default `false`, BanyanDB opts in for transient gRPC codes) instead of crash-looping the pod. **2. v2 MAL `CounterWindow` key collision.** `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name. Two or more counters that reduce to the same label set after `.sum(...)` therefore shared one window slot and computed rates against each other's values — fabricating non-zero rates from unchanged counters (the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). Fixed by keying on the counter's own metric name. `BanyanDBErrorRateReproTest` reproduces it with the real frozen values (966 → 0 after the fix). **3. v2 MAL Elvis `?:` falsy semantics.** Compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only on `null`, so an empty-string primary kept `""` (a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`). Now single-evaluated through `MalRuntimeHelper.elvis` / `isTruthy`, matching Groovy falsy (null, false, numeric zero, empty string/collection/map/array). `MALElvisFalsyTest` covers empty/null/non-empty/side-effecting primaries. **4. banyandb otel-rules.** `PT15S` → `PT1M` rate window to match the collector scrape / OAP minute-bucket cadence (MAL `rate()` is a two-point `CounterWindow` delta, not PromQL). - [ ] If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #<issue number>. - [x] Update the [`CHANGES` log](https://github.com/apache/skywalking/blob/master/docs/en/changes/changes.md). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
