wu-sheng opened a new pull request, #13909:
URL: https://github.com/apache/skywalking/pull/13909

   ### Fix runtime-rule (MAL/LAL hot-update) schema changes in `no-init` mode, 
and the runtime-rule cluster node-identity collision on Kubernetes
   - [x] Add a unit test to verify that the fix works.
   - [x] Explain briefly why the bug exists and how to fix it.
   
   Two bugs in the runtime-rule (DSL hot-update) cluster path, both confirmed 
end-to-end on a local kind cluster:
   
   **1. Runtime-rule schema changes were inoperative in `no-init` mode** — the 
mode every production OAP cluster runs (a one-shot `-Dmode=init` Job creates 
the static schema; the OAP Deployment runs `-Dmode=no-init`). A runtime 
`addOrUpdate` introducing a new metric blocked forever in the storage 
installer's init-node poll loop (`ModelInstaller.whenCreating`), because the 
loop was gated on `RunningMode` rather than the operation's intent. 
`/delete?mode=revertToBundled` recreate and BanyanDB in-place shape updates 
were dead the same way. **Fix:** a new 
`StorageManipulationOpt.Flags.deferDDLToInitNode` bit, set only on the static 
boot-time `schemaCreateIfAbsent()` opt (DRYed into 
`ModelInstaller.deferDDLToInitNode(opt)`, reused by the BanyanDB shape-check / 
group-DDL gates). The runtime-rule opts (`withSchemaChange` / 
`verifySchemaOnly` / `withoutSchemaChange`) are now driven by their flags and 
by cluster main-ness — `no-init` and `default` no longer differ for DSL DDL; 
`init` stay
 s the dedicated initializer. `DSLManager.tickStorageOpt` is collapsed 
accordingly.
   
   **2. Runtime-rule cross-node writes failed with `HTTP 400 forward_self_loop` 
on a multi-replica Kubernetes cluster.** Every OAP replica shared the cluster 
`selfNodeId` `0.0.0.0_11800` (derived from the `0.0.0.0` agent gRPC bind host 
via `TelemetryRelatedContext`), so the main's self-loop guard rejected a 
legitimate peer-to-peer Forward as if it had looped back. **Fix:** resolve the 
runtime-rule node identity from the unique per-pod `SKYWALKING_COLLECTOR_UID` 
(the pod UID injected by the helm chart / swck operator from `metadata.uid`), 
in `start()` before any apply; falls back to the telemetry id off-Kubernetes. 
`MainRouter` already routes correctly off the cluster peer addresses (pod IPs); 
only the self-loop identity needed to be unique.
   
   **Tests:** new `ModelInstallerNoInitTest` (UT) for the no-init create 
chokepoint; the runtime-rule cluster e2e is converted from docker-compose 
(default mode — which never exercised either bug) to a kind + skywalking-helm 
`no-init` cluster (`oap.replicas=2`) driving the apply / STRUCTURAL / 
inactivate / delete lifecycle, cross-node convergence, and the cross-node 
Forward path.
   
   - [ ] If this pull request closes/resolves/fixes an existing issue, replace 
the issue number. Closes #<issue number>.
   - [x] Update the [`CHANGES` 
log](https://github.com/apache/skywalking/blob/master/docs/en/changes/changes.md).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to