This is an automated email from the ASF dual-hosted git repository.

wu-sheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/skywalking.git


The following commit(s) were added to refs/heads/master by this push:
     new 53baf8e5da Fix runtime-rule (MAL/LAL) hot-update in no-init mode + k8s 
cluster node identity (#13909)
53baf8e5da is described below

commit 53baf8e5da6c45670da6509eb7fcceed7b082ceb
Author: 吴晟 Wu Sheng <[email protected]>
AuthorDate: Sun Jun 14 08:04:36 2026 +0800

    Fix runtime-rule (MAL/LAL) hot-update in no-init mode + k8s cluster node 
identity (#13909)
    
    Two bugs in the runtime-rule (DSL hot-update) cluster path, both confirmed 
end-to-end on a local kind cluster:
    
    **1. Runtime-rule schema changes were inoperative in `no-init` mode** — the 
mode every production OAP cluster runs (a one-shot `-Dmode=init` Job creates 
the static schema; the OAP Deployment runs `-Dmode=no-init`). A runtime 
`addOrUpdate` introducing a new metric blocked forever in the storage 
installer's init-node poll loop (`ModelInstaller.whenCreating`), because the 
loop was gated on `RunningMode` rather than the operation's intent. 
`/delete?mode=revertToBundled` recreate and Banya [...]
    
    **2. Runtime-rule cross-node writes failed with `HTTP 400 
forward_self_loop` on a multi-replica Kubernetes cluster.** Every OAP replica 
shared the cluster `selfNodeId` `0.0.0.0_11800` (derived from the `0.0.0.0` 
agent gRPC bind host via `TelemetryRelatedContext`), so the main's self-loop 
guard rejected a legitimate peer-to-peer Forward as if it had looped back. 
**Fix:** resolve the runtime-rule node identity from the unique per-pod 
`SKYWALKING_COLLECTOR_UID` (the pod UID injected by t [...]
    
    **Tests:** new `ModelInstallerNoInitTest` (UT) for the no-init create 
chokepoint; the runtime-rule cluster e2e is converted from docker-compose 
(default mode — which never exercised either bug) to a kind + skywalking-helm 
`no-init` cluster (`oap.replicas=2`) driving the apply / STRUCTURAL / 
inactivate / delete lifecycle, cross-node convergence, and the cross-node 
Forward path.
---
 .github/workflows/skywalking.yaml                  |   3 +-
 docs/en/changes/changes.md                         |   2 +
 .../module/RuntimeRuleModuleProvider.java          |  67 ++++++--
 .../receiver/runtimerule/reconcile/DSLManager.java |  83 +++++-----
 .../server/core/storage/model/ModelInstaller.java  |  30 +++-
 .../core/storage/model/StorageManipulationOpt.java |  42 ++++-
 .../storage/model/ModelInstallerNoInitTest.java    | 140 ++++++++++++++++
 .../plugin/banyandb/BanyanDBIndexInstaller.java    |  39 +++--
 .../cases/runtime-rule/cluster/cluster-flow.sh     | 183 ++++++++++++++-------
 .../cases/runtime-rule/cluster/docker-compose.yml  |  93 -----------
 test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml    |  84 +++++++---
 .../cases/runtime-rule/cluster/expected/ok.txt     |   1 -
 .../cluster/expected/ready-replicas.txt            |   1 +
 test/e2e-v2/cases/runtime-rule/cluster/kind.yaml   |  23 +++
 14 files changed, 529 insertions(+), 262 deletions(-)

diff --git a/.github/workflows/skywalking.yaml 
b/.github/workflows/skywalking.yaml
index 504dc19a7f..05adfffa9e 100644
--- a/.github/workflows/skywalking.yaml
+++ b/.github/workflows/skywalking.yaml
@@ -399,8 +399,9 @@ jobs:
             env: ES_VERSION=8.18.8
           - name: Runtime Rule LAL Hot-Update
             config: test/e2e-v2/cases/runtime-rule/lal/e2e.yaml
-          - name: Runtime Rule Cluster Convergence
+          - name: Runtime Rule Cluster (kind)
             config: test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml
+            runs-on: ubuntu-24.04
           - name: DSL Debug API — MAL
             config: test/e2e-v2/cases/dsl-debugging/mal/e2e.yaml
           - name: DSL Debug API — OAL
diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md
index bffa0e8405..51aa0dfa79 100644
--- a/docs/en/changes/changes.md
+++ b/docs/en/changes/changes.md
@@ -248,6 +248,8 @@
   refcount-tracked and unregistered when the last declaring rule is removed. 
See
   
[runtime-rule-hot-update.md#dynamic-layers](../concepts-and-designs/runtime-rule-hot-update.md)
   for the conflict rules and limitations.
+* Fix: runtime-rule (MAL/LAL hot-update) schema changes now work in `no-init` 
mode — the deployment mode every production cluster runs. Previously a runtime 
`addOrUpdate` that introduced a new metric blocked forever in the storage 
installer's init-node poll loop (`ModelInstaller.whenCreating`) on a `no-init` 
OAP, because the gate keyed off `RunningMode` rather than the operation's 
intent; the `/delete?mode=revertToBundled` recreate and BanyanDB in-place shape 
updates were dead the same w [...]
+* Fix: runtime-rule cross-node writes no longer fail with `HTTP 400 
forward_self_loop` on a multi-replica Kubernetes cluster. Every OAP replica 
shared the cluster `selfNodeId` `0.0.0.0_11800` (derived from the `0.0.0.0` 
agent gRPC bind host via `TelemetryRelatedContext`), so the main's self-loop 
guard rejected a legitimate peer-to-peer Forward as if it had looped back. The 
runtime-rule node identity now prefers the unique per-pod 
`SKYWALKING_COLLECTOR_UID` (the pod UID injected by the he [...]
 * Fix: remove the redundant tags from the `envoy-ai-gateway.yaml` LAL 
configuration.
 * Add Zipkin Virtual GenAI e2e test. Use `zipkin_json` exporter to avoid 
protobuf dependency conflict
   between `opentelemetry-exporter-zipkin-proto-http` (protobuf~=3.12) and 
`opentelemetry-proto` (protobuf>=5.0).
diff --git 
a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java
 
b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java
index f75d770512..ec7197d5de 100644
--- 
a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java
+++ 
b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java
@@ -219,6 +219,15 @@ public class RuntimeRuleModuleProvider extends 
ModuleProvider {
      */
     private static final long SCHEDULER_INITIAL_DELAY_SECONDS = 2L;
 
+    /**
+     * Env var carrying this OAP's unique per-node identity — the Kubernetes 
pod UID, injected
+     * by the skywalking-helm chart / swck operator from {@code metadata.uid}. 
Used as the
+     * runtime-rule cluster {@code selfNodeId} when present, because the 
telemetry-id fallback
+     * (gRPC {@code host_port}) collides across replicas under k8s where the 
bind host is
+     * {@code 0.0.0.0} (every pod reports {@code 0.0.0.0_11800}).
+     */
+    private static final String COLLECTOR_UID_ENV = "SKYWALKING_COLLECTOR_UID";
+
     private RuntimeRuleModuleConfig moduleConfig;
     private ScheduledExecutorService reconcilerExecutor;
     private DSLManager dslManager;
@@ -272,7 +281,12 @@ public class RuntimeRuleModuleProvider extends 
ModuleProvider {
         // cluster gRPC bus (default 11800). Privileged admin RPCs stay on the
         // admin-only port (default 17129) so a compromised node on the agent
         // network cannot reach Suspend/Resume/Forward.
-        final String selfNodeId = TelemetryRelatedContext.INSTANCE.getId();
+        // Resolve this node's stable, unique cluster identity HERE in start() 
— before
+        // notifyAfterCompleted() applies any rule — so the node knows who it 
is before it
+        // forwards a write to the main or broadcasts Suspend/Resume. Must be 
unique per
+        // replica: it is the Forward/Suspend/Resume sender id and the key the 
receiver's
+        // self-loop guard compares against. See resolveSelfNodeId().
+        final String selfNodeId = resolveSelfNodeId();
         final AdminClusterChannelManager adminPeerChannels =
             getManager().find(AdminServerModule.NAME).provider()
                         .getService(AdminClusterChannelManager.class);
@@ -343,24 +357,25 @@ public class RuntimeRuleModuleProvider extends 
ModuleProvider {
         // applies under {@code withSchemaChange} if this node resolves as 
main. Backend DDL is
         // idempotent so the re-apply costs nothing.
         try {
-            // atBoot=true so a no-init OAP picks verifySchemaOnly and refuses 
to
-            // start with a missing or shape-mismatched backend (k8s pod 
backloop)
+            // atBoot=true so a cluster peer picks verifySchemaOnly and 
refuses to
+            // start against a missing or shape-mismatched backend (k8s pod 
backloop)
             // instead of silently registering local workers against schema 
that
-            // doesn't exist. Init / default-mode OAPs are unaffected — their 
boot
-            // opt mirrors the standard tick choice for those modes.
+            // doesn't exist; the main picks withSchemaChange and re-creates 
missing
+            // runtime schema. The choice is by cluster main-ness, not running 
mode
+            // (see DSLManager.tickStorageOpt); init mode is the lone 
exception.
             dslManager.tick(true);
             log.info("Runtime rule dslManager: synchronous first tick 
completed "
                 + "(runtime-only DB rows are now applied locally).");
         } catch (final RuntimeException re) {
-            // Boot pass under verifySchemaOnly re-throws missing/mismatch as a
-            // RuntimeException so module bootstrap aborts. Translate to
-            // ModuleStartException so the OAP exit message points the 
operator at
-            // the right place.
+            // The boot pass re-throws as a RuntimeException so module 
bootstrap aborts —
+            // a peer's verifySchemaOnly hitting a missing/mismatched backend, 
or a main's
+            // withSchemaChange failing to create it. Translate to 
ModuleStartException so
+            // the OAP exit message points the operator at the right place.
             throw new ModuleStartException(
-                "Runtime rule dslManager boot pass failed under 
verifySchemaOnly; "
-                    + "the backend schema is missing or diverges from the 
declared rule. "
-                    + "Bring up the init OAP first or align rule files with 
the backend, "
-                    + "then restart this node.",
+                "Runtime rule dslManager boot pass failed: backend schema is 
missing, "
+                    + "diverges from the declared rule, or could not be 
created. On a peer, "
+                    + "bring up the cluster main (or init OAP) first; on the 
main, align the "
+                    + "rule files with the backend, then restart this node.",
                 re);
         } catch (final Throwable t) {
             log.warn("Runtime rule dslManager: synchronous first tick failed — 
"
@@ -393,6 +408,32 @@ public class RuntimeRuleModuleProvider extends 
ModuleProvider {
             SCHEDULER_INITIAL_DELAY_SECONDS, intervalSeconds);
     }
 
+    /**
+     * Resolve this node's unique, stable runtime-rule cluster identity. 
Prefers the Kubernetes
+     * pod UID ({@value #COLLECTOR_UID_ENV}, injected by the helm chart / swck 
operator from
+     * {@code metadata.uid}) because it is unique per replica; falls back to 
the telemetry id
+     * ({@code host_port}) for non-k8s deployments where each node already has 
a distinct host.
+     *
+     * <p>Why not the telemetry id directly: under Kubernetes the agent gRPC 
bind host is
+     * {@code 0.0.0.0}, so every replica's telemetry id is {@code 
0.0.0.0_11800} — identical.
+     * That collision makes the receiver's self-loop guard (sender id == own 
id) reject a
+     * legitimate peer-to-peer Forward as if it had looped back, breaking 
cross-node writes on
+     * any multi-replica k8s cluster. {@code MainRouter} already routes 
correctly off the
+     * cluster peer addresses (pod IPs); only the self-identity used for loop 
suppression needs
+     * to be unique, which the pod UID guarantees.
+     */
+    private String resolveSelfNodeId() {
+        final String collectorUid = System.getenv(COLLECTOR_UID_ENV);
+        if (collectorUid != null && !collectorUid.trim().isEmpty()) {
+            log.info("Runtime rule: selfNodeId from {} (pod UID) = {}", 
COLLECTOR_UID_ENV, collectorUid);
+            return collectorUid;
+        }
+        final String telemetryId = TelemetryRelatedContext.INSTANCE.getId();
+        log.info("Runtime rule: {} not set; selfNodeId falls back to telemetry 
id = {} "
+            + "(ensure it is unique per node in a multi-node cluster).", 
COLLECTOR_UID_ENV, telemetryId);
+        return telemetryId;
+    }
+
     @Override
     public String[] requiredModules() {
         return new String[] {
diff --git 
a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java
 
b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java
index bd46ae69ed..102a05ebbe 100644
--- 
a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java
+++ 
b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java
@@ -229,15 +229,17 @@ public final class DSLManager {
 
     /**
      * Variant invoked once at boot from {@code 
RuntimeRuleModuleProvider.notifyAfterCompleted}
-     * with {@code atBoot=true}. The boot pass on a no-init OAP picks
+     * with {@code atBoot=true}. The boot pass on a cluster <em>peer</em> picks
      * {@link StorageManipulationOpt#verifySchemaOnly()} so missing or 
shape-mismatched
      * backend schema fails the bootstrap (k8s pod backloop) instead of 
silently
-     * proceeding. The scheduled executor calls the no-arg overload so 
subsequent ticks
-     * stay on the lenient {@code withoutSchemaChange} retry path.
+     * proceeding; the <em>main</em> picks {@link 
StorageManipulationOpt#withSchemaChange()}
+     * so it re-creates any missing runtime schema. The scheduled executor 
calls the no-arg
+     * overload so subsequent peer ticks stay on the lenient {@code 
withoutSchemaChange}
+     * retry path.
      *
-     * <p>Boot semantics are scoped to no-init mode only — init-mode OAPs 
continue to
-     * pick {@link StorageManipulationOpt#schemaCreateIfAbsent()} (boot 
creates), and
-     * default-mode OAPs continue to pick by cluster main-ness.
+     * <p>The choice is by cluster main-ness, not running mode — no-init and 
default behave
+     * identically (see {@link #tickStorageOpt}). Init mode is the one 
exception: the
+     * dedicated initialiser picks {@link 
StorageManipulationOpt#schemaCreateIfAbsent()}.
      */
     public void tick(final boolean atBoot) {
         try {
@@ -708,44 +710,38 @@ public final class DSLManager {
     /**
      * Pick the {@link StorageManipulationOpt} for a tick-driven apply.
      *
-     * <p>Two axes:
+     * <p>For runtime-rule (DSL) DDL the only axis that matters is <b>cluster 
main-ness</b> —
+     * <em>not</em> the init / no-init / default running mode. The 
running-mode axis governs
+     * <em>static</em> schema (the init OAP creates it, no-init OAPs wait); a 
runtime rule is
+     * created at runtime and the init OAP never knows about it, so gating DSL 
DDL on running
+     * mode would leave every production (no-init) cluster unable to apply 
rules. no-init and
+     * default therefore behave identically here.
      *
-     * <p><b>RunningMode (boot/init context).</b>
+     * <p><b>init mode</b> — the one exception. The dedicated initialiser picks
+     * {@link StorageManipulationOpt#schemaCreateIfAbsent()}, matching the 
static-rule install
+     * path (create-if-absent, idempotent against a backend that already holds 
the resource).
+     *
+     * <p><b>Everything else (no-init or default)</b> — branch on main-ness:
      * <ul>
-     *   <li>{@code init} mode — OAP is the dedicated initialiser; install 
schema if
-     *       absent. {@link StorageManipulationOpt#schemaCreateIfAbsent()} 
matches what the
-     *       rest of the static-rule install path does in init mode 
(idempotent against
-     *       backends that already hold the table).
-     *   <li>{@code no-init} mode — this OAP must NOT touch the backend; the 
init OAP
-     *       owns schema. The opt depends on whether this is the synchronous 
boot pass
-     *       or a scheduled tick:
+     *   <li>Self is main → {@link StorageManipulationOpt#withSchemaChange()}. 
The authority
+     *       creates / updates / drops backend schema. The boot pass uses this 
too, so a main
+     *       re-creates any missing runtime schema at startup.
+     *   <li>Peer (someone else is main):
      *     <ul>
      *       <li><b>Boot pass</b> ({@code atBoot=true}) →
-     *           {@link StorageManipulationOpt#verifySchemaOnly()}. Strict: 
backend
-     *           resources must already exist with the declared shape. A 
missing or
-     *           mismatched schema fails the bootstrap (k8s pod backloop) — 
operator must
-     *           bring up the init OAP first, or align rule files with the 
backend.
+     *           {@link StorageManipulationOpt#verifySchemaOnly()}. Strict: 
refuse to start
+     *           against a backend the main hasn't prepared (k8s pod backloop 
until the main
+     *           converges).
      *       <li><b>Scheduled tick</b> ({@code atBoot=false}) →
      *           {@link StorageManipulationOpt#withoutSchemaChange()}. 
Lenient: the timer
-     *           retries forever without raising errors so transient absence 
(init OAP
-     *           still catching up between ticks) self-heals.
+     *           retries without raising so transient absence (main still 
catching up between
+     *           ticks) self-heals.
      *     </ul>
-     *   <li>default mode (regular running OAP) — branch on cluster main-ness, 
see below.
-     * </ul>
-     *
-     * <p><b>Cluster main-ness (default mode only).</b>
-     * <ul>
-     *   <li>Self is main → {@link StorageManipulationOpt#withSchemaChange()}. 
The REST path
-     *       has the same shape; tick rarely runs on main because REST usually
-     *       converges the main's state first.
-     *   <li>Peer (someone else is main) → {@link 
StorageManipulationOpt#withoutSchemaChange()}.
-     *       Local MeterSystem + MetadataRegistry populate so the peer 
dispatches samples
-     *       correctly, but no server-side DDL fires.
      * </ul>
      *
-     * <p>When the cluster module isn't wired (embedded test topology), {@link
-     * MainRouter#isSelfMain} returns {@code true} and the default-mode branch 
falls
-     * through to {@code withSchemaChange} — single-process deployments are 
always main.
+     * <p>When the cluster module isn't wired (embedded / single-process 
topology),
+     * {@link MainRouter#isSelfMain} returns {@code true} so we fall through to
+     * {@code withSchemaChange} — a single process is always its own main.
      *
      * @param atBoot true for the synchronous one-shot pass invoked from
      *               {@code RuntimeRuleModuleProvider.notifyAfterCompleted}; 
false for
@@ -755,20 +751,21 @@ public final class DSLManager {
         if (RunningMode.isInitMode()) {
             return StorageManipulationOpt.schemaCreateIfAbsent();
         }
-        if (RunningMode.isNoInitMode()) {
-            return atBoot
-                ? StorageManipulationOpt.verifySchemaOnly()
-                : StorageManipulationOpt.withoutSchemaChange();
-        }
+        final boolean selfMain;
         try {
             final AdminClusterChannelManager apm =
                 moduleManager.find(AdminServerModule.NAME).provider()
                              .getService(AdminClusterChannelManager.class);
-            return MainRouter.isSelfMain(apm)
-                ? StorageManipulationOpt.withSchemaChange()
-                : StorageManipulationOpt.withoutSchemaChange();
+            selfMain = MainRouter.isSelfMain(apm);
         } catch (final Throwable t) {
+            // Cluster module not wired (embedded / single-process) — always 
main.
+            return StorageManipulationOpt.withSchemaChange();
+        }
+        if (selfMain) {
             return StorageManipulationOpt.withSchemaChange();
         }
+        return atBoot
+            ? StorageManipulationOpt.verifySchemaOnly()
+            : StorageManipulationOpt.withoutSchemaChange();
     }
 }
diff --git 
a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java
 
b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java
index 735e7b379d..d65d7fb75b 100644
--- 
a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java
+++ 
b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java
@@ -89,10 +89,15 @@ public abstract class ModelInstaller implements 
ModelRegistry.CreatingListener,
             return;
         }
 
-        // Legacy poll loop for non-init OAPs that did not opt into the strict 
verify
-        // mode. Static models (boot-time) still take this path; runtime-rule 
reconciler
-        // explicitly chooses verify so this loop is bypassed.
-        if (RunningMode.isNoInitMode()) {
+        // Poll loop for the STATIC boot-time path on a non-init OAP: the init 
OAP owns
+        // schema creation, so this node waits until the resource appears 
rather than
+        // creating it. Gated on deferDDLToInitNode (set only on 
SCHEMA_CREATE_IF_ABSENT),
+        // NOT on RunningMode alone — a runtime-rule DSL apply 
(withSchemaChange) is the
+        // operator/main-driven authority and must fall through to createTable 
below
+        // regardless of no-init, because no init OAP knows about a metric 
created at
+        // runtime. Without this, a no-init OAP would block here forever 
waiting for a
+        // resource that only this very apply would ever create.
+        if (deferDDLToInitNode(opt)) {
             while (true) {
                 InstallInfo info = isExists(model, opt);
                 if (!info.isAllExist()) {
@@ -148,6 +153,23 @@ public abstract class ModelInstaller implements 
ModelRegistry.CreatingListener,
             StorageManipulationOpt.Outcome.DROPPED, null);
     }
 
+    /**
+     * True when this manipulation must defer all backend DDL to the dedicated 
init OAP and
+     * wait for it, rather than create / update / reshape the resource on this 
node. This is
+     * the single source of truth for the "no-init OAP doesn't own schema" 
rule across the
+     * base installer and every backend subclass — call it instead of 
re-checking
+     * {@link RunningMode#isNoInitMode()} inline, so the rule stays one 
decision.
+     *
+     * <p>True only for the static boot-time {@link 
StorageManipulationOpt#schemaCreateIfAbsent()}
+     * opt on a {@code no-init} OAP. The runtime-rule (DSL) opts leave
+     * {@link StorageManipulationOpt.Flags#isDeferDDLToInitNode() 
deferDDLToInitNode} unset, so
+     * an operator-driven apply is governed by the opt's own create / update / 
drop flags and
+     * by cluster main-ness — never by the init / no-init / default running 
mode.
+     */
+    protected static boolean deferDDLToInitNode(final StorageManipulationOpt 
opt) {
+        return RunningMode.isNoInitMode() && 
opt.getFlags().isDeferDDLToInitNode();
+    }
+
     public void start() {
     }
 
diff --git 
a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java
 
b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java
index 3fe2e66eab..9b6d8cb04a 100644
--- 
a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java
+++ 
b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java
@@ -73,10 +73,11 @@ import lombok.Getter;
  * <h3>{@link #verifySchemaOnly()} — {@link Mode#VERIFY_SCHEMA_ONLY} 
(predicate: {@link #isVerifySchemaOnly()})</h3>
  * <p>Callers:
  * <ul>
- *   <li>Boot-time reconciler pass on a non-init OAP — the operator declared
- *       {@code init=false}, so this OAP must not perform DDL but must refuse 
to start if
- *       the backend isn't already in the shape the persisted runtime-rule 
catalog
- *       declares.</li>
+ *   <li>Boot-time runtime-rule reconciler pass on a cluster <em>peer</em> (a 
node that is
+ *       not the hash-selected main for the file) — the main owns DDL, so this 
node must not
+ *       perform it but must refuse to start if the backend isn't already in 
the shape the
+ *       persisted runtime-rule catalog declares. Chosen by main-ness, not 
running mode, so a
+ *       peer behaves the same in no-init and default mode.</li>
  * </ul>
  * <p>Backend behaviour: read-only inspection. The installer issues the same 
metadata
  * read RPCs as {@link Mode#SCHEMA_CREATE_IF_ABSENT} but never invokes create 
/ update / drop. On
@@ -137,15 +138,22 @@ public final class StorageManipulationOpt {
             .escalateToCaller(true)
             .build()),
         /**
-         * Static boot path on an init-mode OAP. Installer creates absent 
resources, but
-         * if a resource already exists with a shape that diverges from the 
declared
-         * model it records {@link Outcome#SKIPPED_SHAPE_MISMATCH} and does 
<strong>not</strong>
-         * call update / reshape. Operator must reconcile via the runtime-rule 
REST
-         * endpoint — boot is not allowed to silently mutate backend shape.
+         * Static boot-time model registration, run by every OAP. On an init / 
standalone
+         * OAP the installer creates absent resources, but if a resource 
already exists with
+         * a shape that diverges from the declared model it records
+         * {@link Outcome#SKIPPED_SHAPE_MISMATCH} and does 
<strong>not</strong> call
+         * update / reshape. Operator must reconcile via the runtime-rule REST 
endpoint —
+         * boot is not allowed to silently mutate backend shape.
+         *
+         * <p>This is the only mode that sets {@code deferDDLToInitNode}: on a 
{@code no-init}
+         * OAP the installer defers to the init OAP (waits in the
+         * {@link ModelInstaller#whenCreating} poll loop) rather than creating 
the resource
+         * itself. The runtime-rule (DSL) modes never defer.
          */
         SCHEMA_CREATE_IF_ABSENT(Flags.builder()
             .inspectBackend(true)
             .createMissing(true)
+            .deferDDLToInitNode(true)
             .build()),
         /**
          * Boot path on a non-init OAP. Installer issues the same read-only 
inspection
@@ -247,6 +255,22 @@ public final class StorageManipulationOpt {
          * the node.
          */
         private final boolean escalateToCaller;
+        /**
+         * On a {@code no-init} OAP, defer all backend DDL to the dedicated 
init OAP and wait
+         * (poll loop in {@link ModelInstaller#whenCreating}) rather than 
create / update the
+         * resource here. Set ONLY on {@link Mode#SCHEMA_CREATE_IF_ABSENT} — 
the static
+         * boot-time model registration that every OAP runs. The init / 
no-init / default
+         * running-mode axis governs <strong>static</strong> schema only.
+         *
+         * <p>The runtime-rule (DSL) opts — {@link Mode#WITH_SCHEMA_CHANGE},
+         * {@link Mode#VERIFY_SCHEMA_ONLY}, {@link Mode#WITHOUT_SCHEMA_CHANGE} 
— leave this
+         * {@code false}, so an operator-driven runtime apply is driven by the 
other flags and
+         * by cluster main-ness, never by {@code RunningMode}. Without this 
distinction a
+         * no-init OAP (every production cluster node) would route a runtime 
{@code withSchemaChange}
+         * create into the init-node poll loop and block forever, because no 
init OAP knows
+         * about a metric that was created at runtime.
+         */
+        private final boolean deferDDLToInitNode;
     }
 
     @Getter
diff --git 
a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java
 
b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java
new file mode 100644
index 0000000000..d9cda58cd7
--- /dev/null
+++ 
b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ */
+
+package org.apache.skywalking.oap.server.core.storage.model;
+
+import java.time.Duration;
+import org.apache.skywalking.oap.server.core.RunningMode;
+import org.apache.skywalking.oap.server.core.storage.StorageException;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTimeoutPreemptively;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+/**
+ * Regression guard for the runtime-rule (DSL) schema-change path on a {@code 
no-init} OAP —
+ * every production cluster node runs no-init. The base {@link 
ModelInstaller#whenCreating}
+ * poll loop must defer to the init OAP only for the static boot-time opt
+ * ({@link StorageManipulationOpt#schemaCreateIfAbsent()}); a runtime-rule
+ * {@link StorageManipulationOpt#withSchemaChange()} apply must fall through to
+ * {@code createTable} and create the resource itself, because no init OAP 
knows about a
+ * metric created at runtime. Before the {@code deferDDLToInitNode} flag, a 
no-init OAP
+ * routed the runtime create into the poll loop and blocked forever.
+ */
+class ModelInstallerNoInitTest {
+
+    @AfterEach
+    void resetRunningMode() {
+        // RunningMode is a process-wide static; setMode("") is a no-op, so 
reset to a
+        // neutral non-init/non-no-init value to avoid leaking no-init into 
other tests.
+        RunningMode.setMode("default");
+    }
+
+    @Test
+    void deferFlagSetOnlyOnStaticBootOpt() {
+        
assertTrue(StorageManipulationOpt.schemaCreateIfAbsent().getFlags().isDeferDDLToInitNode(),
+            "static boot opt must defer DDL to the init node");
+        
assertFalse(StorageManipulationOpt.withSchemaChange().getFlags().isDeferDDLToInitNode(),
+            "runtime-rule withSchemaChange must NOT defer — it is the DDL 
authority");
+        
assertFalse(StorageManipulationOpt.verifySchemaOnly().getFlags().isDeferDDLToInitNode());
+        
assertFalse(StorageManipulationOpt.withoutSchemaChange().getFlags().isDeferDDLToInitNode());
+    }
+
+    @Test
+    void noInitMainCreatesNewMetricUnderWithSchemaChange() {
+        RunningMode.setMode("no-init");
+        final RecordingInstaller installer = new RecordingInstaller(false /* 
resource absent */);
+        final Model model = mock(Model.class);
+        when(model.getName()).thenReturn("runtime_metric");
+
+        // Must return (not spin in the no-init poll loop) and must create the 
resource. The
+        // preemptive timeout turns a regression — the historical infinite 
wait — into a fast
+        // failure instead of a hung build.
+        assertTimeoutPreemptively(Duration.ofSeconds(10), () ->
+            installer.whenCreating(model, 
StorageManipulationOpt.withSchemaChange()));
+        assertEquals(1, installer.createTableCalls,
+            "runtime withSchemaChange on a no-init OAP must create the new 
resource");
+    }
+
+    @Test
+    void noInitStaticBootDefersToInitNode() throws StorageException {
+        RunningMode.setMode("no-init");
+        // Resource already present so the defer poll loop breaks on its first 
probe instead
+        // of waiting forever — lets the test assert the defer path without 
hanging.
+        final RecordingInstaller installer = new RecordingInstaller(true /* 
resource present */);
+        final Model model = mock(Model.class);
+        when(model.getName()).thenReturn("static_metric");
+
+        installer.whenCreating(model, 
StorageManipulationOpt.schemaCreateIfAbsent());
+        assertEquals(0, installer.createTableCalls,
+            "static boot on a no-init OAP must defer to the init node, never 
create");
+    }
+
+    @Test
+    void withSchemaChangeSkipsCreateWhenResourceAlreadyExists() throws 
StorageException {
+        RunningMode.setMode("no-init");
+        final RecordingInstaller installer = new RecordingInstaller(true /* 
resource present */);
+        final Model model = mock(Model.class);
+        when(model.getName()).thenReturn("existing_metric");
+
+        installer.whenCreating(model, 
StorageManipulationOpt.withSchemaChange());
+        assertEquals(0, installer.createTableCalls,
+            "withSchemaChange must not re-create a resource that already 
exists");
+    }
+
+    /** Minimal concrete {@link ModelInstaller} that records createTable calls 
and reports a
+     *  fixed existence result, so the base whenCreating branching can be 
exercised without a
+     *  real storage backend. */
+    private static final class RecordingInstaller extends ModelInstaller {
+        private final boolean resourcePresent;
+        private int createTableCalls;
+
+        private RecordingInstaller(final boolean resourcePresent) {
+            super(null, null);
+            this.resourcePresent = resourcePresent;
+        }
+
+        @Override
+        public InstallInfo isExists(final Model model, final 
StorageManipulationOpt opt) {
+            final TestInstallInfo info = new TestInstallInfo(model);
+            info.setAllExist(resourcePresent);
+            return info;
+        }
+
+        @Override
+        public void createTable(final Model model) {
+            createTableCalls++;
+        }
+    }
+
+    private static final class TestInstallInfo extends 
ModelInstaller.InstallInfo {
+        private TestInstallInfo(final Model model) {
+            super(model);
+        }
+
+        @Override
+        public String buildInstallInfoMsg() {
+            return "test";
+        }
+    }
+}
diff --git 
a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java
 
b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java
index 47ceac8df3..cabfb75276 100644
--- 
a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java
+++ 
b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java
@@ -144,11 +144,13 @@ public class BanyanDBIndexInstaller extends 
ModelInstaller {
                 installInfo.setAllExist(false);
                 return installInfo;
             } else {
-                // Run shape-compat checks unless we're in the legacy no-init 
poll loop
-                // path. failOnAbsence implies the caller wants strict 
verification even
-                // in non-init mode (VERIFY_SCHEMA_ONLY), so honour that 
instead of just
-                // gating on RunningMode.
-                final boolean runShapeChecks = !RunningMode.isNoInitMode() || 
opt.getFlags().isFailOnAbsence();
+                // Run shape-compat checks — and the updates they drive for 
withSchemaChange —
+                // unless this is the static boot-time path deferring to the 
init OAP. The
+                // runtime-rule DSL opts (withSchemaChange / verifySchemaOnly) 
are never
+                // deferred, so an operator-driven shape UPDATE reconciles on 
a no-init OAP
+                // exactly as on a default / standalone one. (verifySchemaOnly 
still runs the
+                // checks but records SKIPPED_SHAPE_MISMATCH instead of 
writing.)
+                final boolean runShapeChecks = !deferDDLToInitNode(opt);
                 if (model.isTimeSeries()) {
                     // register models only locally(Schema cache) but not 
remotely
                     if (model.isRecord()) {
@@ -637,10 +639,15 @@ public class BanyanDBIndexInstaller extends 
ModelInstaller {
             
optsBuilder.addAllDefaultStages(metadata.getResource().getDefaultQueryStages());
         }
         gBuilder.setResourceOpts(optsBuilder.build());
-        if (!RunningMode.isNoInitMode()) {
-            if (!groupAligned.contains(metadata.getGroup())) {
+        // Group DDL follows the opt, not RunningMode: a runtime-rule 
withSchemaChange
+        // creates / updates the group on whatever node reaches here (peers 
short-circuit
+        // earlier via inspectBackend=false), while the static boot path 
defers to the init
+        // OAP on no-init. Create is gated on createMissing and update on 
!failOnShapeMismatch
+        // so verifySchemaOnly stays read-only even though it is not deferred.
+        if (!deferDDLToInitNode(opt) && 
!groupAligned.contains(metadata.getGroup())) {
+            if (!resourceExist.isHasGroup()) {
                 // create the group if not exist
-                if (!resourceExist.isHasGroup()) {
+                if (opt.getFlags().isCreateMissing()) {
                     try {
                         Group g = client.define(gBuilder.build());
                         if (g != null) {
@@ -653,16 +660,16 @@ public class BanyanDBIndexInstaller extends 
ModelInstaller {
                             throw ex;
                         }
                     }
-                } else {
-                    // update the group if necessary
-                    if (this.checkGroup(metadata, client)) {
-                        opt.recordModRevision(client.update(gBuilder.build()));
-                        log.info("group {} updated", metadata.getGroup());
-                    }
                 }
-                // mark the group as aligned
-                groupAligned.add(metadata.getGroup());
+            } else {
+                // update the group if necessary
+                if (!opt.getFlags().isFailOnShapeMismatch() && 
this.checkGroup(metadata, client)) {
+                    opt.recordModRevision(client.update(gBuilder.build()));
+                    log.info("group {} updated", metadata.getGroup());
+                }
             }
+            // mark the group as aligned
+            groupAligned.add(metadata.getGroup());
         }
         return resourceExist;
     }
diff --git a/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh 
b/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh
index ac371559cb..0740a9947c 100755
--- a/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh
+++ b/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh
@@ -15,17 +15,26 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# Drives a runtime-rule apply on OAP-1 and asserts OAP-2 converges on the same
-# (catalog, name, contentHash) within the reconciler tick window. Run from the
-# repo root.
+# Runtime-rule lifecycle + cross-node convergence on a Kubernetes (kind) 
cluster
+# deployed in NO-INIT mode — the topology every production SkyWalking cluster 
runs
+# (a one-shot `-Dmode=init` Job creates static schema, the OAP Deployment runs
+# `-Dmode=no-init`). This is the deployment that exercises the runtime-rule
+# schema-change path on a no-init node: applying a NEW MAL rule must drive the
+# backend DDL (create the BanyanDB measure) on the cluster main even though it 
is a
+# no-init OAP — the init Job never knew about a metric created at runtime, so 
the
+# main is the only node that can create it.
 #
-# Coverage:
-#   1. Apply seed-rule on OAP-1 → ACTIVE
-#   2. Wait for OAP-2 to see the rule via /list (one tick = ~30 s default)
-#   3. STRUCTURAL update on OAP-1 → re-converge on OAP-2 (different content 
hash)
+# Coverage (drive on OAP-1, observe convergence on OAP-2 within a reconciler 
tick):
+#   1. Apply seed-rule on OAP-1 → ACTIVE (NEW: first-time measure creation on 
no-init)
+#   2. OAP-2 converges on the same (status, contentHash)
+#   3. STRUCTURAL update on OAP-1 → re-converge on OAP-2 (new metric, new 
measure)
 #   4. Inactivate on OAP-1 → INACTIVE on OAP-2
 #   5. Delete on OAP-1 → row gone on OAP-2
 #
+# The pre-fix bug: on a no-init OAP the apply blocked forever in the storage
+# installer's init-node poll loop and never created the measure, so step 1 
never
+# reached ACTIVE. Reaching ACTIVE here is the end-to-end regression assertion.
+#
 # Failures route to stderr so the e2e harness's stdout capture stays clean.
 
 set -euo pipefail
@@ -33,26 +42,66 @@ set -euo pipefail
 log() { echo "[cluster-flow] $*" >&2; }
 fail() { log "FAIL: $*"; exit 1; }
 
+NS="${SW_NAMESPACE:-skywalking}"
+# Pod-template labels set by the skywalking-helm OAP Deployment (release name 
= skywalking).
+OAP_SELECTOR="${OAP_SELECTOR:-app=skywalking,component=oap,release=skywalking}"
 OAP1_PORT="${OAP1_PORT:-17128}"
 OAP2_PORT="${OAP2_PORT:-17129}"
 OAP1_BASE="http://127.0.0.1:${OAP1_PORT}";
 OAP2_BASE="http://127.0.0.1:${OAP2_PORT}";
+# Admin REST port inside each OAP container (SW_ADMIN_SERVER=default).
+ADMIN_CONTAINER_PORT="${ADMIN_CONTAINER_PORT:-17128}"
+
 
SEED_DIR="${SEED_DIR:-$(pwd)/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules}"
 SEED_NEW="${SEED_DIR}/seed-rule.yaml"
 SEED_STRUCT="${SEED_DIR}/seed-rule-structural.yaml"
 CATALOG="otel-rules"
 NAME="cluster_rr"
 
-# Two ticks worth — default reconciler interval is 30 s; allow a generous 90 s 
for
-# convergence on a busy CI host.
-CONVERGE_TIMEOUT_S="${CONVERGE_TIMEOUT_S:-90}"
+# Generous on a kind host: two reconciler ticks (default 30 s) + BanyanDB 
schema
+# propagation + RPC jitter.
+CONVERGE_TIMEOUT_S="${CONVERGE_TIMEOUT_S:-120}"
 
 [ -f "${SEED_NEW}" ] || fail "seed-rule.yaml missing at ${SEED_NEW}"
+[ -f "${SEED_STRUCT}" ] || fail "seed-rule-structural.yaml missing at 
${SEED_STRUCT}"
+
+# --- Discover the two OAP pods and port-forward each node's admin REST 
-------------
+# The OAP Deployment runs >= 2 replicas behind one Service; the Service 
load-balances,
+# so addressing individual nodes (to assert cross-node convergence) needs 
per-pod
+# forwards rather than a single Service forward.
+log "waiting for >= 2 ready OAP pods in ns/${NS} (selector: ${OAP_SELECTOR})"
+deadline=$(( $(date +%s) + 300 ))
+PODS=()
+while true; do
+    # Only Ready pods — a no-init OAP keeps port 12800 closed (and stays 
NotReady)
+    # until the init Job has created the static schema. Read into an array 
without
+    # mapfile/readarray so the script runs under macOS bash 3.2 as well as CI 
bash 4+.
+    PODS=()
+    while IFS= read -r _pod; do
+        [ -n "${_pod}" ] && PODS+=("${_pod}")
+    done < <(kubectl -n "${NS}" get pods -l "${OAP_SELECTOR}" \
+        -o jsonpath='{range .items[*]}{range 
@.status.conditions[?(@.type=="Ready")]}{@.status}{end} 
{.metadata.name}{"\n"}{end}' \
+        2>/dev/null | awk '$1=="True"{print $2}')
+    if [ "${#PODS[@]}" -ge 2 ]; then
+        break
+    fi
+    if [ "$(date +%s)" -ge "${deadline}" ]; then
+        kubectl -n "${NS}" get pods -l "${OAP_SELECTOR}" >&2 || true
+        fail "fewer than 2 ready OAP pods after 300s (got ${#PODS[@]})"
+    fi
+    sleep 5
+done
+POD1="${PODS[0]}"
+POD2="${PODS[1]}"
+log "OAP pods: OAP-1=${POD1} OAP-2=${POD2}"
 
-# All runtime-rule REST calls go through swctl's `admin` command tree instead 
of
-# raw curl. This flow drives two OAP nodes, so the admin host (`--admin-url`) 
is
-# passed per call as the first argument. `--display json` keeps the body shape
-# identical to the old curl output, so the jq assertions are unchanged.
+kubectl -n "${NS}" port-forward "pod/${POD1}" 
"${OAP1_PORT}:${ADMIN_CONTAINER_PORT}" >/dev/null 2>&1 &
+PF1=$!
+kubectl -n "${NS}" port-forward "pod/${POD2}" 
"${OAP2_PORT}:${ADMIN_CONTAINER_PORT}" >/dev/null 2>&1 &
+PF2=$!
+trap 'kill "${PF1}" "${PF2}" 2>/dev/null || true' EXIT
+
+# --- swctl admin helpers (per-node --admin-url) 
------------------------------------
 admin() { local base="$1"; shift; swctl --display json --admin-url="${base}" 
admin "$@"; }
 
 list_row() {
@@ -64,26 +113,17 @@ list_row() {
         | head -1
 }
 
-list_status() {
-    local base="$1"
-    list_row "${base}" | jq -r '.status // empty'
-}
-
-list_hash() {
-    local base="$1"
-    list_row "${base}" | jq -r '.contentHash // empty'
-}
+list_status() { list_row "$1" | jq -r '.status // empty'; }
+list_hash() { list_row "$1" | jq -r '.contentHash // empty'; }
+list_apply_error() { list_row "$1" | jq -r '.lastApplyError // empty'; }
 
 await_status() {
     local base="$1" expected="$2" deadline=$(( $(date +%s) + 
CONVERGE_TIMEOUT_S ))
     while true; do
-        local got
-        got="$(list_status "${base}")"
-        if [ "${got}" = "${expected}" ]; then
-            return 0
-        fi
+        local got; got="$(list_status "${base}")"
+        [ "${got}" = "${expected}" ] && return 0
         if [ "$(date +%s)" -ge "${deadline}" ]; then
-            fail "${base} did not reach status='${expected}' within 
${CONVERGE_TIMEOUT_S}s (last='${got}')"
+            fail "${base} did not reach status='${expected}' within 
${CONVERGE_TIMEOUT_S}s (last='${got}', applyError='$(list_apply_error 
"${base}")')"
         fi
         sleep 2
     done
@@ -92,11 +132,8 @@ await_status() {
 await_hash() {
     local base="$1" expected_hash="$2" deadline=$(( $(date +%s) + 
CONVERGE_TIMEOUT_S ))
     while true; do
-        local got
-        got="$(list_hash "${base}")"
-        if [ "${got}" = "${expected_hash}" ]; then
-            return 0
-        fi
+        local got; got="$(list_hash "${base}")"
+        [ "${got}" = "${expected_hash}" ] && return 0
         if [ "$(date +%s)" -ge "${deadline}" ]; then
             fail "${base} did not converge to 
contentHash='${expected_hash:0:8}…' within ${CONVERGE_TIMEOUT_S}s 
(last='${got:0:8}…')"
         fi
@@ -107,9 +144,7 @@ await_hash() {
 await_absent() {
     local base="$1" deadline=$(( $(date +%s) + CONVERGE_TIMEOUT_S ))
     while true; do
-        if [ -z "$(list_row "${base}")" ]; then
-            return 0
-        fi
+        [ -z "$(list_row "${base}")" ] && return 0
         if [ "$(date +%s)" -ge "${deadline}" ]; then
             fail "${base} did not drop row within ${CONVERGE_TIMEOUT_S}s"
         fi
@@ -117,43 +152,46 @@ await_absent() {
     done
 }
 
+assert_no_apply_error() {
+    local base="$1" err; err="$(list_apply_error "${base}")"
+    [ -z "${err}" ] || fail "${base} reports lastApplyError='${err}' (no-init 
schema change failed)"
+}
+
 apply_on() {
     local base="$1" body="$2" extra="${3:-}"
     local -a flags=(--catalog "${CATALOG}" --name "${NAME}" -f "${body}")
     [[ "${extra}" == *allowStorageChange=true* ]] && 
flags+=(--allow-storage-change)
-    local resp; resp="$(admin "${base}" runtime-rule add "${flags[@]}")" \
-        || fail "addOrUpdate against ${base} failed"
-    echo "${resp}"
+    admin "${base}" runtime-rule add "${flags[@]}" || fail "addOrUpdate 
against ${base} failed"
 }
 
-# --- Wait for both OAPs to come up 
-------------------------------------------------
-log "waiting for OAP-1 (${OAP1_BASE})"
-deadline=$(( $(date +%s) + 120 ))
-until admin "${OAP1_BASE}" runtime-rule list >/dev/null 2>&1; do
-    if [ "$(date +%s)" -ge "${deadline}" ]; then fail "OAP-1 not ready after 
120s"; fi
-    sleep 2
-done
-log "waiting for OAP-2 (${OAP2_BASE})"
-deadline=$(( $(date +%s) + 120 ))
-until admin "${OAP2_BASE}" runtime-rule list >/dev/null 2>&1; do
-    if [ "$(date +%s)" -ge "${deadline}" ]; then fail "OAP-2 not ready after 
120s"; fi
-    sleep 2
+# --- Wait for both OAPs' admin REST to answer through the forwards 
-----------------
+for pair in "OAP-1 ${OAP1_BASE}" "OAP-2 ${OAP2_BASE}"; do
+    set -- ${pair}; label="$1"; base="$2"
+    log "waiting for ${label} admin REST (${base})"
+    deadline=$(( $(date +%s) + 120 ))
+    until admin "${base}" runtime-rule list >/dev/null 2>&1; do
+        if [ "$(date +%s)" -ge "${deadline}" ]; then fail "${label} admin not 
ready after 120s"; fi
+        sleep 2
+    done
 done
-log "both OAPs ready"
+log "both OAP admin endpoints ready"
 
-# --- Phase 1: apply on OAP-1, observe convergence on OAP-2 
-------------------------
-log "=== Phase 1: apply (NEW) on OAP-1 ==="
+# --- Phase 1: apply NEW on OAP-1 — first-time measure creation on a no-init 
node ----
+log "=== Phase 1: apply (NEW) on OAP-1 — exercises no-init schema creation ==="
 apply_on "${OAP1_BASE}" "${SEED_NEW}" >/dev/null
 await_status "${OAP1_BASE}" "ACTIVE"
+assert_no_apply_error "${OAP1_BASE}"
 hash_initial="$(list_hash "${OAP1_BASE}")"
-log "OAP-1 → ACTIVE @ ${hash_initial:0:8}…"
+log "OAP-1 → ACTIVE @ ${hash_initial:0:8}… (measure created on a no-init OAP)"
 await_status "${OAP2_BASE}" "ACTIVE"
 await_hash "${OAP2_BASE}" "${hash_initial}"
 log "OAP-2 converged to ${hash_initial:0:8}…"
 
-# --- Phase 2: STRUCTURAL update on OAP-1, observe new hash on OAP-2 
----------------
+# --- Phase 2: STRUCTURAL update on OAP-1 — second measure created on no-init 
--------
 log "=== Phase 2: STRUCTURAL on OAP-1 ==="
 apply_on "${OAP1_BASE}" "${SEED_STRUCT}" "allowStorageChange=true" >/dev/null
+await_status "${OAP1_BASE}" "ACTIVE"
+assert_no_apply_error "${OAP1_BASE}"
 hash_struct="$(list_hash "${OAP1_BASE}")"
 [ "${hash_struct}" != "${hash_initial}" ] || fail "OAP-1 contentHash unchanged 
after STRUCTURAL apply"
 log "OAP-1 → ACTIVE @ ${hash_struct:0:8}… (was ${hash_initial:0:8}…)"
@@ -178,4 +216,33 @@ log "OAP-1 → row gone"
 await_absent "${OAP2_BASE}"
 log "OAP-2 converged: row gone"
 
-log "=== ALL CLUSTER PHASES PASSED ==="
+# --- Phase 5: forward-path coverage — drive a write on OAP-2 
-----------------------
+# Phases 1-4 drove OAP-1; whether that exercised the cross-node Forward 
depends on which
+# node the hash-router picked as main. Driving a write on OAP-2 as well 
guarantees the
+# Forward path is exercised regardless: whichever of OAP-1 / OAP-2 is NOT the 
main forwards
+# the write to the main. This is the path that regressed on Kubernetes — every 
replica
+# shared selfNodeId=0.0.0.0_11800 (the 0.0.0.0 gRPC bind host), so the main's 
self-loop
+# guard rejected a legitimate forward as HTTP 400 forward_self_loop. With a 
unique per-pod
+# id the forward completes; a failure here (esp. forward_self_loop) re-opens 
that bug.
+NAME_B="cluster_rr_fwd"
+log "=== Phase 5: apply on OAP-2 (guarantees cross-node Forward coverage) ==="
+admin "${OAP2_BASE}" runtime-rule add --catalog "${CATALOG}" --name 
"${NAME_B}" -f "${SEED_NEW}" >/dev/null \
+    || fail "addOrUpdate on OAP-2 failed — cross-node Forward broken (e.g. 
forward_self_loop)?"
+b_deadline=$(( $(date +%s) + CONVERGE_TIMEOUT_S ))
+while true; do
+    b_status="$(admin "${OAP2_BASE}" runtime-rule list 2>/dev/null \
+        | jq -r '.rules[] | select(.catalog=="'"${CATALOG}"'" and 
.name=="'"${NAME_B}"'") | .status' | head -1)"
+    [ "${b_status}" = "ACTIVE" ] && break
+    [ "$(date +%s)" -ge "${b_deadline}" ] && fail "OAP-2 write did not reach 
ACTIVE within ${CONVERGE_TIMEOUT_S}s (last='${b_status}')"
+    sleep 2
+done
+log "OAP-2 write → ACTIVE (cross-node Forward path OK)"
+# Cleanup also forwards from OAP-2: inactivate (soft-pause) is required before 
delete,
+# so this exercises the Forward path for the inactivate + delete operations 
too.
+admin "${OAP2_BASE}" runtime-rule inactivate --catalog "${CATALOG}" --name 
"${NAME_B}" >/dev/null \
+    || fail "inactivate of ${NAME_B} on OAP-2 failed"
+admin "${OAP2_BASE}" runtime-rule delete --catalog "${CATALOG}" --name 
"${NAME_B}" >/dev/null \
+    || fail "cleanup delete of ${NAME_B} on OAP-2 failed"
+log "Phase 5 cleanup done (inactivate + delete forwarded OK)"
+
+log "=== ALL CLUSTER (kind) PHASES PASSED ==="
diff --git a/test/e2e-v2/cases/runtime-rule/cluster/docker-compose.yml 
b/test/e2e-v2/cases/runtime-rule/cluster/docker-compose.yml
deleted file mode 100644
index c82047c541..0000000000
--- a/test/e2e-v2/cases/runtime-rule/cluster/docker-compose.yml
+++ /dev/null
@@ -1,93 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Cluster convergence — 2 OAPs behind a ZooKeeper coordinator + BanyanDB.
-# Verifies that a runtime-rule apply on one node propagates to the other 
within a
-# reconciler tick (default 30 s) and that the Suspend / Resume RPC bracket 
dispatch
-# correctly across the cluster.
-services:
-  zookeeper:
-    image: zookeeper:3.8
-    networks:
-      - e2e
-    environment:
-      ZOO_4LW_COMMANDS_WHITELIST: "ruok,stat,srvr"
-    healthcheck:
-      # Use the zookeeper-shell.sh ls wrapper (image's own /bin) — the official
-      # zookeeper:3.8 image does not ship `nc`, so the more obvious `echo ruok 
| nc ...`
-      # idiom fails. zkServer.sh status returns 0 once the server is in 
standalone /
-      # leader mode.
-      test: ["CMD-SHELL", "zkServer.sh status 2>/dev/null | grep -E 'Mode: 
(standalone|leader|follower)'"]
-      interval: 5s
-      timeout: 10s
-      retries: 30
-
-  banyandb:
-    extends:
-      file: ../../../script/docker-compose/base-compose.yml
-      service: banyandb
-
-  oap1:
-    extends:
-      file: ../../../script/docker-compose/base-compose.yml
-      service: oap
-    hostname: oap1
-    environment:
-      SW_ADMIN_SERVER: default
-      SW_RECEIVER_RUNTIME_RULE: default
-      SW_STORAGE: banyandb
-      SW_CLUSTER: zookeeper
-      SW_CLUSTER_ZK_HOST_PORT: zookeeper:2181
-      # First-up node also doubles as the static-rule installer; nothing to 
coordinate
-      # with peers on storage init.
-    ports:
-      - "11800:11800"
-      - "12800:12800"
-      - "17128:17128"
-    depends_on:
-      zookeeper:
-        condition: service_healthy
-      banyandb:
-        condition: service_healthy
-    networks:
-      - e2e
-
-  oap2:
-    extends:
-      file: ../../../script/docker-compose/base-compose.yml
-      service: oap
-    hostname: oap2
-    environment:
-      SW_ADMIN_SERVER: default
-      SW_RECEIVER_RUNTIME_RULE: default
-      SW_STORAGE: banyandb
-      SW_CLUSTER: zookeeper
-      SW_CLUSTER_ZK_HOST_PORT: zookeeper:2181
-    ports:
-      - "11801:11800"
-      - "12801:12800"
-      - "17129:17128"
-    depends_on:
-      zookeeper:
-        condition: service_healthy
-      banyandb:
-        condition: service_healthy
-      oap1:
-        condition: service_healthy
-    networks:
-      - e2e
-
-networks:
-  e2e:
diff --git a/test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml 
b/test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml
index d6b82f0c8d..d0dbc3bee9 100644
--- a/test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml
+++ b/test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml
@@ -13,17 +13,32 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# 2-OAP cluster + ZK + BanyanDB. Drives apply / inactivate / delete on OAP-1 
and
-# verifies OAP-2 converges within a reconciler tick (default 30 s).
+# Runtime-rule lifecycle + cross-node convergence on a Kubernetes (kind) 
cluster
+# deployed via the skywalking-helm chart — a 2-replica OAP Deployment running 
in
+# NO-INIT mode (`-Dmode=no-init`) behind a one-shot `-Dmode=init` schema-init 
Job,
+# the exact topology every production SkyWalking cluster uses. This is the 
case that
+# exercises the runtime-rule schema-change path on a no-init OAP: an operator 
apply
+# must drive backend DDL (create the BanyanDB measure) on the cluster main even
+# though it is a no-init node. BanyanDB native cluster coordination 
(SW_CLUSTER=
+# kubernetes) is wired by the chart; ZooKeeper is not needed.
 
 setup:
-  env: compose
-  file: docker-compose.yml
-  timeout: 25m
+  env: kind
+  file: kind.yaml
+  timeout: 30m
   init-system-environment: ../../../script/env
+  kind:
+    import-images:
+      - skywalking/oap:latest
   steps:
     - name: set PATH
       command: export PATH=/tmp/skywalking-infra-e2e/bin:$PATH
+    - name: install yq
+      command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh yq
+    - name: install swctl
+      command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh swctl
+    - name: install kubectl
+      command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh 
kubectl
     - name: install jq
       command: |
         if ! command -v jq >/dev/null 2>&1; then
@@ -31,11 +46,42 @@ setup:
             
https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64
           chmod +x /tmp/skywalking-infra-e2e/bin/jq
         fi
-    - name: install swctl
-      command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh swctl
-    - name: drive cluster convergence flow
+    - name: install helm
+      command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh helm
+    # 2-replica OAP Deployment (no-init) + init Job + admin server + 
runtime-rule
+    # receiver + BanyanDB. fullnameOverride=skywalking makes the OAP Service / 
pods
+    # discoverable as skywalking-oap with labels app=skywalking,component=oap.
+    - name: install SkyWalking (no-init cluster + BanyanDB) via helm
+      command: |
+        export PATH=/tmp/skywalking-infra-e2e/bin:$PATH
+        helm -n skywalking install skywalking \
+          oci://ghcr.io/apache/skywalking-helm/skywalking-helm \
+          --version "0.0.0-${SW_KUBERNETES_COMMIT_SHA}" \
+          --create-namespace \
+          --set fullnameOverride=skywalking \
+          --set elasticsearch.enabled=false \
+          --set oap.replicas=2 \
+          --set oap.image.repository=skywalking/oap \
+          --set oap.image.tag=latest \
+          --set oap.imagePullPolicy=IfNotPresent \
+          --set oap.storageType=banyandb \
+          --set oap.env.SW_ADMIN_SERVER=default \
+          --set oap.env.SW_RECEIVER_RUNTIME_RULE=default \
+          --set ui.enabled=false \
+          --set banyandb.enabled=true \
+          --set banyandb.standalone.enabled=true \
+          --set banyandb.cluster.enabled=false \
+          --set banyandb.image.repository=ghcr.io/apache/skywalking-banyandb \
+          --set banyandb.image.tag=${SW_BANYANDB_COMMIT}
+      wait:
+        # The init Job must complete (creates the static schema) before the 
no-init
+        # OAP Deployment can become Available.
+        - namespace: skywalking
+          resource: deployment/skywalking-oap
+          for: condition=available
+          timeout: 20m
+    - name: drive runtime-rule lifecycle + cross-node convergence (no-init)
       command: |
-        set -euo pipefail
         export PATH=/tmp/skywalking-infra-e2e/bin:$PATH
         bash test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh
 
@@ -44,18 +90,8 @@ verify:
     count: 1
     interval: 1s
   cases:
-    - query: swctl --display json --admin-url=http://127.0.0.1:17128 admin 
runtime-rule list >/dev/null && echo ok
-      expected: expected/ok.txt
-
-cleanup:
-  on: always
-  collect:
-    on: failure
-    output-dir: $SW_INFRA_E2E_LOG_DIR/runtime-rule/cluster
-    items:
-      - service: oap1
-        paths:
-          - /skywalking/logs/
-      - service: oap2
-        paths:
-          - /skywalking/logs/
+    # The lifecycle assertions live in cluster-flow.sh (it exits non-zero on 
any
+    # failure, failing setup). This is a thin liveness check that the no-init 
cluster
+    # came up with both OAP replicas Ready.
+    - query: kubectl -n skywalking get deployment skywalking-oap -o 
jsonpath='{.status.readyReplicas}'
+      expected: expected/ready-replicas.txt
diff --git a/test/e2e-v2/cases/runtime-rule/cluster/expected/ok.txt 
b/test/e2e-v2/cases/runtime-rule/cluster/expected/ok.txt
deleted file mode 100644
index 9766475a41..0000000000
--- a/test/e2e-v2/cases/runtime-rule/cluster/expected/ok.txt
+++ /dev/null
@@ -1 +0,0 @@
-ok
diff --git a/test/e2e-v2/cases/runtime-rule/cluster/expected/ready-replicas.txt 
b/test/e2e-v2/cases/runtime-rule/cluster/expected/ready-replicas.txt
new file mode 100644
index 0000000000..d8263ee986
--- /dev/null
+++ b/test/e2e-v2/cases/runtime-rule/cluster/expected/ready-replicas.txt
@@ -0,0 +1 @@
+2
\ No newline at end of file
diff --git a/test/e2e-v2/cases/runtime-rule/cluster/kind.yaml 
b/test/e2e-v2/cases/runtime-rule/cluster/kind.yaml
new file mode 100644
index 0000000000..a57ada7120
--- /dev/null
+++ b/test/e2e-v2/cases/runtime-rule/cluster/kind.yaml
@@ -0,0 +1,23 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Single-node kind cluster for the runtime-rule no-init cluster e2e. One node 
is
+# enough — the OAP Deployment's replicas (the no-init cluster) and the 
BanyanDB pod
+# all schedule here. Node image pinned to the same k8s 1.28 build the istio 
e2e uses.
+kind: Cluster
+apiVersion: kind.x-k8s.io/v1alpha4
+nodes:
+  - role: control-plane
+    image: 
kindest/node:v1.28.15@sha256:a7c05c7ae043a0b8c818f5a06188bc2c4098f6cb59ca7d1856df00375d839251

Reply via email to