hanahmily opened a new issue, #13621: URL: https://github.com/apache/skywalking/issues/13621
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar feature requirement. ### Description ## 1\. Context & Motivation **Current State:** BanyanDB currently relies on `etcd` as a hard dependency for cluster coordination, metadata storage, and node discovery (Meta Nodes). This requires maintaining a separate `etcd` cluster, managing leases for health checks, and handling complex certificate management for secure communication. **Goal:** Transform BanyanDB into a "Zero-Dependency" architecture by replacing the `etcd`-based registry with a decentralized **DNS-based Node Discovery** mechanism. This simplifies deployment on Kubernetes (StatefulSets) and static environments (VMs/Edge). ----- ## 2\. Technical Design Specification ### 2.1 Core Abstraction: `NodeRegistry` We will introduce a modular `NodeRegistry` interface to decouple the discovery logic from the specific implementation. * **Old Flow:** `Liaison` -\> Watch `etcd` Key -\> Update gRPC Connection. * **New Flow:** `Liaison` -\> Poll `NodeRegistry` -\> Update gRPC Connection. ### 2.2 Discovery Mechanism (DNS) The primary implementation will be the **DNS Registry**, operating in a "Pull-based" model. * **Query Strategy:** 1. **Primary:** Query **SRV Records** (RFC 2782) to discover target hostnames and dynamic ports (critical for K8s Headless Services). 2. **Fallback:** Query **A/AAAA Records** if SRV is unavailable (requires static port configuration). 3. **Fallback: Static Registry:** To support environments without DNS or for emergency overrides, loads a fixed list of peers from a local file (`topology.yml`). Support hot reloading of this file. * **Polling & Caching:** * Implement a **Custom gRPC Resolver** (Go) that polls DNS at a configurable interval (default: 30s). In the startup process, the interval should be 5 seconds to reflect the topology change. There should be two flags to set up the intervals. * **Two-Layer Caching:** Respect DNS TTL (Infrastructure layer) and maintain an internal snapshot (Application layer). * **Resilience (Serve Stale):** * If the DNS server returns a failure (e.g., `SERVFAIL`, Timeout), the resolver **MUST NOT** flush the current address list. * It must log a warning and return the **stale** (last known good) list of addresses to ensure partition tolerance. ### 2.3 Peer Discovery * **Liaison Node Discovery** Liaison nodes will discover the data nodes * **Data Node Mesh:** Data nodes will discover peers by resolving the same DNS name they publish themselves. * **Lifecycle:** Hot nodes discover Warm/Cold nodes. ### 2.4 Two-Phase Discovery Instead of reading the full Node struct from etcd before connecting, the Liaison/Data node will first connect via DNS and then query the node directly for its details. Add a new gRPC service to return the Node. ### 2.5 Troubleshooting DNS Discovery In the absence of etcdctl, operators need new tools. **State gRPC service**: bydbctl/UI -> calls (Liaison/Data).GetClusterState() -> returns the internal list derived from DNS. The service will return more internal state than DNS in the future. **Metrics**: New metrics are required: - discovery_dns_lookup_duration_seconds - discovery_dns_lookup_failures_total - discovery_cluster_size (Gauge) ----- ## 3\. Task List - [ ] Implement `DNSNodeRegistry` with `net.LookupSRV` and `net.LookupHost`. - [ ] Implement `StaticNodeRegistry` for fallback/file-based discovery. - [ ] Update Helm Charts. - [ ] Create E2E test suite for startup. - [ ] Update Documentation ( Concept and operational document ) ### Use case _No response_ ### Related issues _No response_ ### Are you willing to submit a pull request to implement this on your own? - [ ] Yes I am willing to submit a pull request on my own! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
