hanahmily opened a new issue, #13621:
URL: https://github.com/apache/skywalking/issues/13621

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no 
similar feature requirement.
   
   
   ### Description
   
   
   ## 1\. Context & Motivation
   
   **Current State:** BanyanDB currently relies on `etcd` as a hard dependency 
for cluster coordination, metadata storage, and node discovery (Meta Nodes). 
This requires maintaining a separate `etcd` cluster, managing leases for health 
checks, and handling complex certificate management for secure communication.
   
   **Goal:** Transform BanyanDB into a "Zero-Dependency" architecture by 
replacing the `etcd`-based registry with a decentralized **DNS-based Node 
Discovery** mechanism. This simplifies deployment on Kubernetes (StatefulSets) 
and static environments (VMs/Edge).
   
   -----
   
   ## 2\. Technical Design Specification
   
   ### 2.1 Core Abstraction: `NodeRegistry`
   
   We will introduce a modular `NodeRegistry` interface to decouple the 
discovery logic from the specific implementation.
   
     * **Old Flow:** `Liaison` -\> Watch `etcd` Key -\> Update gRPC Connection.
     * **New Flow:** `Liaison` -\> Poll `NodeRegistry` -\> Update gRPC 
Connection.
   
   ### 2.2 Discovery Mechanism (DNS)
   
   The primary implementation will be the **DNS Registry**, operating in a 
"Pull-based" model.
   
     * **Query Strategy:**
       1.  **Primary:** Query **SRV Records** (RFC 2782) to discover target 
hostnames and dynamic ports (critical for K8s Headless Services).
       2.  **Fallback:** Query **A/AAAA Records** if SRV is unavailable 
(requires static port configuration).
       3. **Fallback: Static Registry:** To support environments without DNS or 
for emergency overrides, loads a fixed list of peers from a local file 
(`topology.yml`). Support hot reloading of this file. 
   
     * **Polling & Caching:**
         * Implement a **Custom gRPC Resolver** (Go) that polls DNS at a 
configurable interval (default: 30s). In the startup process, the interval 
should be 5 seconds to reflect the topology change. There should be two flags 
to set up the intervals.
         * **Two-Layer Caching:** Respect DNS TTL (Infrastructure layer) and 
maintain an internal snapshot (Application layer).
     * **Resilience (Serve Stale):**
         * If the DNS server returns a failure (e.g., `SERVFAIL`, Timeout), the 
resolver **MUST NOT** flush the current address list.
         * It must log a warning and return the **stale** (last known good) 
list of addresses to ensure partition tolerance.
   
   ### 2.3 Peer Discovery
   
     * **Liaison Node Discovery** Liaison nodes will discover the data nodes 
     * **Data Node Mesh:** Data nodes will discover peers by resolving the same 
DNS name they publish themselves.
     * **Lifecycle:** Hot nodes discover Warm/Cold nodes.
   
   ### 2.4 Two-Phase Discovery
   
   Instead of reading the full Node struct from etcd before connecting, the 
Liaison/Data node will first connect via DNS and then query the node directly 
for its details.
   
   Add a new gRPC service to return the Node. 
   
   ### 2.5 Troubleshooting DNS Discovery
   
   In the absence of etcdctl, operators need new tools.
   
   **State gRPC service**: bydbctl/UI -> calls (Liaison/Data).GetClusterState() 
-> returns the internal list derived from DNS. The service will return more 
internal state than DNS in the future. 
   **Metrics**: New metrics are required:
   
   - discovery_dns_lookup_duration_seconds
   - discovery_dns_lookup_failures_total
   - discovery_cluster_size (Gauge)
   
   
   -----
   
   ## 3\. Task List
   
     - [ ] Implement `DNSNodeRegistry` with `net.LookupSRV` and 
`net.LookupHost`.
     - [ ] Implement `StaticNodeRegistry` for fallback/file-based discovery.
     - [ ] Update Helm Charts.
     - [ ] Create E2E test suite for startup.
     - [ ] Update Documentation ( Concept and operational document )
   
   ### Use case
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a pull request to implement this on your own?
   
   - [ ] Yes I am willing to submit a pull request on my own!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to