GitHub user shrihari7396 edited a discussion: Design Discussion: Embedding AlertServer into dolphinscheduler-api Module
Hi all, I’ve been studying the architectural requirements for embedding the AlertServer into the API Server (related to #8975). After reviewing the initialization flows in `dolphinscheduler-alert-server` and `dolphinscheduler-api`, I’d like to discuss a potential design direction and gather feedback. My goal is to transition the alerting mechanism from a standalone process to an embedded background service while maintaining DolphinScheduler's high-availability and reliability standards. --- ## Proposed Technical Direction ### 1. Logic Decoupling (Modularization) Instead of source-code duplication, refactor the core alerting logic (e.g., `AlertBootstrapService`, `AlertSender`) into a reusable library module. The `dolphinscheduler-api` will consume this as a dependency, ensuring a single source of truth for alerting logic. ### 2. Lifecycle Integration Use Spring-managed components and `@PostConstruct` hooks within the API Server to initialize the alerting engine. This ensures alerting threads are orchestrated alongside the API's primary lifecycle, starting only after the server successfully joins the Registry. ### 3. Leader Election & High Availability (HA) To prevent duplicate alert processing in horizontally scaled API deployments, I propose leveraging the existing `RegistryClient` (ZooKeeper/Etcd) to implement a **Leader-Follower** model. Only the "Leader" API instance will activate the `AlertEventLoop`, with standby nodes ready to take over upon leader failure. ### 4. Fault Tolerance & Data Integrity * **Atomic Claim Mechanism:** Implement SQL-based optimistic locking (e.g., `UPDATE ... SET status = 'SENDING', handler_instance = 'ID' WHERE status = 'PENDING'`) to ensure thread-safe row acquisition. * **Self-Healing "Janitor" Thread:** Introduce a background monitoring thread on the leader node to identify alerts orphaned in a `SENDING` state due to unexpected instance crashes and reset them to `PENDING` for re-delivery. ### 5. Performance Isolation Configure a dedicated `ThreadPoolExecutor` for alerting tasks. This prevents long-running notification I/O (e.g., slow SMTP or Webhook responses) from starving the API's Netty/Tomcat worker threads, keeping the REST interface responsive. ### 6. SPI Management & Decommissioning Ensure the API Server remains compatible with the Alert SPI for dynamic plugin loading. This plan includes the complete removal of standalone `AlertServer.java` entry points, assembly descriptors, and redundant Docker/K8s service definitions to simplify the deployment footprint. --- I would appreciate any feedback or concerns regarding this approach, particularly on the distributed coordination strategy, before I proceed further with implementation planning. Best regards, **Shrihari Rajendrakumar Kulkarni** GitHub link: https://github.com/apache/dolphinscheduler/discussions/18005 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
