TJxiaobao opened a new issue, #4007:
URL: https://github.com/apache/hertzbeat/issues/4007

   ### Feature Request
   
   ## Description
   
   ### Background & Motivation
   Currently, HertzBeat provides robust monitoring and alerting capabilities. 
However, as the system scales, O&M teams often face "alert fatigue" and the 
heavy burden of manual daily inspections. To evolve HertzBeat towards 
**AIOps**, we need a more proactive and intelligent way to summarize system 
health and diagnose potential risks.
   
   We propose introducing an **Intelligent Inspection Workflow** within the 
`hertzbeat-ai` module. This feature will leverage LLMs to automate daily 
"system health checks," transforming raw metrics and alerts into actionable 
insights.
   
   ### Proposed Feature: Intelligent Inspection
   The Intelligent Inspection workflow is a composite **Skill** that 
orchestrates multiple atomic **Tools**.
   
   #### Key Workflow Steps:
   1.  **Data Harvesting**: Automatically scan all monitors and active alerts 
(e.g., last 24h).
   2.  **Deep Evidence Collection**: For abnormal monitors, the AI 
automatically retrieves trend data (CPU, Memory, Latency) using existing tools.
   3.  **AI Reasoning**: The LLM performs correlation analysis (e.g., 
identifying if multiple alerts share a root cause) and risk assessment.
   4.  **Report Generation**: Generate a concise Markdown report summarizing 
health status, critical risks, and optimization suggestions.
   
   ### Technical Implementation Ideas
   - **Module**: Implement within `hertzbeat-ai`.
   - **Orchestration**: Use a "Deterministic SOP + Agentic Diagnosis" hybrid 
approach.
   - **Token Optimization**: 
       - Funnel filtering: Only process abnormal/critical monitors via AI.
       - Data summarization: Send statistical features (max, avg, trend) 
instead of raw time-series data.
   - **Human-in-the-loop**: Ensure all "Action" recommendations (like 
restarting a service) require manual confirmation.
   
   ### Benefits
   - **Reduce Manual Toil**: Automate the repetitive daily inspection task.
   - **Proactive Risk Detection**: Identify "silent" risks (e.g., slow memory 
leaks) before they trigger critical alerts.
   - **Enhanced User Experience**: Provide users with a high-level, intelligent 
overview of their entire infrastructure.
   
   ### Request for Feedback
   We would love to hear from the community:
   1. What do you think about the "Intelligent Inspection" concept?
   2. Are there specific inspection metrics or report formats you'd like to see?
   3. Any suggestions on the technical architecture or token optimization 
strategies?
   
   
   ### Is your feature request related to a problem? Please describe
   
   _No response_
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to