Duansg commented on issue #3757:
URL: https://github.com/apache/hertzbeat/issues/3757#issuecomment-3334900616

   > [@Duansg](https://github.com/Duansg) From the code, when an alarm 
resolves, a notification will be sent immediately without being affected by the 
repeat interval. Why is this notification sent after a long delay?
   > 
   > <img alt="Image" width="552" height="439" 
src="https://private-user-images.githubusercontent.com/22488813/492161660-06cb6388-16b6-4c1c-bf5b-06f2b5f4082c.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTg4MTYwNTksIm5iZiI6MTc1ODgxNTc1OSwicGF0aCI6Ii8yMjQ4ODgxMy80OTIxNjE2NjAtMDZjYjYzODgtMTZiNi00YzFjLWJmNWItMDZmMmI1ZjQwODJjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA5MjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwOTI1VDE1NTU1OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI1NTU4MTEyZmFmODBmMWI1MTIyNjJhYzJmYWZmZjJmMjM0YmE0ZjgzMTE3ZGI3NDQzMWY0Y2RhNWUyZTlhNjcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.25EWHZZw783JgUfdGA1WJTL5-kyCQ17TVU-OOmCkFQo";>
   
   Hi, I'm very sorry. Due to recent work commitments, I haven't been able to 
thoroughly review the issue you raised. I've now taken the time to look into 
it, and below is my troubleshooting process for your reference:
   
   First, I'll examine the time processing on the alert dashboard. After 
reviewing the endAt configuration logic, I can confirm that: The header 
displays the update time for the aggregation group, while the end time 
represents the `recovery time` of alerts within the aggregation group— that is, 
the `endAt` value.
   
   As discussed earlier, the cache for the aggregation group will be cleared 
after notifications are sent. I'm emphasizing this point because when handling 
alerts, the decision to send notifications immediately involves two separate 
checks.
   
   1. The alert exists in the cache of the aggregation group. After direct 
overwriting, it is returned. The code is as follows:
   ```
    // Check if this is a duplicate alert
   SingleAlert existingAlert = cache.getAlertFingerprints().get(fingerprint);
   if (existingAlert != null) {
       // Update existing alert timestamp
       alert.setStartAt(existingAlert.getStartAt());
       cache.getAlertFingerprints().put(fingerprint, alert);
       return;
   }
   ```
   
   2. If condition 1 is not met (first occurrence or cache cleared), it will 
generate a new alert. It will also check whether `all` alerts in the 
aggregation group's cache have been restored. If restored, it will proceed with 
the following related sending logic, as shown in the code below:
   ```
   // Add new alert
   cache.getAlertFingerprints().put(fingerprint, alert);
           
   if (shouldSendGroupImmediately(cache)) {  // Note: Determine whether all 
members within the aggregation group have been restored.
       sendGroupAlert(cache);
       cache.setLastSendTime(System.currentTimeMillis());
       cache.getAlertFingerprints().clear();
   }
   ```
   
   3. In addition to the above methods, there is also the scheduled task for 
the aggregation group. If you set the `group_interval` time to be relatively 
short, it will execute the `sendGroupAlert` method normally. Its decision logic 
is also very straightforward.
   Its `RepeatInterval` check condition is that as long as there are 
unrecovered alarms within the aggregation group, it will perform the check; 
otherwise, it proceeds with subsequent sending operations. The code is as 
follows:
   ```
   private void sendGroupAlert(GroupAlertCache cache) {
       String status = 
determineGroupStatus(cache.getAlertFingerprints().values());
           
       // For firing alerts, check repeat interval
       if (CommonConstants.ALERT_STATUS_FIRING.equals(status)) {
           AlertGroupConverge ruleConfig = 
groupDefines.get(cache.getGroupDefineName());
           long repeatInterval = ruleConfig.getRepeatInterval() != null
               ? ruleConfig.getRepeatInterval() * MS_PER_SECOND : 
DEFAULT_REPEAT_INTERVAL;
               
           // Skip if within repeat interval
           if (cache.getLastRepeatTime() > 0 
               && now - cache.getLastRepeatTime() < repeatInterval) {
               return;
           }
           cache.setLastRepeatTime(now);
       }
       // ... notice    
   }
   
   private String determineGroupStatus(Collection<SingleAlert> alerts) {
       // If any alert is firing, group is firing
       return alerts.stream()
           .anyMatch(alert -> 
CommonConstants.ALERT_STATUS_FIRING.equals(alert.getStatus())) 
           ? CommonConstants.ALERT_STATUS_FIRING : 
CommonConstants.ALERT_STATUS_RESOLVED;
   }
   ```
   
   Based on the above analysis, my preliminary conclusion is:
   1. `processAlertByGroupDefine`: When this alert is restored, there exists 
unprocessed data in the cache, or the alert status within the aggregation group 
exhibits two states (Alerting/Restored).
   2. `sendGroupAlert`: This aggregation group has unresolved alerts.
   
   Therefore, it is evident that the aforementioned issues all point to the 
existence of different alert signatures within the aggregation group. These 
signatures correspond to inconsistent alert statuses, leading to discrepancies 
in the operational notification process for the aggregation group.
   
   Recommendation: Exclude whether multiple alerts hit the same aggregation 
group.
   
   Finally, I will add debug logs at these logical decision points to help you 
better troubleshoot issues related to alerting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to