Duansg commented on issue #3757: URL: https://github.com/apache/hertzbeat/issues/3757#issuecomment-3334900616
> [@Duansg](https://github.com/Duansg) From the code, when an alarm resolves, a notification will be sent immediately without being affected by the repeat interval. Why is this notification sent after a long delay? > > <img alt="Image" width="552" height="439" src="https://private-user-images.githubusercontent.com/22488813/492161660-06cb6388-16b6-4c1c-bf5b-06f2b5f4082c.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTg4MTYwNTksIm5iZiI6MTc1ODgxNTc1OSwicGF0aCI6Ii8yMjQ4ODgxMy80OTIxNjE2NjAtMDZjYjYzODgtMTZiNi00YzFjLWJmNWItMDZmMmI1ZjQwODJjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA5MjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwOTI1VDE1NTU1OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI1NTU4MTEyZmFmODBmMWI1MTIyNjJhYzJmYWZmZjJmMjM0YmE0ZjgzMTE3ZGI3NDQzMWY0Y2RhNWUyZTlhNjcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.25EWHZZw783JgUfdGA1WJTL5-kyCQ17TVU-OOmCkFQo"> Hi, I'm very sorry. Due to recent work commitments, I haven't been able to thoroughly review the issue you raised. I've now taken the time to look into it, and below is my troubleshooting process for your reference: First, I'll examine the time processing on the alert dashboard. After reviewing the endAt configuration logic, I can confirm that: The header displays the update time for the aggregation group, while the end time represents the `recovery time` of alerts within the aggregation group— that is, the `endAt` value. As discussed earlier, the cache for the aggregation group will be cleared after notifications are sent. I'm emphasizing this point because when handling alerts, the decision to send notifications immediately involves two separate checks. 1. The alert exists in the cache of the aggregation group. After direct overwriting, it is returned. The code is as follows: ``` // Check if this is a duplicate alert SingleAlert existingAlert = cache.getAlertFingerprints().get(fingerprint); if (existingAlert != null) { // Update existing alert timestamp alert.setStartAt(existingAlert.getStartAt()); cache.getAlertFingerprints().put(fingerprint, alert); return; } ``` 2. If condition 1 is not met (first occurrence or cache cleared), it will generate a new alert. It will also check whether `all` alerts in the aggregation group's cache have been restored. If restored, it will proceed with the following related sending logic, as shown in the code below: ``` // Add new alert cache.getAlertFingerprints().put(fingerprint, alert); if (shouldSendGroupImmediately(cache)) { // Note: Determine whether all members within the aggregation group have been restored. sendGroupAlert(cache); cache.setLastSendTime(System.currentTimeMillis()); cache.getAlertFingerprints().clear(); } ``` 3. In addition to the above methods, there is also the scheduled task for the aggregation group. If you set the `group_interval` time to be relatively short, it will execute the `sendGroupAlert` method normally. Its decision logic is also very straightforward. Its `RepeatInterval` check condition is that as long as there are unrecovered alarms within the aggregation group, it will perform the check; otherwise, it proceeds with subsequent sending operations. The code is as follows: ``` private void sendGroupAlert(GroupAlertCache cache) { String status = determineGroupStatus(cache.getAlertFingerprints().values()); // For firing alerts, check repeat interval if (CommonConstants.ALERT_STATUS_FIRING.equals(status)) { AlertGroupConverge ruleConfig = groupDefines.get(cache.getGroupDefineName()); long repeatInterval = ruleConfig.getRepeatInterval() != null ? ruleConfig.getRepeatInterval() * MS_PER_SECOND : DEFAULT_REPEAT_INTERVAL; // Skip if within repeat interval if (cache.getLastRepeatTime() > 0 && now - cache.getLastRepeatTime() < repeatInterval) { return; } cache.setLastRepeatTime(now); } // ... notice } private String determineGroupStatus(Collection<SingleAlert> alerts) { // If any alert is firing, group is firing return alerts.stream() .anyMatch(alert -> CommonConstants.ALERT_STATUS_FIRING.equals(alert.getStatus())) ? CommonConstants.ALERT_STATUS_FIRING : CommonConstants.ALERT_STATUS_RESOLVED; } ``` Based on the above analysis, my preliminary conclusion is: 1. `processAlertByGroupDefine`: When this alert is restored, there exists unprocessed data in the cache, or the alert status within the aggregation group exhibits two states (Alerting/Restored). 2. `sendGroupAlert`: This aggregation group has unresolved alerts. Therefore, it is evident that the aforementioned issues all point to the existence of different alert signatures within the aggregation group. These signatures correspond to inconsistent alert statuses, leading to discrepancies in the operational notification process for the aggregation group. Recommendation: Exclude whether multiple alerts hit the same aggregation group. Finally, I will add debug logs at these logical decision points to help you better troubleshoot issues related to alerting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
