Duansg commented on PR #3673: URL: https://github.com/apache/hertzbeat/pull/3673#issuecomment-3237625188
> > 5.The entire mechanism currently relies on the normal generation and advancement of `Watermark`. The generation method for `newWatermark` is `maxTs - DEFAULT_WATERMARK_DELAY_MS`, while `maxTs` indirectly depends on the time of newly input `LogEntry`. If events sent by systems with unsynchronized clocks have timestamps ahead of schedule, could this cause the watermark to stagnate and result in delayed alerts? > > You are absolutely right, thank you for pointing this out. This is a critical issue that I completely overlooked in my code. > > You've correctly identified a serious flaw in my logic. When an event with a timestamp far in the future is processed, my current implementation would skew the `maxTs`. This, in turn, causes the `newWatermark` (calculated as `maxTs - DEFAULT_WATERMARK_DELAY_MS`) to jump far ahead of the actual event time. The result is that the watermark would stagnate because normally timed events would be incorrectly considered "late" and fail to advance it, leading to significant delays in the alerting system. > > To fix this, I plan to implement a "stagnation detection and forced advancement" mechanism. Here is the solution I will be coding: > > 1. Track the Last Watermark Update Time: I will add a new variable, `lastWatermarkUpdateTime`, to record the timestamp of the last successful watermark update. > 2. Check for Stagnation: Within the `TimeService`'s regular watermark check, if the newly calculated watermark is the same as the existing one, the logic will treat this as a potential stagnation. > 3. Measure Stagnation Duration: If no update occurs, I'll add a step to calculate the stagnation period by subtracting `lastWatermarkUpdateTime` from the current system time. > 4. Force a Watermark Update: If this stagnation period exceeds a predefined threshold (e.g., `MAX_WATERMARK_STAGNATION_TIME`), my code will determine that the watermark has been stuck for too long. At this point, instead of relying on the potentially incorrect `maxTs`, it will forcibly advance the watermark to `now() - DEFAULT_WATERMARK_DELAY_MS`. This will ensure the processing pipeline moves forward, even when compromised by future-dated events. > > Thank you again for raising this. Your insight is invaluable for helping me improve the robustness and reliability of my code. I will prioritize implementing this fix. @bigcyy hhh, you're welcome. I'm glad I could help you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
