bigcyy commented on PR #3673: URL: https://github.com/apache/hertzbeat/pull/3673#issuecomment-3238990609
> > 5.The entire mechanism currently relies on the normal generation and advancement of `Watermark`. The generation method for `newWatermark` is `maxTs - DEFAULT_WATERMARK_DELAY_MS`, while `maxTs` indirectly depends on the time of newly input `LogEntry`. If events sent by systems with unsynchronized clocks have timestamps ahead of schedule, could this cause the watermark to stagnate and result in delayed alerts? > > You are absolutely right, thank you for pointing this out. This is a critical issue that I completely overlooked in my code. > > You've correctly identified a serious flaw in my logic. When an event with a timestamp far in the future is processed, my current implementation would skew the `maxTs`. This, in turn, causes the `newWatermark` (calculated as `maxTs - DEFAULT_WATERMARK_DELAY_MS`) to jump far ahead of the actual event time. The result is that the watermark would stagnate because normally timed events would be incorrectly considered "late" and fail to advance it, leading to significant delays in the alerting system. > > To fix this, I plan to implement a "stagnation detection and forced advancement" mechanism. Here is the solution I will be coding: > > 1. Track the Last Watermark Update Time: I will add a new variable, `lastWatermarkUpdateTime`, to record the timestamp of the last successful watermark update. > 2. Check for Stagnation: Within the `TimeService`'s regular watermark check, if the newly calculated watermark is the same as the existing one, the logic will treat this as a potential stagnation. > 3. Measure Stagnation Duration: If no update occurs, I'll add a step to calculate the stagnation period by subtracting `lastWatermarkUpdateTime` from the current system time. > 4. Force a Watermark Update: If this stagnation period exceeds a predefined threshold (e.g., `MAX_WATERMARK_STAGNATION_TIME`), my code will determine that the watermark has been stuck for too long. At this point, instead of relying on the potentially incorrect `maxTs`, it will forcibly advance the watermark to `now() - DEFAULT_WATERMARK_DELAY_MS`. This will ensure the processing pipeline moves forward, even when compromised by future-dated events. > > Thank you again for raising this. Your insight is invaluable for helping me improve the robustness and reliability of my code. I will prioritize implementing this fix. I've decided not to use the previously mentioned solution because it doesn't fix the root problem. Instead, I will use the local system clock to maintain an acceptable time range for logs. This approach ensures that logs with anomalous timestamps cannot incorrectly advance the watermark. Under this new logic, logs with timestamps below the watermark will be dropped, while logs above the watermark will only be processed if their timestamps are within this reasonable range. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
