bigcyy commented on PR #3673:
URL: https://github.com/apache/hertzbeat/pull/3673#issuecomment-3236705808

   > 5.The entire mechanism currently relies on the normal generation and 
advancement of `Watermark`. The generation method for `newWatermark` is `maxTs 
- DEFAULT_WATERMARK_DELAY_MS`, while `maxTs` indirectly depends on the time of 
newly input `LogEntry`. If events sent by systems with unsynchronized clocks 
have timestamps ahead of schedule, could this cause the watermark to stagnate 
and result in delayed alerts?
   
   You are absolutely right, thank you for pointing this out. This is a 
critical issue that I completely overlooked in my code.
   
   You've correctly identified a serious flaw in my logic. When an event with a 
timestamp far in the future is processed, my current implementation would skew 
the `maxTs`. This, in turn, causes the `newWatermark` (calculated as `maxTs - 
DEFAULT_WATERMARK_DELAY_MS`) to jump far ahead of the actual event time. The 
result is that the watermark would stagnate because normally timed events would 
be incorrectly considered "late" and fail to advance it, leading to significant 
delays in the alerting system.
   
   To fix this, I plan to implement a "stagnation detection and forced 
advancement" mechanism. Here is the solution I will be coding:
   
   1.  Track the Last Watermark Update Time: I will add a new variable, 
`lastWatermarkUpdateTime`, to record the timestamp of the last successful 
watermark update.
   
   2.  Check for Stagnation: Within the `TimeService`'s regular watermark 
check, if the newly calculated watermark is the same as the existing one, the 
logic will treat this as a potential stagnation.
   
   3.  Measure Stagnation Duration: If no update occurs, I'll add a step to 
calculate the stagnation period by subtracting `lastWatermarkUpdateTime` from 
the current system time.
   
   4.  Force a Watermark Update: If this stagnation period exceeds a predefined 
threshold (e.g., `MAX_WATERMARK_STAGNATION_TIME`), my code will determine that 
the watermark has been stuck for too long. At this point, instead of relying on 
the potentially incorrect `maxTs`, it will forcibly advance the watermark to 
`now() - DEFAULT_WATERMARK_DELAY_MS`. This will ensure the processing pipeline 
moves forward, even when compromised by future-dated events.
   
   Thank you again for raising this. Your insight is invaluable for helping me 
improve the robustness and reliability of my code. I will prioritize 
implementing this fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to