Re: [PR] [OSPP] Apache HertzBeat Log Monitoring Capability [hertzbeat]

via GitHub Fri, 29 Aug 2025 23:15:08 -0700


bigcyy commented on PR #3673:
URL: https://github.com/apache/hertzbeat/pull/3673#issuecomment-3238990609


   > > 5.The entire mechanism currently relies on the normal generation and 
advancement of `Watermark`. The generation method for `newWatermark` is `maxTs 
- DEFAULT_WATERMARK_DELAY_MS`, while `maxTs` indirectly depends on the time of 
newly input `LogEntry`. If events sent by systems with unsynchronized clocks 
have timestamps ahead of schedule, could this cause the watermark to stagnate 
and result in delayed alerts?
   > 
   > You are absolutely right, thank you for pointing this out. This is a 
critical issue that I completely overlooked in my code.
   > 
   > You've correctly identified a serious flaw in my logic. When an event with 
a timestamp far in the future is processed, my current implementation would 
skew the `maxTs`. This, in turn, causes the `newWatermark` (calculated as 
`maxTs - DEFAULT_WATERMARK_DELAY_MS`) to jump far ahead of the actual event 
time. The result is that the watermark would stagnate because normally timed 
events would be incorrectly considered "late" and fail to advance it, leading 
to significant delays in the alerting system.
   > 
   > To fix this, I plan to implement a "stagnation detection and forced 
advancement" mechanism. Here is the solution I will be coding:
   > 
   > 1. Track the Last Watermark Update Time: I will add a new variable, 
`lastWatermarkUpdateTime`, to record the timestamp of the last successful 
watermark update.
   > 2. Check for Stagnation: Within the `TimeService`'s regular watermark 
check, if the newly calculated watermark is the same as the existing one, the 
logic will treat this as a potential stagnation.
   > 3. Measure Stagnation Duration: If no update occurs, I'll add a step to 
calculate the stagnation period by subtracting `lastWatermarkUpdateTime` from 
the current system time.
   > 4. Force a Watermark Update: If this stagnation period exceeds a 
predefined threshold (e.g., `MAX_WATERMARK_STAGNATION_TIME`), my code will 
determine that the watermark has been stuck for too long. At this point, 
instead of relying on the potentially incorrect `maxTs`, it will forcibly 
advance the watermark to `now() - DEFAULT_WATERMARK_DELAY_MS`. This will ensure 
the processing pipeline moves forward, even when compromised by future-dated 
events.
   > 
   > Thank you again for raising this. Your insight is invaluable for helping 
me improve the robustness and reliability of my code. I will prioritize 
implementing this fix.
   
   I've decided not to use the previously mentioned solution because it doesn't 
fix the root problem.
   
   Instead, I will use the local system clock to maintain an acceptable time 
range for logs. This approach ensures that logs with anomalous timestamps 
cannot incorrectly advance the watermark.
   
   Under this new logic, logs with timestamps below the watermark will be 
dropped, while logs above the watermark will only be processed if their 
timestamps are within this reasonable range.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [OSPP] Apache HertzBeat Log Monitoring Capability [hertzbeat]

Reply via email to