Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Kudu Jenkins,
I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19473 to look at the new patch set (#4). Change subject: [clock] add sanity check to detect wall clock jumps ...................................................................... [clock] add sanity check to detect wall clock jumps There was a case when a timestamp read from system/local clock using the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver on an Azure VM, while ntp_adjtime() didn't report an error on clock being unsynchronized. That came along with a huge number of kernel messages, and other software (such as the SASL library used by SSSD) detected the strange jump in the local clock as well. My multiple attempts to reproduce the issue on a real hardware node, Dockerized environment run at a real hardware server in a datacenter, and GCE & EC2 VMs were not successful. This patch adds a sanity check to detect such strange jumps in wall clock readings. The idea is to rely on the readings from the CLOCK_MONOTONIC clock captured along with the wall clock readings. A jump should manifest itself in big difference between the wall clock delta and the corresponding CLOCK_MONOTONIC delta. If such a condition is detected, then HybridClock::NowWithErrorUnlocked() dumps diagnostic information about clock synchronisation status and aborts. This patch also adds a unit test for the newly added functionality. As a part of this changelist, the following new flags are introduced: * --enable_wall_clock_jump_check Whether to enable the newly introduced sanity check for readings of the wall clock. Set to 'false' by default. * --wall_clock_jump_threshold_s The threshold (in seconds) for the difference in corresponding deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's readings. Set to 900 (15 minutes) by default. The reasoning behind having --enable_wall_clock_jump_check=false by default is to skip an extra check at the majority of nodes out there since NTP-synchronized system clock isn't supposed to jump that much at all. However, if a problem is observed at some inadequate VMs such as ones in Azure Cloud, it's now possible to enable the guardrail to detect such an issue. If this goes unnoticed, the timestamp might be persisted with an operation in the WAL and propagated to other replicas as an orphaned REPLICATE operation. That leads to crashes during tablet bootstrapping, and requires manual intervention to remove the orphaned operations with out-of-wack timestamps from the WAL. Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275 --- M src/kudu/clock/hybrid_clock-test.cc M src/kudu/clock/hybrid_clock.cc M src/kudu/clock/hybrid_clock.h M src/kudu/clock/system_ntp.h 4 files changed, 143 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/4 -- To view, visit http://gerrit.cloudera.org:8080/19473 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275 Gerrit-Change-Number: 19473 Gerrit-PatchSet: 4 Gerrit-Owner: Alexey Serbin <ale...@apache.org> Gerrit-Reviewer: Alexey Serbin <ale...@apache.org> Gerrit-Reviewer: Ashwani Raina <ara...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Yingchun Lai <laiyingc...@apache.org> Gerrit-Reviewer: Yuqi Du <shenxingwuy...@gmail.com> Gerrit-Reviewer: Zoltan Chovan <zcho...@cloudera.com>