Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Kudu 
Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19473

to look at the new patch set (#4).

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized. That came along with a huge number of kernel
messages, and other software (such as the SASL library used by SSSD)
detected the strange jump in the local clock as well.  My multiple
attempts to reproduce the issue on a real hardware node, Dockerized
environment run at a real hardware server in a datacenter, and GCE & EC2
VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC clock captured along with the wall clock readings.
A jump should manifest itself in big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC delta.  If such a condition
is detected, then HybridClock::NowWithErrorUnlocked() dumps diagnostic
information about clock synchronisation status and aborts.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --enable_wall_clock_jump_check
      Whether to enable the newly introduced sanity check for readings
      of the wall clock.  Set to 'false' by default.
  * --wall_clock_jump_threshold_s
      The threshold (in seconds) for the difference in corresponding
      deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  Set to 900 (15 minutes) by default.

The reasoning behind having --enable_wall_clock_jump_check=false by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, if a problem is observed at some inadequate VMs such
as ones in Azure Cloud, it's now possible to enable the guardrail
to detect such an issue.  If this goes unnoticed, the timestamp might
be persisted with an operation in the WAL and propagated to other
replicas as an orphaned REPLICATE operation.  That leads to crashes
during tablet bootstrapping, and requires manual intervention to
remove the orphaned operations with out-of-wack timestamps from the WAL.

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
4 files changed, 143 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/4
--
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <ale...@apache.org>
Gerrit-Reviewer: Alexey Serbin <ale...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ara...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <laiyingc...@apache.org>
Gerrit-Reviewer: Yuqi Du <shenxingwuy...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zcho...@cloudera.com>

Reply via email to