[ 
https://issues.apache.org/jira/browse/KUDU-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782887#comment-17782887
 ] 

ASF subversion and git services commented on KUDU-3521:
-------------------------------------------------------

Commit 79bb84893745d330b710fb461e5f89a5d22b324e in kudu's branch 
refs/heads/branch-1.16.x from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=79bb84893 ]

[clock] KUDU-3521 fix crash when clock is synchronized by PTPd

There was an earlier attempt to address the issue [1], but the fix
hasn't received +2 since there was not enough evidence behind the
root cause analysis.  With the report on #kudu-general Slack
channel [2], from the analysis of the code [3] it's easy to see there
isn't any other way to get such a manifestation of the issue but
a negative value for the 'maxerror' field of the 'timex' structure
returned by the ntp_adjtime()/adjtimex() system call.

The essence of the problem is that in Kudu the maximum error is supposed
to be a non-negative number.  This patch addresses the issue.

[1] https://gerrit.cloudera.org/#/c/12149/
[2] https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269
[3] 
https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L706

Change-Id: Ibbe1a50c4857b9742d2ffde35440d0dee082edc0
Reviewed-on: http://gerrit.cloudera.org:8080/20626
Tested-by: Kudu Jenkins
Reviewed-by: Yingchun Lai <laiyingc...@apache.org>
Reviewed-by: Abhishek Chennaka <achenn...@cloudera.com>
(cherry picked from commit 4859d290277bf36f0bd84891c4764194c2cf9521)
  Conflicts:
    src/kudu/clock/system_ntp.cc
Reviewed-on: http://gerrit.cloudera.org:8080/20650


> Kudu servers sometimes crash when host clock is synchronized by PTPd
> --------------------------------------------------------------------
>
>                 Key: KUDU-3521
>                 URL: https://issues.apache.org/jira/browse/KUDU-3521
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Alexey Serbin
>            Assignee: Alexey Serbin
>            Priority: Major
>             Fix For: 1.18.0, 1.17.1
>
>
> This issue has been reported on the [\#kudu-general Slack 
> channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269].  A 
> Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or 
> {{kudu-tserver}}, but it doesn't matter) crashed with the following error:
> {noformat}
> F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() 
> unable to get current timestamp with error bound: Service unavailable: clock 
> error estimate (18446744073709551615us) too high (clock considered 
> synchronized by the kernel)
> {noformat}
> From the analysis of the [code in 
> hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705],
>  the only case it could happen is when {{t.maxerror}} turned to be a negative 
> number (e.g., -1) in [this 
> code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176].
> Negative values of the {{timex::maxerror}} field have never been seen when 
> using ntpd or chronyd for clock synchronization, but it's necessary to update 
> the code to adapt for such situations: apparently, PTP might set the 
> {{maxerror}} field of the {{timex}} structure to a negative value and then 
> call {{adjtimex()}}.  That's obvious from [the PTPd's 
> code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984].
>   The essence of the issue is using unsigned integers for clock error in the 
> Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets 
> it to a negative number when calling {{adjtimex()}}.  Also, nowhere in [the 
> documentation for 
> adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's 
> stated that the {{maxerror}} field's value should be a non-negative number.
> As a side note, there was [a prior attempt to address this 
> issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was 
> presented for the RCA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to