[ 
https://issues.apache.org/jira/browse/KUDU-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2988:
--------------------------------
    Labels: clock  (was: )

> built-in NTP client: sometimes minichronyd fails to start with address 
> already in use
> -------------------------------------------------------------------------------------
>
>                 Key: KUDU-2988
>                 URL: https://issues.apache.org/jira/browse/KUDU-2988
>             Project: Kudu
>          Issue Type: Sub-task
>          Components: clock, ntp-client
>    Affects Versions: 1.11.0
>            Reporter: Adar Dembo
>            Priority: Major
>              Labels: clock
>
> From time to time some tests that use the built-in NTP client and MiniChronyd 
> fail. The failure usually looks like this:
> {noformat}
> I1013 22:02:28.429188 23364 hybrid_clock.cc:162] waiting up to 
> --ntp_initial_sync_wait_secs=10 seconds for the clock to synchronize
> I1013 22:02:38.430480 23364 builtin_ntp.cc:552] server 127.16.18.212:42817: 
> addresses=127.16.18.212:42817 current_address=127.16.18.212:42817 
> i_pkt_total_num=0 i_pkt_valid_num=0 o_pkt_total_num=20 o_pkt_timedout_num=14
> is_synchronized=false
> last_mono=0
> last_wall=0
> last_error=0
> now_mono=264459136130
> F1013 22:02:38.430524 23364 master_main.cc:105] Check failed: _s.ok() Bad 
> status: Service unavailable: Cannot initialize clock: timed out waiting for 
> clock synchronisation: wallclock is not synchronized: no valid NTP responses 
> yet
> *** Check failure stack trace: ***
> {noformat}
> I've also seen one failure like this:
> {noformat}
> [ RUN      ] BlockManagerType/TsRecoveryITest.TestCrashDuringLogReplay/0
> 2019-11-01T00:37:31Z chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK 
> +RTC -PRIVDROP -SCFILTER -SIGND +ASYNCDNS -SECHASH -IPV6 +DEBUG)
> 2019-11-01T00:37:32Z Disabled control of system clock
> Could not open connection to daemon
> W1101 00:37:32.553658   235 mini_chronyd.cc:189] Time spent starting chronyd: 
> real 1.110s     user 0.000s     sys 0.011s
> 2019-11-01T00:37:32Z chronyd exiting
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/integration-tests/external_mini_cluster-itest-base.cc:66:
>  Failure
> Failed
> Bad status: Timed out: failed to start NTP server 0: failed to contact 
> chronyd in 1.000s
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/integration-tests/ts_recovery-itest.cc:428:
>  Failure
> Expected: StartClusterOneTs({ "--fault_crash_during_log_replay=0.05" }) 
> doesn't generate new fatal failures in the current thread.
>   Actual: it does.
> I1101 00:37:32.564697   235 external_mini_cluster-itest-base.cc:80] Found 
> fatal failure
> I1101 00:37:32.566167   235 test_util.cc:136] 
> -----------------------------------------------
> I1101 00:37:32.566222   235 test_util.cc:137] Had fatal failures, leaving 
> test files at 
> /tmp/dist-test-taskBEeRDj/test-tmp/ts_recovery-itest.0.BlockManagerType_TsRecoveryITest.TestCrashDuringLogReplay_0.1572568627695135-235
> [  FAILED  ] BlockManagerType/TsRecoveryITest.TestCrashDuringLogReplay/0, 
> where GetParam() = "file" (1125 ms)
> {noformat}
> In the first case it's pretty odd that despite the fact that MiniChronyd 
> failed to bind to a socket, we managed to convince ourselves that it was up 
> and running. My running theory is that chronyd doesn't exit when this happens 
> and still listens on the UNIX domain socket for our "server stats" query, 
> which then convinces us that chronyd is alive and well.
> Anyway, this is the cause of some test flakiness, so we should look into it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to