[ https://issues.apache.org/jira/browse/KUDU-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Serbin updated KUDU-2988: -------------------------------- Labels: clock (was: ) > built-in NTP client: sometimes minichronyd fails to start with address > already in use > ------------------------------------------------------------------------------------- > > Key: KUDU-2988 > URL: https://issues.apache.org/jira/browse/KUDU-2988 > Project: Kudu > Issue Type: Sub-task > Components: clock, ntp-client > Affects Versions: 1.11.0 > Reporter: Adar Dembo > Priority: Major > Labels: clock > > From time to time some tests that use the built-in NTP client and MiniChronyd > fail. The failure usually looks like this: > {noformat} > I1013 22:02:28.429188 23364 hybrid_clock.cc:162] waiting up to > --ntp_initial_sync_wait_secs=10 seconds for the clock to synchronize > I1013 22:02:38.430480 23364 builtin_ntp.cc:552] server 127.16.18.212:42817: > addresses=127.16.18.212:42817 current_address=127.16.18.212:42817 > i_pkt_total_num=0 i_pkt_valid_num=0 o_pkt_total_num=20 o_pkt_timedout_num=14 > is_synchronized=false > last_mono=0 > last_wall=0 > last_error=0 > now_mono=264459136130 > F1013 22:02:38.430524 23364 master_main.cc:105] Check failed: _s.ok() Bad > status: Service unavailable: Cannot initialize clock: timed out waiting for > clock synchronisation: wallclock is not synchronized: no valid NTP responses > yet > *** Check failure stack trace: *** > {noformat} > I've also seen one failure like this: > {noformat} > [ RUN ] BlockManagerType/TsRecoveryITest.TestCrashDuringLogReplay/0 > 2019-11-01T00:37:31Z chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK > +RTC -PRIVDROP -SCFILTER -SIGND +ASYNCDNS -SECHASH -IPV6 +DEBUG) > 2019-11-01T00:37:32Z Disabled control of system clock > Could not open connection to daemon > W1101 00:37:32.553658 235 mini_chronyd.cc:189] Time spent starting chronyd: > real 1.110s user 0.000s sys 0.011s > 2019-11-01T00:37:32Z chronyd exiting > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/integration-tests/external_mini_cluster-itest-base.cc:66: > Failure > Failed > Bad status: Timed out: failed to start NTP server 0: failed to contact > chronyd in 1.000s > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/integration-tests/ts_recovery-itest.cc:428: > Failure > Expected: StartClusterOneTs({ "--fault_crash_during_log_replay=0.05" }) > doesn't generate new fatal failures in the current thread. > Actual: it does. > I1101 00:37:32.564697 235 external_mini_cluster-itest-base.cc:80] Found > fatal failure > I1101 00:37:32.566167 235 test_util.cc:136] > ----------------------------------------------- > I1101 00:37:32.566222 235 test_util.cc:137] Had fatal failures, leaving > test files at > /tmp/dist-test-taskBEeRDj/test-tmp/ts_recovery-itest.0.BlockManagerType_TsRecoveryITest.TestCrashDuringLogReplay_0.1572568627695135-235 > [ FAILED ] BlockManagerType/TsRecoveryITest.TestCrashDuringLogReplay/0, > where GetParam() = "file" (1125 ms) > {noformat} > In the first case it's pretty odd that despite the fact that MiniChronyd > failed to bind to a socket, we managed to convince ourselves that it was up > and running. My running theory is that chronyd doesn't exit when this happens > and still listens on the UNIX domain socket for our "server stats" query, > which then convinces us that chronyd is alive and well. > Anyway, this is the cause of some test flakiness, so we should look into it. -- This message was sent by Atlassian Jira (v8.3.4#803005)