Hello Will Berkeley, Kudu Jenkins, Andrew Wong, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/12647 to look at the new patch set (#2). Change subject: [TS heartbeater] avoid reconnecting to master too often ...................................................................... [TS heartbeater] avoid reconnecting to master too often With this patch, the heartbeater thread in tservers doesn't reset its master proxy and reconnect to master (re-negotiating a connection) every heartbeat under certain conditions. In particular, that happened if the master was accepting connections and responding to Ping RPC requests, but was not able to process TS heartbeats properly because it was still bootstrapping. E.g., when running RemoteKsckTest.TestClusterWithLocation test scenario for TSAN builds, I sometimes saw log messages like the following (the test sets FLAGS_heartbeat_interval_ms = 10): I0301 20:29:11.932394 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:11.944639 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:11.946904 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:11.960994 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:11.964995 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:11.972220 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:11.974987 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:11.988946 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:11.991653 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.003091 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.017015 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.017540 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.031175 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.031175 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.046165 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.059644 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.073026 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.075335 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.077802 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.089138 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.101193 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.102268 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.104634 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.118392 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.132237 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.147235 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.165709 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.171120 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.179481 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 I0301 20:29:12.191591 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221 It turned out the counter of the consecutively failed heartbeats kept increasing because the master was responding with ServiceUnavailable to incoming TS hearbeats. The prior version of the code did reset the master proxy every failed heartbeat since FLAGS_heartbeat_max_failures_before_backoff consecutive errors happened, and that was the reason behind frequent re-connections to the cluster. For testing, I just verified that the TS heartbeater no longer behaves like described above under the same scenarios and conditions. Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078 --- M src/kudu/tserver/heartbeater.cc 1 file changed, 16 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/47/12647/2 -- To view, visit http://gerrit.cloudera.org:8080/12647 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078 Gerrit-Change-Number: 12647 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com>