Alexey Serbin has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/12647 )

Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................

[TS heartbeater] avoid reconnecting to master too often

With this patch, the heartbeater thread in tservers doesn't reset
its master proxy and reconnect to master (re-negotiating a connection)
every heartbeat under certain conditions.  In particular, that happened
if the master was accepting connections and responding to Ping RPC
requests, but was not able to process TS heartbeats properly because
it was still bootstrapping.

E.g., when running RemoteKsckTest.TestClusterWithLocation test scenario
for TSAN builds, I sometimes saw log messages like the following
(the test sets FLAGS_heartbeat_interval_ms = 10):

I0301 20:29:11.932394  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:11.944639  3671 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:11.946904  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:11.960994  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:11.964995  3819 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:11.972220  3671 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:11.974987  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:11.988946  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:11.991653  3671 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.003091  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.017015  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.017540  3671 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.031175  3819 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.031175  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.046165  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.059644  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.073026  3819 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.075335  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.077802  3671 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.089138  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.101193  3671 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.102268  3819 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.104634  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.118392  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.132237  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.147235  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.165709  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.171120  3819 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.179481  3746 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221
I0301 20:29:12.191591  3671 heartbeater.cc:345] Connected to a master server at 
127.3.75.254:36221

It turned out the counter of the consecutively failed heartbeats kept
increasing because the master was responding with ServiceUnavailable
to incoming TS hearbeats.  The prior version of the code did reset
the master proxy every failed heartbeat since
FLAGS_heartbeat_max_failures_before_backoff consecutive errors happened,
and that was the reason behind frequent re-connections to the cluster.

For testing, I just verified that the TS heartbeater no longer behaves
like described above under the same scenarios and conditions.

Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Reviewed-on: http://gerrit.cloudera.org:8080/12647
Tested-by: Kudu Jenkins
Reviewed-by: Will Berkeley <wdberke...@gmail.com>
---
M src/kudu/tserver/heartbeater.cc
1 file changed, 16 insertions(+), 3 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Will Berkeley: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com>

Reply via email to