Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-1387. Fix a case where the scanner tight-loops and then 
sleeps too long
......................................................................


KUDU-1387. Fix a case where the scanner tight-loops and then sleeps too long

This prevents the following issue:

- the leader is down and election has not yet been triggered
- the scanner tries to hit the leader, and gets 'connection refused', and thus
  marks it as down, then goes back to the scanner retry loop
- in the tablet lookup path, RemoteTablet::HasLeader() returns false because
  the leader is known to be down. This causes the client to fetch new locations.
- fetching new locations marks the server as up again. This logic is dubious,
  but will be more complicated to address.
- because the server is now seen as "up" again, we just retry on the same 
server.

The patch fixes the scanner code so that, when a tablet server is down, it is 
added
to the scan's blacklist in addition to marking the server as down client-wide.
This makes the scanner code realize that all eligible servers are blacklisted 
and
trigger a sleep and backoff before retrying.

Without this patch, linked_list-test timed out a few percent of the time
in RELEASE builds. With the patch, it passed 200/200 times. I also noticed
that an existing test in client-test was triggering the tight retries, but
didn't have any assertion to detect the problematic number of RPCs.

Change-Id: I3cb3afa81cd6f75756c328b6ffe23a385f4b172d
Reviewed-on: http://gerrit.cloudera.org:8080/2699
Reviewed-by: Adar Dembo <[email protected]>
Tested-by: Kudu Jenkins
(cherry picked from commit 563313d15e922db4255736ed1423bb418bbcd6fd)
Reviewed-on: http://gerrit.cloudera.org:8080/2709
Reviewed-by: Jean-Daniel Cryans
---
M src/kudu/client/client-test.cc
M src/kudu/client/scanner-internal.cc
2 files changed, 34 insertions(+), 9 deletions(-)

Approvals:
  Jean-Daniel Cryans: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/2709
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I3cb3afa81cd6f75756c328b6ffe23a385f4b172d
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: branch-0.8.x
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <[email protected]>

Reply via email to