[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-18 Thread Alexey Serbin (Code Review)
Alexey Serbin has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6924

Change subject: KUDU-1034 client does not failover due to timeout
..

KUDU-1034 client does not failover due to timeout

This patch fixes the issue described by KUDU-1034: the client does not
mark the failed tablet server as 'failed' in case of timeout and
continues to use it over and over again to send further requests,
even if other tablet replicas might be available.

Besides the actual fix, this patch incorporates an integration test
RaftConsensusITest.TestClientFailoverOnLeaderTimeout) written by Mike.

Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/rpc/retriable_rpc.h
2 files changed, 52 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-18 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 420: // Test that a client fails over when the leader times out.
This is the original location for the test as in the patch attached to 
KUDU-1034.  Probably, the better place for the test is somewhere in 
client_failover-itest.cc ?


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-18 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS1, Line 450:   for (int i = 0; i < 1000; i++) {
 : if (workload.rows_inserted() >= rows_target) break;
 : SleepFor(MonoDelta::FromMilliseconds(10));
 :   }
It seems this test does not provide the necessary coverage: the Workload does 
some cycling internally, and that's not exactly what we need to test the retry 
in case of a timeout.


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-19 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 420: // Test that a client fails over when the leader times out.
> This is the original location for the test as in the patch attached to KUDU
After some consideration I think it's also a good place for the test since it 
verifies the consistency overall client-->tserver communications when the 
client retries the operation against other tablet replica.


PS1, Line 450:   for (int i = 0; i < 1000; i++) {
 : if (workload.rows_inserted() >= rows_target) break;
 : SleepFor(MonoDelta::FromMilliseconds(10));
 :   }
> It seems this test does not provide the necessary coverage: the Workload do
I take it back: after some consideration I think that in the scope of KUDU-1034 
this is the exact test what is needed.  The client should fail in case if RPC 
times out, and it's the responsibility of the upper-level code to retry an 
operation in that case.  Otherwise, if client would retry on RPC timeout, that 
could lead to retrying indefinitely.


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-20 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 420: // Test that a client fails over when the leader times out.
> After some consideration I think it's also a good place for the test since 
I think client_failover-itest.cc is a better home for this test, since it's a 
good fit with what we're testing (this is testing a client failover scenario) 
and uses basically the same infrastructure as this test (ExternalMiniCluster).


PS1, Line 450:   for (int i = 0; i < 1000; i++) {
 : if (workload.rows_inserted() >= rows_target) break;
 : SleepFor(MonoDelta::FromMilliseconds(10));
 :   }
> I take it back: after some consideration I think that in the scope of KUDU-
The client should respect KuduSession.setTimeout() and I believe that is the 
timeout value that is triggering here (albeit triggering in the resulting RPC 
call), so unless we want to shorten the RPC timeout to be less than the session 
timeout (probably not a good idea without careful consideration) then the 
client has no choice but to respond to the user application with a timeout 
error.


http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

Line 217: server_picker_->MarkServerFailed(server, result.status);
OK now I remember the issue I had when looking at this issue previously. I the 
problem with using MarkServerFailed() here is that ALL replicas on the server 
are considered failed, not just this one. The client needs a concept of only a 
single replica failed and IIRC that's not expressible in the client failover 
codebase today.


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-21 Thread Alexey Serbin (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6924

to look at the new patch set (#2).

Change subject: KUDU-1034 client does not failover due to timeout
..

KUDU-1034 client does not failover due to timeout

This patch fixes the issue described by KUDU-1034: the client does not
mark the failed tablet server as 'failed' in case of timeout and
continues to use it over and over again to send further requests,
even if other tablet replicas might be available.

Besides the actual fix, this patch incorporates an integration test
RaftConsensusITest.TestClientFailoverOnLeaderTimeout) written by Mike.

Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/rpc/retriable_rpc.h
3 files changed, 58 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS1, Line 450:   for (int i = 0; i < 1000; i++) {
 : if (workload.rows_inserted() >= rows_target) break;
 : SleepFor(MonoDelta::FromMilliseconds(10));
 :   }
> The client should respect KuduSession.setTimeout() and I believe that is th
One future thing we can consider is actually triggering a second attempt on a 
different replica before we've timed out on the first replica. Given we now 
have the idempotent request tracker thing in place, it may be safe to do so (if 
in fact both RPCs go through, they'll still execute exactly once).

This is especially true on the metadata read requests from the master which are 
inherently idempotent (being read-only).

That said, we don't have the right machinery in place to do it today, just 
wanted to mention it so it's in the back of your minds :)


http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

Line 217: server_picker_->MarkServerFailed(server, result.status);
> OK now I remember the issue I had when looking at this issue previously. I 
In the case of a timeout, that still might be a good idea, no? It's not 100% 
clear.


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-22 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

Line 217: server_picker_->MarkServerFailed(server, result.status);
> In the case of a timeout, that still might be a good idea, no? It's not 100
I think timeout is considered as non-retriable error.  Given current behavior 
of the client (refreshing its meta-cache in case of absence of usable server 
for an RPC), marking a timed-out server would do no harm.


Line 217: server_picker_->MarkServerFailed(server, result.status);
> OK now I remember the issue I had when looking at this issue previously. I 
As I understand, regardless of whether the client aware of particular replica 
or not, marking a server failed will make the client to switch to a different 
tablet server in the scope of this particular RPC.  Also, most of the errors in 
this context (except, may be, REPLICA_NOT_LEADER) have semantics of 'mark the 
whole server out': connection timeout, server too busy, invalid authn token.

Yes, all the tablets will be marked as non-accessible when using 
MarkServerFailed(), but the client will refresh its meta-cache if it ended up 
with no active replica, right?

E.g., take a look at handling REPLICA_NOT_LEADER error code.  It might happen 
that by the time client marks the server as failed it becomes the leader.  So, 
in that case the client will end up in having no servers in its metacache, and 
will refresh it to get a new leader.


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-23 Thread Alexey Serbin (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6924

to look at the new patch set (#4).

Change subject: KUDU-1034 client does not failover due to timeout
..

KUDU-1034 client does not failover due to timeout

This patch fixes the issue described by KUDU-1034: the client does not
mark the failed tablet server as 'failed' in case of timeout and
continues to use it over and over again to send further requests,
even if other tablet replicas might be available.

Besides the actual fix, this patch incorporates an integration test
(ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike.

Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
---
M src/kudu/integration-tests/client_failover-itest.cc
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/rpc/retriable_rpc.h
3 files changed, 80 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/integration-tests/client_failover-itest.cc
File src/kudu/integration-tests/client_failover-itest.cc:

PS4, Line 252: tserver::TabletServerIntegrationTestBase
why not ExternalMiniClusterITestBase?

I'm trying to standardize on that one for ExternalMiniCluster tests


http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

Line 167:   // A note regarding ServerPicker::MarkServerFailed() calls in this 
method:
I know I mentioned adding a comment but I was thinking just a 2- or 3-line 
comment in the header file for the declaration of MarkServerFailed.


PS4, Line 195: MarkServerFailed
Should we rename this method to MarkReplicaFailed for clarity?


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/integration-tests/client_failover-itest.cc
File src/kudu/integration-tests/client_failover-itest.cc:

PS4, Line 252: tserver::TabletServerIntegrationTestBase
> why not ExternalMiniClusterITestBase?
Because it's using GetLeaderReplicaWithRetries() which currently exists only in 
TabletServerIntegrationTestBase

Do you think it would be a good idea to move that GetLeaderReplicaWithRetries 
and all of its dependencies into cluster_itest_util.h or around?


http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

Line 167:   // A note regarding ServerPicker::MarkServerFailed() calls in this 
method:
> I know I mentioned adding a comment but I was thinking just a 2- or 3-line 
The MarkServerFailed signature in ServerPicker assumes it can mark the whole 
server as failed.  It's just a detail of current implementation of that method 
in MetaCacheServerPicker.

Do you mean we want to have this comment not here but in MetaCacheServerPicker?


PS4, Line 195: MarkServerFailed
> Should we rename this method to MarkReplicaFailed for clarity?
I'm not sure about this: from the semantics of the error code 
SERVER_NOT_ACCESSIBLE you would assume it makes sense to mark the whole server 
failed.  As I understand it, teplica is not relevant in this context.

The same for the call of the same method in case of TimedOut status returned 
along with NON_RERTRIABLE_ERROR


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Alexey Serbin (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6924

to look at the new patch set (#5).

Change subject: KUDU-1034 client does not failover due to timeout
..

KUDU-1034 client does not failover due to timeout

This patch fixes the issue described by KUDU-1034: the client does not
mark the failed tablet server as 'failed' in case of timeout and
continues to use it over and over again to send further requests,
even if other tablet replicas might be available.

Besides the actual fix, this patch incorporates an integration test
(ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike.

Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
---
M src/kudu/client/meta_cache.h
M src/kudu/integration-tests/client_failover-itest.cc
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/rpc/retriable_rpc.h
4 files changed, 77 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/integration-tests/client_failover-itest.cc
File src/kudu/integration-tests/client_failover-itest.cc:

PS4, Line 252: tserver::TabletServerIntegrationTestBase
> Because it's using GetLeaderReplicaWithRetries() which currently exists onl
does FindTabletLeader() do what you need?


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

Line 167:   // A note regarding ServerPicker::MarkServerFailed() calls in this 
method:
> The MarkServerFailed signature in ServerPicker assumes it can mark the whol
Isn't that just an abstract base class with no other implementations? Can we 
just redefine the semantics?


PS4, Line 195: MarkServerFailed
> I'm not sure about this: from the semantics of the error code SERVER_NOT_AC
Maybe the confusion we've had about this is that the abstraction here is too 
general for the problem domain. Maybe it's fine for the server picker to 
conceptually know about replicas.

If you disagree then we can leave it as-is, just trying to think of ways to 
make it more understandable.


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/integration-tests/client_failover-itest.cc
File src/kudu/integration-tests/client_failover-itest.cc:

PS4, Line 252: tserver::TabletServerIntegrationTestBase
> does FindTabletLeader() do what you need?
Excellent -- it looks like exactly what I need.  Thanks!


http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

Line 167:   // A note regarding ServerPicker::MarkServerFailed() calls in this 
method:
> Isn't that just an abstract base class with no other implementations? Can w
Probably, yes.

However, I would prefer to do that in a separate changelist and involve more 
people in discussing that.  My concern is that we might be missing something 
looking only at this context.


PS4, Line 195: MarkServerFailed
> Maybe the confusion we've had about this is that the abstraction here is to
I think we can address that separately -- my current priority to have the bug 
fixed.  By my experience, discussing such things usually leads to many 
iterations with lot of bike-shedding, unfortunately.

What do you think if I file a separate JIRA item to track this and we leave 
this as-is in the current patchset?


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

PS4, Line 195: MarkServerFailed
> I think we can address that separately -- my current priority to have the b
+1 thats ok


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

PS4, Line 195: MarkServerFailed
> +1 thats ok
https://issues.apache.org/jira/browse/KUDU-2024


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Alexey Serbin (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6924

to look at the new patch set (#6).

Change subject: KUDU-1034 client does not failover due to timeout
..

KUDU-1034 client does not failover due to timeout

This patch fixes the issue described by KUDU-1034: the client does not
mark the failed tablet server as 'failed' in case of timeout and
continues to use it over and over again to send further requests,
even if other tablet replicas might be available.

Besides the actual fix, this patch incorporates an integration test
(ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike.

Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
---
M src/kudu/client/meta_cache.h
M src/kudu/integration-tests/client_failover-itest.cc
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/rpc/retriable_rpc.h
4 files changed, 128 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 6: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6924/6/src/kudu/integration-tests/client_failover-itest.cc
File src/kudu/integration-tests/client_failover-itest.cc:

Line 33: #include "kudu/integration-tests/ts_itest-base.h"
no longer used? and the one below


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Alexey Serbin (Code Review)
Hello Mike Percy, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6924

to look at the new patch set (#7).

Change subject: KUDU-1034 client does not failover due to timeout
..

KUDU-1034 client does not failover due to timeout

This patch fixes the issue described by KUDU-1034: the client does not
mark the failed tablet server as 'failed' in case of timeout and
continues to use it over and over again to send further requests,
even if other tablet replicas might be available.

Besides the actual fix, this patch incorporates an integration test
(ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike.

Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
---
M src/kudu/client/meta_cache.h
M src/kudu/integration-tests/client_failover-itest.cc
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/rpc/retriable_rpc.h
4 files changed, 125 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6924/6/src/kudu/integration-tests/client_failover-itest.cc
File src/kudu/integration-tests/client_failover-itest.cc:

Line 33: #include "kudu/integration-tests/ts_itest-base.h"
> no longer used? and the one below
Right, this one is not used.  But the line below is needed for 
tserver::ListTabletsResponsePB::StatusAndSchemaPB.  I'll update it, thanks!


Line 288:   SleepFor(MonoDelta::FromMilliseconds(10 * (i + 1)));
> warning: either cast from 'int' to 'int64_t' (aka 'long') is ineffective, o
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 7: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-24 Thread Alexey Serbin (Code Review)
Alexey Serbin has submitted this change and it was merged.

Change subject: KUDU-1034 client does not failover due to timeout
..


KUDU-1034 client does not failover due to timeout

This patch fixes the issue described by KUDU-1034: the client does not
mark the failed tablet server as 'failed' in case of timeout and
continues to use it over and over again to send further requests,
even if other tablet replicas might be available.

Besides the actual fix, this patch incorporates an integration test
(ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike.

Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Reviewed-on: http://gerrit.cloudera.org:8080/6924
Reviewed-by: Mike Percy 
Tested-by: Kudu Jenkins
---
M src/kudu/client/meta_cache.h
M src/kudu/integration-tests/client_failover-itest.cc
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/rpc/retriable_rpc.h
4 files changed, 125 insertions(+), 3 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon