[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has uploaded a new change for review. http://gerrit.cloudera.org:8080/6924 Change subject: KUDU-1034 client does not failover due to timeout .. KUDU-1034 client does not failover due to timeout This patch fixes the issue described by KUDU-1034: the client does not mark the failed tablet server as 'failed' in case of timeout and continues to use it over and over again to send further requests, even if other tablet replicas might be available. Besides the actual fix, this patch incorporates an integration test RaftConsensusITest.TestClientFailoverOnLeaderTimeout) written by Mike. Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 --- M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/rpc/retriable_rpc.h 2 files changed, 52 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/1 -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: Line 420: // Test that a client fails over when the leader times out. This is the original location for the test as in the patch attached to KUDU-1034. Probably, the better place for the test is somewhere in client_failover-itest.cc ? -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: PS1, Line 450: for (int i = 0; i < 1000; i++) { : if (workload.rows_inserted() >= rows_target) break; : SleepFor(MonoDelta::FromMilliseconds(10)); : } It seems this test does not provide the necessary coverage: the Workload does some cycling internally, and that's not exactly what we need to test the retry in case of a timeout. -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: Line 420: // Test that a client fails over when the leader times out. > This is the original location for the test as in the patch attached to KUDU After some consideration I think it's also a good place for the test since it verifies the consistency overall client-->tserver communications when the client retries the operation against other tablet replica. PS1, Line 450: for (int i = 0; i < 1000; i++) { : if (workload.rows_inserted() >= rows_target) break; : SleepFor(MonoDelta::FromMilliseconds(10)); : } > It seems this test does not provide the necessary coverage: the Workload do I take it back: after some consideration I think that in the scope of KUDU-1034 this is the exact test what is needed. The client should fail in case if RPC times out, and it's the responsibility of the upper-level code to retry an operation in that case. Otherwise, if client would retry on RPC timeout, that could lead to retrying indefinitely. -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Mike Percy has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: Line 420: // Test that a client fails over when the leader times out. > After some consideration I think it's also a good place for the test since I think client_failover-itest.cc is a better home for this test, since it's a good fit with what we're testing (this is testing a client failover scenario) and uses basically the same infrastructure as this test (ExternalMiniCluster). PS1, Line 450: for (int i = 0; i < 1000; i++) { : if (workload.rows_inserted() >= rows_target) break; : SleepFor(MonoDelta::FromMilliseconds(10)); : } > I take it back: after some consideration I think that in the scope of KUDU- The client should respect KuduSession.setTimeout() and I believe that is the timeout value that is triggering here (albeit triggering in the resulting RPC call), so unless we want to shorten the RPC timeout to be less than the session timeout (probably not a good idea without careful consideration) then the client has no choice but to respond to the user application with a timeout error. http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: Line 217: server_picker_->MarkServerFailed(server, result.status); OK now I remember the issue I had when looking at this issue previously. I the problem with using MarkServerFailed() here is that ALL replicas on the server are considered failed, not just this one. The client needs a concept of only a single replica failed and IIRC that's not expressible in the client failover codebase today. -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6924 to look at the new patch set (#2). Change subject: KUDU-1034 client does not failover due to timeout .. KUDU-1034 client does not failover due to timeout This patch fixes the issue described by KUDU-1034: the client does not mark the failed tablet server as 'failed' in case of timeout and continues to use it over and over again to send further requests, even if other tablet replicas might be available. Besides the actual fix, this patch incorporates an integration test RaftConsensusITest.TestClientFailoverOnLeaderTimeout) written by Mike. Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 --- M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/rpc/retriable_rpc.h 3 files changed, 58 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/2 -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1034 client does not failover due to timeout
Todd Lipcon has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: PS1, Line 450: for (int i = 0; i < 1000; i++) { : if (workload.rows_inserted() >= rows_target) break; : SleepFor(MonoDelta::FromMilliseconds(10)); : } > The client should respect KuduSession.setTimeout() and I believe that is th One future thing we can consider is actually triggering a second attempt on a different replica before we've timed out on the first replica. Given we now have the idempotent request tracker thing in place, it may be safe to do so (if in fact both RPCs go through, they'll still execute exactly once). This is especially true on the metadata read requests from the master which are inherently idempotent (being read-only). That said, we don't have the right machinery in place to do it today, just wanted to mention it so it's in the back of your minds :) http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: Line 217: server_picker_->MarkServerFailed(server, result.status); > OK now I remember the issue I had when looking at this issue previously. I In the case of a timeout, that still might be a good idea, no? It's not 100% clear. -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: Line 217: server_picker_->MarkServerFailed(server, result.status); > In the case of a timeout, that still might be a good idea, no? It's not 100 I think timeout is considered as non-retriable error. Given current behavior of the client (refreshing its meta-cache in case of absence of usable server for an RPC), marking a timed-out server would do no harm. Line 217: server_picker_->MarkServerFailed(server, result.status); > OK now I remember the issue I had when looking at this issue previously. I As I understand, regardless of whether the client aware of particular replica or not, marking a server failed will make the client to switch to a different tablet server in the scope of this particular RPC. Also, most of the errors in this context (except, may be, REPLICA_NOT_LEADER) have semantics of 'mark the whole server out': connection timeout, server too busy, invalid authn token. Yes, all the tablets will be marked as non-accessible when using MarkServerFailed(), but the client will refresh its meta-cache if it ended up with no active replica, right? E.g., take a look at handling REPLICA_NOT_LEADER error code. It might happen that by the time client marks the server as failed it becomes the leader. So, in that case the client will end up in having no servers in its metacache, and will refresh it to get a new leader. -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6924 to look at the new patch set (#4). Change subject: KUDU-1034 client does not failover due to timeout .. KUDU-1034 client does not failover due to timeout This patch fixes the issue described by KUDU-1034: the client does not mark the failed tablet server as 'failed' in case of timeout and continues to use it over and over again to send further requests, even if other tablet replicas might be available. Besides the actual fix, this patch incorporates an integration test (ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike. Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 --- M src/kudu/integration-tests/client_failover-itest.cc M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/rpc/retriable_rpc.h 3 files changed, 80 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/4 -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1034 client does not failover due to timeout
Mike Percy has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 4: (3 comments) http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/integration-tests/client_failover-itest.cc File src/kudu/integration-tests/client_failover-itest.cc: PS4, Line 252: tserver::TabletServerIntegrationTestBase why not ExternalMiniClusterITestBase? I'm trying to standardize on that one for ExternalMiniCluster tests http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: Line 167: // A note regarding ServerPicker::MarkServerFailed() calls in this method: I know I mentioned adding a comment but I was thinking just a 2- or 3-line comment in the header file for the declaration of MarkServerFailed. PS4, Line 195: MarkServerFailed Should we rename this method to MarkReplicaFailed for clarity? -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 4: (3 comments) http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/integration-tests/client_failover-itest.cc File src/kudu/integration-tests/client_failover-itest.cc: PS4, Line 252: tserver::TabletServerIntegrationTestBase > why not ExternalMiniClusterITestBase? Because it's using GetLeaderReplicaWithRetries() which currently exists only in TabletServerIntegrationTestBase Do you think it would be a good idea to move that GetLeaderReplicaWithRetries and all of its dependencies into cluster_itest_util.h or around? http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: Line 167: // A note regarding ServerPicker::MarkServerFailed() calls in this method: > I know I mentioned adding a comment but I was thinking just a 2- or 3-line The MarkServerFailed signature in ServerPicker assumes it can mark the whole server as failed. It's just a detail of current implementation of that method in MetaCacheServerPicker. Do you mean we want to have this comment not here but in MetaCacheServerPicker? PS4, Line 195: MarkServerFailed > Should we rename this method to MarkReplicaFailed for clarity? I'm not sure about this: from the semantics of the error code SERVER_NOT_ACCESSIBLE you would assume it makes sense to mark the whole server failed. As I understand it, teplica is not relevant in this context. The same for the call of the same method in case of TimedOut status returned along with NON_RERTRIABLE_ERROR -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6924 to look at the new patch set (#5). Change subject: KUDU-1034 client does not failover due to timeout .. KUDU-1034 client does not failover due to timeout This patch fixes the issue described by KUDU-1034: the client does not mark the failed tablet server as 'failed' in case of timeout and continues to use it over and over again to send further requests, even if other tablet replicas might be available. Besides the actual fix, this patch incorporates an integration test (ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike. Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 --- M src/kudu/client/meta_cache.h M src/kudu/integration-tests/client_failover-itest.cc M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/rpc/retriable_rpc.h 4 files changed, 77 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/5 -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1034 client does not failover due to timeout
Mike Percy has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/integration-tests/client_failover-itest.cc File src/kudu/integration-tests/client_failover-itest.cc: PS4, Line 252: tserver::TabletServerIntegrationTestBase > Because it's using GetLeaderReplicaWithRetries() which currently exists onl does FindTabletLeader() do what you need? -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Mike Percy has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: Line 167: // A note regarding ServerPicker::MarkServerFailed() calls in this method: > The MarkServerFailed signature in ServerPicker assumes it can mark the whol Isn't that just an abstract base class with no other implementations? Can we just redefine the semantics? PS4, Line 195: MarkServerFailed > I'm not sure about this: from the semantics of the error code SERVER_NOT_AC Maybe the confusion we've had about this is that the abstraction here is too general for the problem domain. Maybe it's fine for the server picker to conceptually know about replicas. If you disagree then we can leave it as-is, just trying to think of ways to make it more understandable. -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 4: (3 comments) http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/integration-tests/client_failover-itest.cc File src/kudu/integration-tests/client_failover-itest.cc: PS4, Line 252: tserver::TabletServerIntegrationTestBase > does FindTabletLeader() do what you need? Excellent -- it looks like exactly what I need. Thanks! http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: Line 167: // A note regarding ServerPicker::MarkServerFailed() calls in this method: > Isn't that just an abstract base class with no other implementations? Can w Probably, yes. However, I would prefer to do that in a separate changelist and involve more people in discussing that. My concern is that we might be missing something looking only at this context. PS4, Line 195: MarkServerFailed > Maybe the confusion we've had about this is that the abstraction here is to I think we can address that separately -- my current priority to have the bug fixed. By my experience, discussing such things usually leads to many iterations with lot of bike-shedding, unfortunately. What do you think if I file a separate JIRA item to track this and we leave this as-is in the current patchset? -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Mike Percy has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: PS4, Line 195: MarkServerFailed > I think we can address that separately -- my current priority to have the b +1 thats ok -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/6924/4/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: PS4, Line 195: MarkServerFailed > +1 thats ok https://issues.apache.org/jira/browse/KUDU-2024 -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6924 to look at the new patch set (#6). Change subject: KUDU-1034 client does not failover due to timeout .. KUDU-1034 client does not failover due to timeout This patch fixes the issue described by KUDU-1034: the client does not mark the failed tablet server as 'failed' in case of timeout and continues to use it over and over again to send further requests, even if other tablet replicas might be available. Besides the actual fix, this patch incorporates an integration test (ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike. Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 --- M src/kudu/client/meta_cache.h M src/kudu/integration-tests/client_failover-itest.cc M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/rpc/retriable_rpc.h 4 files changed, 128 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/6 -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1034 client does not failover due to timeout
Mike Percy has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 6: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/6924/6/src/kudu/integration-tests/client_failover-itest.cc File src/kudu/integration-tests/client_failover-itest.cc: Line 33: #include "kudu/integration-tests/ts_itest-base.h" no longer used? and the one below -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Hello Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6924 to look at the new patch set (#7). Change subject: KUDU-1034 client does not failover due to timeout .. KUDU-1034 client does not failover due to timeout This patch fixes the issue described by KUDU-1034: the client does not mark the failed tablet server as 'failed' in case of timeout and continues to use it over and over again to send further requests, even if other tablet replicas might be available. Besides the actual fix, this patch incorporates an integration test (ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike. Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 --- M src/kudu/client/meta_cache.h M src/kudu/integration-tests/client_failover-itest.cc M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/rpc/retriable_rpc.h 4 files changed, 125 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/6924/7 -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 6: (2 comments) http://gerrit.cloudera.org:8080/#/c/6924/6/src/kudu/integration-tests/client_failover-itest.cc File src/kudu/integration-tests/client_failover-itest.cc: Line 33: #include "kudu/integration-tests/ts_itest-base.h" > no longer used? and the one below Right, this one is not used. But the line below is needed for tserver::ListTabletsResponsePB::StatusAndSchemaPB. I'll update it, thanks! Line 288: SleepFor(MonoDelta::FromMilliseconds(10 * (i + 1))); > warning: either cast from 'int' to 'int64_t' (aka 'long') is ineffective, o Done -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1034 client does not failover due to timeout
Mike Percy has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 7: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has submitted this change and it was merged. Change subject: KUDU-1034 client does not failover due to timeout .. KUDU-1034 client does not failover due to timeout This patch fixes the issue described by KUDU-1034: the client does not mark the failed tablet server as 'failed' in case of timeout and continues to use it over and over again to send further requests, even if other tablet replicas might be available. Besides the actual fix, this patch incorporates an integration test (ClientFailoverTServerTimeoutITest.FailoverOnLeaderTimeout) written by Mike. Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Reviewed-on: http://gerrit.cloudera.org:8080/6924 Reviewed-by: Mike Percy Tested-by: Kudu Jenkins --- M src/kudu/client/meta_cache.h M src/kudu/integration-tests/client_failover-itest.cc M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/rpc/retriable_rpc.h 4 files changed, 125 insertions(+), 3 deletions(-) Approvals: Mike Percy: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon