[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-21 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


Patch Set 2:

> Seems reasonable enough. The only potential concern is that I sort
 > of recall picking the 'default RPC timeout' rather than the
 > 'operation timeout' so that, if the master was actually down, the
 > user would get an error quicker than their timeout, rather than a
 > bunch of retries. If you try to contact a cluster which is just
 > down, does it now hang for the full timeout? Or if we get
 > 'connection refused' from all masters, does it bail out relatively
 > quickly?
 
For more background reading check out commit 2c8aa9e.

I expect it'll hang for the full operation timeout. That's 30s by default, vs. 
10s for the RPC timeout.

 > I suppose an argument could be made either way, but 'fast fail'
 > seems to make sense at least for command-line interactive things if
 > the cluster is actually down (not just a transient restart/failure)

The root problem I'm trying to address is: very short RPC timeouts in 
ClientStressTest_MultiMaster_TestLeaderResolutionTimeout means the client 
doesn't wait long enough for a master leader election to finish when building 
the client. Can you think of a way to address that without hacking up the 
client itself?

-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


Patch Set 2:

Seems reasonable enough. The only potential concern is that I sort of recall 
picking the 'default RPC timeout' rather than the 'operation timeout' so that, 
if the master was actually down, the user would get an error quicker than their 
timeout, rather than a bunch of retries. If you try to contact a cluster which 
is just down, does it now hang for the full timeout? Or if we get 'connection 
refused' from all masters, does it bail out relatively quickly?

I suppose an argument could be made either way, but 'fast fail' seems to make 
sense at least for command-line interactive things if the cluster is actually 
down (not just a transient restart/failure)

-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-21 Thread Kudu Jenkins (Code Review)
Kudu Jenkins has posted comments on this change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


Patch Set 2: -Verified

Build Started http://104.196.14.100/job/kudu-gerrit/2609/

-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-21 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3718

to look at the new patch set (#2).

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..

c++ client: use operation timeout as deadline for finding new leader master

We had been using the default RPC timeout, which may be set to a very low
value as in ClientStressTest_MultiMaster_TestLeaderResolutionTimeout.
Now we'll use the operation timeout as the overall deadline while still
preserving the semantics of using the default RPC timeout for the
GetMasterRegistration() RPCs themselves.

As my patch series removes the guarantee that a leader master is elected at
the time that cluster tests run, it's important that the logic for finding
the leader master provide ample time for an election to finish.

Also, I think I've addressed the root cause behind KUDU-573 by fixing a race
in GetLeaderMasterRpc's SendRpcCb() and GetMasterRegistrationRpcCbForNode()
methods. The race manifests when the last two RPC responses are "I am the
leader" and "I am not the leader" respectively. In one interleaving, both
responses enter SendRpcCb(), and the second calls DelayedRetryCb(). If that
were a call to DelayedRetry() instead, the GetLeaderMasterRpc would be
destroyed by the time the reactor thread reran the RPC.

Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
---
M src/kudu/client/client-internal.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/master/master_rpc.cc
M src/kudu/master/master_rpc.h
M src/kudu/tserver/heartbeater.cc
5 files changed, 63 insertions(+), 38 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/18/3718/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-21 Thread Kudu Jenkins (Code Review)
Kudu Jenkins has posted comments on this change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


Patch Set 2:

Build Started http://104.196.14.100/job/kudu-gerrit/2603/

-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-20 Thread Kudu Jenkins (Code Review)
Kudu Jenkins has posted comments on this change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


Patch Set 1:

Build Started http://104.196.14.100/job/kudu-gerrit/2598/

-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-20 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/3718

to review the following change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..

c++ client: use operation timeout as deadline for finding new leader master

We had been using the default RPC timeout, which may be set to a very low
value as in ClientStressTest_MultiMaster_TestLeaderResolutionTimeout.
Now we'll use the operation timeout as the overall deadline while still
preserving the semantics of using the default RPC timeout for the
GetMasterRegistration() RPCs themselves.

As my patch series removes the guarantee that a leader master is elected at
the time that cluster tests run, it's important that the logic for finding
the leader master provide ample time for an election to finish.

Also, I think I've addressed the root cause behind KUDU-573 by fixing a race
in GetLeaderMasterRpc's SendRpcCb() and GetMasterRegistrationRpcCbForNode()
methods. The race manifests when the last two RPC responses are "I am the
leader" and "I am not the leader" respectively. In one interleaving, both
responses enter SendRpcCb(), and the second calls DelayedRetryCb(). If that
were a call to DelayedRetry() instead, the GetLeaderMasterRpc would be
destroyed by the time the reactor thread reran the RPC.

Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
---
M src/kudu/client/client-internal.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/master/master_rpc.cc
M src/kudu/master/master_rpc.h
M src/kudu/tserver/heartbeater.cc
5 files changed, 52 insertions(+), 38 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/18/3718/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Todd Lipcon