[ 
https://issues.apache.org/jira/browse/HDFS-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379889#comment-15379889
 ] 

James Clampffer commented on HDFS-10441:
----------------------------------------

bq.  In RpcConnectionImpl<NextLayer>::OnRecvCompleted, if we detect that we've 
connected to the standby, it falls through to StartReading(). Should it bail 
out at that point?
I tried this, but if we bail out here we need something else to get the rpc 
loop to start running again.  It seemed like this was a relatively simple way 
of solving that rather than adding a special case.  I could be missing 
somethere here, I had a sketch of the Rpc code modeled as a state machine but 
it's possible that it is out of date now.

In RpcEngine::RpcCommsError, we call pendingRequests [ i 
]->IncrementFailoverCount(); should that implicitly reset the retry count to 0? 
Will we get into cases where it retries until it fails, then the retry count is 
already == max_retry?
Nice catch.  I had the failover case covered:

{code}
head_action = RetryAction::failover(std::max(0,options_.rpc_retry_delay_ms));
{code}
I'll move that into IncrementRetryCount to cover all cases.

bq. If a namenode is down when we try to resolve, we don't try again when it's 
time to fail over, do we? We should capture that in another bug
We do at the bottom of HANamenodeTracker::GetFailoverAndUpdate when we call 
ResolveInPlace.  The idea is that the endpoint vector will be empty either 
because it's unset or explicitly cleared when resolution fails so we can just 
do "if empty ResolveInPlace".

---- future discussion
bq. In FixedDealyWithFailover::ShouldRetry(), should we failover on any other 
errors other than timeout? Bad route to host? DNS failure?
I have to check the java code to be 100% sure.  Based on the user configuration 
options it looked like timeout was the main one that needed to be accounted 
for.  Bad route to host should probably fall under that rule as well since it 
doesn't seem like it can be recovered from.  With a DNS failure we might be out 
of luck in general, but it might be worth propagating back to the user.  This 
is just the quick failover path, everything will end up failing over but will 
retry first.

bq. In FixedDealyWithFailover::ShouldRetry(), we're always using a delay if 
retries < 3. This should be configurable. We can cover that in another bug
Oh, I'm not sure how I missed that.  Should be using 
FixedDelayWithFailover::max_retry_.  That said there's still work to be done on 
the delay logic to get parity with the Java client notably things like 
exponential backoff.  I left that out until good tests are added because it 
started adding more corner cases that I had to test manually and because the 
simple workaround is to bump up max failovers.





> libhdfs++: HA namenode support
> ------------------------------
>
>                 Key: HDFS-10441
>                 URL: https://issues.apache.org/jira/browse/HDFS-10441
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>         Attachments: HDFS-10441.HDFS-8707.000.patch, 
> HDFS-10441.HDFS-8707.002.patch, HDFS-10441.HDFS-8707.003.patch, 
> HDFS-10441.HDFS-8707.004.patch, HDFS-10441.HDFS-8707.005.patch, 
> HDFS-10441.HDFS-8707.006.patch, HDFS-10441.HDFS-8707.007.patch, 
> HDFS-10441.HDFS-8707.008.patch, HDFS-10441.HDFS-8707.009.patch, 
> HDFS-10441.HDFS-8707.010.patch, HDFS-10441.HDFS-8707.011.patch, 
> HDFS-10441.HDFS-8707.012.patch, HDFS-10441.HDFS-8707.013.patch, 
> HDFS-8707.HDFS-10441.001.patch
>
>
> If a cluster is HA enabled then do proper failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to