[ 
https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140258#comment-17140258
 ] 

Ayush Saxena edited comment on HDFS-15419 at 6/19/20, 6:48 AM:
---------------------------------------------------------------

The present code is to have failover is because the router maintains the 
active/standby state of the namenode, in case if there is a change in roles of 
namenode which is different to that stored in Router, the router will failover 
and update the state. That way present code seems OK, Removal of that isn't 
required, If we remove that, in case a failover happens the router will keep on 
rejecting calls based on the old states in cache until the heartbeat updates. 
The present retry logic is to just ensure if there is an active namenode then 
it gets the call. If the router couldn't find it, It doesn't hold it. Then the 
client can decide whether to retry or not. 

I am not sure but if as proposed here, the router does a full retry like normal 
client, in worse situations the actual client may timeout. For the actual call 
it sent just one call and it is stuck at server, it won't be aware that the 
router is retrying to different namenodes and stuff

 

Well IIRC we even had a logic added in router for the purpose of retry 
recently, that amongst all the exceptions received from the several Namespaces 
if one exception is retriable that only would get propagated so as client can 
retry.


was (Author: ayushtkn):
The present code is to have failover is because the router maintains the 
active/standby state of the namenode, in case if there is a change in roles of 
namenode which is different to that stored in Router, the router will failover 
and update the state. That way present code seems OK, Removal of that isn't 
required, If we remove that, in case a failover happens the router will keep on 
rejecting calls based on the old states in cache until the heartbeat updates. 
The present retry logic is to just ensure if there is an active namenode then 
it gets the call. If the router couldn't find it, It doesn't hold it. Then the 
client can decide whether to retry or not. 

I am not sure but if as proposed here, the router does a full retry like normal 
client, in worse situations the actual client may timeout. For the actual call 
it sent just one call and it is stuck at server, it won't be aware that the 
router is retrying to different namenodes and stuff

> RBF: Router should retry communicate with NN when cluster is unavailable 
> using configurable time interval
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15419
>                 URL: https://issues.apache.org/jira/browse/HDFS-15419
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: configuration, hdfs-client, rbf
>            Reporter: bhji123
>            Priority: Major
>
> When cluster is unavailable, router -> namenode communication will only retry 
> once without any time interval, that is not reasonable.
> For example, in my company, which has several hdfs clusters with more than 
> 1000 nodes, we have encountered this problem. In some cases, the cluster 
> becomes unavailable briefly for about 10 or 30 seconds, at the same time, 
> almost all rpc requests to router failed because router only retry once 
> without time interval.
> It's better for us to enhance the router retry strategy, to retry 
> **communicate with NN using configurable time interval and max retry times.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to