[
https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140258#comment-17140258
]
Ayush Saxena edited comment on HDFS-15419 at 6/19/20, 6:48 AM:
---------------------------------------------------------------
The present code is to have failover is because the router maintains the
active/standby state of the namenode, in case if there is a change in roles of
namenode which is different to that stored in Router, the router will failover
and update the state. That way present code seems OK, Removal of that isn't
required, If we remove that, in case a failover happens the router will keep on
rejecting calls based on the old states in cache until the heartbeat updates.
The present retry logic is to just ensure if there is an active namenode then
it gets the call. If the router couldn't find it, It doesn't hold it. Then the
client can decide whether to retry or not.
I am not sure but if as proposed here, the router does a full retry like normal
client, in worse situations the actual client may timeout. For the actual call
it sent just one call and it is stuck at server, it won't be aware that the
router is retrying to different namenodes and stuff
Well IIRC we even had a logic added in router for the purpose of retry
recently, that amongst all the exceptions received from the several Namespaces
if one exception is retriable that only would get propagated so as client can
retry.
was (Author: ayushtkn):
The present code is to have failover is because the router maintains the
active/standby state of the namenode, in case if there is a change in roles of
namenode which is different to that stored in Router, the router will failover
and update the state. That way present code seems OK, Removal of that isn't
required, If we remove that, in case a failover happens the router will keep on
rejecting calls based on the old states in cache until the heartbeat updates.
The present retry logic is to just ensure if there is an active namenode then
it gets the call. If the router couldn't find it, It doesn't hold it. Then the
client can decide whether to retry or not.
I am not sure but if as proposed here, the router does a full retry like normal
client, in worse situations the actual client may timeout. For the actual call
it sent just one call and it is stuck at server, it won't be aware that the
router is retrying to different namenodes and stuff
> RBF: Router should retry communicate with NN when cluster is unavailable
> using configurable time interval
> ---------------------------------------------------------------------------------------------------------
>
> Key: HDFS-15419
> URL: https://issues.apache.org/jira/browse/HDFS-15419
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: configuration, hdfs-client, rbf
> Reporter: bhji123
> Priority: Major
>
> When cluster is unavailable, router -> namenode communication will only retry
> once without any time interval, that is not reasonable.
> For example, in my company, which has several hdfs clusters with more than
> 1000 nodes, we have encountered this problem. In some cases, the cluster
> becomes unavailable briefly for about 10 or 30 seconds, at the same time,
> almost all rpc requests to router failed because router only retry once
> without time interval.
> It's better for us to enhance the router retry strategy, to retry
> **communicate with NN using configurable time interval and max retry times.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]