[jira] [Comment Edited] (HDFS-13119) RBF: Manage unavailable clusters
[ https://issues.apache.org/jira/browse/HDFS-13119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364058#comment-16364058 ] Yiqun Lin edited comment on HDFS-13119 at 2/14/18 1:51 PM: --- Thanks for the review, [~elgoiri]. {quote}Otherwise, we could just do: {noformat} if (isClusterUnAvailable(nsId) && retryCount > 0) { throw new IOException("No namenode available under nameservice " + nsId, ioe); } {noformat} Then, the default logic takes care of the first retry. {quote} Actually the default logic won't takes care of the first retry. Here we use the retry policy {{FailoverOnNetworkExceptionRetry}}, it will firstly jump into logic of {{RetryDecision.FAILOVER_AND_RETRY}} and throw {{StandbyException}}. In the failover rerty, the retry count is passing as 0 again. Attach the new patch to fix some warnings. was (Author: linyiqun): Thanks for the review, [~elgoiri]. {quote}Otherwise, we could just do: {noformat} if (isClusterUnAvailable(nsId) && retryCount > 0) { throw new IOException("No namenode available under nameservice " + nsId, ioe); } {noformat} Then, the default logic takes care of the first retry. {quote} Actually the default logic won't takes care of the first retry. Here we use the retry policy {{FailoverOnNetworkExceptionRetry}}, it will firstly jump into logic of {{RetryDecision.FAILOVER_AND_RETRY}} and throw {{StandbyException}}. In the failover rerty, the retry count is passing as 0 again. > RBF: Manage unavailable clusters > > > Key: HDFS-13119 > URL: https://issues.apache.org/jira/browse/HDFS-13119 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: Yiqun Lin >Priority: Major > Attachments: HDFS-13119.001.patch, HDFS-13119.002.patch, > HDFS-13119.003.patch > > > When a federated cluster has one of the subcluster down, operations that run > in every subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC > connections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-13119) RBF: Manage unavailable clusters
[ https://issues.apache.org/jira/browse/HDFS-13119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357389#comment-16357389 ] Íñigo Goiri edited comment on HDFS-13119 at 2/8/18 6:48 PM: Thanks [~linyiqun] for taking this. The example I was giving is {{RouterRpcServer#renewLease()}}. This function calls {{rpcClient.invokeConcurrent(nss, method, false, false);}} with {{nss}} being all the subclusters. {{RouterRpcClient#invokeConcurrent()}} goes and spawns a thread in the {{executorService}} for each subcluster, so for unavailable subcluster we have a thread here stuck for 200 seconds in our case. We also have a lot of threads from this thread pool named {{RPC Router Client-XXX}}. We actually have an option to have a timeout which we use for some UI option; I'm not sure this is OK for {{renewLease()}} for example. Does it make sense? The current problem is that the thread factory in this {{executorService}} has no limit and we should have one (configurable preferably). However, this doesn't fix the real problem which is checking forever for something we know is down. I think your proposal for avoiding the retries could be the other part of this fix. was (Author: elgoiri): Thanks [~linyiqun] for taking this. The example I was giving is {{RouterRpcServer#renewLease()}}. This function calls {{rpcClient.invokeConcurrent(nss, method, false, false);}} with {{nss}} being all the namespaces. {{RouterRpcClient#invokeConcurrent()}} goes and spawns a thread in the {{executorService}} for each subcluster, so for unavailable subcluster we have a thread here stuck for 200 seconds in our case. We also have a lot of threads from this thread pool named {{RPC Router Client-XXX}}. We actually have an option to have a timeout which we use for some UI option; I'm not sure this is OK for {{renewLease()}} for example. Does it make sense? The current problem is that the thread factory in this {{executorService}} has no limit and we should have one (configurable preferably). However, this doesn't fix the real problem which is checking forever for something we know is down. I think your proposal for avoiding the retries could be the other part of this fix. > RBF: Manage unavailable clusters > > > Key: HDFS-13119 > URL: https://issues.apache.org/jira/browse/HDFS-13119 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: Yiqun Lin >Priority: Major > > When a federated cluster has one of the subcluster down, operations that run > in every subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC > connections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-13119) RBF: Manage unavailable clusters
[ https://issues.apache.org/jira/browse/HDFS-13119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356614#comment-16356614 ] Yiqun Lin edited comment on HDFS-13119 at 2/8/18 8:05 AM: -- Just looked into this, {quote}When a federated cluster has one of the subcluster down, operations that run in every subcluster (RouterRpcClient#invokeAll()) may take all the RPC connections. {quote} Looked into the related code, I didn't see the logic for triggering RPC requests for every subclustet once one subcluster was down. I just looked the method {{RouterRpcClient#invoke}} invoked in {{RouterRpcClient#invokeMethod}}. Correct me If I am wrong. {quote} Better control of the number of RPC clients {quote} Not so clear for this, do you mean we may have a maximum RPC queue size in Router RPC server side? I have a proposal for "No need to try so many times if we "know" the subcluster is down": When the failed happened, then query from {{ActiveNamenodeResolver}} if the cluster is down, if yes, don't do retry. In addition, current default retry times (10 times) can be decreased a lot. was (Author: linyiqun): Just looked into this, {quote}When a federated cluster has one of the subcluster down, operations that run in every subcluster (RouterRpcClient#invokeAll()) may take all the RPC connections. {quote} Looked into the related code, I didn't see the logic for triggering RPC requests for every subclustet once one subcluster was down. I just looked the method {{RouterRpcClient#invoke}} invoked in {{RouterRpcClient#invokeMethod}}. Correct me If I am wrong. Not so clear for this, would you describe more? {quote} Better control of the number of RPC clients {quote} I have a proposal for "No need to try so many times if we "know" the subcluster is down": When the failed happened, then query from {{ActiveNamenodeResolver}} if the cluster is down, if yes, don't do retry. In addition, current default retry times (10 times) can be decreased a lot. > RBF: Manage unavailable clusters > > > Key: HDFS-13119 > URL: https://issues.apache.org/jira/browse/HDFS-13119 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: Yiqun Lin >Priority: Major > > When a federated cluster has one of the subcluster down, operations that run > in every subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC > connections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org