[jira] [Comment Edited] (HDFS-13119) RBF: Manage unavailable clusters

2018-02-14 Thread Yiqun Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364058#comment-16364058
 ] 

Yiqun Lin edited comment on HDFS-13119 at 2/14/18 1:51 PM:
---

Thanks for the review, [~elgoiri].
{quote}Otherwise, we could just do:
{noformat}
 if (isClusterUnAvailable(nsId) && retryCount > 0) {
 throw new IOException("No namenode available under nameservice " + nsId, ioe);
 }
{noformat}
Then, the default logic takes care of the first retry.
{quote}
Actually the default logic won't takes care of the first retry. Here we use the 
retry policy {{FailoverOnNetworkExceptionRetry}}, it will firstly jump into 
logic of {{RetryDecision.FAILOVER_AND_RETRY}} and throw {{StandbyException}}. 
In the failover rerty, the retry count is passing as 0 again.

 

Attach the new patch to fix some warnings.


was (Author: linyiqun):
Thanks for the review, [~elgoiri].
{quote}Otherwise, we could just do:
{noformat}
 if (isClusterUnAvailable(nsId) && retryCount > 0) {
 throw new IOException("No namenode available under nameservice " + nsId, ioe);
 }
{noformat}
Then, the default logic takes care of the first retry.
{quote}
Actually the default logic won't takes care of the first retry. Here we use the 
retry policy {{FailoverOnNetworkExceptionRetry}}, it will firstly jump into 
logic of {{RetryDecision.FAILOVER_AND_RETRY}} and throw {{StandbyException}}. 
In the failover rerty, the retry count is passing as 0 again.

> RBF: Manage unavailable clusters
> 
>
> Key: HDFS-13119
> URL: https://issues.apache.org/jira/browse/HDFS-13119
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Yiqun Lin
>Priority: Major
> Attachments: HDFS-13119.001.patch, HDFS-13119.002.patch, 
> HDFS-13119.003.patch
>
>
> When a federated cluster has one of the subcluster down, operations that run 
> in every subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC 
> connections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13119) RBF: Manage unavailable clusters

2018-02-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HDFS-13119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357389#comment-16357389
 ] 

Íñigo Goiri edited comment on HDFS-13119 at 2/8/18 6:48 PM:


Thanks [~linyiqun] for taking this.

The example I was giving is {{RouterRpcServer#renewLease()}}.
This function calls {{rpcClient.invokeConcurrent(nss, method, false, false);}} 
with {{nss}} being all the subclusters.
{{RouterRpcClient#invokeConcurrent()}} goes and spawns a thread in the 
{{executorService}} for each subcluster, so for unavailable subcluster we have 
a thread here stuck for 200 seconds in our case.
We also have a lot of threads from this thread pool named {{RPC Router 
Client-XXX}}.
We actually have an option to have a timeout which we use for some UI option; 
I'm not sure this is OK for {{renewLease()}} for example.
Does it make sense?

The current problem is that the thread factory in this {{executorService}} has 
no limit and we should have one (configurable preferably).
However, this doesn't fix the real problem which is checking forever for 
something we know is down.
I think your proposal for avoiding the retries could be the other part of this 
fix.


was (Author: elgoiri):
Thanks [~linyiqun] for taking this.

The example I was giving is {{RouterRpcServer#renewLease()}}.
This function calls {{rpcClient.invokeConcurrent(nss, method, false, false);}} 
with {{nss}} being all the namespaces.
{{RouterRpcClient#invokeConcurrent()}} goes and spawns a thread in the 
{{executorService}} for each subcluster, so for unavailable subcluster we have 
a thread here stuck for 200 seconds in our case.
We also have a lot of threads from this thread pool named {{RPC Router 
Client-XXX}}.
We actually have an option to have a timeout which we use for some UI option; 
I'm not sure this is OK for {{renewLease()}} for example.
Does it make sense?

The current problem is that the thread factory in this {{executorService}} has 
no limit and we should have one (configurable preferably).
However, this doesn't fix the real problem which is checking forever for 
something we know is down.
I think your proposal for avoiding the retries could be the other part of this 
fix.

> RBF: Manage unavailable clusters
> 
>
> Key: HDFS-13119
> URL: https://issues.apache.org/jira/browse/HDFS-13119
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Yiqun Lin
>Priority: Major
>
> When a federated cluster has one of the subcluster down, operations that run 
> in every subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC 
> connections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13119) RBF: Manage unavailable clusters

2018-02-08 Thread Yiqun Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356614#comment-16356614
 ] 

Yiqun Lin edited comment on HDFS-13119 at 2/8/18 8:05 AM:
--

Just looked into this,
{quote}When a federated cluster has one of the subcluster down, operations that 
run in every subcluster (RouterRpcClient#invokeAll()) may take all the RPC 
connections.
{quote}
Looked into the related code, I didn't see the logic for triggering RPC 
requests for every subclustet once one subcluster was down. I just looked the 
method {{RouterRpcClient#invoke}} invoked in {{RouterRpcClient#invokeMethod}}. 
Correct me If I am wrong.

{quote}
Better control of the number of RPC clients
{quote}
Not so clear for this, do you mean we may have a maximum RPC queue size in 
Router RPC server side?

I have a proposal for "No need to try so many times if we "know" the subcluster 
is down": When the failed happened, then query from {{ActiveNamenodeResolver}} 
if the cluster is down, if yes, don't do retry. In addition, current default 
retry times (10 times) can be decreased a lot.


was (Author: linyiqun):
Just looked into this,
{quote}When a federated cluster has one of the subcluster down, operations that 
run in every subcluster (RouterRpcClient#invokeAll()) may take all the RPC 
connections.
{quote}
Looked into the related code, I didn't see the logic for triggering RPC 
requests for every subclustet once one subcluster was down. I just looked the 
method {{RouterRpcClient#invoke}} invoked in {{RouterRpcClient#invokeMethod}}. 
Correct me If I am wrong.

Not so clear for this, would you describe more?
{quote}
Better control of the number of RPC clients
{quote}

I have a proposal for "No need to try so many times if we "know" the subcluster 
is down": When the failed happened, then query from {{ActiveNamenodeResolver}} 
if the cluster is down, if yes, don't do retry. In addition, current default 
retry times (10 times) can be decreased a lot.

> RBF: Manage unavailable clusters
> 
>
> Key: HDFS-13119
> URL: https://issues.apache.org/jira/browse/HDFS-13119
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Yiqun Lin
>Priority: Major
>
> When a federated cluster has one of the subcluster down, operations that run 
> in every subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC 
> connections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org