[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891583#comment-16891583
 ] 

He Xiaoqiao edited comment on HDFS-14090 at 7/24/19 4:00 AM:
-------------------------------------------------------------

Thanks [~crh] for your contribution, [^HDFS-14090.006.patch] almost looks good 
to me.
minor comment about permit acquires and releases,
 {{RouterRpcClient#acquirePermit}} & {{RouterRpcClient#releasePermit}} should 
invoke pairs, and it certainly did the same thing. However some logic could not 
process exception correctly, it may lead {{Permit}} could not release as 
expected,
 1. {{RouterRpcClient#invokeSequential}}, when #getNamenodesForNameservice 
throw exception, we could not release permit as expected.
{code:java}
  public <T> T invokeSequential(
      final List<? extends RemoteLocationContext> locations,
      final RemoteMethod remoteMethod, Class<T> expectedResultClass,
      Object expectedResultValue) throws IOException {
    ......
    for (final RemoteLocationContext loc : locations) {
      String ns = loc.getNameserviceId();
      acquirePermit(ns, ugi, m);
      List<? extends FederationNamenodeContext> namenodes =
          getNamenodesForNameservice(ns); // if throw exception, it could never 
release permit anymore.
      try {
      } catch{} finally {
        releasePermit(ns, ugi, m);
      }
    }
    ......
  }
{code}
2. in {{RouterRpcClient#invokeConcurrent}}, the issue could also exist after 
the second time to invoke {{acquirePermit}}.
 One minor suggestion, the whole code segment between {{acquirePermit}} and 
{{releasePermit}} should be encase by {{try{} catch{} finally}} statement to 
ensure that we could release the acquired permit in any case.
Thanks [~crh] again, please let me know if there is something I missed.


was (Author: hexiaoqiao):
Thanks [~crh] for your contribution, [^HDFS-14090.006.patch] almost looks good 
to me.
minor comment about permit acquires and releases,
 {{RouterRpcClient#acquirePermit}} & {{RouterRpcClient#releasePermit}} should 
invoke pairs, and it certainly did the same thing. However some logic could not 
process exception correctly, it may lead {{Permit}} could not release as 
expected,
 1. {{RouterRpcClient#invokeSequential}}, when #getNamenodesForNameservice 
throw exception, we could not release permit as expected.
{code:java}
  public <T> T invokeSequential(
      final List<? extends RemoteLocationContext> locations,
      final RemoteMethod remoteMethod, Class<T> expectedResultClass,
      Object expectedResultValue) throws IOException {
    ......
    for (final RemoteLocationContext loc : locations) {
      String ns = loc.getNameserviceId();
      acquirePermit(ns, ugi, m);
      List<? extends FederationNamenodeContext> namenodes =
          getNamenodesForNameservice(ns);
      try {
      } catch{ finally{
        releasePermit(ns, ugi, m);
      }
    }
    ......
  }
{code}
2. in {{RouterRpcClient#invokeConcurrent}}, the issue could also exist after 
the second time to invoke {{acquirePermit}}.
 One minor suggestion, the whole code segment between {{acquirePermit}} and 
{{releasePermit}} should be encase by {{try{} catch{} finally}} statement to 
ensure that we could release the acquired permit in any case.
Thanks [~crh] again, please let me know if there is something I missed.

> RBF: Improved isolation for downstream name nodes.
> --------------------------------------------------
>
>                 Key: HDFS-14090
>                 URL: https://issues.apache.org/jira/browse/HDFS-14090
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: CR Hota
>            Assignee: CR Hota
>            Priority: Major
>         Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to