[jira] [Comment Edited] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

CR Hota (JIRA) Thu, 15 Aug 2019 22:59:29 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908758#comment-16908758
 ]


CR Hota edited comment on HDFS-14090 at 8/16/19 5:58 AM:
---------------------------------------------------------

[~xkrogen] [~elgoiri] Many thanks for the detailed reviews. Very helpful :) 
Have incorporated almost all the points you folks mentioned in 010.patch.

On a high level, changes are
 # "permit" is still the word being used.
 # One configuration controls the feature, {{NoFairnessPolicyController}} is 
dummy whereas {{StaticFairnessPolicyController}} is the fairness implementation.
 # The whole start-up will fail if fairness class loading has issues. Test 
cases are appropriately changed to reflect that.
 # {{NoPermitAvailableException}} is renamed to 
{{PermitLimitExceededException.}}

 

To [~xkrogen] observations,
{quote}I was considering the scenario where there are two routers R1 and R2, 
and two NameNodes N1 and N2. Assume most clients need to access both N1 and N2. 
What happens in the situation when all of R1's N1-handlers are full (but 
N2-handlers mostly empty), and all of R2's N2-handlers are full (but 
N1-handlers mostly empty)? I'm not sure if this is a situation that is likely 
to arise, or if the system will easily self-heal based on the backoff behavior. 
Maybe worth thinking about a little--not a blocking concern for me, more of a 
thought experiment.
{quote}
 It should ideally not happen that all handlers of a specific router are busy 
and other handlers are completely free, since clients are expected to use 
random order while connecting. However, from the beginning the design  focuses 
on getting the system to self-heal as much as possible to eventually get 
similar traffic across all routers in a cluster.
{quote}The configuration for this seems like it will be really tricky to get 
right, particularly knowing how many fan-out handlers to allocate. I imagine as 
an administrator, my thought process would be like:
 I want 35% allocated to NN1 and 65% allocated to NN2, since NN2 is about 2x as 
loaded as NN1. This part is fairly intuitive.
 Then I encounter the fan-out configuration... What am I supposed to do with it?
 Are there perhaps any heuristics we can provide for reasonable values?
{quote}
Yes, configurations values are something, which users have to pay attention to 
specially concurrent calls. In the documentation sub-Jira HDFS-14558, I plan to 
write more about the concurrent calls and some points for users to focus on. 
Also configurations may need to be changed by users based on new use cases and 
load on downstream clusters etc.

[~aajisaka] [~brahmareddy] [~linyiqun] [~hexiaoqiao] FYI.


was (Author: crh):
[~xkrogen] [~elgoiri] Many thanks for the detailed reviews. Very helpful :) 
Have incorporated almost all the points you folks mentioned.

On a high level, changes are
 # "permit" is still the word being used.
 # One configuration controls the feature, {{NoFairnessPolicyController}} is 
dummy whereas {{StaticFairnessPolicyController}} is the fairness implementation.
 # The whole start-up will fail if fairness class loading has issues. Test 
cases are appropriately changed to reflect that.
 # {{NoPermitAvailableException}} is renamed to 
{{PermitLimitExceededException.}}

 

To [~xkrogen] observations,
{quote}I was considering the scenario where there are two routers R1 and R2, 
and two NameNodes N1 and N2. Assume most clients need to access both N1 and N2. 
What happens in the situation when all of R1's N1-handlers are full (but 
N2-handlers mostly empty), and all of R2's N2-handlers are full (but 
N1-handlers mostly empty)? I'm not sure if this is a situation that is likely 
to arise, or if the system will easily self-heal based on the backoff behavior. 
Maybe worth thinking about a little--not a blocking concern for me, more of a 
thought experiment.
{quote}
 It should ideally not happen that all handlers of a specific router are busy 
and other handlers are completely free, since clients are expected to use 
random order while connecting. However, from the beginning the design  focuses 
on getting the system to self-heal as much as possible to eventually get 
similar traffic across all routers in a cluster.
{quote}The configuration for this seems like it will be really tricky to get 
right, particularly knowing how many fan-out handlers to allocate. I imagine as 
an administrator, my thought process would be like:
 I want 35% allocated to NN1 and 65% allocated to NN2, since NN2 is about 2x as 
loaded as NN1. This part is fairly intuitive.
 Then I encounter the fan-out configuration... What am I supposed to do with it?
 Are there perhaps any heuristics we can provide for reasonable values?
{quote}
Yes, configurations values are something, which users have to pay attention to 
specially concurrent calls. In the documentation sub-Jira HDFS-14558, I plan to 
write more about the concurrent calls and some points for users to focus on. 
Also configurations may need to be changed by users based on new use cases and 
load on downstream clusters etc.

[~aajisaka] [~brahmareddy] [~linyiqun] [~hexiaoqiao] FYI.

> RBF: Improved isolation for downstream name nodes.
> --------------------------------------------------
>
>                 Key: HDFS-14090
>                 URL: https://issues.apache.org/jira/browse/HDFS-14090
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: CR Hota
>            Assignee: CR Hota
>            Priority: Major
>         Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

Reply via email to