[ 
https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188442#comment-17188442
 ] 

Wang, Xinglong commented on HDFS-15553:
---------------------------------------

Thank you for the commet [~hexiaoqiao]

I think even in the current code base, we can't make sure Client B will always 
see result of client A.

I will give some counter cases against the scenario in your case.

1 FairCallQueue case, if client A is downgraded to a low priority queue, then 
this rpc may be executed after client B rpc.

2 Unfair lock mode, latter rpc may be executed before former rpc due to unfair 
mechanism 

3 Due to network latency, if T1 and T2 means client side timestamp, then the 
sequence of client A rpc and client B rpc is not guaranteed in server side.

4 Let's say client A and client B is in the same jvm. Usually client A will 
wait until it got response from NN and then it will perform other actions like 
notify others to do things. There needs a synchronization between client A and 
client B if there is dependency.  If user just spawn client A first and client 
B later, and there is no synchronization between client A and client B. The 
result can be a lot of cases. We can't say for sure it will have a guaranteed 
result.

5 If client A and client B is in different jvm, I think it's even harder to 
guarantee result only based on rpc sent out time. 

 

For the dequeue strategy, currently I implemented a weight based dynamic 
dequeue strategy to dynamically decide how many rpc should be dequeued from 
read call queue and write callqueue based on current readqueue length and write 
queue length to make sure we can consume both queue to prevent one queue 
starving. 

> Improve NameNode RPC throughput with ReadWriteRpcCallQueue 
> -----------------------------------------------------------
>
>                 Key: HDFS-15553
>                 URL: https://issues.apache.org/jira/browse/HDFS-15553
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Wang, Xinglong
>            Priority: Major
>
> *Current*
>  In our production cluster, a typical traffic model is read to write raito is 
> 10:1 and sometimes the ratios goes to 30:1.
>  NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. 
> Read lock is shared lock while write lock is exclusive lock.
> Read RPC and Write RPC comes randomly to namenode. This makes read and write 
> mixed up. And then only a small fraction of read can really share their read 
> lock.
> Currently we have default callqueue and faircallqueue. And we can 
> refreshCallQueue on the fly. This opens room to design new call queue.
> *Idea*
>  If we reorder the rpc call in callqueue to group read rpc together and write 
> rpc together, we will have sort of control to let a batch of read rpc come to 
> handlers together and possibly share the same read lock. Thus we can reduce 
> Fragments of read locks.
>  This will only improve the chance to share the read lock among the batch of 
> read rpc due to there are some namenode internal write lock is out of call 
> queue.
> Under ReEntrantReadWriteLock, there is a queue to manage threads asking for 
> locks. We can give an example.
>  R: stands for read rpc
>  W: stands for write rpc
>  e.g
>  RRRRWRRRRWRRRRWRRRRWRRRRWRRRRWRRRRWRRRRW
>  In this case, we need 16 lock timeslice.
> optimized
>  RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRWWWWWWWW
>  In this case, we only need 9 lock timeslice.
> *Correctness*
>  Since the execution order of any 2 concurrent or queued rpc in namenode is 
> not guaranteed. We can reorder the rpc in callqueue into read group and write 
> group. And then dequeue from these 2 queues by a designed strategy. let's say 
> dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and 
> then write again.
>  Since FairCallQueue also does rpc call reorder in callqueue, for this part I 
> think they share the same logic to guarantee rpc result correctness.
> *Performance*
>  In test environment, we can see a 15% - 20% NameNode RPC throughput 
> improvement comparing with default callqueue. 
>  Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR
> This performance is not a surprise. Due to some write rpc is not managed in 
> callqueue. We can't do reorder to them by reording calls in callqueue. 
>  But still we can do a fully read write reorder if we redesign 
> ReEntrantReadWriteLock to achieve this. This will be further step after this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to