[ https://issues.apache.org/jira/browse/HDFS-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853000#comment-17853000 ]
Jian Zhang commented on HDFS-17531: ----------------------------------- Thank you again for your attention. I will split this huge PR into small PRs. You can review the PRs in the subtasks. I will close this huge PR. > RBF: Asynchronous router RPC > ---------------------------- > > Key: HDFS-17531 > URL: https://issues.apache.org/jira/browse/HDFS-17531 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Jian Zhang > Assignee: Jian Zhang > Priority: Major > Labels: pull-request-available > Attachments: Async router single ns performance test.pdf, Aynchronous > router.pdf, Comparison of Async router & sync router performance.pdf, > HDFS-17531.001.patch, image-2024-05-19-18-07-51-282.png > > > *Description* > Currently, the main function of the Router service is to accept client > requests, forward the requests to the corresponding downstream ns, and then > return the results of the downstream ns to the client. The link is as follows: > *!image-2024-05-19-18-07-51-282.png|width=900,height=300!* > The main threads involved in the rpc link are: > {*}Read{*}: Get the client request and put it into the call queue *(1)* > {*}Handler{*}: > Extract call *(2)* from the call queue, process the call, generate a new > call, place it in the call of the connection thread, and wait for the call > processing to complete *(3)* > After being awakened by the connection thread, process the response and put > it into the response queue *(5)* > *Connection:* > Hold the link with downstream ns, send the call from the call to the > downstream ns (via {*}rpcRequestThread{*}), and obtain a response from ns. > Based on the call in the response, notify the call to complete processing > *(4)* > *Responder:* > Retrieve the response queue from the queue *(6)* and return it to the client > > *Shortcoming* > Even if the *connection* thread can send more requests to downstream > nameservices, since *(3)* and *(4)* are synchronous, when the *handler* > thread adds the call to connection.calls, it needs to wait until the > *connection* notifies the call to complete, and then Only after the response > is put into the response queue can a new call be obtained from the call queue > and processed. Therefore, the concurrency performance of the router is > limited by the number of handlers; a simple example is as follows: If the > number of handlers is 1 and the maximum number of calls in the connection > thread is 10, then even if the connection thread can send 10 requests to the > downstream ns, since the number of handlers is 1, the router can only process > one request after another. > > Since the performance of router rpc is mainly limited by the number of > handlers, the most effective way to improve rpc performance currently is to > increase the number of handlers. Letting the router create a large number of > handler threads will also increase the number of thread switches and cannot > maximize the use of machine performance. > > There are usually multiple ns downstream of the router. If the handler > forwards the request to an ns with poor performance, it will cause the > handler to wait for a long time. Due to the reduction of available handlers, > the router's ability to handle ns requests with normal performance will be > reduced. From the perspective of the client, the performance of the > downstream ns of the router has deteriorated at this time. We often find that > the call queue of the downstream ns is not high, but the call queue of the > router is very high. > > Therefore, although the main function of the router is to federate and handle > requests from multiple NSs, the current synchronous RPC performance cannot > satisfy the scenario where there are many NSs downstream of the router. Even > if the concurrent performance of the router can be improved by increasing the > number of handlers, it is still relatively slow. More threads will increase > the CPU context switching time, and in fact many of the handler threads are > in a blocked state, which is undoubtedly a waste of thread resources. When a > request enters the router, there is no guarantee that there will be a running > handler at this time. > > Therefore, I consider asynchronous router rpc. Please view the *pdf* for the > complete solution. > > Welcome everyone to exchange and discuss! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org