Hi @zhangjian <1361320...@qq.com> , the dev branch HDFS-17531 is ready now.
FYI.

Best Regards,
- He Xiaoqiao

On Tue, Jun 4, 2024 at 10:08 PM zhangjian <1361320...@qq.com> wrote:

> Hi, Xiaoqiao He:
>  Can you help create a dev branch?  I don't have the permission to create
> it.
>
> Thank you very much.
> - zhangjian
>
> > 2024年5月30日 11:57,Xiaoqiao He <hexiaoq...@apache.org> 写道:
> >
> > Great! It looks like there are no other nothing blockers.
> >
> > @zhangjian <1361320...@qq.com> If no other furthermore comments, we
> should
> > go to the next step:
> > a. Create a dev branch for this proposal.
> > b. Split this huge PR to some small JIRA and PRs.
> > c. Involve some folks to review PR.
> >
> > Please ping here if you need any help. Thanks again.
> > Good Luck!
> >
> > Best Regards,
> > - He Xiaoqiao
> >
> > On Wed, May 29, 2024 at 11:46 AM Ayush Saxena <ayush...@gmail.com>
> wrote:
> >
> >> Thanx for the details, sounds cool, good luck with the feature!!!
> >>
> >> -Ayush
> >>
> >>> On 29 May 2024, at 8:56 AM, zhangjian <1361320...@qq.com> wrote:
> >>>
> >>> Thank you for taking the time to review this proposal.
> >>> Your opinion does point out the key issues to designing an asynchronous
> >> router, but my proposal can address these issues:
> >>> 1. My design does not affect the functionality of existing synchronous
> >> routers in throwing stanby or retry exceptions and other aspects, the
> async
> >> router still inherits these implementations.
> >>> 2. Currently, both asynchronous router and sync router support
> >> backpressure on client requests when they exceed a certain limit (
> >>>  asynchronous router : cannot obtain semaphores through the handler,
> >>>  sync router : block through handler synchronization, unable to obtain
> >> available handler
> >>>  )
> >>> and return standby exception to allow the client to retry other routers
> >> (RouterRpcFairnessPolicyController mechanism).
> >>>
> >>> Thank you again!
> >>> zhangjian
> >>>
> >>>> 2024年5月29日 07:05,Ayush Saxena <ayush...@gmail.com> 写道:
> >>>> Thanx folks, I had a very quick pass on the PDF and it looks good.
> >>>> Maybe some doubts around the fact where it was mentioned that if a
> >>>> Namenode returns a StandbyException or something on similar lines, the
> >>>> Router will retry, I think we have some logic in RouterRpcClient
> >>>> checking for such case, if it is StandByException it does try the
> >>>> other Namenode, but for all other Retryable Exceptions, we return them
> >>>> back to the client & let the client operate according to its Retry
> >>>> Policy, I think we should preserve that behaviour, if the intentions
> >>>> were to change it.
> >>>> Regarding controlling the concurrency to prevent OOM at the router,
> >>>> maybe we should consider rejecting the client requests beyond a
> >>>> certain limit/backlog & return back a relevant Retriable Exception to
> >>>> the client, so that it can retry on another Router rather than
> >>>> overloading one Router when there are other available, most of the
> >>>> deployments I believe would be running considerable number of Routers
> >>>> Rest I scratched my head for possible scenario where things can go
> >>>> south, but I think mostly the scenarios that came into my mind are
> >>>> covered
> >>>> Nothing blocker from my side, Good Luck!!!
> >>>> -Ayush
> >>>>> On Tue, 28 May 2024 at 21:52, Sangjin Lee <sj...@apache.org> wrote:
> >>>>> Sounds good. Thanks for sharing your findings.
> >>>>> On Sat, May 25, 2024 at 2:24 AM zhangjian <1361320...@qq.com> wrote:
> >>>>>> Hello everyone, I conducted a performance comparison test between
> >> sync and asynchronous router, and the test results showed that in
> single ns
> >> or multi ns scenarios, Asynchronous router in terms of throughput The
> >> utilization of CPU and thread, as well as the average processing time of
> >> client requests, are better than those of sync router, especially when
> >> downstream ns have performance bottlenecks, The performance of the async
> >> router is far greater than that of the sync router; And in terms of
> >> isolation, Asynchronous router is also better than sync router.
> >>>>>> Detailed testing PDF:
> >> https://issues.apache.org/jira/browse/HDFS-17531  Comparison of Async
> >> router & sync router performance.pdf
> >>>>>> 2024年5月24日 14:13,Yuanbo Liu <liuyuanb...@gmail.com> 写道:
> >>>>>> good job!
> >>>>>> On Fri, May 24, 2024 at 1:57 AM zhangjian <1361320...@qq.com>
> wrote:
> >>>>>>> Hello everyone, currently, I have tested the performance of async
> >> and sync router for a downstream ns:
> >>>>>>> 1. The throughput, CPU, and thread performance of the async router
> >> are better than those of the sync router, and its memory performance is
> >> within an acceptable range compared to the synchronous router.
> >>>>>>> 2. Asynchronous router can apply pressure downstream to better
> >> utilize the performance of downstream ns, and can almost fill the call
> >> queue of downstream ns.
> >>>>>>> Due to the large size of the test result pdf, it cannot be sent via
> >> email,
> >>>>>>> please see: https://issues.apache.org/jira/browse/HDFS-17531
> >>>>>>>> 2024年5月23日 17:03,Xiaoqiao He <hexiaoq...@apache.org> 写道:
> >>>>>>>> Great. Thanks for your addendum information.
> >>>>>>>> cc @Ayush Saxena <ayush...@gmail.com> @inigo...@apache.org
> >>>>>>>> <inigo...@apache.org> Any more feedback for this proposal?
> >>>>>>>> IMO The feature of asynchronous router RPC is a helpful
> >> improvement. For my
> >>>>>>>> internal practice, it will improve the throughput of requests
> >> forward
> >>>>>>>> significantly
> >>>>>>>> and is very valuable to push it forward.
> >>>>>>>> Thanks again and good luck!
> >>>>>>>> Best Regards,
> >>>>>>>> - He Xiaoqiao
> >>>>>>>> On Wed, May 22, 2024 at 9:59 AM zhangjian <1361320...@qq.com>
> >> wrote:
> >>>>>>>>> Hi, Sangjin Lee, thank you for your attention. I will use my free
> >> time to
> >>>>>>>>> do a performance comparison recently.
> >>>>>>>>>> 2024年5月22日 03:42,Sangjin Lee <sj...@apache.org> 写道:
> >>>>>>>>>> Thanks for the great proposal, Zhangjian. On point #3, I suspect
> >> it
> >>>>>>>>> should
> >>>>>>>>>> be fairly straightforward to create a small isolated synthetic
> >> test to
> >>>>>>>>>> prove (or disprove) the benefits of this approach. By driving a
> >>>>>>>>> controlled
> >>>>>>>>>> amount of requests per second, you could see latency, memory,
> >> CPU, etc.
> >>>>>>>>>> Ideally, it should show meaningful improvements without much
> >> degradation
> >>>>>>>>> in
> >>>>>>>>>> other metrics. Would you be able to spend some time doing that?
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Sangjin
> >>>>>>>>>> On Tue, May 21, 2024 at 5:13 AM zhangjian
> >> <1361320...@qq.com.invalid>
> >>>>>>>>> wrote:
> >>>>>>>>>>> Hi, xiaoqiao he, thank you for your reply.
> >>>>>>>>>>> 1.Currently, the server and client protocols within router can
> be
> >>>>>>>>>>> implemented by extends existing protocols and adding
> asynchronous
> >>>>>>>>>>> functionality, so it will not affect existing synchronization
> >> protocols.
> >>>>>>>>>>> RouterClientNamenodeProtocolServerSideTranslatorPB
> >>>>>>>>>>> RouterClientProtocolTranslatorPB
> >>>>>>>>>>> RouterGetUserMappingsProtocolServerSideTranslatorPB
> >>>>>>>>>>> RouterGetUserMappingsProtocolTranslatorPB
> >>>>>>>>>>> RouterNamenodeProtocolServerSideTranslatorPB
> >>>>>>>>>>> RouterNamenodeProtocolTranslatorPB
> >>>>>>>>>>> RouterRefreshUserMappingsProtocolServerSideTranslatorPB
> >>>>>>>>>>> RouterRefreshUserMappingsProtocolTranslatorPB
> >>>>>>>>>>> The following issues have implemented asynchronous callbacks
> for
> >>>>>>>>>>> Rpc.server, but I have not found any other modules to use
> related
> >>>>>>>>> functions
> >>>>>>>>>>> Server HADOOP-11552 HADOOP-17046
> >>>>>>>>>>> In the implementation of asynchronous Rpc.client, this issue is
> >> directly
> >>>>>>>>>>> used
> >>>>>>>>>>> Client HADOOP-13226
> >>>>>>>>>>> Therefore, I believe that asynchronous routers are safe for
> >> modifying
> >>>>>>>>> the
> >>>>>>>>>>> RPC protocol, RPC server, and client
> >>>>>>>>>>> 2. Forwarding requests to multiple downstream ns, the
> >> synchronous router
> >>>>>>>>>>> handler adds requests from multiple downstream ns to the thread
> >> pool
> >>>>>>>>>>> (RouterRpcClient.executorService), and then waits for responses
> >> from all
> >>>>>>>>>>> downstream ns before returning. Since threads in the thread
> pool
> >> also
> >>>>>>>>>>> process rpc requests synchronously, similar to a handler, the
> >> number of
> >>>>>>>>>>> threads in the thread pool directly affects the performance of
> >>>>>>>>>>> invoiceConcurrent, which in turn affects the performance of the
> >> handler.
> >>>>>>>>>>> In asynchronous router implementation, the handler calls
> >>>>>>>>> invoiceConcurrent
> >>>>>>>>>>> to simply convert a request into multiple requests and add them
> >> to the
> >>>>>>>>> asyn
> >>>>>>>>>>> handler thread pool, which can then process the next request in
> >> the call
> >>>>>>>>>>> queue; When a connection thread of a downstream ns receives a
> >> response,
> >>>>>>>>> it
> >>>>>>>>>>> will hand it over to the async response for processing. The
> async
> >>>>>>>>> response
> >>>>>>>>>>> thread will determine whether it has received all responses
> from
> >> the
> >>>>>>>>>>> downstream ns. If it does, it will continue to process the
> >> response.
> >>>>>>>>>>> Otherwise, the async response thread will process the next
> >> response. The
> >>>>>>>>>>> asynchronous router uses CompletableFuture.allOf() to implement
> >>>>>>>>>>> asynchronous invoiceConcurrent, and the handler, async handler,
> >> async
> >>>>>>>>>>> response, and connection thread still does not need to wait
> >>>>>>>>> synchronously.
> >>>>>>>>>>> In addition, synchronous routers not only have drawbacks in
> >> multi ns
> >>>>>>>>>>> environments, but also in single downstream ns situations, it
> is
> >> often
> >>>>>>>>>>> difficult to decide how many handlers to set for the router,
> >> setting it
> >>>>>>>>> too
> >>>>>>>>>>> much will waste thread resources, and setting it too small will
> >> not be
> >>>>>>>>> able
> >>>>>>>>>>> to give pressure to downstream ns; Asynchronous routers can
> push
> >>>>>>>>> requests
> >>>>>>>>>>> to downstream ns without considering how to set handlers.
> >> Asynchronous
> >>>>>>>>>>> routers can also better connect to more downstream storage
> >> services that
> >>>>>>>>>>> support the HDFS protocol, with better scalability.
> >>>>>>>>>>> 3.Since I have not yet deployed asynchronous routers to our own
> >> cluster,
> >>>>>>>>>>> there is no performance comparison. However, theoretically, I
> >> believe
> >>>>>>>>> that
> >>>>>>>>>>> asynchronous routers will occupy more memory than synchronous
> >> routers.
> >>>>>>>>>>> However, I do not believe that it will occupy a lot, especially
> >> since we
> >>>>>>>>>>> can control the maximum number of requests entering the router,
> >> as
> >>>>>>>>>>> CompletableFuture is stable and widely used; In other aspects,
> it
> >>>>>>>>> should be
> >>>>>>>>>>> far superior to synchronous routers, especially in downstream
> >> scenarios
> >>>>>>>>>>> with more ns.If anyone is interested, you can also help to
> make a
> >>>>>>>>>>> performance comparison
> >>>>>>>>>>>> 2024年5月21日 11:39,Xiaoqiao He <hexiaoq...@apache.org> 写道:
> >>>>>>>>>>>> Thanks for this great proposal!
> >>>>>>>>>>>> Some questions after reviewing the design doc (sorry didn't
> >> review PR
> >>>>>>>>>>>> carefully which is too large.)
> >>>>>>>>>>>> 1. This solution will involve RPC framework update, will it
> >> affect
> >>>>>>>>> other
> >>>>>>>>>>>> modules and how to
> >>>>>>>>>>>> keep other modules off these changes.
> >>>>>>>>>>>> 2. Some RPC requests should be forward concurrently to all
> >> downstream
> >>>>>>>>> NS,
> >>>>>>>>>>>> will it cover
> >>>>>>>>>>>> this case in this solution.
> >>>>>>>>>>>> 3. Considering there is one init-version implementation, did
> you
> >>>>>>>>> collect
> >>>>>>>>>>>> some benchmark vs
> >>>>>>>>>>>> the current synchronous model of DFSRouter?
> >>>>>>>>>>>> Thanks again.
> >>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>> - He Xiaoqiao
> >>>>>>>>>>>> On Tue, May 21, 2024 at 11:21 AM zhangjian
> >> <1361320...@qq.com.invalid>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>> Thank you for your positive attitude towards this feature.
> You
> >> can
> >>>>>>>>> debug
> >>>>>>>>>>>>> the UTs provided in PR to better understand the current
> >> asynchronous
> >>>>>>>>>>>>> calling function.
> >>>>>>>>>>>>>> 2024年5月21日 02:04,Simbarashe Dzinamarira <
> >> simbadz...@apache.org> 写道:
> >>>>>>>>>>>>>> Excited to see this feature as well. I'll spend more time
> >>>>>>>>> understanding
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> proposal and implementation.
> >>>>>>>>>>>>>> On Mon, May 20, 2024 at 7:55 AM zhangjian
> >> <1361320...@qq.com.invalid
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> Hi, Yuanbo liu,  thank you for your interest in this
> >> feature, I
> >>>>>>>>> think
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> difficulty of an asynchronous router is not only to
> implement
> >>>>>>>>>>>>> asynchronous
> >>>>>>>>>>>>>>> functions, but also to consider the readability and
> >> reusability of
> >>>>>>>>> the
> >>>>>>>>>>>>>>> code, so as to facilitate the development of the community.
> >> I also
> >>>>>>>>>>>>> planned
> >>>>>>>>>>>>>>> to do the virtual thread you mentioned at the beginning,
> >> virtual
> >>>>>>>>>>> Threads
> >>>>>>>>>>>>>>> can achieve asynchronousization elegantly at the code
> level,
> >> but the
> >>>>>>>>>>>>>>> biggest problem is that it is not easy to upgrade the jdk
> >> version,
> >>>>>>>>> no
> >>>>>>>>>>>>>>> matter in the community or in the actual production
> >> environment.
> >>>>>>>>>>>>> Therefore,
> >>>>>>>>>>>>>>> I later used CompletableFuture, which is currently
> supported
> >> by jdk
> >>>>>>>>> 8,
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> achieve asynchronousization. The router is stateless, and
> >> the router
> >>>>>>>>>>> rpc
> >>>>>>>>>>>>>>> process is very clear. Therefore, even if CompletableFuture
> >> itself
> >>>>>>>>> is
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>> as readable as the virtual thread, if we design it well, we
> >> can make
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> asynchronous process look very clear.
> >>>>>>>>>>>>>>>> 2024年5月20日 10:56,Yuanbo Liu <liuyuanb...@gmail.com> 写道:
> >>>>>>>>>>>>>>>> Nice to see this feature brought up. I tried to implement
> >> this
> >>>>>>>>>>> feature
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> our internal clusters, and know that it's a very
> complicated
> >>>>>>>>> feature,
> >>>>>>>>>>>>> CC
> >>>>>>>>>>>>>>>> hdfs-dev to bring more discussion.
> >>>>>>>>>>>>>>>> By the way, I'm not sure whether virtual thread of higher
> >> jdk will
> >>>>>>>>>>> help
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> this case.
> >>>>>>>>>>>>>>>> On Mon, May 20, 2024 at 10:10 AM zhangjian
> >>>>>>>>> <1361320...@qq.com.invalid
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>> Hello everyone, currently there are some shortcomings in
> >> the RPC
> >>>>>>>>> of
> >>>>>>>>>>>>> HDFS
> >>>>>>>>>>>>>>>>> router:
> >>>>>>>>>>>>>>>>> Currently the router's handler thread is synchronized,
> >> when the
> >>>>>>>>>>>>>>> *handler* thread
> >>>>>>>>>>>>>>>>> adds the call to connection.calls, it needs to wait until
> >> the
> >>>>>>>>>>>>>>> *connection* notifies
> >>>>>>>>>>>>>>>>> the call to complete, and then Only after the response is
> >> put into
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> response queue can a new call be obtained from the call
> >> queue and
> >>>>>>>>>>>>>>>>> processed. Therefore, the concurrency performance of the
> >> router is
> >>>>>>>>>>>>>>> limited
> >>>>>>>>>>>>>>>>> by the number of handlers; a simple example is as
> follows:
> >> If the
> >>>>>>>>>>>>>>> number of
> >>>>>>>>>>>>>>>>> handlers is 1 and the maximum number of calls in the
> >> connection
> >>>>>>>>>>> thread
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> 10, then even if the connection thread can send 10
> >> requests to the
> >>>>>>>>>>>>>>>>> downstream ns, since the number of handlers is 1, the
> >> router can
> >>>>>>>>>>> only
> >>>>>>>>>>>>>>>>> process one request after another.
> >>>>>>>>>>>>>>>>> Since the performance of router rpc is mainly limited by
> >> the
> >>>>>>>>> number
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> handlers, the most effective way to improve rpc
> performance
> >>>>>>>>>>> currently
> >>>>>>>>>>>>>>> is to
> >>>>>>>>>>>>>>>>> increase the number of handlers. Letting the router
> create
> >> a large
> >>>>>>>>>>>>>>> number
> >>>>>>>>>>>>>>>>> of handler threads will also increase the number of
> thread
> >>>>>>>>> switches
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> cannot maximize the use of machine performance.
> >>>>>>>>>>>>>>>>> There are usually multiple ns downstream of the router.
> If
> >> the
> >>>>>>>>>>> handler
> >>>>>>>>>>>>>>>>> forwards the request to an ns with poor performance, it
> >> will cause
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> handler to wait for a long time. Due to the reduction of
> >> available
> >>>>>>>>>>>>>>>>> handlers, the router's ability to handle ns requests with
> >> normal
> >>>>>>>>>>>>>>>>> performance will be reduced. From the perspective of the
> >> client,
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>> performance of the downstream ns of the router has
> >> deteriorated at
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> time. We often find that the call queue of the downstream
> >> ns is
> >>>>>>>>> not
> >>>>>>>>>>>>>>> high,
> >>>>>>>>>>>>>>>>> but the call queue of the router is very high.
> >>>>>>>>>>>>>>>>> Therefore, although the main function of the router is to
> >> federate
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> handle requests from multiple NSs, the current
> synchronous
> >> RPC
> >>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>> cannot satisfy the scenario where there are many NSs
> >> downstream of
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> router. Even if the concurrent performance of the router
> >> can be
> >>>>>>>>>>>>>>> improved by
> >>>>>>>>>>>>>>>>> increasing the number of handlers, it is still relatively
> >> slow.
> >>>>>>>>> More
> >>>>>>>>>>>>>>>>> threads will increase the CPU context switching time, and
> >> in fact
> >>>>>>>>>>> many
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the handler threads are in a blocked state, which is
> >> undoubtedly a
> >>>>>>>>>>>>>>> waste of
> >>>>>>>>>>>>>>>>> thread resources. When a request enters the router, there
> >> is no
> >>>>>>>>>>>>>>> guarantee
> >>>>>>>>>>>>>>>>> that there will be a running handler at this time.
> >>>>>>>>>>>>>>>>> Therefore, I consider asynchronous router rpc. Please
> view
> >> the
> >>>>>>>>>>> issues:
> >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/HDFS-17531  for
> the
> >>>>>>>>> complete
> >>>>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>>> And you can also view this PR:
> >>>>>>>>>>>>>>> https://github.com/apache/hadoop/pull/6838,
> >>>>>>>>>>>>>>>>> which is just a demo, but it completes the core
> >> asynchronous RPC
> >>>>>>>>>>>>>>> function.
> >>>>>>>>>>>>>>>>> If you think asynchronous routing is feasible, we can
> >> consider
> >>>>>>>>>>>>> splitting
> >>>>>>>>>>>>>>>>> this PR for easy review in the future.
> >>>>>>>>>>>>>>>>> The PDF is attached and can also be viewed through
> issues.
> >>>>>>>>>>>>>>>>> Welcome everyone to exchange and discuss!
> >>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> common-dev-unsubscr...@hadoop.apache.org
> >>>>>>>>>>>>>>> For additional commands, e-mail:
> >> common-dev-h...@hadoop.apache.org
> >>>>>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>>> To unsubscribe, e-mail:
> hdfs-dev-unsubscr...@hadoop.apache.org
> >>>>>>>>>>>>> For additional commands, e-mail:
> >> hdfs-dev-h...@hadoop.apache.org
> >>>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>> To unsubscribe, e-mail:
> common-dev-unsubscr...@hadoop.apache.org
> >>>>>>>>>>> For additional commands, e-mail:
> >> common-dev-h...@hadoop.apache.org
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> >>>> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
> >> For additional commands, e-mail: common-dev-h...@hadoop.apache.org
> >>
> >>
> >
>
>

Reply via email to