@zhangjian This is a great job, Thanks for working on this. I have a question 
here.
Should we let AsyncHandler thread pool have the ability of isolation? I’m 
worried about async handler thread will be exhausted when there are some 
nameservices which have bad performance.
| |
张浩博
|
|
hfutzhan...@163.com
|


---- Replied Message ----
| From | zhangjian<1361320...@qq.com.INVALID> |
| Date | 06/6/2024 11:49 |
| To | Xiaoqiao He<hexiaoq...@apache.org> |
| Cc | Hadoop Common<common-dev@hadoop.apache.org> ,
Hdfs-dev<hdfs-...@hadoop.apache.org> ,
<priv...@hadoop.apache.org> |
| Subject | Re: [Discuss] RBF: Aynchronous router RPC. |
Hi, xiaoqiao He, thank you so much!

Best Regards,
- zhangjian
2024年6月6日 11:33,Xiaoqiao He <hexiaoq...@apache.org> 写道:

Hi @zhangjian <1361320...@qq.com> , the dev branch HDFS-17531 is ready now.
FYI.

Best Regards,
- He Xiaoqiao

On Tue, Jun 4, 2024 at 10:08 PM zhangjian <1361320...@qq.com> wrote:

Hi, Xiaoqiao He:
Can you help create a dev branch?  I don't have the permission to create
it.

Thank you very much.
- zhangjian

2024年5月30日 11:57,Xiaoqiao He <hexiaoq...@apache.org> 写道:

Great! It looks like there are no other nothing blockers.

@zhangjian <1361320...@qq.com> If no other furthermore comments, we
should
go to the next step:
a. Create a dev branch for this proposal.
b. Split this huge PR to some small JIRA and PRs.
c. Involve some folks to review PR.

Please ping here if you need any help. Thanks again.
Good Luck!

Best Regards,
- He Xiaoqiao

On Wed, May 29, 2024 at 11:46 AM Ayush Saxena <ayush...@gmail.com>
wrote:

Thanx for the details, sounds cool, good luck with the feature!!!

-Ayush

On 29 May 2024, at 8:56 AM, zhangjian <1361320...@qq.com> wrote:

Thank you for taking the time to review this proposal.
Your opinion does point out the key issues to designing an asynchronous
router, but my proposal can address these issues:
1. My design does not affect the functionality of existing synchronous
routers in throwing stanby or retry exceptions and other aspects, the
async
router still inherits these implementations.
2. Currently, both asynchronous router and sync router support
backpressure on client requests when they exceed a certain limit (
asynchronous router : cannot obtain semaphores through the handler,
sync router : block through handler synchronization, unable to obtain
available handler
)
and return standby exception to allow the client to retry other routers
(RouterRpcFairnessPolicyController mechanism).

Thank you again!
zhangjian

2024年5月29日 07:05,Ayush Saxena <ayush...@gmail.com> 写道:
Thanx folks, I had a very quick pass on the PDF and it looks good.
Maybe some doubts around the fact where it was mentioned that if a
Namenode returns a StandbyException or something on similar lines, the
Router will retry, I think we have some logic in RouterRpcClient
checking for such case, if it is StandByException it does try the
other Namenode, but for all other Retryable Exceptions, we return them
back to the client & let the client operate according to its Retry
Policy, I think we should preserve that behaviour, if the intentions
were to change it.
Regarding controlling the concurrency to prevent OOM at the router,
maybe we should consider rejecting the client requests beyond a
certain limit/backlog & return back a relevant Retriable Exception to
the client, so that it can retry on another Router rather than
overloading one Router when there are other available, most of the
deployments I believe would be running considerable number of Routers
Rest I scratched my head for possible scenario where things can go
south, but I think mostly the scenarios that came into my mind are
covered
Nothing blocker from my side, Good Luck!!!
-Ayush
On Tue, 28 May 2024 at 21:52, Sangjin Lee <sj...@apache.org> wrote:
Sounds good. Thanks for sharing your findings.
On Sat, May 25, 2024 at 2:24 AM zhangjian <1361320...@qq.com> wrote:
Hello everyone, I conducted a performance comparison test between
sync and asynchronous router, and the test results showed that in
single ns
or multi ns scenarios, Asynchronous router in terms of throughput The
utilization of CPU and thread, as well as the average processing time of
client requests, are better than those of sync router, especially when
downstream ns have performance bottlenecks, The performance of the async
router is far greater than that of the sync router; And in terms of
isolation, Asynchronous router is also better than sync router.
Detailed testing PDF:
https://issues.apache.org/jira/browse/HDFS-17531  Comparison of Async
router & sync router performance.pdf
2024年5月24日 14:13,Yuanbo Liu <liuyuanb...@gmail.com> 写道:
good job!
On Fri, May 24, 2024 at 1:57 AM zhangjian <1361320...@qq.com>
wrote:
Hello everyone, currently, I have tested the performance of async
and sync router for a downstream ns:
1. The throughput, CPU, and thread performance of the async router
are better than those of the sync router, and its memory performance is
within an acceptable range compared to the synchronous router.
2. Asynchronous router can apply pressure downstream to better
utilize the performance of downstream ns, and can almost fill the call
queue of downstream ns.
Due to the large size of the test result pdf, it cannot be sent via
email,
please see: https://issues.apache.org/jira/browse/HDFS-17531
2024年5月23日 17:03,Xiaoqiao He <hexiaoq...@apache.org> 写道:
Great. Thanks for your addendum information.
cc @Ayush Saxena <ayush...@gmail.com> @inigo...@apache.org
<inigo...@apache.org> Any more feedback for this proposal?
IMO The feature of asynchronous router RPC is a helpful
improvement. For my
internal practice, it will improve the throughput of requests
forward
significantly
and is very valuable to push it forward.
Thanks again and good luck!
Best Regards,
- He Xiaoqiao
On Wed, May 22, 2024 at 9:59 AM zhangjian <1361320...@qq.com>
wrote:
Hi, Sangjin Lee, thank you for your attention. I will use my free
time to
do a performance comparison recently.
2024年5月22日 03:42,Sangjin Lee <sj...@apache.org> 写道:
Thanks for the great proposal, Zhangjian. On point #3, I suspect
it
should
be fairly straightforward to create a small isolated synthetic
test to
prove (or disprove) the benefits of this approach. By driving a
controlled
amount of requests per second, you could see latency, memory,
CPU, etc.
Ideally, it should show meaningful improvements without much
degradation
in
other metrics. Would you be able to spend some time doing that?
Thanks,
Sangjin
On Tue, May 21, 2024 at 5:13 AM zhangjian
<1361320...@qq.com.invalid>
wrote:
Hi, xiaoqiao he, thank you for your reply.
1.Currently, the server and client protocols within router can
be
implemented by extends existing protocols and adding
asynchronous
functionality, so it will not affect existing synchronization
protocols.
RouterClientNamenodeProtocolServerSideTranslatorPB
RouterClientProtocolTranslatorPB
RouterGetUserMappingsProtocolServerSideTranslatorPB
RouterGetUserMappingsProtocolTranslatorPB
RouterNamenodeProtocolServerSideTranslatorPB
RouterNamenodeProtocolTranslatorPB
RouterRefreshUserMappingsProtocolServerSideTranslatorPB
RouterRefreshUserMappingsProtocolTranslatorPB
The following issues have implemented asynchronous callbacks
for
Rpc.server, but I have not found any other modules to use
related
functions
Server HADOOP-11552 HADOOP-17046
In the implementation of asynchronous Rpc.client, this issue is
directly
used
Client HADOOP-13226
Therefore, I believe that asynchronous routers are safe for
modifying
the
RPC protocol, RPC server, and client
2. Forwarding requests to multiple downstream ns, the
synchronous router
handler adds requests from multiple downstream ns to the thread
pool
(RouterRpcClient.executorService), and then waits for responses
from all
downstream ns before returning. Since threads in the thread
pool
also
process rpc requests synchronously, similar to a handler, the
number of
threads in the thread pool directly affects the performance of
invoiceConcurrent, which in turn affects the performance of the
handler.
In asynchronous router implementation, the handler calls
invoiceConcurrent
to simply convert a request into multiple requests and add them
to the
asyn
handler thread pool, which can then process the next request in
the call
queue; When a connection thread of a downstream ns receives a
response,
it
will hand it over to the async response for processing. The
async
response
thread will determine whether it has received all responses
from
the
downstream ns. If it does, it will continue to process the
response.
Otherwise, the async response thread will process the next
response. The
asynchronous router uses CompletableFuture.allOf() to implement
asynchronous invoiceConcurrent, and the handler, async handler,
async
response, and connection thread still does not need to wait
synchronously.
In addition, synchronous routers not only have drawbacks in
multi ns
environments, but also in single downstream ns situations, it
is
often
difficult to decide how many handlers to set for the router,
setting it
too
much will waste thread resources, and setting it too small will
not be
able
to give pressure to downstream ns; Asynchronous routers can
push
requests
to downstream ns without considering how to set handlers.
Asynchronous
routers can also better connect to more downstream storage
services that
support the HDFS protocol, with better scalability.
3.Since I have not yet deployed asynchronous routers to our own
cluster,
there is no performance comparison. However, theoretically, I
believe
that
asynchronous routers will occupy more memory than synchronous
routers.
However, I do not believe that it will occupy a lot, especially
since we
can control the maximum number of requests entering the router,
as
CompletableFuture is stable and widely used; In other aspects,
it
should be
far superior to synchronous routers, especially in downstream
scenarios
with more ns.If anyone is interested, you can also help to
make a
performance comparison
2024年5月21日 11:39,Xiaoqiao He <hexiaoq...@apache.org> 写道:
Thanks for this great proposal!
Some questions after reviewing the design doc (sorry didn't
review PR
carefully which is too large.)
1. This solution will involve RPC framework update, will it
affect
other
modules and how to
keep other modules off these changes.
2. Some RPC requests should be forward concurrently to all
downstream
NS,
will it cover
this case in this solution.
3. Considering there is one init-version implementation, did
you
collect
some benchmark vs
the current synchronous model of DFSRouter?
Thanks again.
Best Regards,
- He Xiaoqiao
On Tue, May 21, 2024 at 11:21 AM zhangjian
<1361320...@qq.com.invalid>
wrote:
Thank you for your positive attitude towards this feature.
You
can
debug
the UTs provided in PR to better understand the current
asynchronous
calling function.
2024年5月21日 02:04,Simbarashe Dzinamarira <
simbadz...@apache.org> 写道:
Excited to see this feature as well. I'll spend more time
understanding
the
proposal and implementation.
On Mon, May 20, 2024 at 7:55 AM zhangjian
<1361320...@qq.com.invalid
wrote:
Hi, Yuanbo liu,  thank you for your interest in this
feature, I
think
the
difficulty of an asynchronous router is not only to
implement
asynchronous
functions, but also to consider the readability and
reusability of
the
code, so as to facilitate the development of the community.
I also
planned
to do the virtual thread you mentioned at the beginning,
virtual
Threads
can achieve asynchronousization elegantly at the code
level,
but the
biggest problem is that it is not easy to upgrade the jdk
version,
no
matter in the community or in the actual production
environment.
Therefore,
I later used CompletableFuture, which is currently
supported
by jdk
8,
to
achieve asynchronousization. The router is stateless, and
the router
rpc
process is very clear. Therefore, even if CompletableFuture
itself
is
not
as readable as the virtual thread, if we design it well, we
can make
the
asynchronous process look very clear.
2024年5月20日 10:56,Yuanbo Liu <liuyuanb...@gmail.com> 写道:
Nice to see this feature brought up. I tried to implement
this
feature
in
our internal clusters, and know that it's a very
complicated
feature,
CC
hdfs-dev to bring more discussion.
By the way, I'm not sure whether virtual thread of higher
jdk will
help
in
this case.
On Mon, May 20, 2024 at 10:10 AM zhangjian
<1361320...@qq.com.invalid
wrote:
Hello everyone, currently there are some shortcomings in
the RPC
of
HDFS
router:
Currently the router's handler thread is synchronized,
when the
*handler* thread
adds the call to connection.calls, it needs to wait until
the
*connection* notifies
the call to complete, and then Only after the response is
put into
the
response queue can a new call be obtained from the call
queue and
processed. Therefore, the concurrency performance of the
router is
limited
by the number of handlers; a simple example is as
follows:
If the
number of
handlers is 1 and the maximum number of calls in the
connection
thread
is
10, then even if the connection thread can send 10
requests to the
downstream ns, since the number of handlers is 1, the
router can
only
process one request after another.
Since the performance of router rpc is mainly limited by
the
number
of
handlers, the most effective way to improve rpc
performance
currently
is to
increase the number of handlers. Letting the router
create
a large
number
of handler threads will also increase the number of
thread
switches
and
cannot maximize the use of machine performance.
There are usually multiple ns downstream of the router.
If
the
handler
forwards the request to an ns with poor performance, it
will cause
the
handler to wait for a long time. Due to the reduction of
available
handlers, the router's ability to handle ns requests with
normal
performance will be reduced. From the perspective of the
client,
the
performance of the downstream ns of the router has
deteriorated at
this
time. We often find that the call queue of the downstream
ns is
not
high,
but the call queue of the router is very high.
Therefore, although the main function of the router is to
federate
and
handle requests from multiple NSs, the current
synchronous
RPC
performance
cannot satisfy the scenario where there are many NSs
downstream of
the
router. Even if the concurrent performance of the router
can be
improved by
increasing the number of handlers, it is still relatively
slow.
More
threads will increase the CPU context switching time, and
in fact
many
of
the handler threads are in a blocked state, which is
undoubtedly a
waste of
thread resources. When a request enters the router, there
is no
guarantee
that there will be a running handler at this time.
Therefore, I consider asynchronous router rpc. Please
view
the
issues:
https://issues.apache.org/jira/browse/HDFS-17531  for
the
complete
solution.
And you can also view this PR:
https://github.com/apache/hadoop/pull/6838,
which is just a demo, but it completes the core
asynchronous RPC
function.
If you think asynchronous routing is feasible, we can
consider
splitting
this PR for easy review in the future.
The PDF is attached and can also be viewed through
issues.
Welcome everyone to exchange and discuss!

---------------------------------------------------------------------
To unsubscribe, e-mail:
common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail:
common-dev-h...@hadoop.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail:
hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail:
hdfs-dev-h...@hadoop.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail:
common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail:
common-dev-h...@hadoop.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org








---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to