[
https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18027952#comment-18027952
]
ASF GitHub Bot commented on HDFS-17281:
---------------------------------------
github-actions[bot] commented on PR #6337:
URL: https://github.com/apache/hadoop/pull/6337#issuecomment-3374717596
We're closing this stale PR because it has been open for 100 days with no
activity. This isn't a judgement on the merit of the PR in any way. It's just a
way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working
on it, please feel free to re-open it and ask for a committer to remove the
stale tag and review again.
Thanks all for your contribution.
> Added support of reporting RPC round-trip time at NN.
> -----------------------------------------------------
>
> Key: HDFS-17281
> URL: https://issues.apache.org/jira/browse/HDFS-17281
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs
> Reporter: Xing Lin
> Assignee: Xing Lin
> Priority: Major
> Labels: pull-request-available
> Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png
>
>
> We have come across a few cases where the hdfs clients are reporting very bad
> latencies, while we don't see similar trends at NN-side. Instead, from
> NN-side, the latency metrics seem normal as usual. I attached a screenshot
> which we took during an internal investigation at LinkedIn. What was
> happening is a token management service was reporting an average latency of 1
> sec in fetching delegation tokens from our NN but at the NN-side, we did not
> see anything abnormal. The recent OverallRpcProcessingTime metric we added in
> HDFS-17042 did not seem to be sufficient to identify/signal such cases.
> We propose to extend the IPC header in hadoop, to communicate call create
> time at client-side to IPC servers, so that for each rpc call, the server can
> get its round-trip time.
>
> *Why is OverallRpcProcessingTime not sufficient?*
> OverallRpcProcessingTime captures the time starting from when the reader
> thread reads in the call from the socket to when the response is sent back to
> the client. As a result, it does not capture the time it takes to transmit
> the call from client to the server. Besides, we only have a couple of reader
> threads to monitor a large number of open connections. It is possible that
> many connections become ready to read at the same time. Then, the reader
> thread would need to read each call sequentially, leading to a wait time for
> many Rpc Calls. We have also hit the case where the callQueue becomes full
> (with a total of 25600 requests) and thus reader threads are blocked to add
> new Calls into the callQueue. This would lead to a longer latency for all
> connections/calls which are ready and wait to be read by reader threads.
> Ideally, we want to measure the time between when a socket/call is ready to
> read and when it is actually being read by the reader thread. This would give
> us the wait time that a call is taking to be read. However, after some Google
> search, we failed to find a way to get this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]