[ https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795617#comment-17795617 ]
ASF GitHub Bot commented on HDFS-17281: --------------------------------------- xinglin commented on PR #6337: URL: https://github.com/apache/hadoop/pull/6337#issuecomment-1851408051 The following two assertions failed during one of the builds. Investigating. ``` assertCounter("RoundTripLockAndSleepNumOps", 1L, rttRB); assertTrue(rttLockAndSleepAvgTime > serverSideLockAndSleepAvgTime); ``` ``` [ERROR] testRpcRTTMetric(org.apache.hadoop.ipc.TestRPC) Time elapsed: 0.04 s <<< FAILURE! java.lang.AssertionError: Bad value for metric RoundTripLockAndSleepNumOps expected:<1> but was:<0> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.apache.hadoop.test.MetricsAsserts.assertCounter(MetricsAsserts.java:230) at org.apache.hadoop.ipc.TestRPC.testRpcRTTMetric(TestRPC.java:1518) ``` > Added support of reporting RPC round-trip time at NN. > ----------------------------------------------------- > > Key: HDFS-17281 > URL: https://issues.apache.org/jira/browse/HDFS-17281 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Reporter: Xing Lin > Assignee: Xing Lin > Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png > > > We have come across a few cases where the hdfs clients are reporting very bad > latencies, while we don't see similar trends at NN-side. Instead, from > NN-side, the latency metrics seem normal as usual. I attached a screenshot > which we took during an internal investigation at LinkedIn. What was > happening is a token management service was reporting an average latency of 1 > sec in fetching delegation tokens from our NN but at the NN-side, we did not > see anything abnormal. The recent OverallRpcProcessingTime metric we added in > HDFS-17042 did not seem to be sufficient to identify/signal such cases. > We propose to extend the IPC header in hadoop, to communicate call create > time at client-side to IPC servers, so that for each rpc call, the server can > get its round-trip time. > > *Why is OverallRpcProcessingTime not sufficient?* > OverallRpcProcessingTime captures the time starting from when the reader > thread reads in the call from the socket to when the response is sent back to > the client. As a result, it does not capture the time it takes to transmit > the call from client to the server. Besides, we only have a couple of reader > threads to monitor a large number of open connections. It is possible that > many connections become ready to read at the same time. Then, the reader > thread would need to read each call sequentially, leading to a wait time for > many Rpc Calls. We have also hit the case where the callQueue becomes full > (with a total of 25600 requests) and thus reader threads are blocked to add > new Calls into the callQueue. This would lead to a longer latency for all > connections/calls which are ready and wait to be read by reader threads. > Ideally, we want to measure the time between when a socket/call is ready to > read and when it is actually being read by the reader thread. This would give > us the wait time that a call is taking to be read. However, after some Google > search, we failed to find a way to get this. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org