[ 
https://issues.apache.org/jira/browse/HDFS-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609190#comment-15609190
 ] 

Chris Nauroth commented on HDFS-11063:
--------------------------------------

Here is an example thread from {{jstack}}:

{code}
"IPC Server handler 6 on 19000" #49 daemon prio=5 os_prio=31 
tid=0x00007ff6b3c84800 nid=0x9e03 waiting on condition [0x0000700003762000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at 
org.apache.hadoop.hdfs.server.namenode.INode.computeAndConvertContentSummary(INode.java:431)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getContentSummaryInt(FSDirStatAndListingOp.java:515)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getContentSummary(FSDirStatAndListingOp.java:134)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:2941)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:1311)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:920)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:467)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:990)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1795)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2535)
{code}

I'm thinking we could set the thread name so that it looks more like this:

{code}
"IPC Server handler 6 on 19000 getContentSummary(user=chris, startTime=12345, 
path=/)" #49 daemon prio=5 os_prio=31 tid=0x00007ff6b3c84800 nid=0x9e03 waiting 
on condition [0x0000700003762000]
...
{code}

This would clearly show that user "chris" was very naughty and called an 
expensive {{getContentSummary}} on the root.  We could also determine how long 
the operation has been running based on the start time.

This additional contextual information would have to be cleared out of the 
thread name after completion of the RPC method, so that when the thread is 
returned to the pool for handling later calls, it doesn't hold on to the 
information about the old call.

Bonus points if we can find a way to do this generically in Hadoop Common in a 
way that gives meaningful thread names for all RPC servers, without code 
changes in the individual RPC servers.  I have a feeling that the desire to 
include protocol-specific information (like the path argument) makes that 
impossible though, so I have filed this as an HDFS JIRA.

> Set NameNode RPC server handler thread name with more descriptive information 
> about the RPC call.
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11063
>                 URL: https://issues.apache.org/jira/browse/HDFS-11063
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Chris Nauroth
>
> We often run {{jstack}} on a NameNode process as a troubleshooting step if it 
> is suffering high load or appears to be hanging.  By reading the stack trace, 
> we can identify if a caller is blocked inside an expensive operation.  This 
> would be even more helpful if we updated the RPC server handler thread name 
> with more descriptive information about the RPC call.  This could include the 
> calling user, the called RPC method, and the most significant argument to 
> that method (most likely the path).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to