[ 
https://issues.apache.org/jira/browse/HDFS-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479304#comment-13479304
 ] 

Kihwal Lee commented on HDFS-4075:
----------------------------------

We had a group of 40 nodes that were decommissioned then recommissioned. When 
they got recommissioned by refreshing nodes using dfsadmin, there were over 5M 
over-replicated blocks, so holding the write lock the NN (RPC handler) went 
through each of them and generated two log messages per block.  That took about 
5 minutes and over 2GB of log were written.  Because of the locking, the 
namenode was unresponsive for the whole time.

I tested the commons-logging + log4j FileAppender family combination for its 
performance and it was clear that the above case was hitting the logging 
bottleneck. When comparing logging a single character vs. 400 bytes, time to 
finish logging 1,000,000 messages didn't seem much different. It was not IO 
bound, but CPU bound as the CPU stayed 100% the whole time. Changing 
FileAppender properties affected the timing a bit but not a lot.  It seems this 
is the inherent limit of this logging mechanism.

For a single character logging, each message took 19-23us. Or it could do about 
42K logs/sec with CPU at 100%, almost no IO wait time.  We can see that the 
namenode in the case given above were spending almost all of its time logging. 
The IO overhead was not significant.
                
> Reduce recommissioning overhead
> -------------------------------
>
>                 Key: HDFS-4075
>                 URL: https://issues.apache.org/jira/browse/HDFS-4075
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.4, 2.0.2-alpha
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>
> When datanodes are recommissioned, 
> {BlockManager#processOverReplicatedBlocksOnReCommission()} is called for each 
> rejoined node and excess blocks are added to the invalidate list. The problem 
> is this is done while the namesystem write lock is held.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to