[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

Daryn Sharp (JIRA) Mon, 15 Oct 2012 14:03:06 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476446#comment-13476446
 ]


Daryn Sharp commented on HDFS-3990:
-----------------------------------

As best I can tell, the {{DatanodeID}}'s hostname is what the DN claims to be 
in the registration.  The existing include/exclude list checks use the DN's ip 
and "real" hostname, not the one the node claimed to be in the registration.  
I'm trying to preserve existing behavior by just caching the socket's peer name 
at registration, so that resolved socket addr can be reused when checking the 
include/exclude lists.

bq. In registerDatanode why is OK to no longer update the registration info 
with the reported IP?

The ip actually is updated when {{setNodeAddr}} is called with the socket's 
peer.

My bad on the comments.  I'm not sure how I lost that change.

I know the approach isn't perfect, and many of the fields could likely be 
folded together into the socket addr, but I'm trying to make the minimalist 
change to avoid a slew of dns queries that are having an adverse performance 
impact on multi-thousand node clusters.

                
> NN's health report has severe performance problems
> --------------------------------------------------
>
>                 Key: HDFS-3990
>                 URL: https://issues.apache.org/jira/browse/HDFS-3990
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-3990.patch, HDFS-3990.patch
>
>
> The dfshealth page will place a read lock on the namespace while it does a 
> dns lookup for every DN.  On a multi-thousand node cluster, this often 
> results in 10s+ load time for the health page.  10 concurrent requests were 
> found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

Reply via email to