[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

Daryn Sharp (JIRA) Tue, 16 Oct 2012 10:25:06 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477181#comment-13477181
 ]


Daryn Sharp commented on HDFS-3990:
-----------------------------------

bq. I'm not sure re-registering with a new IP and the same storage ID actually 
works today.

Jason Lowe recently finished a jira to make that work.

bq.  How about we reject the DN registration in case of a DNS hiccup (rather 
than use the DN value which the patch currently does in this case)?

I think I'm fine with that, so long as we are more strictly ruling out the 
ability to run a cluster in a dns-less or dns error-tolerant environment.  I 
was considering a second jira that would first scan the include/exclude for the 
ip, and if not found, would return include=false or exclude=true if the ip is 
unresolved instead of flat out rejecting the node.

Ignoring the name the dn declares is a trivial enough change that do you think 
we can just do it in this patch?  I was trying to avoid any functional change 
with this patch (because who knows what will break!) but I'll post a revised 
patch that rejects unresolved and ignores the dn's declared name if that's ok 
with you?


                
> NN's health report has severe performance problems
> --------------------------------------------------
>
>                 Key: HDFS-3990
>                 URL: https://issues.apache.org/jira/browse/HDFS-3990
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt
>
>
> The dfshealth page will place a read lock on the namespace while it does a 
> dns lookup for every DN.  On a multi-thousand node cluster, this often 
> results in 10s+ load time for the health page.  10 concurrent requests were 
> found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

Reply via email to