[ https://issues.apache.org/jira/browse/HADOOP-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326242#comment-14326242 ]
Chris Nauroth commented on HADOOP-11604: ---------------------------------------- I may have also seen this thread exit prematurely, even after the recent fixes. I don't have complete information though, so I can't say for sure. One potential problem is that the thread does not explicitly handle logging for unchecked exceptions, thrown either from the try body or the finally body. Without an uncaught exception handler, the default behavior would be to log the stack trace to the console, which is likely getting redirected to a .out file instead of the normal DataNode daemon log. An operator might not think to check the .out log. Liang, do you still have a .out file from this incident? If not, then a good first step might be to patch the logging so that we either set an uncaught exception handler to do the logging, or just catch and log {{Throwable}} explicitly. I agree that we need to find root cause before attempting a fix. > Reach xceiver limit once the watcherThread die > ---------------------------------------------- > > Key: HADOOP-11604 > URL: https://issues.apache.org/jira/browse/HADOOP-11604 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Liang Xie > Assignee: Liang Xie > Priority: Critical > Attachments: HADOOP-11604-001.txt, HADOOP-11604-002.txt > > > Our product cluster hit the Xceiver limit even w/ HADOOP-10404 & > HADOOP-11333, i found it was caused by DomainSocketWatcher.watcherThread > gone. Attached is a possible fix, please review, thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)