Kihwal Lee created HDFS-5500: -------------------------------- Summary: Critical datanode threads may terminate silently on uncaught exceptions Key: HDFS-5500 URL: https://issues.apache.org/jira/browse/HDFS-5500 Project: Hadoop HDFS Issue Type: Bug Reporter: Kihwal Lee Priority: Critical
We've seen refreshUsed (DU) thread disappearing on uncaught exceptions. This can go unnoticed for a long time. If OOM occurs, more things can go wrong. On one occasion, Timer, multiple refreshUsed and DataXceiverServer thread had terminated. DataXceiverServer catches OutOfMemoryError and sleeps for 30 seconds, but I am not sure it is really helpful. In once case, the thread did it multiple times then terminated. I suspect another OOM was thrown while in a catch block. As a result, the server socket was not closed and clients hung on connect. If it had at least closed the socket, client-side would have been impacted less. -- This message was sent by Atlassian JIRA (v6.1#6144)