Kihwal Lee created HDFS-5500:
--------------------------------
Summary: Critical datanode threads may terminate silently on
uncaught exceptions
Key: HDFS-5500
URL: https://issues.apache.org/jira/browse/HDFS-5500
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Kihwal Lee
Priority: Critical
We've seen refreshUsed (DU) thread disappearing on uncaught exceptions. This
can go unnoticed for a long time. If OOM occurs, more things can go wrong. On
one occasion, Timer, multiple refreshUsed and DataXceiverServer thread had
terminated.
DataXceiverServer catches OutOfMemoryError and sleeps for 30 seconds, but I am
not sure it is really helpful. In once case, the thread did it multiple times
then terminated. I suspect another OOM was thrown while in a catch block. As a
result, the server socket was not closed and clients hung on connect. If it had
at least closed the socket, client-side would have been impacted less.
--
This message was sent by Atlassian JIRA
(v6.1#6144)