GlenGeng opened a new pull request #1533:
URL: https://github.com/apache/ozone/pull/1533
## What changes were proposed in this pull request?
In Tencent internal production environment, we got several dead DNs which
can never come back without a restart.
We found that the thread "Datanode State Machine Thread - 0" does not exist
in the output of jstack, thus no HeartbeatEndpointTask will be created, this
DN will soon become dead and can not recover unless being restarted.
After checked the .out log, we saw that OOM occurred in thread "Datanode
State Machine Thread", which should be responsible for this issue:
```
114370.799: Total time for which application threads were stopped: 1.0883622
seconds, Stopping threads took: 0.0002926 seconds Exception in thread "Datanode
State Machine Thread - 0" java.lang.OutOfMemoryError: GC overhead limit
exceeded 114370.810: Application time: 0.0115941 seconds {Heap before GC
invocations=2946 (full 2680): PSYoungGen total 3170304K, used 2846720K
[0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000) eden space
2846720K, 100% used [0x00000006eab00000,0x0000000798700000,0x0000000798700000)
from space 323584K, 0% used
[0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space 324096K, 0%
used [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total
6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000,
0x00000006eab00000) object space 6990848K, 99% used
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used
6583K, capacit
y 7031K, committed 7296K, reserved 1048576K
```
```
300010.579: Total time for which application threads were stopped: 3.0848769
seconds, Stopping threads took: 0.0000943 seconds
Exception in thread "Datanode State Machine Thread - 0"
java.lang.OutOfMemoryError: Java heap space
300010.579: Application time: 0.0001554 seconds
300010.580: Total time for which application threads were stopped: 0.0015600
seconds, Stopping threads took: 0.0002747 seconds
300010.581: Application time: 0.0004684 seconds
{Heap before GC invocations=13766 (full 11664):
PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000,
0x0000000800000000, 0x0000000800000000)
eden space 3388416K, 100% used
[0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
from space 53248K, 0% used
[0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
to space 53248K, 0% used
[0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
ParOldGen total 6990848K, used 6990848K [0x0000000580000000,
0x000000072ab00000, 0x000000072ab00000)
object space 6990848K, 100% used
[0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
class space used 5922K, capacity 6372K, committed 6744K, reserved 1048576K
```
BTW, after running DN for more than a week, we see a lot of
"java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since we
configured a dead Recon, we guess this could an evidence for HDDS-4404.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4408
## How was this patch tested?
CI
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]