GlenGeng opened a new pull request #1533:
URL: https://github.com/apache/ozone/pull/1533


   ## What changes were proposed in this pull request?
   
   In Tencent internal production environment, we got several dead DNs which 
can never come back without a restart.
    
   We found that the thread "Datanode State Machine Thread - 0" does not exist 
in the output of jstack, thus no HeartbeatEndpointTask will be created,  this 
DN will soon become dead and can not recover unless being restarted.
    
   After checked the .out log, we saw that OOM occurred in thread "Datanode 
State Machine Thread", which should be responsible for this issue:
   
   ```
   114370.799: Total time for which application threads were stopped: 1.0883622 
seconds, Stopping threads took: 0.0002926 seconds Exception in thread "Datanode 
State Machine Thread - 0" java.lang.OutOfMemoryError: GC overhead limit 
exceeded 114370.810: Application time: 0.0115941 seconds {Heap before GC 
invocations=2946 (full 2680): PSYoungGen total 3170304K, used 2846720K 
[0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000) eden space 
2846720K, 100% used [0x00000006eab00000,0x0000000798700000,0x0000000798700000) 
from space 323584K, 0% used 
[0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space 324096K, 0% 
used [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total 
6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000, 
0x00000006eab00000) object space 6990848K, 99% used 
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used 
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used 
6583K, capacit
 y 7031K, committed 7296K, reserved 1048576K
   ```
   
   ```
   300010.579: Total time for which application threads were stopped: 3.0848769 
seconds, Stopping threads took: 0.0000943 seconds
   Exception in thread "Datanode State Machine Thread - 0" 
java.lang.OutOfMemoryError: Java heap space
   300010.579: Application time: 0.0001554 seconds
   300010.580: Total time for which application threads were stopped: 0.0015600 
seconds, Stopping threads took: 0.0002747 seconds
   300010.581: Application time: 0.0004684 seconds
   {Heap before GC invocations=13766 (full 11664):
    PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000, 
0x0000000800000000, 0x0000000800000000)
    eden space 3388416K, 100% used 
[0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
    from space 53248K, 0% used 
[0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
    to space 53248K, 0% used 
[0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
    ParOldGen total 6990848K, used 6990848K [0x0000000580000000, 
0x000000072ab00000, 0x000000072ab00000)
    object space 6990848K, 100% used 
[0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
    Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
    class space used 5922K, capacity 6372K, committed 6744K, reserved 1048576K
   ```
   
   BTW, after running DN for more than a week, we see a lot of 
"java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since we 
configured a dead Recon, we guess this could an evidence for HDDS-4404.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-4408
   
   ## How was this patch tested?
   
   CI


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to