[jira] [Updated] (HDDS-4408) Datanode State Machine Thread needs handle OutOfMemoryError

Glen Geng (Jira) Thu, 29 Oct 2020 02:19:21 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Glen Geng updated HDDS-4408:
----------------------------
    Description: 
In Tencent internal production environment, we got several dead DNs which can 
never come back without a restart.

 

We found that the thread "Datanode State Machine Thread - 0" does not exist in 
the jstack, thus no HeartbeatEndpointTask will be created, so DNs will soon 
become dead and can not recover unless being restarted.

 

After checked the .out log, we saw that OOM occurred in thread "Datanode State 
Machine Thread", which will kill the thread.
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622 
seconds, Stopping threads took: 0.0002926 seconds Exception in thread "Datanode 
State Machine Thread - 0" java.lang.OutOfMemoryError: GC overhead limit 
exceeded 114370.810: Application time: 0.0115941 seconds {Heap before GC 
invocations=2946 (full 2680): PSYoungGen total 3170304K, used 2846720K 
[0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000) eden space 
2846720K, 100% used [0x00000006eab00000,0x0000000798700000,0x0000000798700000) 
from space 323584K, 0% used 
[0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space 324096K, 0% 
used [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total 
6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000, 
0x00000006eab00000) object space 6990848K, 99% used 
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used 
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used 
6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}
BTW, we see a lot of "java.lang.OutOfMemoryError: GC overhead limit exceeded" 
in DN's log, after running DN for more than a week. Since we configured a dead 
Recon, we guess this could an evidence for HDDS-4404.

 

 

  was:
In Tencent internal production environment, we got several dead DNs which can 
never come back without a restart.

 

We found that thread "Datanode State Machine Thread - 0" does not exist in the 
jstack, thus no HeartbeatEndpointTask will be created, so DNs will soon become 
dead and can not recover unless being restarted.

 

After checked the .out log, we saw that OOM occurred in thread "Datanode State 
Machine Thread", which will kill the thread.
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622 
seconds, Stopping threads took: 0.0002926 seconds Exception in thread "Datanode 
State Machine Thread - 0" java.lang.OutOfMemoryError: GC overhead limit 
exceeded 114370.810: Application time: 0.0115941 seconds {Heap before GC 
invocations=2946 (full 2680): PSYoungGen total 3170304K, used 2846720K 
[0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000) eden space 
2846720K, 100% used [0x00000006eab00000,0x0000000798700000,0x0000000798700000) 
from space 323584K, 0% used 
[0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space 324096K, 0% 
used [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total 
6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000, 
0x00000006eab00000) object space 6990848K, 99% used 
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used 
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used 
6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}
BTW, we see a lot of "java.lang.OutOfMemoryError: GC overhead limit exceeded" 
in DN's log, after running DN for more than a week. Since we configured a dead 
Recon, we guess this could an evidence for HDDS-4404.

 

 


> Datanode State Machine Thread needs handle OutOfMemoryError
> -----------------------------------------------------------
>
>                 Key: HDDS-4408
>                 URL: https://issues.apache.org/jira/browse/HDDS-4408
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Priority: Major
>
> In Tencent internal production environment, we got several dead DNs which can 
> never come back without a restart.
>  
> We found that the thread "Datanode State Machine Thread - 0" does not exist 
> in the jstack, thus no HeartbeatEndpointTask will be created, so DNs will 
> soon become dead and can not recover unless being restarted.
>  
> After checked the .out log, we saw that OOM occurred in thread "Datanode 
> State Machine Thread", which will kill the thread.
> {code:java}
> 114370.799: Total time for which application threads were stopped: 1.0883622 
> seconds, Stopping threads took: 0.0002926 seconds Exception in thread 
> "Datanode State Machine Thread - 0" java.lang.OutOfMemoryError: GC overhead 
> limit exceeded 114370.810: Application time: 0.0115941 seconds {Heap before 
> GC invocations=2946 (full 2680): PSYoungGen total 3170304K, used 2846720K 
> [0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000) eden space 
> 2846720K, 100% used 
> [0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space 
> 323584K, 0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) 
> to space 324096K, 0% used 
> [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total 
> 6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000, 
> 0x00000006eab00000) object space 6990848K, 99% used 
> [0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used 
> 60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used 
> 6583K, capacity 7031K, committed 7296K, reserved 1048576K
> {code}
> BTW, we see a lot of "java.lang.OutOfMemoryError: GC overhead limit exceeded" 
> in DN's log, after running DN for more than a week. Since we configured a 
> dead Recon, we guess this could an evidence for HDDS-4404.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-4408) Datanode State Machine Thread needs handle OutOfMemoryError

Reply via email to