[
https://issues.apache.org/jira/browse/HDDS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Glen Geng updated HDDS-4408:
----------------------------
Description:
In Tencent internal production environment, we got several dead DNs which can
never come back without a restart.
We found that the thread "Datanode State Machine Thread - 0" does not exist in
the jstack, thus no HeartbeatEndpointTask will be created, so DNs will soon
become dead and can not recover unless being restarted.
After checked the .out log, we saw that OOM occurred in thread "Datanode State
Machine Thread", which will kill the thread.
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622
seconds, Stopping threads took: 0.0002926 seconds Exception in thread "Datanode
State Machine Thread - 0" java.lang.OutOfMemoryError: GC overhead limit
exceeded 114370.810: Application time: 0.0115941 seconds {Heap before GC
invocations=2946 (full 2680): PSYoungGen total 3170304K, used 2846720K
[0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000) eden space
2846720K, 100% used [0x00000006eab00000,0x0000000798700000,0x0000000798700000)
from space 323584K, 0% used
[0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space 324096K, 0%
used [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total
6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000,
0x00000006eab00000) object space 6990848K, 99% used
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used
6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}
BTW, we see a lot of "java.lang.OutOfMemoryError: GC overhead limit exceeded"
in DN's log, after running DN for more than a week. Since we configured a dead
Recon, we guess this could an evidence for HDDS-4404.
was:
In Tencent internal production environment, we got several dead DNs which can
never come back without a restart.
We found that thread "Datanode State Machine Thread - 0" does not exist in the
jstack, thus no HeartbeatEndpointTask will be created, so DNs will soon become
dead and can not recover unless being restarted.
After checked the .out log, we saw that OOM occurred in thread "Datanode State
Machine Thread", which will kill the thread.
{code:java}
114370.799: Total time for which application threads were stopped: 1.0883622
seconds, Stopping threads took: 0.0002926 seconds Exception in thread "Datanode
State Machine Thread - 0" java.lang.OutOfMemoryError: GC overhead limit
exceeded 114370.810: Application time: 0.0115941 seconds {Heap before GC
invocations=2946 (full 2680): PSYoungGen total 3170304K, used 2846720K
[0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000) eden space
2846720K, 100% used [0x00000006eab00000,0x0000000798700000,0x0000000798700000)
from space 323584K, 0% used
[0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space 324096K, 0%
used [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total
6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000,
0x00000006eab00000) object space 6990848K, 99% used
[0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used
60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used
6583K, capacity 7031K, committed 7296K, reserved 1048576K
{code}
BTW, we see a lot of "java.lang.OutOfMemoryError: GC overhead limit exceeded"
in DN's log, after running DN for more than a week. Since we configured a dead
Recon, we guess this could an evidence for HDDS-4404.
> Datanode State Machine Thread needs handle OutOfMemoryError
> -----------------------------------------------------------
>
> Key: HDDS-4408
> URL: https://issues.apache.org/jira/browse/HDDS-4408
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Datanode
> Affects Versions: 1.1.0
> Reporter: Glen Geng
> Priority: Major
>
> In Tencent internal production environment, we got several dead DNs which can
> never come back without a restart.
>
> We found that the thread "Datanode State Machine Thread - 0" does not exist
> in the jstack, thus no HeartbeatEndpointTask will be created, so DNs will
> soon become dead and can not recover unless being restarted.
>
> After checked the .out log, we saw that OOM occurred in thread "Datanode
> State Machine Thread", which will kill the thread.
> {code:java}
> 114370.799: Total time for which application threads were stopped: 1.0883622
> seconds, Stopping threads took: 0.0002926 seconds Exception in thread
> "Datanode State Machine Thread - 0" java.lang.OutOfMemoryError: GC overhead
> limit exceeded 114370.810: Application time: 0.0115941 seconds {Heap before
> GC invocations=2946 (full 2680): PSYoungGen total 3170304K, used 2846720K
> [0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000) eden space
> 2846720K, 100% used
> [0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space
> 323584K, 0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000)
> to space 324096K, 0% used
> [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total
> 6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000,
> 0x00000006eab00000) object space 6990848K, 99% used
> [0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used
> 60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used
> 6583K, capacity 7031K, committed 7296K, reserved 1048576K
> {code}
> BTW, we see a lot of "java.lang.OutOfMemoryError: GC overhead limit exceeded"
> in DN's log, after running DN for more than a week. Since we configured a
> dead Recon, we guess this could an evidence for HDDS-4404.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]