[
https://issues.apache.org/jira/browse/FLINK-17645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107902#comment-17107902
]
Zhu Zhu commented on FLINK-17645:
---------------------------------
[~sewen] I think this specific fix is valid because the root cause is the
inconsistency of {{SafetyNetCloseableRegistry}} if any exception happens in its
constructor. This is due to the REAPER_THREAD is static and is not properly
reset on unexpected failures, which is not a common case.
In this case, a SafetyNetCloseableRegistry cannot be successfully created
anymore even if the native thread creation could succeed later. And any
exception thrown from {{REAPER_THREAD.start()}} (maybe not a native thread
creation error) would also lead to this problem.
Regarding "exit JVM when thread creation fails", I'm not sure whether native
thread creation failure is really not recoverable.
If it is, I think we can force the JVM to shutdown in such cases. But I do not
have an idea yet that how we can apply it to all the thread creation processes
without changing them all.
> REAPER_THREAD.start() in SafetyNetCloseableRegistry failed, causing the
> repeated failover.
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-17645
> URL: https://issues.apache.org/jira/browse/FLINK-17645
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.10.1, 1.11.0
> Reporter: Zakelly Lan
> Assignee: Lijie Wang
> Priority: Major
> Fix For: 1.11.0
>
>
> I'm running a modified version of Flink, and encountered the exception below
> when task start:
> {code:java}
> 2020-05-12 00:46:19,037 ERROR [***] org.apache.flink.runtime.taskmanager.Task
> - Encountered an unexpected exception
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:802)
> at
> org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:73)
> at
> org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586)
> at java.lang.Thread.run(Thread.java:834)
> 2020-05-12 00:46:19,038 INFO [***] org.apache.flink.runtime.taskmanager.Task
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:802)
> at
> org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:73)
> at
> org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586)
> at java.lang.Thread.run(Thread.java:834)
> {code}
> The REAPER_THREAD.start() fails because of OOM, and REAPER_THREAD will never
> be null. Since then, every time SafetyNetCloseableRegistry init in this VM
> will cause an IllegalStateException:
> {code:java}
> java.lang.IllegalStateException
> at
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:179)
> at
> org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:71)
> at
> org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586)
> at java.lang.Thread.run(Thread.java:834){code}
> This may happen in very old version of Flink as well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)