[ https://issues.apache.org/jira/browse/FLINK-17645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107902#comment-17107902 ]
Zhu Zhu commented on FLINK-17645: --------------------------------- [~sewen] I think this specific fix is valid because the root cause is the inconsistency of {{SafetyNetCloseableRegistry}} if any exception happens in its constructor. This is due to the REAPER_THREAD is static and is not properly reset on unexpected failures, which is not a common case. In this case, a SafetyNetCloseableRegistry cannot be successfully created anymore even if the native thread creation could succeed later. And any exception thrown from {{REAPER_THREAD.start()}} (maybe not a native thread creation error) would also lead to this problem. Regarding "exit JVM when thread creation fails", I'm not sure whether native thread creation failure is really not recoverable. If it is, I think we can force the JVM to shutdown in such cases. But I do not have an idea yet that how we can apply it to all the thread creation processes without changing them all. > REAPER_THREAD.start() in SafetyNetCloseableRegistry failed, causing the > repeated failover. > ------------------------------------------------------------------------------------------ > > Key: FLINK-17645 > URL: https://issues.apache.org/jira/browse/FLINK-17645 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Affects Versions: 1.10.1, 1.11.0 > Reporter: Zakelly Lan > Assignee: Lijie Wang > Priority: Major > Fix For: 1.11.0 > > > I'm running a modified version of Flink, and encountered the exception below > when task start: > {code:java} > 2020-05-12 00:46:19,037 ERROR [***] org.apache.flink.runtime.taskmanager.Task > - Encountered an unexpected exception > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:802) > at > org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:73) > at > org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586) > at java.lang.Thread.run(Thread.java:834) > 2020-05-12 00:46:19,038 INFO [***] org.apache.flink.runtime.taskmanager.Task > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:802) > at > org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:73) > at > org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586) > at java.lang.Thread.run(Thread.java:834) > {code} > The REAPER_THREAD.start() fails because of OOM, and REAPER_THREAD will never > be null. Since then, every time SafetyNetCloseableRegistry init in this VM > will cause an IllegalStateException: > {code:java} > java.lang.IllegalStateException > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:179) > at > org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:71) > at > org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586) > at java.lang.Thread.run(Thread.java:834){code} > This may happen in very old version of Flink as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)