[ 
https://issues.apache.org/jira/browse/FLINK-17645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107902#comment-17107902
 ] 

Zhu Zhu commented on FLINK-17645:
---------------------------------

[~sewen] I think this specific fix is valid because the root cause is the 
inconsistency of {{SafetyNetCloseableRegistry}} if any exception happens in its 
constructor. This is due to the REAPER_THREAD is static and is not properly 
reset on unexpected failures, which is not a common case.
In this case, a SafetyNetCloseableRegistry cannot be successfully created 
anymore even if the native thread creation could succeed later. And any 
exception thrown from {{REAPER_THREAD.start()}} (maybe not a native thread 
creation error) would also lead to this problem.

Regarding "exit JVM when thread creation fails", I'm not sure whether native 
thread creation failure is really not recoverable.
If it is, I think we can force the JVM to shutdown in such cases. But I do not 
have an idea yet that how we can apply it to all the thread creation processes 
without changing them all.

> REAPER_THREAD.start() in SafetyNetCloseableRegistry failed, causing the 
> repeated failover.
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-17645
>                 URL: https://issues.apache.org/jira/browse/FLINK-17645
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.10.1, 1.11.0
>            Reporter: Zakelly Lan
>            Assignee: Lijie Wang
>            Priority: Major
>             Fix For: 1.11.0
>
>
> I'm running a modified version of Flink, and encountered the exception below 
> when task start:
> {code:java}
> 2020-05-12 00:46:19,037 ERROR [***] org.apache.flink.runtime.taskmanager.Task 
>   - Encountered an unexpected exception
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:802)
>         at 
> org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:73)
>         at 
> org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89)
>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586)
>         at java.lang.Thread.run(Thread.java:834)
> 2020-05-12 00:46:19,038 INFO  [***] org.apache.flink.runtime.taskmanager.Task 
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:802)
>         at 
> org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:73)
>         at 
> org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89)
>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586)
>         at java.lang.Thread.run(Thread.java:834)
> {code}
> The REAPER_THREAD.start() fails because of OOM, and REAPER_THREAD will never 
> be null. Since then, every time SafetyNetCloseableRegistry init in this VM 
> will cause an IllegalStateException:
> {code:java}
> java.lang.IllegalStateException
>       at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:179)
>       at 
> org.apache.flink.core.fs.SafetyNetCloseableRegistry.<init>(SafetyNetCloseableRegistry.java:71)
>       at 
> org.apache.flink.core.fs.FileSystemSafetyNet.initializeSafetyNetForThread(FileSystemSafetyNet.java:89)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:586)
>       at java.lang.Thread.run(Thread.java:834){code}
> This may happen in very old version of Flink as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to