Thanks Yangze, indeed, I see the following in the log about 10s before the
final crash (masked some sensitive data using `MASKED`):

2021-08-16 15:58:13.985 [Canceler/Interrupts for Source: MAKSED] WARN
org.apache.flink.runtime.taskmanager.Task  - Task 'MASKED' did not react to
cancelling signal for 30 seconds, but is stuck in method:
 java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11.0.11/java.util.concurrent.locks.LockSupport.park(Unknown
Source)
java.base@11.0.11/java.util.concurrent.CompletableFuture$Signaller.block(Unknown
Source)
java.base@11.0.11/java.util.concurrent.ForkJoinPool.managedBlock(Unknown
Source)
java.base@11.0.11/java.util.concurrent.CompletableFuture.waitingGet(Unknown
Source)
java.base@11.0.11/java.util.concurrent.CompletableFuture.join(Unknown
Source)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705)
app//org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
java.base@11.0.11/java.lang.Thread.run(Unknown Source)

2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR
org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error
occurred while executing the TaskManager. Shutting it down...
org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully
within 180 + seconds.
  at
org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1718)
  at java.base/java.lang.Thread.run(Unknown Source)



On Mon, Aug 16, 2021 at 7:05 PM Yangze Guo <karma...@gmail.com> wrote:

> Hi, Abhishek,
>
> Do you see something like "Fatal error occurred while executing the
> TaskManager" in your log or would you like to provide the whole task
> manager log?
>
> Best,
> Yangze Guo
>
> On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <abhis...@netspring.io>
> wrote:
> >
> > Hello,
> >
> > In our production environment, running Flink 1.13 (Scala 2.11), where
> Flink has been working without issues with a dozen or so jobs running for a
> while, Flink taskmanager started crash looping with a period of ~4 minutes
> per crash.  The stack trace is not very informative, therefore reaching out
> for help, see below.
> >
> > The only other thing that's unusual is that due to what might be a
> product issue (custom job code running on Flink), some or all of our tasks
> are also in a crash loop.  Still, I wasn't expecting taskmanager itself to
> die.  Does taskmanager have some built in feature to crash if all/most
> tasks are crashing?
> >
> > 2021-08-16 15:58:23.984 [main] ERROR
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Terminating
> TaskManagerRunner with exit code 1.
> > org.apache.flink.util.FlinkException: Unexpected failure during runtime
> of TaskManagerRunner.
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382)
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413)
> >   at java.base/java.security.AccessController.doPrivileged(Native Method)
> >   at java.base/javax.security.auth.Subject.doAs(Unknown Source)
> >   at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
> >   at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413)
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396)
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354)
> > Caused by: java.util.concurrent.TimeoutException: null
> >   at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255)
> >   at
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
> >   at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582)
> >   at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> >   at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> >   at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
> Source)
> >   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> >   at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> >   at java.base/java.lang.Thread.run(Unknown Source)
> > 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown
> hook] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
> Shutting down TaskExecutorLocalStateStoresManager.
> >
> >
> > Thanks very much!
> >
> > Abhishek
>

Reply via email to