[ https://issues.apache.org/jira/browse/FLINK-14544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961638#comment-16961638 ]
Congxian Qiu(klion26) edited comment on FLINK-14544 at 10/29/19 3:36 AM: ------------------------------------------------------------------------- from the log, found that the {{MailboxStateException}} throws out after {{Artificial failure in the second job.}} {code:java} 2019-10-25 20:46:09,345 INFO org.apache.flink.runtime.taskmanager.Task - FailureMapper (1/1) (466747dfea13738afd021da649dc53f4) switched from RUNNING to FAILED. java.lang.Exception: Artificial failure. at org.apache.flink.streaming.tests.FailureMapper.map(FailureMapper.java:59) at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:173) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.processElement(StreamTaskNetworkInput.java:151) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:128) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:280) at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:152) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:423) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:696) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:521) at java.lang.Thread.run(Thread.java:748) 2019-10-25 20:46:09,361 ERROR org.apache.flink.runtime.taskmanager.Task - Error while canceling task FailureMapper (1/1). java.util.concurrent.RejectedExecutionException: org.apache.flink.streaming.runtime.tasks.mailbox.MailboxStateException: Mailbox is in state CLOSED, but is required to be in state OPEN for put operations. at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxExecutorImpl.executeFirst(MailboxExecutorImpl.java:75) at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxProcessor.sendPriorityLetter(MailboxProcessor.java:176) at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxProcessor.allActionsCompleted(MailboxProcessor.java:172) at org.apache.flink.streaming.runtime.tasks.StreamTask.cancel(StreamTask.java:540) at org.apache.flink.runtime.taskmanager.Task.cancelInvokable(Task.java:1191) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:521) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.streaming.runtime.tasks.mailbox.MailboxStateException: Mailbox is in state CLOSED, but is required to be in state OPEN for put operations. at org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:199) at org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.putHeadInternal(TaskMailboxImpl.java:141) at org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.putFirst(TaskMailboxImpl.java:131) at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxExecutorImpl.executeFirst(MailboxExecutorImpl.java:73) ... 7 more {code} so from currently analysis, the test `Resuming Externalized Checkpoint after terminal failure (file, async) ` complete checkpoint in job 1, restore from checkpoint completed in job 1, and complete more checkpoint in job2 , but the log contains {{MailboxStateException}}, so we see the test failed. * the first job complete checkpoints ** {{2019-10-25 20:46:08,614 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 2 for job 2d7c274d4561078c592df0bbb1dfad52 (156791 bytes in 367 ms).}} * trigger artifical exception * retore from the checkpoint completed by the previous job ** 2019-10-25 20:46:13,358 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job 824b849f432dcffdeb0d18ab6b1f7d6c from savepoint [file:///home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-50753031208/externalized-chckpt-e2e-backend-dir/2d7c274d4561078c592df0bbb1dfad52/chk-2] () 2019-10-25 20:46:13,378 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Reset the checkpoint ID of job 824b849f432dcffdeb0d18ab6b1f7d6c to 3. 2019-10-25 20:46:13,378 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring job 824b849f432dcffdeb0d18ab6b1f7d6c from latest valid checkpoint: Checkpoint 2 @ 0 for 824b849f432dcffdeb0d18ab6b1f7d6c. * complete more new checkpoints in new job ** 2019-10-25 20:46:15,608 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 3 for job 824b849f432dcffdeb0d18ab6b1f7d6c (160262 bytes in 1154 ms). 2019-10-25 20:46:15,655 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 4 @ 1572036375608 for job 824b849f432dcffdeb0d18ab6b1f7d6c. 2019-10-25 20:46:15,831 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 4 for job 824b849f432dcffdeb0d18ab6b1f7d6c (226751 bytes in 221 ms). * trigger {{Artificial failure}} and fail the job(the log above) As the analysis, I think we can change the test as * do not trigger artifical exception in the second job * ignore the {{MailboxStateException}} when finding exceptions in log, I think if there indeed an {{MailboxStateException}} need we attention, the test will be failed in previous stages before finding exception/error in logs. was (Author: klion26): from the log, found that the {{MailboxStateException}} throws out after {{Artificial failure in the second job.}} {code:java} 2019-10-25 20:46:09,345 INFO org.apache.flink.runtime.taskmanager.Task - FailureMapper (1/1) (466747dfea13738afd021da649dc53f4) switched from RUNNING to FAILED. java.lang.Exception: Artificial failure. at org.apache.flink.streaming.tests.FailureMapper.map(FailureMapper.java:59) at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:173) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.processElement(StreamTaskNetworkInput.java:151) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:128) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:280) at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:152) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:423) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:696) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:521) at java.lang.Thread.run(Thread.java:748) 2019-10-25 20:46:09,361 ERROR org.apache.flink.runtime.taskmanager.Task - Error while canceling task FailureMapper (1/1). java.util.concurrent.RejectedExecutionException: org.apache.flink.streaming.runtime.tasks.mailbox.MailboxStateException: Mailbox is in state CLOSED, but is required to be in state OPEN for put operations. at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxExecutorImpl.executeFirst(MailboxExecutorImpl.java:75) at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxProcessor.sendPriorityLetter(MailboxProcessor.java:176) at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxProcessor.allActionsCompleted(MailboxProcessor.java:172) at org.apache.flink.streaming.runtime.tasks.StreamTask.cancel(StreamTask.java:540) at org.apache.flink.runtime.taskmanager.Task.cancelInvokable(Task.java:1191) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:521) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.streaming.runtime.tasks.mailbox.MailboxStateException: Mailbox is in state CLOSED, but is required to be in state OPEN for put operations. at org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:199) at org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.putHeadInternal(TaskMailboxImpl.java:141) at org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.putFirst(TaskMailboxImpl.java:131) at org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxExecutorImpl.executeFirst(MailboxExecutorImpl.java:73) ... 7 more {code} so currently, the conclusion is that, the test `Resuming Externalized Checkpoint after terminal failure (file, async) ` complete checkpoint in job 1, restore from checkpoint , and complete more checkpoint in job2 , but the log contains {{MailboxStateException}}, so we see the test failed. * the first job complete checkpoints {{2019-10-25 20:46:08,614 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 2 for job 2d7c274d4561078c592df0bbb1dfad52 (156791 bytes in 367 ms).}} * trigger artifical exception * retore from the checkpoint completed by the previous job ** 2019-10-25 20:46:13,358 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job 824b849f432dcffdeb0d18ab6b1f7d6c from savepoint file:///home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-50753031208/externalized-chckpt-e2e-backend-dir/2d7c274d4561078c592df0bbb1dfad52/chk-2 () 2019-10-25 20:46:13,378 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Reset the checkpoint ID of job 824b849f432dcffdeb0d18ab6b1f7d6c to 3. 2019-10-25 20:46:13,378 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring job 824b849f432dcffdeb0d18ab6b1f7d6c from latest valid checkpoint: Checkpoint 2 @ 0 for 824b849f432dcffdeb0d18ab6b1f7d6c. * complete more new checkpoints in new job ** 2019-10-25 20:46:15,608 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 3 for job 824b849f432dcffdeb0d18ab6b1f7d6c (160262 bytes in 1154 ms). 2019-10-25 20:46:15,655 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 4 @ 1572036375608 for job 824b849f432dcffdeb0d18ab6b1f7d6c. 2019-10-25 20:46:15,831 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 4 for job 824b849f432dcffdeb0d18ab6b1f7d6c (226751 bytes in 221 ms). * trigger {{Artificial failure}} and fail the job(the log above) As the analysis, I think we can change the test as * do not trigger artifical exception in the second job * ignore the {{MailboxStateException}} when finding exception in log, I think if there indeed an {{MailboxStateException}} need we attention, the test will be failed in previous stages before finding exception/error in logs? > Resuming Externalized Checkpoint after terminal failure (file, async) > end-to-end test fails on travis > ----------------------------------------------------------------------------------------------------- > > Key: FLINK-14544 > URL: https://issues.apache.org/jira/browse/FLINK-14544 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests > Affects Versions: 1.10.0 > Reporter: Yu Li > Priority: Blocker > Labels: test-stability > > From the log we could see below error message and then the job was terminated > due to job exceeded the maximum log length. > {code} > 2019-10-25 20:46:09,361 ERROR org.apache.flink.runtime.taskmanager.Task > - Error while canceling task FailureMapper (1/1). > java.util.concurrent.RejectedExecutionException: > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxStateException: > Mailbox is in state CLOSED, but is required to be in state OPEN for put > operations. > at > org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxExecutorImpl.executeFirst(MailboxExecutorImpl.java:75) > at > org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxProcessor.sendPriorityLetter(MailboxProcessor.java:176) > at > org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxProcessor.allActionsCompleted(MailboxProcessor.java:172) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.cancel(StreamTask.java:540) > at > org.apache.flink.runtime.taskmanager.Task.cancelInvokable(Task.java:1191) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:521) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxStateException: > Mailbox is in state CLOSED, but is required to be in state OPEN for put > operations. > at > org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:199) > at > org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.putHeadInternal(TaskMailboxImpl.java:141) > at > org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.putFirst(TaskMailboxImpl.java:131) > at > org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxExecutorImpl.executeFirst(MailboxExecutorImpl.java:73) > ... 7 more > {code} > https://api.travis-ci.org/v3/job/602788586/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)