subject:"\[jira\] \[Comment Edited\] \(FLINK\-15105\) Resuming Externalized Checkpoint after terminal failure \(rocks, incremental\) end\-to\-end test stalls on travis"

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

2019-12-12 Thread Chesnay Schepler (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-15105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994618#comment-16994618
 ] 

Chesnay Schepler edited comment on FLINK-15105 at 12/12/19 12:47 PM:
-

Since this tests results in error behavior, wouldn't the simplest solution be 
to disable the exception check for runs where failures are being simulated?
This would only affect the {{after terminal failure}} variant of the {{Resuming 
Externalized Checkpoint}} tests.

The correctness of the execution is verified by the test itself; checking that 
only very specific exceptions are being thrown in case of an error seems a bit 
strict and a contract we realistically can't enforce.


was (Author: zentol):
Since this tests results in error behavior, wouldn't the simplest solution be 
to disable the exception check for runs where failures are being simulated?
This would only affect the {{after terminal failure}} variant of the tests.

The correctness of the execution is verified by the test itself; also checking 
that only very specific exceptions are being thrown in case of an error seems a 
bit strict and a contract we realistically can't enforce.

> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test stalls on travis
> -
>
> Key: FLINK-15105
> URL: https://issues.apache.org/jira/browse/FLINK-15105
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test fails on release-1.9 nightly build stalls with "The job 
> exceeded the maximum log length, and has been terminated".
> https://api.travis-ci.org/v3/job/621090394/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

2019-12-11 Thread Congxian Qiu(klion26) (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-15105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993554#comment-16993554
 ] 

Congxian Qiu(klion26) edited comment on FLINK-15105 at 12/11/19 1:52 PM:
-

First, answer the last question: we can't just remove "error" message in 
{{RuntimeException}},  we'll fail in 
{{common.sh#}}{{check_logs_for_exceptions()}} because of the 
{{RuntimeException}}.

Then I'll try to describe more about the things about {{FailureMapper}}.
 # {{FailureMapper is only used in {{DataStreamAllroundTestProgram.
 # we'll add a {{FailureMapper}} in {{DataStreamAllroundTestProgram only if we 
[enabled 
TEST_SIMULATE_FAILURE|https://github.com/apache/flink/blob/eddad99123525211c900102206384dacaf8385fc/flink-end-to-end-tests/flink-datastream-allround-test/src/main/java/org/apache/flink/streaming/tests/DataStreamAllroundTestProgram.java#L173]}}
 in {{DataStreamAllroundTestProgram}}
 # {{We'll throw Exception in {{FailureMapper#map and 
{{FailureMapper#notifyCheckpointComplete}}
 # {{we'll enable {{TEST_SIMULATE_FAILURE}} }} in {{test_ha_datastream.sh}}, 
{{test_ha_per_job_cluster_datastream.sh}} and 
{{test_resume_externalized_checkpoints.sh}}

IIUC, all the above tests are wanna test whether the job can restore 
from(restore with checkpoint) the last failed job successfully(but we do not 
care where the exception come from, then Exception thrown from 
FailureMapper#mapper or FailureMapper#notifyCheckpointComplete have the same 
effect). If we want to verify that `failure of notifyCheckpointComplete can 
fail task`, maybe we can add a ut for it.

 

 


was (Author: klion26):
First, answer the last question: we can't just remove "error" message in 
{{RuntimeException}},  we'll fail in 
{{common.sh#}}{{check_logs_for_exceptions()}} because of the 
{{RuntimeException}}.

Then I'll try to describe more about the things about {{FailureMapper}}.
 # {{FailureMapper is only used in {{DataStreamAllroundTestProgram.
 # we'll add a {{FailureMapper}} in {{DataStreamAllroundTestProgram only if we 
[enabled 
TEST_SIMULATE_FAILURE|https://github.com/apache/flink/blob/eddad99123525211c900102206384dacaf8385fc/flink-end-to-end-tests/flink-datastream-allround-test/src/main/java/org/apache/flink/streaming/tests/DataStreamAllroundTestProgram.java#L173]}}
 in {{DataStreamAllroundTestProgram}}
 # {{We'll throw Exception in {{FailureMapper#map and 
{{FailureMapper#notifyCheckpointComplete}}
 # {{we'll enable {{TEST_SIMULATE_FAILURE}} }} in {{test_ha_datastream.sh}}, 
{{test_ha_per_job_cluster_datastream.sh}} and 
{{test_resume_externalized_checkpoints.sh}}

IIUC, all the above tests are wanna test whether the job can restore 
from(restore with checkpoint) the last failed job successfully(but we do not 
care where the exception come from, then Exception thrown from 
FailureMapper#mapper or FailureMapper#notifyCheckpointComplete have the same 
effect, please correct me if I miss anything here). If we want to verify that 
`failure of notifyCheckpointComplete can fail task`, maybe we can add a ut for 
it.

 

 

> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test stalls on travis
> -
>
> Key: FLINK-15105
> URL: https://issues.apache.org/jira/browse/FLINK-15105
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test fails on release-1.9 nightly build stalls with "The job 
> exceeded the maximum log length, and has been terminated".
> https://api.travis-ci.org/v3/job/621090394/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

2019-12-11 Thread Congxian Qiu(klion26) (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-15105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993554#comment-16993554
 ] 

Congxian Qiu(klion26) edited comment on FLINK-15105 at 12/11/19 1:46 PM:
-

First, answer the last question: we can't just remove "error" message in 
{{RuntimeException}},  we'll fail in 
{{common.sh#}}{{check_logs_for_exceptions()}} because of the 
{{RuntimeException}}.

Then I'll try to describe more about the things about {{FailureMapper}}.
 # {{FailureMapper is only used in {{DataStreamAllroundTestProgram.
 # we'll add a {{FailureMapper}} in {{DataStreamAllroundTestProgram only if we 
[enabled 
TEST_SIMULATE_FAILURE|https://github.com/apache/flink/blob/eddad99123525211c900102206384dacaf8385fc/flink-end-to-end-tests/flink-datastream-allround-test/src/main/java/org/apache/flink/streaming/tests/DataStreamAllroundTestProgram.java#L173]}}
 in {{DataStreamAllroundTestProgram}}
 # {{We'll throw Exception in {{FailureMapper#map and 
{{FailureMapper#notifyCheckpointComplete}}
 # {{we'll enable {{TEST_SIMULATE_FAILURE}} }} in {{test_ha_datastream.sh}}, 
{{test_ha_per_job_cluster_datastream.sh}} and 
{{test_resume_externalized_checkpoints.sh}}

IIUC, all the above tests are wanna test whether the job can restore 
from(restore with checkpoint) the last failed job successfully(but we do not 
care where the exception come from, then Exception thrown from 
FailureMapper#mapper or FailureMapper#notifyCheckpointComplete have the same 
effect, please correct me if I miss anything here). If we want to verify that 
`failure of notifyCheckpointComplete can fail task`, maybe we can add a ut for 
it.

 

 


was (Author: klion26):
First, answer the last question: we can't just remove "error" message in 
{{RuntimeException}},  we'll fail in 
{{common.sh#}}{{check_logs_for_exceptions()}} because of the 
{{RuntimeException}}.

Then I'll try to describe more about the things about {{FailureMapper}}.
 # {{FailureMapper is only used in {{DataStreamAllroundTestProgram.
 # we'll add a {{FailureMapper}} in {{DataStreamAllroundTestProgram only }}if 
we [enabled 
TEST_SIMULATE_FAILURE|https://github.com/apache/flink/blob/eddad99123525211c900102206384dacaf8385fc/flink-end-to-end-tests/flink-datastream-allround-test/src/main/java/org/apache/flink/streaming/tests/DataStreamAllroundTestProgram.java#L173]{{}}
 in {{DataStreamAllroundTestProgram}}
 # {{We'll throw Exception in }}{{FailureMapper#map}} and 
{{FailureMapper#notifyCheckpointComplete}}
 # {{we'll enable }}{{TEST_SIMULATE_FAILURE}} in {{test_ha_datastream.sh}}, 
{{test_ha_per_job_cluster_datastream.sh}} and 
{{test_resume_externalized_checkpoints.sh}}

IIUC, all the above tests are wanna test whether the job can restore 
from(restore with checkpoint) the last failed job successfully(but we do not 
care where the exception come from, then Exception thrown from 
FailureMapper#mapper or FailureMapper#notifyCheckpointComplete have the same 
effect, please correct me if I miss anything here). If we want to verify that 
`failure of notifyCheckpointComplete can fail task`, maybe we can add a ut for 
it.


 

 

> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test stalls on travis
> -
>
> Key: FLINK-15105
> URL: https://issues.apache.org/jira/browse/FLINK-15105
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test fails on release-1.9 nightly build stalls with "The job 
> exceeded the maximum log length, and has been terminated".
> https://api.travis-ci.org/v3/job/621090394/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

2019-12-10 Thread Congxian Qiu(klion26) (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-15105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992673#comment-16992673
 ] 

Congxian Qiu(klion26) edited comment on FLINK-15105 at 12/10/19 4:09 PM:
-

[~trohrmann]  In the previous comment, I didn't want to remove the whole 
{{FailureMapper}}, but just want to remove the {{Artificial failure}} throwing 
statement in {{FailureMapper}}#{{notifyCheckpointComplete just as the comment 
in the below code block.}}
{code:java}
public T map(T value) throws Exception {
   numProcessedRecords++;

   if (isReachedFailureThreshold()) {
  throw new Exception("Artificial failure.");
   }

   return value;
}

@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
   numCompleteCheckpoints++;

   if (isReachedFailureThreshold()) { // === just want to remove 
this ===
  throw new Exception("Artificial failure.");
   }
}
{code}
I think the problem here is that we throw Artifical failure when completing 
checkpoint

After throwing {{Artificial failure}} in 
{{FailureMapper#notifyCheckpointComplete}}

---> we got the following exception(attached below)

---> test failed when {{check_logs_for_errors}} using the commands in 
{{common.sh}}.
{code:java}
java.lang.RuntimeException: Error while confirming checkpoint
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1205)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Artificial failure.
at 
org.apache.flink.streaming.tests.FailureMapper.notifyCheckpointComplete(FailureMapper.java:70)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.notifyCheckpointComplete(StreamTask.java:822)
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1200)
... 5 more

{code}
Remove the {{Artificial failure throwing}} in 
{{FailureMapper#notifyCheckpointComplete, we can still throw }}{{Aritifical 
failure}} in {{FailureMapper#notifyCheckpointComplete}}, IMHO, the {{Artificial 
failure throwing}} is just needed when the source is finite, but in the test 
job, we use 
[SequenceGeneratorSource|https://github.com/apache/flink/blob/171020749f7fccfa7781563569e2c88ea5e8b6a1/flink-end-to-end-tests/flink-datastream-allround-test/src/main/java/org/apache/flink/streaming/tests/SequenceGeneratorSource.java#L102],
 it is an infinite source.


was (Author: klion26):
[~trohrmann]  In the previous comment, I'm not want to remove the whole 
{{FailureMapper}}, but just want to remove the {{Artificial failure}} throwing 
statement in {{FailureMapper}}#{{notifyCheckpointComplete just as the comment 
in the below code block.}}
{code:java}
public T map(T value) throws Exception {
   numProcessedRecords++;

   if (isReachedFailureThreshold()) {
  throw new Exception("Artificial failure.");
   }

   return value;
}

@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
   numCompleteCheckpoints++;

   if (isReachedFailureThreshold()) { // === just want to remove 
this ===
  throw new Exception("Artificial failure.");
   }
}
{code}
I think the problem here is that we throw Artifical failure when completing 
checkpoint

After throwing {{Artificial failure}} in 
{{FailureMapper#notifyCheckpointComplete}}

---> we got the following exception(attached below)

---> test failed when {{check_logs_for_errors}} using the commands in 
{{common.sh}}.
{code:java}
java.lang.RuntimeException: Error while confirming checkpoint
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1205)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Artificial failure.
at 
org.apache.flink.streaming.tests.FailureMapper.notifyCheckpointComplete(FailureMapper.java:70)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.notifyCheckpointComplete(StreamTask.java:822)
at org.

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

2019-12-10 Thread Congxian Qiu(klion26) (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-15105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992673#comment-16992673
 ] 

Congxian Qiu(klion26) edited comment on FLINK-15105 at 12/10/19 3:49 PM:
-

[~trohrmann]  In the previous comment, I'm not want to remove the whole 
{{FailureMapper}}, but just want to remove the {{Artificial failure}} throwing 
statement in {{FailureMapper}}#{{notifyCheckpointComplete just as the comment 
in the below code block.}}
{code:java}
public T map(T value) throws Exception {
   numProcessedRecords++;

   if (isReachedFailureThreshold()) {
  throw new Exception("Artificial failure.");
   }

   return value;
}

@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
   numCompleteCheckpoints++;

   if (isReachedFailureThreshold()) { // === just want to remove 
this ===
  throw new Exception("Artificial failure.");
   }
}
{code}
I think the problem here is that we throw Artifical failure when completing 
checkpoint

After throwing {{Artificial failure}} in 
{{FailureMapper#notifyCheckpointComplete}}

---> we got the following exception(attached below)

---> test failed when {{check_logs_for_errors}} using the commands in 
{{common.sh}}.
{code:java}
java.lang.RuntimeException: Error while confirming checkpoint
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1205)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Artificial failure.
at 
org.apache.flink.streaming.tests.FailureMapper.notifyCheckpointComplete(FailureMapper.java:70)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.notifyCheckpointComplete(StreamTask.java:822)
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1200)
... 5 more

{code}
Remove the {{Artificial failure throwing}} in 
{{FailureMapper#notifyCheckpointComplete, we can still throw }}{{Aritifical 
failure}} in {{FailureMapper#notifyCheckpointComplete}}, IMHO, the {{Artificial 
failure throwing}} is just needed when the source is finite, but in the test 
job, we use 
[SequenceGeneratorSource|https://github.com/apache/flink/blob/171020749f7fccfa7781563569e2c88ea5e8b6a1/flink-end-to-end-tests/flink-datastream-allround-test/src/main/java/org/apache/flink/streaming/tests/SequenceGeneratorSource.java#L102],
 it is an infinite source.


was (Author: klion26):
[~trohrmann]  In the previous comment, I'm not want to remove the whole 
{{FailureMapper}}, but just want to remove the {{Artificial failure}} throwing 
statement in {{FailureMapper}}#{{notifyCheckpointComplete just as the comment 
in the below code block.}}

I think the problem here is that we throw Artifical failure when completing 
checkpoint(we'll throw Artifical failure in two places  in {{FailureMapper}})
{code:java}
public T map(T value) throws Exception {
   numProcessedRecords++;

   if (isReachedFailureThreshold()) {
  throw new Exception("Artificial failure.");
   }

   return value;
}

@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
   numCompleteCheckpoints++;

   if (isReachedFailureThreshold()) { // === just want to remove 
this ===
  throw new Exception("Artificial failure.");
   }
}
{code}
After throwing {{Artificial failure}} in 
{{FailureMapper#notifyCheckpointComplete}}

---> we got the following exception(attached below)

---> test failed when {{check_logs_for_errors}} using the commands in 
{{common.sh}}.
{code:java}
java.lang.RuntimeException: Error while confirming checkpoint
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1205)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Artificial failure.
at 
org.apache.flink.streaming.tests.FailureMapper.notifyCheckpointComplete(FailureMapper.java:70)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
at 
org.apache.flink.streaming.runtime.tasks.Stream

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

2019-12-08 Thread Congxian Qiu(klion26) (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-15105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991179#comment-16991179
 ] 

Congxian Qiu(klion26) edited comment on FLINK-15105 at 12/9/19 6:38 AM:


The test complete checkpoint successfully in the first job, and resumed from 
the checkpoint successfully in the second job, and can complete checkpoint in 
the seconde job successfully, 
{code:java}
// log for first job
2019-12-05 20:12:17,410 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
SlidingWindowOperator (1/2) (5a5c73dd041a0145bc02dc017e46bf1f) switched from DE
PLOYING to RUNNING.
2019-12-05 20:12:17,970 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 1 @ 1575576737956 for job 39a292088648857cac5f7e110547c18
0.
2019-12-05 20:12:21,095 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 1 for job 39a292088648857cac5f7e110547c180 (261564 bytes in 3114 ms).
2019-12-05 20:12:21,113 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 2 @ 1575576741094 for job 39a292088648857cac5f7e110547c180.
2019-12-05 20:12:22,002 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - FailureMapper 
(1/1) (4d273da136346ef3ff6e1a54d197f00b) switched from RUNNING to
 FAILED.
java.lang.RuntimeException: Error while confirming checkpoint
        at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1205)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Artificial failure.
        at 
org.apache.flink.streaming.tests.FailureMapper.notifyCheckpointComplete(FailureMapper.java:70)
        at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.notifyCheckpointComplete(StreamTask.java:822)
        at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1200)
        ... 5 more
2019-12-05 20:12:22,014 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Discarding 
checkpoint 2 of job 39a292088648857cac5f7e110547c180.
java.lang.RuntimeException: Error while confirming checkpoint
        at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1205)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Artificial failure.
        at 
org.apache.flink.streaming.tests.FailureMapper.notifyCheckpointComplete(FailureMapper.java:70)
        at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.notifyCheckpointComplete(StreamTask.java:822)
        at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1200)
        ... 5 more

// log for second job
2019-12-05 20:12:27,190 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Starting job 
7c862506012fb04c0d565bfda7cc9595 from savepoint 
file:///home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-02075534631/externalized-chckpt-e2e-backend-dir/39a292088648857cac5f7e110547c180/chk-1
 ()
2019-12-05 20:12:27,213 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Reset the 
checkpoint ID of job 7c862506012fb04c0d565bfda7cc9595 to 2.
2019-12-05 20:12:27,220 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Restoring job 
7c862506012fb04c0d565bfda7cc9595 from latest valid checkpoint: Checkpoint 1 @ 0 
for 7c862506012fb04c0d565bfda7cc9595.
2019-12-05 20:12:27,231 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - No master state 
to restore
2019-12-05 20:12:27,232 INFO  
org.apache.flink.runtime.jobmaster.JobManagerRunner           - JobManager 
runner for job General purpose test job (7c862506012fb04c0d565bfda7cc9595) was 
granted leadership with session id ---- at 
akka.tcp://flink@localhost:6123/user/jobmanager_1.
2019-12-05 20:12:27,233 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
            - Starting execution of job

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

[jira] [Comment Edited] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

6 matches

Site Navigation

Mail list logo

Footer information