Matthias Pohl created FLINK-26391:
-------------------------------------

             Summary: Release Testing: Application Mode recovery does not 
re-trigger a job which failed during cleanup (FLINK-11813)
                 Key: FLINK-26391
                 URL: https://issues.apache.org/jira/browse/FLINK-26391
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.15.0
            Reporter: Matthias Pohl
             Fix For: 1.15.0


FLINK-11813 is about not being able to determine whether a job has been 
terminated globally before a failover happened. Testing this behavior can be 
achieved by running a job in HA mode to enable the file-based 
{{JobResultStore}} (JRS).

You can specify 
[job-result-store.storage-path|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#job-result-store-storage-path]
 to point to a directory which you can access. 
[job-result-store.delete-on-commit|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#job-result-store-delete-on-commit]
 can be used to make the JRS artifacts not being deleted after a job finished.

You can make a job finish to generate a the JRS artifact for this job in the 
specified directory. Renaming the generated file from {{<job-id>.json}} to 
{{<job-id>_DIRTY.json}} will simulate the job not being cleaned up properly. 
Starting the job in application mode once more (through specifying the 
corresponding Job ID) should lead to the job not being started again (you might 
want to enable {{debug}} logging to verify the logs), i.e.:
* Cleanup should be performed. 
* No JobMaster-related logs should appear in the Flink logs.
* cleanup-related logs should appear in the Flink logs.
* At the end, the {{_DIRTY.json}} file extension should have been removed from 
the JRS artifact again



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to