Martijn Visser created FLINK-39917:
--------------------------------------

             Summary: 
JobMasterTriggerSavepointITCase.testDoNotCancelJobIfSavepointFails: "Disconnect 
job manager" log assertion races the async JM->RM disconnect
                 Key: FLINK-39917
                 URL: https://issues.apache.org/jira/browse/FLINK-39917
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination, Tests
            Reporter: Martijn Visser
            Assignee: Martijn Visser


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865&view=results
 (leg: test_cron_azure tests)

{code}
  06:17:51.991 [ERROR] 
org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.testDoNotCancelJobIfSavepointFails(ClusterClient)
 -- Time elapsed: 0.394 s <<< FAILURE!
  java.lang.AssertionError:
  [not all expected events logged by 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager, logged:
  [... Message=Registering job manager ..., ... Message=Registered job manager 
...]]
  Expecting empty but was: [Disconnect job manager .*]
        at 
org.apache.flink.util.JobIDLoggingUtil.assertKeyPresent(JobIDLoggingUtil.java:98)
        at 
org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.verifyJobIdIsLogged(JobMasterTriggerSavepointITCase.java:280)
  {code}

Root cause: {{waitForDisconnect}} cancels the job and waits for the 
client-visible {{CANCELED}} status, then {{verifyJobIdIsLogged}} asserts that 
{{StandaloneResourceManager}} logged "Disconnect job manager ...". The 
JobMaster disconnects from the ResourceManager asynchronously during shutdown, 
*after* the job reports CANCELED. The run logs confirm the window: job CANCELED 
at 06:17:51,115, JobMaster began stopping at 06:17:51,136, and the assertion 
ran in between, capturing only the "Registering/Registered job manager" events.

Not the same failure as FLINK-37821 (closed), which addressed a different 
signal in this test.

Proposed fix: in {{waitForDisconnect}}, after the CANCELED wait, additionally 
wait until the RM has actually logged the disconnect event before returning. No 
assertion change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to