shunping opened a new pull request, #36235: URL: https://github.com/apache/beam/pull/36235
The Precommit Go test has been flaky (https://github.com/apache/beam/actions/runs/17922206057/job/50959740198). The error is related to `TestServer_RunThenCancel` and the error message is below. ``` server_test.go:142: server.GetState() = RUNNING, want CANCELLED ``` After some investigation, I found it is related to a race condition between the main thread and the Server go routine. The cancel call is made after the server starts to run. https://github.com/apache/beam/blob/08b0572d54c47654e1378fb5c00e884714202a33/sdks/go/pkg/beam/runners/prism/internal/jobservices/server_test.go#L114-L124. The race condition occurs in the following order: - The server go routine started to run in the call of `undertest.Run()` to the point of setting the state to RUNNING. - In the main thread, we started to trigger a cancel request by calling `undertest.Cancel`. - However, before `job.CancelFn(ErrCancel)` is called, the server go rountine finished the check on Line 89 and return. https://github.com/apache/beam/blob/08b0572d54c47654e1378fb5c00e884714202a33/sdks/go/pkg/beam/runners/prism/internal/jobservices/server_test.go#L86-L94. - Then when we try to get the server state at the end, it is still RUNNING state. In my fix, I add the logic to wait until the context is done before checking the cause. --- Furthermore, even after fixing the previous race condition, there is yet another: when `undertest.Cancel` is called, it is not guaranteed that the job reaches the RUNNING state. So additional logic is needed to wait for the RUNNING state before triggering the cancel. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
