Xiao-zhen-Liu commented on issue #3880: URL: https://github.com/apache/texera/issues/3880#issuecomment-3420404322
There are two kinds of flaky test behaviors, both related to e2e tests. - One is for `DataProcessingSpec`. This test suite contains 10 test cases, and each sets a 1-minute timeout. There is a chance on CI environment that one of the the 10 test cases may time out. - The other is for `PauseSpec` . This is appeared in [PR 3913](https://github.com/apache/texera/actions/runs/18610057938/job/53066773439?pr=3913). The test cases in PauseSpec do not have timeouts internally, and seems to cause the test job to hang there until it reaches Github CI’s 6-hour timeout. **Cause**: I think the causes of both behaviors are the same. When we run these e2e tests, we execute a workflow and wait for controller to send workflow’s `COMPLETED` status. Somehow in the CI environment there is a chance this message is not sent (or not received) and causes the wait to either hang (for `PauseSpec` ) or trigger our internal timeout (for `DataProcessingSpec`). **Reproducibility**: This seems to only happen in the CI environment. @aglinxinyuan mentioned he tested locally 10,000 times and could not reproduce this timeout issue. **Origin**: Regarding the origin of this issue, I think it may be much earlier than recently, possibly since the creation of these test cases. It happens infrequently (maybe 1 in 50 CI runs) and usually rerunning the CI solves the problem. I found in March this year there was a [6-hour timeout](https://github.com/apache/texera/actions/runs/13732392069) in master-branch’s CI run, and the other flaky CI failure may be dated as early as [Sep 2024](https://github.com/apache/texera/actions/runs/10806203712) (earlier runs have been garbage-collected). **Short-term Solution**: As this problem may be very hard to reproduce or debug, I suggest for all these test cases, we keep enforcing the 1-minute internal timeout but add a retry mechanism, as rerunning the test case always solves the issue. This will stabilize the CIs. **Permenant Solution**: We should try to reproduce this issue, possibly in an Ubuntu environment, and locate the root cause of the `COMPLETED` status not being sent for these test cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
