Baunsgaard opened a new pull request, #2492: URL: https://github.com/apache/systemds/pull/2492
The **.component.c**.** Java test job still has intermittently runs until the 30-minute GitHub Actions cap with no further output: a surefire fork stalls in a way that surefire's own timeouts never catch (a fork wedged around the booter handshake, or a starved maven parent), so neither forkedProcessTimeoutInSeconds nor forkedProcessExitTimeoutInSeconds fires and the job is cancelled with nothing to diagnose. The stall does not reproduce locally, so the only place to capture evidence is CI. Add an outer guard in the docker test entrypoint that watches the test log for a stall (no new line for a window kept just above the 600s per-fork surefire timeout) and an absolute runtime ceiling below the job cap. On either trigger it force-dumps thread stacks from every JVM in the test process tree via SIGQUIT (relayed into the job log) plus a jstack file backup, then force-kills the tree so the job fails fast WITH stacks instead of being cancelled empty-handed. Limits are overridable via SYSDS_TEST_STALL_LIMIT and SYSDS_TEST_MAX_RUNTIME. Also set surefire runOrder to alphabetical so the hang reproduces at a stable class boundary across runs, making the responsible class identifiable from the captured dumps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
