On Wed, 17 Mar 2021 17:30:25 GMT, Thomas Stuefe <stu...@openjdk.org> wrote:
>> Arbitrary time out has been a reliable source of intermittent failures. >> >> Since we have spent a lot of time analyzing this failure, I think it's >> worthwhile to fix it properly, which doesn't seem that complicated. That's >> better than the same bug happening again a year later and a different set of >> people would spend hours to analyze it again. > > I don't think this is CPU starvation but memory exhaustion. _beginthreadex > fails with EACCES if it has no resources to start the thread, which in this > case probably means memory (the other possibility would be > out-of-HANDLE-space but seeing that the child just started I don't see how > this could be). > > Should we harden tests against resource starvation like this, or rather > require the test machine to be beefy enough for tests? Also, I don't > understand, if the child has not enough resources to bring the VM fully up > how waiting on either stream would help. @tstuefe It's unlikely that _beginthreadex failed due to lack of memory. We are running on a machine with more than 50GB ram with only concurrency of 6. Extracts from the logs: $ jtreg -vmoption:-Xmx512m -concurrency:6 -vmoption:-XX:MaxRAMPercentage=4 .... open/test/jdk:jdk_lang start = Wed Mar 10 22:54:53 GMT 2021 end = Wed Mar 10 22:56:36 GMT 2021 elapsed= 102932 0:01:42.932 ---------------------------------------- [2021-03-10 22:57:07] [C:\cygwin\bin\free.exe] timeout=20000 ---------------------------------------- total used free shared buff/cache available Mem: 50068964 19974888 30094076 0 0 30094076 Swap: 8388608 105048 8283560 ---------------------------------------- [2021-03-10 22:57:07] exit code: 0 time: 33 ms ---------------------------------------- My theory is that `TerminateProcess` has made it impossible for the child process to spawn new threads, but somehow existing threads are still able to run a little bit and produce the log message. Of course this is just a theory, and I cannot find any supporting docs from MS. However, if we implement the work around and make sure we don't kill the child until it has finished bootstrapping, and the bug doesn't happen anymore, then we know something more. ------------- PR: https://git.openjdk.java.net/jdk/pull/3049