On Wed, 17 Mar 2021 17:30:25 GMT, Thomas Stuefe <stu...@openjdk.org> wrote:

>> Arbitrary time out has been a reliable source of intermittent failures.
>> 
>> Since we have spent a lot of time analyzing this failure, I think it's 
>> worthwhile to fix it properly, which doesn't seem that complicated. That's 
>> better than the same bug happening again a year later and a different set of 
>> people would spend hours to analyze it again.
>
> I don't think this is CPU starvation but memory exhaustion. _beginthreadex 
> fails with EACCES if it has no resources to start the thread, which in this 
> case probably means memory (the other possibility would be 
> out-of-HANDLE-space but seeing that the child just started I don't see how 
> this could be).
> 
> Should we harden tests against resource starvation like this, or rather 
> require the test machine to be beefy enough for tests? Also, I don't 
> understand, if the child has not enough resources to bring the VM fully up 
> how waiting on either stream would help.

@tstuefe It's unlikely that _beginthreadex failed due to lack of memory. We are 
running on a machine with more than 50GB ram with only concurrency of 6.  
Extracts from the logs:

$ jtreg -vmoption:-Xmx512m -concurrency:6 -vmoption:-XX:MaxRAMPercentage=4 .... 
open/test/jdk:jdk_lang

start = Wed Mar 10 22:54:53 GMT 2021
end = Wed Mar 10 22:56:36 GMT 2021
elapsed= 102932 0:01:42.932

----------------------------------------
[2021-03-10 22:57:07] [C:\cygwin\bin\free.exe] timeout=20000
----------------------------------------
              total used free shared buff/cache available
Mem: 50068964 19974888 30094076 0 0 30094076
Swap: 8388608 105048 8283560
----------------------------------------
[2021-03-10 22:57:07] exit code: 0 time: 33 ms
----------------------------------------

My theory is that `TerminateProcess` has made it impossible for the child 
process to spawn new threads, but somehow existing threads are still able to 
run a little bit and produce the log message.

Of course this is just a theory, and I cannot find any supporting docs from MS. 
However, if we implement the work around and make sure we don't kill the child 
until it has finished bootstrapping, and the bug doesn't happen anymore, then 
we know something more.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3049

Reply via email to