On Mon, 22 May 2023 10:22:16 GMT, Volker Simonis <simo...@openjdk.org> wrote:

>> Since JDK13, executing commands in a sub-process defaults to the so called 
>> `POSIX_SPAWN` launching mechanism (i.e. 
>> `-Djdk.lang.Process.launchMechanism=POSIX_SPAWN`) on Linux. This works by 
>> using `posix_spawn(3)` to firstly start a tiny helper program called 
>> `jspawnhelper` in a subprocess. In a second step, `jspawnhelper` reads the 
>> command data from the parent Java process over a Unix pipe and finally 
>> executes (i.e. `execvp(3)`) the requested command.
>> 
>> In cases where the parent process terminates abnormally before 
>> `jspawnhelper` has read all the expected data from the pipe, `jspawnhelper` 
>> will block indefinitely on the reading end of the pipe. This is especially 
>> harmful if the parent process had open sockets, because in that case, the 
>> forked `jspawnhelper` process will inherit them and keep all the 
>> corresponding ports open effectively preventing other processes to bind to 
>> them. Notice that this is not an abstract scenario. We've observed this 
>> regularly in production with services which couldn't be restarted after a 
>> crash after migrating to JDK 17.
>> 
>> The fix of the issue is rather trivial. `jspawnhelper` has to close its 
>> writing end of the pipe which connects it with the parent Java process 
>> *before* starting to read from that pipe such that reads from the pipe can 
>> immediately return with EOF if the parent process terminates abnormally.
>> 
>> Also did some cleanup:
>>  - in `jspawnhelper.c` correctly chek the result of `readFully()`
>>  - in `ProcessImpl_md.c` use `restartableWrite()` instead of `write()` to 
>> write the command data from the parent process to `jspawnhelper`
>>  - in `childproc.{c,h}` remove remaining traces of `MODE_CLONE` because 
>> that's long gone.
>
> Volker Simonis has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'master' into JDK-8307990
>  - Address comments from tstuefe and RogerRiggs
>  - REALLY adding the test :)
>  - Added test case
>  - 8307990: jspawnhelper must close its writing side of a pipe before reading 
> from it

> Hey, chiming on this conversation, because I believe it will fix an issue we 
> encountered since, upgrading from JDK 11 to 17. In our application we spawn a 
> lot of processes over time and noticed that more and more JVM threads get 
> stuck in the native `ProcessImpl.forkAndExec`.
> 
> Thread Dump Excerpt
> After some investigation we saw that we have `jspawnhelper` processes 
> corresponding to the number of the stuck threads. `/proc/*/task/{id}/syscall` 
> shows that all of these processes are currently reading from the pipe.
> 
> Our theory is that the write calls from the parent thread were interrupted 
> resulting in an incomplete write.
> 
> https://github.com/openjdk/jdk/blob/69f508a2ac344eb61cef7be985348873b8265171/src/java.base/unix/native/libjava/ProcessImpl_md.c#L556-L559
> 
> 
> This would cause the `readFully` on the `jspawnhelper` side to wait for the 
> remaining data.
> As the number of bytes written is not checked on the writer side it would 
> continue and wait for the alive ping from the `jspawnhelper` which results in 
> a deadlock.
> I would be interested in a reproducer e.g. using `gdb`, but to be honest I 
> wouldn't know where to start.
> But regardless of whether this fixes our issue I think incomplete writes 
> should be handled according to specification.
> I think the `restartableWrites` in this PR do that?
> We also mitigated by falling back to `VFORK` as was also mentioned in [this 
> comment](https://github.com/openjdk/jdk/pull/13956#issuecomment-1547777042).
> 
> Maybe it would be worth it to mention our bug situation in the release notes 
> as well? Thank you and please let me know if I got something wrong or there 
> is anything I could provide to help!

@mlichtblau thanks a lot for your report. I saw some similar reports from our 
internal Maven builds (have to check though if they are exactly the same). Your 
analysis of the problem seems sound and I hope the restartable writes will 
indeed fix the problem.

I'll definitely add your example to the release notes and I'll try if I can 
come up with a test that reproduces your issue.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13956#issuecomment-1558794718

Reply via email to