On 23/03/2026 01:01, Samuel Thibault wrote:
Hello,

Michael Kelly, le mar. 17 mars 2026 21:13:53 +0000, a ecrit:
I don't have enough knowledge of this to make conclusions without further
input. My guess would be that the assignment of EINTR on the client side in
this instance is wrong.
Indeed. As I would understand it, the interrupt call should make the
server carefully stop its operation, and have the opportunity to return
either EINTR or a short read/write. Then the client should be able to
receive that and return it.

Claude's suggestion of not calling abort_all_rpcs() in suspend() is just
papering over the real issue, which would definitely happen with signal
handling, anyway, so better really fix the issue than avoid it.
(and no, this issue cannot explain the corrupted haskell symbol tables,
since it's not about a repeated piece of data, the binary would be
completely bogus otherwise)

Thanks, Samuel, for the confirmation and thanks to Claude and Brent for validating my findings.

I think that once the RPC has made it to the server the overall result of the RPC should be determined by the server and not by the client as is currently the case when a signal is about to be handled.

The strategy on the server side seems right to me already. The server operation must be terminated swiftly (either by completion or aborting early) to minimise the delay before the client can handle the signal. The most likely cause of delay is the server waiting a response from making an RPC or system call of its own. Part of the signal handling preparation is to send an interrupt_operation RPC to the server whose default implementation is to call hurd_thread_cancel() which aborts all server RPCs in progress. Provided that the server code handles RPC errors appropriately, it has the opportunity to correct system state (if necessary) before returning an appropriate RPC reply to the client. There doesn't seem to be a method for interrupting normal user code within the server but provided that the operation is relatively fast it can simply complete and return its reply to the client before the signal handling is progressed. It therefore is necessary for the signal handling code to not only wait for the server reply but to make that reply available to the client once the signal handling is completed. Although the code does wait for the server reply currently it does not preserve that reply for the client.

I have prototyped an alteration to glibc/hurd/hurdsig:abort_all_rpcs(). After the 'interrupt operation' has been sent to the server the code awaits a reply to the RPC that is being interrupted. Currently the code receives the reply with an undersized message header presumably just to confirm that the operation is complete. The actual reply is then discarded. I instead supplied the mach_message_header_t that was supplied to the original RPC call in  _hurd_intr_rpc_mach_msg() with its associated rcv_size. These can be obtained from the thread state in registers rdi and r10. The actual return code from the server can be stored in SYSRETURN. In effect, changing:

       mach_msg_header_t head;
       err = __mach_msg (&head, MACH_RCV_MSG|MACH_RCV_TIMEOUT, 0, sizeof head,
                          reply_ports[nthreads],
                          _hurd_interrupted_rpc_timeout, MACH_PORT_NULL);
to:

        mach_msg_header_t* head = (mach_msg_header_t*)state->basic.rdi;
        mach_msg_size_t rcv_size = (mach_msg_size_t)state->basic.r10;

        err = __mach_msg (head, MACH_RCV_MSG|MACH_RCV_TIMEOUT, 0, rcv_size,
                                  reply_ports[nthreads],
                                  _hurd_interrupted_rpc_timeout, MACH_PORT_NULL);

        state->basic.SYSRETURN = err;
        state_changed = 1;

I was able to run the test case (calls to write() with simultaneous SIGSTOP/SIGCONT) successfully with this change and some minor rearrangement of the code. This is only a partial solution as there are several places where EINTR is potentially returned to the client inappropriately. The ability to return the actual server reply to the suspended thread was the main part I was uncertain about succeeding so I'd be more confident now about providing a complete implementation if it is considered the right way to go so please advise. I'll probably need some guidance with the appropriate behaviour under other failure conditions, for example, if the interrupt operation cannot be delivered. Those can be considered later if the overall approach is valid.

Cheers,

Mike.


Reply via email to