On 23/03/2026 01:01, Samuel Thibault wrote:
Hello,
Michael Kelly, le mar. 17 mars 2026 21:13:53 +0000, a ecrit:
I don't have enough knowledge of this to make conclusions without further
input. My guess would be that the assignment of EINTR on the client side in
this instance is wrong.
Indeed. As I would understand it, the interrupt call should make the
server carefully stop its operation, and have the opportunity to return
either EINTR or a short read/write. Then the client should be able to
receive that and return it.
Claude's suggestion of not calling abort_all_rpcs() in suspend() is just
papering over the real issue, which would definitely happen with signal
handling, anyway, so better really fix the issue than avoid it.
(and no, this issue cannot explain the corrupted haskell symbol tables,
since it's not about a repeated piece of data, the binary would be
completely bogus otherwise)
Thanks, Samuel, for the confirmation and thanks to Claude and Brent for
validating my findings.
I think that once the RPC has made it to the server the overall result
of the RPC should be determined by the server and not by the client as
is currently the case when a signal is about to be handled.
The strategy on the server side seems right to me already. The server
operation must be terminated swiftly (either by completion or aborting
early) to minimise the delay before the client can handle the
signal. The most likely cause of delay is the server waiting a response
from making an RPC or system call of its own. Part of the signal
handling preparation is to send an interrupt_operation RPC to the server
whose default implementation is to call hurd_thread_cancel() which
aborts all server RPCs in progress. Provided that the server code
handles RPC errors appropriately, it has the opportunity to correct
system state (if necessary) before returning an appropriate RPC reply to
the client. There doesn't seem to be a method for interrupting normal
user code within the server but provided that the operation is
relatively fast it can simply complete and return its reply to the
client before the signal handling is progressed. It therefore is
necessary for the signal handling code to not only wait for the server
reply but to make that reply available to the client once the signal
handling is completed. Although the code does wait for the server reply
currently it does not preserve that reply for the client.
I have prototyped an alteration to glibc/hurd/hurdsig:abort_all_rpcs().
After the 'interrupt operation' has been sent to the server the code
awaits a reply to the RPC that is being interrupted. Currently the code
receives the reply with an undersized message header presumably just to
confirm that the operation is complete. The actual reply is then
discarded. I instead supplied the mach_message_header_t that was
supplied to the original RPC call in _hurd_intr_rpc_mach_msg() with its
associated rcv_size. These can be obtained from the thread state in
registers rdi and r10. The actual return code from the server can be
stored in SYSRETURN. In effect, changing:
mach_msg_header_t head;
err = __mach_msg (&head, MACH_RCV_MSG|MACH_RCV_TIMEOUT, 0,
sizeof head,
reply_ports[nthreads],
_hurd_interrupted_rpc_timeout, MACH_PORT_NULL);
to:
mach_msg_header_t* head = (mach_msg_header_t*)state->basic.rdi;
mach_msg_size_t rcv_size = (mach_msg_size_t)state->basic.r10;
err = __mach_msg (head, MACH_RCV_MSG|MACH_RCV_TIMEOUT, 0, rcv_size,
reply_ports[nthreads],
_hurd_interrupted_rpc_timeout,
MACH_PORT_NULL);
state->basic.SYSRETURN = err;
state_changed = 1;
I was able to run the test case (calls to write() with simultaneous
SIGSTOP/SIGCONT) successfully with this change and some minor
rearrangement of the code. This is only a partial solution as there are
several places where EINTR is potentially returned to the client
inappropriately. The ability to return the actual server reply to the
suspended thread was the main part I was uncertain about succeeding so
I'd be more confident now about providing a complete implementation if
it is considered the right way to go so please advise. I'll probably
need some guidance with the appropriate behaviour under other failure
conditions, for example, if the interrupt operation cannot be delivered.
Those can be considered later if the overall approach is valid.
Cheers,
Mike.