srun: error: _server_read: fd 18 got error or unexpected eof reading header srun: debug: IO error on node 1srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 1
[mrobbert@node001 mpi]$ srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: Complete job step 422.0 received srun: debug: task 0 done srun: Received task exit notification for 1 task (status=0x0009). srun: error: node001: task 0: Killed srun: debug: IO thread exiting srun: debug: Leaving _msg_thr_internalThis does not happen if the job is on one node or if we don't use --pty. I have run with some debugging on and we are receiving task exit from the tasks on the secondary node right after startup. Let me know what other debugging output might be useful here.
Thanks, Mike Robbert
smime.p7s
Description: S/MIME Cryptographic Signature
