We just upgraded Slurm from 2.6.6 to 14.03.2-1 on a Linux cluster and now were are having problems with interactive jobs using srun and --pty. If we get more than one node then the job exits after 2 key strokes with these errors:

srun: error: _server_read: fd 18 got error or unexpected eof reading header
srun: debug:  IO error on node 1
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 1

[mrobbert@node001 mpi]$ srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: Complete job step 422.0 received
srun: debug:  task 0 done
srun: Received task exit notification for 1 task (status=0x0009).
srun: error: node001: task 0: Killed
srun: debug:  IO thread exiting
srun: debug:  Leaving _msg_thr_internal

This does not happen if the job is on one node or if we don't use --pty. I have run with some debugging on and we are receiving task exit from the tasks on the secondary node right after startup. Let me know what other debugging output might be useful here.

Thanks,
Mike Robbert

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to