Has there been any news on this issue? I am seeing the same thing with 14.03.3.

Thanks

Martins

On 5/6/14 2:43 PM, Michael Robbert wrote:
We just upgraded Slurm from 2.6.6 to 14.03.2-1 on a Linux cluster and now were are having problems with interactive jobs using srun and --pty. If we get more than one node then the job exits after 2 key strokes with these errors:

srun: error: _server_read: fd 18 got error or unexpected eof reading header
srun: debug:  IO error on node 1
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 1

[mrobbert@node001 mpi]$ srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: Complete job step 422.0 received
srun: debug:  task 0 done
srun: Received task exit notification for 1 task (status=0x0009).
srun: error: node001: task 0: Killed
srun: debug:  IO thread exiting
srun: debug:  Leaving _msg_thr_internal

This does not happen if the job is on one node or if we don't use --pty. I have run with some debugging on and we are receiving task exit from the tasks on the secondary node right after startup. Let me know what other debugging output might be useful here.

Thanks,
Mike Robbert

Reply via email to