Dear SLURM,

Running into this error(srun: debug3: got eof-stdout msg on _server_read 
header), which I believe is the root cause, that is causing an "interactive" 
job that is allocated resources to be immediately terminated. As far as SLURM 
and the job is concerned, the status of the job shows completed. I have ran out 
of all options, debugged pam stack and everything but couldn't get the remote 
bash shell launched. Next obvious step was to add debug in slurmstepd sources 
files and see if I can interpret what is going on but wanted to see if I can 
get helped here. Also wanted to mention I have another set of nodes that I have 
exactly the same setup and works perfectly fine for the interactive jobs. Only 
difference between the two node types is one is RHEL6.5 and one is RHEL6.7.

Any help is greatly appreciated.
Thank you,
Amit

Here is what I tried and see ..

#srun  -d9 -p gpu --exclusive --pty $SHELL
Or
#srun  -d9 -p gpu -n1 --x11=first  --pty $SHELL

srun: job 12908567 queued and waiting for resources
srun: job 12908567 has been allocated resources
srun: error: x11: job has no allocated nodes defined
                                                    login01:/users/ahkumar
$

With additional debug flags I see something is causing the stdout output return 
eof msg. Below is a snippet from SlurmdLogFile. I do have correctly installed 
the slurm-spank-x11 plugin


[2016-10-02T21:43:39.825] [12894576.0] Handling REQUEST_SIGNAL_CONTAINER
[2016-10-02T21:43:39.825] [12894576.0] _handle_signal_container for 
step=12894576.0 uid=0 signal=995
[2016-10-02T21:43:39.825] [12894576.0] Leaving  _handle_request: SLURM_SUCCESS
[2016-10-02T21:43:39.825] [12894576.0] Entering _handle_request
[2016-10-02T21:43:39.825] [12894576.0] Leaving  _handle_accept
[2016-10-02T21:43:39.841] [12894576.0] Entering _task_read for obj 23a7240
[2016-10-02T21:43:39.841] [12894576.0]   error in _task_read: Input/output error
[2016-10-02T21:43:39.841] [12894576.0]   got eof on task
[2016-10-02T21:43:39.841] [12894576.0] ************************ -1 bytes read 
from task STDOUT
[2016-10-02T21:43:39.841] [12894576.0] Entering _send_eof_msg
[2016-10-02T21:43:39.841] [12894576.0] Myname in build_hashtbl: (slurmstepd)
[2016-10-02T21:43:39.841] [12894576.0] ======================== Enqueued eof 
message
[2016-10-02T21:43:39.841] [12894576.0] Leaving  _send_eof_msg




Reply via email to