Sajesh,

For those other users that may have run into this.  I found a reason why srun 
cannot run interactive jobs, and it may not necessarily be related to 
RHEL/CentOS 7

If one straces the slurmd one may see (see arg 3 for gid)

chown("/dev/pts/1", 1326, 7) = -1 EPERM (Operation not permitted)

in my case I had (something similar)

chown("/dev/pts/1", 1326, 0) = -1 EPERM (Operation not permitted)

For our site, this report was also helpful
https://bugs.schedmd.com/show_bug.cgi?id=8729

tty was mapped to group 7 in Sajesh’s case.  It (tty) should always be mapped 
to group 5. At our site, we had a problem with /etc/group being large and the 
tty group not being properly read in.

The fix for us was to resort the group file by gid, so that the tty line would 
fall on line 5.

Hope this helps,
Kevin

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Sajesh 
Singh <ssi...@amnh.org>
Date: Wednesday, March 25, 2020 at 2:23 AM
To: slurm-us...@schedmd.com <slurm-us...@schedmd.com>
Subject: [slurm-users] Cannot run interactive jobs
CentOS 7.7.1908
Slurm 18.08.8

When trying to run an interactive job I am getting the following error:

srun: error: task 0 launch failed: Slurmd could not connect IO

Checking the log file on the compute node I see the following error:

[2020-03-25T01:42:08.262] launch task 13.0 request from UID:1326 GID:50000 
HOST:192.168.229.254 PORT:14980
[2020-03-25T01:42:08.262] lllp_distribution jobid [13] implicit auto binding: 
cores,one_thread, dist 8192
[2020-03-25T01:42:08.262] _task_layout_lllp_cyclic
[2020-03-25T01:42:08.262] _lllp_generate_cpu_bind jobid [13]: 
mask_cpu,one_thread, 0x0000000000000001
[2020-03-25T01:42:08.262] _run_prolog: run job script took usec=5
[2020-03-25T01:42:08.262] _run_prolog: prolog with lock for job 13 ran for 0 
seconds
[2020-03-25T01:42:08.272] [13.0] Considering each NUMA node as a socket
[2020-03-25T01:42:08.310] [13.0] error: stdin openpty: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: IO setup failed: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: job_manager exiting abnormally, rc = 
4021
[2020-03-25T01:42:08.315] [13.0] done with job

When doing the same on a CentOS 7.3 and Slurm 18.08.4 cluster the interactive 
job runs as expected.

Any advise on how to remedy this would be appreciated.

-Sajesh-




Reply via email to