Sajesh, For those other users that may have run into this. I found a reason why srun cannot run interactive jobs, and it may not necessarily be related to RHEL/CentOS 7
If one straces the slurmd one may see (see arg 3 for gid) chown("/dev/pts/1", 1326, 7) = -1 EPERM (Operation not permitted) in my case I had (something similar) chown("/dev/pts/1", 1326, 0) = -1 EPERM (Operation not permitted) For our site, this report was also helpful https://bugs.schedmd.com/show_bug.cgi?id=8729 tty was mapped to group 7 in Sajesh’s case. It (tty) should always be mapped to group 5. At our site, we had a problem with /etc/group being large and the tty group not being properly read in. The fix for us was to resort the group file by gid, so that the tty line would fall on line 5. Hope this helps, Kevin From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Sajesh Singh <ssi...@amnh.org> Date: Wednesday, March 25, 2020 at 2:23 AM To: slurm-us...@schedmd.com <slurm-us...@schedmd.com> Subject: [slurm-users] Cannot run interactive jobs CentOS 7.7.1908 Slurm 18.08.8 When trying to run an interactive job I am getting the following error: srun: error: task 0 launch failed: Slurmd could not connect IO Checking the log file on the compute node I see the following error: [2020-03-25T01:42:08.262] launch task 13.0 request from UID:1326 GID:50000 HOST:192.168.229.254 PORT:14980 [2020-03-25T01:42:08.262] lllp_distribution jobid [13] implicit auto binding: cores,one_thread, dist 8192 [2020-03-25T01:42:08.262] _task_layout_lllp_cyclic [2020-03-25T01:42:08.262] _lllp_generate_cpu_bind jobid [13]: mask_cpu,one_thread, 0x0000000000000001 [2020-03-25T01:42:08.262] _run_prolog: run job script took usec=5 [2020-03-25T01:42:08.262] _run_prolog: prolog with lock for job 13 ran for 0 seconds [2020-03-25T01:42:08.272] [13.0] Considering each NUMA node as a socket [2020-03-25T01:42:08.310] [13.0] error: stdin openpty: Operation not permitted [2020-03-25T01:42:08.311] [13.0] error: IO setup failed: Operation not permitted [2020-03-25T01:42:08.311] [13.0] error: job_manager exiting abnormally, rc = 4021 [2020-03-25T01:42:08.315] [13.0] done with job When doing the same on a CentOS 7.3 and Slurm 18.08.4 cluster the interactive job runs as expected. Any advise on how to remedy this would be appreciated. -Sajesh-