[slurm-users] Re: Debian RPM build for arm64?

2024-06-14 Thread Christopher Harrop - NOAA Affiliate via slurm-users
I have confirmed that the issue is Ubuntu 20.04. I used the tmate github action to get access to the Ubuntu 20.04 github arm runner and tried the steps manually one be one. It did indeed fail, almost immediately in the "debuild -b -uc -us” step. Given that the same experiment done on a Ubuntu

[slurm-users] Debian RPM build for arm64?

2024-06-13 Thread Christopher Harrop - NOAA Affiliate via slurm-users
Hello, Are the instructions for building Debian RPMs found at https://slurm.schedmd.com/quickstart_admin.html#debuild expected to work on ARM machines? I am having trouble with the "debuild -b -uc -us” step. #10 29.01 configure: exit 1 #10 29.01 dh_auto_configure: error: cd obj-aarch64-linux-

[slurm-users] slurmstepd: error: task_g_set_affinity: Operation not permitted

2024-06-13 Thread Christopher Harrop - NOAA Affiliate via slurm-users
Hi, I am building a containerized Slurm cluster with Ubuntu 20.04 and have it almost working. The daemons start, and an “sinfo” command shows compute nodes up and available: admin@slurmfrontend:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST slurmpar*up infinite 3 idle sl

[slurm-users] Re: slurmstepd: error: task_g_set_affinity: Operation not permitted

2024-06-13 Thread Christopher Harrop - NOAA Affiliate via slurm-users
There is a permission problem somewhere, but I don’t know where. If I run as root, it works: admin@slurmfrontend:~$ srun hostname srun: error: task 0 launch failed: Slurmd could not execve job slurmstepd: error: task_g_set_affinity: Operation not permitted slurmstepd: error: _exec_wait_child_wait

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Christopher Harrop - NOAA Affiliate
> Hi Chris > > You are right in pointing that the job actually runs, despite of the error in > the sbatch. The customer mention that: > === start === > Problem had usual scenario - job script was submitted and executed, but > sbatch command returned non-zero exit status to ecflow, which thus as

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Christopher Harrop - NOAA Affiliate
Hi, My group is struggling with this also. The worst part of this, which no one has brought up yet, is that the sbatch command does not necessarily fail to submit the job in this situation. In fact, most of the time (for us), it succeeds. There appears to be some sort of race condition or