Re: [slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?

2021-04-15 Thread Thomas Arildsen
Hi Ole

Thanks for the suggestion. I am afraid the solution is not the same. At least, 
restarting `slurmdbd` and `slurmctld` on the head node has made no difference 
either.
It puzzles me why Slurm appears to treat this one user differently than all 
others. Even other users under the same account are doing fine.
I think the possible relation to another (array) job I was speculating about in 
my original message was just coincidental.
I have now tried the following three steps in the hope of somehow fixing the 
problem, none of which have changed the situation:

- Deleted the user from Slurm using `sacctmgr remove user` and re-created the 
user again afterwards.
- Removed the user's home directory and let the login procedure populate a new 
home directory from scratch for the user.
- Restarted `slurmdbd` and `slurmctld` as mentioned above.

Thomas


Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-15 Thread Cristóbal Navarro
Hi Michael,
Thanks, Indeed I don't have it. Slurm must have not detected it.
I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu)
Here is some output, including the relevant paths for nvml.
Is it possible to tell the slurm compilation to check these paths for nvml ?
best

*NVML PKG CHECK*
➜  ~ sudo apt search nvml
Sorting... Done
Full Text Search... Done
cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64
  NVML native dev links, headers

cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64
  NVML native dev links, headers

cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64
  NVML native dev links, headers


*libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed]  NVIDIA Management
Library (NVML) development files*
python3-pynvml/focal 7.352.0-3 amd64
  Python3 bindings to the NVIDIA Management Library



*NVML Shared library location*
➜  ~ find /usr/lib | grep libnvidia-ml
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so



*NVML Header*
➜  ~ find /usr | grep nvml
/usr/include/nvml.h




*SLURM LIBS*
➜  ~ ls /usr/lib64/slurm/
accounting_storage_mysql.so*core_spec_none.so*
 job_submit_pbs.so*  proctrack_pgid.so*
accounting_storage_none.so* cred_munge.so*
 job_submit_require_timelimit.so*route_default.so*
accounting_storage_slurmdbd.so* cred_none.so*
job_submit_throttle.so* route_topology.so*
acct_gather_energy_ibmaem.so*   ext_sensors_none.so*
 launch_slurm.so*sched_backfill.so*
acct_gather_energy_ipmi.so* gpu_generic.so*
mcs_account.so* sched_builtin.so*
acct_gather_energy_none.so* gres_gpu.so*
 mcs_group.so*   sched_hold.so*
acct_gather_energy_pm_counters.so*  gres_mic.so*
 mcs_none.so*select_cons_res.so*
acct_gather_energy_rapl.so* gres_mps.so*
 mcs_user.so*select_cons_tres.so*
acct_gather_energy_xcc.so*  gres_nic.so*
 mpi_none.so*select_linear.so*
acct_gather_filesystem_lustre.so*   jobacct_gather_cgroup.so*
mpi_pmi2.so*site_factor_none.so*
acct_gather_filesystem_none.so* jobacct_gather_linux.so*
 mpi_pmix.so@slurmctld_nonstop.so*
acct_gather_interconnect_none.so*   jobacct_gather_none.so*
mpi_pmix_v2.so* src/
acct_gather_interconnect_ofed.so*   jobcomp_elasticsearch.so*
node_features_knl_generic.so*   switch_none.so*
acct_gather_profile_hdf5.so*jobcomp_filetxt.so*
power_none.so*  task_affinity.so*
acct_gather_profile_influxdb.so*jobcomp_lua.so*
preempt_none.so*task_cgroup.so*
acct_gather_profile_none.so*jobcomp_mysql.so*
preempt_partition_prio.so*  task_none.so*
auth_munge.so*  jobcomp_none.so*
 preempt_qos.so* topology_3d_torus.so*
burst_buffer_generic.so*jobcomp_script.so*
 prep_script.so* topology_hypercube.so*
cli_filter_lua.so*  job_container_cncu.so*
 priority_basic.so*  topology_none.so*
cli_filter_none.so* job_container_none.so*
 priority_multifactor.so*topology_tree.so*
cli_filter_syslog.so*   job_submit_all_partitions.so*
proctrack_cgroup.so*
cli_filter_user_defaults.so*job_submit_lua.so*
 proctrack_linuxproc.so*

On Thu, Apr 15, 2021 at 9:02 AM Michael Di Domenico 
wrote:

> the error message sounds like when you built the slurm source it
> wasn't able to find the nvml devel packages.  if you look in where you
> installed slurm, in lib/slurm you should have a gpu_nvml.so.  do you?
>
> On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
>  wrote:
> >
> > typing error, should be --> **located at /usr/include/nvml.h**
> >
> > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro <
> cristobal.navarr...@gmail.com> wrote:
> >>
> >> Hi community,
> >> I have set up the configuration files as mentioned in the
> documentation, but the slurmd of the GPU-compute node fails with the
> following error shown in the log.
> >> After reading the slurm documentation, it is not entirely clear to me
> how to properly set up GPU autodetection for the gres.conf file as it does
> not mention if the nvml detection should be automatic or not.
> >> I have also read the top google searches including
> https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html
> but that was a problem of a cuda installation overwritten (not my case).
> >> This a DGX A100 node that comes with the Nvidia driver installed and
> nvml is located at /etc/include/nvml.h, not sure if there is a libnvml.so
> or similar as well.
> >> How to tell SLURM to look at those paths? any ideas of experience
> sharing is welcome.
> >> best
> >>
> >>
> >> slurmd.log (GPU node)
> >> [2021-04-14T17:31:42.302] got shutdown request
> >> [2021-04-1

[slurm-users] NHC and slurm

2021-04-15 Thread Heitor
Hello,

I'm trying to setup NHC[0] for our Slurm cluster, but I'm not getting
it to work properly.

I'm using the dev branch from [0] and compiled it this way:

$ ./autogen.sh --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/lib
$ make test
$ sudo make install

When I run nhc, I get an error that sshd is not running:

$ sudo nhc
ERROR:  nhc:  Health check failed:  check_ps_service:  Service sshd (process 
sshd) owned by root not running

I know sshd is running because I logged in this machine with ssh. And
`systemctl status sshd` shows it is active.

Here's a sample of my nhc.conf:

   * || check_ps_service munged
   * || check_ps_service -u root sshd
   * || check_ps_service -u root ssh
   * || check_ps_service ssh
   * || check_ps_service sshd

If I run `sudo nhc -a` to run all the tests, it gives 4 errors about
ssh.

NHC can find munge running, so what's the problem with ssh? What am I
missing?

I'm using Ubuntu 20.04.

Cheers,
Heitor


[0] https://github.com/mej/nhc/


pgp5KTuJw6y_H.pgp
Description: OpenPGP digital signature


Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-15 Thread Michael Di Domenico
the error message sounds like when you built the slurm source it
wasn't able to find the nvml devel packages.  if you look in where you
installed slurm, in lib/slurm you should have a gpu_nvml.so.  do you?

On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
 wrote:
>
> typing error, should be --> **located at /usr/include/nvml.h**
>
> On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro 
>  wrote:
>>
>> Hi community,
>> I have set up the configuration files as mentioned in the documentation, but 
>> the slurmd of the GPU-compute node fails with the following error shown in 
>> the log.
>> After reading the slurm documentation, it is not entirely clear to me how to 
>> properly set up GPU autodetection for the gres.conf file as it does not 
>> mention if the nvml detection should be automatic or not.
>> I have also read the top google searches including 
>> https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html 
>> but that was a problem of a cuda installation overwritten (not my case).
>> This a DGX A100 node that comes with the Nvidia driver installed and nvml is 
>> located at /etc/include/nvml.h, not sure if there is a libnvml.so or similar 
>> as well.
>> How to tell SLURM to look at those paths? any ideas of experience sharing is 
>> welcome.
>> best
>>
>>
>> slurmd.log (GPU node)
>> [2021-04-14T17:31:42.302] got shutdown request
>> [2021-04-14T17:31:42.302] all threads complete
>> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open 
>> '(null)/tasks' for reading : No such file or directory
>> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of 
>> '(null)'
>> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open 
>> '(null)/tasks' for reading : No such file or directory
>> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of 
>> '(null)'
>> [2021-04-14T17:31:42.304] debug:  gres/gpu: fini: unloading
>> [2021-04-14T17:31:42.304] debug:  gpu/generic: fini: fini: unloading GPU 
>> Generic plugin
>> [2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres 
>> shutting down ...
>> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0
>> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature 
>> plugin unloaded
>> [2021-04-14T17:31:42.304] Slurmd shutdown completing
>> [2021-04-14T17:31:42.321] debug:  Log file re-opened
>> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init
>> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load
>> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
>> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket
>> [2021-04-14T17:31:42.446] debug:  CPUs:256 Boards:1 Sockets:8 
>> CoresPerSocket:16 ThreadsPerCore:2
>> [2021-04-14T17:31:42.446] debug:  Reading cgroup.conf file 
>> /etc/slurm/cgroup.conf
>> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init
>> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file 
>> (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
>> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket
>> [2021-04-14T17:31:42.448] debug:  CPUs:256 Boards:1 Sockets:8 
>> CoresPerSocket:16 ThreadsPerCore:2
>> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
>> [2021-04-14T17:31:42.449] debug:  gres/gpu: init: loaded
>> [2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml 
>> functionality, but we weren't able to find that lib when Slurm was 
>> configured.
>>
>>
>>
>> gres.conf (just AutoDetect=nvml)
>> ➜  ~ cat /etc/slurm/gres.conf
>> # GRES configuration for native GPUS
>> # DGX A100 8x Nvidia A100
>> # not working, slurm cannot find nvml
>> AutoDetect=nvml
>> #Name=gpu File=/dev/nvidia[0-7]
>> #Name=gpu Type=A100 File=/dev/nvidia[0-7]
>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>>
>>
>> slurm.conf
>> GresTypes=gpu
>> AccountingStorageTRES=gres/gpu
>> DebugFlags=CPU_Bind,gres
>>
>> ## We don't want a node to go back in pool without sys admin acknowledgement
>> ReturnToService=0
>>
>> ## Basic scheduling
>> #SelectType=select/cons_res
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>> SchedulerType=sched/backfill
>>
>> TaskPlugin=task/cgroup
>> ProctrackType=proctrack/cgroup
>>
>> ## Nodes list
>> ## use native GPUs
>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 
>> RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu
>>
>> ## Partitions list
>> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE 
>> State=UP Nodes=nodeGPU01  Default=YES
>> PartitionName=cpu OverSubscribe=FORCE Ma

Re: [slurm-users] GRES Restrictions

2021-04-15 Thread Stefan Staeglich
Hello,

is there a best practise for activating this feature (set 
ConstrainDevices=yes)? Do I have restart the slurmds? Does this affects running 
jobs?

We are using Slurm 19.05.

Best,
Stefan

Am Dienstag, 25. August 2020, 17:24:41 CEST schrieb Christoph Brüning:
> Hello,
> 
> we're using cgroups to restrict access to the GPUs.
> 
> What I found particularly helpful, are the slides by Marshall Garey from
> last year's Slurm User Group Meeting:
> https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf
> (NVML didn't work for us for some reason I cannot recall, but listing
> the GPU device files explicitly was not a big deal)
> 
> Best,
> Christoph
> 
> On 25/08/2020 16.12, Willy Markuske wrote:
> > Hello,
> > 
> > I'm trying to restrict access to gpu resources on a cluster I maintain
> > for a research group. There are two nodes put into a partition with gres
> > gpu resources defined. User can access these resources by submitting
> > their job under the gpu partition and defining a gres=gpu.
> > 
> > When a user includes the flag --gres=gpu:# they are allocated the number
> > of gpus and slurm properly allocates them. If a user requests only 1 gpu
> > they only see CUDA_VISIBLE_DEVICES=1. However, if a user does not
> > include the --gres=gpu:# flag they can still submit a job to the
> > partition and are then able to see all the GPUs. This has led to some
> > bad actors running jobs on all GPUs that other users have allocated and
> > causing OOM errors on the gpus.
> > 
> > Is it possible, and where would I find the documentation on doing so, to
> > require users to define a --gres=gpu:# to be able to submit to a
> > partition? So far reading the gres documentation doesn't seem to have
> > yielded any word on this issue specifically.
> > 
> > Regards,


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax: +49 761 203-8222






Re: [slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?

2021-04-15 Thread Ole Holm Nielsen

Hi Thomas,

I wonder if your problem is related to that reported in this list thread?
https://lists.schedmd.com/pipermail/slurm-users/2021-April/007107.html

You could try to restart the slurmctld service, and also make sure your 
configuration (slurm.conf etc.) has been pushed correctly to the slurmd nodes.


/Ole

On 4/14/21 9:53 AM, Thomas Arildsen wrote:

Oh and I forgot to mention that we are using Slurm version 20.11.3.
Best,

Thomas

ons, 14 04 2021 kl. 09:23 +0200, skrev Thomas Arildsen:

I administer a Slurm cluster with many users and the operation of the
cluster currently appears "totally normal" for all users; except for
one. This one user gets all attempts to run commands through Slurm
killed after 20-25 seconds (I think the cause is another job - not so
much the time, see further down).
The following minimal example reproduces the error:

 $ sudo -u  srun --pty sleep 25
 srun: job 110962 queued and waiting for resources
 srun: job 110962 has been allocated resources
 srun: Force Terminated job 110962
 srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.
 slurmstepd: error: *** STEP 110962.0 ON  CANCELLED AT 2021-
04-09T16:33:35 ***
 srun: error: : task 0: Terminated

When this happens, I find this line in the slurmctld log:

 _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=110962 uid


It only happens for '' and not for any other user that I know
of. This very similar but shorter-running example works fine:

 $ sudo -u  srun --pty sleep 20
 srun: job 110963 queued and waiting for resources
 srun: job 110963 has been allocated resources

Note that when I run srun --pty sleep 20 as myself, srun does not
output the two srun: job... lines. This seems to me to be an additional
indication that srun is subject to some different settings for
''.
All settings that I have been able to inspect appear identical for
'' as for other users. I have checked, and 'MaxWall' is not
set for this user and not for any other user, either. Other users
belonging to the same Slurm account do not experience this problem.

When this unfortunate user's jobs get allocated, I see messages like
this in
'/var/log/slurm/slurmctld.log':

 sched: _slurm_rpc_allocate_resources JobId=111855 NodeList=

and shortly after, I see this message:

 select/cons_tres: common_job_test: no job_resources info for
JobId=110722_* rc=0

Job 110722_* is a pending array job by another user that is pending due
to 'QOSMaxGRESPerUser'. One pending part of this array job (110722_57)
eventually ends up taking over job 111855's CPU cores when 111855 gets
killed. This leads me to believe that 110722_57 causes 111855 to be
killed. However, 110722_57 remains pending afterwards.
Some of the things I fail to understand here are:
   - Why does a pending job kill another job, yet remains pending
afterwards?
   - Why does the pending job even have privileges to kill another job
in the first place?
   - Why does this only affect ''s jobs but not those of other
users?

None of this is intended to happen. I am guessing it must be caused by
some settings specific to '', but I cannot figure out what
they are and they are not supposed to be like this. If these are
settings we admins somehow caused, it was unintended.

NB: some details have been anonymized as  above.

I hope someone has a clue what is going on here. Thanks in advance,

Thomas Arildsen