[slurm-users] Re: errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes

2024-09-27 Thread Robert Kudyba via slurm-users
Totally. That's why we are upgrading. We were recommended by SchedMD to
update the OS first, then Slurm. So as we are testing RHEL 9 we are hitting
this roadblock.

On Fri, Sep 27, 2024 at 12:12 PM Davide DelVento 
wrote:

> Slurm 18? Isn't that a bit outdated?
>
> On Fri, Sep 27, 2024 at 9:41 AM Robert Kudyba via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> We're in the process of upgrading but first we're moving to RHEL 9. My
>> attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}"
>>  slurm-18.08.9.tar.bz2 (H/T to Brian for this flag
>> <https://groups.google.com/g/slurm-users/c/W8YfGIn1rDI/m/4bsSAoqZAAAJ>).
>> I've stumped Google and the Slurm mailing list with the scancel error so
>> hoping someone here knows of a work around.
>>
>> /bin/ld:
>> opt.o:/root/rpmbuild/BUILD/slurm-18.08.9/src/scancel/../../src/scancel/scancel.h:78:
>> multiple definition of `opt';
>> scancel.o:/root/rpmbuild/BUILD/slurm-18.08.9/src/scancel/../../src/scancel/scancel.h:78:
>> first defined here
>> collect2: error: ld returned 1 exit status
>> make[3]: *** [Makefile:577: scancel] Error 1
>> make[3]: Leaving directory
>> '/root/rpmbuild/BUILD/slurm-18.08.9/src/scancel'
>> make[2]: *** [Makefile:563: all-recursive] Error 1
>> make[2]: Leaving directory '/root/rpmbuild/BUILD/slurm-18.08.9/src'
>> make[1]: *** [Makefile:690: all-recursive] Error 1
>> make[1]: Leaving directory '/root/rpmbuild/BUILD/slurm-18.08.9'
>> make: *** [Makefile:589: all] Error 2
>> error: Bad exit status from /var/tmp/rpm-tmp.jhiGyR (%build)
>>
>>
>> RPM build errors:
>> Macro expanded in comment on line 22: %_prefix path install path for
>> commands, libraries, etc.
>>
>> line 70: It's not recommended to have unversioned Obsoletes:
>> Obsoletes: slurm-lua slurm-munge slurm-plugins
>> Macro expanded in comment on line 158: %define
>> _unpackaged_files_terminate_build  0
>>
>> line 224: It's not recommended to have unversioned Obsoletes:
>> Obsoletes: slurm-sql
>> line 256: It's not recommended to have unversioned Obsoletes:
>> Obsoletes: slurm-sjobexit slurm-sjstat slurm-seff
>> line 275: It's not recommended to have unversioned Obsoletes:
>> Obsoletes: pam_slurm
>> Bad exit status from /var/tmp/rpm-tmp.jhiGyR (%build)
>>
>> #!/bin/sh
>>
>>   RPM_SOURCE_DIR="/root"
>>   RPM_BUILD_DIR="/root/rpmbuild/BUILD"
>>   RPM_OPT_FLAGS="-O2  -fexceptions -g -grecord-gcc-switches -pipe -Wall
>> -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
>> "-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64
>> -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables
>> -fstack-clash-protection -fcf-protection"
>>   RPM_LD_FLAGS="-Wl,-z,relro -Wl,--as-needed  "-Wl,-z,lazy"
>> -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 "
>>   RPM_ARCH="x86_64"
>>   RPM_OS="linux"
>>   RPM_BUILD_NCPUS="48"
>>   export RPM_SOURCE_DIR RPM_BUILD_DIR RPM_OPT_FLAGS RPM_LD_FLAGS RPM_ARCH
>> RPM_OS RPM_BUILD_NCPUS RPM_LD_FLAGS
>>   RPM_DOC_DIR="/usr/share/doc"
>>   export RPM_DOC_DIR
>>   RPM_PACKAGE_NAME="slurm"
>>   RPM_PACKAGE_VERSION="18.08.9"
>>   RPM_PACKAGE_RELEASE="1.el9"
>>   export RPM_PACKAGE_NAME RPM_PACKAGE_VERSION RPM_PACKAGE_RELEASE
>>   LANG=C
>>   export LANG
>>   unset CDPATH DISPLAY ||:
>>   RPM_BUILD_ROOT="/root/rpmbuild/BUILDROOT/slurm-18.08.9-1.el9.x86_64"
>>   export RPM_BUILD_ROOT
>>
>>
>> PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:/usr/lib64/pkgconfig:/usr/share/pkgconfig"
>>   export PKG_CONFIG_PATH
>>   CONFIG_SITE=${CONFIG_SITE:-NONE}
>>   export CONFIG_SITE
>>
>>   set -x
>>   umask 022
>>   cd "/root/rpmbuild/BUILD"
>> cd 'slurm-18.08.9'
>>
>>
>>   CFLAGS="${CFLAGS:--O2  -fexceptions -g -grecord-gcc-switches -pipe
>> -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
>> -Wp,-D_GLIBCXX_ASSERTIONS "-Wl,-z,lazy"
>> -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64-v2
>> -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection
>> -fcf-protection}" ; export CFLAGS ;
>>   CXXFLAGS="${CXXFLAGS:--O2  -fexceptions -g -grecord-gcc-switches -pipe
>> -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
>> -Wp

[slurm-users] errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes

2024-09-27 Thread Robert Kudyba via slurm-users
We're in the process of upgrading but first we're moving to RHEL 9. My
attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}"
 slurm-18.08.9.tar.bz2 (H/T to Brian for this flag
).
I've stumped Google and the Slurm mailing list with the scancel error so
hoping someone here knows of a work around.

/bin/ld:
opt.o:/root/rpmbuild/BUILD/slurm-18.08.9/src/scancel/../../src/scancel/scancel.h:78:
multiple definition of `opt';
scancel.o:/root/rpmbuild/BUILD/slurm-18.08.9/src/scancel/../../src/scancel/scancel.h:78:
first defined here
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:577: scancel] Error 1
make[3]: Leaving directory '/root/rpmbuild/BUILD/slurm-18.08.9/src/scancel'
make[2]: *** [Makefile:563: all-recursive] Error 1
make[2]: Leaving directory '/root/rpmbuild/BUILD/slurm-18.08.9/src'
make[1]: *** [Makefile:690: all-recursive] Error 1
make[1]: Leaving directory '/root/rpmbuild/BUILD/slurm-18.08.9'
make: *** [Makefile:589: all] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.jhiGyR (%build)


RPM build errors:
Macro expanded in comment on line 22: %_prefix path install path for
commands, libraries, etc.

line 70: It's not recommended to have unversioned Obsoletes: Obsoletes:
slurm-lua slurm-munge slurm-plugins
Macro expanded in comment on line 158: %define
_unpackaged_files_terminate_build  0

line 224: It's not recommended to have unversioned Obsoletes:
Obsoletes: slurm-sql
line 256: It's not recommended to have unversioned Obsoletes:
Obsoletes: slurm-sjobexit slurm-sjstat slurm-seff
line 275: It's not recommended to have unversioned Obsoletes:
Obsoletes: pam_slurm
Bad exit status from /var/tmp/rpm-tmp.jhiGyR (%build)

#!/bin/sh

  RPM_SOURCE_DIR="/root"
  RPM_BUILD_DIR="/root/rpmbuild/BUILD"
  RPM_OPT_FLAGS="-O2  -fexceptions -g -grecord-gcc-switches -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
"-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64
-march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables
-fstack-clash-protection -fcf-protection"
  RPM_LD_FLAGS="-Wl,-z,relro -Wl,--as-needed  "-Wl,-z,lazy"
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 "
  RPM_ARCH="x86_64"
  RPM_OS="linux"
  RPM_BUILD_NCPUS="48"
  export RPM_SOURCE_DIR RPM_BUILD_DIR RPM_OPT_FLAGS RPM_LD_FLAGS RPM_ARCH
RPM_OS RPM_BUILD_NCPUS RPM_LD_FLAGS
  RPM_DOC_DIR="/usr/share/doc"
  export RPM_DOC_DIR
  RPM_PACKAGE_NAME="slurm"
  RPM_PACKAGE_VERSION="18.08.9"
  RPM_PACKAGE_RELEASE="1.el9"
  export RPM_PACKAGE_NAME RPM_PACKAGE_VERSION RPM_PACKAGE_RELEASE
  LANG=C
  export LANG
  unset CDPATH DISPLAY ||:
  RPM_BUILD_ROOT="/root/rpmbuild/BUILDROOT/slurm-18.08.9-1.el9.x86_64"
  export RPM_BUILD_ROOT


PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:/usr/lib64/pkgconfig:/usr/share/pkgconfig"
  export PKG_CONFIG_PATH
  CONFIG_SITE=${CONFIG_SITE:-NONE}
  export CONFIG_SITE

  set -x
  umask 022
  cd "/root/rpmbuild/BUILD"
cd 'slurm-18.08.9'


  CFLAGS="${CFLAGS:--O2  -fexceptions -g -grecord-gcc-switches -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
"-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64
-march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables
-fstack-clash-protection -fcf-protection}" ; export CFLAGS ;
  CXXFLAGS="${CXXFLAGS:--O2  -fexceptions -g -grecord-gcc-switches -pipe
-Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-Wp,-D_GLIBCXX_ASSERTIONS "-Wl,-z,lazy"
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64-v2
-mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection
-fcf-protection}" ; export CXXFLAGS ;
  FFLAGS="${FFLAGS:--O2  -fexceptions -g -grecord-gcc-switches -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
"-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64
-march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables
-fstack-clash-protection -fcf-protection -I/usr/lib64/gfortran/modules}" ;
export FFLAGS ;
  FCFLAGS="${FCFLAGS:--O2  -fexceptions -g -grecord-gcc-switches -pipe
-Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-Wp,-D_GLIBCXX_ASSERTIONS "-Wl,-z,lazy"
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64-v2
-mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection
-fcf-protection -I/usr/lib64/gfortran/modules}" ; export FCFLAGS ;
  LDFLAGS="${LDFLAGS:--Wl,-z,relro -Wl,--as-needed  "-Wl,-z,lazy"
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 }" ; export LDFLAGS ;
  LT_SYS_LIBRARY_PATH="${LT_SYS_LIBRARY_PATH:-/usr/lib64:}" ; export
LT_SYS_LIBRARY_PATH ;
  CC="${CC:-gcc}" ; export CC ;
  CXX="${CXX:-g++}" ; export CXX;
  [ ""x != x ] &&
  for file in $(find . -type f -name configure -print); do
/usr/bin/sed -r --in-place=.backup 's/^char \(\*f\) \(\) =
/__attribute__ ((used)) char (*f) () = /g' $file;
diff -u $file.backup $file && mv $file.backup $file
/

[slurm-users] sreport syntax for TRES/GPU usage

2024-08-16 Thread Robert Kudyba via slurm-users
In a 25 node heterogeneous cluster with 4 different types of GPUs, to get
granular to see which GPUs were used most over a time period we have to set
AccountingStorageTRES to something like:
AccountingStorageTRES=gres/gpu,gres/gpu:rtx8000,gres/gpu:v100s,gres/gpu:a40,gres/gpu:a100

Unfortunately it's currently at:
AccountingStorageTRES=gres/gpu

At least all nodes have the same GPU within each node. What are some good
options to sreport to get details on usage over a year, e.g., percentage of
CPU vs GPU, which partitions/accounts used the most GPUs, etc.

>From this example:
sreport -tminper -t Percent cluster utilization --tres="cpu,gres/gpu"
start=2023-07-01

Cluster Utilization 2023-07-01T00:00:00 - 2024-08-15T23:59:59
Usage reported in Percentage of Total

  Cluster  TRES Name   Allocated   Down PLND DowIdle
ReservedReported
- -- --- --  ---
-- ---
 clustercpu  43.81%  2.87%0.00%  48.35%
 4.97%  99.86%
 cluster   gres/gpu  50.36%  3.59%0.00%  46.05%
 0.00% 100.38%

Is that showing that 50% of all jobs were run with GPUs? How do we read the
Idle column? Why does Reported show > 100% for gres?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm commands fail when run in Singularity container with the error "Invalid user for SlurmUser slurm, SINGULARITYENV_SLURM_CONF

2024-07-03 Thread Robert Kudyba via slurm-users
Thanks Ben but there's no mention  of SINGULARITYENV_SLURM_CONF in that
page. Slurm is not in the container either so we're trying to get mpirun
from the host to run inside the container.

On Wed, Jul 3, 2024, 11:30 AM Benjamin Smith  wrote:

> On 03/07/2024 16:03, Robert Kudyba via slurm-users wrote:
>
> In https://support.schedmd.com/show_bug.cgi?id=9282#c6
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D9282-23c6&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=5hJxdWMig3ZAd7ryzPLIeuycWSWxc7C12VDuDRXdUfEGni-pnpKj_3eGOBZad2p8&s=8GLAIOdGzghsSwpL3y1O1hyHNyi9YVppnF3mcaYbFCs&e=>
> Tim mentioned this env variable SINGULARITYENV_SLURM_CONF, what is the
> usage/syntax for it? I can't find any reference to this. I'm running into
> the same issue mentioned there.
>
> That's an Apptainer (formerly singularity) feature. See
> https://apptainer.org/docs/user/1.3/singularity_compatibility.html#singularity-environment-variable-compatibility
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__apptainer.org_docs_user_1.3_singularity-5Fcompatibility.html-23singularity-2Denvironment-2Dvariable-2Dcompatibility&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=5hJxdWMig3ZAd7ryzPLIeuycWSWxc7C12VDuDRXdUfEGni-pnpKj_3eGOBZad2p8&s=CW6hvxscvCDcznEdIfImhlTxyigqJD3oaLOBgG4m9Qk&e=>
> . So it should be setting SLURM_CONF inside the container.
>
>
>
> Thanks in advance!
>
>
>
>  --
> Benjamin Smith  
> Computing Officer, AT-7.12a
> Research and Teaching Unit
> School of Informatics, University of Edinburgh
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336. Is e buidheann carthannais a th’ ann an
> Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm commands fail when run in Singularity container with the error "Invalid user for SlurmUser slurm, SINGULARITYENV_SLURM_CONF

2024-07-03 Thread Robert Kudyba via slurm-users
In https://support.schedmd.com/show_bug.cgi?id=9282#c6 Tim mentioned this
env variable SINGULARITYENV_SLURM_CONF, what is the usage/syntax for it? I
can't find any reference to this. I'm running into the same issue mentioned
there.

Thanks in advance!

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-05 Thread Robert Kudyba via slurm-users
>
>
> Your bf_window may be too small.  From 'man slurm.conf':
>
>   bf_window=#
>
>  The number of minutes into the future to look when considering
>  jobs to schedule.  Higher values result in more overhead and
>  less responsiveness.  A value at least as long as the highest
>  allowed time limit is generally advisable to prevent job
>  starvation.  In order to limit the amount of data managed by
>  the backfill scheduler, if the value of bf_window is increased,
>  then it is generally advisable to also increase bf_resolution.
>  This option applies only to SchedulerType=sched/backfill.
>  Default: 1440 (1 day), Min: 1, Max: 43200 (30 days).
>

So since we have a 5 day option should bf_window=7200? What
should bf_resolution be set to then?

But how does this affect/improve wait times?



>
> >  On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski 
> wrote:
> >
> >  This is relatively true of my system as well, and I believe it’s that
> the backfill schedule is slower than the main scheduler.
> >
> >  --
> >  #BlackLivesMatter
> >  
> >  || \\UTGERS,
>  |---*O*---
> >  ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
> >  || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> >  ||  \\of NJ  | Office of Advanced Research Computing - MSB A555B,
> Newark
> >   `'
> >
> >  On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
> >
> >  At the moment we have 2 nodes that are having long wait times.
> Generally this is when the nodes are fully allocated. What would be the
> other
> >  reasons if there is still enough available memory and CPU available,
> that a job would take so long? Slurm version is  23.02.4 via Bright
> >  Computing. Note the compute nodes have hyperthreading enabled but that
> should be irrelevant. Is there a way to determine what else could
> >  be holding jobs up?
> >
> >  srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p
> short /bin/bash
> >  srun: job 672204 queued and waiting for resources
> >
> >   scontrol show node node001
> >  NodeName=m001 Arch=x86_64 CoresPerSocket=48
> > CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
> > AvailableFeatures=location=local
> > ActiveFeatures=location=local
> > Gres=gpu:A6000:8
> > NodeAddr=node001 NodeHostName=node001 Version=23.02.4
> > OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14
> 12:42:38 EDT 2022
> > RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
> > State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> > Partitions=ours,short
> > BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
> > LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
> > CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
> > AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
> > CapWatts=n/a
> > CurrentWatts=0 AveWatts=0
> > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> >  grep 672204 /var/log/slurmctld
> >  [2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources
> JobId=672204 NodeList=(null) usec=852
> >
> >  --
> >  slurm-users mailing list -- slurm-users@lists.schedmd.com
> >  To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
> --
> Dr. Loris Bennett (Herr/Mr)
> FUB-IT (ex-ZEDAT), Freie Universität Berlin
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Robert Kudyba via slurm-users
Thanks for the quick response Ryan!

Are there any recommendations for bf_ options from
https://slurm.schedmd.com/sched_config.html that could help with this?
bf_continue? Decreasing bf_interval= to a value lower than 30?

On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski 
wrote:

> This is relatively true of my system as well, and I believe it’s that the
> backfill schedule is slower than the main scheduler.
>
> --
> #BlackLivesMatter
> 
> || \\UTGERS, |---*O*---
> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ  | Office of Advanced Research Computing - MSB
> A555B, Newark
>      `'
>
> On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
> At the moment we have 2 nodes that are having long wait times. Generally
> this is when the nodes are fully allocated. What would be the other reasons
> if there is still enough available memory and CPU available, that a
> job would take so long? Slurm version is  23.02.4 via Bright Computing.
> Note the compute nodes have hyperthreading enabled but that should be
> irrelevant. Is there a way to determine what else could be holding jobs up?
>
> srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p
> short /bin/bash
> srun: job 672204 queued and waiting for resources
>
>  scontrol show node node001
> NodeName=m001 Arch=x86_64 CoresPerSocket=48
>CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
>AvailableFeatures=location=local
>ActiveFeatures=location=local
>Gres=gpu:A6000:8
>NodeAddr=node001 NodeHostName=node001 Version=23.02.4
>OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38
> EDT 2022
>RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
>State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>Partitions=ours,short
>BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
>LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
>CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
>AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
>CapWatts=n/a
>CurrentWatts=0 AveWatts=0
>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> grep 672204 /var/log/slurmctld
> [2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources
> JobId=672204 NodeList=(null) usec=852
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Robert Kudyba via slurm-users
At the moment we have 2 nodes that are having long wait times. Generally
this is when the nodes are fully allocated. What would be the other reasons
if there is still enough available memory and CPU available, that a
job would take so long? Slurm version is  23.02.4 via Bright Computing.
Note the compute nodes have hyperthreading enabled but that should be
irrelevant. Is there a way to determine what else could be holding jobs up?

srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p short
/bin/bash
srun: job 672204 queued and waiting for resources

 scontrol show node node001
NodeName=m001 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:A6000:8
   NodeAddr=node001 NodeHostName=node001 Version=23.02.4
   OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38
EDT 2022
   RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ours,short
   BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
   LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
   CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
   AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

grep 672204 /var/log/slurmctld
[2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources JobId=672204
NodeList=(null) usec=852

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: any way to allow interactive jobs or ssh in Slurm 23.02 when node is draining?

2024-05-13 Thread Robert Kudyba via slurm-users
Thanks for the reply, Luke. I also found that with Bright they have a file
called /etc/security/pam_bright.d/pam_whitelist.conf that can be used to
allow access.

On Thu, May 9, 2024 at 5:10 AM Luke Sudbery  wrote:

> Draining a node will not stop someone logging on via pam_slurm_adopt.
>
>
>
> If they have a running job, and can log on when the node is not draining,
> then they can log on when it is draining.
>
>
>
> If they don’t have a running job, they can’t log on whether it is draining
> or not.
>
>
>
> If you want people to be able to log on when they don’t have a job
> running, you could put them in a group which is given access in access.conf
> and PAM, as explained here:
> https://slurm.schedmd.com/pam_slurm_adopt.html#admin_access
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_pam-5Fslurm-5Fadopt.html-23admin-5Faccess&d=DwMGaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=930NtoLMP-HvoNP-dfQ9jhRtE5LJnxRDm9D7MJkOJnZQJRNbHHXjsP41nIQyfBxL&s=4p4zui4pf8xYjAj48y_0dCLnMEudAClm-bNhCYct-ZM&e=>
>
>
>
> Cheers,
>
>
>
> Luke
>
>
>
> --
>
> Luke Sudbery
>
> Principal Engineer (HPC and Storage).
>
> Architecture, Infrastructure and Systems
>
> Advanced Research Computing, IT Services
>
> Room 132, Computer Centre G5, Elms Road
>
>
>
> *Please note I don’t work on Monday.*
>
>
>
> *From:* Robert Kudyba via slurm-users 
> *Sent:* Friday, April 19, 2024 9:17 PM
> *To:* Slurm User Community List 
> *Subject:* [slurm-users] any way to allow interactive jobs or ssh in
> Slurm 23.02 when node is draining?
>
>
>
> *CAUTION:* This email originated from outside the organisation. Do not
> click links or open attachments unless you recognise the sender and know
> the content is safe.
>
>
>
> We use Bright Cluster Manager with SLurm 23.02 on RHEL9. I know about
> pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_pam-5Fslurm-5Fadopt.html&d=DwMGaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=930NtoLMP-HvoNP-dfQ9jhRtE5LJnxRDm9D7MJkOJnZQJRNbHHXjsP41nIQyfBxL&s=Kch4xC6o-kw7TW21LcDVPMjH1a0Zl7TL1l8FiTdvLyI&e=>
> which does not appear to come by default with the Bright 'cm' package of
> Slurm.
>
>
>
> Currently ssh to a node gets:
>
> Login not allowed: no running jobs and no WLM allocations
>
>
>
> We have 8 GPUs on a node so when we drain a node, which can have up to a 5
> day job, no new jobs can run. And since we have 20+ TB (yes TB) local
> drives, researchers have their work and files on them to retrieve.
>
>
>
> Is there a way to use /etc/security/access.conf to work around this at
> least temporarily until the reboot and then we can revert?
>
>
>
> Thanks!
>
>
>
> Rob
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] any way to allow interactive jobs or ssh in Slurm 23.02 when node is draining?

2024-04-19 Thread Robert Kudyba via slurm-users
We use Bright Cluster Manager with SLurm 23.02 on RHEL9. I know about
pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html which does
not appear to come by default with the Bright 'cm' package of Slurm.

Currently ssh to a node gets:
Login not allowed: no running jobs and no WLM allocations

We have 8 GPUs on a node so when we drain a node, which can have up to a 5
day job, no new jobs can run. And since we have 20+ TB (yes TB) local
drives, researchers have their work and files on them to retrieve.

Is there a way to use /etc/security/access.conf to work around this at
least temporarily until the reboot and then we can revert?

Thanks!

Rob

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
On Bright it's set in a few places:
grep -r -i SLURM_CONF /etc
/etc/systemd/system/slurmctld.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/systemd/system/slurmdbd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/systemd/system/slurmd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/logrotate.d/slurmdbd.rpmsave:
 SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
/etc/logrotate.d/slurm.rpmsave:
 SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
/etc/pull.pl:$ENV{'SLURM_CONF'} =
'/cm/shared/apps/slurm/var/etc/slurm/slurm.conf';

It'd still be good to check on a compute node what echo $SLURM_CONF returns
for you.

On Fri, Apr 19, 2024 at 1:50 PM Brian Andrus via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> I would double-check where you are setting SLURM_CONF then. It is acting
> as if it is not set (typo maybe?)
>
> It should be in /etc/defaults/slurmd (but could be /etc/sysconfig/slurmd).
>
> Also check what the final, actual command being run to start it is. If
> anyone has changed the .service file or added an override file, that will
> affect things.
>
> Brian Andrus
>
>
> On 4/19/2024 10:15 AM, Jeffrey Layton wrote:
>
> I like it, however, it was working before without a slurm.conf in
> /etc/slurm.
>
> Plus the environment variable SLURM_CONF is pointing to the correct
> slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?
>
> Thanks!
>
> Jeff
>
>
> On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> This is because you have no slurm.conf in /etc/slurm, so it it is trying
>> 'configless' which queries DNS to find out where to get the config. It is
>> failing because you do not have DNS configured to tell nodes where to ask
>> about the config.
>>
>> Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).
>>
>> Brian Andrus
>> On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
>>
>> Good afternoon,
>>
>> I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base
>> Command Manager which is based on Bright Cluster Manager). I ran into an
>> error and only just learned that Slurm and Weka don't get along (presumably
>> because Weka pins their client threads to cores). I read through their
>> documentation here:
>> https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8
>> 
>>
>> I through I set everything correctly but when I try to restart the slurm
>> server I get the following:
>>
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
>> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
>> fetch_config: DNS SRV lookup failed
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
>> _establish_configuration: failed to load configs
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd
>> initialization failed
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
>> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS
>> SRV lookup failed
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
>> _establish_configuration: failed to load configs
>> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
>> initialization failed
>> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process
>> exited, code=exited, status=1/FAILURE
>> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with
>> result 'exit-code'.
>>
>> Has anyone encountered this?
>>
>> I read this is usually associated with configless Slurm, but I don't know
>> how Slurm is built in BCM. slurm.conf is located in
>> /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The
>> environment variables for Slurm are set correctly so it points to this
>> slurm.conf file.
>>
>> One thing that I did not do was tell Slurm which cores Weka was using. I
>> can seem to figure out the syntax for this. Can someone share the changes
>> they made to slurm.conf?
>>
>> Thanks!
>>
>> Jeff
>>
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an e

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
>
> Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).
>
For Bright slurm.conf is in /cm/shared/apps/slurm/var/etc/slurm including
on all nodes. Make sure on the compute nodes $SLURM_CONF resolves to the
correct path.



> On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
>
> Good afternoon,
>
> I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base
> Command Manager which is based on Bright Cluster Manager). I ran into an
> error and only just learned that Slurm and Weka don't get along (presumably
> because Weka pins their client threads to cores). I read through their
> documentation here:
> https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8
> 
>
> I through I set everything correctly but when I try to restart the slurm
> server I get the following:
>
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> fetch_config: DNS SRV lookup failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
> _establish_configuration: failed to load configs
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd
> initialization failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
> resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS
> SRV lookup failed
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
> _establish_configuration: failed to load configs
> Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
> initialization failed
> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process
> exited, code=exited, status=1/FAILURE
> Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with
> result 'exit-code'.
>
> Has anyone encountered this?
>
> I read this is usually associated with configless Slurm, but I don't know
> how Slurm is built in BCM. slurm.conf is located in
> /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The
> environment variables for Slurm are set correctly so it points to this
> slurm.conf file.
>
> One thing that I did not do was tell Slurm which cores Weka was using. I
> can seem to figure out the syntax for this. Can someone share the changes
> they made to slurm.conf?
>
> Thanks!
>
> Jeff
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
Now what would be causing this? The srun just hangs and these are the only
logs from slurmctld:
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node007
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node006
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node005
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node009
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node008

[2024-02-24T23:43:21.183] _slurm_rpc_complete_job_allocation: JobId=563
error Job/step already completing or completed

[465.extern] error: common_file_write_content: unable to open
'/sys/fs/cgroup/system.slice/slurmstepd.scope/job_463/step_extern/user/cgroup.freeze'
for writing: Permission denied

On Sat, Feb 24, 2024 at 12:09 PM Robert Kudyba  wrote:

> <<
>
> Ah yes thanks for pointing that out. Hope this helps someone down the
> line...perhaps the error detection could be more explicit in slurmctld?
>
> On Sat, Feb 24, 2024, 12:07 PM Chris Samuel via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
>>
>> > For now I just set it to chmod 777 on /tmp and that fixed the errors.
>> Is
>> > there a better option?
>>
>> Traditionally /tmp and /var/tmp have been 1777 (that "1" being the
>> sticky bit, originally invented to indicate that the OS should attempt
>> to keep a frequently used binary in memory but then adopted to indicate
>> special handling of a world writeable directory so users can only unlink
>> objects they own and not others).
>>
>> Hope that helps!
>>
>> All the best,
>> Chris
>> --
>> Chris Samuel  :
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=DwICAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=1dr8K8YEcCyc4UDmIvmXWNuOled6fEZ424zSwluePPfhXD2Q5JVklrCrDUQU-mSW&s=ZbSiWLCu-81ZY1xhscjqczszYgOmqxUbVa6f2qUEd-o&e=
>>  :  Berkeley, CA, USA
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote:

> On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
>
> > For now I just set it to chmod 777 on /tmp and that fixed the errors. Is
> > there a better option?
>
> Traditionally /tmp and /var/tmp have been 1777 (that "1" being the
> sticky bit, originally invented to indicate that the OS should attempt
> to keep a frequently used binary in memory but then adopted to indicate
> special handling of a world writeable directory so users can only unlink
> objects they own and not others).
>
> Hope that helps!
>
> All the best,
> Chris
> --
> Chris Samuel  :
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=DwICAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=1dr8K8YEcCyc4UDmIvmXWNuOled6fEZ424zSwluePPfhXD2Q5JVklrCrDUQU-mSW&s=ZbSiWLCu-81ZY1xhscjqczszYgOmqxUbVa6f2qUEd-o&e=
>  :  Berkeley, CA, USA
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote:

> Hi Robert,
>
> On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:
>
> > We switched over from using systemctl for tmp.mount and change to zram,
> > e.g.,
> > modprobe zram
> > echo 20GB > /sys/block/zram0/disksize
> > mkfs.xfs /dev/zram0
> > mount -o discard /dev/zram0 /tmp
> [...]
>  > [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward:
> failed to create temporary XAUTHORITY file: Permission denied
>
> Where do you set the permissions on /tmp ?  What do you set them to?
>
> All the best,
> Chris
> --
> Chris Samuel  :
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=DwICAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=dmeaMvnkyzcOflY8XQKXwHbYw7wooGy71JGyj1fwEKHls6zdAR5Q2C5DxN-CFzsa&s=REC8OGrY-7z6qJAyYetQhVU6LQdDBV6ajjKgtqH0_jU&e=
>  :  Berkeley, CA, USA
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of JobId=

2024-02-23 Thread Robert Kudyba via slurm-users
We switched over from using systemctl for tmp.mount and change to zram,
e.g.,
modprobe zram
echo 20GB > /sys/block/zram0/disksize
mkfs.xfs /dev/zram0
mount -o discard /dev/zram0 /tmp

srun with --x11 was working before changing this. We're on RHEL 9.

slurmctld logs show this whenever --x11 is used with srun:
[2024-02-23T20:22:43.442] [529.extern] error: setup_x11_forward: failed to
create temporary XAUTHORITY file: Permission denied
[2024-02-23T20:22:43.442] [529.extern] error: x11 port forwarding setup
failed
[2024-02-23T20:22:43.442] error: _forkexec_slurmstepd: slurmstepd failed to
send return code got 0: Resource temporarily unavailable
[2024-02-23T20:22:43.443] Could not launch job 529 and not able to requeue
it, cancelling job
[2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: failed to
create temporary XAUTHORITY file: Permission denied
[2024-02-23T20:26:15.881] [530.extern] error: x11 port forwarding setup
failed
[2024-02-23T20:26:15.882] error: _forkexec_slurmstepd: slurmstepd failed to
send return code got 0: Resource temporarily unavailable
[2024-02-23T20:26:15.883] Could not launch job 530 and not able to requeue
it, cancelling job

slurmd log entries from a node:
[2024-02-23T20:26:15.859] sched: _slurm_rpc_allocate_resources JobId=530
NodeList=2402-node005 usec=1800
[2024-02-23T20:26:15.882] _slurm_rpc_requeue: Requeue of JobId=530 returned
an error: Only batch jobs are accepted or processed
[2024-02-23T20:26:15.883] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=530
uid 0
[2024-02-23T20:26:15.962] _slurm_rpc_complete_job_allocation: JobId=530
error Job/step already completing or completed

srun -v --pty  -t 0-4:00 --x11 --mem=10g
srun: defined options
srun:  
srun: account : me
srun: mem : 10G
srun: nodelist: our-node
srun: pty :
srun: time: 04:00:00
srun: verbose : 1
srun: x11 : all
srun:  
srun: end of defined options
srun: Waiting for resource configuration
srun: error: Nodes our-node are still not ready
srun: error: Something is wrong with the boot of the nodes.

slurm.conf has PrologFlags=x11 set. /usr/bin/xauth is installed on each
compute node.

Is this a known issue with zram or is that just a red herring and there's
something else wrong?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-29 Thread Robert Kudyba
According to these links:
https://rpmfind.net/linux/rpm2html/search.php?query=slurm
https://src.fedoraproject.org/rpms/slurm

Why doesn't RHEL 8 get a newer version? Can someone update the repo
maintainer Philip Kovacs  <
pk...@fedoraproject.org>? There was a ticket at
https://bugzilla.redhat.com/show_bug.cgi?id=1912491 but no movement on RHEL
8.


[slurm-users] JobState of RaisedSignal:53 Real-time_signal_19; slurm 23.02.4

2023-11-10 Thread Robert Kudyba
The user is launching a Singularity container for RStudio and the final
option for --rsession-path does not exist.

scontrol show job 420719
JobId=420719 JobName=r2.sbatch
  UserId=ouruser(552199) GroupId=user(500) MCS_label=N/A
  Priority=1428 Nice=0 Account=ouracct QOS=xxx
  JobState=FAILED Reason=RaisedSignal:53(Real-time_signal_19)
Dependency=(null)

>From slurmctld.log:
[2023-11-10T11:40:20.569] _slurm_rpc_submit_batch_job: JobId=420719
InitPrio=1428 usec=272
[2023-11-10T11:40:20.973] sched: Allocate JobId=420719 NodeList=node001
#CPUs=2 Partition=xxx
[2023-11-10T11:40:21.143] _job_complete: JobId=420719 WTERMSIG 53
[2023-11-10T11:40:21.144] _job_complete: JobId=420719 done

I think I may know the reason but wanted to see if this error meant
something else.Here is the snippet in the sbatch file:

singularity exec --cleanenv rstudio_4.2.sif \
  /usr/lib/rstudio-server/bin/rserver --www-port ${PORT} \
  --auth-none=0 \
  --auth-pam-helper-path=pam-helper \
  --auth-stay-signed-in-days=30 \
  --auth-timeout-minutes=0 \
  --rsession-path=/path/to/4.2/rsession.sh
Thanks.


[slurm-users] Slurm 20.11.3, Suspended new connections while processing backlog filled /

2021-03-10 Thread Robert Kudyba
I see there is this exact issue
https://githubmemory.com/repo/dun/munge/issues/94. We are on Slurm 20.11.3
on Bright Cluster 8.1 on Centos 7.9

I found hundreds of these logs in slurmctld:
error: slurm_accept_msg_conn: Too many open files in system

Then in munged.log:
Suspended new connections while processing backlog

Also in slurmctld.log:
Mar 7 15:40:21 node003 nslcd[7941]: [18ed80]  failed
to bind to LDAP server ldaps://ldapserver/: Can't contact LDAP server:
Connection timed out
Mar 7 15:40:21 node003 nslcd[7941]: [18ed80]  no
available LDAP server found: Can't contact LDAP server: Connection timed out
Mar 7 15:40:30 node001 nsl cd[8838]: [53fb78] 
connected to LDAP server ldaps://ldapserver/
Mar 7 15:40:30 node003 nslcd[7941]: [b82726]  no
available LDAP server found: Server is unavailable: Broken pipe
Mar 7 15:40:30 node003 nslcd[7941]: [b82726]  no
available LDAP server found: Server is unavailable: Broken pipe

So / was 100%. Yes we should've put var on a separate partition.

As for file descriptor setting we have:
cat /proc/sys/fs/file-max
131072

Is there a way to avoid this in the future?


[slurm-users] Slurm upgrade to 20.11.3, slurmdbd still trying to start old version 20.02.3

2021-03-03 Thread Robert Kudyba
Slurmdbd has an issue and from the logs is still trying to load the old
version:
[2021-01-22T14:17:18.430] MySQL server version is: 5.5.68-MariaDB
[2021-01-22T14:17:18.433] error: Database settings not recommended values:
innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout
[2021-01-22T14:17:18.528] Accounting storage MYSQL plugin loaded
[2021-01-22T14:17:18.529] error: chdir(/var/log): Permission denied
[2021-01-22T14:17:18.529] chdir to /var/tmp

*[2021-01-22T14:17:18.531] slurmdbd version 20.02.3
started[2021-01-22T14:56:40.334] error: g_slurm_auth_unpack: remote
plugin_id 144 not found*
[2021-01-22T14:56:40.334] error: slurm_unpack_received_msg: Invalid
Protocol Version 9216 from uid=-1 from problem connection: Socket operation
on non-socket
[2021-01-22T14:56:40.334]* error: slurm_unpack_received_msg: Incompatible
versions of client and server code*
[2021-01-22T14:56:40.345] error: CONN:7 Failed to unpack SLURM_PERSIST_INIT
message
[2021-03-03T09:49:57.607] Terminate signal (SIGINT or SIGTERM) received
[2021-03-03T09:49:57.610] Unable to remove pidfile '/var/run/slurmdbd.pid':
Permission denied

But I know it's updated:
rpm -qa|grep slurmdbd
slurm20-slurmdbd-20.11.3-mybuild.x86_64

And the pid file is not there:
ls -l /var/run/slurmdbd.pid
ls: cannot access /var/run/slurmdbd.pid: No such file or directory

And on the service file:
cat /usr/lib/systemd/system/slurmdbd.service
[Unit]
RequiresMountsFor=/cm/shared
Description=Slurm DBD accounting daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurmdbd.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmdbd
*ExecStart=/cm/shared/apps/slurm/20.11.3/sbin/slurmdbd -D $SLURMDBD_OPTIONS*
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536

I reinstalled the slurmdbd file that is local:
Dependencies Resolved

==
 PackageArch
  Version Repository
   Size
==
Reinstalling:
 slurm20-slurmdbd   x86_64
  20.11.3-mybuild
/slurm20-slurmdbd-20.11.3-mybuild.x86_64   2.3 M

Transaction Summary
==
Reinstall  1 Package

Total size: 2.3 M
Installed size: 2.3 M
Is this ok [y/d/N]: y
Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : slurm20-slurmdbd-20.11.3-mybuild.x86_64

   1/1
  Verifying  : slurm20-slurmdbd-20.11.3-mybuild.x86_64

   1/1

Installed:
  slurm20-slurmdbd.x86_64 0:20.11.3-mybuild

What did I miss? In the upgrade page
 I see this:
The libslurm.so version is increased every major release. So things like
MPI libraries with Slurm integration should be recompiled. Sometimes it
works to just symlink the old .so name(s) to the new one, but this has no
guarantee of working.

So I have this:
locate libslurm.so
/cm/shared/apps/slurm/20.11.3/lib64/libslurm.so
/cm/shared/apps/slurm/20.11.3/lib64/libslurm.so.36
/cm/shared/apps/slurm/20.11.3/lib64/libslurm.so.36.0.0

Is there some other place the old version is being referenced?


Re: [slurm-users] exempting a node from Gres Autodetect

2021-02-19 Thread Robert Kudyba
have you seen this? https://bugs.schedmd.com/show_bug.cgi?id=7919#c7, fixed
in 20.06.1

On Fri, Feb 19, 2021 at 11:34 AM Paul Brunk  wrote:

> Hi all:
>
> (I hope plague and weather are being visibly less than maximally cruel
> to you all.)
>
> In short, I was trying to exempt a node from NVML Autodetect, and
> apparently introduced a syntax error in gres.conf.  This is not an
> urgent matter for us now, but I'm curious what went wrong.  Thanks for
> lending any eyes to this!
>
> More info:
>
> Slurm 20.02.6, CentOS 7.
>
> We've historically had only this in our gres.conf:
> AutoDetect=nvml
>
> Each of our GPU nodes has e.g. 'Gres=gpu:V100:1' as part of its
> NodeName entry (GPU models vary across them).
>
> I wanted to exempt one GPU node from the autodetect (was curious about
> the presence or absence of the GPU model subtype designation,
> e.g. 'V100' vs. 'v100s'), so I changed gres.conf to this (modelled
> after 'gres.conf' man page):
>
> AutoDetect=nvml
> NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0
>
> I restarted slurmctld, then "scontrol reconfigure".  Each node got a
> fatal error parsing gres.conf, causing RPC failure between slurmctld
> and nodes, causing slurmctld to consider the nodes failed.
>
> Here's how it looked to slurmctld:
>
> [2021-02-04T13:36:30.482] backfill: Started JobId=1469772_3(1473148) in
> batch on ra3-6
> [2021-02-04T15:14:48.642] error: Node ra3-6 appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
> [2021-02-04T15:25:40.258] agent/is_node_resp: node:ra3-6 RPC:REQUEST_PING
> : Communication connection failure
> [2021-02-04T15:39:49.046] requeue job JobId=1443912 due to failure of node
> ra3-6
>
> And to the slurmd's :
>
> [2021-02-04T15:14:50.730] Message aggregation disabled
> [2021-02-04T15:14:50.742] error: Parsing error at unrecognized key:
> AutoDetect
> [2021-02-04T15:14:50.742] error: Parse error in file
> /var/lib/slurmd/conf-cache/gres.conf line 2: " AutoDetect=off Name=gpu
> File=/dev/nvidia0"
> [2021-02-04T15:14:50.742] fatal: error opening/reading
> /var/lib/slurmd/conf-cache/gres.conf
>
> Reverting to the original, one-line gres.conf reverted the cluster to
> production state.
>
> --
> Paul Brunk, system administrator
> Georgia Advanced Computing Resource Center
> Enterprise IT Svcs, the University of Georgia
>
>
>


Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Robert Kudyba
You all might be interested in a patch to the SPEC file, to not make the
slurm RPMs depend on libnvidia-ml.so, even if it's been enabled at
configure time. See https://bugs.schedmd.com/show_bug.cgi?id=7919#c3

On Tue, Jan 26, 2021 at 3:17 PM Paul Raines 
wrote:

>
> You should check your jobs that allocated GPUs and make sure
> CUDA_VISIBLE_DEVICES is being set in the environment.  This is a sign
> you GPU support is not really there but SLURM is just doing "generic"
> resource assignment.
>
> I have both GPU and non-GPU nodes.  I build SLURM rpms twice. Once on a
> non-GPU node and use those RPMs to install on the non-GPU nodes. Then
> build
> again on the GPU node where CUDA is installed via the NVIDIA CUDA YUM repo
> rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm
> nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to the
> default
> RPM SPEC is needed.  I just run
>
>rpmbuild --tb slurm-20.11.3.tar.bz2
>
> You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml' and see
> that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the
> GPU node.
>
> -- Paul Raines (
> https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo&e=
> )
>
>
>
> On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:
>
> > In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
> >>  Personally, I think it's good that Slurm RPMs are now available through
> >>  EPEL, although I won't be able to use them, and I'm sure many people on
> >>  the list won't be able to either, since licensing issues prevent them
> from
> >>  providing support for NVIDIA drivers, so those of us with GPUs on our
> >>  clusters will still have to compile Slurm from source to include NVIDIA
> >>  GPU support.
> >
> > We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes.
> > The Slurm GPU documentation seems to be
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA&e=
> > We don't seem to have any problems scheduling jobs on GPUs, even though
> our
> > Slurm RPM build host doesn't have any NVIDIA software installed, as
> shown by
> > the command:
> > $ ldconfig -p | grep libnvidia-ml
> >
> > I'm curious about Prentice's statement about needing NVIDIA libraries to
> be
> > installed when building Slurm RPMs, and I read the discussion in bug
> 9525,
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525&d=DwIBAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE&s=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI&e=
> > from which it seems that the problem was fixed in 20.02.6 and 20.11.
> >
> > Question: Is there anything special that needs to be done when building
> Slurm
> > RPMs with NVIDIA GPU support?
> >
> > Thanks,
> > Ole
> >
> >
> >
>
>


Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-26 Thread Robert Kudyba
On Mon, Jan 25, 2021 at 6:36 PM Brian Andrus  wrote:

> Also, a plug for support contracts. I have been doing slurm for a very
> long while, but always encourage my clients to get a support contract.
> That is how SchedMD stays alive and we are able to have such a good
> piece of software. I see the cloud providers starting to build tools
> that will eventually obsolesce slurm for the cloud. I worry that there
> won't be enough paying customers for Tim to keep things running as well
> as he has. I'm pretty sure most folks that use slurm for any period of
> time has received more value that a small support contract would be.
>

We considered this but we have a very small cluster. And when I reached out
for a quote, I was told "SchedMD has a MIN node count of 256 for $10K/yr".

Since we're using Bright Computing we've always had to ignore Slurm updates
from yum and have to compile our own version.

Curious, which cloud provider scheduling tools do you see gaining traction?


Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-12-02 Thread Robert Kudyba
>
> been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It
> seems to have started to occur when I enabled proctrack/cgroup and changed
> select/linear to select/con_tres.
>
Our slurm.conf has the same setting:
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60
EnforcePartLimits=YES

We enabled MPS too. Not sure if that's relevant.


> Are you using cgroup process tracking and have you manipulated the
> cgroup.conf file?
>
Here's what we have in ours:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
TaskAffinity=no
ConstrainCores=no
ConstrainRAMSpace=no
ConstrainSwapSpace=no
ConstrainDevices=no
ConstrainKmemSpace=yes
AllowedRamSpace=100
AllowedSwapSpace=0
MinKmemSpace=30
MaxKmemPercent=100
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

  Do jobs complete correctly when not cancelled?


Yes they do and canceling doesn't always result in a node draining.

So would this be a Slurm issue or Bright? I'm telling users to add 'sleep
60' as the last line in their sbatch files.


Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba
Sure I've seen that in some of the posts here, e.g., a NAS. But in this
case it's a NFS share to the local RAID10 storage. There aren't any other
settings that deal with this to not drain a node?

On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon  wrote:

> That can help.  Usually this happens due to laggy storage the job is
> using taking time flushing the job's data.  So making sure that your
> storage is up, responsive, and stable will also cut these down.
>
> -Paul Edmon-
>
> On 11/30/2020 12:52 PM, Robert Kudyba wrote:
> > I've seen where this was a bug that was fixed
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
>
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
> > but this happens
> > occasionally still. A user cancels his/her job and a node gets
> > drained. UnkillableStepTimeout=120 is set in slurm.conf
> >
> > Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2
> >
> > Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED,
> > ExitCode 0
> > Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
> > update_node: node node001 reason set to: Kill task failed
> > update_node: node node001 state set to DRAINING
> > error: slurmd error running JobId=6908 on node(s)=node001: Kill task
> > failed
> >
> > update_node: node node001 reason set to: hung
> > update_node: node node001 state set to DOWN
> > update_node: node node001 state set to IDLE
> > error: Nodes node001 not responding
> >
> > scontrol show config | grep kill
> > UnkillableStepProgram   = (null)
> > UnkillableStepTimeout   = 120 sec
> >
> > Do we just increase the timeout value?
>
>


[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba
I've seen where this was a bug that was fixed
https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally
still. A user cancels his/her job and a node gets drained.
UnkillableStepTimeout=120 is set in slurm.conf

Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2

Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED,
ExitCode 0
Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
update_node: node node001 reason set to: Kill task failed
update_node: node node001 state set to DRAINING
error: slurmd error running JobId=6908 on node(s)=node001: Kill task failed

update_node: node node001 reason set to: hung
update_node: node node001 state set to DOWN
update_node: node node001 state set to IDLE
error: Nodes node001 not responding

scontrol show config | grep kill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec

Do we just increase the timeout value?


[slurm-users] MPS Count option clarification and TensorFlow 2/PyTorch greediness causing out of memory OOMs

2020-08-25 Thread Robert Kudyba
Comparing the Slurm  MPS configuration example here
, our gres.conf
has this:
NodeName=node[001-003] Name=mps Count=400

What does "Count" really mean and how do you use this number?

>From that web page  you
have:
"MPS configuration includes only the Name and Count parameters: The count
of gres/mps elements will be evenly distributed across all GPUs configured
on the node. This is similar to case 1, but places duplicate configuration
in the gres.conf file."

Also on that page there is this:
# Example 1 of gres.conf
# Configure support for four GPUs (with MPS)
AutoDetect=nvml
Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1
Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3
Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3
# Set gres/mps Count value to 100 on each of the 4 available GPUs
Name=mps Count=400

And then this (sidenote, the typo of "*different*" in the example)

# Example 2 of gres.conf
# Configure support for four *differernt *GPU types (with MPS)
AutoDetect=nvml
Name=gpu Type=gtx1080 File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gtx1070 File=/dev/nvidia1 Cores=0,1
Name=gpu Type=gtx1060 File=/dev/nvidia2 Cores=2,3
Name=gpu Type=gtx1050 File=/dev/nvidia3 Cores=2,3
Name=mps Count=1300   File=/dev/nvidia0
Name=mps Count=1200   File=/dev/nvidia1
Name=mps Count=1100   File=/dev/nvidia2
Name=mps Count=1000   File=/dev/nvidia3

And lower in the page, not sure what "to a job of step" means:
The percentage will be calculated based upon the portion of the configured
Count on the Gres is allocated to a job of step. For example, a job
requesting "--gres=gpu:200" and using configuration example 2 above would
be allocated
15% of the gtx1080 (File=/dev/nvidia0, 200 x 100 / 1300 = 15), or
16% of the gtx1070 (File=/dev/nvidia0, 200 x 100 / 1200 = 16), or
18% of the gtx1060 (File=/dev/nvidia0, 200 x 100 / 1100 = 18), or
20% of the gtx1050 (File=/dev/nvidia0, 200 x 100 / 1000 = 20).

How were the count values of 1300, 1200, 1100 and 1000 determined?

Now segueing to TensorFlow 2 and PyTorch memory greediness.

Using the same "Deep Convolutional Generative Adversarial Networks
"
sample script and in my sbatch file I added:
#SBATCH --gres=mps:35
echo here is value of TF_FORCE_GPU_ALLOW_GROWTH $TF_FORCE_GPU_ALLOW_GROWTH
echo here is the CUDA-MPS-ActiveThread-Percentage
$CUDA_MPS_ACTIVE_THREAD_PERCENTAGE

So the job log file showed this:
here is value of TF_FORCE_GPU_ALLOW_GROWTH true
here is the CUDA-MPS-ActiveThread-Percentage 17

So that 17 is half of the 35 I see with the MPS option. The description
from the SchedMD page reads:
"The percentage will be calculated based upon the portion of the configured
Count on the Gres is allocated to a job of step."

So how does Count=400 from the gres.conf file factor in? Does it mean the
job is using 17% of the available threads of the GPU? From nvidia-smi on
this Slurm job:
+-+
| Processes:   GPU
Memory |
|  GPU   PID   Type   Process name Usage
   |
|=|
|
|0 59793  C   python3.6
1135MiB |

The GPU has 32 GB:

|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |
 0 |
| N/A   49CP0   128W / 250W |   3417MiB / 32510MiB | 96%
 Default |

So MPS and the Count option do not help with GPU memory. So I'm trying to
find ways to tell our users how to avoid the OOM's. The most common advice
is to use smaller batches
 but
the complaint we get is it really slows down their jobs doing so.


So I just found the section 2 Physical GPUs, 2 Logical GPUs from the
TensorFlow 2
 docs,
works by setting a hard limit, in this case 2048 MB, adding the below code
after import tensorflow as tf


gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
#  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],

[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
## Virtual devices must be set before GPUs have been initialized
print(e)


I know this is outside of the scope of Slurm but I was hoping someone had a
more graceful way rather than a hard memory limit to achieve this. The
first option mentioned in the TF docs state: The first option is t

[slurm-users] configure Slurm when disk quota exceeded

2020-08-04 Thread Robert Kudyba
Is there a way for Slurm to detect when a user quota has been exceeded? We
use XFS and when users are over the quota they will get a "Disk quota
exceeded" message, e.g., when trying to scp or create a new file. However
if they are not aware of this and try using a sbatch file, they don't
receive any notification and the Slurm logs (which they don't have access
to) simply say this:

[2020-08-04T09:12:35.001] _slurm_rpc_submit_batch_job: JobId=5495
InitPrio=4294900561 usec=1964
[2020-08-04T09:12:35.567] email msg to u...@myschool.edu: Slurm Job_id=5495
Name=M2_alltrans Began, Queued time 00:00:01
[2020-08-04T09:12:35.567] backfill: Started JobId=5495 in defq on node001
[2020-08-04T09:12:35.824] prolog_running_decr: Configuration for JobId=5495
is complete
[2020-08-04T09:12:35.916] _job_complete: JobId=5495 WEXITSTATUS 1
[2020-08-04T09:12:35.916] email msg to u...@myschool.edu : Slurm
Job_id=5495 Name=M2_alltrans Failed, Run time 00:00:00, FAILED, ExitCode 1
[2020-08-04T09:12:35.916] _job_complete: JobId=5495 done


[slurm-users] TensorRT script runs with srun but not from a sbatch file

2020-04-29 Thread Robert Kudyba
I'm using this TensorRT tutorial

with MPS on Slurm 20.02 on Bright Cluster 8.2

Here are the contents of my mpsmovietest sbatch file:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=MPSMovieTest
#SBATCH --gres=gpu:1
#SBATCH --nodelist=node001
#SBATCH --output=mpstest.out
export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
module load shared slurm  openmpi/cuda/64 cm-ml-python3deps/3.2.3
 cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3
tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb
keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2

When run in Slurm I get the below errors so perhaps there is a pathing
issue that does not work when I run srun alone:
Could not find movielens_ratings.txt in data directories:
data/samples/movielens/
data/movielens/
 FAILED

I’m trying to use srun to test this but it always fails as it appears to be
trying all nodes. We only have 3 compute nodes. As I’m writing this node002
 and node003 are in use by other users so I just want to use node001.

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |0 |
| N/A   67CP0   241W / 250W |  32167MiB / 32510MiB |100%   E. Process |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|0428996  C   python3.6  32151MiB |
+-+
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
 RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check
your CUDA installation:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10]  FAILED
TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1

So is my syntax wrong with srun? MPS is running:

$ ps -auwx|grep mps
root 108581  0.0  0.0  12780   812 ?Ssl  Mar23   0:54
/cm/local/apps/cuda-

When node002 is available the program runs correctly, albeit with an error
about the log file failing to write:

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
 --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Thu Apr 16 10:08:52 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2
  |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |
 0 |
| N/A   28CP025W / 250W | 41MiB / 32510MiB |  0%   E.
Process |
+---+--+--+

+-+
| Processes:   GPU
Memory |
|  GPU   PID   Type   Process name Usage
   |
|===

Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Robert Kudyba
On Thu, Apr 23, 2020 at 1:43 PM Michael Robbert  wrote:

> It looks like you have hyper-threading turned on, but haven’t defined the
> ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS
> or changed the definition of ThreadsPerCore in slurm.conf.
>

Nice find. node003 has hyper threading enabled but node001 and node002 do
not:
[root@node001 ~]# dmidecode -t processor | grep -E '(Core Count|Thread
Count)'
Core Count: 12
Thread Count: 12
Core Count: 12
Thread Count: 12

[root@node003 ~]# dmidecode -t processor | grep -E '(Core Count|Thread
Count)'
Core Count: 12
Thread Count: 24
Core Count: 12
I found a great mini script  to
disable hyperthreading without reboot. I did get the following warning but
I don't think it's a big issue:
 WARNING, didn't collect load info for all cpus, balancing is broken

Do I have to restart slurmctl on the head node and/or slurmd on node003?

Side question, are there ways with Slurm to test if hyperthreading improves
performance and job speed?

>


[slurm-users] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Robert Kudyba
Running Slurm 20.02 on Centos 7.7 on Bright Cluster 8.2. slurm.conf is on
the head node. I don't see these errors on the other 2 nodes. After
restarting slurmd on node003 I see this:

slurmd[400766]: error: Node configuration differs from hardware:
CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)
CoresPerSocket=12:12(hw) ThreadsPerCore=1:2(hw)
Apr 23 10:05:49 node003 slurmd[400766]: Message aggregation disabled
Apr 23 10:05:49 node003 slurmd[400766]: CPU frequency setting not
configured for this node
Apr 23 10:05:49 node003 slurmd[400770]: CPUs=24 Boards=1 Sockets=2 Cores=12
Threads=1 Memory=191880 TmpDisk=2038 Uptime=2488268 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)

>From slurm.conf:
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=191800 Sockets=2
Gres=gpu:v100:1
# Partitions
$O Hidden=NO OverSubscribe=FORCE:12 GraceTime=0 PreemptMode=OFF ReqResv=NO
AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=N$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidde$
# Generic resources types
GresTypes=gpu,mic
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60
EnforcePartLimits=YES

lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):48
On-line CPU(s) list:   0-47
Thread(s) per core:2
Core(s) per socket:12
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 85
Model name:Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
Stepping:  4
CPU MHz:   2600.000
BogoMIPS:  5200.00
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  1024K
L3 cache:  19712K
NUMA node0 CPU(s):
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47

cat /etc/slurm/cgroup.conf| grep -v '#'
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
TaskAffinity=no
ConstrainCores=no
ConstrainRAMSpace=no
ConstrainSwapSpace=no
ConstrainDevices=no
ConstrainKmemSpace=yes
AllowedRamSpace=100
AllowedSwapSpace=0
MinKmemSpace=30
MaxKmemPercent=100
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

What else can I check?


[slurm-users] srun always uses node002 even using --nodelist=node001

2020-04-16 Thread Robert Kudyba
I'm using this TensorRT tutorial

with MPS on Slurm 20.02 on Bright Cluster 8.2

I’m trying to use srun to test this but it always fails as it appears to be
trying all nodes. We only have 3 compute nodes. As I’m writing this node002
 and node003 are in use by other users so I just want to use node001.

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |0 |
| N/A   67CP0   241W / 250W |  32167MiB / 32510MiB |100%   E. Process |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|0428996  C   python3.6  32151MiB |
+-+
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
 RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check
your CUDA installation:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10]  FAILED
TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1

So is my syntax wrong with srun? MPS is running:

$ ps -auwx|grep mps
root 108581  0.0  0.0  12780   812 ?Ssl  Mar23   0:54
/cm/local/apps/cuda-


When node002 is available the program runs correctly, albeit with an error
on the log file:

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
 --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Thu Apr 16 10:08:52 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2
  |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |
 0 |
| N/A   28CP025W / 250W | 41MiB / 32510MiB |  0%   E.
Process |
+---+--+--+

+-+
| Processes:   GPU
Memory |
|  GPU   PID   Type   Process name Usage
   |
|=|
|0420596  C   nvidia-cuda-mps-server
 29MiB |
+-+
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
An instance of this daemon is already running
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0
nccl2-cuda10.1-gcc/2.5.6
 RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
[03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt
[03/16/2020-

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Robert Kudyba
>
> > use yum install slurm20, here they show Slurm 19 but it's the same for 20
>
> In that case you'll need to open a bug with Bright to get them to
> rebuild Slurm with nvml support.


They told me they don't officially support MPS nor Slurm and to come here
to get support (or pay SchedMD).

The vicious cycle continues.

Since all I want it MPS enabled from
https://slurm.schedmd.com/gres.html#MPS_config_example_2
"CUDA Multi-Process Service (MPS) provides a mechanism where GPUs can be
shared by multiple jobs, where each job is allocated some percentage of the
GPU's resources. The total count of MPS resources available on a node
should be configured in the slurm.conf file (e.g. "NodeName=tux[1-16]
Gres=gpu:2,mps:200"). Several options are available for configuring MPS in
the gres.conf file as listed below with examples following that:

No MPS configuration: The count of gres/mps elements defined in the
slurm.conf will be evenly distributed across all GPUs configured on the
node. For the example, "NodeName=tux[1-16] Gres=gpu:2,mps:200" will
configure a count of 100 gres/mps resources on each of the two GPUs."

Do I even need  to edit gres.conf? Can I just leave out AutoDetect=nvml?


Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Robert Kudyba
>
> > and the NVIDIA Management Library (NVML) is installed on the node and
>> > was found during Slurm configuration
>>
>> That's the key phrase - when whoever compiled Slurm ran ./configure
>> *before* compilation it was on a system without the nvidia libraries and
>> headers present, so Slurm could not compile that support in.
>>
>> You'll need to redo the build on a system with the nvidia libraries and
>> headers in order for this to work.
>
>
As I wrote we use Bright Cluster on CentOS 7.7. So we just follow their
instructions

to
use yum install slurm20, here they show Slurm 19 but it's the same for 20:
Example
[root@bright82 ~]# rpm -qa | grep slurm | xargs -p rpm -e
[root@bright82 ~]# rpm -qa -r /cm/images/default-image |grep slurm |xargs
-p rpm -r /cm/images/default-image -e
[root@bright82 ~]# yum install slurm19-client slurm19-slurmdbd
slurm19-perlapi slurm19-contribs slurm19
[root@bright82 ~]# yum install --installroot=/cm/images/default-image
slurm19-client
If either slurm or slurm19 is installed, then the administrator can run
wlm-setup using the workload manager name slurm—that is without the 19
suffix–to set up Slurm. The roles at node level, or
category level—slurmserver and slurmclient—work with either Slurm version.
Configuring Slurm
After package setup is done with wlm-setup (section 7.3), Slurm software
components are installed in /cm/shared/apps/slurm/current.
Slurm clients and servers can be configured to some extent via role
assignment (sections 7.4.1 and 7.4.2). Using cmsh, advanced option
parameters can be set under the slurmclient role:
For example, the number of cores per socket can be set:
Example
[bright82->category[default]->roles[slurmclient]]% set corespersocket 2
[bright82->category*[default*]->roles*[slurmclient*]]% commit
In order to configure generic resources, the genericresources mode can be
used to set a list of objects. Each object then represents one generic
resource available on nodes. Each value of name in genericresources must
already be defined in the list of GresTypes. The list of GresTypes is
defined in the slurmserver role. Several generic resources entries can have
the same value for name (for example gpu), but must have a unique alias.
The alias is a string that is used to manage the resource entry in cmsh or
in Bright View. The string is enclosed in square brackets in cmsh, and is
used instead of the name for the object. The alias does not affect Slurm
configuration.

For example, to add two GPUs for all the nodes in the default category
which are of type k20xm, and to assign them to different CPU cores, the
following cmsh commands can be run:
Example
[bright82]% category use default
[bright82->category[default]]% roles
[bright82->category[default]->roles]% use slurmclient
[...[slurmclient]]% genericresources
[...[slurmclient]->genericresources]% add gpu0
[...[slurmclient*]->genericresources*[gpu0*]]% set name gpu
[...[slurmclient*]->genericresources*[gpu0*]]% set file /dev/nvidia0
[...[slurmclient*]->genericresources*[gpu0*]]% set cores 0-7
[...[slurmclient*]->genericresources*[gpu0*]]% set type k20xm
[...[slurmclient*]->genericresources*[gpu0*]]% add gpu1
[...[slurmclient*]->genericresources*[gpu1*]]% set name gpu
[...[slurmclient*]->genericresources*[gpu1*]]% set file /dev/nvidia1


Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Robert Kudyba
On Wed, Apr 8, 2020 at 9:34 AM  wrote:

> I believe in order to compile for nvml you'll have to compile on a system
> with an Nvidia gpu installed otherwise the Nvidia driver and libraries
> won't install on that system.
>

Yes our 3 compute nodes have 1 V100 each. So I can run:
ssh node001
Last login: Tue Apr  7 17:30:16 2020
# module load shared
# module load nccl2-cuda10.1-gcc/2.5.6
Loading nccl2-cuda10.1-gcc/2.5.6
  Loading requirement: gcc5/5.5.0 cuda10.1/toolkit/10.1.243
nvidia-smi
Wed Apr  8 10:00:49 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2
  |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |
 0 |
| N/A   28CP025W / 250W |  0MiB / 32510MiB |  0%   E.
Process |
+---+--+--+

+-+
| Processes:   GPU
Memory |
|  GPU   PID   Type   Process name Usage
   |
|=|
|  No running processes found
  |
+-+


> From: slurm-users  On Behalf Of
> Christopher Samuel
> > How can I get this to work by loading the correct Bright module?
>
> You can't - you will need to recompile Slurm.
>
> The error says:
>
> Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to
> autodetect nvml functionality, but we weren't able to find that lib when
> Slurm was configured.
>
> So when Slurm was built the libraries you are telling it to use now were
> not detected and so the configure script disabled that functionality as it
> would not otherwise have been able to compile.
>

But it's clearly there as noted in my previous reply. From
https://slurm.schedmd.com/gres.html#MPS_Management

"If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library
(NVML) is installed on the node and was found during Slurm configuration,
configuration details will automatically be filled in for any
system-detected NVIDIA GPU. This removes the need to explicitly configure
GPUs in gres.conf, though the Gres= line in slurm.conf is still required in
order to tell slurmctld how many GRES to expect."

So there isn't a way to have the "configuration details [will]
automatically [be] filled in for any system-detected NVIDIA GPU. "?

Also the page says this:
"By default, all system-detected devices are added to the node. However, if
Type and File in gres.conf match a GPU on the system, any other properties
explicitly specified (e.g. Cores or Links) can be double-checked against
it. If the system-detected GPU differs from its matching GPU configuration,
then the GPU is omitted from the node with an error. This allows gres.conf
to serve as an optional sanity check and notifies administrators of any
unexpected changes in GPU properties."

How does " system-detected devices" work here? How can  I get "Type and
File in gres.conf  (to) match a GPU on the system"?


Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Robert Kudyba
On Wed, Apr 8, 2020 at 10:23 AM Eric Berquist  wrote:

> I just ran into this issue. Specifically, SLURM looks for the NVML header
> file, which comes with CUDA or DCGM, in addition to the library that comes
> with the drivers. The check is at
> https://github.com/SchedMD/slurm/blob/a763a008b7700321b51aad2e619deab00638a379/auxdir/x_ac_nvml.m4#L32
> .
> Once you’ve built SLURM, it’s enough to just have the GPU drivers on the
> nodes where SLURM will be installed.
>

So how do I get around the  "fatal: We were configured to autodetect nvml
functionality" error so we can use "AutoDetect=nvml"?

Chris Samuel ch...@csamuel.org via

lists.schedmd.com
> Once you’ve built SLURM, it’s enough to just have the GPU drivers on the
> nodes where SLURM will be installed.

>>>Yeah I checked that at the Slurm User Group - slurmd will try
and dlopen() the required libraries and should gracefully deal with them
not being present.>>>

How do I get it to "gracefully deal with them not being present"?


Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-07 Thread Robert Kudyba
> Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to
> autodetect nvml functionality, but we weren't able to find that lib when
> Slurm was configured.
>
>
>
> Apparently the Slurm build you are using has not be compiled against NVML
> and as such it cannot use the autodetect functionality.
>

Since we're using Bright Cluster we just have to load the CUDA toolkit for
NVML. I can run nvidia-sml:
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2
  |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |
 0 |
| N/A   29CP037W / 250W |  0MiB / 32510MiB |  0%   E.
Process |
+---+--+--+
 We do have GresTypes=gpu,mic,mps and Gres=gpu:v100:1 set in slurm.conf.

At https://slurm.schedmd.com/gres.html I see:
"If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library
(NVML) is installed on the node and was found during Slurm configuration,
configuration details will automatically be filled in for any
system-detected NVIDIA GPU. This removes the need to explicitly configure
GPUs in gres.conf, though the Gres= line in slurm.conf is still required in
order to tell slurmctld how many GRES to expect."

How can I get this to work by loading the correct Bright module?


Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-07 Thread Robert Kudyba
OK when restarting slurmd on the nodes I get these errors:

Apr 07 16:52:33 node001 systemd[1]: Starting Slurm node daemon...
Apr 07 16:52:33 node001 slurmd[299181]: Message aggregation disabled
Apr 07 16:52:33 node001 slurmd[299181]: WARNING: A line in gres.conf for
GRES mps has 400 more configured than expected in slurm.conf. Ignoring
extra GRES.
Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to
autodetect nvml functionality, but we weren't able to find that lib when
Slurm was configured.
Apr 07 16:52:33 node001 systemd[1]: slurmd.service: control process exited,
code=exited status=1
Apr 07 16:52:33 node001 systemd[1]: Failed to start Slurm node daemon.
Apr 07 16:52:33 node001 systemd[1]: Unit slurmd.service entered failed
state.
Apr 07 16:52:33 node001 systemd[1]: slurmd.service failed.

Apr 07 16:43:27 node002 slurmd[273406]: error: GresPlugins changed from
gpu,mic to gpu,mic,mps ignored
Apr 07 16:43:27 node002 slurmd[273406]: error: Restart the slurmctld daemon
to change GresPlugins
Apr 07 16:43:27 node002 slurmd[273406]: error: Ignoring gres.conf record,
invalid name: mps
Apr 07 16:44:06 node002 slurmd[273406]: error:
select_g_select_jobinfo_unpack: select plugin cons_tres not found
Apr 07 16:44:06 node002 slurmd[273406]: error:
select_g_select_jobinfo_unpack: unpack error
Apr 07 16:44:06 node002 slurmd[273406]: error: Malformed RPC of type
REQUEST_TERMINATE_JOB(6011) received
Apr 07 16:44:06 node002 slurmd[273406]: error:
slurm_receive_msg_and_forward: Header lengths are longer than data received
Apr 07 16:44:06 node002 slurmd[273406]: error: service_connection:
slurm_receive_msg: Header lengths are longer than dat...ceived

so that " WARNING: A line in gres.conf for GRES mps has 400" must come from
this entry in gres.conf:
NodeName=node[001-003] Name=gpu Type=v100 File=/dev/nvidia0
# END AUTOGENERATED SECTION   -- DO NOT REMOVE
Name=mps Count=400
AutoDetect=nvml

Perhaps I'm misunderstanding the Count option?

On Tue, Apr 7, 2020 at 4:34 PM Davide Vanzo 
wrote:

> Robert,
>
>
>
> That error is typically due to slurmd/slurmctld version mismatch or
> different configuration. I would not be surprised if you need to restart
> slurmd too after changing the SelectType configuration.
>
> Also, do not forget this warning from the documentation when it comes to
> modifying SelectType:
>
>
>
> *Changing this value can only be done by restarting the slurmctld daemon
> and will result in the loss of all job information (running and pending)
> since the job state save format used by each plugin is different.*
>
>
>
> --
>
> *Davide Vanzo, PhD*
>
> *Computer Scientist*
>
> BioHPC – Lyda Hill Dept. of Bioinformatics
>
> UT Southwestern Medical Center
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Robert Kudyba
> *Sent:* Tuesday, April 7, 2020 3:26 PM
> *To:* Slurm User Community List 
> *Subject:* [slurm-users] Header lengths are longer than data received
> after changing SelectType & GresTypes to use MPS
>
>
>
> *EXTERNAL MAIL*
>
> Using Slurm 20.02 on CentIOS 7.7 with Bright Cluster. We changed the
> following options to enable MPS:
> SelectType=select/cons_tres
> GresTypes=gpu,mic,mps
>
> I restarted slurmctld and ran scontrol reconfigure, however all jobs get
> the below error:
> [2020-04-07T15:29:00.741] debug:  backfill: no jobs to backfill
> [2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3056
> Nodelist=node[001-002]
> [2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3061
> Nodelist=node003
> [2020-04-07T15:29:03.051] debug:  sched: Running job scheduler
> [2020-04-07T15:29:03.063] agent/is_node_resp: node:node003
> RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
> [2020-04-07T15:29:03.071] agent/is_node_resp: node:node002
> RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
> [2020-04-07T15:29:03.071] agent/is_node_resp: node:node001
> RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
>
> Do any other options need changing? What causes these header length
> errors?
>
> *CAUTION: *This email originated from outside UTSW. Please be cautious of
> links or attachments, and validate the sender's email address before
> replying.
>
> --
>
> UT Southwestern
>
> Medical Center
>
> The future of medicine, today.
>


[slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-07 Thread Robert Kudyba
Using Slurm 20.02 on CentIOS 7.7 with Bright Cluster. We changed the
following options to enable MPS:
SelectType=select/cons_tres
GresTypes=gpu,mic,mps

I restarted slurmctld and ran scontrol reconfigure, however all jobs get
the below error:
[2020-04-07T15:29:00.741] debug:  backfill: no jobs to backfill
[2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3056
Nodelist=node[001-002]
[2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3061
Nodelist=node003
[2020-04-07T15:29:03.051] debug:  sched: Running job scheduler
[2020-04-07T15:29:03.063] agent/is_node_resp: node:node003
RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
[2020-04-07T15:29:03.071] agent/is_node_resp: node:node002
RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
[2020-04-07T15:29:03.071] agent/is_node_resp: node:node001
RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received

Do any other options need changing? What causes these header length errors?


[slurm-users] PyTorch with Slurm and MPS work-around --gres=gpu:1?

2020-04-03 Thread Robert Kudyba
Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I'm wondering
how the below sbatch file is sharing a GPU.

MPS is running on the head node:
ps -auwx|grep mps
root 108581  0.0  0.0  12780   812 ?Ssl  Mar23   0:27
/cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d

The entire script is posted on SO here

.

Here is the sbatch file contents:

#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval
#SBATCH --nodelist=node003
module purge
module load gcc5 cuda10.1
module load openmpi/cuda/64
module load pytorch-py36-cuda10.1-gcc
module load ml-pythondeps-py36-cuda10.1-gcc
python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee
alex_100_imwoof_seq_longtrain_cv_$1.txt

>From nvidia-smi on the compute node:
Processes
Process ID  : 320467
Type: C
Name: python3.6
Used GPU Memory : 2369 MiB
Process ID  : 320574
Type: C
Name: python3.6
Used GPU Memory : 2369 MiB

[node003 ~]# nvidia-smi -q -d compute

==NVSMI LOG==

Timestamp   : Fri Apr  3 15:27:49 2020
Driver Version  : 440.33.01
CUDA Version: 10.2

Attached GPUs   : 1
GPU :3B:00.0
Compute Mode: Default


[~]# nvidia-smi
Fri Apr  3 15:28:49 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2
  |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |
 0 |
| N/A   42CP046W / 250W |   4750MiB / 32510MiB | 32%
 Default |
+---+--+--+

+-+
| Processes:   GPU
Memory |
|  GPU   PID   Type   Process name Usage
   |
|=|
|0320467  C   python3.6
2369MiB |
|0320574  C   python3.6
2369MiB |
+-+

>From htop:
320574 ouruser 20   0 12.2G 1538M  412M R 502.  0.8 14h45:59 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320467 ouruser 20   0 12.2G 1555M  412M D 390.  0.8 14h45:13 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320654 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320656 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320658 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320660 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:53 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320661 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320655 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320657 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320659 ouruser 20   0 12.2G 1538M  412M R 55.8  0.8  3h00:53 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1

Is PyTorch somehow working around Slurm and NOT locking a GPU since the
user omitted --gres=gpu:1? How can I tell if MPS is really working?


[slurm-users] Fwd: gres/gpu: count changed for node node002 from 0 to 1

2020-03-14 Thread Robert Kudyba
ce)

# zero the parameter gradients
optimizer.zero_grad()

# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

running_loss += loss.item()

if epoch % 10 == 9:
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 100))

allAccs = []
for blurType in blurTypes: # multiple types of blur
print(blurType)
print('-' * 10)
# for block in range(5):
block = int(block_call)
print("\nFOLD " + str(block+1) + ":")
for i in range(5):
if i == 0:
blurLevels = [23, 11, 5, 3, 1]
elif i == 1:
blurLevels = [11, 5, 3, 1]
elif i == 2:
blurLevels = [5, 3, 1]
elif i == 3:
blurLevels = [3, 1]
elif i == 4:
blurLevels = [1]

if modelType == 'vgg16':
net = torchvision.models.vgg16(pretrained=False)
num_ftrs = net.classifier[6].in_features
net.classifier[6] = nn.Linear(num_ftrs,
len(classes))
elif modelType == 'alexnet':
net = torchvision.models.alexnet(pretrained=False)
num_ftrs = net.classifier[6].in_features
net.classifier[6] = nn.Linear(num_ftrs,
len(classes))
else:
net =
torchvision.models.squeezenet1_1(pretrained=False)
net.classifier[1] = nn.Conv2d(512, len(classes),
kernel_size=(1, 1), stride=(1, 1))
net.num_classes = len(classes)
optimizer = optim.SGD(net.parameters(), lr=0.001,
momentum=0.9)
net = net.to(device)
for i in range(len(blurLevels)): #5 levels of blur: 1, 3,
5, 11, 23
mult = blurLevels[i]

trainloader, validloader =
get_train_valid_loader(data_dir=data_dir + blurType + '/' + image_set +
'-320_' + str(mult) + '/train',

block=block,shuffle=False,num_workers=0,batch_size=128)
print('Start training on blur window of ' +
str(mult))
train()
print('Finished Training on ' + blurType + ' with
blur window of ' + str(mult))

accs = []
permBlurLevels = [23, 11, 5, 3, 1]
for j in range(len(permBlurLevels)):
tempMult = permBlurLevels[j]
correct = 0
total = 0
# newTestSet =
torchvision.datasets.ImageFolder(root=data_dir + blurType + '/' + image_set
+ '-320_' +
#   str(tempMult) + '/val',
#   transform=transform)
# newTestLoader =
torch.utils.data.DataLoader(newTestSet, batch_size=128,
#   shuffle=True, num_workers=0)
t2, validloader2 =
get_train_valid_loader(data_dir=data_dir + blurType + '/' + image_set +
'-320_' + str(mult) + '/train',

block=block,shuffle=False,num_workers=0,batch_size=128)

with torch.no_grad():
for data in validloader2:
images, labels = data
images = images.to(device)
labels = labels.to(device)
outputs = net(images)
_, predicted =
torch.max(outputs.data, 1)
total += labels.size(0)
        correct += (predicted ==
labels).sum().item()
acc = 100 * correct / total
print('Accuracy: %f %%' % (acc))
accs.append(acc)
allAccs.append(accs)


-- Forwarded message -
From: Robert Kudyba 
Date: Fri, Mar 13, 2020 at 11:36 AM
Subject: gres/gpu: count changed for node node002 from 0 to 1
To: Slurm User Community List 


We're running slurm-17.11.12 on Bright Cluster 8.1 and our node002 keeps
going into a draining state:
 sinfo -a
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*up   infinite  1   drng node002

info -N -o "%.20

[slurm-users] gres/gpu: count changed for node node002 from 0 to 1

2020-03-13 Thread Robert Kudyba
We're running slurm-17.11.12 on Bright Cluster 8.1 and our node002 keeps
going into a draining state:
 sinfo -a
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*up   infinite  1   drng node002

info -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.35E"
NODELIST   CPUS(A/I/O/T)  STATE MEMORY   PARTITION
   GRES  REASON
 node001   9/15/0/24mix 191800   defq*
  gpu:1none
 node002   1/0/23/24   drng 191800   defq*
  gpu:1 gres/gpu count changed and jobs are
 node003   1/23/0/24mix 191800   defq*
  gpu:1none

Node of the nodes have a separate slurm.conf file, it's all shared from the
head node. What else could be causing this?

[2020-03-13T07:14:28.590] gres/gpu: count changed for node node002 from 0 to
1
[2020-03-13T07:14:28.590] error: _slurm_rpc_node_registration
node=node002: Invalid
argument
[2020-03-13T07:14:28.590] error: Node node001 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set  DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2020-03-13T07:14:28.590] error: Node node003 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T07:47:48.787] error: Node node001 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T07:47:48.787] error: Node node003 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T07:47:48.788] gres/gpu: count changed for node node002 from 0 to
1
[2020-03-13T07:47:48.788] error: _slurm_rpc_node_registration node=node002:
Invalid argument [2020-03-13T08:21:08.057] error: Node node001 appears to
have a different slurm.conf than the slurmctld. This could cause issues
with communication and functionality. Please review both files and make
sure they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T08:21:08.058] error: Node node003 appears to have a
different slurm.conf
than the slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the same. If
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-03-13T08:21:08.058] gres/gpu: count changed for node node002 from 0 to
1
[2020-03-13T08:21:08.058] error: _slurm_rpc_node_registration
node=node002: Invalid
argument


Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Robert Kudyba
>
> If that 32 GB is main system RAM, and not GPU RAM, then yes. Since our GPU
> nodes are over-provisioned in terms of both RAM and CPU, we end up using
> the excess resources for non-GPU jobs.
>

No it's GPU RAM


> If that 32 GB is GPU RAM, then I have no experience with that, but I
> suspect MPS would be required.


OK so does SLURM support MPS and if so what version? Would we need to
enable cons_tres and use, e.g., --mem-per-gpu?


On Thu, Feb 27, 2020 at 12:46 PM Renfro, Michael  wrote:

> If that 32 GB is main system RAM, and not GPU RAM, then yes. Since our GPU
> nodes are over-provisioned in terms of both RAM and CPU, we end up using
> the excess resources for non-GPU jobs.
>
> If that 32 GB is GPU RAM, then I have no experience with that, but I
> suspect MPS would be required.
>
> > On Feb 27, 2020, at 11:14 AM, Robert Kudyba  wrote:
> >
> > So looking at the new cons_tres option at
> https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_SLUG19_GPU-5FScheduling-5Fand-5FCons-5FTres.pdf&d=DwIFAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=kiUokcO6jsOTlmQrVWWzmLutg5C_kIEUNEzcEye6pkQ&s=SB-TTKR1B3MGmXXHiDzz9OwguSjQdp2LaOTyJFfpep8&e=
> , would we be able to use, e.g., --mem-per-gpu= Memory per allocated GPU,
> and it a user allocated --mem-per-gpu=8, and the V100 we have is 32 GB,
> will subsequent jobs be able to use the remaining 24 GB?
>
>
>


Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Robert Kudyba
We figured out the issue.

All of our jobs are requesting 1 GPU. Each node only has 1 GPU. Thus, the
jobs that are pending are pending based on:, resources - meaning "no
resources are available for these jobs", meaning "I want a GPU, but there
are no GPUs that I can use until a job on a node finishes".

So looking at the new cons_tres option at
https://slurm.schedmd.com/SLUG19/GPU_Scheduling_and_Cons_Tres.pdf, would we
be able to use, e.g., --mem-per-gpu= Memory per allocated GPU, and it a
user allocated --mem-per-gpu=8, and the V100 we have is 32 GB, will
subsequent jobs be able to use the remaining 24 GB?

Would Slurm be able to use multi-process service (MPS):
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
if we had it enabled? I'm also trying to see if MPS would work with
TensorFlow and finding mixed results.

Thanks for your reply, Ahmet.

We'd consider SchedMD pait support but their minimum is $10K and 250
nodes...a bit higher than our 4 nodes.



On Thu, Feb 27, 2020 at 3:53 AM mercan  wrote:

> Hi;
>
> At your partition definition, there is "Shared=NO". This is means "do
> not share nodes between jobs". This parameter conflict with
> "OverSubscribe=FORCE:12 " parameter. Acording to the slurm
> documentation, the Shared parameter has been replaced by the
> OverSubscribe parameter. But, I suppose it still works.
>
> Regards,
>
> Ahmet M.
>
>
> On 26.02.2020 22:56, Robert Kudyba wrote:
> > We run Bright 8.1 and Slurm 17.11. We are trying to allow for multiple
> > concurrent jobs to run on our small 4 node cluster.
> >
> > Based on
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__community.brightcomputing.com_question_5d6614ba08e8e81e885f1991-3Faction-3Dartikel-26cat-3D14-26id-3D410-26artlang-3Den-26highlight-3Dslurm-2B-252526-25252334-25253Bgang-2Bscheduling-252526-25252334-25253B&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=JXCldpkgwkDQTsj6kERPbX4hIO1G9jBTaGe4WHHWtKE&e=
>
> > and
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_cons-5Fres-5Fshare.html&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=0xnOemAfvqAmLn7PbzlzspC3ZTvkBqVMxpOyJ6iQOaU&e=
> >
> > Here are some settings in /etc/slurm/slurm.conf:
> >
> > SchedulerType=sched/backfill
> > # Nodes
> > NodeName=node[001-003] CoresPerSocket=12 RealMemory=191800 Sockets=2
> > Gres=gpu:1
> > # Partitions
> > PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> > PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> > Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO
> > AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO
> > OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP Nodes=node[001-003]
> > PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> > PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> > Hidden=NO Shared=NO GraceTime= 0 PreemptMode=OFF ReqResv=NO
> > AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO
> > OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP
> > # Generic resources types
> > GresTypes=gpu,mic
> > # Epilog/Prolog parameters
> > PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
> > Prolog=/cm/local/apps/cmd/scripts/prolog
> > Epilog=/cm/local/apps/cmd/scripts/epilog
> > # Fast Schedule option
> > FastSchedule=1
> > # Power Saving
> > SuspendTime=-1 # this disables power saving
> > SuspendTimeout=30
> > ResumeTimeout=60
> > SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff
> > ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron
> > # END AUTOGENERATED SECTION -- DO NOT REMOVE
> > #
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__kb.brightcomputing.com_faq_index.php-3Faction-3Dartikel-26cat-3D14-26id-3D410-26artlang-3Den-26highlight-3Dslurm-2B-2526-252334-253Bgang-2Bscheduling-2526-252334-253B&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=Yf8fh3avSWaIjsjRyFUW3mJgOlvaTfqZ5xYcsA8pMmo&e=
> > SelectType=select/cons_res
> > SelectTypeParameters=CR_CPU
> > SchedulerTimeSlice=60
> > EnforcePartLimits=YES
> >
> > But it appears each job takes 1 of the 3 nodes and all other jobs are
> > back scheduled. Do we have an incorrect option set?
> >
> > squeue -a
> > JOBID PARTITION

[slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-26 Thread Robert Kudyba
We run Bright 8.1 and Slurm 17.11. We are trying to allow for multiple
concurrent jobs to run on our small 4 node cluster.

Based on
https://community.brightcomputing.com/question/5d6614ba08e8e81e885f1991?action=artikel&cat=14&id=410&artlang=en&highlight=slurm+%2526%252334%253Bgang+scheduling%2526%252334%253B
and
https://slurm.schedmd.com/cons_res_share.html

Here are some settings in /etc/slurm/slurm.conf:

SchedulerType=sched/backfill
# Nodes
NodeName=node[001-003] CoresPerSocket=12 RealMemory=191800 Sockets=2
Gres=gpu:1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL
AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0
State=UP Nodes=node[001-003]
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime= 0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL
AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0
State=UP
# Generic resources types
GresTypes=gpu,mic
# Epilog/Prolog parameters
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
Prolog=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
# Fast Schedule option
FastSchedule=1
# Power Saving
SuspendTime=-1 # this disables power saving
SuspendTimeout=30
ResumeTimeout=60
SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff
ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron
# END AUTOGENERATED SECTION -- DO NOT REMOVE
#
http://kb.brightcomputing.com/faq/index.php?action=artikel&cat=14&id=410&artlang=en&highlight=slurm+%26%2334%3Bgang+scheduling%26%2334%3B
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60
EnforcePartLimits=YES

But it appears each job takes 1 of the 3 nodes and all other jobs are back
scheduled. Do we have an incorrect option set?

squeue -a
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1937 defq PaNet5 user1 PD 0:00 1 (Resources)
1938 defq PoNet5 user1 PD 0:00 1 (Priority)
1964 defq SENet5 user1 PD 0:00 1 (Priority)
1979 defq IcNet5 user1 PD 0:00 1 (Priority)
1980 defq runtrain user2 PD 0:00 1 (Priority)
1981 defq InRes5  user1   PD 0:00 1 (Priority)
1983 defq run_LSTM user3 PD 0:00 1 (Priority)
1984 defq run_hui. user4 PD 0:00 1 (Priority)
1936 defq SeRes5  user1   R 10:02:39 1 node003
1950 defq sequenti  user5  R 1-02:03:00 1 node001
1978 defq run_hui. user16 R 13:48:21 1 node002

Am I misunderstanding some of the settings?


Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-25 Thread Robert Kudyba
I suppose I can ask Bright Computing but does anyone know what version of
Bright is needed? I would guess 8.2 or 9.0. Definitely want to dive into
this.


Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-11 Thread Robert Kudyba
This is still happening. Nodes are being drained after a kill task failed.
Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307?

[2020-02-11T12:21:26.005] update_node: node node001 reason set to: Kill
task failed
[2020-02-11T12:21:26.006] update_node: node node001 state set to DRAINING
[2020-02-11T12:21:26.006] got (nil)
[2020-02-11T12:21:26.015] error: slurmd error running JobId=1514 on
node(s)=node001: Kill task failed
[2020-02-11T12:21:26.015] _job_complete: JobID=1514 State=0x1 NodeCnt=1
WEXITSTATUS 1
[2020-02-11T12:21:26.015] email msg to sli...@fordham.edu: SLURM
Job_id=1514 Name=run.sh Failed, Run time 00:02:21, NODE_FAIL, ExitCode 0
[2020-02-11T12:21:26.016] _job_complete: requeue JobID=1514 State=0x8000
NodeCnt=1 per user/system request
[2020-02-11T12:21:26.016] _job_complete: JobID=1514 State=0x8000 NodeCnt=1
done
[2020-02-11T12:21:26.057] Requeuing JobID=1514 State=0x0 NodeCnt=0
[2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x1 NodeCnt=1
WEXITSTATUS 0
[2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x8003 NodeCnt=1
done
[2020-02-11T12:21:52.111] _job_complete: JobID=1512 State=0x1 NodeCnt=1
WEXITSTATUS 0
[2020-02-11T12:21:52.112] _job_complete: JobID=1512 State=0x8003 NodeCnt=1
done
[2020-02-11T12:21:52.214] sched: Allocate JobID=1516 NodeList=node002
#CPUs=1 Partition=defq
[2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x1 NodeCnt=1
WEXITSTATUS 0
[2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x8003 NodeCnt=1
done

On Tue, Feb 11, 2020 at 11:54 AM Robert Kudyba  wrote:

> Usually means you updated the slurm.conf but have not done "scontrol
>> reconfigure" yet.
>>
> Well it turns out it was something else related to a Bright Computing
> setting. In case anyone finds this thread in the future:
>
> ourcluster->category[gpucategory]->roles]% use slurmclient
> [ourcluster->category[gpucategory]->roles[slurmclient]]% show
> ...
> RealMemory 196489092
> ...
> [ ciscluster->category[gpucategory]->roles[slurmclient]]%
>
> Values are specified in MB and this line is saying that our node has 196TB
> of RAM.
>
> I set this using cmsh:
>
> # cmsh
> % category
> % use gpucategory
> % roles
> % use slurmclient
> % set realmemory 191846
> % commit
>
> The value in /etc/slurm/slurm.conf was conflicting with this especially
> when restarting slurmctld.
>
> On 2/10/2020 8:55 AM, Robert Kudyba wrote:
>>
>> We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
>>
>> We're getting the below errors when I restart the slurmctld service. The
>> file appears to be the same on the head node and compute nodes:
>> [root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
>>
>> -rw-r--r-- 1 root root 3477 Feb 10 11:05
>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>
>> [root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf
>> /etc/slurm/slurm.conf
>>
>> -rw-r--r-- 1 root root 3477 Feb 10 11:05
>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>
>> lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf ->
>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>
>> So what else could be causing this?
>> [2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
>> [2020-02-10T10:31:12.009] error: Node node001 appears to have a different
>> slurm.conf than the slurmctld.  This could cause issues with communication
>> and functionality.  Please review both files and make  sure they are the
>> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
>> slurm.conf.
>> [2020-02-10T10:31:12.009] error: Node node001 has low real_memory size
>> (191846 < 196489092)
>> [2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration
>> node=node001: Invalid argument
>> [2020-02-10T10:31:12.011] error: Node node002 appears to have a different
>> slurm.conf than the slurmctld.  This could cause issues with communication
>> and functionality.  Please review both files and make sure they are the
>> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
>> slurm.conf.
>> [2020-02-10T10:31:12.011] error: Node node002 has low real_memory size
>> (191840 < 196489092)
>> [2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration
>> node=node002: Invalid argument
>> [2020-02-10T10:31:12.047] error: Node node003 appears to have a different
>> slurm.conf than the slurmctld.  This could cause issues with communication
>> and functionality.  Please review both files and make sure they are the
>> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
>> slurm.conf.
>> [2020-02-10T10:31:1

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-11 Thread Robert Kudyba
>
> Usually means you updated the slurm.conf but have not done "scontrol
> reconfigure" yet.
>
Well it turns out it was something else related to a Bright Computing
setting. In case anyone finds this thread in the future:

ourcluster->category[gpucategory]->roles]% use slurmclient
[ourcluster->category[gpucategory]->roles[slurmclient]]% show
...
RealMemory 196489092
...
[ ciscluster->category[gpucategory]->roles[slurmclient]]%

Values are specified in MB and this line is saying that our node has 196TB
of RAM.

I set this using cmsh:

# cmsh
% category
% use gpucategory
% roles
% use slurmclient
% set realmemory 191846
% commit

The value in /etc/slurm/slurm.conf was conflicting with this especially
when restarting slurmctld.

On 2/10/2020 8:55 AM, Robert Kudyba wrote:
>
> We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
>
> We're getting the below errors when I restart the slurmctld service. The
> file appears to be the same on the head node and compute nodes:
> [root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
>
> -rw-r--r-- 1 root root 3477 Feb 10 11:05
> /cm/shared/apps/slurm/var/etc/slurm.conf
>
> [root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf
> /etc/slurm/slurm.conf
>
> -rw-r--r-- 1 root root 3477 Feb 10 11:05
> /cm/shared/apps/slurm/var/etc/slurm.conf
>
> lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf ->
> /cm/shared/apps/slurm/var/etc/slurm.conf
>
> So what else could be causing this?
> [2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
> [2020-02-10T10:31:12.009] error: Node node001 appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make  sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
> [2020-02-10T10:31:12.009] error: Node node001 has low real_memory size
> (191846 < 196489092)
> [2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration
> node=node001: Invalid argument
> [2020-02-10T10:31:12.011] error: Node node002 appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
> [2020-02-10T10:31:12.011] error: Node node002 has low real_memory size
> (191840 < 196489092)
> [2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration
> node=node002: Invalid argument
> [2020-02-10T10:31:12.047] error: Node node003 appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
> [2020-02-10T10:31:12.047] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
> [2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
> [2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
> [2020-02-10T10:32:08.026]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> [2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0
> [2020-02-10T10:56:08.992] layouts: no layout to initialize
> [2020-02-10T10:56:08.992] restoring original state of nodes
> [2020-02-10T10:56:08.992] cons_res: select_p_node_init
> [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
> [2020-02-10T10:56:08.992] _preserve_plugins: backup_controller not
> specified
> [2020-02-10T10:56:08.992] cons_res: select_p_reconfigure
> [2020-02-10T10:56:08.992] cons_res: select_p_node_init
> [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
> [2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set
> [2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set.
> [2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completed
> usec=4369
> [2020-02-10T10:56:11.253]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> [2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung
> [2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN
> [2020-02-10T10:56:18.645] got (nil)
> [2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE
> [2020-02-10T10:56:18.679] got (nil)
> [2020-02-10T10:56:18.693

[slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-10 Thread Robert Kudyba
We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.

We're getting the below errors when I restart the slurmctld service. The
file appears to be the same on the head node and compute nodes:
[root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf

-rw-r--r-- 1 root root 3477 Feb 10 11:05
/cm/shared/apps/slurm/var/etc/slurm.conf

[root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf
/etc/slurm/slurm.conf

-rw-r--r-- 1 root root 3477 Feb 10 11:05
/cm/shared/apps/slurm/var/etc/slurm.conf

lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf ->
/cm/shared/apps/slurm/var/etc/slurm.conf

So what else could be causing this?
[2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:31:12.009] error: Node node001 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make  sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2020-02-10T10:31:12.009] error: Node node001 has low real_memory size
(191846 < 196489092)
[2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration node=node001:
Invalid argument
[2020-02-10T10:31:12.011] error: Node node002 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2020-02-10T10:31:12.011] error: Node node002 has low real_memory size
(191840 < 196489092)
[2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration node=node002:
Invalid argument
[2020-02-10T10:31:12.047] error: Node node003 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2020-02-10T10:31:12.047] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
[2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
[2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration node=node003:
Invalid argument
[2020-02-10T10:32:08.026]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2020-02-10T10:56:08.992] layouts: no layout to initialize
[2020-02-10T10:56:08.992] restoring original state of nodes
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] _preserve_plugins: backup_controller not specified
[2020-02-10T10:56:08.992] cons_res: select_p_reconfigure
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set
[2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completed
usec=4369
[2020-02-10T10:56:11.253]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung
[2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN
[2020-02-10T10:56:18.645] got (nil)
[2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE
[2020-02-10T10:56:18.679] got (nil)
[2020-02-10T10:56:18.693] update_node: node node002 reason set to: hung
[2020-02-10T10:56:18.693] update_node: node node002 state set to DOWN
[2020-02-10T10:56:18.693] got (nil)
[2020-02-10T10:56:18.711] update_node: node node002 state set to IDLE
[2020-02-10T10:56:18.711] got (nil)

And not sure if this is related but we're getting this  "Kill task failed"
and a node gets drained.

[2020-02-09T14:42:06.006] error: slurmd error running JobId=1465 on
node(s)=node001: Kill task failed
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x1 NodeCnt=1
WEXITSTATUS 1
[2020-02-09T14:42:06.006] email msg to ouru...@ourdomain.edu: SLURM
Job_id=1465 Name=run.sh Failed, Run time 00:02:23, NODE_FAIL, ExitCode 0
[2020-02-09T14:42:06.006] _job_complete: requeue JobID=1465 State=0x8000
NodeCnt=1 per user/system request
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x8000 NodeCnt=1
done
[2020-02-09T14:42:06.017] Requeuing JobID=1465 State=0x0 NodeCnt=0
[2020-02-09T14:43:16.308] backfill: Started JobID=1466 in defq on node003
[2020-02-09T14:43:17.054] prolog_running_decr: Configuration for JobID=1466
is complete
[2020-02-09T14:44:16.309] email msg to ouru...@ourdomain.edu:: SLURM
Job_id=1461 Name=run.sh Bega

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-21 Thread Robert Kudyba
>
>
> are you sure, your 24 core nodes have 187 TERABYTES memory?
>
> As you yourself cited:
>
> Size of real memory on the node in megabytes
>
> The settings in your slurm.conf:
>
> NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
> Gres=gpu:1
>
> so, your machines should have 196489092 megabytes memory, that are ~191884
> gigabytes or ~187 terabytes
>

192 GB .

What was also throwing me off was this error:
error: _slurm_rpc_node_registration node=node003: Invalid argument

Invalid in this case appears to be "too high".

It sees only 191840 megabytes, which is still less than the 191884. Since
> the available memory changes slightly from OS version to OS version, I
> would suggest to set RealMemory to less than 191840, e.g. 191800.
> But Brian already told you to reduce the RealMemory:
>
> I would suggest RealMemory=191879 , where I suspect you have
> RealMemory=196489092
>
>
Thanks Marcus and Brian that was indeed the culprit.


Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Robert Kudyba
We are on a Bright Cluster and their support says the head node controls
this. Here you can see the sym links:

[root@node001 ~]# file /etc/slurm/slurm.conf
/etc/slurm/slurm.conf: symbolic link to
`/cm/shared/apps/slurm/var/etc/slurm.conf'

[root@ourcluster myuser]# file /etc/slurm/slurm.conf
/etc/slurm/slurm.conf: symbolic link to
`/cm/shared/apps/slurm/var/etc/slurm.conf'

 ls -l  /etc/slurm/slurm.conf
lrwxrwxrwx 1 root root 40 Nov 30  2018 /etc/slurm/slurm.conf ->
/cm/shared/apps/slurm/var/etc/slurm.conf
[root@ourcluster myuser]# ssh node001
Last login: Mon Jan 20 14:02:00 2020
[root@node001 ~]# ls -l  /etc/slurm/slurm.conf
lrwxrwxrwx 1 root root 40 Nov 30  2018 /etc/slurm/slurm.conf ->
/cm/shared/apps/slurm/var/etc/slurm.conf

On Mon, Jan 20, 2020 at 1:52 PM Brian Andrus  wrote:

> Try using "nodename=node003" in the slurm.conf on your nodes.
>
> Also, make sure the slurm.conf on the nodes is the same as on the head.
>
> Somewhere in there, you have "node=node003" (as well as the other nodes
> names).
>
> That may even do it, as they may be trying to register generically, so
> their configs are not getting matched to the specific info in your main
> config
>
> Brian Andrus
>
>
> On 1/20/2020 10:37 AM, Robert Kudyba wrote:
>
> I've posted about this previously here
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21searchin_slurm-2Dusers_kudyba-257Csort-3Adate_slurm-2Dusers_mMECjerUmFE_V1wK19fFAQAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=V4tz7Qab3oK28vrC090A6R6aFEaDXz7Czqr5y2eDUk0&e=>,
> and here
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21searchin_slurm-2Dusers_kudyba-257Csort-3Adate_slurm-2Dusers_vVAyqm0wg3Y_2YoBq744AAAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=eEetgW964TvhYChxX27f_Bjz3tn5UlwUpVEVAZIdIKo&e=>
>  so
> I'm trying to get to the bottom of this once and for all and even got this
> comment
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_msg_slurm-2Dusers_vVAyqm0wg3Y_x9-2D-5FiQQaBwAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=5UB2Ohj42gVpQ0GXneP02dO3kpRATj5OvQ4nmNTWZd4&e=>
> previously:
>
> our problem here is that the configuration for the nodes in question have
>> an incorrect amount of memory set for them. Looks like you have it set in
>> bytes instead of megabytes
>> In your slurm.conf you should look at the RealMemory setting:
>> RealMemory
>> Size of real memory on the node in megabytes (e.g. "2048"). The default
>> value is 1.
>> I would suggest RealMemory=191879 , where I suspect you have
>> RealMemory=196489092
>
>
> Now the slurmctld logs show this:
>
> [2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node002: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node001 has low real_memory size
> (191846 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node001: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
>
> Here's the setting in slurm.conf:
> /etc/slurm/slurm.conf
> # Nodes
> NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
> Gres=gpu:1
> # Partitions
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 Preempt$
> PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 PreemptM$
>
> sinfo -N
> NODELIST   NODES PARTITION STATE
> node0011 defq* drain
> node0021 

[slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Robert Kudyba
I've posted about this previously here
,
and here

so
I'm trying to get to the bottom of this once and for all and even got this
comment

previously:

our problem here is that the configuration for the nodes in question have
> an incorrect amount of memory set for them. Looks like you have it set in
> bytes instead of megabytes
> In your slurm.conf you should look at the RealMemory setting:
> RealMemory
> Size of real memory on the node in megabytes (e.g. "2048"). The default
> value is 1.
> I would suggest RealMemory=191879 , where I suspect you have
> RealMemory=196489092


Now the slurmctld logs show this:

[2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node002:
Invalid argument
[2020-01-20T13:22:48.256] error: Node node001 has low real_memory size
(191846 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node001:
Invalid argument
[2020-01-20T13:22:48.256] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node003:
Invalid argument

Here's the setting in slurm.conf:
/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
Gres=gpu:1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 PreemptM$

sinfo -N
NODELIST   NODES PARTITION STATE
node0011 defq* drain
node0021 defq* drain
node0031 defq* drain

sinfo -N
NODELIST   NODES PARTITION STATE
node0011 defq* drain
node0021 defq* drain
node0031 defq* drain

[2020-01-20T12:50:51.034] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration node=node003:
Invalid argument

/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
Gres=gpu:1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 PreemptM$

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node001: Thread(s) per core:1
node001: Core(s) per socket:12
node001: Socket(s): 2
node002: Thread(s) per core:1
node002: Core(s) per socket:12
node002: Socket(s): 2
node003: Thread(s) per core:2
node003: Core(s) per socket:12
node003: Socket(s): 2

module load cmsh
[root@ciscluster kudyba]# cmsh
[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type Name Nodes
 

Slurmdefq node001..node003
Slurmgpuq

use defq
[ciscluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:1
   NodeAddr=node001 NodeHostName=node001 Version=17.11
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
   RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=defq
   BootTime=2019-07-18T12:08:42 SlurmdStartTime=2020-01-17T21:34:15
   CfgTRES=cpu=24,mem=196489092M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2020-01-20T13:22:48]

sinfo -R
REASON   USER  TIMESTAMP   NODELIST
Low RealMemory   sl

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-30 Thread Robert Kudyba
I had set RealMemory to a really high number as I mis-interpreted the
recommendation.
NodeName=node[001-003]  CoresPerSocket=12 RealMemory= 196489092  Sockets=2
Gres=gpu:1

But now I set it to:
RealMemory=191000

I restarted slurmctld. And according to the Bright Cluster support team:
"Unless it has been overridden in the image, the nodes will have a symlink
directly to the slurm.conf on the head node. This means that any changes
made to the file on the head node will automatically be available to the
compute nodes. All they would need in that case is to have slurmd restarted"

But now I see these errors:

mcs: MCSParameters = (null). ondemand set.
[2019-08-30T09:22:41.700] error: Node node001 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2019-08-30T09:22:41.700] error: Node node002 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2019-08-30T09:22:41.701] error: Node node003 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2019-08-30T09:23:16.347] update_node: node node001 state set to IDLE
[2019-08-30T09:23:16.347] got (nil)
[2019-08-30T09:23:16.766]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2019-08-30T09:23:19.082] update_node: node node002 state set to IDLE
[2019-08-30T09:23:19.082] got (nil)
[2019-08-30T09:23:20.929] update_node: node node003 state set to IDLE
[2019-08-30T09:23:20.929] got (nil)
[2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job: JobId=449
InitPrio=4294901759 usec=355
[2019-08-30T09:45:46.430] sched: Allocate JobID=449 NodeList=node[001-003]
#CPUs=30 Partition=defq
[2019-08-30T09:45:46.670] prolog_running_decr: Configuration for JobID=449
is complete
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1 NodeCnt=3
WEXITSTATUS 127
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x8005 NodeCnt=3
done

Is this another option that needs to be set?

On Thu, Aug 29, 2019 at 3:27 PM Alex Chekholko  wrote:

> Sounds like maybe you didn't correctly roll out / update your slurm.conf
> everywhere as your RealMemory value is back to your large wrong number.
> You need to update your slurm.conf everywhere and restart all the slurm
> daemons.
>
> I recommend the "safe procedure" from here:
> https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.fysik.dtu.dk_niflheim_SLURM-23add-2Dand-2Dremove-2Dnodes&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yUZtCS8lFs9N4Dm1nidebq1bpGa9QMJUap7ZWVR8NVg&s=Fq72zWoETitTA7ayJCyYkbp8E1fInntp4YeBv75o7vU&e=>
> Your Bright manual may have a similar process for updating SLURM config
> "the Bright way".
>
> On Thu, Aug 29, 2019 at 12:20 PM Robert Kudyba 
> wrote:
>
>> I thought I had taken care of this a while back but it appears the issue
>> has returned. A very simply sbatch slurmhello.sh:
>>  cat slurmhello.sh
>> #!/bin/sh
>> #SBATCH -o my.stdout
>> #SBATCH -N 3
>> #SBATCH --ntasks=16
>> module add shared openmpi/gcc/64/1.10.7 slurm
>> mpirun hello
>>
>> sbatch slurmhello.sh
>> Submitted batch job 419
>>
>> squeue
>>  JOBID PARTITION NAME USER ST   TIME  NODES
>> NODELIST(REASON)
>>419  defq slurmhel root PD   0:00  3
>> (Resources)
>>
>> In /etc/slurm/slurm.conf:
>> # Nodes
>> NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
>> Gres=gpu:1
>>
>> Logs show:
>> [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
>> node=node001: Invalid argument
>> [2019-08-29T14:24:40.025] error: Node node002 has low real_memory size
>> (191840 < 196489092)
>> [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
>> node=node002: Invalid argument
>> [2019-08-29T14:24:40.026] error: Node node003 has low real_memory size
>> (191840 < 196489092)
>> [2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration
>> node=node003: Invalid argument
>>
>> scontrol show jobid -dd 419
>>

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-29 Thread Robert Kudyba
I thought I had taken care of this a while back but it appears the issue has 
returned. A very simply sbatch slurmhello.sh:
 cat slurmhello.sh
#!/bin/sh
#SBATCH -o my.stdout
#SBATCH -N 3
#SBATCH --ntasks=16
module add shared openmpi/gcc/64/1.10.7 slurm
mpirun hello

sbatch slurmhello.sh
Submitted batch job 419

squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
   419  defq slurmhel root PD   0:00  3 (Resources)

In /etc/slurm/slurm.conf:
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2 
Gres=gpu:1

Logs show:
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration node=node001: 
Invalid argument
[2019-08-29T14:24:40.025] error: Node node002 has low real_memory size (191840 
< 196489092)
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration node=node002: 
Invalid argument
[2019-08-29T14:24:40.026] error: Node node003 has low real_memory size (191840 
< 196489092)
[2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration node=node003: 
Invalid argument

scontrol show jobid -dd 419
JobId=419 JobName=slurmhello.sh
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=4294901759 Nice=0 Account=root QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2019-08-28T09:54:22 EligibleTime=2019-08-28T09:54:22
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-08-28T09:57:22
   Partition=defq AllocNode:Sid=ourcluster:194152
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=3-3 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,node=3
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/root/slurmhello.sh
   WorkDir=/root
   StdErr=/root/my.stdout
   StdIn=/dev/null
   StdOut=/root/my.stdout
   Power=

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.06
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:1
   NodeAddr=node001 NodeHostName=node001 Version=17.11
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
   RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defq
   BootTime=2019-07-18T12:08:41 SlurmdStartTime=2019-07-18T12:09:44
   CfgTRES=cpu=24,mem=196489092M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2019-07-18T10:17:24]

[root@ciscluster ~]# scontrol show nodes| grep -i mem
   RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2 Boards=1
   CfgTRES=cpu=24,mem=196489092M,billing=24
   Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
   RealMemory=196489092 AllocMem=0 FreeMem=180969 Sockets=2 Boards=1
   CfgTRES=cpu=24,mem=196489092M,billing=24
   Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
   RealMemory=196489092 AllocMem=0 FreeMem=178999 Sockets=2 Boards=1
   CfgTRES=cpu=24,mem=196489092M,billing=24
   Reason=Low RealMemory [slurm@2019-07-18T10:17:24]

sinfo -R
REASON   USER  TIMESTAMP   NODELIST
Low RealMemory   slurm 2019-07-18T10:17:24 node[001-003]

sinfo -N
NODELIST   NODES PARTITION STATE
node0011 defq* drain
node0021 defq* drain
node0031 defq* drain

pdsh -w node00[1-3]  "lscpu | grep -iE 'socket|core'"
node002: Thread(s) per core:1
node002: Core(s) per socket:12
node002: Socket(s): 2
node001: Thread(s) per core:1
node001: Core(s) per socket:12
node001: Socket(s): 2
node003: Thread(s) per core:2
node003: Core(s) per socket:12
node003: Socket(s): 2

scontrol show nodes| grep -i mem
   RealMemory=196489092 AllocMem=0 FreeMem=100054 Sockets=2 Boards=1
   CfgTRES=cpu=24,mem=196489092M,billing=24
   Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
   RealMemory=196489092 AllocMem=0 FreeMem=181101 Sockets=2 Boards=1
   CfgTRES=cpu=24,mem=196489092M,billing=24
   Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
   RealMemory=196489092 AllocMem=0 FreeMem=179004 Sockets=2 Boards=1
   CfgTRES=cpu=24,mem=196489092M,billing=24
   Reason=Low RealMemory

Does anything look off?


[slurm-users] JobState=FAILED Reason=NonZeroExitCode Dependency=(null) ExitCode=1:0

2019-07-09 Thread Robert Kudyba
>From this tutorial
https://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters
I
am trying to run the below and it always fails. I've made sure to run
'module load slurm'. What could be wrong? Logs from slurmctld show ok:
[2019-07-09T10:19:44.183] prolog_running_decr: Configuration for JobID=402
is complete
[2019-07-09T10:19:44.266] _job_complete: JobID=402 State=0x1 NodeCnt=1
WEXITSTATUS 1
[2019-07-09T10:19:44.266] _job_complete: JobID=402 State=0x8005 NodeCnt=1
done
[2019-07-09T10:21:31.934] _slurm_rpc_submit_batch_job: JobId=403
InitPrio=4294901690 usec=321

cat slurm-job.sh
#!/usr/bin/bash

#SBATCH -o slurm.sh.out
#SBATCH -p defq

echo "In the directory: `pwd`"
echo "As the user: `whoami`"
echo "write this is a file" > analysis.output
sleep 60

scontrol show job 402
JobId=402 JobName=slurm-job.sh
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=4294901691 Nice=0 Account=root QOS=normal
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
   RunTime=00:00:01 TimeLimit=365-00:00:00 TimeMin=N/A
   SubmitTime=2019-07-09T10:19:43 EligibleTime=2019-07-09T10:19:43
   StartTime=2019-07-09T10:19:43 EndTime=2019-07-09T10:19:44 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-07-09T10:19:43
   Partition=defq AllocNode:Sid=ciscluster:349904
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node001
   BatchHost=node001
   NumNodes=1 NumCPUs=1 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/root/testing/slurm-job.sh
   WorkDir=/root/testing
   StdErr=/root/testing/slurm.sh.out
   StdIn=/dev/null
   StdOut=/root/testing/slurm.sh.out
   Power=


Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-07-08 Thread Robert Kudyba
Thanks Brian indeed we did have it set in bytes. I set it to the MB value. 
Hoping this takes care of the situation.

> On Jul 8, 2019, at 4:02 PM, Brian Andrus  wrote:
> 
> Your problem here is that the configuration for the nodes in question have an 
> incorrect amount of memory set for them. Looks like you have it set in bytes 
> instead of megabytes
> 
> In your slurm.conf you should look at the RealMemory setting:
> 
> 
> RealMemory
> Size of real memory on the node in megabytes (e.g. "2048"). The default value 
> is 1. 
> 
> I would suggest RealMemory=191879 , where I suspect you have 
> RealMemory=196489092
> 
> Brian Andrus
> On 7/8/2019 11:59 AM, Robert Kudyba wrote:
>> I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 
>> and Bright Cluster 8.1. Their support sent me here as they say Slurm is 
>> configured optimally to allow multiple tasks to run. However at times a job 
>> will hold up new jobs. Are there any other logs I can look at and/or 
>> settings to change to prevent this or alert me when this is happening? Here 
>> are some tests and commands that I hope will illuminate where I may be going 
>> wrong. The slurn.conf file has these options set:
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU
>> SchedulerTimeSlice=60
>> 
>> I also see /var/log/slurmctld is loaded with errors like these:
>> [2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: 
>> Invalid argument
>> [2019-07-03T02:54:50.655] error: Node node002 has low real_memory size 
>> (191879 < 196489092)
>> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: 
>> Invalid argument
>> [2019-07-03T02:54:50.655] error: Node node001 has low real_memory size 
>> (191883 < 196489092)
>> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: 
>> Invalid argument
>> [2019-07-03T02:54:50.655] error: Node node003 has low real_memory size 
>> (191879 < 196489092)
>> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: 
>> Invalid argument
>> [2019-07-03T03:28:10.293] error: Node node002 has low real_memory size 
>> (191879 < 196489092)
>> [2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: 
>> Invalid argument
>> [2019-07-03T03:28:10.293] error: Node node003 has low real_memory size 
>> (191879 < 196489092)
>> 
>> squeue
>> JOBID PARTITION NAME  USER  ST TIME NODES NODELIST(REASON)
>> 352   defq   TensorFl myuser PD 0:00 3 (Resources)
>> 
>>  scontrol show jobid -dd 352
>> JobId=352 JobName=TensorFlowGPUTest
>> UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
>> Priority=4294901741 Nice=0 Account=(null) QOS=normal
>> JobState=PENDING Reason=Resources Dependency=(null)
>> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>> DerivedExitCode=0:0
>> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>> SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
>> StartTime=Unknown EndTime=Unknown Deadline=N/A
>> PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> LastSchedEval=2019-07-02T16:57:59
>> Partition=defq AllocNode:Sid=ourcluster:386851
>> ReqNodeList=(null) ExcNodeList=(null)
>> NodeList=(null)
>> NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>> TRES=cpu=3,node=3
>> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>> Features=(null) DelayBoot=00:00:00
>> Gres=gpu:1 Reservation=(null)
>> OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
>> Command=/home/myuser/cnn_gpu.sh
>> WorkDir=/home/myuser
>> StdErr=/home/myuser/slurm-352.out
>> StdIn=/dev/null
>> StdOut=/home/myuser/slurm-352.out
>> Power=
>> 
>> Another test showed the below:
>> sinfo -N
>> NODELIST   NODES PARTITION STATE
>> node0011 defq*drain
>> node0021 defq*drain
>> node0031 defq*drain
>> 
>> sinfo -R
>> REASON   USER  TIMESTAMP   NODELIST
>> Low RealMemory   slurm 2019-05-17T10:05:26 node[001-003]
>> 
>> 
>> [ciscluster]% jobqueue
>> [ciscluster->jobqueue(slurm)]% ls
>> Type Name Nodes
>>  
>> 
>> Slurm defq node001..node003
>> Slurm gpuq
>> [ourcluster->jobqueue(slurm)]% use defq
>> [ourcluster->jobqueue(slurm)->defq]% get options
>> QoS=N/A Ex

[slurm-users] sbatch tasks stuck in queue when a job is hung

2019-07-08 Thread Robert Kudyba
I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and 
Bright Cluster 8.1. Their support sent me here as they say Slurm is configured 
optimally to allow multiple tasks to run. However at times a job will hold up 
new jobs. Are there any other logs I can look at and/or settings to change to 
prevent this or alert me when this is happening? Here are some tests and 
commands that I hope will illuminate where I may be going wrong. The slurn.conf 
file has these options set:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60

I also see /var/log/slurmctld is loaded with errors like these:
[2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: 
Invalid argument
[2019-07-03T02:54:50.655] error: Node node002 has low real_memory size (191879 
< 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: 
Invalid argument
[2019-07-03T02:54:50.655] error: Node node001 has low real_memory size (191883 
< 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: 
Invalid argument
[2019-07-03T02:54:50.655] error: Node node003 has low real_memory size (191879 
< 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: 
Invalid argument
[2019-07-03T03:28:10.293] error: Node node002 has low real_memory size (191879 
< 196489092)
[2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: 
Invalid argument
[2019-07-03T03:28:10.293] error: Node node003 has low real_memory size (191879 
< 196489092)

squeue
JOBID PARTITION NAME USER  ST TIME NODES NODELIST(REASON)
352   defq  TensorFl myuser PD 0:00 3 (Resources)

 scontrol show jobid -dd 352
JobId=352 JobName=TensorFlowGPUTest
UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
Priority=4294901741 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-07-02T16:57:59
Partition=defq AllocNode:Sid=ourcluster:386851
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=3,node=3
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:1 Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/home/myuser/cnn_gpu.sh
WorkDir=/home/myuser
StdErr=/home/myuser/slurm-352.out
StdIn=/dev/null
StdOut=/home/myuser/slurm-352.out
Power=

Another test showed the below:
sinfo -N
NODELIST   NODES PARTITION STATE
node0011 defq*drain
node0021 defq*drain
node0031 defq*drain

sinfo -R
REASON   USER  TIMESTAMP   NODELIST
Low RealMemory   slurm 2019-05-17T10:05:26 node[001-003]


[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type Name Nodes
 

Slurm defq node001..node003
Slurm gpuq
[ourcluster->jobqueue(slurm)]% use defq
[ourcluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'" 
node003: Thread(s) per core: 1 
node003: Core(s) per socket: 12 
node003: Socket(s): 2 
node001: Thread(s) per core: 1 
node001: Core(s) per socket: 12 
node001: Socket(s): 2 
node002: Thread(s) per core: 1 
node002: Core(s) per socket: 12 
node002: Socket(s): 2 

scontrol show nodes node001 
NodeName=node001 Arch=x86_64 CoresPerSocket=12 
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 
AvailableFeatures=(null) 
Ac tiveFeatures=(null) 
Gres=gpu:1 
NodeAddr=node001 NodeHostName=node001 Version=17.11 
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018 
RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1 
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A 
Partitions=defq 
BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17 
CfgTRES=cpu=24,mem=196489092M,billing=24 
AllocTRES= 
CapWatts=n/a 
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s 
Reason=Low RealMemory [slurm@2019-05-17T10:05:26] 


sinfo 
PARTITION AVAIL TI MELIMIT NODES STATE NODELIST 
defq* up infinite 3 drain node[001-003] 
gpuq up infinite 0 n/a 


scontrol show nodes| grep -i mem 
RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1 
CfgTRES=cpu=24,mem=196489092M,billing=24 
Reason=Low RealMemory [slurm@2019-05-17T10:05:26] 
RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1 
CfgTRES=cpu=24,mem=196489092M,billing=24 
Re

[slurm-users] Where to adjust the memory limit from sinfo vs free command?

2019-05-16 Thread Robert Kudyba
The MEMORY limit here shows 1, which I believe is 1 MB? But the results of the 
free command clearly shows we have more than that. Where is this configured?

sinfo -lNe
Thu May 16 16:41:23 2019
NODELIST   NODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON  
node0011 defq*idle   24   2:12:1  10  1   
(null) none
node0021 defq*idle   24   2:12:1  10  1   
(null) none
node0031 defq*idle   24   2:12:1  10  1   
(null) none
[rkudyba@ciscluster ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)

free -h
  totalusedfree  shared  buff/cache   available
Mem:   187G7.8G128G992M 50G176G
Swap:   15G3.3G 12G
[rkudyba@ciscluster ~]$ srun -N 3 free -h
  totalusedfree  shared  buff/cache   available
Mem:   187G4.5G147G1.8G 35G179G
Swap:   11G382M 11G
  totalusedfree  shared  buff/cache   available
Mem:   187G4.5G145G1.6G 36G179G
Swap:   11G658M 11G
  totalusedfree  shared  buff/cache   available
Mem:   187G 95G 78G   

[slurm-users] Myload script from Slurm Gang Scheduling tutorial

2019-05-16 Thread Robert Kudyba
Hello,

Can anyone share the myload script referenced in 
https://slurm.schedmd.com/gang_scheduling.html 


Would like to test this on our Bright Cluster running Slurm now as the workload 
manager and allowing multiple jobs to run concurrently.


Thanks,

Rob