Re: [slurm-users] ulimits

2023-11-16 Thread Ozeryan, Vladimir
Thank you Franky, I will give it a shot.

From: slurm-users  On Behalf Of Franky 
Backeljauw
Sent: Thursday, November 16, 2023 3:14 PM
To: Slurm User Community List ; 
slurm-us...@schedmd.com
Subject: [EXT] Re: [slurm-users] ulimits

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>
 before clicking links or attachments



Hi Vlad

If you are using the systemd services, then you can change the value in the 
file /usr/lib/systemd/system/slurmd.service, e.g.:

LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
LimitMSGQUEUE=12345678

See https://www.baeldung.com/linux/ulimit-limits-systemd-units for the list of 
possibilities...

-- Kind regards

Franky



Van: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 namens Ozeryan, Vladimir 
mailto:vladimir.ozer...@jhuapl.edu>>
Datum: donderdag, 16 november 2023 om 21:05
Aan: slurm-us...@schedmd.com<mailto:slurm-us...@schedmd.com> 
mailto:slurm-us...@schedmd.com>>
Onderwerp: [slurm-users] ulimits
Hello everyone,

I am having the following issue, on the compute nodes "POSIX message queues" is 
set to unlimited for soft and hard limits.
However, when I do "srun -w node01 --pty bash -I" and then once I am in the 
node I do "cat /proc/SLURMPID/limits" it shows that "Max msgqueue size" is set 
to 819200 for both soft and hard limits.
Where is it being set and how do I change it?

Thank you,

Vlad Ozeryan
AMDS - AB1 Linux-Support
vladimir.ozer...@jhuapl.edu<mailto:vladimir.ozer...@jhuapl.edu>
Ext. 23966



[slurm-users] ulimits

2023-11-16 Thread Ozeryan, Vladimir
Hello everyone,

I am having the following issue, on the compute nodes "POSIX message queues" is 
set to unlimited for soft and hard limits.
However, when I do "srun -w node01 --pty bash -I" and then once I am in the 
node I do "cat /proc/SLURMPID/limits" it shows that "Max msgqueue size" is set 
to 819200 for both soft and hard limits.
Where is it being set and how do I change it?

Thank you,

Vlad Ozeryan
AMDS - AB1 Linux-Support
vladimir.ozer...@jhuapl.edu
Ext. 23966



Re: [slurm-users] [EXT] Submitting hybrid OpenMPI and OpenMP Jobs

2023-09-22 Thread Ozeryan, Vladimir
Hello,

I would set "--ntasks"= number of cpus you want use for your job and remove 
"--cpus-per-task" which would be set to 1 by default.

From: slurm-users  On Behalf Of Selch, 
Brigitte (FIDD)
Sent: Friday, September 22, 2023 7:58 AM
To: slurm-us...@schedmd.com
Subject: [EXT] [slurm-users] Submitting hybrid OpenMPI and OpenMP Jobs

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com
 before clicking links or attachments



Hello,

one of our applications need hybrid OpenMPI and OpenMP Job-Submit.
Only one task is allowed on one node, but this task should use all cores of the 
node.
So, for example I made:

#!/bin/bash

#SBATCH --nodes=5
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=44
#SBATCH --export=ALL

export OMP_NUM_THREADS=44
mpiexec PreonNode test.prscene


But the job does not take more than one  thread:

...
Thread binding will be disabled because the full machine is not available for 
the process.
Detected 44 CPU threads, 2 l3 caches and 2 packages on the machine.
Number of CPU processors reported by OpenMP: 1
Maximum number of CPU threads reported by OpenMP: 44

Warning: OMP_NUM_THREADS was set to 44, which is higher than the number of 
available processors of 1. Will use 1 threads now.
...

What did I wrong?
Does anyone have any idea why OpenMP thinks it can only use one thread per node?

Thanks !

Best regards,
Brigitte Selch

MAN Truck & Bus SE
IT Produktentwicklung Simulation (FIDD)
Vogelweiher Str. 33
90441 Nürnberg



MAN Truck & Bus SE
Sitz der Gesellschaft: München
Registergericht: Amtsgericht München, HRB 247520
Vorsitzender des Aufsichtsrats: Christian Levin, Vorstand: Alexander Vlaskamp 
(Vorsitzender), Murat Aksel, Friedrich-W. Baumann, Michael Kobriger, Inka 
Koljonen, Arne Puls, Dr. Frederik Zohm

You can find information about how we process your personal data and your 
rights in our data protection notice: 
www.man.eu/data-protection-notice

This e-mail (including any attachments) is confidential and may be privileged.
If you have received it by mistake, please notify the sender by e-mail and 
delete this message from your system.
Any unauthorised use or dissemination of this e-mail in whole or in part is 
strictly prohibited.
Please note that e-mails are susceptible to change.
MAN Truck & Bus SE (including its group companies) shall not be liable for the 
improper or incomplete transmission of the information contained in this 
communication nor for any delay in its receipt.
MAN Truck & Bus SE (or its group companies) does not guarantee that the 
integrity of this communication has been maintained nor that this communication 
is free of viruses, interceptions or interference.


[slurm-users] MCNP6.2 test

2023-07-19 Thread Ozeryan, Vladimir
Hello everyone,

Has anyone here ever ran MCNP6.2 parallel job via Slurm scheduler?
I am looking for a simple test job to test my software compilation.

Thank you,

Vlad Ozeryan


[slurm-users] Slurm Rest API error

2023-06-28 Thread Ozeryan, Vladimir
Hello everyone,

I am trying to get access to Slurm REST API working.

JWT configured and token generated. All daemons are configured and running 
"slurmdbd, slurmctld and slurmrestd". I can successfully get to Slurm API with 
"slurm" user but that's it.
bash-4.2$ echo -e "GET /slurm/v0.0.39/jobs HTTP/1.1\r\nAccept: */*\r\n" | 
slurmrestd - That works.

But as my user I get the following error:

[user@sched01 slurm-23.02.3]$ curl localhost:6820/slurm/v0.0.39/diag --header 
"X-SLURM-USER-NAME: $USER" --header "X-SLURM-USER-TOKEN: $SLURM_JWT"
HTTP/1.1 500 INTERNAL ERROR
Connection: Close
Content-Length: 833
Content-Type: application/json

{
   "meta": {
 "plugin": {
   "type": "openapi\/v0.0.39",
   "name": "Slurm OpenAPI v0.0.39",
   "data_parser": "v0.0.39"
 },
 "client": {
   "source": "[localhost]:55960"
 },
 "Slurm": {
   "version": {
 "major": 23,
 "micro": 3,
 "minor": 2
   },
   "release": "23.02.3"
 }
   },
   "errors": [
 {
   "description": "openapi_get_db_conn() failed to open slurmdb connection",
   "error_number": 7000,
   "error": "Unable to connect to database",
   "source": "init_connection"
 },
 {
   "description": "slurm_get_statistics() failed to get slurmctld 
statistics",
   "error_number": -1,
   "error": "Unspecified error",
   "source": "_op_handler_diag"
 }
   ],
   "warnings": [
   ],
   "statistics": null

Thank you,

Vlad Ozeryan
AMDS - AB1 Linux-Support
vladimir.ozer...@jhuapl.edu
Ext. 23966



Re: [slurm-users] [EXT] --mem is not limiting the job's memory

2023-06-22 Thread Ozeryan, Vladimir
No worries,
No, we don’t have any OS level settings, only “allowed_devices.conf” which just 
has /dev/random, /dev/tty and stuff like that.

But I think this could be the culprit, check out man page for cgroup.conf
AllowedRAMSpace=100

I would just leave these four:
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

Vlad.

From: slurm-users  On Behalf Of Boris 
Yazlovitsky
Sent: Thursday, June 22, 2023 5:40 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] [EXT] --mem is not limiting the job's memory

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>
 before clicking links or attachments



thank you Vlad - looks like we have the same yes's
Do you remember if you had to make any settings on the OS level or in the 
kernel to make it work?

-b

On Thu, Jun 22, 2023 at 5:31 PM Ozeryan, Vladimir 
mailto:vladimir.ozer...@jhuapl.edu>> wrote:
Hello,

We have the following configured and it seems to be working ok.

CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
Vlad.

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Boris Yazlovitsky
Sent: Thursday, June 22, 2023 4:50 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] [EXT] --mem is not limiting the job's memory

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>
 before clicking links or attachments



Hello Vladimir, thank you for your response.

this is the cgroups.conf file:
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
MaxRAMPercent=90
AllowedSwapSpace=0
AllowedRAMSpace=100
MemorySwappiness=0
MaxSwapPercent=0

/etc/default/grub:
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0 cgroup_enable=memory 
swapaccount=1"

what other cgroup settings need to be set?

&& thank you!
-b

On Thu, Jun 22, 2023 at 4:02 PM Ozeryan, Vladimir 
mailto:vladimir.ozer...@jhuapl.edu>> wrote:
--mem=5G. Should allocate 5G of memory per node.
Are your cgroups configured?

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Boris Yazlovitsky
Sent: Thursday, June 22, 2023 3:28 PM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: [EXT] [slurm-users] --mem is not limiting the job's memory

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>
 before clicking links or attachments



Running slurm 22.03.02 on Ubunutu 22.04 server.
Jobs submitted with --mem=5g are able to allocate an unlimited amount of memory.

how to limit on the job submission level how much memory it can grab?

thanks, and best regards!
Boris



Re: [slurm-users] [EXT] --mem is not limiting the job's memory

2023-06-22 Thread Ozeryan, Vladimir
Hello,

We have the following configured and it seems to be working ok.

CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

Vlad.

From: slurm-users  On Behalf Of Boris 
Yazlovitsky
Sent: Thursday, June 22, 2023 4:50 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] [EXT] --mem is not limiting the job's memory

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>
 before clicking links or attachments



Hello Vladimir, thank you for your response.

this is the cgroups.conf file:
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
MaxRAMPercent=90
AllowedSwapSpace=0
AllowedRAMSpace=100
MemorySwappiness=0
MaxSwapPercent=0

/etc/default/grub:
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0 cgroup_enable=memory 
swapaccount=1"

what other cgroup settings need to be set?

&& thank you!
-b

On Thu, Jun 22, 2023 at 4:02 PM Ozeryan, Vladimir 
mailto:vladimir.ozer...@jhuapl.edu>> wrote:
--mem=5G. Should allocate 5G of memory per node.
Are your cgroups configured?

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Boris Yazlovitsky
Sent: Thursday, June 22, 2023 3:28 PM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: [EXT] [slurm-users] --mem is not limiting the job's memory

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>
 before clicking links or attachments



Running slurm 22.03.02 on Ubunutu 22.04 server.
Jobs submitted with --mem=5g are able to allocate an unlimited amount of memory.

how to limit on the job submission level how much memory it can grab?

thanks, and best regards!
Boris



Re: [slurm-users] [EXT] --mem is not limiting the job's memory

2023-06-22 Thread Ozeryan, Vladimir
--mem=5G. Should allocate 5G of memory per node.
Are your cgroups configured?

From: slurm-users  On Behalf Of Boris 
Yazlovitsky
Sent: Thursday, June 22, 2023 3:28 PM
To: slurm-users@lists.schedmd.com
Subject: [EXT] [slurm-users] --mem is not limiting the job's memory

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com
 before clicking links or attachments



Running slurm 22.03.02 on Ubunutu 22.04 server.
Jobs submitted with --mem=5g are able to allocate an unlimited amount of memory.

how to limit on the job submission level how much memory it can grab?

thanks, and best regards!
Boris



Re: [slurm-users] [EXT] Submit sbatch to multiple partitions

2023-04-17 Thread Ozeryan, Vladimir
You should be able to specify both partitions in your sbatch submission script, 
unless there is some other configuration preventing this.

-Original Message-
From: slurm-users  On Behalf Of Xaver 
Stiensmeier
Sent: Monday, April 17, 2023 5:37 AM
To: slurm-users@lists.schedmd.com
Subject: [EXT] [slurm-users] Submit sbatch to multiple partitions

APL external email warning: Verify sender slurm-users-boun...@lists.schedmd.com 
before clicking links or attachments 

Dear slurm-users list,

let's say I want to submit a large batch job that should run on 8 nodes.
I have two partitions, each holding 4 nodes. Slurm will now tell me that 
"Requested node configuration is not available". However, my desired output 
would be that slurm makes use of both partitions and allocates all 8 nodes.

Best regards,
Xaver Stiensmeier




Re: [slurm-users] [EXT] Software and Config for Job submission host only

2022-05-12 Thread Ozeryan, Vladimir
Hello,

All you need to setup is the path to the Slurm binaries whether they are 
available via shared file system or locally on the submit nodes (srun, sbatch, 
sinfo, sacct, etc.) and possibly man pages.
Probably want to do this somewhere in /etc/profile.d or equivalent.


-Original Message-
From: slurm-users  On Behalf Of Richard 
Chang
Sent: Thursday, May 12, 2022 5:06 AM
To: slurm-users@lists.schedmd.com
Subject: [EXT] [slurm-users] Software and Config for Job submission host only

APL external email warning: Verify sender slurm-users-boun...@lists.schedmd.com 
before clicking links or attachments 

Hi,

I am new to SLURM and I am still trying to understand stuff. There is ample 
documentation available that teaches you how to set it up quickly.

Pardon me if this was asked before,  I was not able to find anything pointing 
to this.

I am trying to figure out if there is something like PBS-execution only for 
SLURM. Such that I can install it is the Login nodes and those nodes will only 
be responsible for the job-submission and not job execution.

Is there any particular package to install and is there a different config that 
needs to be put in the job submission only nodes ?

Basically, I want the job submission nodes to have all the commands and all the 
things that will enable them to get reports,logs whatever an admin and a user 
will need. Just not execution of the jobs.

Thanks in advance for your help.

RC.





Re: [slurm-users] [EXT] Distribute the node resources in multiple partitions and regarding job submission script

2022-04-12 Thread Ozeryan, Vladimir
1.   I don’t see where you specifying a “Default” partition (DEFAULT=yes)

2.   In “NodeName=* ” you have Gres=gpu:2 (All nodes on that line have 2 
GPUs.) Create another “NodeName” line below and list your non-gpu nodes there 
without the GRES flag.

From: slurm-users  On Behalf Of Purvesh 
Parmar
Sent: Tuesday, April 12, 2022 5:49 AM
To: slurm-users@lists.schedmd.com
Subject: [EXT] [slurm-users] Distribute the node resources in multiple 
partitions and regarding job submission script

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com
 before clicking links or attachments



Hello,

I am using slurm 21.08. I am stuck with the following.

Q1 : I have 8 nodes with 2 gpus each and 128 cores with 512 GB RAM. I want to 
distribute each node's resources in 2 partitions so that "par1" partition  will 
have 2 gpus with 64 cores and 256 GB ram of the node and the other partition 
"par 2" will have the remaining  64 cores and remaining 256 gb ram and no gpus 
of the same node.

par1 should be the default partition.

I have used MaxCPUsPerNode and also listed each node in both par1 and par2 
.However, while job submission, if i give par2 as the partition name and use 
gres:gpu, still the job is getting submitted and is going for run (in spite of 
par2 not having gpus).

slurm.conf (something like this)


NodeName=comp1,comp2..comp8 Sockets=1 CPUs=64 CoresPerSocket=64 
ThreadsPerCore=1 Gres=gpu:2
PartitionName=par1 State=UP Nodes=comp1,comp2..comp8 MaxCPUsPerNode=64
PartitionName=par1 State=UP Nodes=comp1,comp2..comp8 MaxCPUsPerNode=64
PartitionName=par2 State=UP Nodes=comp1,comp2..comp8 MaxCPUsPerNode=64


Where are the things going wrong?

Q2 : How to save the job scripts permanently? I have given
SlurmdSpoolDir=/usr/local/slurm/var/spool/slurmd
AccountingStorageEnforce=safe
AccountingStoreFlags=job_script,job_env

Regards,
Purvesh


Re: [slurm-users] step creation temporarily disabled, retrying (Requested nodes are busy)

2022-03-04 Thread Ozeryan, Vladimir
Try with SBATCH script and use "mpirun" executable without  "--mpi=pmi2".

From: slurm-users  On Behalf Of masber 
masber
Sent: Tuesday, March 1, 2022 12:54 PM
To: slurm-users@lists.schedmd.com
Subject: [EXT] [slurm-users] step creation temporarily disabled, retrying 
(Requested nodes are busy)

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com
 before clicking links or attachments



Dear slurm user community,

I have a slurm cluster on centos7 installed through yum, I also have mpich 
installed.

I can ssh into on of the nodes and run an mpi job:

# /usr/lib64/mpich/bin/mpirun --hosts 
nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469
 /scratch/mpi-helloworld
Warning: Permanently added 
'nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469,10.233.88.25' (ECDSA) to 
the list of known hosts.
Hello world from processor nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, 
rank 2 out of 3 processors
Hello world from processor nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, 
rank 0 out of 3 processors
Hello world from processor nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, 
rank 1 out of 3 processors

However I can't make it work through slurm, these are the logs form running the 
job:

# srun --mpi=pmi2 -N3 -vvv /usr/lib64/mpich/bin/mpirun /scratch/mpi-helloworld
srun: defined options
srun:  
srun: mpi : pmi2
srun: nodes   : 3
srun: verbose : 3
srun:  
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=8388608
srun: debug:  propagating RLIMIT_CORE=18446744073709551615
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=18446744073709551615
srun: debug:  propagating RLIMIT_NOFILE=1048576
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug2: srun PMI messages to port=33065
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 44387
srun: debug:  Entering _msg_thr_internal
srun: debug:  auth/munge: init: Munge authentication plugin loaded
srun: jobid 8: 
nodes(3):`nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469',
 cpu counts: 1(x3)
srun: debug2: creating job with 3 tasks
srun: debug:  requesting job 8, user 0, nodes 3 including ((null))
srun: debug:  cpus 3, tasks 3, name mpirun, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  mpi/pmi2: p_mpi_hook_client_prelaunch: mpi/pmi2: client_prelaunch
srun: debug:  mpi/pmi2: _get_proc_mapping: mpi/pmi2: processor mapping: 
(vector,(0,3,1))
srun: debug:  mpi/pmi2: _setup_srun_socket: mpi/pmi2: srun pmi port: 37029
srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable
srun: debug:  mpi/pmi2: pmi2_start_agent: mpi/pmi2: started agent thread
srun: debug:  Entering _msg_thr_create()
srun: debug:  initialized stdio listening socket, port 41275
srun: debug:  Started IO server thread (140538792195840)
srun: debug:  Entering _launch_tasks
srun: launching StepId=8.0 on host 
nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 0
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: launching StepId=8.0 on host 
nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 1
srun: launching StepId=8.0 on host 
nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 2
srun: route/default: init: route default plugin loaded
srun: debug2: Tree head got back 0 looking for 3
srun: debug2: Tree head got back 1
srun: debug2: Tree head got back 2
srun: debug2: Tree head got back 3
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug2: Activity on IO listening socket 17
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving  io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving  io_init_msg_validate
srun: debug2: Validated IO connection from 10.233.88.26:33470, node rank 0, 
sd=18
srun: debug2: eio_message_socket_accept: got message connection from 
10.233.88.26:53410 19
srun: debug2: received task launch
srun: launch/slurm: _task_start: Node 
nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started
srun: 

Re: [slurm-users] [EXT] Building Slurm with UCX support

2022-01-12 Thread Ozeryan, Vladimir
I am not sure about the rest of the Slurm world, but since I will most likely 
update OpenMPI more often than Slurm, I've configured and built OpenMPI with 
UCX and Slurm support and I think they are both default unless you specify 
"--without" option. Works great so far!

-Original Message-
From: slurm-users  On Behalf Of Matthias 
Leopold
Sent: Wednesday, January 12, 2022 11:54 AM
To: Slurm User Community List 
Subject: [EXT] [slurm-users] Building Slurm with UCX support

APL external email warning: Verify sender slurm-users-boun...@lists.schedmd.com 
before clicking links or attachments 

Hi,

I'm compiling Slurm with ansible playbooks from NVIDIA deepops framework 
(https://github.com/NVIDIA/deepops). I'm trying to add UCX support. How can I 
tell if UCX is actually included in the resulting binaries (without actually 
using Slurm)? I was looking at executables and *so files with ldd, but found no 
reference to the UCX installation in /usr/lib/ucx.

Background:
- I'm struggling with the build system using a non-existent path 
(PMIXP_UCX_LIBPATH=\"/usr/lib64\"). The last ugly hack was to create a symlink 
from /usr/lib/ucx to /usr/lib64/ucx
- I can't easily test actual operation of MPI with UCX because I'm on a limited 
test/dev system and (frankly) because I'm not yet a MPI expert

The configure string used is:
./configure --prefix=/usr/local --disable-dependency-tracking --disable-debug 
--disable-x11 --enable-really-no-cray --enable-salloc-kill-cmd --with-hdf5=no 
--sysconfdir=/etc/slurm --enable-pam 
--with-pam_dir=/lib/x86_64-linux-gnu/security
--with-shared-libslurm --without-rpath --with-pmix=/opt/deepops/pmix 
--with-hwloc=/opt/deepops/hwloc --with-ucx=/usr

thx
Matthias



Re: [slurm-users] TimeLimit parameter

2021-12-02 Thread Ozeryan, Vladimir
Hello,

In your case 15 minute partition "TimeLimit" is a default value and should only 
apply if user has not specified time limit for their job within their sbatch 
script or srun command, or specified a lower value than partition default or 
has done so incorrectly.

From: slurm-users  On Behalf Of Gestió 
Servidors
Sent: Thursday, December 2, 2021 8:18 AM
To: slurm-users@lists.schedmd.com
Subject: [EXT] [slurm-users] TimeLimit parameter

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com
 before clicking links or attachments



Hello,

I'm going a problema I have detected in my SLURM cluster. If I configure a 
partition with a "TimeLimit" of, for example, 15 minutes and, later, a user 
submits a job in which he/she apply a "TimeLimitt" bigger (for example, 20 
minutes), job remains in PENDING state because TimeLimit requested by user is 
bigger that configured in the queue. My question is: is there any way to force 
to the partition TimeLimit from the queue if user request a bigger value?

Thanks.




[slurm-users] max_script_size

2021-09-13 Thread Ozeryan, Vladimir
max_script_size=#
Specify the maximum size of a batch script, in bytes. The default value is 4 
megabytes. Larger values may adversely impact system performance.

I have users who've requested to increase this setting, what are some of system 
performance issues might arise from changing that value to a higher number?

Thank you,

Vlad Ozeryan
AMDS - ADX/AB1
vladimir.ozer...@jhuapl.edu
Ext. 23966