Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel

Hi Sajesh,

On 10/8/20 4:18 pm, Sajesh Singh wrote:


  Thank you for the tip. That works as expected.


No worries, glad it's useful. Do be aware that the core bindings for the 
GPUs would likely need to be adjusted for your hardware!


Best of luck,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Christopher,

 Thank you for the tip. That works as expected. 


-SS-

-Original Message-
From: slurm-users  On Behalf Of 
Christopher Samuel
Sent: Thursday, October 8, 2020 6:52 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


On 10/8/20 3:48 pm, Sajesh Singh wrote:

>Thank you. Looks like the fix is indeed the missing file 
> /etc/slurm/cgroup_allowed_devices_file.conf

No, you don't want that, that will allow all access to GPUs whether people have 
requested them or not.

What you want is in gres.conf and looks like (hopefully not line wrapped!):

NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,2,4,6,8 
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia1
Cores=10,12,14,16,18
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia2
Cores=20,22,24,26,28
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia3
Cores=30,32,34,36,38

All the best,
Chris
--
   Chris Samuel  :  
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=01%7C01%7Cssingh%40amnh.org%7C19c93fb5353d43eae47f08d86bdcdebd%7Cbe0003e8c6b9496883aeb34586974b76%7C0sdata=YO9kTd3TSKG6Y2B6NHx%2B59I5rNdZGPESatncINTPC5A%3Dreserved=0
  :  Berkeley, CA, USA



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel

On 10/8/20 3:48 pm, Sajesh Singh wrote:


   Thank you. Looks like the fix is indeed the missing file 
/etc/slurm/cgroup_allowed_devices_file.conf


No, you don't want that, that will allow all access to GPUs whether 
people have requested them or not.


What you want is in gres.conf and looks like (hopefully not line wrapped!):

NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,2,4,6,8
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia1 
Cores=10,12,14,16,18
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia2 
Cores=20,22,24,26,28
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia3 
Cores=30,32,34,36,38


All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Relu, 
  Thank you. Looks like the fix is indeed the missing file 
/etc/slurm/cgroup_allowed_devices_file.conf



-SS-

-Original Message-
From: slurm-users  On Behalf Of 
Christopher Samuel
Sent: Thursday, October 8, 2020 6:10 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


Hi Sajesh,

On 10/8/20 11:57 am, Sajesh Singh wrote:

> debug:  common_gres_set_env: unable to set env vars, no device files 
> configured

I suspect the clue is here - what does your gres.conf look like?
Does it list the devices in /dev for the GPUs?

All the best,
Chris
--
   Chris Samuel  :  
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=01%7C01%7Cssingh%40amnh.org%7C1bf5374fd6454b3fcd5a08d86bd6f427%7Cbe0003e8c6b9496883aeb34586974b76%7C0sdata=INvZvw%2FiTrdf52patYRF9TtrQ0vuXRSivrxC8MJYLM4%3Dreserved=0
  :  Berkeley, CA, USA




Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel

Hi Sajesh,

On 10/8/20 11:57 am, Sajesh Singh wrote:

debug:  common_gres_set_env: unable to set env vars, no device files 
configured


I suspect the clue is here - what does your gres.conf look like?
Does it list the devices in /dev for the GPUs?

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Relu Patrascu

Do you have a line like this in  your cgroup_allowed_devices_file.conf
/dev/nvidia*

?

Relu

On 2020-10-08 16:32, Sajesh Singh wrote:


It seems as though the modules are loaded as when I run lsmod I get 
the following:


nvidia_drm 43714  0

nvidia_modeset   1109636  1 nvidia_drm

nvidia_uvm    935322  0

nvidia  20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi

Thu Oct  8 16:31:57 2020

+-+

| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 
10.2 |


|---+--+--+

| GPU  Name    Persistence-M| Bus-Id    Disp.A | Volatile 
Uncorr. ECC |


| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |

|===+==+==|

|   0  Quadro M5000    Off  | :02:00.0 Off 
|  Off |


| 33%   21C    P0    45W / 150W |  0MiB /  8126MiB |  0%  
Default |


+---+--+--+

|   1  Quadro M5000    Off  | :82:00.0 Off 
|  Off |


| 30%   17C    P0    45W / 150W |  0MiB /  8126MiB |  0%  
Default |


+---+--+--+

+-+

| Processes: GPU Memory |

|  GPU   PID   Type   Process name 
Usage  |


|=|

|  No running processes 
found |


+-+

--

-SS-

*From:* slurm-users  *On Behalf 
Of *Relu Patrascu

*Sent:* Thursday, October 8, 2020 4:26 PM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] CUDA environment variable not being set

*EXTERNAL SENDER*

That usually means you don't have the nvidia kernel module loaded, 
probably because there's no driver installed.


Relu

On 2020-10-08 14:57, Sajesh Singh wrote:

Slurm 18.08

CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the
slurm.conf and gres.conf of the cluster, but if I launch a job
requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is
never set and I see the following messages in the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device
files configured

Has anyone encountered this before?

Thank you,

SS



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Yes. It is located in the /etc/slurm directory

--

-SS-

From: slurm-users  On Behalf Of Brian 
Andrus
Sent: Thursday, October 8, 2020 5:02 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


do you have your gres.conf on the nodes also?

Brian Andrus
On 10/8/2020 11:57 AM, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Brian Andrus

do you have your gres.conf on the nodes also?

Brian Andrus

On 10/8/2020 11:57 AM, Sajesh Singh wrote:


Slurm 18.08

CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the 
slurm.conf and gres.conf of the cluster, but if I launch a job 
requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never 
set and I see the following messages in the slurmd.log file:


debug:  common_gres_set_env: unable to set env vars, no device files 
configured


Has anyone encountered this before?

Thank you,

SS



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
I only get a line returned for “Gres=”, but this is the same behavior on 
another cluster that has GPUs and the variable gets set on that cluster.

-Sajesh-

--
_
Sajesh Singh
Manager, Systems and Scientific Computing
American Museum of Natural History
200 Central Park West
New York, NY 10024

(O) (212) 313-7263
(C) (917) 763-9038
(E) ssi...@amnh.org

From: slurm-users  On Behalf Of Renfro, 
Michael
Sent: Thursday, October 8, 2020 4:53 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER

From any node you can run scontrol from, what does ‘scontrol show node 
GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and 
“CfgTRES=”.

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Sajesh Singh mailto:ssi...@amnh.org>>
Reply-To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Date: Thursday, October 8, 2020 at 3:33 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] CUDA environment variable not being set


External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


It seems as though the modules are loaded as when I run lsmod I get the 
following:

nvidia_drm 43714  0
nvidia_modeset   1109636  1 nvidia_drm
nvidia_uvm935322  0
nvidia  20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi
Thu Oct  8 16:31:57 2020
+-+
| NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Quadro M5000Off  | :02:00.0 Off |  Off |
| 33%   21CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+
|   1  Quadro M5000Off  | :82:00.0 Off |  Off |
| 30%   17CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|  No running processes found |
+-+

--

-SS-

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Relu Patrascu
Sent: Thursday, October 8, 2020 4:26 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


That usually means you don't have the nvidia kernel module loaded, probably 
because there's no driver installed.

Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Renfro, Michael
From any node you can run scontrol from, what does ‘scontrol show node 
GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and 
“CfgTRES=”.

From: slurm-users  on behalf of Sajesh 
Singh 
Reply-To: Slurm User Community List 
Date: Thursday, October 8, 2020 at 3:33 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] CUDA environment variable not being set


External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


It seems as though the modules are loaded as when I run lsmod I get the 
following:

nvidia_drm 43714  0
nvidia_modeset   1109636  1 nvidia_drm
nvidia_uvm935322  0
nvidia  20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi
Thu Oct  8 16:31:57 2020
+-+
| NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Quadro M5000Off  | :02:00.0 Off |  Off |
| 33%   21CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+
|   1  Quadro M5000Off  | :82:00.0 Off |  Off |
| 30%   17CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|  No running processes found |
+-+

--

-SS-

From: slurm-users  On Behalf Of Relu 
Patrascu
Sent: Thursday, October 8, 2020 4:26 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


That usually means you don't have the nvidia kernel module loaded, probably 
because there's no driver installed.

Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
It seems as though the modules are loaded as when I run lsmod I get the 
following:

nvidia_drm 43714  0
nvidia_modeset   1109636  1 nvidia_drm
nvidia_uvm935322  0
nvidia  20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi
Thu Oct  8 16:31:57 2020
+-+
| NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Quadro M5000Off  | :02:00.0 Off |  Off |
| 33%   21CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+
|   1  Quadro M5000Off  | :82:00.0 Off |  Off |
| 30%   17CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|  No running processes found |
+-+

--

-SS-

From: slurm-users  On Behalf Of Relu 
Patrascu
Sent: Thursday, October 8, 2020 4:26 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


That usually means you don't have the nvidia kernel module loaded, probably 
because there's no driver installed.

Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Relu Patrascu
That usually means you don't have the nvidia kernel module loaded, 
probably because there's no driver installed.


Relu

On 2020-10-08 14:57, Sajesh Singh wrote:


Slurm 18.08

CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the 
slurm.conf and gres.conf of the cluster, but if I launch a job 
requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never 
set and I see the following messages in the slurmd.log file:


debug:  common_gres_set_env: unable to set env vars, no device files 
configured


Has anyone encountered this before?

Thank you,

SS



[slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] Controlling access to idle nodes

2020-10-08 Thread David Baker
Thank you very much for your comments. Oddly enough, I came up with the 
3-partition model as well once I'd sent my email. So, your comments helped to 
confirm that I was thinking on the right lines.

Best regards,
David


From: slurm-users  on behalf of Thomas 
M. Payerle 
Sent: 06 October 2020 18:50
To: Slurm User Community List 
Subject: Re: [slurm-users] Controlling access to idle nodes

We use a scavenger partition, and although we do not have the policy you 
describe, it could be used in your case.

Assume you have 6 nodes (node-[0-5]) and two groups A and B.
Create partitions
partA = node-[0-2]
partB = node-[3-5]
all = node-[0-6]

Create QoSes normal and scavenger.
Allow normal QoS to preempt jobs with scavenger QoS

In sacctmgr, give members of group A access to use partA with normal QoS  and 
group B access to use partB with normal QoS
Allow both A and B to use part all with scavenger QoS.

So members of A can launch jobs on partA with normal QoS (probably want to make 
that their default), and similarly member of B can launch jobs on partB with 
normal QoS.
But membes of A can also launch jobs on partB with scavenger QoS and vica 
versa.  If the partB nodes used by A are needed by B, they will get preempted.

This is not automatic (users need to explicitly say they want to run jobs on 
the other half of the cluster), but that is probably reasonable because there 
are some jobs one does not wish to get preempted even if they have to wait a 
while in the queue to ensure such.

On Tue, Oct 6, 2020 at 11:12 AM David Baker 
mailto:d.j.ba...@soton.ac.uk>> wrote:
Hello,

I would appreciate your advice on how to deal with this situation in Slurm, 
please. If I have a set of nodes used by 2 groups, and normally each group 
would each have access to half the nodes. So, I could limit each group to have 
access to 3 nodes each, for example. I am trying to devise a scheme that allows 
each group to make best use of the node always. In other words, each group 
could potentially use all the nodes (assuming they all free and the other group 
isn't using the nodes at all).

I cannot set hard and soft limits in slurm, and so I'm not sure how to make the 
situation flexible. Ideally It would be good for each group to be able to use 
their allocation and then take advantage of any idle nodes via a scavenging 
mechanism. The other group could then pre-empt the scavenger jobs and claim 
their nodes. I'm struggling with this since this seems like a two-way scavenger 
situation.

Could anyone please help? I have, by the way, set up partition-based 
pre-emption in the cluster. This allows the general public to scavenge nodes 
owned by research groups.

Best regards,
David




--
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu
5825 University Research Park   (301) 405-6135
University of Maryland
College Park, MD 20740-3831


Re: [slurm-users] unable to run on all the logical cores

2020-10-08 Thread William Brown
R is single threaded.

On Thu, 8 Oct 2020, 07:44 Diego Zuccato,  wrote:

> Il 08/10/20 08:19, David Bellot ha scritto:
>
> > good spot. At least, scontrol show job is now saying that each job only
> > requires one "CPU", so it seems all the cores are treated the same way
> now.
> > Though I still have the problem of not using more than half the cores.
> > So I suppose it might be due to the way I submit (batchtools in this
> > case) the jobs.
> Maybe R is generating single-threaded code? In that case, only a single
> process can run on a given core at a time (processes does not share
> memory map, threads do, and on Intel CPUs there's a single MMU per core,
> not one per thread as in some AMDs).
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>
>


Re: [slurm-users] unable to run on all the logical cores

2020-10-08 Thread Diego Zuccato
Il 08/10/20 08:19, David Bellot ha scritto:

> good spot. At least, scontrol show job is now saying that each job only
> requires one "CPU", so it seems all the cores are treated the same way now.
> Though I still have the problem of not using more than half the cores.
> So I suppose it might be due to the way I submit (batchtools in this
> case) the jobs.
Maybe R is generating single-threaded code? In that case, only a single
process can run on a given core at a time (processes does not share
memory map, threads do, and on Intel CPUs there's a single MMU per core,
not one per thread as in some AMDs).

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Segfault with 32 processes, OK with 30 ???

2020-10-08 Thread Diego Zuccato
Il 06/10/20 13:45, Riebs, Andy ha scritto:

Well, the cluster is quite heterogeneus, and node bl0-02 only have 24
threads available:
str957-bl0-02:~$ lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
Address sizes:   46 bits physical, 48 bits virtual
CPU(s):  24
On-line CPU(s) list: 0-23
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):   2
NUMA node(s):2
Vendor ID:   GenuineIntel
CPU family:  6
Model:   45
Model name:  Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Stepping:7
CPU MHz: 1943.442
CPU max MHz: 2500,
CPU min MHz: 1200,
BogoMIPS:4000.26
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:256K
L3 cache:15360K
NUMA node0 CPU(s):   0-5,12-17
NUMA node1 CPU(s):   6-11,18-23
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti
tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts


str957-bl0-03:~$ lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
Address sizes:   46 bits physical, 48 bits virtual
CPU(s):  32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):   2
NUMA node(s):2
Vendor ID:   GenuineIntel
CPU family:  6
Model:   63
Model name:  Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:2
CPU MHz: 2400.142
CPU max MHz: 2300,
CPU min MHz: 1200,
BogoMIPS:4800.28
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:256K
L3 cache:20480K
NUMA node0 CPU(s):   0-7,16-23
NUMA node1 CPU(s):   8-15,24-31
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1
sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm
cpuid_fault epb invpcid_single pti intel_ppin tpr_shadow vnmi
flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2
erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm arat pln pts

Another couple of nodes do have 32 threads, but with AMD CPU...

The same problem happened in the past, and seemed to "move" between
nodes even with no changes in the config. While trying to fix it I added
mtl = psm2
to /etc/openmpi/openmpi-mca-params.conf but only installing gdb and its
dependencies apparently "worked". But, as I feared, it wos just a mask,
not a solution.

>> The problem is with a single, specific, node: str957-bl0-03 . The same
>> job script works if being allocated to another node, even with more
>> ranks (tested up to 224/4 on mtx-* nodes).
> 
> Ahhh... here's where the details help. So it appears that the problem is on a 
> single node, and probably not a general configuration or system problem. I 
> suggest starting with  something like this to help figure out why node bl0-03 
> is different
> 
> $ sudo ssh str957- bl0-02 lscpu
> $ sudo ssh str957- bl0-03 lscpu
> 
> Andy
> 
> -Original Message-
> From: Diego Zuccato [mailto:diego.zucc...@unibo.it] 
> Sent: Tuesday, October 6, 2020 3:13 AM
> To: Riebs, Andy ; Slurm User Community List 
> 
> Subject: Re: [slurm-users] Segfault with 32 processes, OK with 30 ???
> 
> Il 05/10/20 14:18, Riebs, Andy ha scritto:
> 
> Tks for considering my query.
> 
>> You need to provide some hints! What we know so far:
>> 1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x 
>> backtrace.
> Correct.
> 
>> 2. Your decision to address this to the Slurm mailing list suggests that you 
>> think that Slurm might be involved.
> At least I couldn't replicate launching manually (it always says "no
> slots available" unless I use mpirun -np 16 ...). I'm no MPI expert
> (actually less than a noob!) so I can't rule out it's unrelated to
> Slurm. I mostly hope that on this list I can find someone with enough
> experience with both Slurm and MPI.
> 
>> 3. You have something (a job? a program?) that segfaults when you go from 30 
>> to 32 processes.
> Multiple programs, actually.
> 
>> a. What operating system?
> Debian 10.5 . Only extension is PBIS-open to authenticate users from AD.
> 
>> b. Are you seeing this while running Slurm? What version?
> 

Re: [slurm-users] unable to run on all the logical cores

2020-10-08 Thread David Bellot
Hi Rodrigo,

good spot. At least, scontrol show job is now saying that each job only
requires one "CPU", so it seems all the cores are treated the same way now.
Though I still have the problem of not using more than half the cores. So I
suppose it might be due to the way I submit (batchtools in this case) the
jobs.
I'm still investigating even if NumCPUs=1 now as it should be. Thanks.

David

On Thu, Oct 8, 2020 at 4:40 PM Rodrigo Santibáñez <
rsantibanez.uch...@gmail.com> wrote:

> Hi David,
>
> I had the same problem time ago when configuring my first server.
>
> Could you try SelectTypeParameters=CR_CPU instead of
> SelectTypeParameters=CR_Core?
>
> Best regards,
> Rodrigo.
>
> On Thu, Oct 8, 2020, 02:16 David Bellot 
> wrote:
>
>> Hi,
>>
>> my Slurm cluster has a dozen machines configured as follows:
>>
>> NodeName=foobar01 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
>> ThreadsPerCore=2 RealMemory=257243 State=UNKNOWN
>>
>> and scheduling is:
>>
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core
>>
>> My problem is that only half of the logical cores are used when I run a
>> computation.
>>
>> Let me explain: I use R and the package 'batchtools' to create jobs. All
>> the jobs are created under the hood with sbatch. If I log in to all the
>> machines in my cluster and do a 'htop', I can see that only half of the
>> logical cores are used. Other methods to measure the load of each machine
>> confirmed this "visual" clue.
>> My jobs ask Slurm for only one cpu per task. I tried to enforce that with
>> the -c 1 but it didn't make any difference.
>>
>> Then I realized there was something strange:
>> when I do scontrol show job , I can spot the following output:
>>
>>NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>TRES=cpu=2,node=1,billing=2
>>Socks/Node=* NtasksPerN:B:S:C=0:0:*:2 CoreSpec=*
>>
>> that is each job uses NumCPUs=2 instead of 1. Also, I'm not sure why
>> TRES=cpu=2
>>
>> Any idea on how to solve this problem and have 100% of the logical cores
>> allocated?
>>
>> Best regards,
>> David
>>
>

-- 

David Bellot
Head of Quantitative Research

A. Suite B, Level 3A, 43-45 East Esplanade, Manly, NSW 2095
E. david.bel...@lifetrading.com.au
P. (+61) 0405 263012