Re: [slurm-users] Running gpu and cpu jobs on the same node

2020-09-30 Thread Renfro, Michael
I could have missed a detail on my description, but we definitely don’t enable 
oversubscribe, or shared, or exclusiveuser. All three of those are set to “no” 
on all active queues.

Current subset of slurm.conf and squeue output:

=

# egrep '^PartitionName=(gpu|any-interactive) ' /etc/slurm/slurm.conf
PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 
MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF 
ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL LLN=NO 
MaxCPUsPerNode=16 ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP 
TRESBillingWeights=CPU=3.00,Mem=1.024G,GRES/gpu=30.00 Nodes=gpunode[001-004]
PartitionName=any-interactive Default=NO MinNodes=1 MaxNodes=4 
DefaultTime=02:00:00 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 
PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 
PreemptMode=OFF ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL 
LLN=NO MaxCPUsPerNode=12 ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 
State=UP TRESBillingWeights=CPU=3.00,Mem=1.024G,GRES/gpu=30.00 
Nodes=node[001-040],gpunode[001-004]
# squeue -o "%6i %.15P %.10j %.5u %4C %5D %16R %6b" | grep gpunode002
778462 gpu CNN_GRU.sh miibr 11 gpunode002   gpu:1
778632 any-interactive   bash rnour 11 gpunode002   N/A

=

From: slurm-users  on behalf of Relu 
Patrascu 
Reply-To: Slurm User Community List 
Date: Wednesday, September 30, 2020 at 4:02 PM
To: "slurm-users@lists.schedmd.com" 
Subject: Re: [slurm-users] Running gpu and cpu jobs on the same node

If you don't use OverSubscribe then resources are not shared. What resources a 
job gets allocated is not available to other jobs, regardless of partition.

Relu
On 2020-09-30 16:12, Ahmad Khalifa wrote:
I have a machine with 4 rtx2080ti and a core i9. I submit jobs to it through 
MPI PMI2 (from Relion).

If I use 5 MPI and 4 threads, then basically I'm using all 4 GPUs and 20 
threads of my cpu.

My question is, my current configuration allows submitting jobs to the same 
node, but with a different partition, but I'm not sure if I use #SBATCH 
--partition=cpu that the submitted jobs will only use the remaining 2 cores (4 
threads) or is it going to share resources with my gpu job?!

Thanks.




Re: [slurm-users] Running gpu and cpu jobs on the same node

2020-09-30 Thread Relu Patrascu
If you don't use**OverSubscribe then resources are not shared. What 
resources a job gets allocated is not available to other jobs, 
regardless of partition.


Relu
**

On 2020-09-30 16:12, Ahmad Khalifa wrote:
I have a machine with 4 rtx2080ti and a core i9. I submit jobs to it 
through MPI PMI2 (from Relion).


If I use 5 MPI and 4 threads, then basically I'm using all 4 GPUs and 
20 threads of my cpu.


My question is, my current configuration allows submitting jobs to the 
same node, but with a different partition, but I'm not sure if I use 
#SBATCH --partition=cpu that the submitted jobs will only use the 
remaining 2 cores (4 threads) or is it going to share resources with 
my gpu job?!


Thanks.




[slurm-users] Running gpu and cpu jobs on the same node

2020-09-30 Thread Ahmad Khalifa
I have a machine with 4 rtx2080ti and a core i9. I submit jobs to it
through MPI PMI2 (from Relion).

If I use 5 MPI and 4 threads, then basically I'm using all 4 GPUs and 20
threads of my cpu.

My question is, my current configuration allows submitting jobs to the same
node, but with a different partition, but I'm not sure if I use #SBATCH
--partition=cpu that the submitted jobs will only use the remaining 2 cores
(4 threads) or is it going to share resources with my gpu job?!

Thanks.


Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Chris Samuel

On 9/30/20 8:29 am, Relu Patrascu wrote:

We have actually modified the code on both v 19 and 20 to do what we 
would like, preemption within the same QOS, but we think that the 
community would benefit from this feature, hence our request to have it 
in the release version.


There's a special severity level for contributions of code in the 
SchedMD bugzilla "C - Contributions".


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



[slurm-users] slurmdbd errors after MariaDB upgrade

2020-09-30 Thread Peter Van Buren

Hello,

I am running an old version of slurm (slurm 14.11.4). It's been working 
great up until there was a recent unintended upgrade of the MariaDB 
database.
The upgrade took MariaDB from 5.5.60 to 10.3.20. After that point, I am 
unable to use the sacct command. It reports an error:


sacct: error: Resource temporarily unavailable

The slurmdbd log shows:

[2020-09-30T14:35:44.878] debug2: Opened connection 8 from 127.0.0.1
[2020-09-30T14:35:44.881] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
UID:1000 IP:127.0.0.1 CONN:8
[2020-09-30T14:35:44.881] debug2: acct_storage_p_get_connection: request 
new connection 1

[2020-09-30T14:35:44.924] debug2: DBD_GET_JOBS_COND: called
[2020-09-30T14:35:44.945] error: Processing last message from connection 
8(127.0.0.1) uid(1000)

[2020-09-30T14:35:44.985] debug2: Closed connection 8 uid(1000)

Some research shows this might be an incompatibility with newer versions 
of MariaDB and my version slurm due to a change in SQL code:


https://lists.schedmd.com/pipermail/slurm-users/2018-December/002475.html

Unfortunately I don't have a backup of this database from before the 
upgrade. I dumped the slurm_acct_db database, downgraded MariaDB and 
restored the database,
but the error still occurs. Is there any way for me to get slurmdbd 
working again without starting from scratch? I'd prefer not to lose all 
of the data in the database.


Thanks,
Peter.



Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-30 Thread Luecht, Jeff A
So just to confirm, there is not inherent issue using srun within an SBATCH 
file?

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Ryan Novosielski
Sent: Wednesday, September 30, 2020 10:01 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] EXTERNAL: Re: Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **


Primary one I’m aware of is that resource use is better reported (or at all in 
some cases) via srun, and srun can take care of MPI for an MPI job.  I’m sure 
there are others as well (I guess avoiding another place where you have to 
describe the resources to be used and making sure they match, in the case of 
mpirun, etc.).
--

|| \\UTGERS,   
|---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'


On Sep 30, 2020, at 09:38, Luecht, Jeff A 
mailto:jeff.lue...@pnc.com>> wrote:
First off, I want to thank everyone for their input and suggestions.  They 
were very helpful an ultimately pointed me in the right direction.  I spent 
several hours playing around with various settings.

Some additional background. When the srun command is used to execute this job,  
we do not see this issue.  We only see it in SBATCH.

What I ultimate did was the following:

1 - Change the NodeName to add the specific parameters Sockets, Cores and 
Threads.
2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 
respectively

I tested jobs after the above changes and used 'scontrol --defaults job ' 
command.  The CPU allocation now works as expected.

I do have one question though - what is the benefit/recommendation of using 
srun to execute a process within SBATCH.  We are running primarily python jobs, 
but need to also support R jobs.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Wednesday, September 30, 2020 2:18 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>; Michael 
Di Domenico mailto:mdidomeni...@gmail.com>>
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Il 29/09/20 16:19, Michael Di Domenico ha scritto:


what leads you to believe that you're getting 2 CPU's instead of 1?
I think I saw that too, once, but thought it was related to hyperthreading.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 
Bologna - Italy
tel.: +39 051 20 95786




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com




Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-30 Thread Luecht, Jeff A
First off, I want to thank everyone for their input and suggestions.  They were 
very helpful an ultimately pointed me in the right direction.  I spent several 
hours playing around with various settings.  

Some additional background. When the srun command is used to execute this job,  
we do not see this issue.  We only see it in SBATCH.

What I ultimate did was the following:

1 - Change the NodeName to add the specific parameters Sockets, Cores and 
Threads.
2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 
respectively

I tested jobs after the above changes and used 'scontrol --defaults job ' 
command.  The CPU allocation now works as expected.  

I do have one question though - what is the benefit/recommendation of using 
srun to execute a process within SBATCH.  We are running primarily python jobs, 
but need to also support R jobs.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Wednesday, September 30, 2020 2:18 AM
To: Slurm User Community List ; Michael Di 
Domenico 
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Il 29/09/20 16:19, Michael Di Domenico ha scritto:

> what leads you to believe that you're getting 2 CPU's instead of 1?
I think I saw that too, once, but thought it was related to hyperthreading.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 
Bologna - Italy
tel.: +39 051 20 95786




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com




Re: [slurm-users] Limit a partition or host to jobs less than 4 cores?

2020-09-30 Thread Renfro, Michael
Untested, but a combination of a QOS with MaxTRESPerJob=cpu=X and a partition 
that allows or denies that QOS may work. A job_submit.lua should be able to 
adjust the QOS of a submitted job, too.

On 9/30/20, 10:50 AM, "slurm-users on behalf of Paul Edmon" 
 
wrote:

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



Probably the best way to accomplish this is via a job_submit.lua
script.  That way you can reject at submission time.  There isn't a
feature in the partition configurations that I am aware that can
accomplish this but a custom job_submit script certainly can.

-Paul Edmon-

On 9/30/2020 11:44 AM, Jim Kilborn wrote:
> Does anyone know if there is a way to limit a partition (or a host in
> a partition) to only allow jobs less than x number of cores. It would
> be preferable to not have to move the host to a seperate partition,
> but we could if necessary. I just want to have a place that only small
> jobs can run. I cant find a parameter in slurm.conf that allows this,
> or I am overlooking something.
>
> Thanks in advance!
>




Re: [slurm-users] Limit a partition or host to jobs less than 4 cores?

2020-09-30 Thread Paul Edmon
Probably the best way to accomplish this is via a job_submit.lua 
script.  That way you can reject at submission time.  There isn't a 
feature in the partition configurations that I am aware that can 
accomplish this but a custom job_submit script certainly can.


-Paul Edmon-

On 9/30/2020 11:44 AM, Jim Kilborn wrote:

Does anyone know if there is a way to limit a partition (or a host in
a partition) to only allow jobs less than x number of cores. It would
be preferable to not have to move the host to a seperate partition,
but we could if necessary. I just want to have a place that only small
jobs can run. I cant find a parameter in slurm.conf that allows this,
or I am overlooking something.

Thanks in advance!





Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Riebs, Andy
Relu,

There are a number of ways to run an open source project. In the case of Slurm, 
the code is managed by SchedMD. As a rule, one presumes that they have plenty 
on their plate, and little time to respond to the mailing list. Hence the 
suggestion that one get a support contract to get their attention. I’m not 
complaining, it’s just the way it works.

This mailing list is handled 99% by users like you and me. If you’ve got a 
great idea, particularly if you have an implementation, one of the best ways to 
handle it is to describe your innovation here, asking for feedback if you 
choose, and then offer the patch here on the mailing list or, as Ryan suggests, 
post it in the Bugzilla.

Andy


From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Ryan Novosielski
Sent: Wednesday, September 30, 2020 11:35 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] How to contact slurm developers

I’ve previously seen code contributed back in that way. See bug 1611 as an 
example (happened to have looked at that just yesterday).
--

|| \\UTGERS,   
|---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'


On Sep 30, 2020, at 11:29, Relu Patrascu 
mailto:r...@cs.toronto.edu>> wrote:

Thanks Ryan, I'll try the bugs site. And indeed, one person in our organization 
has already said "let's pay for support, maybe they'll listen." :) It's a 
little bit funny to me that we don't actually need support, but get it hoping 
that they might consider adding a feature which we think would benefit everyone.

We have actually modified the code on both v 19 and 20 to do what we would 
like, preemption within the same QOS, but we think that the community would 
benefit from this feature, hence our request to have it in the release version.
Relu

On 2020-09-30 11:02, Ryan Novosielski wrote:
Depends on the issue I think, but the bugs site is often a way to request 
enhancements, etc. Of course, requests coming from an entity with a support 
contact carry more weight.
--

|| \\UTGERS,   
|---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'


On Sep 30, 2020, at 10:57, Relu Patrascu 
 wrote:
Hi all,

I posted recently on this mailing list a feature request and got no reply from 
the developers. Is there a better way to contact the slurm developers or we 
should just accept that they are not interested in community feedback?

Regards,

Relu



Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Paul Edmon
The bug site is the best way.  The devs prioritize sponsored features 
over general community requested features.


-Paul Edmon-

On 9/30/2020 11:34 AM, Ryan Novosielski wrote:
I’ve previously seen code contributed back in that way. See bug 1611 
as an example (happened to have looked at that just yesterday).


--

|| \\UTGERS, |---*O*---
||_// the State     | Ryan Novosielski - novos...@rutgers.edu 

|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS 
Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB 
C630, Newark

`'


On Sep 30, 2020, at 11:29, Relu Patrascu  wrote:


Thanks Ryan, I'll try the bugs site. And indeed, one person in our 
organization has already said "let's pay for support, maybe they'll 
listen." :) It's a little bit funny to me that we don't actually need 
support, but get it hoping that they might consider adding a feature 
which we think would benefit everyone.


We have actually modified the code on both v 19 and 20 to do what we 
would like, preemption within the same QOS, but we think that the 
community would benefit from this feature, hence our request to have 
it in the release version.

Relu

On 2020-09-30 11:02, Ryan Novosielski wrote:
Depends on the issue I think, but the bugs site is often a way to 
request enhancements, etc. Of course, requests coming from an entity 
with a support contact carry more weight.


--

|| \\UTGERS, |---*O*---
||_// the State     | Ryan Novosielski - 
novos...@rutgers.edu 
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS 
Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB 
C630, Newark

`'


On Sep 30, 2020, at 10:57, Relu Patrascu  wrote:

Hi all,

I posted recently on this mailing list a feature request and got no 
reply from the developers. Is there a better way to contact the 
slurm developers or we should just accept that they are not 
interested in community feedback?


Regards,

Relu




Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Ryan Novosielski
I’ve previously seen code contributed back in that way. See bug 1611 as an 
example (happened to have looked at that just yesterday).

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 11:29, Relu Patrascu  wrote:


Thanks Ryan, I'll try the bugs site. And indeed, one person in our organization 
has already said "let's pay for support, maybe they'll listen." :) It's a 
little bit funny to me that we don't actually need support, but get it hoping 
that they might consider adding a feature which we think would benefit everyone.

We have actually modified the code on both v 19 and 20 to do what we would 
like, preemption within the same QOS, but we think that the community would 
benefit from this feature, hence our request to have it in the release version.
Relu

On 2020-09-30 11:02, Ryan Novosielski wrote:
Depends on the issue I think, but the bugs site is often a way to request 
enhancements, etc. Of course, requests coming from an entity with a support 
contact carry more weight.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 10:57, Relu Patrascu 
 wrote:

Hi all,

I posted recently on this mailing list a feature request and got no reply from 
the developers. Is there a better way to contact the slurm developers or we 
should just accept that they are not interested in community feedback?

Regards,

Relu




Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Relu Patrascu
Thanks Ryan, I'll try the bugs site. And indeed, one person in our 
organization has already said "let's pay for support, maybe they'll 
listen." :) It's a little bit funny to me that we don't actually need 
support, but get it hoping that they might consider adding a feature 
which we think would benefit everyone.


We have actually modified the code on both v 19 and 20 to do what we 
would like, preemption within the same QOS, but we think that the 
community would benefit from this feature, hence our request to have it 
in the release version.

Relu

On 2020-09-30 11:02, Ryan Novosielski wrote:
Depends on the issue I think, but the bugs site is often a way to 
request enhancements, etc. Of course, requests coming from an entity 
with a support contact carry more weight.


--

|| \\UTGERS, |---*O*---
||_// the State     | Ryan Novosielski - novos...@rutgers.edu 

|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS 
Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB 
C630, Newark

`'


On Sep 30, 2020, at 10:57, Relu Patrascu  wrote:

Hi all,

I posted recently on this mailing list a feature request and got no 
reply from the developers. Is there a better way to contact the slurm 
developers or we should just accept that they are not interested in 
community feedback?


Regards,

Relu




Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Ryan Novosielski
Depends on the issue I think, but the bugs site is often a way to request 
enhancements, etc. Of course, requests coming from an entity with a support 
contact carry more weight.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 10:57, Relu Patrascu  wrote:

Hi all,

I posted recently on this mailing list a feature request and got no reply from 
the developers. Is there a better way to contact the slurm developers or we 
should just accept that they are not interested in community feedback?

Regards,

Relu




Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-30 Thread Ryan Novosielski
Absolutely not. It’s recommended.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 10:46, Luecht, Jeff A  wrote:


So just to confirm, there is not inherent issue using srun within an SBATCH 
file?

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Ryan Novosielski
Sent: Wednesday, September 30, 2020 10:01 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] EXTERNAL: Re: Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Primary one I’m aware of is that resource use is better reported (or at all in 
some cases) via srun, and srun can take care of MPI for an MPI job.  I’m sure 
there are others as well (I guess avoiding another place where you have to 
describe the resources to be used and making sure they match, in the case of 
mpirun, etc.).
--

|| \\UTGERS,   
|---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'


On Sep 30, 2020, at 09:38, Luecht, Jeff A 
mailto:jeff.lue...@pnc.com>> wrote:
First off, I want to thank everyone for their input and suggestions.  They 
were very helpful an ultimately pointed me in the right direction.  I spent 
several hours playing around with various settings.

Some additional background. When the srun command is used to execute this job,  
we do not see this issue.  We only see it in SBATCH.

What I ultimate did was the following:

1 - Change the NodeName to add the specific parameters Sockets, Cores and 
Threads.
2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 
respectively

I tested jobs after the above changes and used 'scontrol --defaults job ' 
command.  The CPU allocation now works as expected.

I do have one question though - what is the benefit/recommendation of using 
srun to execute a process within SBATCH.  We are running primarily python jobs, 
but need to also support R jobs.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Wednesday, September 30, 2020 2:18 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>; Michael 
Di Domenico mailto:mdidomeni...@gmail.com>>
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Il 29/09/20 16:19, Michael Di Domenico ha scritto:


what leads you to believe that you're getting 2 CPU's instead of 1?
I think I saw that too, once, but thought it was related to hyperthreading.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 
Bologna - Italy
tel.: +39 051 20 95786




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com


The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com



Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-30 Thread Ryan Novosielski
Primary one I’m aware of is that resource use is better reported (or at all in 
some cases) via srun, and srun can take care of MPI for an MPI job.  I’m sure 
there are others as well (I guess avoiding another place where you have to 
describe the resources to be used and making sure they match, in the case of 
mpirun, etc.).

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 09:38, Luecht, Jeff A  wrote:

First off, I want to thank everyone for their input and suggestions.  They 
were very helpful an ultimately pointed me in the right direction.  I spent 
several hours playing around with various settings.

Some additional background. When the srun command is used to execute this job,  
we do not see this issue.  We only see it in SBATCH.

What I ultimate did was the following:

1 - Change the NodeName to add the specific parameters Sockets, Cores and 
Threads.
2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 
respectively

I tested jobs after the above changes and used 'scontrol --defaults job ' 
command.  The CPU allocation now works as expected.

I do have one question though - what is the benefit/recommendation of using 
srun to execute a process within SBATCH.  We are running primarily python jobs, 
but need to also support R jobs.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Wednesday, September 30, 2020 2:18 AM
To: Slurm User Community List ; Michael Di 
Domenico 
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Il 29/09/20 16:19, Michael Di Domenico ha scritto:

what leads you to believe that you're getting 2 CPU's instead of 1?
I think I saw that too, once, but thought it was related to hyperthreading.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 
Bologna - Italy
tel.: +39 051 20 95786




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com




[slurm-users] Getting --gpus -request in job_submit.lua

2020-09-30 Thread Niels Carl Hansen

I am trying to retrieve the number of requested GPUs in job_submit.lua

If the job is submitted with a --gres -flag, as in "sbatch 
--gres=gpu:2...", I can get the

information in job_submit.lua via the variable 'job_desc.tres_per_node'.

But if the job is submitted with the --gpus -flag, as in "sbatch 
--gpus=2", then 'job_desc.tres_per_node'

is nil.

How can I dig out the number of requested GPUs in job_submit.lua in the 
latter case?

I am running Slurm 20.02.5.

Thanks in advance.

Niels Carl Hansen
Aarhus University, Denmark



Re: [slurm-users] error: user not found

2020-09-30 Thread Diego Zuccato
Il 30/09/20 12:33, Marcus Wagner ha scritto:

> the submission process runs on the slurmctld, so the user must be known
> there.
It is. The frontend is the node users use to submit jobs and it's where
slurmctld runs.
The user is known (he's logged in via ssh). His home is available (NFS
share visible by all nodes), id and "getent passwd" correctly identify
the user, but slurmctld does not. :(
What's slurmctld doing differently?

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] error: user not found

2020-09-30 Thread Marcus Wagner

Hi Diego,

the submission process runs on the slurmctld, so the user must be known there.


Best
Marcus

Am 30.09.2020 um 08:37 schrieb Diego Zuccato:

Il 30/09/20 03:49, Brian Andrus ha scritto:

Tks for the answer.


That means the system has no idea who that user is.

But which system? Being a message generated by slurmctld, I thought it
must be the frontend node. But, as I wrote, that system correctly
identifies the user (he's logged in, 'id' and 'getent passwd' can
resolve both the name and the UID).


If you are part of a domain or other shared directory (ldap, etc), your
master is likely not configured right.

The frontend is an AD member, using PBIS-open. It's been working as-is
for at least the last 6 years :) and other users from the same domain
are able to submit jobs.


If you are using SSSD, it is also possible your sssd has too long of a
cache time. Run "sss_cache -E" to clear everything.

To partially workaround an issue with conflicting UIDs/GIDs (the
PBIS-assigned range is too short for our forest), I already clear the
PBIS cache every 5 minutes and re-populate it forcing an id of every
entry of /home/{PERSONALE,STUDENTI}/*.* entry (this forces PBIS to pull
the right name from AD when first resolving the UID, so the GUID is
already cached and associated to the UID when the reverse mapping is
required).


If you have a forest, it could be the information has not propagated to
all the servers, so you have to wait.
I've been places where that can take 24 hours.

It's been more than a week since the first failure :( And our forest
usually propagates changes in just a few minutes (more often in seconds).



--
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] error: user not found

2020-09-30 Thread Diego Zuccato
Il 30/09/20 03:49, Brian Andrus ha scritto:

Tks for the answer.

> That means the system has no idea who that user is.
But which system? Being a message generated by slurmctld, I thought it
must be the frontend node. But, as I wrote, that system correctly
identifies the user (he's logged in, 'id' and 'getent passwd' can
resolve both the name and the UID).

> If you are part of a domain or other shared directory (ldap, etc), your
> master is likely not configured right.
The frontend is an AD member, using PBIS-open. It's been working as-is
for at least the last 6 years :) and other users from the same domain
are able to submit jobs.

> If you are using SSSD, it is also possible your sssd has too long of a
> cache time. Run "sss_cache -E" to clear everything.
To partially workaround an issue with conflicting UIDs/GIDs (the
PBIS-assigned range is too short for our forest), I already clear the
PBIS cache every 5 minutes and re-populate it forcing an id of every
entry of /home/{PERSONALE,STUDENTI}/*.* entry (this forces PBIS to pull
the right name from AD when first resolving the UID, so the GUID is
already cached and associated to the UID when the reverse mapping is
required).

> If you have a forest, it could be the information has not propagated to
> all the servers, so you have to wait.
> I've been places where that can take 24 hours.
It's been more than a week since the first failure :( And our forest
usually propagates changes in just a few minutes (more often in seconds).

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Memory per CPU

2020-09-30 Thread Diego Zuccato
Il 29/09/20 16:19, Michael Di Domenico ha scritto:

> what leads you to believe that you're getting 2 CPU's instead of 1?
I think I saw that too, once, but thought it was related to hyperthreading.

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786