Re: [slurm-users] cpu limit issue

2018-07-11 Thread Mahmood Naderan
>Check the Gaussian log file for mention of its using just 8 CPUs-- just
because there are 12 CPUs available doesn't mean the program uses all of
>them.  It will scale-back if 12 isn't a good match to the problem as I
recall.



Well, in the log file, it says

 **
 %nprocshared=12
 Will use up to   12 processors via shared memory.
 %mem=18GB
 %chk=trimer.chk

Maybe, it scales down to a good match. But I haven't seen that before. That
was why I asked the question.





One more question. Does it matter if the user specify (or not specify)
--account in the sbatch script?

[root@rocks7 ~]# sacctmgr list association format=partition,account,user,
grptres,maxwall
 PartitionAccount   User   GrpTRES MaxWall
-- -- -- - ---
   emerald z3 noor cpu=12,mem=1+ 30-00:00:00



[noor@rocks7 ~]$ grep nprocshared trimer.gjf
%nprocshared=12
[noor@rocks7 ~]$ cat trimer.sh
#!/bin/bash
#SBATCH --output=trimer.out
#SBATCH --job-name=trimer
#SBATCH --ntasks=12
#SBATCH --mem=18GB
#SBATCH --partition=EMERALD
g09 trimer.gjf







Regards,
Mahmood


Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-11 Thread Taras Shapovalov
Thank you, guys,

Lets wait for 17.11.8. Any estimation for the release date?

Best regards,

Taras

On Wed, Jul 11, 2018 at 12:11 AM Kilian Cavalotti <
kilian.cavalotti.w...@gmail.com> wrote:

> On Tue, Jul 10, 2018 at 10:34 AM, Taras Shapovalov
>  wrote:
> > I noticed the commit that can be related to this:
> >
> >
> https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e
>
> Yes. See also this bug: https://bugs.schedmd.com/show_bug.cgi?id=5240
> This commit will be reverted in 17.11.8 and replaced with a lighter
> fix for the original issue with jobs submitted to multiple partitions.
>
> Cheers,
> --
> Kilian
>
>


Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-11 Thread Douglas Jacobsen
Applying patches d52d8f4f0 and f07f53fc13 to a slurm 17.11.7 source tree
fixes this issue in my experience.  Only requires restarting slurmctld.



Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
Acting Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center 
dmjacob...@lbl.gov

- __o
-- _ '\<,_
--(_)/  (_)__



On Wed, Jul 11, 2018 at 12:59 AM Taras Shapovalov <
taras.shapova...@brightcomputing.com> wrote:

> Thank you, guys,
>
> Lets wait for 17.11.8. Any estimation for the release date?
>
> Best regards,
>
> Taras
>
> On Wed, Jul 11, 2018 at 12:11 AM Kilian Cavalotti <
> kilian.cavalotti.w...@gmail.com> wrote:
>
>> On Tue, Jul 10, 2018 at 10:34 AM, Taras Shapovalov
>>  wrote:
>> > I noticed the commit that can be related to this:
>> >
>> >
>> https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e
>>
>> Yes. See also this bug: https://bugs.schedmd.com/show_bug.cgi?id=5240
>> This commit will be reverted in 17.11.8 and replaced with a lighter
>> fix for the original issue with jobs submitted to multiple partitions.
>>
>> Cheers,
>> --
>> Kilian
>>
>>


Re: [slurm-users] cpu limit issue

2018-07-11 Thread John Hearns
Mahmood,
  I am sure you have checked this.  Try runningps -eaf --forest   while
a job is running.
I often find the --forest option helps to understand how batch jobs are
being run.

On 11 July 2018 at 09:12, Mahmood Naderan  wrote:

> >Check the Gaussian log file for mention of its using just 8 CPUs-- just
> because there are 12 CPUs available doesn't mean the program uses all of
> >them.  It will scale-back if 12 isn't a good match to the problem as I
> recall.
>
>
>
> Well, in the log file, it says
>
>  **
>  %nprocshared=12
>  Will use up to   12 processors via shared memory.
>  %mem=18GB
>  %chk=trimer.chk
>
> Maybe, it scales down to a good match. But I haven't seen that before.
> That was why I asked the question.
>
>
>
>
>
> One more question. Does it matter if the user specify (or not specify)
> --account in the sbatch script?
>
> [root@rocks7 ~]# sacctmgr list association format=partition,account,user,
> grptres,maxwall
>  PartitionAccount   User   GrpTRES MaxWall
> -- -- -- - ---
>emerald z3 noor cpu=12,mem=1+ 30-00:00:00
>
>
>
> [noor@rocks7 ~]$ grep nprocshared trimer.gjf
> %nprocshared=12
> [noor@rocks7 ~]$ cat trimer.sh
> #!/bin/bash
> #SBATCH --output=trimer.out
> #SBATCH --job-name=trimer
> #SBATCH --ntasks=12
> #SBATCH --mem=18GB
> #SBATCH --partition=EMERALD
> g09 trimer.gjf
>
>
>
>
>
>
>
> Regards,
> Mahmood
>
>
>


Re: [slurm-users] cpu limit issue

2018-07-11 Thread John Hearns
Another thought - are we getting mixed up between hyperthreaded and
physical cores here?
I don't see how 12 hyperthreaded cores translates to 8 though - it would be
6!





On 11 July 2018 at 10:30, John Hearns  wrote:

> Mahmood,
>   I am sure you have checked this.  Try runningps -eaf --forest
> while a job is running.
> I often find the --forest option helps to understand how batch jobs are
> being run.
>
> On 11 July 2018 at 09:12, Mahmood Naderan  wrote:
>
>> >Check the Gaussian log file for mention of its using just 8 CPUs-- just
>> because there are 12 CPUs available doesn't mean the program uses all of
>> >them.  It will scale-back if 12 isn't a good match to the problem as I
>> recall.
>>
>>
>>
>> Well, in the log file, it says
>>
>>  **
>>  %nprocshared=12
>>  Will use up to   12 processors via shared memory.
>>  %mem=18GB
>>  %chk=trimer.chk
>>
>> Maybe, it scales down to a good match. But I haven't seen that before.
>> That was why I asked the question.
>>
>>
>>
>>
>>
>> One more question. Does it matter if the user specify (or not specify)
>> --account in the sbatch script?
>>
>> [root@rocks7 ~]# sacctmgr list association format=partition,account,user,
>> grptres,maxwall
>>  PartitionAccount   User   GrpTRES MaxWall
>> -- -- -- - ---
>>emerald z3 noor cpu=12,mem=1+ 30-00:00:00
>>
>>
>>
>> [noor@rocks7 ~]$ grep nprocshared trimer.gjf
>> %nprocshared=12
>> [noor@rocks7 ~]$ cat trimer.sh
>> #!/bin/bash
>> #SBATCH --output=trimer.out
>> #SBATCH --job-name=trimer
>> #SBATCH --ntasks=12
>> #SBATCH --mem=18GB
>> #SBATCH --partition=EMERALD
>> g09 trimer.gjf
>>
>>
>>
>>
>>
>>
>>
>> Regards,
>> Mahmood
>>
>>
>>
>


Re: [slurm-users] cpu limit issue

2018-07-11 Thread Mahmood Naderan
>Try runningps -eaf --forest   while a job is running.

noor 30907 30903  0 Jul10 ?00:00:00  \_ /bin/bash
/var/spool/slurmd/job00749/slurm_script
noor 30908 30907  0 Jul10 ?00:00:00  \_ g09 trimmer.gjf
noor 30909 30908 99 Jul10 ?4-13:00:21  \_
/usr/local/chem/g09-64-D01/l703.exe 2415919104 trimmer.chk 1 /st


Nonetheless, that doesn't fit in the terminal's window.



>are we getting mixed up between hyperthreaded and physical cores here?

CPUs are Opteron 61XX. So, hyperthreading is not applicable here.


Regards,
Mahmood


Re: [slurm-users] cpu limit issue

2018-07-11 Thread John Hearns
Mahmood, please please forgive me for saying this.  A quick Google shows
that Opteron 61xx have eight or twelve cores.
Have you checked that all the servers have 12 cores?
I realise I am appearing stupid here.





On 11 July 2018 at 10:39, Mahmood Naderan  wrote:

> >Try runningps -eaf --forest   while a job is running.
>
> noor 30907 30903  0 Jul10 ?00:00:00  \_ /bin/bash
> /var/spool/slurmd/job00749/slurm_script
> noor 30908 30907  0 Jul10 ?00:00:00  \_ g09 trimmer.gjf
> noor 30909 30908 99 Jul10 ?4-13:00:21  \_
> /usr/local/chem/g09-64-D01/l703.exe 2415919104 trimmer.chk 1 /st
>
>
> Nonetheless, that doesn't fit in the terminal's window.
>
>
>
> >are we getting mixed up between hyperthreaded and physical cores here?
>
> CPUs are Opteron 61XX. So, hyperthreading is not applicable here.
>
>
> Regards,
> Mahmood
>
>
>


Re: [slurm-users] cpu limit issue

2018-07-11 Thread Mahmood Naderan
My fault. One of the other nodes was in my mind!

The node which is running g09 is


[root@compute-0-3 ~]# ps  aux |  grep l502
root 11198  0.0  0.0 112664   968 pts/0S+   13:31   0:00 grep
--color=auto l502
nooriza+ 30909  803  1.4 21095004 947968 ? Rl   Jul10 6752:47
/usr/local/chem/g09-64-D01/l502.exe 2415919104
[root@compute-0-3 ~]# lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):32
On-line CPU(s) list:   0-31
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 2
NUMA node(s):  4
Vendor ID: AuthenticAMD
CPU family:21
Model: 1
Model name:AMD Opteron(tm) Processor 6282 SE
Stepping:  2


Regards,
Mahmood



On Wed, Jul 11, 2018 at 1:25 PM, John Hearns  wrote:

> Mahmood, please please forgive me for saying this.  A quick Google shows
> that Opteron 61xx have eight or twelve cores.
> Have you checked that all the servers have 12 cores?
> I realise I am appearing stupid here.
>
>
>
>


Re: [slurm-users] cpu limit issue

2018-07-11 Thread John Hearns
I bet all on here would just LOVE the AMD Fangio ;-)
http://www.cpu-world.com/news_2012/2012111801_Obscure_CPUs_AMD_Opteron_6275.html

Hint - quite a few of these were sold!

On 11 July 2018 at 11:04, Mahmood Naderan  wrote:

> My fault. One of the other nodes was in my mind!
>
> The node which is running g09 is
>
>
> [root@compute-0-3 ~]# ps  aux |  grep l502
> root 11198  0.0  0.0 112664   968 pts/0S+   13:31   0:00 grep
> --color=auto l502
> nooriza+ 30909  803  1.4 21095004 947968 ? Rl   Jul10 6752:47
> /usr/local/chem/g09-64-D01/l502.exe 2415919104
> [root@compute-0-3 ~]# lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):32
> On-line CPU(s) list:   0-31
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  4
> Vendor ID: AuthenticAMD
> CPU family:21
> Model: 1
> Model name:AMD Opteron(tm) Processor 6282 SE
> Stepping:  2
>
>
> Regards,
> Mahmood
>
>
>
> On Wed, Jul 11, 2018 at 1:25 PM, John Hearns 
> wrote:
>
>> Mahmood, please please forgive me for saying this.  A quick Google shows
>> that Opteron 61xx have eight or twelve cores.
>> Have you checked that all the servers have 12 cores?
>> I realise I am appearing stupid here.
>>
>>
>>
>>
>


[slurm-users] SLURM_NTASKS not defined after salloc

2018-07-11 Thread Alexander Grund

Hi all,

is it expected/intended that the env variable SLURM_NTASKS is not 
defined after salloc? It only gets defined after the an srun command.
The number of tasks appear in `scontrol -d show job ` though. So 
is it a bug in our installation or expected?


Thanks, Alex



Re: [slurm-users] SLURM_NTASKS not defined after salloc

2018-07-11 Thread Peter Kjellström
On Wed, 11 Jul 2018 14:10:51 +0200
Alexander Grund  wrote:

> Hi all,
> 
> is it expected/intended that the env variable SLURM_NTASKS is not 
> defined after salloc? It only gets defined after the an srun command.
> The number of tasks appear in `scontrol -d show job ` though.
> So is it a bug in our installation or expected?

my salloc sets SLURM_NTASKS (if you pass -n, otherwise not). This is
similar to how sbatch works.

Tested on 17.11.7

/Peter

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [slurm-users] SLURM_NTASKS not defined after salloc

2018-07-11 Thread Alexander Grund

Hi Peter,

thanks for the information, you are right: SLURM_NTASKS is not set if 
"-n" is not passed to salloc.


I am kinda relying on what happens after I call "srun ./binary" 
especially how many instances will be started. scontrol shows this 
information, so I could parse this. But is there any better way?
The script in question is a generic starter (wrapper) and should work no 
matter what the user passed to salloc. It should just be able to know, 
how many processes would be started by srun. How can I do that?


Alex


Am 11.07.2018 um 15:12 schrieb Peter Kjellström:

On Wed, 11 Jul 2018 14:10:51 +0200
Alexander Grund  wrote:

> Hi all,
>
> is it expected/intended that the env variable SLURM_NTASKS is not
> defined after salloc? It only gets defined after the an srun command.
> The number of tasks appear in `scontrol -d show job ` though.
> So is it a bug in our installation or expected?

my salloc sets SLURM_NTASKS (if you pass -n, otherwise not). This is
similar to how sbatch works.

Tested on 17.11.7

/Peter

--
Sent from my Android device with K-9 Mail. Please excuse my brevity. 


--
~~
Alexander Grund
Interdisziplinäre Anwendungsunterstützung und Koordination (IAK)

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
Chemnitzer Str. 46b, Raum 010 01062 Dresden
Tel.: +49 (351) 463-35982
E-Mail: alexander.gr...@tu-dresden.de
~~




Re: [slurm-users] SLURM_NTASKS not defined after salloc

2018-07-11 Thread Jeffrey Frey
SLURM_NTASKS is only unset when no task count flags are handed to salloc (no 
--ntasks, --ntasks-per-node, etc.).  Can't you then assume if it's not present 
in the environment you've got a single task allocated to you?  So in your 
generic starter script instead of using SLURM_NTASKS itself, use an expansion 
with a default value of 1:


${SLURM_NTASKS:-1}





> On Jul 11, 2018, at 9:39 AM, Alexander Grund  
> wrote:
> 
> Hi Peter,
> 
> thanks for the information, you are right: SLURM_NTASKS is not set if "-n" is 
> not passed to salloc.
> 
> I am kinda relying on what happens after I call "srun ./binary" especially 
> how many instances will be started. scontrol shows this information, so I 
> could parse this. But is there any better way?
> The script in question is a generic starter (wrapper) and should work no 
> matter what the user passed to salloc. It should just be able to know, how 
> many processes would be started by srun. How can I do that?
> 
> Alex
> 
> 
> Am 11.07.2018 um 15:12 schrieb Peter Kjellström:
>> On Wed, 11 Jul 2018 14:10:51 +0200
>> Alexander Grund  wrote:
>> 
>> > Hi all,
>> >
>> > is it expected/intended that the env variable SLURM_NTASKS is not
>> > defined after salloc? It only gets defined after the an srun command.
>> > The number of tasks appear in `scontrol -d show job ` though.
>> > So is it a bug in our installation or expected?
>> 
>> my salloc sets SLURM_NTASKS (if you pass -n, otherwise not). This is
>> similar to how sbatch works.
>> 
>> Tested on 17.11.7
>> 
>> /Peter
>> 
>> -- 
>> Sent from my Android device with K-9 Mail. Please excuse my brevity. 
> 
> -- 
> ~~
> Alexander Grund
> Interdisziplinäre Anwendungsunterstützung und Koordination (IAK)
> 
> Technische Universität Dresden
> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
> Chemnitzer Str. 46b, Raum 010 01062 Dresden
> Tel.: +49 (351) 463-35982
> E-Mail: alexander.gr...@tu-dresden.de
> ~~
> 
> 


::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::







Re: [slurm-users] cpu limit issue

2018-07-11 Thread Renfro, Michael
Looking at your script, there’s a chance that by only specifying ntasks instead 
of ntasks-per-node or a similar parameter, you might have allocated 8 CPUs on 
one node, and the remaining 4 on another.

Regardless, I’ve dug into my Gaussian documentation, and here’s my test case 
for you to see what happens:

1. Make a copy of tests/com/test1044.com from the Gaussian main directory.
2. Reserve some number of cores on a single node. I’m using --cpus-per-task=N 
instead of --ntasks-per-node, but it might not matter. Regardless, I try to 
stick with the cpus-per-task format for OpenMP-type programs. My job script is:

=

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=28

module load gaussian
export GAUSS_PDEF=${SLURM_CPUS_PER_TASK}
export GAUSS_SCRDIR=$(mktemp -d -p /local/tmp)
g09 test1044.com
rm -r ${GAUSS_SCRDIR}

=

Obviously, you’ll need to modify module load commands, and you might not need 
the GAUSS_SCRDIR variable. The GAUSS_PDEF variable does the same thing as the 
NProc commands, but doesn’t require modifying the input file.

With that job script, and that test file, I max out all 28 cores in my node for 
certain parts of the calculations as seen in ’top'. Job takes about 21 minutes 
of CPU time. Your times will obviously vary.

Depending on what you find out from the test case, that’ll give some insight on 
where you should go next.

> On Jul 11, 2018, at 4:04 AM, Mahmood Naderan  wrote:
> 
> My fault. One of the other nodes was in my mind!
> 
> The node which is running g09 is
> 
> 
> [root@compute-0-3 ~]# ps  aux |  grep l502
> root 11198  0.0  0.0 112664   968 pts/0S+   13:31   0:00 grep 
> --color=auto l502
> nooriza+ 30909  803  1.4 21095004 947968 ? Rl   Jul10 6752:47 
> /usr/local/chem/g09-64-D01/l502.exe 2415919104
> [root@compute-0-3 ~]# lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):32
> On-line CPU(s) list:   0-31
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  4
> Vendor ID: AuthenticAMD
> CPU family:21
> Model: 1
> Model name:AMD Opteron(tm) Processor 6282 SE
> Stepping:  2
> 
> 
> Regards,
> Mahmood
> 
> 
> 
> On Wed, Jul 11, 2018 at 1:25 PM, John Hearns  wrote:
> Mahmood, please please forgive me for saying this.  A quick Google shows that 
> Opteron 61xx have eight or twelve cores.
> Have you checked that all the servers have 12 cores?
> I realise I am appearing stupid here.
> 
> 
> 
> 



Re: [slurm-users] SLURM_NTASKS not defined after salloc

2018-07-11 Thread Alexander Grund

Unfortunately this will not work. Example: salloc --nodes=3 --exclusive

I'm wondering, why there is a discrepancy between the environment 
variables and scontrol. The latter clearly shows "NumNodes=3 NumCPUs=72 
NumTasks=3 CPUs/Task=1" (yes I realize that those values are 
inconsistent too, but at least it matches srun behavior (ignoring NumCPUs))


Alex


Am 11.07.2018 um 16:10 schrieb Jeffrey Frey:

SLURM_NTASKS is only unset when no task count flags are handed to salloc (no 
--ntasks, --ntasks-per-node, etc.).  Can't you then assume if it's not present 
in the environment you've got a single task allocated to you?  So in your 
generic starter script instead of using SLURM_NTASKS itself, use an expansion 
with a default value of 1:


${SLURM_NTASKS:-1}