[slurm-dev] Re: Wrong behaviour of "--tasks-per-node" flag

2016-12-02 Thread Miguel Couceiro


Hi.

I think I am also experiencing the same problem with SLURM 16.05.2.



slurm.conf:

SchedulerPort=7321
SchedulerRootFilter=1
SchedulerType=sched/backfill

SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_Pack_Nodes

NodeName=CompNode[001-020] NodeAddr=172.16.32.[1-20] Boards=1 
SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=193233 
TmpDisk=102350 State=UNKNOWN
PartitionName=unlimited Nodes=CompNode[001-020] Default=NO MaxNodes=20 
DefaultTime=UNLIMITED MaxTime=UNLIMITED Priority=100 Hidden=YES State=UP 
AllowGroups=slurmspecial




mm.batch:

#!/bin/bash

#SBATCH --time=24:00:00
#SBATCH --nodes=5
#SBATCH --ntasks=10
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=12
#SBATCH --licenses=comsol@headnode
#SBATCH --partition=unlimited
#SBATCH --exclusive
#SBATCH --output=~/Documents/COMSOL/mm.out
#SBATCH --error=~/Documents/COMSOL/mm.err

module load Programs/comsol-5.2a

comsol -nn $SLURM_NTASKS -nnhost $SLURM_NTASKS_PER_NODE -np 
$SLURM_CPUS_PER_TASK -numasets 2 -mpmode owner -mpibootstrap slurm 
-mpifabrics ofa:ofa batch -job b1 -inputfile ~/Documents/COMSOL/mm.mph 
-outputfile ~/Documents/COMSOL/mm-results.mph -batchlog 
~/Documents/COMSOL/mm.log




mm.err:

srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't 
match the requested tasks 5 with the number of requested nodes 5. 
Ignoring --ntasks-per-node.




mm.out:

[0] MPI startup(): Multi-threaded optimized library
[9] MPID_nem_ofacm_init(): Init
[1] MPID_nem_ofacm_init(): Init
[6] MPID_nem_ofacm_init(): Init
[4] MPID_nem_ofacm_init(): Init
[0] MPID_nem_ofacm_init(): Init
[8] MPID_nem_ofacm_init(): Init
[5] MPID_nem_ofacm_init(): Init
[7] MPID_nem_ofacm_init(): Init
[3] MPID_nem_ofacm_init(): Init
[2] MPID_nem_ofacm_init(): Init
[9] MPI startup(): ofa data transfer mode
[1] MPI startup(): ofa data transfer mode
[6] MPI startup(): ofa data transfer mode
[4] MPI startup(): ofa data transfer mode
[0] MPI startup(): ofa data transfer mode
[8] MPI startup(): ofa data transfer mode
[7] MPI startup(): ofa data transfer mode
[5] MPI startup(): ofa data transfer mode
[3] MPI startup(): ofa data transfer mode
[2] MPI startup(): ofa data transfer mode
[0] MPI startup(): RankPid  Node namePin cpu
[0] MPI startup(): 0   15011CompNode001 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 1   15012CompNode001 
{0,2,4,6,8,10,12,14,16,18,20,22}
[0] MPI startup(): 2   45695CompNode002 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 4   17504CompNode003 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 5   17505CompNode003 
{0,2,4,6,8,10,12,14,16,18,20,22}
[0] MPI startup(): 6   8158 CompNode004 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 7   8159 CompNode004 
{0,2,4,6,8,10,12,14,16,18,20,22}
[0] MPI startup(): 8   45973CompNode005 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 9   45974CompNode005 
{0,2,4,6,8,10,12,14,16,18,20,22}

Node 0 is running on host: CompNode001
Node 0 has address: CompNode001.laced.ib
Node 1 is running on host: CompNode001
Node 1 has address: CompNode001.laced.ib
Node 2 is running on host: CompNode002
Node 2 has address: CompNode002.laced.ib
Node 3 is running on host: CompNode002
Node 3 has address: CompNode002.laced.ib
Node 4 is running on host: CompNode003
Node 4 has address: CompNode003.laced.ib
Node 5 is running on host: CompNode003
Node 5 has address: CompNode003.laced.ib
Node 6 is running on host: CompNode004
Node 6 has address: CompNode004.laced.ib
Node 7 is running on host: CompNode004
Node 7 has address: CompNode004.laced.ib
Node 8 is running on host: CompNode005
Node 8 has address: CompNode005.laced.ib
Node 9 is running on host: CompNode005
Node 9 has address: CompNode005.laced.ib


Regards,
Miguel



On 10/21/2016 02:54 PM, Manuel Rodríguez Pascual wrote:

Wrong behaviour of "--tasks-per-node" flag
Hi all,

I am having the weirdest error ever.  I am pretty sure this is a bug. 
I have reproduced the error in latest slurm commit (slurm 
17.02.0-0pre2,  commit 406d3fe429ef6b694f30e19f69acf989e65d7509 ) and 
in slurm 16.05.5 branch. It does NOT happen in slurm 15.08.12 .


My cluster is composed by 8 nodes, each with 2 sockets, each with 8 
cores. Slurm.conf content is


SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear  #DEDICATED NODES
NodeName=acme[11-14,21-24] CPUs=16 Sockets=2 CoresPerSocket=8 
ThreadsPerCore=1 State=UNKNOWN


I am running a simple hello World parallel code. It is submitted as 
"sbatch --ntasks=X --tasks-per-node=Y myScript.sh ". The problem is 
that, depending on the values of X and Y, Slurm performs a wrong 
opperation and returns an error.


"
sbatch --ntasks=8 --tasks-per-node=2 myScript.sh
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't 
match the requested tasks 4 with the number of requested nodes 4. 
Ignoring --ntasks-per-node.

"
Note that  I did not request 4 but 8 tasks, and I did not request any 
n

[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-20 Thread Thomas Orgis
Am Sun, 20 Nov 2016 15:12:47 -0800
schrieb Christopher Samuel : 

> If your MPI stack properly supports Slurm shouldn't that be:
> 
> sbatch --ntasks=16  --tasks-per-node=2  --wrap 'srun ./helloWorldMPI'
> ?
> Otherwise you're at the mercy of what your mpiexec chooses to do.

If the MPI stack properly supports Slurm, it is going to do the proper
thing. If the resulting srun call to start the MPI ranks is wrong, the
stack's interaction to Slurm should be fixed.

In both cases, invoking via srun directly or via mpirun, isn't proper
integration needed to make things work?

Recalling my tests with MPI startup on our main cluster with CentOS 7,
16 real cores per node:

- Intel MPI (compiler 15.0.3 and associated MPI, around that)
-- start via srun: just hangs
-- mpirun --bootstrap slurm: works
-- mpirun --bootstrap ssh: works
- Open MPI (1.8 or so) built --with-slurm but without PMI
-- srun: starts one MPI process on each node
-- mpirun: works
- Open MPI without anything (SSH method)
-- srun: starts one MPI process on each node
-- mpirun: 16 processes on first node only
- Open MPI --with-slurm und mit PMI
-- srun: works
-- mpirun: works

So, for Open MPI build with Slurm and PMI it did not matter whether
srun or mpirun was used. The other builds did not work properly with
srun, but all builds --with-slurm (PMI or not) worked just fine using
mpirun. Intel MPI needs those extra environment setup to make srun
work; I did not have that at hand back then (in my notes there is
something about it not even working with libpmi.so, not sure anymore
how it went wrong).

My point being: If things are set up to work at all with the MPI stack,
I see no value in insisting on using srun. Calling mpirun seems to be
the more robust method.

Since mpirun can be made to work with both Intel via --bootstrap slurm
(the default, actually) and openmpi simply via --with-slurm, we settled
for that method and avoided linking in any code specific to the version
of batch system (like using libpmi.so from an older Slurm version, possibly
causing issues in future).

Is that way of running MPI jobs in Slurm not supported?


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
Universität Hamburg
RRZ / Basisinfrastruktur / HPC
Schlüterstr. 70
20146 Hamburg
Tel.: 040/42838 8826
Fax: 040/428 38 6270


smime.p7s
Description: S/MIME cryptographic signature


[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-20 Thread Christopher Samuel

On 19/11/16 03:38, Manuel Rodríguez Pascual wrote:

> sbatch --ntasks=16  --tasks-per-node=2  --wrap 'mpiexec ./helloWorldMPI'

If your MPI stack properly supports Slurm shouldn't that be:

sbatch --ntasks=16  --tasks-per-node=2  --wrap 'srun ./helloWorldMPI'

?

Otherwise you're at the mercy of what your mpiexec chooses to do.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-18 Thread Manuel Rodríguez Pascual
sbatch --ntasks=16  --tasks-per-node=2  --wrap 'mpiexec ./helloWorldMPI'

(code at the end of the mail)

However, I just repeated the tests and I have discovered that with my
current Slurm (slurm 17.02.0-0pre2) and MPI ( mvapich2-2.2), although the
warning is still displayed, slurm is in fact correctly sending the jobs. I
don't know whether this is a happy coincidence or a weird behaviour.

And with the test of "-ntasks=16  --tasks-per-node=16 --nodes=2 "  slurm
did not show any warning, altough it sent 8 tasks per node.


If you need anything else or want me to perform any particular experiment,
please do not hesitate to contact me.


Thanks for your help and interest in the matter,

Manuel


---
---


-bash-4.2$  more helloWorldMPI.c
/* C Example */
#include 
#include 

int main (argc, argv)
 int argc;
 char *argv[];
{
  int rank, size, namelen;

  MPI_Init (&argc, &argv);  /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);/* get current process id */
  MPI_Comm_size (MPI_COMM_WORLD, &size);/* get number of processes */

  char   processor_name[MPI_MAX_PROCESSOR_NAME];
  MPI_Get_processor_name(processor_name,&namelen);

  printf("Process %d of %d is on %s\n", rank, size, processor_name);


  printf( "Hello world from process %d of %d\n", rank, size );

  int i = 0;
  for (i = 0; i < 10; i++){
sleep(1);
printf("%d on process  %d of %d\n",i,  rank, size );
}

  printf( "Goodbye world from process %d of %d\n", rank, size );

  MPI_Finalize();
  return 0;
}






2016-11-17 23:56 GMT+01:00 Christopher Samuel :

>
> On 28/10/16 18:20, Manuel Rodríguez Pascual wrote:
>
> > -bash-4.2$ sbatch --ntasks=16  --tasks-per-node=2  test.sh
>
> Could you post the content of your batch script too please?
>
> We're not seeing this on 16.05.5, but I can't be sure I'm correctly
> replicating what you are seeing.
>
> cheers,
> Chris
> --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
>


[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-17 Thread Christopher Samuel

On 28/10/16 18:20, Manuel Rodríguez Pascual wrote:

> -bash-4.2$ sbatch --ntasks=16  --tasks-per-node=2  test.sh 

Could you post the content of your batch script too please?

We're not seeing this on 16.05.5, but I can't be sure I'm correctly
replicating what you are seeing.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-17 Thread Dr. Thomas Orgis
Am Fri, 28 Oct 2016 00:21:12 -0700
schrieb Manuel Rodríguez Pascual :

> Altogether, I think the condition should be rewritten to something like:
> if ((opt.ntasks_per_node != NO_VAL) &&
> if (opt.ntasks < opt.ntasks_per_node
> *opt.min_nodes )
> 
> but with opt.ntasks being the value introduced by the user, not the one
> internally considered by Slurm at this point.  I don't know how to do
> correct this, but I hope this helps to point towards the problem.

Is there any comment on this by Slurm developers? We tried upgrading to
Slurm 16.05.6 here and hit this bug, mpiexec/srun complaining about
requested tasks per node that do not match what the user requested. We
downgraded to 15.08.12 and that seems to work fine.

I am astonished that there is no reaction at all to this … also not to
the bug tracker report under

https://bugs.schedmd.com/show_bug.cgi?id=3032

. It looks to me like Slurm 16.05 clearly broke somethinh. Well … it is
broken for at least three people. Are we facing a strange fringe use
case?

Is Intel MPI just buggy? I am not sure about OpenMPI, I figured I have
to downgrade Slurm when Intel MPI does not work properly anymore. Even
if this is ultimately Intel MPI's fault, this would be a strong reason
for us to keep Slurm at the older version for the whole lifetime of our
cluster in order to support the existing binaries.


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
Universität Hamburg
RRZ / Basis-Infrastruktur / HPC
Schlüterstr. 70
20146 Hamburg
Tel.: 040/42838 8826
Fax: 040/428 38 6270


smime.p7s
Description: S/MIME cryptographic signature


[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-10-28 Thread Manuel Rodríguez Pascual
H,

After some searching into the code, I may have a clue of what is going on.

I have seen that the commit that launches the error is this one:

72ed146cd2a6facb76e854919fb887faf3fc0c25 (date May. 11th)

I have modified newest version of slurm,  src/srun/libsrun/opt.c (line
2279) to print values of opt

info("opt.ntasks=%u \n"
"opt.ntasks_per_node=%u\n"
"opt.min_nodes=%u\n"
"opt.ntasks_set=%d\n",

opt.ntasks,opt.ntasks_per_node,opt.min_nodes,opt.ntasks_set);


before the error condition, which is:

if ((opt.ntasks_per_node != NO_VAL) &&
(opt.ntasks_per_node != (opt.ntasks /
opt.min_nodes))) {
if (opt.ntasks > opt.ntasks_per_node)

info("Warning: can't honor
--ntasks-per-node "
 "set to %u which doesn't match the "
 "requested tasks %u with the number of
"
 "requested nodes %u. Ignoring "
 "--ntasks-per-node.",
opt.ntasks_per_node,
 opt.ntasks, opt.min_nodes);
opt.ntasks_per_node = NO_VAL;


Result shows the problem:

-bash-4.2$ sbatch --ntasks=16  --tasks-per-node=2  test.sh
-bash-4.2$ more slurm-475.out
srun: opt.ntasks=8
opt.ntasks_per_node=2
opt.min_nodes=8
opt.ntasks_set=1
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match
the requested tasks 8 with the number of requested nodes 8. Ignoring
--ntasks-per-node.

-bash-4.2$ sbatch --ntasks=16  --tasks-per-node=4  test.sh
-bash-4.2$ more slurm-476.out
srun: opt.ntasks=4
opt.ntasks_per_node=4
opt.min_nodes=4
opt.ntasks_set=1
OK

-bash-4.2$ sbatch --ntasks=16  --tasks-per-node=8  test.sh
-bash-4.2$ more slurm-477.out
srun: opt.ntasks=2
opt.ntasks_per_node=8
opt.min_nodes=2
opt.ntasks_set=1
OK

-bash-4.2$ sbatch --ntasks=16  --tasks-per-node=16  test.sh
-bash-4.2$ more slurm-478.out
srun: opt.ntasks=1
opt.ntasks_per_node=16
opt.min_nodes=1
opt.ntasks_set=1
OK

The calculation of min_nodes is always correct after the specification of
task_per_node, so the error should not be arising.

Instead, if I do:

-bash-4.2$ sbatch --ntasks=16  --tasks-per-node=16 --nodes=2  test.sh
-bash-4.2$ more slurm-479.out
srun: opt.ntasks=2
opt.ntasks_per_node=16
opt.min_nodes=2
opt.ntasks_set=1
OK

In this case an error should arise, but it does not.

Altogether, I think the condition should be rewritten to something like:
if ((opt.ntasks_per_node != NO_VAL) &&
if (opt.ntasks < opt.ntasks_per_node
*opt.min_nodes )

but with opt.ntasks being the value introduced by the user, not the one
internally considered by Slurm at this point.  I don't know how to do
correct this, but I hope this helps to point towards the problem.

Best regards,


Manuel


[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-10-22 Thread Ade Fewings
Hi Manuel

Yes, have seen the same version-dependent behaviour, although perhaps have not 
diagnosed it is thoroughly as you.  We principally use Intel MPI and thought it 
mainly affected that.   I did report a bug (3032) but unfortunately (to my 
chagrin) we don’t have a support contract at the moment.

Sorry I can’t offer anything other than that.

~~
Ade



From: Manuel Rodríguez Pascual [mailto:manuel.rodriguez.pasc...@gmail.com]
Sent: 21 October 2016 14:54
To: slurm-dev 
Subject: [slurm-dev] Wrong behaviour of "--tasks-per-node" flag

Hi all,

I am having the weirdest error ever.  I am pretty sure this is a bug. I have 
reproduced the error in latest slurm commit (slurm 17.02.0-0pre2,  commit 
406d3fe429ef6b694f30e19f69acf989e65d7509 ) and in slurm 16.05.5 branch. It does 
NOT happen in slurm 15.08.12 .

My cluster is composed by 8 nodes, each with 2 sockets, each with 8 cores. 
Slurm.conf content is

SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear  #DEDICATED NODES
NodeName=acme[11-14,21-24] CPUs=16 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 
State=UNKNOWN

I am running a simple hello World parallel code. It is submitted as "sbatch 
--ntasks=X --tasks-per-node=Y myScript.sh ". The problem is that, depending on 
the values of X and Y, Slurm performs a wrong opperation and returns an error.

"
sbatch --ntasks=8 --tasks-per-node=2 myScript.sh
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the 
requested tasks 4 with the number of requested nodes 4. Ignoring 
--ntasks-per-node.
"
Note that  I did not request 4 but 8 tasks, and I did not request any number of 
nodes.  Same happens with
"
sbatch --ntasks=16 --tasks-per-node=2 myScript.sh
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the 
requested tasks 8 with the number of requested nodes 8. Ignoring 
--ntasks-per-node.
"
and
"
sbatch --ntasks=32 --tasks-per-node=4 myScript.sh
srun: Warning: can't honor --ntasks-per-node set to 4 which doesn't match the 
requested tasks 8 with the number of requested nodes 8. Ignoring 
--ntasks-per-node.
"
All the rest of configurations work correctly and do not return any error. In 
particular, I have tried the following combinations with no problem:
(ntasks, tasks-per-node)
(1,1)
(2,1), (2,2)
(4,1), (4,2), (4,4)
(8,1), (4,4), (8,8)
(16,4), (16,8), (16,16)
(32,8), (32,16)
(64,8), (64, 16)
(128, 16)

As said, this does not happen when executing the very same commands and scripts 
with slurm 15.08.12. So, have you had any similar experiences? Is this a bug, a 
desired behaviour, or am I doing something wrong?

Thanks for your help.

Best regards,



Manuel


   [HPC Wales - www.hpcwales.co.uk] 



The contents of this email and any files transmitted with it are confidential 
and intended solely for the named addressee only.  Unless you are the named 
addressee (or authorised to receive this on their behalf) you may not copy it 
or use it, or disclose it to anyone else.  If you have received this email in 
error, please notify the sender by email or telephone.  All emails sent by High 
Performance Computing Wales have been checked using an Anti-Virus system.  We 
would advise you to run your own virus check before opening any attachments 
received as we will not in any event accept any liability whatsoever, once an 
email and/or attachment is received.

High Performance Computing Wales is a private limited company incorporated in 
Wales on 8 March 2010 as company number 07181701.

Our registered office is at Finance Office, Bangor University, Cae Derwen, 
College Road, Bangor, Gwynedd. LL57 2DG. UK.

High Performance Computing Wales is part funded by the European Regional 
Development Fund through the Welsh Government.