[OMPI users] Problem mpi

2014-06-24 Thread Diego Saúl Carrió Carrió
Dear all,

I have problems for a long time related with  mpirun. When I executed
mpirun (with my program) I obtained the next error after a while:

. 
. 
. 
. 
. 

 mlx4: local QP operation err (QPN c00054, WQE index a, vendor syndrome
6f, opcode = 5e)
[[64826,1],0][btl_openib_component.c:3497:handle_wc] from foner109 to:
foner111 error polling LP CQ with status LOCAL QP OPERATION ERROR status
number 2 for wr_id af58a8 opcode 128  vendor error 111 qp_idx 3

mpirun has exited due to process rank 0 with PID 51754 on
node foner109 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).



I am using a cluster (42 nodes, with 20 processors and 64 Gb RAM for each
one). I want to use for example only 20 nodes, so I put:

salloc -N20 --tasks-per-node=1 --cpus-per-task=20 -p thin(name of the node)

mpirun -pernode [my_program]


Could you help me to solve this problem?

Best Regards,
Diego


Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-24 Thread Ralph Castain
Let's say that the downside is an unknown at this time. The only real
impact of setting that param is that each daemon now reports its topology
at startup. Without the param, only the daemon on the first node does so.
The concern expressed when we first added that report was that the volume
of data being sent on a very large system might impact launch time.
However, the amount of data from each node isn't very much, so we don't
know if there really would be a downside, or how significant it might be.

Sadly, we haven't had access to machines of any real size to test this so
we had real numbers for the decision. Absent that data, we took the
conservative approach of setting the default so as to preserve the
pre-existing behavior.

So everyone out there: please consider this an appeal for data. If you are
interested and willing, just send me (or the list - your option) any data
you are willing to share regarding launch time with and without the
--hetero-nodes option. A simple "time mpirun --map-by ppr:1:node /bin/true"
(or equivalent) run at various numbers of nodes would suffice.


On Mon, Jun 23, 2014 at 3:17 PM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:

> Hi,
> I've been following this thread because it may be relevant to our setup.
>
> Is there a drawback of having orte_hetero_nodes=1 as default MCA parameter
> ? Is there a reason why the most generic case is not assumed ?
>
> Maxime Boissonneault
>
> Le 2014-06-20 13:48, Ralph Castain a écrit :
>
>> Put "orte_hetero_nodes=1" in your default MCA param file - uses can
>> override by setting that param to 0
>>
>>
>> On Jun 20, 2014, at 10:30 AM, Brock Palen  wrote:
>>
>>  Perfection!  That appears to do it for our standard case.
>>>
>>> Now I know how to set MCA options by env var or config file.  How can I
>>> make this the default, that then a user can override?
>>>
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE Campus Champion
>>> bro...@umich.edu
>>> (734)936-1985
>>>
>>>
>>>
>>> On Jun 20, 2014, at 1:21 PM, Ralph Castain  wrote:
>>>
>>>  I think I begin to grok at least part of the problem. If you are
 assigning different cpus on each node, then you'll need to tell us that by
 setting --hetero-nodes otherwise we won't have any way to report that back
 to mpirun for its binding calculation.

 Otherwise, we expect that the cpuset of the first node we launch a
 daemon onto (or where mpirun is executing, if we are only launching local
 to mpirun) accurately represents the cpuset on every node in the 
 allocation.

 We still might well have a bug in our binding computation - but the
 above will definitely impact what you said the user did.

 On Jun 20, 2014, at 10:06 AM, Brock Palen  wrote:

  Extra data point if I do:
>
> [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core
> hostname
> 
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
>   Bind to: CORE
>   Node:nyx5513
>   #processes:  2
>   #cpus:  1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> 
> --
>
> [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime
> 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90,
> 12.38
> 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90,
> 12.38
> [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind
> --get
> 0x0010
> 0x1000
> [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
> nyx5513
> nyx5513
>
> Interesting, if I force bind to core, MPI barfs saying there is only 1
> cpu available, PBS says it gave it two, and if I force (this is all inside
> an interactive job) just on that node hwloc-bind --get I get what I 
> expect,
>
> Is there a way to get a map of what MPI thinks it has on each host?
>
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
>
>
>
> On Jun 20, 2014, at 12:38 PM, Brock Palen  wrote:
>
>  I was able to produce it in my test.
>>
>> orted affinity set by cpuset:
>> [root@nyx5874 ~]# hwloc-bind --get --pid 103645
>> 0xc002
>>
>> This mask (1, 14,15) which is across sockets, matches the cpu set
>> setup by the batch system.
>> [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.
>> nyx.engin.umich.edu/cpus
>> 1,14-15
>>
>> The ranks though were then all set to the same core:
>>
>> [root@nyx5874 ~]# hwloc-bind --get --pid 103871
>> 0x8000
>> [root

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-24 Thread Ralph Castain
That's odd - it shouldn't truncate the output. I'll take a look later today
- we're all gathered for a developer's conference this week, so I'll be
able to poke at this with Nathan.



On Mon, Jun 23, 2014 at 3:15 PM, Brock Palen  wrote:

> Perfection, flexible, extensible, so nice.
>
> BTW this doesn't happen older versions,
>
> [brockp@flux-login2 34241]$ ompi_info --param all all
> Error getting SCIF driver version
>  MCA btl: parameter "btl_tcp_if_include" (current value:
> "",
>   data source: default, level: 1 user/basic, type:
>   string)
>   Comma-delimited list of devices and/or CIDR
>   notation of networks to use for MPI communication
>   (e.g., "eth0,192.168.0.0/16").  Mutually
> exclusive
>   with btl_tcp_if_exclude.
>  MCA btl: parameter "btl_tcp_if_exclude" (current value:
>   "127.0.0.1/8,sppp", data source: default,
> level: 1
>   user/basic, type: string)
>   Comma-delimited list of devices and/or CIDR
>   notation of networks to NOT use for MPI
>   communication -- all devices not matching these
>   specifications will be used (e.g.,
>   "eth0,192.168.0.0/16").  If set to a non-default
>   value, it is mutually exclusive with
>   btl_tcp_if_include.
>
>
> This is normally much longer.  And yes we don't have the PHI stuff
> installed on all nodes, strange that 'all all' is now very short,
>  ompi_info -a  still works though.
>
>
>
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
>
>
>
> On Jun 20, 2014, at 1:48 PM, Ralph Castain  wrote:
>
> > Put "orte_hetero_nodes=1" in your default MCA param file - uses can
> override by setting that param to 0
> >
> >
> > On Jun 20, 2014, at 10:30 AM, Brock Palen  wrote:
> >
> >> Perfection!  That appears to do it for our standard case.
> >>
> >> Now I know how to set MCA options by env var or config file.  How can I
> make this the default, that then a user can override?
> >>
> >> Brock Palen
> >> www.umich.edu/~brockp
> >> CAEN Advanced Computing
> >> XSEDE Campus Champion
> >> bro...@umich.edu
> >> (734)936-1985
> >>
> >>
> >>
> >> On Jun 20, 2014, at 1:21 PM, Ralph Castain  wrote:
> >>
> >>> I think I begin to grok at least part of the problem. If you are
> assigning different cpus on each node, then you'll need to tell us that by
> setting --hetero-nodes otherwise we won't have any way to report that back
> to mpirun for its binding calculation.
> >>>
> >>> Otherwise, we expect that the cpuset of the first node we launch a
> daemon onto (or where mpirun is executing, if we are only launching local
> to mpirun) accurately represents the cpuset on every node in the allocation.
> >>>
> >>> We still might well have a bug in our binding computation - but the
> above will definitely impact what you said the user did.
> >>>
> >>> On Jun 20, 2014, at 10:06 AM, Brock Palen  wrote:
> >>>
>  Extra data point if I do:
> 
>  [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core
> hostname
> 
> --
>  A request was made to bind to that would result in binding more
>  processes than cpus on a resource:
> 
>  Bind to: CORE
>  Node:nyx5513
>  #processes:  2
>  #cpus:  1
> 
>  You can override this protection by adding the "overload-allowed"
>  option to your binding directive.
> 
> --
> 
>  [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime
>  13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90,
> 12.38
>  13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90,
> 12.38
>  [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind
> --get
>  0x0010
>  0x1000
>  [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
>  nyx5513
>  nyx5513
> 
>  Interesting, if I force bind to core, MPI barfs saying there is only
> 1 cpu available, PBS says it gave it two, and if I force (this is all
> inside an interactive job) just on that node hwloc-bind --get I get what I
> expect,
> 
>  Is there a way to get a map of what MPI thinks it has on each host?
> 
>  Brock Palen
>  www.umich.edu/~brockp
>  CAEN Advanced Computing
>  XSEDE Campus Champion
>  bro...@umich.edu
>  (734)936-1985
> 
> 
> 
>  On Jun 20, 2014, at 12:38 PM, Brock Palen  wrote:
> 
> > I was able to produce it in my test.
> >
> > orted affinity set by cpuset:
> >

[OMPI users] mpi prorg fails (big data)

2014-06-24 Thread Dr.Peer-Joachim Koch

Hi,

one of our cluster users reported a problem with openmpi.
He created a short sample (just a few lines) which will start and
crash after a short time.
We only see "Fatal error in PMPI_Gather: Other MPI error" - no further 
details.
He is using an intel fortran compiler with a self compiled openmpi (just 
tested 1.8.1).


I've know nearly nothing about mpi(openmpi) so I'm asking at this forum.
Has anybody some idea ?

Thanks, Peer



---makefile--
OPTIONS=-assume byterecl -fpp -allow nofpp_comments -free
DEBUG=-g -d-lines -check -debug -debug-parameters -fpe0 -traceback

all:
rm -f JeDi globe_mod.mod JeDi.out jedi_restart
$(SOURCE) ; mpif90 $(OPTIONS) $(DEBUG) -o JeDi globe.f90

--

globe.f90-
  program globe
  use mpi
  implicit none

  integer :: mpinfo  = 0
  integer :: myworld = 0
  integer :: mypid   = 0
  integer :: npro= 1

! * The comments give some conditions required to reproduce the problem.

! * If the program runs at two hosts, the error message is shown two 
times


  integer, parameter :: vv_g_d1 = 2432
  integer, parameter :: vv_p_d1 = vv_g_d1 / 16  ! requires 16 CPUs

  integer, parameter :: out_d1  = 2418  ! requires >=2416 (vv_g_d1 
- 16)


  integer, parameter :: d2 = 5001 !  requires >=4282 @ ii=30 / 
>=6682 @ ii=20 (depends on number of loops, but this limit can change 
for unknown reason)


  integer :: ii, jj

  real:: vv_p(vv_p_d1,d2)
  real,allocatable :: vv_g(:,:)
! * requires the definition of the variable for write to be defined 
below vv_g(:,:)

  real:: out(out_d1,d2)

  vv_p(:,:) = 0.0
  out(:,:) = 0.0

  call mpi_init(mpinfo)
  myworld = MPI_COMM_WORLD
  call mpi_comm_size(myworld, npro, mpinfo)
! * The problem requires 16 CPUs
  if (npro .ne. 16) then; write(*,*) "Works only with 16 CPUs"; 
stop; endif

  call mpi_comm_rank(myworld, mypid, mpinfo)

  if (mypid == 0) then
open(11, FILE='jedi_restart', STATUS='replace', FORM='unformatted')
  endif

  write(6,*) "test1",mypid ; flush(6)

  do ii = 1, 25  ! number of loops depends on field size
allocate(vv_g(vv_g_d1,d2))

do jj = 1, d2
  call mpi_gather(vv_p(1,jj), vv_p_d1, MPI_REAL, vv_g(1,jj), 
vv_p_d1, MPI_REAL, 0, myworld, mpinfo)

enddo

if (mypid == 0) then; write(11) out; flush(11); endif

deallocate(vv_g)
  enddo

  write(6,*) "test2",mypid ; flush(6)

  if (mypid == 0) close(11)

  call mpi_barrier(myworld, mpinfo)
  call mpi_finalize(mpinfo)

  end
-end 
globe.f90--


--
Mit freundlichem Gruß
Peer-Joachim Koch
_
Max-Planck-Institut für Biogeochemie
Dr. Peer-Joachim Koch
Hans-Knöll Str.10Telefon: ++49 3641 57-6705
D-07745 Jena Telefax: ++49 3641 57-7705

<>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-24 Thread Jeff Squyres (jsquyres)
Brock --

Can you run with "ompi_info --all"?

With "--param all all", ompi_info in v1.8.x is defaulting to only showing level 
1 MCA params.  It's showing you all possible components and variables, but only 
level 1.

Or you could also use "--level 9" to show all 9 levels.  Here's the relevant 
section from the README:

-
The following options may be helpful:

--all   Show a *lot* of information about your Open MPI
installation. 
--parsable  Display all the information in an easily
grep/cut/awk/sed-able format.
--param  
A  of "all" and a  of "all" will
show all parameters to all components.  Otherwise, the
parameters of all the components in a specific framework,
or just the parameters of a specific component can be
displayed by using an appropriate  and/or
 name.
--level 
By default, ompi_info only shows "Level 1" MCA parameters
-- parameters that can affect whether MPI processes can
run successfully or not (e.g., determining which network
interfaces to use).  The --level option will display all
MCA parameters from level 1 to  (the max 
value is 9).  Use "ompi_info --param 
 --level 9" to see *all* MCA parameters for a
given component.  See "The Modular Component Architecture
(MCA)" section, below, for a fuller explanation.





On Jun 24, 2014, at 5:19 AM, Ralph Castain  wrote:

> That's odd - it shouldn't truncate the output. I'll take a look later today - 
> we're all gathered for a developer's conference this week, so I'll be able to 
> poke at this with Nathan.
> 
> 
> 
> On Mon, Jun 23, 2014 at 3:15 PM, Brock Palen  wrote:
> Perfection, flexible, extensible, so nice.
> 
> BTW this doesn't happen older versions,
> 
> [brockp@flux-login2 34241]$ ompi_info --param all all
> Error getting SCIF driver version
>  MCA btl: parameter "btl_tcp_if_include" (current value: "",
>   data source: default, level: 1 user/basic, type:
>   string)
>   Comma-delimited list of devices and/or CIDR
>   notation of networks to use for MPI communication
>   (e.g., "eth0,192.168.0.0/16").  Mutually exclusive
>   with btl_tcp_if_exclude.
>  MCA btl: parameter "btl_tcp_if_exclude" (current value:
>   "127.0.0.1/8,sppp", data source: default, level: 1
>   user/basic, type: string)
>   Comma-delimited list of devices and/or CIDR
>   notation of networks to NOT use for MPI
>   communication -- all devices not matching these
>   specifications will be used (e.g.,
>   "eth0,192.168.0.0/16").  If set to a non-default
>   value, it is mutually exclusive with
>   btl_tcp_if_include.
> 
> 
> This is normally much longer.  And yes we don't have the PHI stuff installed 
> on all nodes, strange that 'all all' is now very short,  ompi_info -a  still 
> works though.
> 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jun 20, 2014, at 1:48 PM, Ralph Castain  wrote:
> 
> > Put "orte_hetero_nodes=1" in your default MCA param file - uses can 
> > override by setting that param to 0
> >
> >
> > On Jun 20, 2014, at 10:30 AM, Brock Palen  wrote:
> >
> >> Perfection!  That appears to do it for our standard case.
> >>
> >> Now I know how to set MCA options by env var or config file.  How can I 
> >> make this the default, that then a user can override?
> >>
> >> Brock Palen
> >> www.umich.edu/~brockp
> >> CAEN Advanced Computing
> >> XSEDE Campus Champion
> >> bro...@umich.edu
> >> (734)936-1985
> >>
> >>
> >>
> >> On Jun 20, 2014, at 1:21 PM, Ralph Castain  wrote:
> >>
> >>> I think I begin to grok at least part of the problem. If you are 
> >>> assigning different cpus on each node, then you'll need to tell us that 
> >>> by setting --hetero-nodes otherwise we won't have any way to report that 
> >>> back to mpirun for its binding calculation.
> >>>
> >>> Otherwise, we expect that the cpuset of the first node we launch a daemon 
> >>> onto (or where mpirun is executing, if we are only launching local to 
> >>> mpirun) accurately represents the cpuset on every node in the allocation.
> >>>
> >>> We still might well have a bug in our binding computation - but the above 
> >>> will definitely impact what you said the user did.
> >>>
> >>> On Jun 20, 2014, at 10:06 AM, Brock Palen  wrote:
> >>>
>  Extra data point if I do:
> 
>  [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname
>  

[OMPI users] poor performance using the openib btl

2014-06-24 Thread Fischer, Greg A.
Hello openmpi-users,

A few weeks ago, I posted to the list about difficulties I was having getting 
openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). 
The issues were related to Torque imposing restrictive limits on locked memory, 
and have since been resolved.

However, now that I've had some time to test the applications, I'm seeing 
abysmal performance over the openib layer. Applications run with the tcp btl 
execute about 10x faster than with the openib btl. Clearly something still 
isn't quite right.

I tried running with "-mca btl_openib_verbose 1", but didn't see anything 
resembling a smoking gun. How should I go about determining the source of the 
problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 
setup discussed previously.)

Thanks,
Greg


Re: [OMPI users] poor performance using the openib btl

2014-06-24 Thread Maxime Boissonneault

What are your threading options for OpenMPI (when it was built) ?

I have seen OpenIB BTL completely lock when some level of threading is 
enabled before.


Maxime Boissonneault


Le 2014-06-24 18:18, Fischer, Greg A. a écrit :


Hello openmpi-users,

A few weeks ago, I posted to the list about difficulties I was having 
getting openib to work with Torque (see "openib segfaults with 
Torque", June 6, 2014). The issues were related to Torque imposing 
restrictive limits on locked memory, and have since been resolved.


However, now that I've had some time to test the applications, I'm 
seeing abysmal performance over the openib layer. Applications run 
with the tcp btl execute about 10x faster than with the openib btl. 
Clearly something still isn't quite right.


I tried running with "-mca btl_openib_verbose 1", but didn't see 
anything resembling a smoking gun. How should I go about determining 
the source of the problem? (This uses the same OpenMPI Version 1.8.1 / 
SLES11 SP3 / GCC 4.8.3 setup discussed previously.)


Thanks,

Greg



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24697.php



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique