[OMPI users] Problem mpi
Dear all, I have problems for a long time related with mpirun. When I executed mpirun (with my program) I obtained the next error after a while: . . . . . mlx4: local QP operation err (QPN c00054, WQE index a, vendor syndrome 6f, opcode = 5e) [[64826,1],0][btl_openib_component.c:3497:handle_wc] from foner109 to: foner111 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id af58a8 opcode 128 vendor error 111 qp_idx 3 mpirun has exited due to process rank 0 with PID 51754 on node foner109 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). I am using a cluster (42 nodes, with 20 processors and 64 Gb RAM for each one). I want to use for example only 20 nodes, so I put: salloc -N20 --tasks-per-node=1 --cpus-per-task=20 -p thin(name of the node) mpirun -pernode [my_program] Could you help me to solve this problem? Best Regards, Diego
Re: [OMPI users] affinity issues under cpuset torque 1.8.1
Let's say that the downside is an unknown at this time. The only real impact of setting that param is that each daemon now reports its topology at startup. Without the param, only the daemon on the first node does so. The concern expressed when we first added that report was that the volume of data being sent on a very large system might impact launch time. However, the amount of data from each node isn't very much, so we don't know if there really would be a downside, or how significant it might be. Sadly, we haven't had access to machines of any real size to test this so we had real numbers for the decision. Absent that data, we took the conservative approach of setting the default so as to preserve the pre-existing behavior. So everyone out there: please consider this an appeal for data. If you are interested and willing, just send me (or the list - your option) any data you are willing to share regarding launch time with and without the --hetero-nodes option. A simple "time mpirun --map-by ppr:1:node /bin/true" (or equivalent) run at various numbers of nodes would suffice. On Mon, Jun 23, 2014 at 3:17 PM, Maxime Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: > Hi, > I've been following this thread because it may be relevant to our setup. > > Is there a drawback of having orte_hetero_nodes=1 as default MCA parameter > ? Is there a reason why the most generic case is not assumed ? > > Maxime Boissonneault > > Le 2014-06-20 13:48, Ralph Castain a écrit : > >> Put "orte_hetero_nodes=1" in your default MCA param file - uses can >> override by setting that param to 0 >> >> >> On Jun 20, 2014, at 10:30 AM, Brock Palen wrote: >> >> Perfection! That appears to do it for our standard case. >>> >>> Now I know how to set MCA options by env var or config file. How can I >>> make this the default, that then a user can override? >>> >>> Brock Palen >>> www.umich.edu/~brockp >>> CAEN Advanced Computing >>> XSEDE Campus Champion >>> bro...@umich.edu >>> (734)936-1985 >>> >>> >>> >>> On Jun 20, 2014, at 1:21 PM, Ralph Castain wrote: >>> >>> I think I begin to grok at least part of the problem. If you are assigning different cpus on each node, then you'll need to tell us that by setting --hetero-nodes otherwise we won't have any way to report that back to mpirun for its binding calculation. Otherwise, we expect that the cpuset of the first node we launch a daemon onto (or where mpirun is executing, if we are only launching local to mpirun) accurately represents the cpuset on every node in the allocation. We still might well have a bug in our binding computation - but the above will definitely impact what you said the user did. On Jun 20, 2014, at 10:06 AM, Brock Palen wrote: Extra data point if I do: > > [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core > hostname > > -- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node:nyx5513 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > > -- > > [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime > 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, > 12.38 > 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, > 12.38 > [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind > --get > 0x0010 > 0x1000 > [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513 > nyx5513 > nyx5513 > > Interesting, if I force bind to core, MPI barfs saying there is only 1 > cpu available, PBS says it gave it two, and if I force (this is all inside > an interactive job) just on that node hwloc-bind --get I get what I > expect, > > Is there a way to get a map of what MPI thinks it has on each host? > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > On Jun 20, 2014, at 12:38 PM, Brock Palen wrote: > > I was able to produce it in my test. >> >> orted affinity set by cpuset: >> [root@nyx5874 ~]# hwloc-bind --get --pid 103645 >> 0xc002 >> >> This mask (1, 14,15) which is across sockets, matches the cpu set >> setup by the batch system. >> [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806. >> nyx.engin.umich.edu/cpus >> 1,14-15 >> >> The ranks though were then all set to the same core: >> >> [root@nyx5874 ~]# hwloc-bind --get --pid 103871 >> 0x8000 >> [root
Re: [OMPI users] affinity issues under cpuset torque 1.8.1
That's odd - it shouldn't truncate the output. I'll take a look later today - we're all gathered for a developer's conference this week, so I'll be able to poke at this with Nathan. On Mon, Jun 23, 2014 at 3:15 PM, Brock Palen wrote: > Perfection, flexible, extensible, so nice. > > BTW this doesn't happen older versions, > > [brockp@flux-login2 34241]$ ompi_info --param all all > Error getting SCIF driver version > MCA btl: parameter "btl_tcp_if_include" (current value: > "", > data source: default, level: 1 user/basic, type: > string) > Comma-delimited list of devices and/or CIDR > notation of networks to use for MPI communication > (e.g., "eth0,192.168.0.0/16"). Mutually > exclusive > with btl_tcp_if_exclude. > MCA btl: parameter "btl_tcp_if_exclude" (current value: > "127.0.0.1/8,sppp", data source: default, > level: 1 > user/basic, type: string) > Comma-delimited list of devices and/or CIDR > notation of networks to NOT use for MPI > communication -- all devices not matching these > specifications will be used (e.g., > "eth0,192.168.0.0/16"). If set to a non-default > value, it is mutually exclusive with > btl_tcp_if_include. > > > This is normally much longer. And yes we don't have the PHI stuff > installed on all nodes, strange that 'all all' is now very short, > ompi_info -a still works though. > > > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > On Jun 20, 2014, at 1:48 PM, Ralph Castain wrote: > > > Put "orte_hetero_nodes=1" in your default MCA param file - uses can > override by setting that param to 0 > > > > > > On Jun 20, 2014, at 10:30 AM, Brock Palen wrote: > > > >> Perfection! That appears to do it for our standard case. > >> > >> Now I know how to set MCA options by env var or config file. How can I > make this the default, that then a user can override? > >> > >> Brock Palen > >> www.umich.edu/~brockp > >> CAEN Advanced Computing > >> XSEDE Campus Champion > >> bro...@umich.edu > >> (734)936-1985 > >> > >> > >> > >> On Jun 20, 2014, at 1:21 PM, Ralph Castain wrote: > >> > >>> I think I begin to grok at least part of the problem. If you are > assigning different cpus on each node, then you'll need to tell us that by > setting --hetero-nodes otherwise we won't have any way to report that back > to mpirun for its binding calculation. > >>> > >>> Otherwise, we expect that the cpuset of the first node we launch a > daemon onto (or where mpirun is executing, if we are only launching local > to mpirun) accurately represents the cpuset on every node in the allocation. > >>> > >>> We still might well have a bug in our binding computation - but the > above will definitely impact what you said the user did. > >>> > >>> On Jun 20, 2014, at 10:06 AM, Brock Palen wrote: > >>> > Extra data point if I do: > > [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core > hostname > > -- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node:nyx5513 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > > -- > > [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime > 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, > 12.38 > 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, > 12.38 > [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind > --get > 0x0010 > 0x1000 > [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513 > nyx5513 > nyx5513 > > Interesting, if I force bind to core, MPI barfs saying there is only > 1 cpu available, PBS says it gave it two, and if I force (this is all > inside an interactive job) just on that node hwloc-bind --get I get what I > expect, > > Is there a way to get a map of what MPI thinks it has on each host? > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > On Jun 20, 2014, at 12:38 PM, Brock Palen wrote: > > > I was able to produce it in my test. > > > > orted affinity set by cpuset: > >
[OMPI users] mpi prorg fails (big data)
Hi, one of our cluster users reported a problem with openmpi. He created a short sample (just a few lines) which will start and crash after a short time. We only see "Fatal error in PMPI_Gather: Other MPI error" - no further details. He is using an intel fortran compiler with a self compiled openmpi (just tested 1.8.1). I've know nearly nothing about mpi(openmpi) so I'm asking at this forum. Has anybody some idea ? Thanks, Peer ---makefile-- OPTIONS=-assume byterecl -fpp -allow nofpp_comments -free DEBUG=-g -d-lines -check -debug -debug-parameters -fpe0 -traceback all: rm -f JeDi globe_mod.mod JeDi.out jedi_restart $(SOURCE) ; mpif90 $(OPTIONS) $(DEBUG) -o JeDi globe.f90 -- globe.f90- program globe use mpi implicit none integer :: mpinfo = 0 integer :: myworld = 0 integer :: mypid = 0 integer :: npro= 1 ! * The comments give some conditions required to reproduce the problem. ! * If the program runs at two hosts, the error message is shown two times integer, parameter :: vv_g_d1 = 2432 integer, parameter :: vv_p_d1 = vv_g_d1 / 16 ! requires 16 CPUs integer, parameter :: out_d1 = 2418 ! requires >=2416 (vv_g_d1 - 16) integer, parameter :: d2 = 5001 ! requires >=4282 @ ii=30 / >=6682 @ ii=20 (depends on number of loops, but this limit can change for unknown reason) integer :: ii, jj real:: vv_p(vv_p_d1,d2) real,allocatable :: vv_g(:,:) ! * requires the definition of the variable for write to be defined below vv_g(:,:) real:: out(out_d1,d2) vv_p(:,:) = 0.0 out(:,:) = 0.0 call mpi_init(mpinfo) myworld = MPI_COMM_WORLD call mpi_comm_size(myworld, npro, mpinfo) ! * The problem requires 16 CPUs if (npro .ne. 16) then; write(*,*) "Works only with 16 CPUs"; stop; endif call mpi_comm_rank(myworld, mypid, mpinfo) if (mypid == 0) then open(11, FILE='jedi_restart', STATUS='replace', FORM='unformatted') endif write(6,*) "test1",mypid ; flush(6) do ii = 1, 25 ! number of loops depends on field size allocate(vv_g(vv_g_d1,d2)) do jj = 1, d2 call mpi_gather(vv_p(1,jj), vv_p_d1, MPI_REAL, vv_g(1,jj), vv_p_d1, MPI_REAL, 0, myworld, mpinfo) enddo if (mypid == 0) then; write(11) out; flush(11); endif deallocate(vv_g) enddo write(6,*) "test2",mypid ; flush(6) if (mypid == 0) close(11) call mpi_barrier(myworld, mpinfo) call mpi_finalize(mpinfo) end -end globe.f90-- -- Mit freundlichem Gruß Peer-Joachim Koch _ Max-Planck-Institut für Biogeochemie Dr. Peer-Joachim Koch Hans-Knöll Str.10Telefon: ++49 3641 57-6705 D-07745 Jena Telefax: ++49 3641 57-7705 <> smime.p7s Description: S/MIME Cryptographic Signature
Re: [OMPI users] affinity issues under cpuset torque 1.8.1
Brock -- Can you run with "ompi_info --all"? With "--param all all", ompi_info in v1.8.x is defaulting to only showing level 1 MCA params. It's showing you all possible components and variables, but only level 1. Or you could also use "--level 9" to show all 9 levels. Here's the relevant section from the README: - The following options may be helpful: --all Show a *lot* of information about your Open MPI installation. --parsable Display all the information in an easily grep/cut/awk/sed-able format. --param A of "all" and a of "all" will show all parameters to all components. Otherwise, the parameters of all the components in a specific framework, or just the parameters of a specific component can be displayed by using an appropriate and/or name. --level By default, ompi_info only shows "Level 1" MCA parameters -- parameters that can affect whether MPI processes can run successfully or not (e.g., determining which network interfaces to use). The --level option will display all MCA parameters from level 1 to (the max value is 9). Use "ompi_info --param --level 9" to see *all* MCA parameters for a given component. See "The Modular Component Architecture (MCA)" section, below, for a fuller explanation. On Jun 24, 2014, at 5:19 AM, Ralph Castain wrote: > That's odd - it shouldn't truncate the output. I'll take a look later today - > we're all gathered for a developer's conference this week, so I'll be able to > poke at this with Nathan. > > > > On Mon, Jun 23, 2014 at 3:15 PM, Brock Palen wrote: > Perfection, flexible, extensible, so nice. > > BTW this doesn't happen older versions, > > [brockp@flux-login2 34241]$ ompi_info --param all all > Error getting SCIF driver version > MCA btl: parameter "btl_tcp_if_include" (current value: "", > data source: default, level: 1 user/basic, type: > string) > Comma-delimited list of devices and/or CIDR > notation of networks to use for MPI communication > (e.g., "eth0,192.168.0.0/16"). Mutually exclusive > with btl_tcp_if_exclude. > MCA btl: parameter "btl_tcp_if_exclude" (current value: > "127.0.0.1/8,sppp", data source: default, level: 1 > user/basic, type: string) > Comma-delimited list of devices and/or CIDR > notation of networks to NOT use for MPI > communication -- all devices not matching these > specifications will be used (e.g., > "eth0,192.168.0.0/16"). If set to a non-default > value, it is mutually exclusive with > btl_tcp_if_include. > > > This is normally much longer. And yes we don't have the PHI stuff installed > on all nodes, strange that 'all all' is now very short, ompi_info -a still > works though. > > > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > On Jun 20, 2014, at 1:48 PM, Ralph Castain wrote: > > > Put "orte_hetero_nodes=1" in your default MCA param file - uses can > > override by setting that param to 0 > > > > > > On Jun 20, 2014, at 10:30 AM, Brock Palen wrote: > > > >> Perfection! That appears to do it for our standard case. > >> > >> Now I know how to set MCA options by env var or config file. How can I > >> make this the default, that then a user can override? > >> > >> Brock Palen > >> www.umich.edu/~brockp > >> CAEN Advanced Computing > >> XSEDE Campus Champion > >> bro...@umich.edu > >> (734)936-1985 > >> > >> > >> > >> On Jun 20, 2014, at 1:21 PM, Ralph Castain wrote: > >> > >>> I think I begin to grok at least part of the problem. If you are > >>> assigning different cpus on each node, then you'll need to tell us that > >>> by setting --hetero-nodes otherwise we won't have any way to report that > >>> back to mpirun for its binding calculation. > >>> > >>> Otherwise, we expect that the cpuset of the first node we launch a daemon > >>> onto (or where mpirun is executing, if we are only launching local to > >>> mpirun) accurately represents the cpuset on every node in the allocation. > >>> > >>> We still might well have a bug in our binding computation - but the above > >>> will definitely impact what you said the user did. > >>> > >>> On Jun 20, 2014, at 10:06 AM, Brock Palen wrote: > >>> > Extra data point if I do: > > [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname >
[OMPI users] poor performance using the openib btl
Hello openmpi-users, A few weeks ago, I posted to the list about difficulties I was having getting openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). The issues were related to Torque imposing restrictive limits on locked memory, and have since been resolved. However, now that I've had some time to test the applications, I'm seeing abysmal performance over the openib layer. Applications run with the tcp btl execute about 10x faster than with the openib btl. Clearly something still isn't quite right. I tried running with "-mca btl_openib_verbose 1", but didn't see anything resembling a smoking gun. How should I go about determining the source of the problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 setup discussed previously.) Thanks, Greg
Re: [OMPI users] poor performance using the openib btl
What are your threading options for OpenMPI (when it was built) ? I have seen OpenIB BTL completely lock when some level of threading is enabled before. Maxime Boissonneault Le 2014-06-24 18:18, Fischer, Greg A. a écrit : Hello openmpi-users, A few weeks ago, I posted to the list about difficulties I was having getting openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). The issues were related to Torque imposing restrictive limits on locked memory, and have since been resolved. However, now that I've had some time to test the applications, I'm seeing abysmal performance over the openib layer. Applications run with the tcp btl execute about 10x faster than with the openib btl. Clearly something still isn't quite right. I tried running with "-mca btl_openib_verbose 1", but didn't see anything resembling a smoking gun. How should I go about determining the source of the problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 setup discussed previously.) Thanks, Greg ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24697.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique