[OMPI users] MPI_COMM_split hanging
I am attempting to split my application into multiple master+workers groups using MPI_COMM_split. My MPI revision is shown as: mpirun --tag-output ompi_info -v ompi full --parsable [1,0]:package:Open MPI root@build-x86-64 Distribution [1,0]:ompi:version:full:1.4.3 [1,0]:ompi:version:svn:r23834 [1,0]:ompi:version:release_date:Oct 05, 2010 [1,0]:orte:version:full:1.4.3 [1,0]:orte:version:svn:r23834 [1,0]:orte:version:release_date:Oct 05, 2010 [1,0]:opal:version:full:1.4.3 [1,0]:opal:version:svn:r23834 [1,0]:opal:version:release_date:Oct 05, 2010 [1,0]:ident:1.4.3 The basic problem I am having is that none of processor instances ever returns from the MPI_COMM_split call. I am pretty new to MPI and it is likely I am not doing things quite correctly. I'd appreciate some guidance. I am working with an application that has functioned nicely for a while now. It only uses a single MPI_COMM_WORLD communicator. It is standard stuff: a master that hands out tasks to many workers, receives output and keeps track of workers that are ready to receive another task. The tasks are quite compute-intensive. When running a variation of the process that uses Monte Carlo iterations, jobs can exceed the 30 hours they are limited to. The MC iterations are independent of each other - adding random noise to an input - so I would like to run multiple iterations simultaneously so that 4 times the cores runs in a fourth of the time. This would entail a supervisor interacting with multiple master+workers groups. I had thought that I would just have to declare a communicator for each group so that broadcasts and syncs would work within a single group. MPI_Comm_size( MPI_COMM_WORLD, &total_proc_count ); MPI_Comm_rank( MPI_COMM_WORLD, &my_rank ); ... cores_per_group = total_proc_count / groups_count; my_group = my_rank / cores_per_group; // e.g., 0, 1, ... group_rank = my_rank - my_group * cores_per_group; // rank within a group if ( my_rank == 0 )continue;// Do not create group for supervisor MPI_Comm oldcomm = MPI_COMM_WORLD; MPI_Comm my_communicator;// Actually declared as a class variable int sstat = MPI_Comm_split( oldcomm, my_group, group_rank, &my_communicator ); There is never a return from the above _split() call. Do I need to do something else to set this up? I would have expected perhaps a non-zero status return, but not that I would get no return at all. I would appreciate any comments or guidance. - Gary
Re: [OMPI users] MPI_Allgather problem
I guess your output is from different ranks. YOu can add rank infor inside print to tell like follows: (void) printf("rank %d: gathered[%d].node = %d\n", rank, i, gathered[i].node); >From my side, I did not see anything wrong from your code in Open MPI 1.4.3. after I add rank, the output is rank 5: gathered[0].node = 0 rank 5: gathered[1].node = 1 rank 5: gathered[2].node = 2 rank 5: gathered[3].node = 3 rank 5: gathered[4].node = 4 rank 5: gathered[5].node = 5 rank 3: gathered[0].node = 0 rank 3: gathered[1].node = 1 rank 3: gathered[2].node = 2 rank 3: gathered[3].node = 3 rank 3: gathered[4].node = 4 rank 3: gathered[5].node = 5 rank 1: gathered[0].node = 0 rank 1: gathered[1].node = 1 rank 1: gathered[2].node = 2 rank 1: gathered[3].node = 3 rank 1: gathered[4].node = 4 rank 1: gathered[5].node = 5 rank 0: gathered[0].node = 0 rank 0: gathered[1].node = 1 rank 0: gathered[2].node = 2 rank 0: gathered[3].node = 3 rank 0: gathered[4].node = 4 rank 0: gathered[5].node = 5 rank 4: gathered[0].node = 0 rank 4: gathered[1].node = 1 rank 4: gathered[2].node = 2 rank 4: gathered[3].node = 3 rank 4: gathered[4].node = 4 rank 4: gathered[5].node = 5 rank 2: gathered[0].node = 0 rank 2: gathered[1].node = 1 rank 2: gathered[2].node = 2 rank 2: gathered[3].node = 3 rank 2: gathered[4].node = 4 rank 2: gathered[5].node = 5 Is that what you expected? On Fri, Dec 9, 2011 at 12:03 PM, Brett Tully wrote: > Dear all, > > I have not used OpenMPI much before, but am maintaining a large legacy > application. We noticed a bug to do with a call to MPI_Allgather as > summarised in this post to Stackoverflow: > http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results > > In the process of looking further into the problem, I noticed that the > following function results in strange behaviour. > > void test_all_gather() { > > struct _TEST_ALL_GATHER { > int node; > }; > > int ierr, size, rank; > ierr = MPI_Comm_size(MPI_COMM_WORLD, &size); > ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > struct _TEST_ALL_GATHER local; > struct _TEST_ALL_GATHER *gathered; > > gathered = (struct _TEST_ALL_GATHER*) malloc(size * sizeof(*gathered)); > > local.node = rank; > > MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, > gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, > MPI_COMM_WORLD); > > int i; > for (i = 0; i < numnodes; ++i) { > (void) printf("gathered[%d].node = %d\n", i, gathered[i].node); > } > > FREE(gathered); > } > > At one point, this function printed the following: > gathered[0].node = 2 > gathered[1].node = 3 > gathered[2].node = 2 > gathered[3].node = 3 > gathered[4].node = 4 > gathered[5].node = 5 > > Can anyone suggest a place to start looking into why this might be > happening? There is a section of the code that calls MPI_Comm_split, but I > am not sure if that is related... > > Running on Ubuntu 11.10 and a summary of ompi_info: > Package: Open MPI buildd@allspice Distribution > Open MPI: 1.4.3 > Open MPI SVN revision: r23834 > Open MPI release date: Oct 05, 2010 > Open RTE: 1.4.3 > Open RTE SVN revision: r23834 > Open RTE release date: Oct 05, 2010 > OPAL: 1.4.3 > OPAL SVN revision: r23834 > OPAL release date: Oct 05, 2010 > Ident string: 1.4.3 > Prefix: /usr > Configured architecture: x86_64-pc-linux-gnu > Configure host: allspice > Configured by: buildd > > Thanks! > Brett > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- | Teng Ma Univ. of Tennessee | | t...@cs.utk.eduKnoxville, TN | | http://web.eecs.utk.edu/~tma/ |
[OMPI users] MPI_Allgather problem
Dear all, I have not used OpenMPI much before, but am maintaining a large legacy application. We noticed a bug to do with a call to MPI_Allgather as summarised in this post to Stackoverflow: http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results In the process of looking further into the problem, I noticed that the following function results in strange behaviour. void test_all_gather() { struct _TEST_ALL_GATHER { int node; }; int ierr, size, rank; ierr = MPI_Comm_size(MPI_COMM_WORLD, &size); ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank); struct _TEST_ALL_GATHER local; struct _TEST_ALL_GATHER *gathered; gathered = (struct _TEST_ALL_GATHER*) malloc(size * sizeof(*gathered)); local.node = rank; MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, MPI_COMM_WORLD); int i; for (i = 0; i < numnodes; ++i) { (void) printf("gathered[%d].node = %d\n", i, gathered[i].node); } FREE(gathered); } At one point, this function printed the following: gathered[0].node = 2 gathered[1].node = 3 gathered[2].node = 2 gathered[3].node = 3 gathered[4].node = 4 gathered[5].node = 5 Can anyone suggest a place to start looking into why this might be happening? There is a section of the code that calls MPI_Comm_split, but I am not sure if that is related... Running on Ubuntu 11.10 and a summary of ompi_info: Package: Open MPI buildd@allspice Distribution Open MPI: 1.4.3 Open MPI SVN revision: r23834 Open MPI release date: Oct 05, 2010 Open RTE: 1.4.3 Open RTE SVN revision: r23834 Open RTE release date: Oct 05, 2010 OPAL: 1.4.3 OPAL SVN revision: r23834 OPAL release date: Oct 05, 2010 Ident string: 1.4.3 Prefix: /usr Configured architecture: x86_64-pc-linux-gnu Configure host: allspice Configured by: buildd Thanks! Brett
[OMPI users] mca_btl_sm_component_progress read an unknown type of header
Hi all, This question was buried in an earlier question, and I got no replies, so I'll try reposting it with a more enticing subject. I have a multithreaded openmpi code where each task has N+1 threads, the N threads send nonblocking messages that are received by the 1 thread on the other tasks. When I run this code with 2 tasks, 5+1 threads on a single node with 12 cores, after about a million messages has been exchanged, the tasks segfault after printing the following error: [sunrise01.rc.fas.harvard.edu:10009] mca_btl_sm_component_progress read an unknown type of header The debugger spits me out on line 674 of btl_sm_component.c, in the default of a switch on fragment type. There's a comment there that says: * This code path should presumably never be called. * It's unclear if it should exist or, if so, how it should be written. * If we want to return it to the sending process, * we have to figure out who the sender is. * It seems we need to subtract the mask bits. * Then, hopefully this is an sm header that has an smp_rank field. * Presumably that means the received header was relative. * Or, maybe this code should just be removed. It seems like whoever wrote that code didn't know quite what was going on, and I guess the assumption was wrong because dereferencing that result seems to be what's causing the segfault. Does anyone here know what could cause this error? If I run the code with the tcp btl instead of sm, it runs fine, albeit with a bit lower performance. This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel 12.3.174. I've attached the ompi_info output. Thanks, /Patrik J. ompi_info.gz Description: GNU Zip compressed data
Re: [OMPI users] Asymmetric performance with nonblocking, multithreaded communications
Hi Yiannis, On Fri, Dec 9, 2011 at 10:21 AM, Yiannis Papadopoulos wrote: > Patrik Jonsson wrote: >> >> Hi all, >> >> I'm seeing performance issues I don't understand in my multithreaded >> MPI code, and I was hoping someone could shed some light on this. >> >> The code structure is as follows: A computational domain is decomposed >> into MPI tasks. Each MPI task has a "master thread" that receives >> messages from the other tasks and puts those into a local, concurrent >> queue. The tasks then have a few "worker threads" that processes the >> incoming messages and when necessary sends them to other tasks. So for >> each task, there is one thread doing receives and N (typically number >> of cores-1) threads doing sends. All messages are nonblocking, so the >> workers just post the sends and continue with computation, and the >> master repeatedly does a number of test calls to check for incoming >> messages (there are different flavors of these messages so it does >> several tests). > > When do you do the MPI_Test on the Isends? I have had performance issues in > a number of systems if I would use a single queue of MPI_Requests that would > keep Isends to different ranks and testing them one by one. It appears that > some messages are sent out more efficiently if you test them. There are 3 classes of messages that may arrive. The requests for each are in a vector, and I use boost::mpi::test_some (which I assume just calls MPI_Testsome) to test them in a round-robin fashion. > > I found that either using MPI_Testsome or having a map(key=rank, value=queue > of MPI_Requests) and testing for each key the first MPI_Request, resolved > this issue. In my case, I know that the overwhelming traffic volume is one kind of message. What I ended up doing was to simply repeat the test for that message immediately if the preceding test succeeded, up to 1000 times, before again checking the other requests. This appears to enable the task to keep up with the incoming traffic. I guess another possibility would be to have several slots for the incoming messages. Right now I only post one irecv per source task. By posting a couple, more messages would be able to come in without not having a matching recv, and one test could match more of them. Since that makes the logic more complicated, I didn't try that.
Re: [OMPI users] Asymmetric performance with nonblocking, multithreaded communications
Patrik Jonsson wrote: Hi all, I'm seeing performance issues I don't understand in my multithreaded MPI code, and I was hoping someone could shed some light on this. The code structure is as follows: A computational domain is decomposed into MPI tasks. Each MPI task has a "master thread" that receives messages from the other tasks and puts those into a local, concurrent queue. The tasks then have a few "worker threads" that processes the incoming messages and when necessary sends them to other tasks. So for each task, there is one thread doing receives and N (typically number of cores-1) threads doing sends. All messages are nonblocking, so the workers just post the sends and continue with computation, and the master repeatedly does a number of test calls to check for incoming messages (there are different flavors of these messages so it does several tests). When do you do the MPI_Test on the Isends? I have had performance issues in a number of systems if I would use a single queue of MPI_Requests that would keep Isends to different ranks and testing them one by one. It appears that some messages are sent out more efficiently if you test them. I found that either using MPI_Testsome or having a map(key=rank, value=queue of MPI_Requests) and testing for each key the first MPI_Request, resolved this issue.
Re: [OMPI users] Cofigure(?) problem building /1.5.3 on ScientificLinux6.0
Hello Gus, Ralph, Jeff a very late answer for this - just found it in my mailbox. Would "cp -rp" help? (To preserve time stamps, instead of "cp -r".) Yes, the root of the evil were the time stamps. 'cp -a' is the magic wand. Many thanks for your help, and I should wear sackcloth and ashes... :-/ Best, Paul Anyway, since 1.2.8 here I build 5, sometimes more versions, all from the same tarball, but on separate build directories, as Jeff suggests. [VPATH] Works for me. My two cents. Gus Correa Jeff Squyres wrote: Ah -- Ralph pointed out the relevant line to me in your first mail that I initially missed: In each case I build 16 versions at all (4 compiler * 32bit/64bit * support for multithreading ON/OFF). The same error arise in all 16 versions. Perhaps you should just expand the tarball once and then do VPATH builds...? Something like this: tar xf openmpi-1.5.3.tar.bz2 cd openmpi-1.5.3 mkdir build-gcc cd build-gcc ../configure blah.. make -j 4 make install cd .. mkdir build-icc ../configure CC=icc CXX=icpc FC=ifort F77=ifort ..blah. make -j 4 make install cd .. etc. This allows you to have one set of source and have N different builds from it. Open MPI uses the GNU Autotools correctly to support this kind of build pattern. On Jul 22, 2011, at 2:37 PM, Jeff Squyres wrote: Your RUNME script is a *very* strange way to build Open MPI. It starts with a massive copy: cp -r /home/pk224850/OpenMPI/openmpi-1.5.3/AUTHORS /home/pk224850/OpenMPI/openmpi-1.5.3/CMakeLists.txt <...much snipped...> . Why are you doing this kind of copy? I suspect that the GNU autotools' timestamps are getting all out of whack when you do this kind of copy, and therefore when you run "configure", it tries to re-autogen itself. To be clear: when you expand OMPI from a tarball, you shouldn't need the GNU Autotools installed at all -- the tarball is pre-bootstrapped exactly to avoid you needing to use the Autotools (much less any specific version of the Autotools). I suspect that if you do this: - tar xf openmpi-1.5.3.tar.bz2 cd openmpi-1.5.3 ./configure etc. - everything will work just fine. On Jul 22, 2011, at 11:12 AM, Paul Kapinos wrote: Dear OpenMPI volks, currently I have a problem by building the version 1.5.3 of OpenMPI on Scientific Linux 6.0 systems, which seem vor me to be a configuration problem. After the configure run (which seem to terminate without error code), the "gmake all" stage produces errors and exits. Typical is the output below. Fancy: the 1.4.3 version on same computer can be build with no special trouble. Both the 1.4.3 and 1.5.3 versions can be build on other computer running CentOS 5.6. In each case I build 16 versions at all (4 compiler * 32bit/64bit * support for multithreading ON/OFF). The same error arise in all 16 versions. Can someone give a hint about how to avoid this issue? Thanks! Best wishes, Paul Some logs and configure are downloadable here: https://gigamove.rz.rwth-aachen.de/d/id/2jM6MEa2nveJJD The configure line is in RUNME.sh, the logs of configure and build stage in log_* files; I also attached the config.log file and the configure itself (which is the standard from the 1.5.3 release). ## CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /tmp/pk224850/linuxc2_11254/openmpi-1.5.3mt_linux64_gcc/config/missing --run aclocal-1.11 -I config sh: config/ompi_get_version.sh: No such file or directory /usr/bin/m4: esyscmd subprocess failed configure.ac:953: warning: OMPI_CONFIGURE_SETUP is m4_require'd but not m4_defun'd config/ompi_mca.m4:37: OMPI_MCA is expanded from... configure.ac:953: the top level configure.ac:953: warning: AC_COMPILE_IFELSE was called before AC_USE_SYSTEM_EXTENSIONS ../../lib/autoconf/specific.m4:386: AC_USE_SYSTEM_EXTENSIONS is expanded from... opal/mca/paffinity/hwloc/hwloc/config/hwloc.m4:152: HWLOC_SETUP_CORE_AFTER_C99 is expanded from... ../../lib/m4sugar/m4sh.m4:505: AS_IF is expanded from... opal/mca/paffinity/hwloc/hwloc/config/hwloc.m4:22: HWLOC_SETUP_CORE is expanded from... opal/mca/paffinity/hwloc/configure.m4:40: MCA_paffinity_hwloc_CONFIG is expanded from... config/ompi_mca.m4:540: MCA_CONFIGURE_M4_CONFIG_COMPONENT is expanded from... config/ompi_mca.m4:326: MCA_CONFIGURE_FRAMEWORK is expanded from... config/ompi_mca.m4:247: MCA_CONFIGURE_PROJECT is expanded from... configure.ac:953: warning: AC_RUN_IFELSE was called before AC_USE_SYSTEM_EXTENSIONS -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/abo
Re: [OMPI users] Program hangs in mpi_bcast
Dear Jeff, thanks so much for your detailed reply and explanations and sorry for not answering sooner. I'll try to develop reproducer and I have some ideas how this can be done. At least I know typical scenarios causing this issue to appear. To be honest, I'm rather busy these days (as probably most of us are), but I'll try to do this as soon as I can. Just a brief comment on repeated collectives. I know at least two situations when repeated collectives are either required or beneficial. First, the sizes of arrays to be (all)reduced can be really large causing overflow of 32-bit integers so one has to split single operation into a sequence of calls. I know some MPI implementations supports 64-bit integers as arguments for extended set of functions handling large arrays, but some does not. In addition, such a splitting reduces probability of hangs due to lack of resources on the compute nodes. Second, our experiences with any transport, any MPI implementations and any CPU types we tried so far show that the overall performance of (all)reduce is usually worse on very large arrays as compared with that for a sequence of calls. While it is hard to predict the optimal size of chunk, it can be easily found experimentally. > > Some of our users would like to use Firefly with OpenMPI. Usually, we > > simply answer them that OpenMPI is too buggy to be used. > This surprises me. Is this with regards to this collective/hang issue, or > something else? Yes, this is with regards to collective hang issue. All the best, Alex - Original Message - From: "Jeff Squyres" To: "Alex A. Granovsky" ; Sent: Saturday, December 03, 2011 3:36 PM Subject: Re: [OMPI users] Program hangs in mpi_bcast On Dec 2, 2011, at 8:50 AM, Alex A. Granovsky wrote: >I would like to start discussion on implementation of collective > operations within OpenMPI. The reason for this is at least twofold. > Last months, there was the constantly growing number of messages in > the list sent by persons facing problems with collectives so I do > believe these issues must be discussed and hopefully will finally > attract proper attention of OpenMPI developers. The second one is my > involvement in the development of Firefly Quantum Chemistry package, > which, of course, uses collectives rather intensively. Greetings Alex, and thanks for your note. We take it quite seriously, and had a bunch of phone/off-list conversations about it in the past 24 hours. Let me shed a little light on the history with regards to this particular issue... - This issue was originally brought to light by LANL quite some time ago. They discovered that one of their MPI codes was hanging in the middle of lengthy runs. After some investigation, it was determined that it was hanging in the middle of some collective operations -- MPI_REDUCE, IIRC (maybe MPI_ALLREDUCE? For the purposes of this email, I'll assume MPI_REDUCE). - It turns out that this application called MPI_REDUCE a *lot*. Which is not uncommon. However, it was actually a fairly poorly architected application, such that it was doing things like repeatedly invoking MPI_REDUCE on single variables rather than bundling them up into an array and computing them all with a single MPI_REDUCE (for example). Calling MPI_REDUCE a lot is not necessarily a problem, per se, however -- MPI guarantees that this is supposed to be ok. I'll bring up below why I mention this specific point. - After some investigating at LANL, they determined that putting a barrier in every N iterations caused the hangs to stop. A little experimentation determined that running a barrier every 1000 collective operations both did not affect performance in any noticeable way and avoided whatever the underlying problem was. - The user did not want to add the barriers to their code, so we added another collective module that internally counts collective operations and invokes a barrier every N iterations (where N is settable via MCA parameter). We defaulted N to 1000 because it solved LANL's problems. I do not recall offhand whether we experimented to see if we could make N *more* than 1000 or not. - Compounding the difficulty of this investigation was the fact that other Open MPI community members had an incredibly difficult time reproducing the problem. I don't think that I was able to reproduce the problem at all, for example. I just took Ralph's old reproducers and tried again, and am unable to make OMPI 1.4 or OMPI 1.5 hang. I actually modified his reproducers to make them a bit *more* abusive (i.e., flood rank 0 with even *more* unexpected incoming messages), but I still can't get it to hang. - To be clear: due to personnel changes at LANL at the time, there was very little experience in the MPI layer at LANL (Ralph, who was at LANL at the time, is the ORTE guy -- he actively stays out of the MPI layer whenever possible). The application that generated the problem was on rest