[OMPI users] MPI_COMM_split hanging

2011-12-09 Thread Gary Gorbet

I am attempting to split my application into multiple master+workers
groups using MPI_COMM_split. My MPI revision is shown as:

mpirun --tag-output ompi_info -v ompi full --parsable
[1,0]:package:Open MPI root@build-x86-64 Distribution
[1,0]:ompi:version:full:1.4.3
[1,0]:ompi:version:svn:r23834
[1,0]:ompi:version:release_date:Oct 05, 2010
[1,0]:orte:version:full:1.4.3
[1,0]:orte:version:svn:r23834
[1,0]:orte:version:release_date:Oct 05, 2010
[1,0]:opal:version:full:1.4.3
[1,0]:opal:version:svn:r23834
[1,0]:opal:version:release_date:Oct 05, 2010
[1,0]:ident:1.4.3

The basic problem I am having is that none of processor instances ever
returns from the MPI_COMM_split call. I am pretty new to MPI and it is
likely I am not doing things quite correctly. I'd appreciate some guidance.

I am working with an application that has functioned nicely for a while
now. It only uses a single MPI_COMM_WORLD communicator. It is standard
stuff:  a master that hands out tasks to many workers, receives output
and keeps track of workers that are ready to receive another task. The
tasks are quite compute-intensive. When running a variation of the
process that uses Monte Carlo iterations, jobs can exceed the 30 hours
they are limited to. The MC iterations are independent of each other -
adding random noise to an input - so I would like to run multiple
iterations simultaneously so that 4 times the cores runs in a fourth of
the time. This would entail a supervisor interacting with multiple
master+workers groups.

I had thought that I would just have to declare a communicator for each
group so that broadcasts and syncs would work within a single group.

   MPI_Comm_size( MPI_COMM_WORLD, &total_proc_count );
   MPI_Comm_rank( MPI_COMM_WORLD, &my_rank );
   ...
   cores_per_group = total_proc_count / groups_count;
   my_group = my_rank / cores_per_group; // e.g., 0, 1, ...
   group_rank = my_rank - my_group * cores_per_group;  // rank within a
group
   if ( my_rank == 0 )continue;// Do not create group for supervisor
   MPI_Comm oldcomm = MPI_COMM_WORLD;
   MPI_Comm my_communicator;// Actually declared as a class variable
   int sstat = MPI_Comm_split( oldcomm, my_group, group_rank,
 &my_communicator );

There is never a return from the above _split() call. Do I need to do
something else to set this up? I would have expected perhaps a non-zero
status return, but not that I would get no return at all. I would
appreciate any comments or guidance.

- Gary


Re: [OMPI users] MPI_Allgather problem

2011-12-09 Thread teng ma
I guess your output is from different ranks.   YOu can add rank infor
inside print to tell like follows:

(void) printf("rank %d: gathered[%d].node = %d\n", rank, i,
gathered[i].node);

>From my side, I did not see anything wrong from your code in Open MPI
1.4.3. after I add rank, the output is
rank 5: gathered[0].node = 0
rank 5: gathered[1].node = 1
rank 5: gathered[2].node = 2
rank 5: gathered[3].node = 3
rank 5: gathered[4].node = 4
rank 5: gathered[5].node = 5
rank 3: gathered[0].node = 0
rank 3: gathered[1].node = 1
rank 3: gathered[2].node = 2
rank 3: gathered[3].node = 3
rank 3: gathered[4].node = 4
rank 3: gathered[5].node = 5
rank 1: gathered[0].node = 0
rank 1: gathered[1].node = 1
rank 1: gathered[2].node = 2
rank 1: gathered[3].node = 3
rank 1: gathered[4].node = 4
rank 1: gathered[5].node = 5
rank 0: gathered[0].node = 0
rank 0: gathered[1].node = 1
rank 0: gathered[2].node = 2
rank 0: gathered[3].node = 3
rank 0: gathered[4].node = 4
rank 0: gathered[5].node = 5
rank 4: gathered[0].node = 0
rank 4: gathered[1].node = 1
rank 4: gathered[2].node = 2
rank 4: gathered[3].node = 3
rank 4: gathered[4].node = 4
rank 4: gathered[5].node = 5
rank 2: gathered[0].node = 0
rank 2: gathered[1].node = 1
rank 2: gathered[2].node = 2
rank 2: gathered[3].node = 3
rank 2: gathered[4].node = 4
rank 2: gathered[5].node = 5

Is that what you expected?

On Fri, Dec 9, 2011 at 12:03 PM, Brett Tully wrote:

> Dear all,
>
> I have not used OpenMPI much before, but am maintaining a large legacy
> application. We noticed a bug to do with a call to MPI_Allgather as
> summarised in this post to Stackoverflow:
> http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results
>
> In the process of looking further into the problem, I noticed that the
> following function results in strange behaviour.
>
> void test_all_gather() {
>
> struct _TEST_ALL_GATHER {
> int node;
> };
>
> int ierr, size, rank;
> ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
> ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>
> struct _TEST_ALL_GATHER local;
> struct _TEST_ALL_GATHER *gathered;
>
> gathered = (struct _TEST_ALL_GATHER*) malloc(size * sizeof(*gathered));
>
> local.node = rank;
>
> MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
> gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
> MPI_COMM_WORLD);
>
> int i;
> for (i = 0; i < numnodes; ++i) {
> (void) printf("gathered[%d].node = %d\n", i, gathered[i].node);
> }
>
> FREE(gathered);
> }
>
> At one point, this function printed the following:
> gathered[0].node = 2
> gathered[1].node = 3
> gathered[2].node = 2
> gathered[3].node = 3
> gathered[4].node = 4
> gathered[5].node = 5
>
> Can anyone suggest a place to start looking into why this might be
> happening? There is a section of the code that calls MPI_Comm_split, but I
> am not sure if that is related...
>
> Running on Ubuntu 11.10 and a summary of ompi_info:
> Package: Open MPI buildd@allspice Distribution
> Open MPI: 1.4.3
> Open MPI SVN revision: r23834
> Open MPI release date: Oct 05, 2010
> Open RTE: 1.4.3
> Open RTE SVN revision: r23834
> Open RTE release date: Oct 05, 2010
> OPAL: 1.4.3
> OPAL SVN revision: r23834
> OPAL release date: Oct 05, 2010
> Ident string: 1.4.3
> Prefix: /usr
> Configured architecture: x86_64-pc-linux-gnu
> Configure host: allspice
> Configured by: buildd
>
> Thanks!
> Brett
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
| Teng Ma  Univ. of Tennessee |
| t...@cs.utk.eduKnoxville, TN |
| http://web.eecs.utk.edu/~tma/   |


[OMPI users] MPI_Allgather problem

2011-12-09 Thread Brett Tully
Dear all,

I have not used OpenMPI much before, but am maintaining a large legacy
application. We noticed a bug to do with a call to MPI_Allgather as
summarised in this post to Stackoverflow:
http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results

In the process of looking further into the problem, I noticed that the
following function results in strange behaviour.

void test_all_gather() {

struct _TEST_ALL_GATHER {
int node;
};

int ierr, size, rank;
ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank);

struct _TEST_ALL_GATHER local;
struct _TEST_ALL_GATHER *gathered;

gathered = (struct _TEST_ALL_GATHER*) malloc(size * sizeof(*gathered));

local.node = rank;

MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
MPI_COMM_WORLD);

int i;
for (i = 0; i < numnodes; ++i) {
(void) printf("gathered[%d].node = %d\n", i, gathered[i].node);
}

FREE(gathered);
}

At one point, this function printed the following:
gathered[0].node = 2
gathered[1].node = 3
gathered[2].node = 2
gathered[3].node = 3
gathered[4].node = 4
gathered[5].node = 5

Can anyone suggest a place to start looking into why this might be
happening? There is a section of the code that calls MPI_Comm_split, but I
am not sure if that is related...

Running on Ubuntu 11.10 and a summary of ompi_info:
Package: Open MPI buildd@allspice Distribution
Open MPI: 1.4.3
Open MPI SVN revision: r23834
Open MPI release date: Oct 05, 2010
Open RTE: 1.4.3
Open RTE SVN revision: r23834
Open RTE release date: Oct 05, 2010
OPAL: 1.4.3
OPAL SVN revision: r23834
OPAL release date: Oct 05, 2010
Ident string: 1.4.3
Prefix: /usr
Configured architecture: x86_64-pc-linux-gnu
Configure host: allspice
Configured by: buildd

Thanks!
Brett


[OMPI users] mca_btl_sm_component_progress read an unknown type of header

2011-12-09 Thread Patrik Jonsson
Hi all,

This question was buried in an earlier question, and I got no replies,
so I'll try reposting it with a more enticing subject.

I have a multithreaded openmpi code where each task has N+1 threads,
the N threads send nonblocking messages that are received by the 1
thread on the other tasks. When I run this code with 2 tasks, 5+1
threads on a single node with 12 cores, after about a million messages
has been exchanged, the tasks segfault after printing the following
error:

[sunrise01.rc.fas.harvard.edu:10009] mca_btl_sm_component_progress
read an unknown type of header

The debugger spits me out on line 674 of btl_sm_component.c, in the
default of a switch on fragment type. There's a comment there that
says:

* This code path should presumably never be called.
* It's unclear if it should exist or, if so, how it should be written.
* If we want to return it to the sending process,
* we have to figure out who the sender is.
* It seems we need to subtract the mask bits.
* Then, hopefully this is an sm header that has an smp_rank field.
* Presumably that means the received header was relative.
* Or, maybe this code should just be removed.

It seems like whoever wrote that code didn't know quite what was going
on, and I guess the assumption was wrong because dereferencing that
result seems to be what's causing the segfault. Does anyone here know
what could cause this error? If I run the code with the tcp btl
instead of sm, it runs fine, albeit with a bit lower performance.

This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell
PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel
12.3.174. I've attached the ompi_info output.

Thanks,

/Patrik J.


ompi_info.gz
Description: GNU Zip compressed data


Re: [OMPI users] Asymmetric performance with nonblocking, multithreaded communications

2011-12-09 Thread Patrik Jonsson
Hi Yiannis,

On Fri, Dec 9, 2011 at 10:21 AM, Yiannis Papadopoulos
 wrote:
> Patrik Jonsson wrote:
>>
>> Hi all,
>>
>> I'm seeing performance issues I don't understand in my multithreaded
>> MPI code, and I was hoping someone could shed some light on this.
>>
>> The code structure is as follows: A computational domain is decomposed
>> into MPI tasks. Each MPI task has a "master thread" that receives
>> messages from the other tasks and puts those into a local, concurrent
>> queue. The tasks then have a few "worker threads" that processes the
>> incoming messages and when necessary sends them to other tasks. So for
>> each task, there is one thread doing receives and N (typically number
>> of cores-1) threads doing sends. All messages are nonblocking, so the
>> workers just post the sends and continue with computation, and the
>> master repeatedly does a number of test calls to check for incoming
>> messages (there are different flavors of these messages so it does
>> several tests).
>
> When do you do the MPI_Test on the Isends? I have had performance issues in
> a number of systems if I would use a single queue of MPI_Requests that would
> keep Isends to different ranks and testing them one by one. It appears that
> some messages are sent out more efficiently if you test them.

There are 3 classes of messages that may arrive. The requests for each
are in a vector, and I use boost::mpi::test_some (which I assume just
calls MPI_Testsome) to test them in a round-robin fashion.

>
> I found that either using MPI_Testsome or having a map(key=rank, value=queue
> of MPI_Requests) and testing for each key the first MPI_Request, resolved
> this issue.

In my case, I know that the overwhelming traffic volume is one kind of
message. What I ended up doing was to simply repeat the test for that
message immediately if the preceding test succeeded, up to 1000 times,
before again checking the other requests. This appears to enable the
task to keep up with the incoming traffic.

I guess another possibility would be to have several slots for the
incoming messages. Right now I only post one irecv per source task. By
posting a couple, more messages would be able to come in without not
having a matching recv, and one test could match more of them. Since
that makes the logic more complicated, I didn't try that.


Re: [OMPI users] Asymmetric performance with nonblocking, multithreaded communications

2011-12-09 Thread Yiannis Papadopoulos

Patrik Jonsson wrote:

Hi all,

I'm seeing performance issues I don't understand in my multithreaded
MPI code, and I was hoping someone could shed some light on this.

The code structure is as follows: A computational domain is decomposed
into MPI tasks. Each MPI task has a "master thread" that receives
messages from the other tasks and puts those into a local, concurrent
queue. The tasks then have a few "worker threads" that processes the
incoming messages and when necessary sends them to other tasks. So for
each task, there is one thread doing receives and N (typically number
of cores-1) threads doing sends. All messages are nonblocking, so the
workers just post the sends and continue with computation, and the
master repeatedly does a number of test calls to check for incoming
messages (there are different flavors of these messages so it does
several tests).
When do you do the MPI_Test on the Isends? I have had performance issues in a 
number of systems if I would use a single queue of MPI_Requests that would keep 
Isends to different ranks and testing them one by one. It appears that some 
messages are sent out more efficiently if you test them.


I found that either using MPI_Testsome or having a map(key=rank, value=queue of 
MPI_Requests) and testing for each key the first MPI_Request, resolved this issue.




Re: [OMPI users] Cofigure(?) problem building /1.5.3 on ScientificLinux6.0

2011-12-09 Thread Paul Kapinos

Hello Gus, Ralph, Jeff

a very late answer for this - just found it in my mailbox.


Would "cp -rp" help?
(To preserve time stamps, instead of "cp -r".)


Yes, the root of the evil were the time stamps. 'cp -a' is the magic 
wand. Many thanks for your help, and I should wear sackcloth and 
ashes... :-/


Best,

Paul





Anyway, since 1.2.8 here I build 5, sometimes more versions,
all from the same tarball, but on separate build directories,
as Jeff suggests.
[VPATH] Works for me.

My two cents.
Gus Correa

Jeff Squyres wrote:
Ah -- Ralph pointed out the relevant line to me in your first mail 
that I initially missed:



In each case I build 16 versions at all (4 compiler * 32bit/64bit *
support for multithreading ON/OFF). The same error arise in all 16 
versions.


Perhaps you should just expand the tarball once and then do VPATH 
builds...?


Something like this:

tar xf openmpi-1.5.3.tar.bz2
cd openmpi-1.5.3

mkdir build-gcc
cd build-gcc
../configure blah..
make -j 4
make install
cd ..

mkdir build-icc
../configure CC=icc CXX=icpc FC=ifort F77=ifort ..blah.
make -j 4
make install
cd ..
etc.

This allows you to have one set of source and have N different builds 
from it.  Open MPI uses the GNU Autotools correctly to support this 
kind of build pattern.





On Jul 22, 2011, at 2:37 PM, Jeff Squyres wrote:

Your RUNME script is a *very* strange way to build Open MPI.  It 
starts with a massive copy:


cp -r /home/pk224850/OpenMPI/openmpi-1.5.3/AUTHORS 
/home/pk224850/OpenMPI/openmpi-1.5.3/CMakeLists.txt <...much 
snipped...> .


Why are you doing this kind of copy?  I suspect that the GNU 
autotools' timestamps are getting all out of whack when you do this 
kind of copy, and therefore when you run "configure", it tries to 
re-autogen itself.


To be clear: when you expand OMPI from a tarball, you shouldn't need 
the GNU Autotools installed at all -- the tarball is pre-bootstrapped 
exactly to avoid you needing to use the Autotools (much less any 
specific version of the Autotools).


I suspect that if you do this:

-
tar xf openmpi-1.5.3.tar.bz2
cd openmpi-1.5.3
./configure etc.
-

everything will work just fine.


On Jul 22, 2011, at 11:12 AM, Paul Kapinos wrote:


Dear OpenMPI volks,
currently I have a problem by building the version 1.5.3 of OpenMPI on
Scientific Linux 6.0 systems, which seem vor me to be a configuration
problem.

After the configure run (which seem to terminate without error code),
the "gmake all" stage produces errors and exits.

Typical is the output below.

Fancy: the 1.4.3 version on same computer can be build with no special
trouble. Both the 1.4.3 and 1.5.3 versions can be build on other
computer running CentOS 5.6.

In each case I build 16 versions at all (4 compiler * 32bit/64bit *
support for multithreading ON/OFF). The same error arise in all 16 
versions.


Can someone give a hint about how to avoid this issue? Thanks!

Best wishes,

Paul


Some logs and configure are downloadable here:
https://gigamove.rz.rwth-aachen.de/d/id/2jM6MEa2nveJJD

The configure line is in RUNME.sh, the
logs of configure and build stage in log_* files; I also attached the
config.log file and the configure itself (which is the standard from 
the

1.5.3 release).


##


CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh
/tmp/pk224850/linuxc2_11254/openmpi-1.5.3mt_linux64_gcc/config/missing
--run aclocal-1.11 -I config
sh: config/ompi_get_version.sh: No such file or directory
/usr/bin/m4: esyscmd subprocess failed



configure.ac:953: warning: OMPI_CONFIGURE_SETUP is m4_require'd but not
m4_defun'd
config/ompi_mca.m4:37: OMPI_MCA is expanded from...
configure.ac:953: the top level
configure.ac:953: warning: AC_COMPILE_IFELSE was called before
AC_USE_SYSTEM_EXTENSIONS
../../lib/autoconf/specific.m4:386: AC_USE_SYSTEM_EXTENSIONS is 
expanded

from...
opal/mca/paffinity/hwloc/hwloc/config/hwloc.m4:152:
HWLOC_SETUP_CORE_AFTER_C99 is expanded from...
../../lib/m4sugar/m4sh.m4:505: AS_IF is expanded from...
opal/mca/paffinity/hwloc/hwloc/config/hwloc.m4:22: HWLOC_SETUP_CORE is
expanded from...
opal/mca/paffinity/hwloc/configure.m4:40: MCA_paffinity_hwloc_CONFIG is
expanded from...
config/ompi_mca.m4:540: MCA_CONFIGURE_M4_CONFIG_COMPONENT is expanded
from...
config/ompi_mca.m4:326: MCA_CONFIGURE_FRAMEWORK is expanded from...
config/ompi_mca.m4:247: MCA_CONFIGURE_PROJECT is expanded from...
configure.ac:953: warning: AC_RUN_IFELSE was called before
AC_USE_SYSTEM_EXTENSIONS




--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/abo

Re: [OMPI users] Program hangs in mpi_bcast

2011-12-09 Thread Alex A. Granovsky
Dear Jeff,

thanks so much for your detailed reply and explanations and sorry for not 
answering sooner.

I'll try to develop reproducer and I have some ideas how this can be done.
At  least I know typical scenarios causing this issue to appear. To be honest, 
I'm rather
busy these days (as probably most of us are), but I'll try to do this as soon 
as I can.

Just a brief comment on repeated collectives. I know at least two situations 
when repeated
collectives are either required or beneficial. First, the sizes of arrays to be 
(all)reduced
can be really large causing overflow of 32-bit integers so one has to split 
single operation
into a sequence of calls. I know some MPI implementations supports 64-bit 
integers as
arguments for extended set of functions handling large arrays, but some does 
not. In addition,
such a splitting reduces probability of hangs due to lack of resources on the 
compute nodes.

Second, our experiences with any transport, any MPI implementations and any CPU 
types
we tried so far show that the overall performance of (all)reduce is usually 
worse on very large
arrays as compared with that for a sequence of calls. While it is hard to 
predict the optimal size
of chunk, it can be easily found experimentally.

> >   Some of our users would like to use Firefly with OpenMPI. Usually, we
> > simply answer them that OpenMPI is too buggy to be used.

> This surprises me.  Is this with regards to this collective/hang issue, or 
> something else?

Yes, this is with regards to collective hang issue.

All the best,
Alex




- Original Message -
From: "Jeff Squyres" 
To: "Alex A. Granovsky" ;
Sent: Saturday, December 03, 2011 3:36 PM
Subject: Re: [OMPI users] Program hangs in mpi_bcast


On Dec 2, 2011, at 8:50 AM, Alex A. Granovsky wrote:

>I would like to start discussion on implementation of collective
> operations within OpenMPI. The reason for this is at least twofold.
> Last months, there was the constantly growing number of messages in
> the list sent by persons facing problems with collectives so I do
> believe these issues must be discussed and hopefully will finally
> attract proper attention of OpenMPI developers. The second one is my
> involvement in the development of Firefly Quantum Chemistry package,
> which, of course, uses collectives rather intensively.

Greetings Alex, and thanks for your note.  We take it quite seriously, and had 
a bunch of phone/off-list conversations about it in
the past 24 hours.

Let me shed a little light on the history with regards to this particular 
issue...

- This issue was originally brought to light by LANL quite some time ago.  They 
discovered that one of their MPI codes was hanging
in the middle of lengthy runs.  After some investigation, it was determined 
that it was hanging in the middle of some collective
operations -- MPI_REDUCE, IIRC (maybe MPI_ALLREDUCE?  For the purposes of this 
email, I'll assume MPI_REDUCE).

- It turns out that this application called MPI_REDUCE a *lot*.  Which is not 
uncommon.  However, it was actually a fairly poorly
architected application, such that it was doing things like repeatedly invoking 
MPI_REDUCE on single variables rather than bundling
them up into an array and computing them all with a single MPI_REDUCE (for 
example).  Calling MPI_REDUCE a lot is not necessarily a
problem, per se, however -- MPI guarantees that this is supposed to be ok.  
I'll bring up below why I mention this specific point.

- After some investigating at LANL, they determined that putting a barrier in 
every N iterations caused the hangs to stop.  A little
experimentation determined that running a barrier every 1000 collective 
operations both did not affect performance in any noticeable
way and avoided whatever the underlying problem was.

- The user did not want to add the barriers to their code, so we added another 
collective module that internally counts collective
operations and invokes a barrier every N iterations (where N is settable via 
MCA parameter).  We defaulted N to 1000 because it
solved LANL's problems.  I do not recall offhand whether we experimented to see 
if we could make N *more* than 1000 or not.

- Compounding the difficulty of this investigation was the fact that other Open 
MPI community members had an incredibly difficult
time reproducing the problem.  I don't think that I was able to reproduce the 
problem at all, for example.  I just took Ralph's old
reproducers and tried again, and am unable to make OMPI 1.4 or OMPI 1.5 hang.  
I actually modified his reproducers to make them a
bit *more* abusive (i.e., flood rank 0 with even *more* unexpected incoming 
messages), but I still can't get it to hang.

- To be clear: due to personnel changes at LANL at the time, there was very 
little experience in the MPI layer at LANL (Ralph, who
was at LANL at the time, is the ORTE guy -- he actively stays out of the MPI 
layer whenever possible).  The application that
generated the problem was on rest