Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-16 Thread Bryan Lally

Ashley Pittman wrote:


Do you have a stack trace of your hung application to hand, in
particular when you say "All 
processes have made the same call to MPI_Allreduce.  The processes are 
all in opal_progress, called (with intervening calls) by MPI_Allreduce."

do the intervening calls include mca_coll_sync_bcast
ompi_coll_tuned_barrier_intra_dec_fixed and
ompi_coll_tuned_barrier_intra_recursivedoubling?


I don't have a stack trace handy, and today is pretty full.  I'll try 
and make some time to document what I've got in the next few days.  I 
was able to hang a C translation of Ralph's reproducer as well.


- Bryan

--
Bryan Lally, la...@lanl.gov
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-16 Thread Bryan Lally

Ashley Pittman wrote:


Whilst the fact that it appears to only happen on your machine implies
it's not a general problem with OpenMPI the fact that it happens in the
same location/rep count every time does swing the blame back the other
way.


This sounds a _lot_ like the problem I was seeing, my initial message is 
appended here.  If it's the same thing, then it's not only on the big 
machines here that Ralph was talking about, but on very vanilla Fedora 7 
and 9 boxes.


I was able to hang Ralph's reproducer on an 8 core Dell, Fedora 9, 
kernel 2.6.27(.4-78.2.53.fc9.x86_64).


I don't think it's just the one machine and it's configuration.

- Bryan

--
Bryan Lally, la...@lanl.gov
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico

Developers,

This is my first post to the openmpi developers list.  I think I've run 
across a race condition in your latest release.  Since my demonstrator 
is somewhat large and cumbersome, I'd like to know if you already know 
about this issue before we start the process of providing code and details.


Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine.

Symptoms: our code hangs, always in the same vicinity, usually at the 
same place, 10-25% of the time.  Sometimes more often, sometimes less.


Our code has run reliably with many MPI implementations for years.  We 
haven't added anything recently that is a likely culprit.  While we have 
our own issues, this doesn't feel like one of ours.


We see that there is new code in the shared memory transport between 
1.3.1 and 1.3.2.  Our code doesn't hang with 1.3.1 (nor 1.2.9).  Only 
with 1.3.2.


If we switch to tcp for transport (with mpirun --mca btl tcp,self ...) 
we don't see any hangs.  Running using --mca btl sm,self results in hangs.


If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of the 
problem, we no longer see hangs.


We demonstrate this with 4 processes.  When we attach a debugger to the 
hung processes, we see that the hang results from an MPI_Allreduce.  All 
processes have made the same call to MPI_Allreduce.  The processes are 
all in opal_progress, called (with intervening calls) by MPI_Allreduce.


My question is, have you seen anything like this before?  If not, what 
do we do next?


Thanks.

- Bryan


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-15 Thread Ashley Pittman
On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote:
> Hi Ashley
> 
> Thanks! I would definitely be interested and will look at the tool.
> Meantime, I have filed a bunch of data on this in ticket #1944, so
> perhaps you might take a glance at that and offer some thoughts?
> 
> https://svn.open-mpi.org/trac/ompi/ticket/1944
> 
> Will be back after I look at the tool.

Have you made any progress?

Whilst the fact that it appears to only happen on your machine implies
it's not a general problem with OpenMPI the fact that it happens in the
same location/rep count every time does swing the blame back the other
way.  Perhaps it's some special configure or runtime option you are
setting?  One thing that springs to mind is the numa-maps could be
exposing some timimg problem with shared memory calls however this
doesn't sit well with it always failing on the same iteration.

Can you provide stack traces from when it's hung and crucially are they
the same for every hang?  If you change the coll_sync_barrier_before
value to make it hang on a different repetition does this change the
stack trace at all?

Likewise when you have applied the collectives patch is the collective
state the same for every hang and how does this differ when you change
the coll_sync_barrier_before variable?

It would be useful to see stack traces and collective state from the
three collectives you report as causing problems, MPI_Bcast, MPI_Reduce
and MPI_Allgather because as I said before these three collectives have
radically different communication patterns.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-12 Thread Sylvain Jeaugey

Hi Ralph,

I managed to have a deadlock after a whole night, but not the same you 
have : after a quick analysis, process 0 seems to be blocked in the very 
first send through shared memory. Still maybe a bug, but not the same as 
yours IMO.


I also figured out that libnuma support was not in my library, so I 
rebuilt the lib and this doesn't seem to change anything : same execution 
speed, same memory footprint, and of course same the-bug-does-not-appear 
:-(.


So, no luck so far in reproducing your problem. I guess you're the only 
one to be able to progress on this (since you seem to have a real 
reproducer).


Sylvain

On Wed, 10 Jun 2009, Sylvain Jeaugey wrote:

Hum, very glad that padb works with Open MPI, I couldn't live without it. In 
my opinion, the best debug tool for parallel applications, and more 
importantly, the only one that scales.


About the issue, I couldn't reproduce it on my platform (tried 2 nodes with 2 
to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is Mellanox QDR).


So my feeling about that is that is may be very hardware related. Especially 
if you use the hierarch component, some transactions will be done through 
RDMA on one side and read directly through shared memory on the other side, 
which can, depending on the hardware, produce very different timings and 
bugs. Did you try with a different collective component (i.e. not hierarch) ? 
Or with another interconnect ? [Yes, of course, if it is a race condition, we 
might well avoid the bug because timings will be different, but that's still 
information]


Perhaps all what I'm saying makes no sense or you already thought about this, 
anyway, if you want me to try different things, just let me know.


Sylvain

On Wed, 10 Jun 2009, Ralph Castain wrote:


Hi Ashley

Thanks! I would definitely be interested and will look at the tool. 
Meantime, I have filed a bunch of data on this in
ticket #1944, so perhaps you might take a glance at that and offer some 
thoughts?


https://svn.open-mpi.org/trac/ompi/ticket/1944

Will be back after I look at the tool.

Thanks again
Ralph


On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman  
wrote:


  Ralph,

  If I may say this is exactly the type of problem the tool I have been
  working on recently aims to help with and I'd be happy to help you
  through it.

  Firstly I'd say of the three collectives you mention, MPI_Allgather,
  MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a 
many-to-one

  and the last a many-to-one communication pattern.  The scenario of a
  root process falling behind and getting swamped in comms is a 
plausible
  one for MPI_Reduce only but doesn't hold water with the other two. 
 You

  also don't mention if the loop is over a single collective or if you
  have loop calling a number of different collectives each iteration.

  padb, the tool I've been working on has the ability to look at 
parallel
  jobs and report on the state of collective comms and should help 
narrow

  you down on erroneous processes and those simply blocked waiting for
  comms.  I'd recommend using it to look at maybe four or five 
instances
  where the application has hung and look for any common features 
between

  them.

  Let me know if you are willing to try this route and I'll talk, the 
code
  is downloadable from http://padb.pittman.org.uk and if you want the 
full
  collective functionality you'll need to patch openmp with the patch 
from

  http://padb.pittman.org.uk/extensions.html

  Ashley,

  --

  Ashley Pittman, Bath, UK.

  Padb - A parallel job inspection tool for cluster computing
  http://padb.pittman.org.uk

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
Well, it would - except then -all- the procs would run real slow! :-)

Still, might be a reasonable diagnostic step to try...will give it a shot.

On Wed, Jun 10, 2009 at 1:12 PM, Bogdan Costescu <
bogdan.coste...@iwr.uni-heidelberg.de> wrote:

> On Wed, 10 Jun 2009, Ralph Castain wrote:
>
>  I appreciate the input and have captured it in the ticket. Since this
>> appears to be a NUMA-related issue, the lack of NUMA support in your setup
>> makes the test difficult to interpret.
>>
>
> Based on this reasoning, disabling libnuma support in your OpenMPI build
> should also solve the problem, or do I interpret things the wrong way ?
>
>
> --
> Bogdan Costescu
>
> IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
> Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
> E-mail: bogdan.coste...@iwr.uni-heidelberg.de
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Bogdan Costescu

On Wed, 10 Jun 2009, Ralph Castain wrote:

I appreciate the input and have captured it in the ticket. Since 
this appears to be a NUMA-related issue, the lack of NUMA support in 
your setup makes the test difficult to interpret.


Based on this reasoning, disabling libnuma support in your OpenMPI 
build should also solve the problem, or do I interpret things the 
wrong way ?


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
Much appreciated!

Per some of my other comments on this thread and on the referenced ticket,
can you tell me what kernel you have on that machine? I assume you have NUMA
support enabled, given that chipset?

Thanks!
Ralph

On Wed, Jun 10, 2009 at 10:29 AM, Sylvain Jeaugey
wrote:

> Hum, very glad that padb works with Open MPI, I couldn't live without it.
> In my opinion, the best debug tool for parallel applications, and more
> importantly, the only one that scales.
>
> About the issue, I couldn't reproduce it on my platform (tried 2 nodes with
> 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is Mellanox QDR).
>
> So my feeling about that is that is may be very hardware related.
> Especially if you use the hierarch component, some transactions will be done
> through RDMA on one side and read directly through shared memory on the
> other side, which can, depending on the hardware, produce very different
> timings and bugs. Did you try with a different collective component (i.e.
> not hierarch) ? Or with another interconnect ? [Yes, of course, if it is a
> race condition, we might well avoid the bug because timings will be
> different, but that's still information]
>
> Perhaps all what I'm saying makes no sense or you already thought about
> this, anyway, if you want me to try different things, just let me know.
>
> Sylvain
>
>
> On Wed, 10 Jun 2009, Ralph Castain wrote:
>
>  Hi Ashley
>>
>> Thanks! I would definitely be interested and will look at the tool.
>> Meantime, I have filed a bunch of data on this in
>> ticket #1944, so perhaps you might take a glance at that and offer some
>> thoughts?
>>
>> https://svn.open-mpi.org/trac/ompi/ticket/1944
>>
>> Will be back after I look at the tool.
>>
>> Thanks again
>> Ralph
>>
>>
>> On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman 
>> wrote:
>>
>>  Ralph,
>>
>>  If I may say this is exactly the type of problem the tool I have been
>>  working on recently aims to help with and I'd be happy to help you
>>  through it.
>>
>>  Firstly I'd say of the three collectives you mention, MPI_Allgather,
>>  MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a
>> many-to-one
>>  and the last a many-to-one communication pattern.  The scenario of a
>>  root process falling behind and getting swamped in comms is a
>> plausible
>>  one for MPI_Reduce only but doesn't hold water with the other two.
>>  You
>>  also don't mention if the loop is over a single collective or if you
>>  have loop calling a number of different collectives each iteration.
>>
>>  padb, the tool I've been working on has the ability to look at
>> parallel
>>  jobs and report on the state of collective comms and should help
>> narrow
>>  you down on erroneous processes and those simply blocked waiting for
>>  comms.  I'd recommend using it to look at maybe four or five
>> instances
>>  where the application has hung and look for any common features
>> between
>>  them.
>>
>>  Let me know if you are willing to try this route and I'll talk, the
>> code
>>  is downloadable from http://padb.pittman.org.uk and if you want the
>> full
>>  collective functionality you'll need to patch openmp with the patch
>> from
>>  http://padb.pittman.org.uk/extensions.html
>>
>>  Ashley,
>>
>>  --
>>
>>  Ashley Pittman, Bath, UK.
>>
>>  Padb - A parallel job inspection tool for cluster computing
>>  http://padb.pittman.org.uk
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
I appreciate the input and have captured it in the ticket. Since this
appears to be a NUMA-related issue, the lack of NUMA support in your setup
makes the test difficult to interpret.

I agree, though, that this is likely something peculiar to our particular
setup. Of primary concern is that it might be related to the relatively old
kernel (2.6.18) on these machines. There has been a lot of change since that
kernel was released, and some of those changes may be relevant to this
problem.

Unfortunately, upgrading the kernel will take persuasive argument. We are
going to try and run the reproducers on machines with more modern kernels to
see if we get different behavior.

Please feel free to follow this further on the ticket.

Thanks again!
Ralph


On Wed, Jun 10, 2009 at 11:29 AM, Bogdan Costescu <
bogdan.coste...@iwr.uni-heidelberg.de> wrote:

> On Wed, 10 Jun 2009, Ralph Castain wrote:
>
>  Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
>> you might take a glance at that and offer some thoughts?
>>
>> https://svn.open-mpi.org/trac/ompi/ticket/1944
>>
>
> I wasn't able to reproduce this. I have run with the following setup:
> - OS is Scientific Linux 5.1 with a custom compiled kernel based on
> 2.6.22.19, but (due to circumstances that I can't control):
>
> checking if MCA component maffinity:libnuma can compile... no
>
> - Intel compiler 10.1
> - OpenMPI 1.3.2
> - nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX IB
> DDR
>
> I've used the platform file that you have provided, but took out the
> references to PanFS and fixed the paths. I've also used the MCA file that
> you have provided.
>
> I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished
> successfully with m=50 several times. This, together with the earlier post
> also describing a negative result, points to a problem related to your
> particular setup...
>
> --
> Bogdan Costescu
>
> IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
> Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
> E-mail: bogdan.coste...@iwr.uni-heidelberg.de
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Bogdan Costescu

On Wed, 10 Jun 2009, Ralph Castain wrote:


Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?

https://svn.open-mpi.org/trac/ompi/ticket/1944


I wasn't able to reproduce this. I have run with the following setup:
- OS is Scientific Linux 5.1 with a custom compiled kernel based on 
2.6.22.19, but (due to circumstances that I can't control):


checking if MCA component maffinity:libnuma can compile... no

- Intel compiler 10.1
- OpenMPI 1.3.2
- nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX 
IB DDR


I've used the platform file that you have provided, but took out the 
references to PanFS and fixed the paths. I've also used the MCA file 
that you have provided.


I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished 
successfully with m=50 several times. This, together with the earlier 
post also describing a negative result, points to a problem related to 
your particular setup...


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Sylvain Jeaugey
Hum, very glad that padb works with Open MPI, I couldn't live without it. 
In my opinion, the best debug tool for parallel applications, and more 
importantly, the only one that scales.


About the issue, I couldn't reproduce it on my platform (tried 2 nodes 
with 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is 
Mellanox QDR).


So my feeling about that is that is may be very hardware related. 
Especially if you use the hierarch component, some transactions will be 
done through RDMA on one side and read directly through shared memory on 
the other side, which can, depending on the hardware, produce very 
different timings and bugs. Did you try with a different collective 
component (i.e. not hierarch) ? Or with another interconnect ? [Yes, of 
course, if it is a race condition, we might well avoid the bug because 
timings will be different, but that's still information]


Perhaps all what I'm saying makes no sense or you already thought about 
this, anyway, if you want me to try different things, just let me know.


Sylvain

On Wed, 10 Jun 2009, Ralph Castain wrote:


Hi Ashley

Thanks! I would definitely be interested and will look at the tool. Meantime, I 
have filed a bunch of data on this in
ticket #1944, so perhaps you might take a glance at that and offer some 
thoughts?

https://svn.open-mpi.org/trac/ompi/ticket/1944

Will be back after I look at the tool.

Thanks again
Ralph


On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman  wrote:

  Ralph,

  If I may say this is exactly the type of problem the tool I have been
  working on recently aims to help with and I'd be happy to help you
  through it.

  Firstly I'd say of the three collectives you mention, MPI_Allgather,
  MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
  and the last a many-to-one communication pattern.  The scenario of a
  root process falling behind and getting swamped in comms is a plausible
  one for MPI_Reduce only but doesn't hold water with the other two.  You
  also don't mention if the loop is over a single collective or if you
  have loop calling a number of different collectives each iteration.

  padb, the tool I've been working on has the ability to look at parallel
  jobs and report on the state of collective comms and should help narrow
  you down on erroneous processes and those simply blocked waiting for
  comms.  I'd recommend using it to look at maybe four or five instances
  where the application has hung and look for any common features between
  them.

  Let me know if you are willing to try this route and I'll talk, the code
  is downloadable from http://padb.pittman.org.uk and if you want the full
  collective functionality you'll need to patch openmp with the patch from
  http://padb.pittman.org.uk/extensions.html

  Ashley,

  --

  Ashley Pittman, Bath, UK.

  Padb - A parallel job inspection tool for cluster computing
  http://padb.pittman.org.uk

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
Hi Ashley

Thanks! I would definitely be interested and will look at the tool.
Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?

https://svn.open-mpi.org/trac/ompi/ticket/1944

Will be back after I look at the tool.

Thanks again
Ralph


On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman wrote:

>
> Ralph,
>
> If I may say this is exactly the type of problem the tool I have been
> working on recently aims to help with and I'd be happy to help you
> through it.
>
> Firstly I'd say of the three collectives you mention, MPI_Allgather,
> MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
> and the last a many-to-one communication pattern.  The scenario of a
> root process falling behind and getting swamped in comms is a plausible
> one for MPI_Reduce only but doesn't hold water with the other two.  You
> also don't mention if the loop is over a single collective or if you
> have loop calling a number of different collectives each iteration.
>
> padb, the tool I've been working on has the ability to look at parallel
> jobs and report on the state of collective comms and should help narrow
> you down on erroneous processes and those simply blocked waiting for
> comms.  I'd recommend using it to look at maybe four or five instances
> where the application has hung and look for any common features between
> them.
>
> Let me know if you are willing to try this route and I'll talk, the code
> is downloadable from http://padb.pittman.org.uk and if you want the full
> collective functionality you'll need to patch openmp with the patch from
> http://padb.pittman.org.uk/extensions.html
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ashley Pittman

Ralph,

If I may say this is exactly the type of problem the tool I have been
working on recently aims to help with and I'd be happy to help you
through it.

Firstly I'd say of the three collectives you mention, MPI_Allgather,
MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
and the last a many-to-one communication pattern.  The scenario of a
root process falling behind and getting swamped in comms is a plausible
one for MPI_Reduce only but doesn't hold water with the other two.  You
also don't mention if the loop is over a single collective or if you
have loop calling a number of different collectives each iteration.

padb, the tool I've been working on has the ability to look at parallel
jobs and report on the state of collective comms and should help narrow
you down on erroneous processes and those simply blocked waiting for
comms.  I'd recommend using it to look at maybe four or five instances
where the application has hung and look for any common features between
them.

Let me know if you are willing to try this route and I'll talk, the code
is downloadable from http://padb.pittman.org.uk and if you want the full
collective functionality you'll need to patch openmp with the patch from
http://padb.pittman.org.uk/extensions.html

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



[OMPI devel] Hang in collectives involving shared memory

2009-06-09 Thread Ralph Castain
Hi folks

As mentioned in today's telecon, we at LANL are continuing to see hangs when
running even small jobs that involve shared memory in collective operations.
This has been the topic of discussion before, but I bring it up again
because (a) the problem is beginning to become epidemic across our
application codes, and (b) repeated testing provides more info and (most
importantly) confirms that this problem -does not- occur under 1.2.x - it is
strictly a 1.3.2 (we haven't checked to see if it is in 1.3.0 or 1.3.1)
problem.

The condition is caused when the application performs a loop over collective
operations such as MPI_Allgather, MPI_Reduce, and MPI_Bcast. This list is
not intended to be exhaustive, but only represents the ones for which we
have solid and repeatable data. The symptoms are a "hanging" job, typically
(but not always!) associated with fully-consumed memory. The loops do not
have to involve substantial amounts of memory (the Bcast loop hangs after
moving a whole 32Mbytes, total), nor involve high loop counts. They only
have to repeatedly call the collective.

Disabling the shared memory BTL is enough to completely resolve the problem.
However, this creates an undesirable performance penalty we would like to
avoid, if possible.

Our current solution is to use the "sync" collective to occasionally insert
an MPI_Barrier into the code "behind the scenes" - i.e., to add an
MPI_Barrier call every N number of calls to "problem" collectives. The
argument in favor of this was that the hang is caused by consuming memory
due to "unexpected messages", caused principally by the root process in the
collective running slower than other procs. Thus, the notion goes, the root
process continues to fall further and further behind, consuming ever more
memory until it simply cannot progress. Adding the barrier operation forced
the other procs to "hold" until the root process could catch up, thereby
relieving the memory backlog.

The sync collective has worked for us, but we are now finding a very
disconcerting behavior - namely, that the precise value of N required to
avoid hanging (a) is very, very sensitive and can still let the app hang
even by changing the value by small amounts, (b) flunctuates between runs on
an unpredictable basis, and (c) can be different for different collectives.

These new problems surfaced this week when we found that a job that
previously ran fine with one value of coll_sync_barrier_before suddenly hung
when a loop over MPI_Bcast was added to the code. Further investigation has
found that the value of N required to make the new loop work is
significantly different than the prior value that made Allgather work,
creating an exhaustive search for a "sweet spot" for N.

Clearly, as codes grow in complexity, this simply is not going to work.

It seems to me that we have to begin investigating -why- the 1.3.2 code is
encountering this problem whereas the 1.2.x code is not. From our rough
measurements, there is a some speed difference between the two releases, so
perhaps we are now getting fast enough to create the problem - I don't think
we know enough yet to really claim this is true. At this time, we really
don't know -why- one process is running slow, or even if it is -always- the
root process that is doing so...nor have we confirmed (to my knowledge) that
our original analysis of the problem is correct!

We would appreciate any help with this problem. I gathered from today's
telecon that others are also encountering this, so perhaps there is enough
general pain to stimulate a team effort to resolve it!

Thanks
Ralph