Dear Jeff,

thanks so much for your detailed reply and explanations and sorry for not 
answering sooner.

I'll try to develop reproducer and I have some ideas how this can be done.
At  least I know typical scenarios causing this issue to appear. To be honest, 
I'm rather
busy these days (as probably most of us are), but I'll try to do this as soon 
as I can.

Just a brief comment on repeated collectives. I know at least two situations 
when repeated
collectives are either required or beneficial. First, the sizes of arrays to be 
(all)reduced
can be really large causing overflow of 32-bit integers so one has to split 
single operation
into a sequence of calls. I know some MPI implementations supports 64-bit 
integers as
arguments for extended set of functions handling large arrays, but some does 
not. In addition,
such a splitting reduces probability of hangs due to lack of resources on the 
compute nodes.

Second, our experiences with any transport, any MPI implementations and any CPU 
types
we tried so far show that the overall performance of (all)reduce is usually 
worse on very large
arrays as compared with that for a sequence of calls. While it is hard to 
predict the optimal size
of chunk, it can be easily found experimentally.

> >   Some of our users would like to use Firefly with OpenMPI. Usually, we
> > simply answer them that OpenMPI is too buggy to be used.

> This surprises me.  Is this with regards to this collective/hang issue, or 
> something else?

Yes, this is with regards to collective hang issue.

All the best,
Alex




----- Original Message -----
From: "Jeff Squyres" <jsquy...@cisco.com>
To: "Alex A. Granovsky" <g...@classic.chem.msu.su>;
Sent: Saturday, December 03, 2011 3:36 PM
Subject: Re: [OMPI users] Program hangs in mpi_bcast


On Dec 2, 2011, at 8:50 AM, Alex A. Granovsky wrote:

>    I would like to start discussion on implementation of collective
> operations within OpenMPI. The reason for this is at least twofold.
> Last months, there was the constantly growing number of messages in
> the list sent by persons facing problems with collectives so I do
> believe these issues must be discussed and hopefully will finally
> attract proper attention of OpenMPI developers. The second one is my
> involvement in the development of Firefly Quantum Chemistry package,
> which, of course, uses collectives rather intensively.

Greetings Alex, and thanks for your note.  We take it quite seriously, and had 
a bunch of phone/off-list conversations about it in
the past 24 hours.

Let me shed a little light on the history with regards to this particular 
issue...

- This issue was originally brought to light by LANL quite some time ago.  They 
discovered that one of their MPI codes was hanging
in the middle of lengthy runs.  After some investigation, it was determined 
that it was hanging in the middle of some collective
operations -- MPI_REDUCE, IIRC (maybe MPI_ALLREDUCE?  For the purposes of this 
email, I'll assume MPI_REDUCE).

- It turns out that this application called MPI_REDUCE a *lot*.  Which is not 
uncommon.  However, it was actually a fairly poorly
architected application, such that it was doing things like repeatedly invoking 
MPI_REDUCE on single variables rather than bundling
them up into an array and computing them all with a single MPI_REDUCE (for 
example).  Calling MPI_REDUCE a lot is not necessarily a
problem, per se, however -- MPI guarantees that this is supposed to be ok.  
I'll bring up below why I mention this specific point.

- After some investigating at LANL, they determined that putting a barrier in 
every N iterations caused the hangs to stop.  A little
experimentation determined that running a barrier every 1000 collective 
operations both did not affect performance in any noticeable
way and avoided whatever the underlying problem was.

- The user did not want to add the barriers to their code, so we added another 
collective module that internally counts collective
operations and invokes a barrier every N iterations (where N is settable via 
MCA parameter).  We defaulted N to 1000 because it
solved LANL's problems.  I do not recall offhand whether we experimented to see 
if we could make N *more* than 1000 or not.

- Compounding the difficulty of this investigation was the fact that other Open 
MPI community members had an incredibly difficult
time reproducing the problem.  I don't think that I was able to reproduce the 
problem at all, for example.  I just took Ralph's old
reproducers and tried again, and am unable to make OMPI 1.4 or OMPI 1.5 hang.  
I actually modified his reproducers to make them a
bit *more* abusive (i.e., flood rank 0 with even *more* unexpected incoming 
messages), but I still can't get it to hang.

- To be clear: due to personnel changes at LANL at the time, there was very 
little experience in the MPI layer at LANL (Ralph, who
was at LANL at the time, is the ORTE guy -- he actively stays out of the MPI 
layer whenever possible).  The application that
generated the problem was on restricted / un-shareable networks, so no one else 
in the OMPI community could see them.  So:

  - no one else could replicate the problem
  - no OMPI layer expert could see the application that caused the problem

This made it *extremely* difficult to diagnose.  As such, the 
barrier-every-N-iterations solution was deemed sufficient.

- There were some *suppositions* about what the real problem was, but we were 
never able to investigate properly, due to the
conditions listed above.  The suppositions included:

  - some kind of race condition where an incoming message is dropped.  This 
seemed unlikely, however, because if we were dropping
messages, that kind of problem should have showed up long ago
  - resource exhaustion.  There are 3 documented issues with Open MPI running 
out of registered memory (one of which is just about
to get fixed).  See:

       https://svn.open-mpi.org/trac/ompi/ticket/2295
       https://svn.open-mpi.org/trac/ompi/ticket/2155
       https://svn.open-mpi.org/trac/ompi/ticket/2157 (this one is about to be 
fixed)

    It *could* be an issue with running out of registered memory, but 
preliminary investigation indicated that it *might* not have
been.  However, this investigation was hampered by the factors above, and 
therefore was not completed (and therefore was not
definitive).

FWIW, LANL now does have additional OMPI-level experts on staff, but the one 
problematic application that showed this behavior has
been re-written/modernized and no longer exhibits the problem.  Hence, no one 
can justify reviving the old, complex, legacy code to
figure out what, if any, was the actual problem.

- Since no one else was able to replicate the problem, we determined that the 
barrier-every-N-iterations solution was sufficient.
We added the sync module to OMPI v1.4 and v1.5, and made it the default.  It 
solved LANL's problems and didn't affect performance in
a noticeable way: problem solved, let's move on to the next issue.

- The most recent report about this issue had the user claim that they had to 
set the iteration count down to *5* (vs. 1000) before
their code worked right.  This did set off alarm bells in my head -- *5* is 
waaaay too small of a number.  That's why I specifically
asked if there was a way we could get a reproducer for that issue -- it would 
(hopefully) be a smoking gun pointing to whatever the
actual underlying issue was.  Unfortunately, the user had a good enough 
solution and moved on, so a reproducer wasn't possible with
available resources.  That being said, given that the number the user had to 
use was *5*, I wonder if there is some other problem /
race condition in the application itself.  Keep in mind that just because an 
application runs with one MPI implementation doesn't
mean that it is correct / conformant.  But without a detailed analysis of the 
problematic application code, it's impossible to say.

- Per the "the original LANL code was poorly architected" comment above, it 
falls into this same category: we don't actually know if
the application itself was correct.  Since there were no MPI experts available 
at LANL at the time, the MPI application itself was
not analyzed to see if it, itself, was correct.  To be clear: it is *possible* 
that OMPI is correct in hanging because the
application itself is invalid.  That sounds like me avoiding responsibility, 
but it is a possibility that cannot be ignored.  We've
run into *lots* of faulty use applications that, once corrected, run just fine. 
 But that being said, we don't *know* that the
application was faulty (nor did we assume it) because a proper analysis was not 
able to be done both on that code or what was
happening inside OMPI.  So we don't know where the problem was.

So -- that's why we are where we are today.  Basically: a) this issue seemed to 
happen to a *very* small number of users, and b) no
one has created a reproducer that MPI experts can use to reliably diagnose the 
actual problem.

My only point in this lengthy recitation of history: there are (good) reasons 
why we are where we are.

All that being said, however, if a) and/or b) are incorrect -- e.g., if you 
have a simple reproducer code that can exhibit the
problem -- that would be *great*.  I'd also apologize, because we honestly 
thought this was a problem that had affected a very small
number of people and that the coll sync workaround fixed the issue for everyone 
in an un-noticeable way.

>   Some of our users would like to use Firefly with OpenMPI. Usually, we
> simply answer them that OpenMPI is too buggy to be used.

This surprises me.  Is this with regards to this collective/hang issue, or 
something else?  I don't see prior emails from you
indicating any specific bugs -- did I miss them?  It would be good to get 
whatever the issues are fixed.

Do you have some specific issues that you could report to us?

More specifically, do you have a simple reproducer that shows the collective 
hangs when the coll sync module is disabled?  That
would be most helpful.

If you're still reading this super lengthy email :-), many thanks for your time 
for a) reporting the issue, and b) reading my huge
reply!

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Reply via email to