Re: [OMPI devel] collective problems

Jeff Squyres Wed, 7 Nov 2007 14:17:57 -0500

This is not a problem in the current code base.

Remember that this is all in the context of Galen's proposal forbtl_send() to be able to return NOT_ON_WIRE -- meaning that the sendwas successful, but it has not yet been sent (e.g., openib BTLbuffered it because it ran out of credits).


Read these two messages again to get the context:

    http://www.open-mpi.org/community/lists/devel/2007/10/2486.php
    http://www.open-mpi.org/community/lists/devel/2007/10/2487.php

Gleb describes the recursive problem (paired with the concept ofNOT_ON_WIRE) nicely in his post.


Make sense?



On Nov 7, 2007, at 1:16 PM, George Bosilca wrote:


On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote:

The same callback is called in both cases. In the case that you
described, the callback is called just a little bit deeper into the
recursion, when in the "normal case" it will get called from the
first level of the recursion. Or maybe I miss something here ...


Right -- it's not the callback that is the problem.  It's when the
recursion is unwound and further up the stack you now have a stale
request.

That's exactly the point that I fail to see. If the request is freedin the PML callback, then it should get release in both cases, andtherefore lead to problems all the time. Which, obviously, is nottrue when we do not have this deep recursion thing going on.

Moreover, he request management is based on the reference count. ThePML level have one ref count and the MPI level have another one. Infact, we cannot release a request until we explicitly callompi_request_free on it. The place where this call happens isdifferent between the blocking and non blocking calls. In the nonblocking case the ompi_request_free get called from the *_test(*_wait) functions while in the blocking case it get called directlyfrom the MPI_Send function.

Let me summarize: a request cannot reach a stale state without acall to ompi_request_free. This function is never called directlyfrom the PML level. Therefore, the recursion depth should not haveany impact on the state of the request !

Is there a simple test case I can run in order to trigger thisstrange behavior ?


 Thanks,
   george.


george.

This is *only* a problem for requests that are involved from the
current top-level MPI call.  Request from prior calls to MPI
functions
(e.g., a request from a prior call to MPI_ISEND) are ok because a)
we've already done the Right Things to ensure the safety of that
request, and b) that request is not on the recursive stack anywhere
to
become stale as the recursion unwinds.

Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better
name :-) ).

3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
specifically, the top of the PML calls for blocking send/receive)
right when the request is allocated (i.e., before calling
btl_send()).

4. when the PML is called for completion on this request, it willdoall the stuff that it needs to effect completion -- but then itwill

see the DONT_FREE_ME flag and not actually free the request.
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
does today: it frees the request.

5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
only free the request if it was completed.

Note that with this scheme, it becomes irrelevant as to whether the
PML completion call is invoked on the first descent into the BTL or
recursively via opal_progress.

How does that sound?

If that all works, it might be beneficial to put this back to the1.2branch because there are definitely apps that would benefit fromit.




On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:

So this problem goes WAY back..
The problem here is that the PML marks MPI completion justprior to
calling
btl_send and then returns to the user. This wouldn't be a problem
if the BTL
then did something, but in the case of OpenIB this fragment maynot
actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return
either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allowthe
PML to
not mark MPI completion of the fragment and then MPI_WAITALL and
others will
do there job properly.

I even implemented this once, but there is a problem. Currently we
mark
request as completed on MPI level and then do btl_send(). Whenever
IB completion

will happen the request will be marked as complete on PML leveland

freed. The fix requires to change the order like this: Call
btl_send(),

check return value from BTL and mark request complete asnecessary.

The
problem is that because we allow BTL to call opal_progress()
internally the
request may be already completed on MPI and MPL levels and freed
before return from
the call to btl_send().

I did a code review to see how hard it will be to get rid of
recursion

in Open MPI and I think this is doable. We have to disallowcallingprogress() (or other functions that may call progress()internally)

from

BTL and from ULP callbacks that are called by BTL. There is nomuch

places that break this law. The main offenders are calls to

FREE_LIST_WAIT(), but those never actually call progress if theycan

grow without limit and this is the most common use of
FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will
solve
recursion problem the fix to the problem will be a couple of lines
of
code.


- Galen



On 10/11/07 11:26 AM, "Gleb Natapov" <[email protected]> wrote:

On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:

David --
Gleb and I just actively re-looked at this problem yesterday;wethink it's related to https://svn.open-mpi.org/trac/ompi/ticket/1015. We previously thought this ticket was a differentproblem,
but
our analysis yesterday shows that it could be a real problem in
the
openib BTL or ob1 PML (kinda think it's the openib btlbecause it
doesn't seem to happen on other networks, but who knows...).

Gleb is investigating.

Here is the result of the investigation. The problem isdifferent

than
#1015 ticket. What we have here is one rank calls isend() of a
small
message and wait_all() in a loop and another one calls irecv().
The
problem is that isend() usually doesn't call opal_progress()
anywhere
and wait_all() doesn't call progress if all requests are already
completed

so messages are never progressed. We may force opal_progress()to

be called

by setting btl_openib_free_list_max to 1000. Then wait_all()will

call
progress because not every request will be immediately completed
by OB1. Or
we can limit a number of uncompleted requests that OB1 can
allocate by setting

pml_ob1_free_list_max to 1000. Then opal_progress() will becalled

from a

free_list_wait() when max will be reached. The second optionworks

much
faster for me.




On Oct 5, 2007, at 12:59 AM, David Daniel wrote:

Hi Folks,

I have been seeing some nasty behaviour in collectives,
particularly bcast and reduce.  Attached is a reproducer (for
bcast).

The code will rapidly slow to a crawl (usually interpretedas ahang in real applications) and sometimes gets killed withsigbus

or
sigterm.

I see this with

openmpi-1.2.3 or openmpi-1.2.4
ofed 1.2
linux 2.6.19 + patches
gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
4 socket, dual core opterons

run as

mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang

To my now uneducated eye it looks as if the root process is
rushing
ahead and not progressing earlier bcasts.

Anyone else seeing similar?  Any ideas for workarounds?

As a point of reference, mvapich2 0.9.8 works fine.

Thanks, David


<bcast-hang.c>
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
                        Gleb.
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] collective problems

Reply via email to