Re: [OMPI devel] collective problems

2007-11-08 Thread Patrick Geoffray
Hi Gleb, Gleb Natapov wrote: In the case of TCP, kernel is kind enough to progress message for you, but only if there was enough space in a kernel internal buffers. If there was no place there, TCP BTL will also buffer messages in userspace and will, eventually, have the same problem. Occasion

Re: [OMPI devel] collective problems

2007-11-08 Thread George Bosilca
Decrease the latency is the main reason. If we delay the MPI completion, then we always have to call opal_progress at least once in order to allow the BTL to trigger the callback. In the current implementation, we never call opal_progress on small messages, unless there is some kind of reso

Re: [OMPI devel] collective problems

2007-11-08 Thread Andrew Friedley
Brian Barrett wrote: > Personally, I'd rather just not mark MPI completion until a local completion callback from the BTL. But others don't like that idea, so we came up with a way for back pressure from the BTL to say "it's not on the wire yet". This is more complicated than just not mar

Re: [OMPI devel] collective problems

2007-11-08 Thread Richard Graham
On 11/8/07 4:03 AM, "Gleb Natapov" wrote: > On Wed, Nov 07, 2007 at 11:25:43PM -0500, Patrick Geoffray wrote: >> Richard Graham wrote: >>> The real problem, as you and others have pointed out is the lack of >>> predictable time slices for the progress engine to do its work, when relying >>> on

Re: [OMPI devel] collective problems

2007-11-08 Thread Gleb Natapov
On Wed, Nov 07, 2007 at 11:25:43PM -0500, Patrick Geoffray wrote: > Richard Graham wrote: > > The real problem, as you and others have pointed out is the lack of > > predictable time slices for the progress engine to do its work, when relying > > on the ULP to make calls into the library... > > Th

Re: [OMPI devel] collective problems

2007-11-08 Thread Gleb Natapov
On Wed, Nov 07, 2007 at 01:16:04PM -0500, George Bosilca wrote: > > On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote: > >>> The same callback is called in both cases. In the case that you >>> described, the callback is called just a little bit deeper into the >>> recursion, when in the "normal case"

Re: [OMPI devel] collective problems

2007-11-08 Thread Gleb Natapov
On Wed, Nov 07, 2007 at 09:07:23PM -0700, Brian Barrett wrote: > Personally, I'd rather just not mark MPI completion until a local > completion callback from the BTL. But others don't like that idea, so > we came up with a way for back pressure from the BTL to say "it's not > on the wire yet

Re: [OMPI devel] collective problems

2007-11-07 Thread Richard Graham
On 11/8/07 12:25 AM, "Patrick Geoffray" wrote: > Richard Graham wrote: >> The real problem, as you and others have pointed out is the lack of >> predictable time slices for the progress engine to do its work, when relying >> on the ULP to make calls into the library... > > The real, real prob

Re: [OMPI devel] collective problems

2007-11-07 Thread Shipman, Galen M.
The lengths we go to avoid progress :-) On 11/7/07 10:19 PM, "Richard Graham" wrote: > The real problem, as you and others have pointed out is the lack of > predictable time slices for the progress engine to do its work, when relying > on the ULP to make calls into the library... > > Rich >

Re: [OMPI devel] collective problems

2007-11-07 Thread Patrick Geoffray
Richard Graham wrote: The real problem, as you and others have pointed out is the lack of predictable time slices for the progress engine to do its work, when relying on the ULP to make calls into the library... The real, real problem is that the BTL should handle progression at their level, s

Re: [OMPI devel] collective problems

2007-11-07 Thread Richard Graham
The real problem, as you and others have pointed out is the lack of predictable time slices for the progress engine to do its work, when relying on the ULP to make calls into the library... Rich On 11/8/07 12:07 AM, "Brian Barrett" wrote: > As it stands today, the problem is that we can inject

Re: [OMPI devel] collective problems

2007-11-07 Thread Brian Barrett
As it stands today, the problem is that we can inject things into the BTL successfully that are not injected into the NIC (due to software flow control). Once a message is injected into the BTL, the PML marks completion on the MPI request. If it was a blocking send that got marked as comp

Re: [OMPI devel] collective problems

2007-11-07 Thread Richard Graham
Does this mean that we donĀ¹t have a queue to store btl level descriptors that are only partially complete ? Do we do an all or nothing with respect to btl level requests at this stage ? Seems to me like we want to mark things complete at the MPI level ASAP, and that this proposal is not to do

Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres
On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote: Remember that this is all in the context of Galen's proposal for btl_send() to be able to return NOT_ON_WIRE -- meaning that the send was successful, but it has not yet been sent (e.g., openib BTL buffered it because it ran out of credits). S

Re: [OMPI devel] collective problems

2007-11-07 Thread Patrick Geoffray
Jeff Squyres wrote: This is not a problem in the current code base. Remember that this is all in the context of Galen's proposal for btl_send() to be able to return NOT_ON_WIRE -- meaning that the send was successful, but it has not yet been sent (e.g., openib BTL buffered it because it ra

Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres
This is not a problem in the current code base. Remember that this is all in the context of Galen's proposal for btl_send() to be able to return NOT_ON_WIRE -- meaning that the send was successful, but it has not yet been sent (e.g., openib BTL buffered it because it ran out of credits).

Re: [OMPI devel] collective problems

2007-11-07 Thread George Bosilca
On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote: The same callback is called in both cases. In the case that you described, the callback is called just a little bit deeper into the recursion, when in the "normal case" it will get called from the first level of the recursion. Or maybe I miss som

Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres
On Nov 7, 2007, at 12:29 PM, George Bosilca wrote: I finally talked with Galen and Don about this issue in depth. Our understanding is that the "request may get freed before recursion unwinds" issue is *only* a problem within the context of a single MPI call (e.g., MPI_SEND). Is that right?

Re: [OMPI devel] collective problems

2007-11-07 Thread George Bosilca
On Nov 7, 2007, at 11:06 AM, Jeff Squyres wrote: Gleb -- I finally talked with Galen and Don about this issue in depth. Our understanding is that the "request may get freed before recursion unwinds" issue is *only* a problem within the context of a single MPI call (e.g., MPI_SEND). Is that r

Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres
Gleb -- I finally talked with Galen and Don about this issue in depth. Our understanding is that the "request may get freed before recursion unwinds" issue is *only* a problem within the context of a single MPI call (e.g., MPI_SEND). Is that right? Specifically, if in an MPI_SEND, the B

Re: [OMPI devel] collective problems

2007-10-23 Thread Gleb Natapov
On Tue, Oct 23, 2007 at 09:40:45AM -0400, Shipman, Galen M. wrote: > So this problem goes WAY back.. > > The problem here is that the PML marks MPI completion just prior to calling > btl_send and then returns to the user. This wouldn't be a problem if the BTL > then did something, but in the case

Re: [OMPI devel] collective problems

2007-10-23 Thread Shipman, Galen M.
So this problem goes WAY back.. The problem here is that the PML marks MPI completion just prior to calling btl_send and then returns to the user. This wouldn't be a problem if the BTL then did something, but in the case of OpenIB this fragment may not actually be on the wire (the joys of user lev

Re: [OMPI devel] collective problems

2007-10-11 Thread Gleb Natapov
On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote: > David -- > > Gleb and I just actively re-looked at this problem yesterday; we > think it's related to https://svn.open-mpi.org/trac/ompi/ticket/ > 1015. We previously thought this ticket was a different problem, but > our analys

Re: [OMPI devel] collective problems

2007-10-05 Thread Jeff Squyres
David -- Gleb and I just actively re-looked at this problem yesterday; we think it's related to https://svn.open-mpi.org/trac/ompi/ticket/ 1015. We previously thought this ticket was a different problem, but our analysis yesterday shows that it could be a real problem in the openib BTL or

[OMPI devel] collective problems

2007-10-04 Thread David Daniel
Hi Folks, I have been seeing some nasty behaviour in collectives, particularly bcast and reduce. Attached is a reproducer (for bcast). The code will rapidly slow to a crawl (usually interpreted as a hang in real applications) and sometimes gets killed with sigbus or sigterm. I see this w