On Wed, Aug 29, 2007 at 11:01:14AM -0400, Richard Graham wrote:
> If you are going to look at it, I will not bother with this.
I need the code to reproduce the problem. Otherwise I have nothing to
look at. 

> 
> Rich
> 
> 
> On 8/29/07 10:47 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:
> 
> > On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote:
> >> Gleb,
> >>   Are you looking at this ?
> > Not today. And I need the code to reproduce the bug. Is this possible?
> > 
> >> 
> >> Rich
> >> 
> >> 
> >> On 8/29/07 9:56 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:
> >> 
> >>> On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote:
> >>>> Is this trunk or 1.2?
> >>> Oops. I should read more carefully :) This is trunk.
> >>> 
> >>>> 
> >>>> On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote:
> >>>>> I have a program that does a simple bucket brigade of sends and receives
> >>>>> where rank 0 is the start and repeatedly sends to rank 1 until a certain
> >>>>> amount of time has passed and then it sends and all done packet.
> >>>>> 
> >>>>> Running this under np=2 always works.  However, when I run with greater
> >>>>> than 2 using only the SM btl the program usually hangs and one of the
> >>>>> processes has a long stack that has a lot of the following 3 calls in 
> >>>>> it:
> >>>>> 
> >>>>>  [25] opal_progress(), line 187 in "opal_progress.c"
> >>>>>   [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
> >>>>>   [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"
> >>>>> 
> >>>>> When stepping through the ompi_fifo_write_to_head routine it looks like
> >>>>> the fifo has overflowed.
> >>>>> 
> >>>>> I am wondering if what is happening is rank 0 has sent a bunch of
> >>>>> messages that have exhausted the
> >>>>> resources such that one of the middle ranks which is in the process of
> >>>>> sending cannot send and therefore
> >>>>> never gets to the point of trying to receive the messages from rank 0?
> >>>>> 
> >>>>> Is the above a possible scenario or are messages periodically bled off
> >>>>> the SM BTL's fifos?
> >>>>> 
> >>>>> Note, I have seen np=3 pass sometimes and I can get it to pass reliably
> >>>>> if I raise the shared memory space used by the BTL.   This is using the
> >>>>> trunk.
> >>>>> 
> >>>>> 
> >>>>> --td
> >>>>> 
> >>>>> 

--
                        Gleb.

Reply via email to