On Wed, Aug 29, 2007 at 11:01:14AM -0400, Richard Graham wrote: > If you are going to look at it, I will not bother with this. I need the code to reproduce the problem. Otherwise I have nothing to look at.
> > Rich > > > On 8/29/07 10:47 AM, "Gleb Natapov" <gl...@voltaire.com> wrote: > > > On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote: > >> Gleb, > >> Are you looking at this ? > > Not today. And I need the code to reproduce the bug. Is this possible? > > > >> > >> Rich > >> > >> > >> On 8/29/07 9:56 AM, "Gleb Natapov" <gl...@voltaire.com> wrote: > >> > >>> On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote: > >>>> Is this trunk or 1.2? > >>> Oops. I should read more carefully :) This is trunk. > >>> > >>>> > >>>> On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote: > >>>>> I have a program that does a simple bucket brigade of sends and receives > >>>>> where rank 0 is the start and repeatedly sends to rank 1 until a certain > >>>>> amount of time has passed and then it sends and all done packet. > >>>>> > >>>>> Running this under np=2 always works. However, when I run with greater > >>>>> than 2 using only the SM btl the program usually hangs and one of the > >>>>> processes has a long stack that has a lot of the following 3 calls in > >>>>> it: > >>>>> > >>>>> [25] opal_progress(), line 187 in "opal_progress.c" > >>>>> [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c" > >>>>> [27] mca_bml_r2_progress(), line 110 in "bml_r2.c" > >>>>> > >>>>> When stepping through the ompi_fifo_write_to_head routine it looks like > >>>>> the fifo has overflowed. > >>>>> > >>>>> I am wondering if what is happening is rank 0 has sent a bunch of > >>>>> messages that have exhausted the > >>>>> resources such that one of the middle ranks which is in the process of > >>>>> sending cannot send and therefore > >>>>> never gets to the point of trying to receive the messages from rank 0? > >>>>> > >>>>> Is the above a possible scenario or are messages periodically bled off > >>>>> the SM BTL's fifos? > >>>>> > >>>>> Note, I have seen np=3 pass sometimes and I can get it to pass reliably > >>>>> if I raise the shared memory space used by the BTL. This is using the > >>>>> trunk. > >>>>> > >>>>> > >>>>> --td > >>>>> > >>>>> -- Gleb.