On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote:
> Is this trunk or 1.2?
Oops. I should read more carefully :) This is trunk.

> 
> On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote:
> > I have a program that does a simple bucket brigade of sends and receives 
> > where rank 0 is the start and repeatedly sends to rank 1 until a certain 
> > amount of time has passed and then it sends and all done packet.
> > 
> > Running this under np=2 always works.  However, when I run with greater 
> > than 2 using only the SM btl the program usually hangs and one of the 
> > processes has a long stack that has a lot of the following 3 calls in it:
> > 
> >  [25] opal_progress(), line 187 in "opal_progress.c"
> >   [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
> >   [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"
> > 
> > When stepping through the ompi_fifo_write_to_head routine it looks like 
> > the fifo has overflowed.
> > 
> > I am wondering if what is happening is rank 0 has sent a bunch of 
> > messages that have exhausted the
> > resources such that one of the middle ranks which is in the process of 
> > sending cannot send and therefore
> > never gets to the point of trying to receive the messages from rank 0?
> > 
> > Is the above a possible scenario or are messages periodically bled off 
> > the SM BTL's fifos?
> > 
> > Note, I have seen np=3 pass sometimes and I can get it to pass reliably 
> > if I raise the shared memory space used by the BTL.   This is using the 
> > trunk.
> > 
> > 
> > --td
> > 
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> --
>                       Gleb.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
                        Gleb.

Reply via email to