I have a program that does a simple bucket brigade of sends and receives where rank 0 is the start and repeatedly sends to rank 1 until a certain amount of time has passed and then it sends and all done packet.

Running this under np=2 always works. However, when I run with greater than 2 using only the SM btl the program usually hangs and one of the processes has a long stack that has a lot of the following 3 calls in it:

[25] opal_progress(), line 187 in "opal_progress.c"
 [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
 [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"

When stepping through the ompi_fifo_write_to_head routine it looks like the fifo has overflowed.

I am wondering if what is happening is rank 0 has sent a bunch of messages that have exhausted the resources such that one of the middle ranks which is in the process of sending cannot send and therefore
never gets to the point of trying to receive the messages from rank 0?

Is the above a possible scenario or are messages periodically bled off the SM BTL's fifos?

Note, I have seen np=3 pass sometimes and I can get it to pass reliably if I raise the shared memory space used by the BTL. This is using the trunk.


--td


Reply via email to