Yes, this does sound like the classic "assuming MPI buffering" case. Check out this magazine column that I wrote a long time ago about this topic:

    http://cw.squyres.com/columns/2004-08-CW-MPI-Mechanic.pdf

It's #1 on the top 10 list of All-Time Favorite Evils to Avoid in Parallel. :-)

One comment on Mattijs's email: please don't use bsend. Bsend is evil. :-)



On Jun 13, 2008, at 5:27 AM, Mattijs Janssens wrote:

Sounds like a typical deadlock situation. All processors are waiting for one
another.

Not a specialist but from what I know if the messages are small enough they'll be offloaded to kernel/hardware and there is no deadlock. That why it might
work for small messages and/or certain mpi implementations.

Solutions:
- come up with a global communication schedule such that if one processor
sends the receiver is receiving.
- use mpi_bsend. Might be slower.
- use mpi_isend, mpi_irecv (but then you'll have to make sure the buffers stay
valid for the duration of the communication)

On Friday 13 June 2008 01:55, zach wrote:
I have a weird problem that shows up when i use LAM or OpenMPI but not
MPICH.

I have a parallelized code working on a really large matrix. It
partitions the matrix column-wise and ships them off to processors,
so, any given processor is working on a matrix with the same number of
rows as the original but reduced number of columns. Each processor
needs to send a single column vector entry
from its own matrix to the adjacent processor and visa versa as part
of the algorithm.

I have found that depending on the number of rows of the matrix -or,
the size of the vector being sent using MPI_Send, MPI_Recv, the
simulation will hang.
It is only until i reduce this dimension to a certain max number will
the sim run properly. I have also found that this magic number differs depending on the system I am using, eg my home quad-core box or remote
cluster.

As i mentioned i have not had this issue with mpich. I would like to
understand why it is happening rather than just defect over to mpich
to get by.

Any help would be appreciated!
zach
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to