We have made a *lot* of changes to the run-time support for spawn and some changes to the FLUSH support in the openib BTL for the upcoming v1.3 series.

Would it be possible for you to try a trunk nightly tarball snapshot, perchance?

    http://www.open-mpi.org/nightly/trunk/


On May 29, 2008, at 3:50 AM, Matt Hughes wrote:

I have a program which uses MPI::Comm::Spawn to start processes on
compute nodes (c0-0, c0-1, etc).  The communication between the
compute nodes consists of ISend and IRecv pairs, while communication
between the compute nodes consists of gather and bcast operations.
After executing ~80 successful loops (gather/bcast pairs), I get this
error message from the head node process during a gather call:

[0,1,0][btl_openib_component.c:1332:btl_openib_component_progress]
from headnode.local to: c0-0 error polling HP CQ with status WORK
REQUEST FLUSHED ERROR status number 5 for wr_id 18504944 opcode 1

The relevant environment variables:
OMPI_MCA_btl_openib_rd_num=128
OMPI_MCA_btl_openib_verbose=1
OMPI_MCA_btl_base_verbose=1
OMPI_MCA_btl_openib_rd_low=75
OMPI_MCA_btl_base_debug=1
OMPI_MCA_btl_openib_warn_no_hca_params_found=0
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=self,openib

If rd_low and rd_num are left at their default values, the program
simply hangs in the gather call after about 20 iterations (a gather
and a bcast).

Can anyone shed any light on what this error message means or what
might be done about it?

Thanks,
mch
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to