Hello,

As I understand, When MPI_Iprobe is called, the code that is called is the 
function pointed by the attribute 

mca_pml_base_module_iprobe_fn_t       pml_iprobe;


in ompi/mca/pml/pml.h


In the file ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c (Open-MPI 1.4.3), 
ompi_crcp_bkmrk_pml_iprobe calls drain_message_find_any.


In drain_message_find_any (in ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c), there is a 
loop over all MPI ranks
regardless of the peer parameter.
For instance, with 256 peers, probing for peer 255 requires 256 iterations 
while probing for peer 0 requires 1 iteration.


As I understand it, the linked list ompi_crcp_bkmrk_pml_peer_refs is populated 
with nprocs entries where nprocs is presumably the number of MPI ranks in 
MPI_COMM_WORLD.


If my understanding is right, here are some suggestions:


1. ompi_crcp_bkmrk_pml_peer_refs should be an array so that when peer is not 
MPI_ANY_SOURCE, MPI_Iprobe can returns in constant time.


2. There should be some sort of round-robin mechanism for the case where the 
peer is MPI_ANY_SOURCE, otherwise lower ranks will get more probed and higher 
ranks will
suffer from starvation. This could be done by having a current position in the 
peer list (or array, see point 1). Instead of starting to loop on the first, 
the loop would start at current position and
a maximum of nprocs iterations would take place.


A code review is on my blog: 
http://dskernel.blogspot.com/2011/09/code-review-what-happens-in-open-mpis.html



                                                     Sébastien

Reply via email to