During some experiments we have identified several major issues with coll ML 
with a very recent version of Open MPI master (22ab638 Jan 20 13:21:44). Based 
on the description below I consider these issues as major drawbacks that 
require immediate action (or disabling coll ML by default in all versions where 
it ships).

1. Stressing the coll ML selection mechanism leads to deadlocks. For each new 
communicator created coll ml will do several collective communications to 
figure out the topology of the newly created communicator. Unfortunately this 
algorithm seem to be somehow broken as a stress test eventually deadlocks. 
Attached is a such a test developed by Thomas that will stress the communicator 
creation in Open MPI by creating hundreds of communicators following a random 
split. Running it over 4 processes with “-a 250” will deadlock. As soon as coll 
ML is disabled, the test successfully completes. When it deadlocks the 
backtrace is the following:

#6  0x00007ffeb9520009 in mca_pml_ob1_recv (addr=0x7ffff7936780, count=38, 
   datatype=0x7ffec290bb40, src=0, tag=-99, comm=0x3092e40, status=0x0)
   at pml_ob1_irecv.c:109
#7  0x00007ffec2629bc7 in comm_allreduce_pml (sbuf=0x3095c88, rbuf=0x3095c88, 
   count=38, dtype=0x7ffec290bb40, my_rank_in_group=2, op=0x7ffec2924520, 
   n_peers=3, ranks_in_comm=0x30a6d60, comm=0x3092e40)
   at patterns/comm/allreduce.c:215
#8  0x00007ffeb865a151 in ml_module_set_small_msg_thresholds (
   ml_module=0x3093da0) at coll_ml_module.c:1312
#9  0x00007ffeb865aa0f in ml_discover_hierarchy (ml_module=0x3093da0)
   at coll_ml_module.c:1546
#10 0x00007ffeb865f401 in mca_coll_ml_comm_query (comm=0x3092e40, 
   priority=0x7ffff793aa68) at coll_ml_module.c:2970

2. In the lucky cases where the above mentioned deadlock doesn’t occur, the 
whole selection logic of the coll ML is __extremely__ costly. All the 
collective communications during the hierarchy discovery are unnecessary done 
for each communicator, they should be done only when new processes are added to 
the poll (as an example this should only be done once per MPI_COMM_WORLD).

The figure in ml.pdf shows the average and the standard deviation of the 
communicator creation cost. As one can see there is a drastic increase in 
communicator creation cost, as well as an extreme variation of the standard 
deviation.

 George.


Attachment: manysplit.c
Description: Binary data

Attachment: ml.pdf
Description: Adobe PDF document

Reply via email to