During some experiments we have identified several major issues with coll ML with a very recent version of Open MPI master (22ab638 Jan 20 13:21:44). Based on the description below I consider these issues as major drawbacks that require immediate action (or disabling coll ML by default in all versions where it ships).
1. Stressing the coll ML selection mechanism leads to deadlocks. For each new communicator created coll ml will do several collective communications to figure out the topology of the newly created communicator. Unfortunately this algorithm seem to be somehow broken as a stress test eventually deadlocks. Attached is a such a test developed by Thomas that will stress the communicator creation in Open MPI by creating hundreds of communicators following a random split. Running it over 4 processes with “-a 250” will deadlock. As soon as coll ML is disabled, the test successfully completes. When it deadlocks the backtrace is the following: #6 0x00007ffeb9520009 in mca_pml_ob1_recv (addr=0x7ffff7936780, count=38, datatype=0x7ffec290bb40, src=0, tag=-99, comm=0x3092e40, status=0x0) at pml_ob1_irecv.c:109 #7 0x00007ffec2629bc7 in comm_allreduce_pml (sbuf=0x3095c88, rbuf=0x3095c88, count=38, dtype=0x7ffec290bb40, my_rank_in_group=2, op=0x7ffec2924520, n_peers=3, ranks_in_comm=0x30a6d60, comm=0x3092e40) at patterns/comm/allreduce.c:215 #8 0x00007ffeb865a151 in ml_module_set_small_msg_thresholds ( ml_module=0x3093da0) at coll_ml_module.c:1312 #9 0x00007ffeb865aa0f in ml_discover_hierarchy (ml_module=0x3093da0) at coll_ml_module.c:1546 #10 0x00007ffeb865f401 in mca_coll_ml_comm_query (comm=0x3092e40, priority=0x7ffff793aa68) at coll_ml_module.c:2970 2. In the lucky cases where the above mentioned deadlock doesn’t occur, the whole selection logic of the coll ML is __extremely__ costly. All the collective communications during the hierarchy discovery are unnecessary done for each communicator, they should be done only when new processes are added to the poll (as an example this should only be done once per MPI_COMM_WORLD). The figure in ml.pdf shows the average and the standard deviation of the communicator creation cost. As one can see there is a drastic increase in communicator creation cost, as well as an extreme variation of the standard deviation. George.
manysplit.c
Description: Binary data
ml.pdf
Description: Adobe PDF document