During some experiments we have identified several major issues with coll ML
with a very recent version of Open MPI master (22ab638 Jan 20 13:21:44). Based
on the description below I consider these issues as major drawbacks that
require immediate action (or disabling coll ML by default in all versions where
it ships).
1. Stressing the coll ML selection mechanism leads to deadlocks. For each new
communicator created coll ml will do several collective communications to
figure out the topology of the newly created communicator. Unfortunately this
algorithm seem to be somehow broken as a stress test eventually deadlocks.
Attached is a such a test developed by Thomas that will stress the communicator
creation in Open MPI by creating hundreds of communicators following a random
split. Running it over 4 processes with “-a 250” will deadlock. As soon as coll
ML is disabled, the test successfully completes. When it deadlocks the
backtrace is the following:
#6 0x7ffeb9520009 in mca_pml_ob1_recv (addr=0x77936780, count=38,
datatype=0x7ffec290bb40, src=0, tag=-99, comm=0x3092e40, status=0x0)
at pml_ob1_irecv.c:109
#7 0x7ffec2629bc7 in comm_allreduce_pml (sbuf=0x3095c88, rbuf=0x3095c88,
count=38, dtype=0x7ffec290bb40, my_rank_in_group=2, op=0x7ffec2924520,
n_peers=3, ranks_in_comm=0x30a6d60, comm=0x3092e40)
at patterns/comm/allreduce.c:215
#8 0x7ffeb865a151 in ml_module_set_small_msg_thresholds (
ml_module=0x3093da0) at coll_ml_module.c:1312
#9 0x7ffeb865aa0f in ml_discover_hierarchy (ml_module=0x3093da0)
at coll_ml_module.c:1546
#10 0x7ffeb865f401 in mca_coll_ml_comm_query (comm=0x3092e40,
priority=0x7793aa68) at coll_ml_module.c:2970
2. In the lucky cases where the above mentioned deadlock doesn’t occur, the
whole selection logic of the coll ML is __extremely__ costly. All the
collective communications during the hierarchy discovery are unnecessary done
for each communicator, they should be done only when new processes are added to
the poll (as an example this should only be done once per MPI_COMM_WORLD).
The figure in ml.pdf shows the average and the standard deviation of the
communicator creation cost. As one can see there is a drastic increase in
communicator creation cost, as well as an extreme variation of the standard
deviation.
George.
manysplit.c
Description: Binary data
ml.pdf
Description: Adobe PDF document