Hi all - I have an application where the master node will spawn slaves to perform computation (using the singleton Comm_spawn_multiple paradigm available in OpenMPI) . The master will only decide the work to do, and also provide data common to all the computations.
The slaves are multi-threaded, and locally handle load balancing via a non-blocking thread-safe queue. Work is load balanced between nodes like so: 1) The master doles out half the work in a round-robin fashion 2) The master will replace work when it receives completed work from a slave I currently have a paradigm where I have build OpenMPI with multi-threading enabled, and I allow any thread to send (work) or broadcast (common data and control messages) to the other nodes. I have a dedicated thread handling receives which can also handle the receive end of broadcast. The receive thread will hand the work data off to the local load balancing mechanism, and set the common data in a thread-safe fashion. When worker threads complete work quickly, they pound MPI with sends. This leads to a ton of lock contention. Another issue I'm facing is that sometimes the messages are very small, but there are a lot of them, and I think this may have a lot of overhead in MPI and/or various network layers. I'm thinking of going to a THREAD_FUNNELED design instead of a THREAD_MULTIPLE design, but I'm unsure of the best way to accomplish this. For example, is it advisable to multiple Isend and/or multiple Irecv in flight at once (essentially allowing the data to be staged concurrently), or is it better to only have one Isend at a time? If I Iprobe and then Irecv, and then I Iprobe again, presumably I will not get the same message because that message retrieval was started already? Currently, I Isend data to all receiving nodes to describe the details of a broadcast, but I Waitall before calling Bcast. Is there anything to be careful of if I move to more asynchronous communication (if I don't Waitall are there cases where I can deadlock? I haven't thought of cases). All my communication will be somewhat generic in the sense that Probe/Iprobe accept MPI_ANY_SOURCE and MPI_ANY_TAG, and Bcast is only initiated on the receiver side if it has received a control message saying the source of the sender and the size of the message. Thanks for any and all suggestions and comments, Brian