Jeff Squyres wrote:
We get this question so much that I really need to add it to the FAQ. :-\
Open MPI currently always spins for completion for exactly the reason
that Scott cites: lower latency.
Arguably, when using TCP, we could probably get a bit better performance
by blocking and allowing the kernel to make more progress than a single
quick pass through the sockets progress engine, but that involves some
other difficulties such as simultaneously allowing shared memory
progress. We have ideas how to make this work, but it has unfortunately
remained at a lower priority: the performance difference isn't that
great, and we've been focusing on the other, lower latency interconnects
(shmem, MX, verbs, etc.).
Whilst I understand that you have other priorities, and I grateful for
the leverage I get by using OpenMPI, I would like to offer an
alternative use case, which I believe may become more common.
We're developing parallel software which is designed to be used
*interactively* as well as in batch mode. We want the same SIMD code
running on a user's quad-core workstation as on a 1,000-node cluster.
For the former case (single workstation), it would be *much* more user
friendly and interactive, for the back-end MPI code not to be spinning
at 100% when it's just waiting for the next front-end command. The GUI
thread doesn't get a look in.
I can't imagine the difficulties involved, but if the POSIX calls
select() and pthread_cond_wait() can do it for TCP and shared-memory
threads respectively, it can't be impossible!
Just my .2c,
Simon