Sorry for the delays in replying.

The central problem is that Open MPI is much more aggressive about its message passing progress than LAM is -- it simply wasn't designed to share well as a mechanism to get as high performance as possible.

mpi_yield_when_idle is most helpful only for certain transports that actively use our event engine, such as the TCP device. Since you're using the LAM sysv RPI, I assume you're using the TCP and shared memory devices in OMPI, right? If you're using infiniband, for example, the event engine is not called much because IB has its own progression engine that is unrelated to OMPI's (and therefore we don't invoke OMPI's much).

mpi_yield_when_idle is also only helpful if you're going into the MPI layer often and making message passing progress (i.e., OMPI's event engine is actively being invoked). Is this true for your application?

If mpi_yield_when_idle really doesn't help much, you may consider sprinkling calls to sched_yield() in your codes to force the process to yield the processor.



On Apr 4, 2008, at 2:30 AM, Lars Andersson wrote:
Hi,

I'm just in the progress of moving our application from LAM/MPI to
OpenMPI, mainly because OpenMPI makes it easier for a user to run
multiple jobs(MPI universa) simultaneously. This is useful if a user
wants to run smaller experiments without disturbing a large experiment
running in the background). I've been evaluation the performance using
a simple test, running on a hetrogenous cluster of 2 x dual core
Opteron machines, a couple of dual core P4 Xeon machines and a 8 core
Core2 machine. The main structure of the application is a master rank
distributing jobs packages to the rest of the ranks and collecting the
results. We don't use any fancy MPI features but rather see it as an
efficient low-level tool for broadcasting and transferring data.

When a single user runs a job (fully subscribed nodes, but not
oversubscribed, i.e one process per cpu-core) on an otherwise unloaded
cluster both LAM/MPI and OpenMPI average runtimes of about 1m33s
(OpenMPI has a slightly lower average).

When I start the same job simultaneously as two different users (thus
oversubscribing the nodes 2x) under LAM/MPI, the two jobs finish as an
average time of about 3m, thus scaling very well (we use the -ssi rpi
sysv option to mpirun under LAM/MPI to avoid busy waiting).

When running the same second experiment under OpenMPI, the average
runtime jumps up to about 3m30s, with runs occasionally taking more
than 4 minutes to complete. I do use the "--mca mpi_yield_when_idle 1"
option to mpirun, but it doesn't seem to make any difference. I've
also tried setting the environment variable
OMPI_MCA_mpi_yield_when_idle=1, but still no change. ompi_info says:

ompi_info --param all all | grep yield
MCA mpi: parameter "mpi_yield_when_idle" (current value: "1")

The cluster is used for various tasks, running MPI applications as
well as non-MPI applications, so we would like to avoid spending too
much cycles on busy-waiting. Any ideas on how to tweak OpenMPI to get
better performance and more cooperative behavior in this case would be
greatly appreciated.

Cheers,

Lars
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to