A user noticed a specific change that we made between 1.4.2 and 1.4.3:
https://svn.open-mpi.org/trac/ompi/changeset/23448
which is from CMR https://svn.open-mpi.org/trac/ompi/ticket/2489, and
originally from trunk https://svn.open-mpi.org/trac/ompi/changeset/23434. I
removed the opal_progress_event_users_decrement() from ompi_mpi_init() because
the ORTE DPM does its own _increment() and _decrement().
However, it seems that there was an unintended consequence of this -- look at
the annotated Ganglia graph that the user sent (see attached). In 1.4.2, all
of the idle time was "user" CPU usage. In 1.4.3, it's split between user and
system CPU usage. The application that he used to test is basically an init /
finalize test (with some additional MPI middleware). See:
http://www.open-mpi.org/community/lists/users/2010/11/14773.php
Can anyone think of why this occurs, and/or if it's a Bad Thing?
If removing this decrement enabled a bunch more system CPU time, that would
seem to imply that we're calling libevent more frequently than we used to (vs.
polling the opal event callbacks), and therefore that there might now be an
unmatched increment somewhere.
Right...?
--
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/