Re: [OMPI devel] Change in OPAL / OMPI DPM system time during MPI_INIT

Barrett, Brian W Mon, 22 Nov 2010 14:47:46 -0500

Short answer: we need the "extra" decrement at the end of MPI init.

Long answer: Ok, so I was somewhat wrong :).

The count of users is initialized to 0.  If it's greater than zero, the event 
library is polled every time opal_progress() is called, which kills latency 
(surprised this didn't show up in testing).  It's really quite pointless to a 
runtime library or portability library to not poll the event library every time 
(particularly since the primary communication mechanisms in the runtime library 
use the event library), so opal_init() increases the counter to 1.

So by the time anything interesting in MPI_INIT happens, the counter is set to 
1, and every call to opal_progress results in a call to the event library.  The 
decrement in MPI_INIT was to "undo" the initialization increment, so that 
things would run fast from end of MPI_INIT to start of MPI_FINALIZE unless some 
other piece of OMPI knew it needed fast run-time interactions (such as the DPM 
or the TCP-based BTLs).  Of course, during MPI_FINALIZE, we need to "undo" the 
go-fast options we changed during the end of MPI_INIT, which is why there's an 
increment early in finalize.

Brian

On Nov 22, 2010, at 12:27 PM, Jeff Squyres wrote:

> On Nov 22, 2010, at 11:35 AM, Barrett, Brian W wrote:
> 
>> Um, the counter starts initialized at one.
> 
> Does that mean that we should or should not leave that extra _decrement() in 
> there?
> 
>> Brian
>> 
>> On Nov 22, 2010, at 9:32 AM, Jeff Squyres wrote:
>> 
>>> A user noticed a specific change that we made between 1.4.2 and 1.4.3:
>>> 
>>>  https://svn.open-mpi.org/trac/ompi/changeset/23448
>>> 
>>> which is from CMR https://svn.open-mpi.org/trac/ompi/ticket/2489, and 
>>> originally from trunk https://svn.open-mpi.org/trac/ompi/changeset/23434.  
>>> I removed the opal_progress_event_users_decrement() from ompi_mpi_init() 
>>> because the ORTE DPM does its own _increment() and _decrement().
>>> 
>>> However, it seems that there was an unintended consequence of this -- look 
>>> at the annotated Ganglia graph that the user sent (see attached).  In 
>>> 1.4.2, all of the idle time was "user" CPU usage.  In 1.4.3, it's split 
>>> between user and system CPU usage.  The application that he used to test is 
>>> basically an init / finalize test (with some additional MPI middleware).  
>>> See:
>>> 
>>>  http://www.open-mpi.org/community/lists/users/2010/11/14773.php
>>> 
>>> Can anyone think of why this occurs, and/or if it's a Bad Thing?
>>> 
>>> If removing this decrement enabled a bunch more system CPU time, that would 
>>> seem to imply that we're calling libevent more frequently than we used to 
>>> (vs. polling the opal event callbacks), and therefore that there might now 
>>> be an unmatched increment somewhere.
>>> 
>>> Right...?
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> <openmpi143.jpeg><ATT00002..txt>
>> 
>> -- 
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories

Re: [OMPI devel] Change in OPAL / OMPI DPM system time during MPI_INIT

Reply via email to