Re: [OMPI devel] problem in the ORTE notifier framework

Ralph Castain Wed, 27 May 2009 06:59:15 -0400

While that is a good way of minimizing the impact of the counter, you still
have to do an "if-then" to check if the counter exceeds the threshold. This
"if-then" also has to get executed every time, and generally consumes more
than a few cycles.


To be clear: it isn't the output that is the concern. The output only occurs
as an exception case, essentially equivalent to dealing with an error, so it
can be "slow". The concern is with the impact of testing to see if the
output needs to be generated as this testing occurs every time we transit
the code.

I think Jeff and I are probably closer to agreement on design than it might
seem, and may be close to what you might also have had in mind. Basically, I
was thinking of a macro like this:

ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)

#if WANT_NOTIFIER_VERBOSE
opal_atomic_increment(counter);
if (counter > threshold) {
    orte_notifier.api(.....)
}
#endif

You would set the specific thresholds for each situation via MCA params, so
this could be tuned to fit specific needs. Those who don't want the penalty
can just build normally - those who want this level of information can
enable it.

We can then see just how much penalty is involved in real world situations.
My guess is that it won't be that big, but it's hard to know without seeing
how frequently we actually insert this code.

Hope that makes sense
Ralph


On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugey
<sylvain.jeau...@bull.net>wrote:

> About performance, I may miss something, but our first goal was to track
> already slow pathes.
>
> We imagined that it could be possible to add at the beginning (or end) of
> this "bad path" just one line that would basically do an atomic inc. So, in
> terms of CPU cycles, something like 1 for the inc and maybe 1 jump before.
> Are a couple of cycles really an issue in slow pathes (which take at least
> hundreds of cycles), or do you fear out-of-cache memory accesses - or
> something else ?
>
> As for outputs, they indeed are slow (and can slow down considerably an
> application if not synchronized), but aggregation on the head node should
> solve our problems. And if not, we can also disable outputs at runtime.
>
> So, in my opinion, no application should notice a difference (unless you
> tune the framework to output every warning).
>
> Sylvain
>
>
> On Tue, 26 May 2009, Jeff Squyres wrote:
>
>  Nadia --
>>
>> Sorry I didn't get to jump in on the other thread earlier.
>>
>> We have made considerable changes to the notifier framework in a branch to
>> better support "SOS" functionality:
>>
>>  https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos
>>
>> Cisco and Indiana U. have been working on this branch for a while.  A
>> description of the SOS stuff is here:
>>
>>  https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
>>
>> As for setting up an external web server with hg, don't bother -- just get
>> an account at bitbucket.org.  They're free and allow you to host hg
>> repositories there.  I've used bitbucket to collaborate on code before it
>> hits OMPI's SVN trunk with both internal and external OMPI developers.
>>
>> We can certainly move the opal-sos repo to bitbucket (or branch again off
>> opal-sos to bitbucket -- whatever makes more sense) to facilitate
>> collaborating with you.
>>
>> Back on topic...
>>
>> I'd actually suggest a combination of what has been discussed in the other
>> thread.  The notifier can be the mechanism that actually sends the output
>> message, but it doesn't have to be the mechanism that tracks the stats and
>> decides when to output a message.  That can be separate logic, and therefore
>> be more fine-grained (and potentially even specific to the MPI layer).
>>
>> The Big Question will how to do this with zero performance impact when it
>> is not being used. This has always been the difficult issue when trying to
>> implement any kind of monitoring inside the core OMPI performance-sensitive
>> paths.  Even adding individual branches has met with resistance (in
>> performance-critical code paths)...
>>
>>
>>
>> On May 26, 2009, at 10:59 AM, Nadia Derbey wrote:
>>
>>  Hi,
>>>
>>> While having a look at the notifier framework under orte, I noticed that
>>> the way it is written, the init routine for the selected module cannot
>>> be called.
>>>
>>> Attached is a small patch that fixes this issue.
>>>
>>> Regards,
>>> Nadia
>>>
>>> <orte_notifier_fix_select.patch><ATT14046023.txt>
>>>
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>  _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] problem in the ORTE notifier framework

Reply via email to