Re: [OMPI devel] problem in the ORTE notifier framework

Nadia Derbey Thu, 28 May 2009 08:15:48 -0400

On Thu, 2009-05-28 at 05:57 -0600, Ralph Castain wrote:
> I agree with Terry here about being careful in pursuing this path.
> What I wouldn't want to have happen is to force anyone wanting to be
> notified of error events to have to also turn on peruse, which impacts
> the non-error code path.


Agreed, I missed that part!

Regards,
Nadia
> 
> Again, I'm not entirely sure what you are trying to do here. As I
> understood the original RFC, it sounded like you wanted to track
> errors but only report them when they occurred a controlled number of
> times (as opposed to every time). I think this would better be done
> outside of peruse.
> 
> If you are trying to track normal performance (e.g., trying to alert
> sys admins when networks aren't running as fast as they should), then
> that probably should be done inside of peruse. However, that
> definitely will impact the critical code path, so Terry's caution is
> definitely a concern.
> 
> 
> On Thu, May 28, 2009 at 12:55 AM, Nadia Derbey <nadia.der...@bull.net>
> wrote:
>         On Wed, 2009-05-27 at 14:25 -0400, Jeff Squyres wrote:
>         > Excellent points; Ralph and I chatted about this on the
>         phone today --
>         > we concur with George.
>         >
>         > Bull -- would peruse work for you?  I think you mentioned
>         before that
>         > it didn't seem attractive to you.
>         
>         
>         Well, it didn't because from what I understood, the MPI
>         program need to
>         be changed (register a callback routine for the event,
>         activate the
>         event, etc), and this is something we wanted to avoid.
>         
>         Now, if we are allowed to
>         1. define new "internal" PERUSE events,
>         2. internally set the associated callback routines
>         why not using peruse? This combined with the orte notifier
>         framework,
>         could do the job I think.
>         
>         Regards,
>         Nadia
>         
>         
>         >   I think George's point is that we
>         > already have lots of hooks in place in the PML -- and
>         they're called
>         > peruse.  So if we could use those hooks, then a) they're
>         run-time
>         > selectable already, and b) there's no additional cost in
>         performance
>         > critical/not-critical code paths (for the case where these
>         stats are
>         > not being collected) because PERUSE has been in the code
>         base for a
>         > long time.
>         >
>         > I think the idea is that your callbacks could be invoked by
>         the peruse
>         > hooks and then they can do whatever they want -- increment
>         counters,
>         > conditionally invoke the ORTE notifier system, etc.
>         >
>         >
>         >
>         > On May 27, 2009, at 11:34 AM, George Bosilca wrote:
>         >
>         > > What is a generic threshold? And what is a counter? We
>         have a policy
>         > > against such coding standards, and to be honest I would
>         like to stick
>         > > to it. The reason is that the PML is a very complex piece
>         of code, and
>         > > I would like to keep it as easy to understand as possible.
>         If people
>         > > start adding #if/#endif all over the code, we diverging
>         from this
>         > > goal.
>         > >
>         > > The only way to make this work is to call the notifier or
>         some other
>         > > framework in this "slow path" and let this other framework
>         do it's own
>         > > logic to determine what and when to print. Of course the
>         cost of this
>         > > is a function call plus an atomic operation (which is
>         already not
>         > > cheap). It's starting to get expensive, even for a "slow
>         path", which
>         > > in this particular context is just one insertion in an
>         atomic FIFO.
>         > >
>         > > If instead of counting in number of times we try to send
>         the fragment,
>         > > and switch to a time base approach, this can be solved
>         with the PERUSE
>         > > calls. There is a callback when the request is created,
>         and another
>         > > callback when the first fragment is pushed successfully
>         into the
>         > > network. Computing the time between these two, allow a
>         tool to figure
>         > > out how much time the request was waiting in some internal
>         queues, and
>         > > therefore how much delay this added to the execution time.
>         > >
>         > >    george.
>         > >
>         > > On May 27, 2009, at 06:59 , Ralph Castain wrote:
>         > >
>         > > > ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)
>         > > >
>         > > > #if WANT_NOTIFIER_VERBOSE
>         > > > opal_atomic_increment(counter);
>         > > > if (counter > threshold) {
>         > > >     orte_notifier.api(.....)
>         > > > }
>         > > > #endif
>         > >
>         > > _______________________________________________
>         > > devel mailing list
>         > > de...@open-mpi.org
>         > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>         > >
>         >
>         >
>         --
>         
>         Nadia Derbey <nadia.der...@bull.net>
>         
>         
>         _______________________________________________
>         devel mailing list
>         de...@open-mpi.org
>         http://www.open-mpi.org/mailman/listinfo.cgi/devel
>         
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
-- 
Nadia Derbey <nadia.der...@bull.net>

Re: [OMPI devel] problem in the ORTE notifier framework

Reply via email to