Re: [OMPI devel] problem in the ORTE notifier framework

Ralph Castain Thu, 28 May 2009 07:57:49 -0400

I agree with Terry here about being careful in pursuing this path. What I
wouldn't want to have happen is to force anyone wanting to be notified of
error events to have to also turn on peruse, which impacts the non-error
code path.


Again, I'm not entirely sure what you are trying to do here. As I understood
the original RFC, it sounded like you wanted to track errors but only report
them when they occurred a controlled number of times (as opposed to every
time). I think this would better be done outside of peruse.

If you are trying to track normal performance (e.g., trying to alert sys
admins when networks aren't running as fast as they should), then that
probably should be done inside of peruse. However, that definitely will
impact the critical code path, so Terry's caution is definitely a concern.


On Thu, May 28, 2009 at 12:55 AM, Nadia Derbey <nadia.der...@bull.net>wrote:

> On Wed, 2009-05-27 at 14:25 -0400, Jeff Squyres wrote:
> > Excellent points; Ralph and I chatted about this on the phone today --
> > we concur with George.
> >
> > Bull -- would peruse work for you?  I think you mentioned before that
> > it didn't seem attractive to you.
>
> Well, it didn't because from what I understood, the MPI program need to
> be changed (register a callback routine for the event, activate the
> event, etc), and this is something we wanted to avoid.
>
> Now, if we are allowed to
> 1. define new "internal" PERUSE events,
> 2. internally set the associated callback routines
> why not using peruse? This combined with the orte notifier framework,
> could do the job I think.
>
> Regards,
> Nadia
>
> >   I think George's point is that we
> > already have lots of hooks in place in the PML -- and they're called
> > peruse.  So if we could use those hooks, then a) they're run-time
> > selectable already, and b) there's no additional cost in performance
> > critical/not-critical code paths (for the case where these stats are
> > not being collected) because PERUSE has been in the code base for a
> > long time.
> >
> > I think the idea is that your callbacks could be invoked by the peruse
> > hooks and then they can do whatever they want -- increment counters,
> > conditionally invoke the ORTE notifier system, etc.
> >
> >
> >
> > On May 27, 2009, at 11:34 AM, George Bosilca wrote:
> >
> > > What is a generic threshold? And what is a counter? We have a policy
> > > against such coding standards, and to be honest I would like to stick
> > > to it. The reason is that the PML is a very complex piece of code, and
> > > I would like to keep it as easy to understand as possible. If people
> > > start adding #if/#endif all over the code, we diverging from this
> > > goal.
> > >
> > > The only way to make this work is to call the notifier or some other
> > > framework in this "slow path" and let this other framework do it's own
> > > logic to determine what and when to print. Of course the cost of this
> > > is a function call plus an atomic operation (which is already not
> > > cheap). It's starting to get expensive, even for a "slow path", which
> > > in this particular context is just one insertion in an atomic FIFO.
> > >
> > > If instead of counting in number of times we try to send the fragment,
> > > and switch to a time base approach, this can be solved with the PERUSE
> > > calls. There is a callback when the request is created, and another
> > > callback when the first fragment is pushed successfully into the
> > > network. Computing the time between these two, allow a tool to figure
> > > out how much time the request was waiting in some internal queues, and
> > > therefore how much delay this added to the execution time.
> > >
> > >    george.
> > >
> > > On May 27, 2009, at 06:59 , Ralph Castain wrote:
> > >
> > > > ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)
> > > >
> > > > #if WANT_NOTIFIER_VERBOSE
> > > > opal_atomic_increment(counter);
> > > > if (counter > threshold) {
> > > >     orte_notifier.api(.....)
> > > > }
> > > > #endif
> > >
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> >
> >
> --
> Nadia Derbey <nadia.der...@bull.net>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] problem in the ORTE notifier framework

Reply via email to