On Thu, 2009-05-28 at 05:57 -0600, Ralph Castain wrote: > I agree with Terry here about being careful in pursuing this path. > What I wouldn't want to have happen is to force anyone wanting to be > notified of error events to have to also turn on peruse, which impacts > the non-error code path.
Agreed, I missed that part! Regards, Nadia > > Again, I'm not entirely sure what you are trying to do here. As I > understood the original RFC, it sounded like you wanted to track > errors but only report them when they occurred a controlled number of > times (as opposed to every time). I think this would better be done > outside of peruse. > > If you are trying to track normal performance (e.g., trying to alert > sys admins when networks aren't running as fast as they should), then > that probably should be done inside of peruse. However, that > definitely will impact the critical code path, so Terry's caution is > definitely a concern. > > > On Thu, May 28, 2009 at 12:55 AM, Nadia Derbey <[email protected]> > wrote: > On Wed, 2009-05-27 at 14:25 -0400, Jeff Squyres wrote: > > Excellent points; Ralph and I chatted about this on the > phone today -- > > we concur with George. > > > > Bull -- would peruse work for you? I think you mentioned > before that > > it didn't seem attractive to you. > > > Well, it didn't because from what I understood, the MPI > program need to > be changed (register a callback routine for the event, > activate the > event, etc), and this is something we wanted to avoid. > > Now, if we are allowed to > 1. define new "internal" PERUSE events, > 2. internally set the associated callback routines > why not using peruse? This combined with the orte notifier > framework, > could do the job I think. > > Regards, > Nadia > > > > I think George's point is that we > > already have lots of hooks in place in the PML -- and > they're called > > peruse. So if we could use those hooks, then a) they're > run-time > > selectable already, and b) there's no additional cost in > performance > > critical/not-critical code paths (for the case where these > stats are > > not being collected) because PERUSE has been in the code > base for a > > long time. > > > > I think the idea is that your callbacks could be invoked by > the peruse > > hooks and then they can do whatever they want -- increment > counters, > > conditionally invoke the ORTE notifier system, etc. > > > > > > > > On May 27, 2009, at 11:34 AM, George Bosilca wrote: > > > > > What is a generic threshold? And what is a counter? We > have a policy > > > against such coding standards, and to be honest I would > like to stick > > > to it. The reason is that the PML is a very complex piece > of code, and > > > I would like to keep it as easy to understand as possible. > If people > > > start adding #if/#endif all over the code, we diverging > from this > > > goal. > > > > > > The only way to make this work is to call the notifier or > some other > > > framework in this "slow path" and let this other framework > do it's own > > > logic to determine what and when to print. Of course the > cost of this > > > is a function call plus an atomic operation (which is > already not > > > cheap). It's starting to get expensive, even for a "slow > path", which > > > in this particular context is just one insertion in an > atomic FIFO. > > > > > > If instead of counting in number of times we try to send > the fragment, > > > and switch to a time base approach, this can be solved > with the PERUSE > > > calls. There is a callback when the request is created, > and another > > > callback when the first fragment is pushed successfully > into the > > > network. Computing the time between these two, allow a > tool to figure > > > out how much time the request was waiting in some internal > queues, and > > > therefore how much delay this added to the execution time. > > > > > > george. > > > > > > On May 27, 2009, at 06:59 , Ralph Castain wrote: > > > > > > > ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...) > > > > > > > > #if WANT_NOTIFIER_VERBOSE > > > > opal_atomic_increment(counter); > > > > if (counter > threshold) { > > > > orte_notifier.api(.....) > > > > } > > > > #endif > > > > > > _______________________________________________ > > > devel mailing list > > > [email protected] > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > > > -- > > Nadia Derbey <[email protected]> > > > _______________________________________________ > devel mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Nadia Derbey <[email protected]>
