I think it depends upon what is being monitored. As I understand it, we could use the peruse link to generate notifications based on the number of times someone calls "MPI_Send", for example. I concur with George's concerns about performance in this area and would agree that using the peruse hooks makes some sense.
However, if one wants to generate a notification when an error occurs (e.g., too many IB retries) that might not be fatal, but only wants that notification to go out every xx times that happens, then I don't think the peruse option will work. In this scenario, though, I don't think performance is an issue any longer - this code path would only be followed when tracking errors, and thus can flow slower. So I think a combination of the two approaches makes the most sense. All the ORTE_NOTIFIER_VERBOSE method does is provide a means of enabling the second option in a configure-it-in/out way that is fairly clean as it just mirrors the current OPAL_OUTPUT_VERBOSE methodology. Using peruse for the first option sounds like a reasonable approach. HTH Ralph On Wed, May 27, 2009 at 12:25 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > Excellent points; Ralph and I chatted about this on the phone today -- we > concur with George. > > Bull -- would peruse work for you? I think you mentioned before that it > didn't seem attractive to you. I think George's point is that we already > have lots of hooks in place in the PML -- and they're called peruse. So if > we could use those hooks, then a) they're run-time selectable already, and > b) there's no additional cost in performance critical/not-critical code > paths (for the case where these stats are not being collected) because > PERUSE has been in the code base for a long time. > > I think the idea is that your callbacks could be invoked by the peruse > hooks and then they can do whatever they want -- increment counters, > conditionally invoke the ORTE notifier system, etc. > > > > > On May 27, 2009, at 11:34 AM, George Bosilca wrote: > > What is a generic threshold? And what is a counter? We have a policy >> against such coding standards, and to be honest I would like to stick >> to it. The reason is that the PML is a very complex piece of code, and >> I would like to keep it as easy to understand as possible. If people >> start adding #if/#endif all over the code, we diverging from this goal. >> >> The only way to make this work is to call the notifier or some other >> framework in this "slow path" and let this other framework do it's own >> logic to determine what and when to print. Of course the cost of this >> is a function call plus an atomic operation (which is already not >> cheap). It's starting to get expensive, even for a "slow path", which >> in this particular context is just one insertion in an atomic FIFO. >> >> If instead of counting in number of times we try to send the fragment, >> and switch to a time base approach, this can be solved with the PERUSE >> calls. There is a callback when the request is created, and another >> callback when the first fragment is pushed successfully into the >> network. Computing the time between these two, allow a tool to figure >> out how much time the request was waiting in some internal queues, and >> therefore how much delay this added to the execution time. >> >> george. >> >> On May 27, 2009, at 06:59 , Ralph Castain wrote: >> >> > ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...) >> > >> > #if WANT_NOTIFIER_VERBOSE >> > opal_atomic_increment(counter); >> > if (counter > threshold) { >> > orte_notifier.api(.....) >> > } >> > #endif >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > -- > Jeff Squyres > Cisco Systems > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >