On Tue, 2009-05-26 at 05:35 -0600, Ralph Castain wrote: > Hi Nadia > > We actually have a framework in the system for this purpose, though it > might require some minor modifications to do precisely what you > describe. It is the ORTE "notifier" framework - you will find it at > orte/mca/notifier. There are several components, each of which > supports a different notification mechanism (e.g., message into the > sys log, smtp, and even "twitter").
Ralph, Thanks a lot for your detailed answer. I'll have a look at the notifier framework to see if it could serve our purpose. Actually, form what you describe, looks like it does. Regards, Nadia > > The system works by adding orte_notifier calls to the OMPI code > wherever we deem it advisable to alert someone. For example, if we > think a sys admin might want to be alerted when the number of IB send > retries exceeds some limit, we add a call to orte_notifier to the IB > code with: > > if (#retries > threshold) { > orte_notifier.xxx(....); > } > > I believe we could easily extend this to support your proposed > functionality. A couple of possibilities that immediately spring to > mind would be: > > 1. you could create a new component (or we could modify the existing > ones) that tracks how many times it is called for a given error, and > only actually issues a notification for that specific error when the > count exceeds a threshold. The negative to this approach is that the > threshold would be uniform across all errors. > > 2. we could extend the current notifier APIs to add a threshold count > upon which the notification is to be sent, perhaps creating a new > macro ORTE_NOTIFIER_VERBOSE that takes the threshold as one of its > arguments. We could then let each OMPI framework have a new > "threshold" MCA param, thus allowing the sys admins to "tune" the > frequency of error reporting by framework. Of course, we could let > them get as detailed here as you want - they could even have > "threshold" params for each component, function, or whatever. This > would be combined with #1 above to alert only when the count exceeded > the threshold for that specific error message. > > I'm sure you and others will come up with additional (probably better) > ways of implementing this extension. My point here was simply to > ensure you knew that the basic mechanism already exists, and to > stimulate some thought as to how to use it for your proposed purpose. > > I would be happy to help you do so as this is something we (LANL) have > put at a high priority - our sys admins on the large clusters really > need the help. > > HTH > Ralph > > > On Mon, May 25, 2009 at 11:33 PM, Nadia Derbey <nadia.der...@bull.net> > wrote: > What: Warn the administrator when unusual events are occurring > too > frequently. > > Why: Such unusual events might be the symptom of some problem > that can > easily be fixed (by a better tuning, for example) > > Where: Adds a new ompi framework > > ------------------------------------------------------------------- > > Description: > > The objective of the Open MPI library is to make applications > run to > completion, given that no fatal error is encountered. > In some situations, unusual events may occur. Since these > events are not > considered to be fatal enough, the library arbitrarily chooses > to bypass > them using a software mechanism, instead of actually stopping > the > application. But even though this choice helps in completing > the > application, it may frequently result in significant > performance > degradation. This is not an issue if such “unusual events” > don't occur > too frequently. But if they actually do, that might be > representative of > a real problem that could sometimes be easily avoided. > > For example, when mca_pml_ob1_send_request_start() starts a > send request > and faces a resource shortage, it silently calls > add_request_to_send_pending() to queue that send request into > the list > of pending send requests in order to process it later on. If > an adapting > mechanism is not provided at runtime to increase the receive > queue > length, at least a message can be sent to the administrator to > let him > do the tuning by hand before the next run. > > We had a look at other tracing utilities (like PMPI, PERUSE, > VT), but > found them either too high level or too intrusive at the > application > level. > > The “diagnostic framework” we'd like to propose would help > capturing > such “unusual events” and tracing them, while having a very > low impact > on the performances. This is obtained by defining tracing > routines that > can be called from the ompi code. The collected events are > aggregated > per MPI process and only traced if a threshold has been > reached. Another > threshold (time threshold) can be used to condition subsequent > traces > generation for an already traced event. > > This is obtained by defining 2 mca parameters and a rule: > . the count threshold C > . the time delay T > The rule is: an event will only be traced if it happened N > times, and it > won't be traced more than once every T seconds. > > Thus, events happening at a very low rate will never generate > a trace > except one at MPI_Finalize summarizing: > [time] At finalize : 23 times : pre-allocated buffers all > full, calling > malloc > > Those happening "a little too much" will sometimes generate a > trace > saying something like: > [time] 1000 warnings : could not send in openib now, delaying > [time+12345 sec] 1000 warnings : could not send in openib now, > delaying > > And events occurring at a high frequency will only generate a > message > every T seconds saying: > [time] 1000 warnings : adding buffers in the SRQ > [time+T] 1,234,567 warnings (in T seconds) : adding buffers > in the SRQ > [time+2*T] 2,345,678 warnings (in T seconds) : adding buffers > in the SRQ > > The count threshold and time delay are defined per event. > They can also be defined as MCA parameters. In that case, the > mca > parameter value overrides the per event values. > > The following information are traced too: > . job family > . the local job id > . the job vpid > > Another aspect of performance savings is that a mechanism ala > show_help() can be used in order to let the HNP actually do > the job. > > We started the implementation of this feature, so patches are > available if > needed. We are currently trying to setup hgweb on an external > server. > > Since I'm an Open MPI newbie, I'm submitting this RFC to have > your > opinion about its usefulness, or even to know if there's an > already > existing mechanism to do this job. > > Regards, > Nadia > > -- > Nadia Derbey <nadia.der...@bull.net> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Nadia Derbey <nadia.der...@bull.net>