What: Warn the administrator when unusual events are occurring too frequently.
Why: Such unusual events might be the symptom of some problem that can easily be fixed (by a better tuning, for example) Where: Adds a new ompi framework ------------------------------------------------------------------- Description: The objective of the Open MPI library is to make applications run to completion, given that no fatal error is encountered. In some situations, unusual events may occur. Since these events are not considered to be fatal enough, the library arbitrarily chooses to bypass them using a software mechanism, instead of actually stopping the application. But even though this choice helps in completing the application, it may frequently result in significant performance degradation. This is not an issue if such “unusual events” don't occur too frequently. But if they actually do, that might be representative of a real problem that could sometimes be easily avoided. For example, when mca_pml_ob1_send_request_start() starts a send request and faces a resource shortage, it silently calls add_request_to_send_pending() to queue that send request into the list of pending send requests in order to process it later on. If an adapting mechanism is not provided at runtime to increase the receive queue length, at least a message can be sent to the administrator to let him do the tuning by hand before the next run. We had a look at other tracing utilities (like PMPI, PERUSE, VT), but found them either too high level or too intrusive at the application level. The “diagnostic framework” we'd like to propose would help capturing such “unusual events” and tracing them, while having a very low impact on the performances. This is obtained by defining tracing routines that can be called from the ompi code. The collected events are aggregated per MPI process and only traced if a threshold has been reached. Another threshold (time threshold) can be used to condition subsequent traces generation for an already traced event. This is obtained by defining 2 mca parameters and a rule: . the count threshold C . the time delay T The rule is: an event will only be traced if it happened N times, and it won't be traced more than once every T seconds. Thus, events happening at a very low rate will never generate a trace except one at MPI_Finalize summarizing: [time] At finalize : 23 times : pre-allocated buffers all full, calling malloc Those happening "a little too much" will sometimes generate a trace saying something like: [time] 1000 warnings : could not send in openib now, delaying [time+12345 sec] 1000 warnings : could not send in openib now, delaying And events occurring at a high frequency will only generate a message every T seconds saying: [time] 1000 warnings : adding buffers in the SRQ [time+T] 1,234,567 warnings (in T seconds) : adding buffers in the SRQ [time+2*T] 2,345,678 warnings (in T seconds) : adding buffers in the SRQ The count threshold and time delay are defined per event. They can also be defined as MCA parameters. In that case, the mca parameter value overrides the per event values. The following information are traced too: . job family . the local job id . the job vpid Another aspect of performance savings is that a mechanism ala show_help() can be used in order to let the HNP actually do the job. We started the implementation of this feature, so patches are available if needed. We are currently trying to setup hgweb on an external server. Since I'm an Open MPI newbie, I'm submitting this RFC to have your opinion about its usefulness, or even to know if there's an already existing mechanism to do this job. Regards, Nadia -- Nadia Derbey <nadia.der...@bull.net>