What: Warn the administrator when unusual events are occurring too
frequently.

Why: Such unusual events might be the symptom of some problem that can
easily be fixed (by a better tuning, for example)

Where: Adds a new ompi framework

-------------------------------------------------------------------

Description:

The objective of the Open MPI library is to make applications run to
completion, given that no fatal error is encountered.
In some situations, unusual events may occur. Since these events are not
considered to be fatal enough, the library arbitrarily chooses to bypass
them using a software mechanism, instead of actually stopping the
application. But even though this choice helps in completing the
application, it may frequently result in significant performance
degradation. This is not an issue if such “unusual events” don't occur
too frequently. But if they actually do, that might be representative of
a real problem that could sometimes be easily avoided.

For example, when mca_pml_ob1_send_request_start() starts a send request
and faces a resource shortage, it silently calls
add_request_to_send_pending() to queue that send request into the list
of pending send requests in order to process it later on. If an adapting
mechanism is not provided at runtime to increase the receive queue
length, at least a message can be sent to the administrator to let him
do the tuning by hand before the next run.

We had a look at other tracing utilities (like PMPI, PERUSE, VT), but
found them either too high level or too intrusive at the application
level.

The “diagnostic framework” we'd like to propose would help capturing
such “unusual events” and tracing them, while having a very low impact
on the performances. This is obtained by defining tracing routines that
can be called from the ompi code. The collected events are aggregated
per MPI process and only traced if a threshold has been reached. Another
threshold (time threshold) can be used to condition subsequent traces
generation for an already traced event.

This is obtained by defining 2 mca parameters and a rule:
. the count threshold C
. the time delay T
The rule is: an event will only be traced if it happened N times, and it
won't be traced more than once every T seconds.

Thus, events happening at a very low rate will never generate a trace
except one at MPI_Finalize summarizing:
[time] At finalize : 23 times : pre-allocated buffers all full, calling
malloc

Those happening "a little too much" will sometimes generate a trace
saying something like:
[time] 1000 warnings : could not send in openib now, delaying
[time+12345 sec] 1000 warnings : could not send in openib now, delaying

And events occurring at a high frequency will only generate a message
every T seconds saying:
[time]     1000 warnings : adding buffers in the SRQ
[time+T]   1,234,567 warnings (in T seconds) : adding buffers in the SRQ
[time+2*T] 2,345,678 warnings (in T seconds) : adding buffers in the SRQ

The count threshold and time delay are defined per event.
They can also be defined as MCA parameters. In that case, the mca
parameter value overrides the per event values.

The following information are traced too:
  . job family
  . the local job id
  . the job vpid

Another aspect of performance savings is that a mechanism ala
show_help() can be used in order to let the HNP actually do the job.

We started the implementation of this feature, so patches are available if 
needed. We are currently trying to setup hgweb on an external server.

Since I'm an Open MPI newbie, I'm submitting this RFC to have your
opinion about its usefulness, or even to know if there's an already
existing mechanism to do this job.

Regards,
Nadia

-- 
Nadia Derbey <nadia.der...@bull.net>

Reply via email to