Re: [OMPI devel] RFC: Diagnostoc framework for MPI

Ralph Castain Tue, 26 May 2009 07:35:08 -0400

Hi Nadia

We actually have a framework in the system for this purpose, though it might
require some minor modifications to do precisely what you describe. It is
the ORTE "notifier" framework - you will find it at orte/mca/notifier. There
are several components, each of which supports a different notification
mechanism (e.g., message into the sys log, smtp, and even "twitter").

The system works by adding orte_notifier calls to the OMPI code wherever we
deem it advisable to alert someone. For example, if we think a sys admin
might want to be alerted when the number of IB send retries exceeds some
limit, we add a call to orte_notifier to the IB code with:

if (#retries > threshold) {
    orte_notifier.xxx(....);
}

I believe we could easily extend this to support your proposed
functionality. A couple of possibilities that immediately spring to mind
would be:

1. you could create a new component (or we could modify the existing ones)
that tracks how many times it is called for a given error, and only actually
issues a notification for that specific error when the count exceeds a
threshold. The negative to this approach is that the threshold would be
uniform across all errors.

2. we could extend the current notifier APIs to add a threshold count upon
which the notification is to be sent, perhaps creating a new macro
ORTE_NOTIFIER_VERBOSE that takes the threshold as one of its arguments. We
could then let each OMPI framework have a new "threshold" MCA param, thus
allowing the sys admins to "tune" the frequency of error reporting by
framework. Of course, we could let them get as detailed here as you want -
they could even have "threshold" params for each component, function, or
whatever. This would be combined with #1 above to alert only when the count
exceeded the threshold for that specific error message.

I'm sure you and others will come up with additional (probably better) ways
of implementing this extension. My point here was simply to ensure you knew
that the basic mechanism already exists, and to stimulate some thought as to
how to use it for your proposed purpose.

I would be happy to help you do so as this is something we (LANL) have put
at a high priority - our sys admins on the large clusters really need the
help.

HTH
Ralph

On Mon, May 25, 2009 at 11:33 PM, Nadia Derbey <[email protected]>wrote:

> What: Warn the administrator when unusual events are occurring too
> frequently.
>
> Why: Such unusual events might be the symptom of some problem that can
> easily be fixed (by a better tuning, for example)
>
> Where: Adds a new ompi framework
>
> -------------------------------------------------------------------
>
> Description:
>
> The objective of the Open MPI library is to make applications run to
> completion, given that no fatal error is encountered.
> In some situations, unusual events may occur. Since these events are not
> considered to be fatal enough, the library arbitrarily chooses to bypass
> them using a software mechanism, instead of actually stopping the
> application. But even though this choice helps in completing the
> application, it may frequently result in significant performance
> degradation. This is not an issue if such “unusual events” don't occur
> too frequently. But if they actually do, that might be representative of
> a real problem that could sometimes be easily avoided.
>
> For example, when mca_pml_ob1_send_request_start() starts a send request
> and faces a resource shortage, it silently calls
> add_request_to_send_pending() to queue that send request into the list
> of pending send requests in order to process it later on. If an adapting
> mechanism is not provided at runtime to increase the receive queue
> length, at least a message can be sent to the administrator to let him
> do the tuning by hand before the next run.
>
> We had a look at other tracing utilities (like PMPI, PERUSE, VT), but
> found them either too high level or too intrusive at the application
> level.
>
> The “diagnostic framework” we'd like to propose would help capturing
> such “unusual events” and tracing them, while having a very low impact
> on the performances. This is obtained by defining tracing routines that
> can be called from the ompi code. The collected events are aggregated
> per MPI process and only traced if a threshold has been reached. Another
> threshold (time threshold) can be used to condition subsequent traces
> generation for an already traced event.
>
> This is obtained by defining 2 mca parameters and a rule:
> . the count threshold C
> . the time delay T
> The rule is: an event will only be traced if it happened N times, and it
> won't be traced more than once every T seconds.
>
> Thus, events happening at a very low rate will never generate a trace
> except one at MPI_Finalize summarizing:
> [time] At finalize : 23 times : pre-allocated buffers all full, calling
> malloc
>
> Those happening "a little too much" will sometimes generate a trace
> saying something like:
> [time] 1000 warnings : could not send in openib now, delaying
> [time+12345 sec] 1000 warnings : could not send in openib now, delaying
>
> And events occurring at a high frequency will only generate a message
> every T seconds saying:
> [time]     1000 warnings : adding buffers in the SRQ
> [time+T]   1,234,567 warnings (in T seconds) : adding buffers in the SRQ
> [time+2*T] 2,345,678 warnings (in T seconds) : adding buffers in the SRQ
>
> The count threshold and time delay are defined per event.
> They can also be defined as MCA parameters. In that case, the mca
> parameter value overrides the per event values.
>
> The following information are traced too:
>  . job family
>  . the local job id
>  . the job vpid
>
> Another aspect of performance savings is that a mechanism ala
> show_help() can be used in order to let the HNP actually do the job.
>
> We started the implementation of this feature, so patches are available if
> needed. We are currently trying to setup hgweb on an external server.
>
> Since I'm an Open MPI newbie, I'm submitting this RFC to have your
> opinion about its usefulness, or even to know if there's an already
> existing mechanism to do this job.
>
> Regards,
> Nadia
>
> --
> Nadia Derbey <[email protected]>
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Diagnostoc framework for MPI

Reply via email to