Re: [OMPI devel] RFC: Diagnostoc framework for MPI

Nadia Derbey Tue, 26 May 2009 08:15:17 -0400

On Tue, 2009-05-26 at 05:35 -0600, Ralph Castain wrote:
> Hi Nadia
> 
> We actually have a framework in the system for this purpose, though it
> might require some minor modifications to do precisely what you
> describe. It is the ORTE "notifier" framework - you will find it at
> orte/mca/notifier. There are several components, each of which
> supports a different notification mechanism (e.g., message into the
> sys log, smtp, and even "twitter").


Ralph,

Thanks a lot for your detailed answer. I'll have a look at the notifier
framework to see if it could serve our purpose. Actually, form what you
describe, looks like it does.

Regards,
Nadia
> 
> The system works by adding orte_notifier calls to the OMPI code
> wherever we deem it advisable to alert someone. For example, if we
> think a sys admin might want to be alerted when the number of IB send
> retries exceeds some limit, we add a call to orte_notifier to the IB
> code with:
> 
> if (#retries > threshold) {
>     orte_notifier.xxx(....);
> }
> 
> I believe we could easily extend this to support your proposed
> functionality. A couple of possibilities that immediately spring to
> mind would be:
> 
> 1. you could create a new component (or we could modify the existing
> ones) that tracks how many times it is called for a given error, and
> only actually issues a notification for that specific error when the
> count exceeds a threshold. The negative to this approach is that the
> threshold would be uniform across all errors.
> 
> 2. we could extend the current notifier APIs to add a threshold count
> upon which the notification is to be sent, perhaps creating a new
> macro ORTE_NOTIFIER_VERBOSE that takes the threshold as one of its
> arguments. We could then let each OMPI framework have a new
> "threshold" MCA param, thus allowing the sys admins to "tune" the
> frequency of error reporting by framework. Of course, we could let
> them get as detailed here as you want - they could even have
> "threshold" params for each component, function, or whatever. This
> would be combined with #1 above to alert only when the count exceeded
> the threshold for that specific error message.
> 
> I'm sure you and others will come up with additional (probably better)
> ways of implementing this extension. My point here was simply to
> ensure you knew that the basic mechanism already exists, and to
> stimulate some thought as to how to use it for your proposed purpose.
> 
> I would be happy to help you do so as this is something we (LANL) have
> put at a high priority - our sys admins on the large clusters really
> need the help.
> 
> HTH
> Ralph
> 
> 
> On Mon, May 25, 2009 at 11:33 PM, Nadia Derbey <nadia.der...@bull.net>
> wrote:
>         What: Warn the administrator when unusual events are occurring
>         too
>         frequently.
>         
>         Why: Such unusual events might be the symptom of some problem
>         that can
>         easily be fixed (by a better tuning, for example)
>         
>         Where: Adds a new ompi framework
>         
>         -------------------------------------------------------------------
>         
>         Description:
>         
>         The objective of the Open MPI library is to make applications
>         run to
>         completion, given that no fatal error is encountered.
>         In some situations, unusual events may occur. Since these
>         events are not
>         considered to be fatal enough, the library arbitrarily chooses
>         to bypass
>         them using a software mechanism, instead of actually stopping
>         the
>         application. But even though this choice helps in completing
>         the
>         application, it may frequently result in significant
>         performance
>         degradation. This is not an issue if such “unusual events”
>         don't occur
>         too frequently. But if they actually do, that might be
>         representative of
>         a real problem that could sometimes be easily avoided.
>         
>         For example, when mca_pml_ob1_send_request_start() starts a
>         send request
>         and faces a resource shortage, it silently calls
>         add_request_to_send_pending() to queue that send request into
>         the list
>         of pending send requests in order to process it later on. If
>         an adapting
>         mechanism is not provided at runtime to increase the receive
>         queue
>         length, at least a message can be sent to the administrator to
>         let him
>         do the tuning by hand before the next run.
>         
>         We had a look at other tracing utilities (like PMPI, PERUSE,
>         VT), but
>         found them either too high level or too intrusive at the
>         application
>         level.
>         
>         The “diagnostic framework” we'd like to propose would help
>         capturing
>         such “unusual events” and tracing them, while having a very
>         low impact
>         on the performances. This is obtained by defining tracing
>         routines that
>         can be called from the ompi code. The collected events are
>         aggregated
>         per MPI process and only traced if a threshold has been
>         reached. Another
>         threshold (time threshold) can be used to condition subsequent
>         traces
>         generation for an already traced event.
>         
>         This is obtained by defining 2 mca parameters and a rule:
>         . the count threshold C
>         . the time delay T
>         The rule is: an event will only be traced if it happened N
>         times, and it
>         won't be traced more than once every T seconds.
>         
>         Thus, events happening at a very low rate will never generate
>         a trace
>         except one at MPI_Finalize summarizing:
>         [time] At finalize : 23 times : pre-allocated buffers all
>         full, calling
>         malloc
>         
>         Those happening "a little too much" will sometimes generate a
>         trace
>         saying something like:
>         [time] 1000 warnings : could not send in openib now, delaying
>         [time+12345 sec] 1000 warnings : could not send in openib now,
>         delaying
>         
>         And events occurring at a high frequency will only generate a
>         message
>         every T seconds saying:
>         [time]     1000 warnings : adding buffers in the SRQ
>         [time+T]   1,234,567 warnings (in T seconds) : adding buffers
>         in the SRQ
>         [time+2*T] 2,345,678 warnings (in T seconds) : adding buffers
>         in the SRQ
>         
>         The count threshold and time delay are defined per event.
>         They can also be defined as MCA parameters. In that case, the
>         mca
>         parameter value overrides the per event values.
>         
>         The following information are traced too:
>          . job family
>          . the local job id
>          . the job vpid
>         
>         Another aspect of performance savings is that a mechanism ala
>         show_help() can be used in order to let the HNP actually do
>         the job.
>         
>         We started the implementation of this feature, so patches are
>         available if
>         needed. We are currently trying to setup hgweb on an external
>         server.
>         
>         Since I'm an Open MPI newbie, I'm submitting this RFC to have
>         your
>         opinion about its usefulness, or even to know if there's an
>         already
>         existing mechanism to do this job.
>         
>         Regards,
>         Nadia
>         
>         --
>         Nadia Derbey <nadia.der...@bull.net>
>         
>         _______________________________________________
>         devel mailing list
>         de...@open-mpi.org
>         http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
-- 
Nadia Derbey <nadia.der...@bull.net>

Re: [OMPI devel] RFC: Diagnostoc framework for MPI

Reply via email to