Hello,

I would like to introduce OMPI timing framework that was included into the
trunk yesterday (r32738). The code is new so if you'll hit some bugs - just
let me know.

The framework consists of the set of macro's and routines for internal OMPI
usage + standalone tool mpisync and few additional scripts: mpirun_prof and
ompi_timing_post. The set of features is very basic and I am open for
discussion of new things that are desirable there.

To enable framework compilation you should configure OMPI with
--enable-timing option. If the option was passed to ./configure, standalone
tools and scripts will be installed into <prefix>/bin.

The timing code is located in OPAL (opal/utils/timing.[ch]). There is a set
of macro's that should be used to preprocess out all mentions of the timing
code in case it wasn't requested with --enable-timing:
OPAL_TIMING_DECLARE(t) - declare timing handler structure with name "t".
OPAL_TIMING_DECLARE_EXT(x, t) - external declaration of a timing handler
"t".
OPAL_TIMING_INIT(t) - initialize timing handler "t"
OPAL_TIMING_EVENT(x) - printf-like event declaration similar to OPAL_OUTPUT.
The information about the event will be quickly inserted into the linked
list. Maximum event description is limited by OPAL_TIMING_DESCR_MAX.
The malloc is performed in buckets (OPAL_TIMING_BUFSIZE at once) and
overhead (time to malloc and prepare the bucket) is accounted in
corresponding list element. It might be excluded from the timing results
(controlled by OMPI_MCA_opal_timing_overhead parameter).
OPAL_TIMING_REPORT(enable, t, prefix) - prepare and print out timing
information. If OMPI_MCA_opal_timing_file was specified the output will go
to that file. In other case the output will be directed using opal_output,
each line will be prefixed with "prefix" to ease grep'ing. "enable" is a
boolean/integer variable that is used for runtime selection of what should
be reported.
OPAL_TIMING_RELEASE(t) - the counterpart for OPAL_TIMING_INIT.

There are several examples in OMPI code. And here is another simple example:
    OPAL_TIMING_DECLARE(tm);
    OPAL_TIMING_INIT(&tm);
    ...
    OPAL_TIMING_EVENT((&tm,"Begin of timing: %s",
ORTE_NAME_PRINT(&(peer->name)) ));
    ....
    OPAL_TIMING_EVENT((&tm,"Next timing event with condition x = %d", x ));
    ...
    OPAL_TIMING_EVENT((&tm,"Finish"));
    OPAL_TIMING_REPORT(enable_var, &tm,"MPI Init");
    OPAL_TIMING_RELEASE(&tm);


An output from all OMPI processes (mpirun, orted's, user processes) is
merged together. NTP provides 1 millisecond - 100 microsecond level of
precision. This may not be sufficient to order events globally.
To help developers extract the most realistic picture of what is going on,
additional time synchronisation might be performed before profiling. The
mpisync program should be runned 1-user-process-per-node to acquire the
file with time offsets relative to HNP of each node. If the cluster runs
over Gig Ethernet the precision will be 30-50 microseconds, in case of
Infiniband - 4 microseconds. mpisync produces output file that might be
readed and used by timing framework (OMPI_MCA_opal_clksync_file parameter).
The bad news is that this synchronisation is not enough because of
different clock skew on different nodes. Additional periodical
synchronisation is needed. This is planned for the near future (me and
Ralph discussing possible ways now).

the mpirun_prof & ompi_timing_post script may be used to automate clock
synchronisation in following manner:
export OMPI_MCA_ompi_timing=true
export OMPI_MCA_orte_oob_timing=true
export OMPI_MCA_orte_rml_timing=true
export OMPI_MCA_opal_timing_file=timing.out
mpirun_prof <ompi-params> ./mpiprog
ompi_timing_post timing.out

ompi_timing_post will simply sort the events and made all times to be
relative to the first one.

-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Reply via email to