[OMPI devel] RFC: new OMPI RTE define:
WHAT: New OMPI_RTE_EVENT_BASE define WHY: The usnic BTL needs to run some events asynchronously; the ORTE event base already exists and is running asynchronously in MPI processes WHERE: in ompi/mca/rte/rte.h and rte_orte.h TIMEOUT: COB Friday, 21 Feb 2014 MORE DETAIL: The WHY line described it pretty well: we want to run some things asynchronously in the usnic BTL and we don't really want to re-invent the wheel (or add yet another thread in each MPI process). The ORTE event base is already there, there's already a thread servicing it, and Ralph tells me that it is safe to add our own events on to it. The patch below adds the new OMPI_RTE_EVENT_BASE #define. diff --git a/ompi/mca/rte/orte/rte_orte.h b/ompi/mca/rte/orte/rte_orte.h index 3c88c6d..3ceadb8 100644 --- a/ompi/mca/rte/orte/rte_orte.h +++ b/ompi/mca/rte/orte/rte_orte.h @@ -142,6 +142,9 @@ typedef struct { } ompi_orte_tracker_t; OBJ_CLASS_DECLARATION(ompi_orte_tracker_t); +/* define the event base that the RTE exports */ +#define OMPI_RTE_EVENT_BASE orte_event_base + END_C_DECLS #endif /* MCA_OMPI_RTE_ORTE_H */ diff --git a/ompi/mca/rte/rte.h b/ompi/mca/rte/rte.h index 69ad488..de10dff 100644 --- a/ompi/mca/rte/rte.h +++ b/ompi/mca/rte/rte.h @@ -150,7 +150,9 @@ *a. OMPI_DB_HOSTNAME *b. OMPI_DB_LOCALITY * - * (g) Communication support + * (g) Asynchronous / event support + * 1. OMPI_RTE_EVENT_BASE - the libevent base that executes in a + *separate thread * */ @@ -162,6 +164,7 @@ #include "opal/dss/dss_types.h" #include "opal/mca/mca.h" #include "opal/mca/base/base.h" +#include "opal/mca/event/event.h" BEGIN_C_DECLS
Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process
It look fine except that the restart state is not flagged. When a process is restarted does it resume execution inside the criu_dump() function? If so, is there a way to tell from its return code (or some other mechanism) that it is being restarted versus continuing after checkpointing? On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castainwrote: > Great - looks fine to me!! > > > On Feb 17, 2014, at 11:39 AM, Adrian Reber wrote: > > > I have prepared a patch I would like to commit which adds to code to > > actually checkpoint a process. Thanks for the pointers about the string > > variables I tried to do implement it correctly. > > > > CRIU currently has problems with the new OOB usock but I will contact > > the CRIU developers about this error. Using tcp, checkpointing works. > > > > CRIU also has problems with --np > 1, but I am sure this can also be > > resolved. > > > > The patch is at: > > > > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492 > > > > Adrian > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Joshua Hursey Assistant Professor of Computer Science University of Wisconsin-La Crosse http://cs.uwlax.edu/~jjhursey
Re: [OMPI devel] OPAL_CRS_* meaning
These values indicate the current state of the checkpointing lifecycle. In particular CONTINUE/RESTART are set by the checkpointer in the CRS (all others are used by the INC mechanism). In the opal_crs.checkpoint() call the checkpointer will capture the program state and it is possible to emerge from this function in one of two scenarios. Either we are continuing execution in the original process (Continue state), or we are resuming execution from a checkpointed state (Restart state). So if the checkpoint was successful, and you are not restarting the process then you want OPAL_CRS_CONTINUE. If the process is being restarted from a checkpoint file, then we should emerge from this function setting the state to OPAL_CRS_RESTART. The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all of the components to prepare for checkpoint (we probably should have called it OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at all. You can see it used in the opal_cr_inc_core_prep() function in opal/runtime/opal_cr.c -- Josh On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reberwrote: > This is probably for Josh. What is the meaning of the OPAL_CRS_* enums? > > They are probably used to communicate the state of the CRS modules. > OPAL_CRS_ERROR seems to be used in case an error happened. What is the > CRS module supposed to set this to if the checkpoint was successful. > > OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT? > > Adrian > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Joshua Hursey Assistant Professor of Computer Science University of Wisconsin-La Crosse http://cs.uwlax.edu/~jjhursey
Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process
Great - looks fine to me!! On Feb 17, 2014, at 11:39 AM, Adrian Reberwrote: > I have prepared a patch I would like to commit which adds to code to > actually checkpoint a process. Thanks for the pointers about the string > variables I tried to do implement it correctly. > > CRIU currently has problems with the new OOB usock but I will contact > the CRIU developers about this error. Using tcp, checkpointing works. > > CRIU also has problems with --np > 1, but I am sure this can also be > resolved. > > The patch is at: > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492 > > Adrian > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] CRS/CRIU: add code to actually checkpoint a process
I have prepared a patch I would like to commit which adds to code to actually checkpoint a process. Thanks for the pointers about the string variables I tried to do implement it correctly. CRIU currently has problems with the new OOB usock but I will contact the CRIU developers about this error. Using tcp, checkpointing works. CRIU also has problems with --np > 1, but I am sure this can also be resolved. The patch is at: https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492 Adrian
Re: [OMPI devel] [PATCH] Fix typo defining macro _WORD_MASK_
+1 On Feb 16, 2014, at 4:55 PM, Andreas Schwabwrote: > diff --git a/opal/util/crc.c b/opal/util/crc.c > index 9cfae94..c2112de 100644 > --- a/opal/util/crc.c > +++ b/opal/util/crc.c > @@ -41,7 +41,7 @@ > #elif (OPAL_ALIGNMENT_LONG == 4) > #define _WORD_MASK_ 0x3 > #else > -#define _WORD_MASK 0x > +#define _WORD_MASK_ 0x > #endif > > > -- > 1.9.0 > > -- > Andreas Schwab, sch...@linux-m68k.org > GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 > "And now for something completely different." > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] How to prefer oob/tcp over oob/usock
Sure: "-mca oob tcp" On Feb 17, 2014, at 8:10 AM, Adrian Reberwrote: > With the newly added oob/usock checkpointing with CRIU stopped working. > Is there a way I can prefer oob/tcp on the command line? > > Adrian > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] How to prefer oob/tcp over oob/usock
With the newly added oob/usock checkpointing with CRIU stopped working. Is there a way I can prefer oob/tcp on the command line? Adrian
[OMPI devel] OPAL_CRS_* meaning
This is probably for Josh. What is the meaning of the OPAL_CRS_* enums? They are probably used to communicate the state of the CRS modules. OPAL_CRS_ERROR seems to be used in case an error happened. What is the CRS module supposed to set this to if the checkpoint was successful. OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT? Adrian
Re: [OMPI devel] How to read OPAL_OUTPUT-ed strings
Looking at your cmd line, it looks like you are trying to get diagnostic output from the mapper? If so, that cmd line is totally wrong. First, there are no "OPAL_OUTPUT" calls (at least, that I know of) in the orte layer as I studiously avoid them. Instead, everything is either cap or lower case opal_output_verbose. The cap version is only in the debug builds. Regardless, you are almost certainly not seeing any output because you aren't passing the right param. You need something like this: oshrun -map-by node -np 2 -mca rmaps_base_verbose 10 ./examples/ring_oshmem That will output the diagnostics from the mapper framework. On Feb 17, 2014, at 3:40 AM, Jeff Squyres (jsquyres)wrote: > OPAL_OUTPUT is the exact equivalent of opal_output(), except that it is > complied out for non-debug builds. > > So if you did a production build (E.g., a vpath build), OPAL_OUTPUT() will be > compiled out. Otherwise, we typically use stream 0 for debugging stuff. > > On Feb 17, 2014, at 3:21 AM, Alex Margolin > wrote: > >> Hi, >> >> I'm having trouble getting the OPAL_OUTPUT to print. I'm trying the >> following command line (with no success): >> >> `pwd`/osh_install/bin/oshrun --map-by node -np 2 -mca orte_debug true -mca >> orte_debug_verbose 100 -mca orte_report_silent_errors true -mca >> orte_map_stddiag_to_stderr true ./examples/ring_oshmem >> >> How can I get it to print these strings? Online search was surprisingly >> fruitless. >> >> Thanks, >> Alex >> >> P.S. all the mca params are available if I look at "oshmem_info -a", so I >> suppose I can use them, but there are a lot more params so I'm not sure what >> I need to add here... >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] How to read OPAL_OUTPUT-ed strings
OPAL_OUTPUT is the exact equivalent of opal_output(), except that it is complied out for non-debug builds. So if you did a production build (E.g., a vpath build), OPAL_OUTPUT() will be compiled out. Otherwise, we typically use stream 0 for debugging stuff. On Feb 17, 2014, at 3:21 AM, Alex Margolinwrote: > Hi, > > I'm having trouble getting the OPAL_OUTPUT to print. I'm trying the following > command line (with no success): > > `pwd`/osh_install/bin/oshrun --map-by node -np 2 -mca orte_debug true -mca > orte_debug_verbose 100 -mca orte_report_silent_errors true -mca > orte_map_stddiag_to_stderr true ./examples/ring_oshmem > > How can I get it to print these strings? Online search was surprisingly > fruitless. > > Thanks, > Alex > > P.S. all the mca params are available if I look at "oshmem_info -a", so I > suppose I can use them, but there are a lot more params so I'm not sure what > I need to add here... > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] How to read OPAL_OUTPUT-ed strings
Hi, I'm having trouble getting the OPAL_OUTPUT to print. I'm trying the following command line (with no success): `pwd`/osh_install/bin/oshrun --map-by node -np 2 -mca orte_debug true -mca orte_debug_verbose 100 -mca orte_report_silent_errors true -mca orte_map_stddiag_to_stderr true ./examples/ring_oshmem How can I get it to print these strings? Online search was surprisingly fruitless. Thanks, Alex P.S. all the mca params are available if I look at "oshmem_info -a", so I suppose I can use them, but there are a lot more params so I'm not sure what I need to add here...