[OMPI devel] RFC: new OMPI RTE define:

2014-02-17 Thread Jeff Squyres (jsquyres)
WHAT: New OMPI_RTE_EVENT_BASE define

WHY: The usnic BTL needs to run some events asynchronously; the ORTE event base 
already exists and is running asynchronously in MPI processes

WHERE: in ompi/mca/rte/rte.h and rte_orte.h

TIMEOUT: COB Friday, 21 Feb 2014

MORE DETAIL:

The WHY line described it pretty well: we want to run some things 
asynchronously in the usnic BTL and we don't really want to re-invent the wheel 
(or add yet another thread in each MPI process).  The ORTE event base is 
already there, there's already a thread servicing it, and Ralph tells me that 
it is safe to add our own events on to it.

The patch below adds the new OMPI_RTE_EVENT_BASE #define.


diff --git a/ompi/mca/rte/orte/rte_orte.h b/ompi/mca/rte/orte/rte_orte.h
index 3c88c6d..3ceadb8 100644
--- a/ompi/mca/rte/orte/rte_orte.h
+++ b/ompi/mca/rte/orte/rte_orte.h
@@ -142,6 +142,9 @@ typedef struct {
 } ompi_orte_tracker_t;
 OBJ_CLASS_DECLARATION(ompi_orte_tracker_t);
 
+/* define the event base that the RTE exports */
+#define OMPI_RTE_EVENT_BASE orte_event_base
+
 END_C_DECLS
 
 #endif /* MCA_OMPI_RTE_ORTE_H */
diff --git a/ompi/mca/rte/rte.h b/ompi/mca/rte/rte.h
index 69ad488..de10dff 100644
--- a/ompi/mca/rte/rte.h
+++ b/ompi/mca/rte/rte.h
@@ -150,7 +150,9 @@
  *a. OMPI_DB_HOSTNAME
  *b. OMPI_DB_LOCALITY
  *
- * (g) Communication support
+ * (g) Asynchronous / event support
+ * 1. OMPI_RTE_EVENT_BASE - the libevent base that executes in a
+ *separate thread
  *
  */
 
@@ -162,6 +164,7 @@
 #include "opal/dss/dss_types.h"
 #include "opal/mca/mca.h"
 #include "opal/mca/base/base.h"
+#include "opal/mca/event/event.h"
 
 BEGIN_C_DECLS
 




Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-17 Thread Josh Hursey
It look fine except that the restart state is not flagged. When a process
is restarted does it resume execution inside the criu_dump() function? If
so, is there a way to tell from its return code (or some other mechanism)
that it is being restarted versus continuing after checkpointing?


On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain  wrote:

> Great - looks fine to me!!
>
>
> On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:
>
> > I have prepared a patch I would like to commit which adds to code to
> > actually checkpoint a process. Thanks for the pointers about the string
> > variables I tried to do implement it correctly.
> >
> > CRIU currently has problems with the new OOB usock but I will contact
> > the CRIU developers about this error. Using tcp, checkpointing works.
> >
> > CRIU also has problems with --np > 1, but I am sure this can also be
> > resolved.
> >
> > The patch is at:
> >
> >
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> >
> >   Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] OPAL_CRS_* meaning

2014-02-17 Thread Josh Hursey
These values indicate the current state of the checkpointing lifecycle. In
particular CONTINUE/RESTART are set by the checkpointer in the CRS (all
others are used by the INC mechanism). In the opal_crs.checkpoint() call
the checkpointer will capture the program state and it is possible to
emerge from this function in one of two scenarios. Either we are continuing
execution in the original process (Continue state), or we are resuming
execution from a checkpointed state (Restart state).

So if the checkpoint was successful, and you are not restarting the process
then you want OPAL_CRS_CONTINUE.

If the process is being restarted from a checkpoint file, then we should
emerge from this function setting the state to OPAL_CRS_RESTART.

The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all of
the components to prepare for checkpoint (we probably should have called it
OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at all.
You can see it used in the opal_cr_inc_core_prep() function in
opal/runtime/opal_cr.c

-- Josh



On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber  wrote:

> This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
>
> They are probably used to communicate the state of the CRS modules.
> OPAL_CRS_ERROR seems to be used in case an error happened. What is the
> CRS module supposed to set this to if the checkpoint was successful.
>
> OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?
>
> Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-17 Thread Ralph Castain
Great - looks fine to me!!


On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:

> I have prepared a patch I would like to commit which adds to code to
> actually checkpoint a process. Thanks for the pointers about the string
> variables I tried to do implement it correctly.
> 
> CRIU currently has problems with the new OOB usock but I will contact
> the CRIU developers about this error. Using tcp, checkpointing works.
> 
> CRIU also has problems with --np > 1, but I am sure this can also be
> resolved.
> 
> The patch is at:
> 
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-17 Thread Adrian Reber
I have prepared a patch I would like to commit which adds to code to
actually checkpoint a process. Thanks for the pointers about the string
variables I tried to do implement it correctly.

CRIU currently has problems with the new OOB usock but I will contact
the CRIU developers about this error. Using tcp, checkpointing works.

CRIU also has problems with --np > 1, but I am sure this can also be
resolved.

The patch is at:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492

Adrian


Re: [OMPI devel] [PATCH] Fix typo defining macro _WORD_MASK_

2014-02-17 Thread Jeff Squyres (jsquyres)
+1

On Feb 16, 2014, at 4:55 PM, Andreas Schwab  wrote:

> diff --git a/opal/util/crc.c b/opal/util/crc.c
> index 9cfae94..c2112de 100644
> --- a/opal/util/crc.c
> +++ b/opal/util/crc.c
> @@ -41,7 +41,7 @@
> #elif (OPAL_ALIGNMENT_LONG == 4)
> #define _WORD_MASK_ 0x3
> #else
> -#define _WORD_MASK 0x
> +#define _WORD_MASK_ 0x
> #endif
> 
> 
> -- 
> 1.9.0
> 
> -- 
> Andreas Schwab, sch...@linux-m68k.org
> GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
> "And now for something completely different."
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] How to prefer oob/tcp over oob/usock

2014-02-17 Thread Ralph Castain
Sure: "-mca oob tcp"


On Feb 17, 2014, at 8:10 AM, Adrian Reber  wrote:

> With the newly added oob/usock checkpointing with CRIU stopped working.
> Is there a way I can prefer oob/tcp on the command line?
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] How to prefer oob/tcp over oob/usock

2014-02-17 Thread Adrian Reber
With the newly added oob/usock checkpointing with CRIU stopped working.
Is there a way I can prefer oob/tcp on the command line?

Adrian


[OMPI devel] OPAL_CRS_* meaning

2014-02-17 Thread Adrian Reber
This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?

They are probably used to communicate the state of the CRS modules.
OPAL_CRS_ERROR seems to be used in case an error happened. What is the
CRS module supposed to set this to if the checkpoint was successful.

OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?

Adrian


Re: [OMPI devel] How to read OPAL_OUTPUT-ed strings

2014-02-17 Thread Ralph Castain
Looking at your cmd line, it looks like you are trying to get diagnostic output 
from the mapper? If so, that cmd line is totally wrong. First, there are no 
"OPAL_OUTPUT" calls (at least, that I know of) in the orte layer as I 
studiously avoid them. Instead, everything is either cap or lower case 
opal_output_verbose. The cap version is only in the debug builds.

Regardless, you are almost certainly not seeing any output because you aren't 
passing the right param. You need something like this:

oshrun -map-by node -np 2 -mca rmaps_base_verbose 10 ./examples/ring_oshmem

That will output the diagnostics from the mapper framework.


On Feb 17, 2014, at 3:40 AM, Jeff Squyres (jsquyres)  wrote:

> OPAL_OUTPUT is the exact equivalent of opal_output(), except that it is 
> complied out for non-debug builds.
> 
> So if you did a production build (E.g., a vpath build), OPAL_OUTPUT() will be 
> compiled out.  Otherwise, we typically use stream 0 for debugging stuff.
> 
> On Feb 17, 2014, at 3:21 AM, Alex Margolin  
> wrote:
> 
>> Hi,
>> 
>> I'm having trouble getting the OPAL_OUTPUT to print. I'm trying the 
>> following command line (with no success):
>> 
>> `pwd`/osh_install/bin/oshrun --map-by node  -np 2 -mca orte_debug true -mca 
>> orte_debug_verbose 100 -mca orte_report_silent_errors true -mca 
>> orte_map_stddiag_to_stderr true ./examples/ring_oshmem
>> 
>> How can I get it to print these strings? Online search was surprisingly 
>> fruitless.
>> 
>> Thanks,
>> Alex
>> 
>> P.S. all the mca params are available if I look at "oshmem_info -a", so I 
>> suppose I can use them, but there are a lot more params so I'm not sure what 
>> I need to add here...
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] How to read OPAL_OUTPUT-ed strings

2014-02-17 Thread Jeff Squyres (jsquyres)
OPAL_OUTPUT is the exact equivalent of opal_output(), except that it is 
complied out for non-debug builds.

So if you did a production build (E.g., a vpath build), OPAL_OUTPUT() will be 
compiled out.  Otherwise, we typically use stream 0 for debugging stuff.

On Feb 17, 2014, at 3:21 AM, Alex Margolin  
wrote:

> Hi,
> 
> I'm having trouble getting the OPAL_OUTPUT to print. I'm trying the following 
> command line (with no success):
> 
> `pwd`/osh_install/bin/oshrun --map-by node  -np 2 -mca orte_debug true -mca 
> orte_debug_verbose 100 -mca orte_report_silent_errors true -mca 
> orte_map_stddiag_to_stderr true ./examples/ring_oshmem
> 
> How can I get it to print these strings? Online search was surprisingly 
> fruitless.
> 
> Thanks,
> Alex
> 
> P.S. all the mca params are available if I look at "oshmem_info -a", so I 
> suppose I can use them, but there are a lot more params so I'm not sure what 
> I need to add here...
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] How to read OPAL_OUTPUT-ed strings

2014-02-17 Thread Alex Margolin
Hi,

I'm having trouble getting the OPAL_OUTPUT to print. I'm trying the
following command line (with no success):

`pwd`/osh_install/bin/oshrun --map-by node  -np 2 -mca orte_debug true -mca
orte_debug_verbose 100 -mca orte_report_silent_errors true -mca
orte_map_stddiag_to_stderr true ./examples/ring_oshmem

How can I get it to print these strings? Online search was surprisingly
fruitless.

Thanks,
Alex

P.S. all the mca params are available if I look at "oshmem_info -a", so I
suppose I can use them, but there are a lot more params so I'm not sure
what I need to add here...