Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Suraj Prabhakaran
Hmm.. but in actual the MPI_Comm_spawn of parents and MPI_Init of children 
never returned!

I configured MPI with 

./configure --prefix=/dir/ --enable-debug --with-tm=/usr/local/


On Feb 22, 2014, at 12:53 AM, Ralph Castain wrote:

> Strange - it all looks just fine. How was OMPI configured?
> 
> On Feb 21, 2014, at 3:31 PM, Suraj Prabhakaran  
> wrote:
> 
>> Ok, I figured out that it was not a problem with the node grsacc04 because I 
>> now conducted the same on totally different set of nodes. 
>> 
>> I must really say that with --bind-to none option, the program completed 
>> "many" times compared to earlier but still "sometimes" it hangs! Attaching 
>> now the output of the same case conducted on different set of nodes with the 
>> --bind-to none option.
>> 
>> mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
>> grpcomm_base_verbose 5 --bind-to none -np 3 ./example
>> 
>> Best,
>> Suraj
>> 
>> 
>> 
>> 
>> On Feb 21, 2014, at 5:03 PM, Ralph Castain wrote:
>> 
>>> Well, that all looks fine. However, I note that the procs on grsacc04 all 
>>> stopped making progress at the same point, which is why the job hung. All 
>>> the procs on the other nodes were just fine.
>>> 
>>> So let's try a couple of things:
>>> 
>>> 1. add "--bind-to none" to your cmd line so we avoid any contention issues
>>> 
>>> 2. if possible, remove grsacc04 from the allocation (you can just use the 
>>> -host option on the cmd line to ignore it), and/or replace that host with 
>>> another one. Let's see if the problem has something to do with that 
>>> specific node.
>>> 
>>> 
>>> On Feb 21, 2014, at 4:08 AM, Suraj Prabhakaran 
>>>  wrote:
>>> 
 Right, so I have the output here. Same case, 
 
 mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
 grpcomm_base_verbose 5  -np 3 ./simple_spawn
 
 Output attached. 
 
 Best,
 Suraj
 
 
 
 On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote:
 
> 
> On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran 
>  wrote:
> 
>> Thanks Ralph!
>> 
>> I must have mentioned though. Without the Torque environment, spawning 
>> with ssh works ok. But Under the torque environment, not. 
> 
> Ah, no - you forgot to mention that point.
> 
>> 
>> I started the simple_spawn with 3 processes and spawned 9 processes (3 
>> per node on 3 nodes). 
>> 
>> There is no problem with the Torque environment because all the 9 
>> processes are started on the respective nodes. But the MPI_Comm_spawn of 
>> the parent and MPI_Init of the children, "sometimes" don't return!
> 
> Seems odd - the launch environment has nothing to do with MPI_Init, so if 
> the processes are indeed being started, they should run. One possibility 
> is that they aren't correctly getting some wireup info.
> 
> Can you configure OMPI --enable-debug and then rerun the example with 
> "-mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
> grpcomm_base_verbose 5" on the command line?
> 
> 
>> 
>> This is the output of simple_spawn - which confirms the above statement. 
>> 
>> [pid 31208] starting up!
>> [pid 31209] starting up!
>> [pid 31210] starting up!
>> 0 completed MPI_Init
>> Parent [pid 31208] about to spawn!
>> 1 completed MPI_Init
>> Parent [pid 31209] about to spawn!
>> 2 completed MPI_Init
>> Parent [pid 31210] about to spawn!
>> [pid 28630] starting up!
>> [pid 28631] starting up!
>> [pid 9846] starting up!
>> [pid 9847] starting up!
>> [pid 9848] starting up!
>> [pid 6363] starting up!
>> [pid 6361] starting up!
>> [pid 6362] starting up!
>> [pid 28632] starting up!
>> 
>> Any hints?
>> 
>> Best,
>> Suraj
>> 
>> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote:
>> 
>>> Hmmm...I don't see anything immediately glaring. What do you mean by 
>>> "doesn't work"? Is there some specific behavior you see?
>>> 
>>> You might try the attached program. It's a simple spawn test we use - 
>>> 1.7.4 seems happy with it.
>>> 
>>> 
>>> 
>>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran 
>>>  wrote:
>>> 
 I am using 1.7.4! 
 
 On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
 
> What OMPI version are you using?
> 
> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran 
>  wrote:
> 
>> Hello!
>> 
>> I am having problem using MPI_Comm_spawn under torque. It doesn't 
>> work when spawning more than 12 processes on various nodes. To be 
>> more precise, "sometimes" it works, and "sometimes" it doesn't!
>> 
>> Here is my case. I obtain 5 nodes, 3 cores per node and my 
>> $PBS_NODEFILE looks like below.
>> 
>> node1
>> node1
>> node1
>> node2
>>

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Ralph Castain
Strange - it all looks just fine. How was OMPI configured?

On Feb 21, 2014, at 3:31 PM, Suraj Prabhakaran  
wrote:

> Ok, I figured out that it was not a problem with the node grsacc04 because I 
> now conducted the same on totally different set of nodes. 
> 
> I must really say that with --bind-to none option, the program completed 
> "many" times compared to earlier but still "sometimes" it hangs! Attaching 
> now the output of the same case conducted on different set of nodes with the 
> --bind-to none option.
> 
> mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
> grpcomm_base_verbose 5 --bind-to none -np 3 ./example
> 
> Best,
> Suraj
> 
> 
> 
> 
> On Feb 21, 2014, at 5:03 PM, Ralph Castain wrote:
> 
>> Well, that all looks fine. However, I note that the procs on grsacc04 all 
>> stopped making progress at the same point, which is why the job hung. All 
>> the procs on the other nodes were just fine.
>> 
>> So let's try a couple of things:
>> 
>> 1. add "--bind-to none" to your cmd line so we avoid any contention issues
>> 
>> 2. if possible, remove grsacc04 from the allocation (you can just use the 
>> -host option on the cmd line to ignore it), and/or replace that host with 
>> another one. Let's see if the problem has something to do with that specific 
>> node.
>> 
>> 
>> On Feb 21, 2014, at 4:08 AM, Suraj Prabhakaran  
>> wrote:
>> 
>>> Right, so I have the output here. Same case, 
>>> 
>>> mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
>>> grpcomm_base_verbose 5  -np 3 ./simple_spawn
>>> 
>>> Output attached. 
>>> 
>>> Best,
>>> Suraj
>>> 
>>> 
>>> 
>>> On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote:
>>> 
 
 On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran 
  wrote:
 
> Thanks Ralph!
> 
> I must have mentioned though. Without the Torque environment, spawning 
> with ssh works ok. But Under the torque environment, not. 
 
 Ah, no - you forgot to mention that point.
 
> 
> I started the simple_spawn with 3 processes and spawned 9 processes (3 
> per node on 3 nodes). 
> 
> There is no problem with the Torque environment because all the 9 
> processes are started on the respective nodes. But the MPI_Comm_spawn of 
> the parent and MPI_Init of the children, "sometimes" don't return!
 
 Seems odd - the launch environment has nothing to do with MPI_Init, so if 
 the processes are indeed being started, they should run. One possibility 
 is that they aren't correctly getting some wireup info.
 
 Can you configure OMPI --enable-debug and then rerun the example with 
 "-mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 
 5" on the command line?
 
 
> 
> This is the output of simple_spawn - which confirms the above statement. 
> 
> [pid 31208] starting up!
> [pid 31209] starting up!
> [pid 31210] starting up!
> 0 completed MPI_Init
> Parent [pid 31208] about to spawn!
> 1 completed MPI_Init
> Parent [pid 31209] about to spawn!
> 2 completed MPI_Init
> Parent [pid 31210] about to spawn!
> [pid 28630] starting up!
> [pid 28631] starting up!
> [pid 9846] starting up!
> [pid 9847] starting up!
> [pid 9848] starting up!
> [pid 6363] starting up!
> [pid 6361] starting up!
> [pid 6362] starting up!
> [pid 28632] starting up!
> 
> Any hints?
> 
> Best,
> Suraj
> 
> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote:
> 
>> Hmmm...I don't see anything immediately glaring. What do you mean by 
>> "doesn't work"? Is there some specific behavior you see?
>> 
>> You might try the attached program. It's a simple spawn test we use - 
>> 1.7.4 seems happy with it.
>> 
>> 
>> 
>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran 
>>  wrote:
>> 
>>> I am using 1.7.4! 
>>> 
>>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
>>> 
 What OMPI version are you using?
 
 On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran 
  wrote:
 
> Hello!
> 
> I am having problem using MPI_Comm_spawn under torque. It doesn't 
> work when spawning more than 12 processes on various nodes. To be 
> more precise, "sometimes" it works, and "sometimes" it doesn't!
> 
> Here is my case. I obtain 5 nodes, 3 cores per node and my 
> $PBS_NODEFILE looks like below.
> 
> node1
> node1
> node1
> node2
> node2
> node2
> node3
> node3
> node3
> node4
> node4
> node4
> node5
> node5
> node5
> 
> I started a hello program (which just spawns itself and of course, 
> the children don't spawn), with 
> 
> mpiexec -np 3 ./hello
> 
> Spawning 3 more pro

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Suraj Prabhakaran
Ok, I figured out that it was not a problem with the node grsacc04 because I 
now conducted the same on totally different set of nodes. 

I must really say that with --bind-to none option, the program completed "many" 
times compared to earlier but still "sometimes" it hangs! Attaching now the 
output of the same case conducted on different set of nodes with the --bind-to 
none option.

mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
grpcomm_base_verbose 5 --bind-to none -np 3 ./example

Best,
Suraj

{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf360
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww31940\viewh20120\viewkind0
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural

\f0\fs24 \cf0 [grsacc06:19901] mca:base:select:(  ess) Querying component [env]\
[grsacc06:19901] mca:base:select:(  ess) Skipping component [env]. Query failed to return a module\
[grsacc06:19901] mca:base:select:(  ess) Querying component [hnp]\
[grsacc06:19901] mca:base:select:(  ess) Query of component [hnp] set priority to 100\
[grsacc06:19901] mca:base:select:(  ess) Querying component [singleton]\
[grsacc06:19901] mca:base:select:(  ess) Skipping component [singleton]. Query failed to return a module\
[grsacc06:19901] mca:base:select:(  ess) Querying component [slurm]\
[grsacc06:19901] mca:base:select:(  ess) Skipping component [slurm]. Query failed to return a module\
[grsacc06:19901] mca:base:select:(  ess) Querying component [tm]\
[grsacc06:19901] mca:base:select:(  ess) Skipping component [tm]. Query failed to return a module\
[grsacc06:19901] mca:base:select:(  ess) Querying component [tool]\
[grsacc06:19901] mca:base:select:(  ess) Skipping component [tool]. Query failed to return a module\
[grsacc06:19901] mca:base:select:(  ess) Selected component [hnp]\
[grsacc06:19901] [[INVALID],INVALID] Topology Info:\
[grsacc06:19901] Type: Machine Number of child objects: 3\
	Name=NULL\
	total=25156656KB\
	DMIProductName=X8DTT-H\
	DMIProductVersion=1234567890\
	DMIBoardVendor=Supermicro\
	DMIBoardName=X8DTT-H\
	DMIBoardVersion=1234567890\
	DMIBoardAssetTag=1234567890\
	DMIChassisVendor=Supermicro\
	DMIChassisType=17\
	DMIChassisVersion=1234567890\
	DMIChassisAssetTag="To Be Filled By O.E.M."\
	DMIBIOSVendor="American Megatrends Inc."\
	DMIBIOSVersion="080015 "\
	DMIBIOSDate=12/10/2009\
	DMISysVendor=Supermicro\
	Backend=Linux\
	OSName=Linux\
	OSRelease=2.6.35-30-generic\
	OSVersion="#56-Ubuntu SMP Mon Jul 11 20:01:08 UTC 2011"\
	Architecture=x86_64\
	Cpuset:  0x00ff\
	Online:  0x00ff\
	Allowed: 0x00ff\
	Bind CPU proc:   TRUE\
	Bind CPU thread: TRUE\
	Bind MEM proc:   FALSE\
	Bind MEM thread: TRUE\
	Type: NUMANode Number of child objects: 1\
		Name=NULL\
		local=12573744KB\
		total=12573744KB\
		Cpuset:  0x000f\
		Online:  0x000f\
		Allowed: 0x000f\
		Type: Socket Number of child objects: 1\
			Name=NULL\
			CPUModel="Intel(R) Xeon(R) CPU   X5570  @ 2.93GHz"\
			Cpuset:  0x000f\
			Online:  0x000f\
			Allowed: 0x000f\
			Type: L3Cache Number of child objects: 4\
Name=NULL\
size=8192KB\
linesize=64\
ways=16\
Cpuset:  0x000f\
Online:  0x000f\
Allowed: 0x000f\
Type: L2Cache Number of child objects: 1\
	Name=NULL\
	size=256KB\
	linesize=64\
	ways=8\
	Cpuset:  0x0001\
	Online:  0x0001\
	Allowed: 0x0001\
	Type: L1dCache Number of child objects: 1\
		Name=NULL\
		size=32KB\
		linesize=64\
		ways=8\
		Cpuset:  0x0001\
		Online:  0x0001\
		Allowed: 0x0001\
		Type: Core Number of child objects: 1\
			Name=NULL\
			Cpuset:  0x0001\
			Online:  0x0001\
			Allowed: 0x0001\
			Type: PU Number of child objects: 0\
Name=NULL\
Cpuset:  0x0001\
Online:  0x0001\
Allowed: 0x0001\
Type: L2Cache Number of child objects: 1\
	Name=NULL\
	size=256KB\
	linesize=64\
	ways=8\
	Cpuset:  0x0002\
	Online:  0x0002\
	Allowed: 0x0002\
	Type: L1dCache Number of child objects: 1\
		Name=NULL\
		size=32KB\
		linesize=64\
		ways=8\
		Cpuset:  0x0002\
		Online:  0x0002\
		Allowed: 0x0002\
		Type: Core Number of child objects: 1\
			Name=NULL\
			Cpuset:  0x0002\
			Online:  0x0002\
			Allowed: 0x0002\
			Type: PU Number of child objects: 0\
Name=NULL\
Cpuset:  0x0002\
Online:  0x0002\
Allowed: 0x0002\
Type: L2Cache Number of child objects: 1\
	Name=NULL\
	size=256KB\
	linesize=64\
	ways=8\
	Cpuset:  0x0004\
	Online:  0x0004\
	Allowed: 0x0004\
	Type: L1dCache Number of child objects: 1\
		Name=NULL\
		size=32KB\
		linesize=64\
		ways=8\
	

Re: [OMPI devel] 1.7.5 status

2014-02-21 Thread Paul Hargrove
On Fri, Feb 21, 2014 at 1:18 PM, Ralph Castain  wrote:

> Still on the table:
>
[...]

> * SGI xpmem support
>


To the best of my knowledge I am the only one with platform access to test
this.
Nathan hasn't sent me anything new recently.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] 1.7.5 status

2014-02-21 Thread Ralph Castain
Hi folks

Just an end-of-week status update on the 1.7.5 branch. With most CMRs applied, 
it doesn't look too bad. We still have failures in the following MPI functions:

* intercomm_create - was supposed to be fixed by the coll/ml CMR, but 
apparently was not

* datatype/idx_null

* collective/ireduce_loc 

* collective/ibcast_struct

* topology/distgraph1

The shmem code is showing roughly a 20% failure rate on its test suite. Some of 
those are due to running with too many processors (test errors out with that 
message), but the majority of them are OSHMEM calling abort for some reason. 
This is with TCP under CentOS using gcc.

My test results are here:  http://mtt.open-mpi.org/index.php?do_redir=2154

I'd like to see us resolve the MPI problems, or at least certify that they are 
not a regression from 1.7.4. I'm comfortable releasing the shmem code in a "as 
good as we can get" mode (since there is no apparent damage to the MPI side), 
with an accompanying "known defects" file and a plan on fixing the problems.

Still on the table:

* usnic UDP upgrade

* ob1 optimization

* SGI xpmem support

* direct modex option

* atomics selection

HTH
Ralph



Re: [OMPI devel] startup sstore orte/mca/ess/base/ess_base_std_tool.c

2014-02-21 Thread Josh Hursey
+1


On Fri, Feb 21, 2014 at 10:04 AM, Ralph Castain  wrote:

> looks fine to me
>
>
> On Feb 21, 2014, at 6:23 AM, Adrian Reber  wrote:
>
> > To restart a process using orte-restart I need sstore initialized when
> > running as a tool. This is currently missing. The new code is
> >
> > #if OPAL_ENABLE_FT_CR == 1
> >
> > and should only affect --with-ft builds. The following is the change I
> > want to make:
> >
> > diff --git a/orte/mca/ess/base/ess_base_std_tool.c
> b/orte/mca/ess/base/ess_base_std_tool.c
> > index 93aed89..b102e6d 100644
> > --- a/orte/mca/ess/base/ess_base_std_tool.c
> > +++ b/orte/mca/ess/base/ess_base_std_tool.c
> > @@ -43,6 +43,7 @@
> > #include "orte/mca/state/base/base.h"
> > #if OPAL_ENABLE_FT_CR == 1
> > #include "orte/mca/snapc/base/base.h"
> > +#include "orte/mca/sstore/base/base.h"
> > #endif
> > #include "orte/util/proc_info.h"
> > #include "orte/util/session_dir.h"
> > @@ -175,11 +176,22 @@ int orte_ess_base_tool_setup(void)
> > error = "orte_snapc_base_open";
> > goto error;
> > }
> > +if (ORTE_SUCCESS != (ret =
> mca_base_framework_open(&orte_sstore_base_framework, 0))) {
> > +ORTE_ERROR_LOG(ret);
> > +error = "orte_sstore_base_open";
> > +goto error;
> > +}
> > +
> > if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP,
> ORTE_PROC_IS_APP))) {
> > ORTE_ERROR_LOG(ret);
> > error = "orte_snapc_base_select";
> > goto error;
> > }
> > +if (ORTE_SUCCESS != (ret = orte_sstore_base_select())) {
> > +ORTE_ERROR_LOG(ret);
> > +error = "orte_sstore_base_select";
> > +goto error;
> > +}
> >
> > /* Tools do not need all the OPAL CR stuff */
> > opal_cr_set_enabled(false);
> > @@ -201,6 +213,7 @@ int orte_ess_base_tool_finalize(void)
> >
> > #if OPAL_ENABLE_FT_CR == 1
> > mca_base_framework_close(&orte_snapc_base_framework);
> > +mca_base_framework_close(&orte_sstore_base_framework);
> > #endif
> >
> > /* if I am a tool, then all I will have done is
> >
> >
> >   Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] mca_base_component_distill_checkpoint_ready variable

2014-02-21 Thread Nathan Hjelm
On Fri, Feb 21, 2014 at 05:21:10PM +0100, Adrian Reber wrote:
> There is a variable in the FT code which is not defined and therefore
> currently #ifdef'd out.
> 
> #if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1)
> #ifdef ENABLE_FT_FIXED
> /* FIXME_FT
>  *
>  * the variable mca_base_component_distill_checkpoint_ready
>  * was removed by commit 8181c8273c486bba59b3dead324939eac1a58b8c (r28237)
>  * "Introduce the MCA framework system. This formalizes the interface 
> frameworks must provide."
>  *
>  * */
> if (mca_base_component_distill_checkpoint_ready) {
> open_only_flags |= MCA_BASE_METADATA_PARAM_CHECKPOINT;
> }
> #endif /* ENABLE_FT_FIXED */
> #endif  /* (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) */
> 
> 
> The variable 'mca_base_component_distill_checkpoint_ready' used to exist but 
> was removed
> with commit 'r28237':
> 
> -#if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1)
> -{
> -int param_id = -1;
> -int param_val = 0;
> -/*
> - * Extract supported mca parameters for selection contraints
> - * Supported Options:
> - *   - mca_base_component_distill_checkpoint_ready = Checkpoint Ready
> - */
> -param_id = mca_base_param_reg_int_name("mca", 
> "base_component_distill_checkpoint_ready",
> -   "Distill only those 
> components that are Checkpoint Ready", 
> -   false, false,
> -   0, ¶m_val);
> -if( 0 != param_val ) { /* Select Checkpoint Ready */
> -open_only_flags |= MCA_BASE_METADATA_PARAM_CHECKPOINT;
> -}
> -}
> -#endif  /* (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) */
> 
> The variable is defined in contrib/amca-param-sets/ft-enable-cr
> 
> mca_base_component_distill_checkpoint_ready=1
> 
> Looking at the name of other variable I would say it should be called
> 
> opal_base_distill_checkpoint_ready
> 
> and probably created with mca_base_var_register() or 
> mca_base_component_var_register().
> 
> What would be the best place to create the variable so that it can be used 
> again in
> the FT code?

Some variables are registered in opal/runtime/opal_params.c. That might
be a good place to add it.

-Nathan


pgpCv7jJDp68u.pgp
Description: PGP signature


[OMPI devel] mca_base_component_distill_checkpoint_ready variable

2014-02-21 Thread Adrian Reber
There is a variable in the FT code which is not defined and therefore
currently #ifdef'd out.

#if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1)
#ifdef ENABLE_FT_FIXED
/* FIXME_FT
 *
 * the variable mca_base_component_distill_checkpoint_ready
 * was removed by commit 8181c8273c486bba59b3dead324939eac1a58b8c (r28237)
 * "Introduce the MCA framework system. This formalizes the interface 
frameworks must provide."
 *
 * */
if (mca_base_component_distill_checkpoint_ready) {
open_only_flags |= MCA_BASE_METADATA_PARAM_CHECKPOINT;
}
#endif /* ENABLE_FT_FIXED */
#endif  /* (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) */


The variable 'mca_base_component_distill_checkpoint_ready' used to exist but 
was removed
with commit 'r28237':

-#if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1)
-{
-int param_id = -1;
-int param_val = 0;
-/*
- * Extract supported mca parameters for selection contraints
- * Supported Options:
- *   - mca_base_component_distill_checkpoint_ready = Checkpoint Ready
- */
-param_id = mca_base_param_reg_int_name("mca", 
"base_component_distill_checkpoint_ready",
-   "Distill only those components 
that are Checkpoint Ready", 
-   false, false,
-   0, ¶m_val);
-if( 0 != param_val ) { /* Select Checkpoint Ready */
-open_only_flags |= MCA_BASE_METADATA_PARAM_CHECKPOINT;
-}
-}
-#endif  /* (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) */

The variable is defined in contrib/amca-param-sets/ft-enable-cr

mca_base_component_distill_checkpoint_ready=1

Looking at the name of other variable I would say it should be called

opal_base_distill_checkpoint_ready

and probably created with mca_base_var_register() or 
mca_base_component_var_register().

What would be the best place to create the variable so that it can be used 
again in
the FT code?

Adrian


Re: [OMPI devel] startup sstore orte/mca/ess/base/ess_base_std_tool.c

2014-02-21 Thread Ralph Castain
looks fine to me


On Feb 21, 2014, at 6:23 AM, Adrian Reber  wrote:

> To restart a process using orte-restart I need sstore initialized when
> running as a tool. This is currently missing. The new code is
> 
> #if OPAL_ENABLE_FT_CR == 1
> 
> and should only affect --with-ft builds. The following is the change I
> want to make:
> 
> diff --git a/orte/mca/ess/base/ess_base_std_tool.c 
> b/orte/mca/ess/base/ess_base_std_tool.c
> index 93aed89..b102e6d 100644
> --- a/orte/mca/ess/base/ess_base_std_tool.c
> +++ b/orte/mca/ess/base/ess_base_std_tool.c
> @@ -43,6 +43,7 @@
> #include "orte/mca/state/base/base.h"
> #if OPAL_ENABLE_FT_CR == 1
> #include "orte/mca/snapc/base/base.h"
> +#include "orte/mca/sstore/base/base.h"
> #endif
> #include "orte/util/proc_info.h"
> #include "orte/util/session_dir.h"
> @@ -175,11 +176,22 @@ int orte_ess_base_tool_setup(void)
> error = "orte_snapc_base_open";
> goto error;
> }
> +if (ORTE_SUCCESS != (ret = 
> mca_base_framework_open(&orte_sstore_base_framework, 0))) {
> +ORTE_ERROR_LOG(ret);
> +error = "orte_sstore_base_open";
> +goto error;
> +}
> +
> if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
> ORTE_PROC_IS_APP))) {
> ORTE_ERROR_LOG(ret);
> error = "orte_snapc_base_select";
> goto error;
> }
> +if (ORTE_SUCCESS != (ret = orte_sstore_base_select())) {
> +ORTE_ERROR_LOG(ret);
> +error = "orte_sstore_base_select";
> +goto error;
> +}
> 
> /* Tools do not need all the OPAL CR stuff */
> opal_cr_set_enabled(false);
> @@ -201,6 +213,7 @@ int orte_ess_base_tool_finalize(void)
> 
> #if OPAL_ENABLE_FT_CR == 1
> mca_base_framework_close(&orte_snapc_base_framework);
> +mca_base_framework_close(&orte_sstore_base_framework);
> #endif
> 
> /* if I am a tool, then all I will have done is
> 
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Ralph Castain
Well, that all looks fine. However, I note that the procs on grsacc04 all 
stopped making progress at the same point, which is why the job hung. All the 
procs on the other nodes were just fine.

So let's try a couple of things:

1. add "--bind-to none" to your cmd line so we avoid any contention issues

2. if possible, remove grsacc04 from the allocation (you can just use the -host 
option on the cmd line to ignore it), and/or replace that host with another 
one. Let's see if the problem has something to do with that specific node.


On Feb 21, 2014, at 4:08 AM, Suraj Prabhakaran  
wrote:

> Right, so I have the output here. Same case, 
> 
> mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
> grpcomm_base_verbose 5  -np 3 ./simple_spawn
> 
> Output attached. 
> 
> Best,
> Suraj
> 
> 
> 
> On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote:
> 
>> 
>> On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran  
>> wrote:
>> 
>>> Thanks Ralph!
>>> 
>>> I must have mentioned though. Without the Torque environment, spawning with 
>>> ssh works ok. But Under the torque environment, not. 
>> 
>> Ah, no - you forgot to mention that point.
>> 
>>> 
>>> I started the simple_spawn with 3 processes and spawned 9 processes (3 per 
>>> node on 3 nodes). 
>>> 
>>> There is no problem with the Torque environment because all the 9 processes 
>>> are started on the respective nodes. But the MPI_Comm_spawn of the parent 
>>> and MPI_Init of the children, "sometimes" don't return!
>> 
>> Seems odd - the launch environment has nothing to do with MPI_Init, so if 
>> the processes are indeed being started, they should run. One possibility is 
>> that they aren't correctly getting some wireup info.
>> 
>> Can you configure OMPI --enable-debug and then rerun the example with "-mca 
>> plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5" on 
>> the command line?
>> 
>> 
>>> 
>>> This is the output of simple_spawn - which confirms the above statement. 
>>> 
>>> [pid 31208] starting up!
>>> [pid 31209] starting up!
>>> [pid 31210] starting up!
>>> 0 completed MPI_Init
>>> Parent [pid 31208] about to spawn!
>>> 1 completed MPI_Init
>>> Parent [pid 31209] about to spawn!
>>> 2 completed MPI_Init
>>> Parent [pid 31210] about to spawn!
>>> [pid 28630] starting up!
>>> [pid 28631] starting up!
>>> [pid 9846] starting up!
>>> [pid 9847] starting up!
>>> [pid 9848] starting up!
>>> [pid 6363] starting up!
>>> [pid 6361] starting up!
>>> [pid 6362] starting up!
>>> [pid 28632] starting up!
>>> 
>>> Any hints?
>>> 
>>> Best,
>>> Suraj
>>> 
>>> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote:
>>> 
 Hmmm...I don't see anything immediately glaring. What do you mean by 
 "doesn't work"? Is there some specific behavior you see?
 
 You might try the attached program. It's a simple spawn test we use - 
 1.7.4 seems happy with it.
 
 
 
 On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran 
  wrote:
 
> I am using 1.7.4! 
> 
> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
> 
>> What OMPI version are you using?
>> 
>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran 
>>  wrote:
>> 
>>> Hello!
>>> 
>>> I am having problem using MPI_Comm_spawn under torque. It doesn't work 
>>> when spawning more than 12 processes on various nodes. To be more 
>>> precise, "sometimes" it works, and "sometimes" it doesn't!
>>> 
>>> Here is my case. I obtain 5 nodes, 3 cores per node and my 
>>> $PBS_NODEFILE looks like below.
>>> 
>>> node1
>>> node1
>>> node1
>>> node2
>>> node2
>>> node2
>>> node3
>>> node3
>>> node3
>>> node4
>>> node4
>>> node4
>>> node5
>>> node5
>>> node5
>>> 
>>> I started a hello program (which just spawns itself and of course, the 
>>> children don't spawn), with 
>>> 
>>> mpiexec -np 3 ./hello
>>> 
>>> Spawning 3 more processes (on node 2) - works!
>>> spawning 6 more processes (node 2 and 3) - works!
>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
>>> spawning 12 processes (node 2,3,4,5) - "mostly" not!
>>> 
>>> I ideally want to spawn about 32 processes with large number of nodes, 
>>> but this is at the moment impossible. I have attached my hello program 
>>> to this email. 
>>> 
>>> I will be happy to provide any more info or verbose outputs if you 
>>> could please tell me what exactly you would like to see.
>>> 
>>> Best,
>>> Suraj
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> deve

[OMPI devel] startup sstore orte/mca/ess/base/ess_base_std_tool.c

2014-02-21 Thread Adrian Reber
To restart a process using orte-restart I need sstore initialized when
running as a tool. This is currently missing. The new code is

#if OPAL_ENABLE_FT_CR == 1

and should only affect --with-ft builds. The following is the change I
want to make:

diff --git a/orte/mca/ess/base/ess_base_std_tool.c 
b/orte/mca/ess/base/ess_base_std_tool.c
index 93aed89..b102e6d 100644
--- a/orte/mca/ess/base/ess_base_std_tool.c
+++ b/orte/mca/ess/base/ess_base_std_tool.c
@@ -43,6 +43,7 @@
 #include "orte/mca/state/base/base.h"
 #if OPAL_ENABLE_FT_CR == 1
 #include "orte/mca/snapc/base/base.h"
+#include "orte/mca/sstore/base/base.h"
 #endif
 #include "orte/util/proc_info.h"
 #include "orte/util/session_dir.h"
@@ -175,11 +176,22 @@ int orte_ess_base_tool_setup(void)
 error = "orte_snapc_base_open";
 goto error;
 }
+if (ORTE_SUCCESS != (ret = 
mca_base_framework_open(&orte_sstore_base_framework, 0))) {
+ORTE_ERROR_LOG(ret);
+error = "orte_sstore_base_open";
+goto error;
+}
+
 if (ORTE_SUCCESS != (ret = orte_snapc_base_select(ORTE_PROC_IS_HNP, 
ORTE_PROC_IS_APP))) {
 ORTE_ERROR_LOG(ret);
 error = "orte_snapc_base_select";
 goto error;
 }
+if (ORTE_SUCCESS != (ret = orte_sstore_base_select())) {
+ORTE_ERROR_LOG(ret);
+error = "orte_sstore_base_select";
+goto error;
+}

 /* Tools do not need all the OPAL CR stuff */
 opal_cr_set_enabled(false);
@@ -201,6 +213,7 @@ int orte_ess_base_tool_finalize(void)

 #if OPAL_ENABLE_FT_CR == 1
 mca_base_framework_close(&orte_snapc_base_framework);
+mca_base_framework_close(&orte_sstore_base_framework);
 #endif

 /* if I am a tool, then all I will have done is


Adrian


Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Suraj Prabhakaran
Right, so I have the output here. Same case, 

mpiexec  -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca 
grpcomm_base_verbose 5  -np 3 ./simple_spawn

Output attached. 

Best,
Suraj



output
Description: Binary data


On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote:

> 
> On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran  
> wrote:
> 
>> Thanks Ralph!
>> 
>> I must have mentioned though. Without the Torque environment, spawning with 
>> ssh works ok. But Under the torque environment, not. 
> 
> Ah, no - you forgot to mention that point.
> 
>> 
>> I started the simple_spawn with 3 processes and spawned 9 processes (3 per 
>> node on 3 nodes). 
>> 
>> There is no problem with the Torque environment because all the 9 processes 
>> are started on the respective nodes. But the MPI_Comm_spawn of the parent 
>> and MPI_Init of the children, "sometimes" don't return!
> 
> Seems odd - the launch environment has nothing to do with MPI_Init, so if the 
> processes are indeed being started, they should run. One possibility is that 
> they aren't correctly getting some wireup info.
> 
> Can you configure OMPI --enable-debug and then rerun the example with "-mca 
> plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5" on 
> the command line?
> 
> 
>> 
>> This is the output of simple_spawn - which confirms the above statement. 
>> 
>> [pid 31208] starting up!
>> [pid 31209] starting up!
>> [pid 31210] starting up!
>> 0 completed MPI_Init
>> Parent [pid 31208] about to spawn!
>> 1 completed MPI_Init
>> Parent [pid 31209] about to spawn!
>> 2 completed MPI_Init
>> Parent [pid 31210] about to spawn!
>> [pid 28630] starting up!
>> [pid 28631] starting up!
>> [pid 9846] starting up!
>> [pid 9847] starting up!
>> [pid 9848] starting up!
>> [pid 6363] starting up!
>> [pid 6361] starting up!
>> [pid 6362] starting up!
>> [pid 28632] starting up!
>> 
>> Any hints?
>> 
>> Best,
>> Suraj
>> 
>> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote:
>> 
>>> Hmmm...I don't see anything immediately glaring. What do you mean by 
>>> "doesn't work"? Is there some specific behavior you see?
>>> 
>>> You might try the attached program. It's a simple spawn test we use - 1.7.4 
>>> seems happy with it.
>>> 
>>> 
>>> 
>>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran 
>>>  wrote:
>>> 
 I am using 1.7.4! 
 
 On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
 
> What OMPI version are you using?
> 
> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran 
>  wrote:
> 
>> Hello!
>> 
>> I am having problem using MPI_Comm_spawn under torque. It doesn't work 
>> when spawning more than 12 processes on various nodes. To be more 
>> precise, "sometimes" it works, and "sometimes" it doesn't!
>> 
>> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE 
>> looks like below.
>> 
>> node1
>> node1
>> node1
>> node2
>> node2
>> node2
>> node3
>> node3
>> node3
>> node4
>> node4
>> node4
>> node5
>> node5
>> node5
>> 
>> I started a hello program (which just spawns itself and of course, the 
>> children don't spawn), with 
>> 
>> mpiexec -np 3 ./hello
>> 
>> Spawning 3 more processes (on node 2) - works!
>> spawning 6 more processes (node 2 and 3) - works!
>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
>> spawning 12 processes (node 2,3,4,5) - "mostly" not!
>> 
>> I ideally want to spawn about 32 processes with large number of nodes, 
>> but this is at the moment impossible. I have attached my hello program 
>> to this email. 
>> 
>> I will be happy to provide any more info or verbose outputs if you could 
>> please tell me what exactly you would like to see.
>> 
>> Best,
>> Suraj
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel