Re: [OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103

2014-03-18 Thread Jeff Squyres (jsquyres)
This seems to be working, but I think we now have a pid group problem -- I 
think we need to setpgid() right after the fork.  Otherwise, when we kill the 
group, we might end up killing much more than just the one MPI process 
(including the orted and/or orted's parent!).

Ping me on IM -- I'm testing this idea and it seems to work properly.


On Mar 18, 2014, at 4:11 PM, Ralph Castain  wrote:

> Okay, fixed and cmr'd to you
> 
> 
> On Mar 18, 2014, at 11:00 AM, Ralph Castain  wrote:
> 
>> 
>> On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell)  
>> wrote:
>> 
>>> Ralph,
>>> 
>>> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to 
>>> HEAD):
>>> 
>>> 8<
>>> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper
>>> --
>>> The user-provided time limit for job execution has been
>>> reached:
>>> 
>>> MPIEXEC_TIMEOUT: 8 seconds
>>> 
>>> The job will now be aborted. Please check your code and/or
>>> adjust/remove the job execution time limit (as specified
>>> by MPIEXEC_TIMEOUT in your environment).
>>> 
>>> --
>>> srun: error: mpi015: task 0: Killed
>>> srun: Terminating job step 689585.2
>>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
>>> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] 
>>> mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) 
>>> [sd = 16]
>>> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] 
>>> mca_oob_tcp_peer_send_handler: unable to send header
>>> 
>>> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly 
>>> terminate
>>> 
>>> ^C
>>> 8<
>>> 
>>> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass 
>>> beforehand (several minutes for the first two, <5s in the third).
>>> 
>>> Where "sleeper" is just an MPI program that does:
>>> 
>>> 8<
>>>  MPI_Init(, );
>>>  MPI_Comm_rank(MPI_COMM_WORLD, );
>>>  MPI_Comm_size(MPI_COMM_WORLD, );
>>> 
>>>  while (1) {
>>>  sleep(60);
>>>  }
>>> 
>>>  MPI_Finalize();
>>> 8<
>>> 
>>> It happens under slurm and SSH.  If I launch on localhost (no 
>>> --host/--hostfile option, no slurm, etc.) then it exits just fine.  The 
>>> example output I gave above used the "usnic" BTL, but "tcp" has identical 
>>> behavior.
>>> 
>>> This worked fine in v1.7.4.  I've bisected the change in behavior down to 
>>> r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981
>>> 
>>> Should I file a ticket?
>>> 
>> 
>> Crud - no, I'll take a look in a little bit
>> 
>> 
>>> -Dave
>>> 
>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14367.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103

2014-03-18 Thread Ralph Castain
Okay, fixed and cmr'd to you


On Mar 18, 2014, at 11:00 AM, Ralph Castain  wrote:

> 
> On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell)  
> wrote:
> 
>> Ralph,
>> 
>> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to 
>> HEAD):
>> 
>> 8<
>> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper
>> --
>> The user-provided time limit for job execution has been
>> reached:
>> 
>> MPIEXEC_TIMEOUT: 8 seconds
>> 
>> The job will now be aborted. Please check your code and/or
>> adjust/remove the job execution time limit (as specified
>> by MPIEXEC_TIMEOUT in your environment).
>> 
>> --
>> srun: error: mpi015: task 0: Killed
>> srun: Terminating job step 689585.2
>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
>> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] 
>> mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd 
>> = 16]
>> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] 
>> mca_oob_tcp_peer_send_handler: unable to send header
>> 
>> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly 
>> terminate
>> 
>> ^C
>> 8<
>> 
>> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass 
>> beforehand (several minutes for the first two, <5s in the third).
>> 
>> Where "sleeper" is just an MPI program that does:
>> 
>> 8<
>>   MPI_Init(, );
>>   MPI_Comm_rank(MPI_COMM_WORLD, );
>>   MPI_Comm_size(MPI_COMM_WORLD, );
>> 
>>   while (1) {
>>   sleep(60);
>>   }
>> 
>>   MPI_Finalize();
>> 8<
>> 
>> It happens under slurm and SSH.  If I launch on localhost (no 
>> --host/--hostfile option, no slurm, etc.) then it exits just fine.  The 
>> example output I gave above used the "usnic" BTL, but "tcp" has identical 
>> behavior.
>> 
>> This worked fine in v1.7.4.  I've bisected the change in behavior down to 
>> r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981
>> 
>> Should I file a ticket?
>> 
> 
> Crud - no, I'll take a look in a little bit
> 
> 
>> -Dave
>> 
> 



[hwloc-devel] === CREATE FAILURE (dev-135-g73e55a7) ===

2014-03-18 Thread MPI Team

ERROR: Command returned a non-zero exist status (dev-135-g73e55a7):
   ./autogen.sh

Start time: Tue Mar 18 14:36:21 EDT 2014
End time:   Tue Mar 18 14:36:28 EDT 2014

===
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force -I ./config
configure.ac:151: warning: macro `AM_ENABLE_SHARED' not found in library
configure.ac:152: warning: macro `AM_DISABLE_STATIC' not found in library
configure.ac:153: warning: macro `AM_PROG_LIBTOOL' not found in library
configure.ac:40: error: libtool version 2.2.6 or higher is required
configure.ac:40: the top level
autom4te: /usr/bin/m4 failed with exit status: 63
aclocal: autom4te failed with exit status: 63
autoreconf: aclocal failed with exit status: 63
===

Your friendly daemon,
Cyrador


[hwloc-devel] Create success (hwloc git 1.8.1-11-g969ae06)

2014-03-18 Thread MPI Team
Creating nightly hwloc snapshot git tarball was a success.

Snapshot:   hwloc 1.8.1-11-g969ae06
Start time: Tue Mar 18 14:34:08 EDT 2014
End time:   Tue Mar 18 14:36:21 EDT 2014

Your friendly daemon,
Cyrador


[hwloc-devel] Create success (hwloc git dev-135-g6cad2d3)

2014-03-18 Thread MPI Team
Creating nightly hwloc snapshot git tarball was a success.

Snapshot:   hwloc dev-135-g6cad2d3
Start time: Tue Mar 18 14:32:05 EDT 2014
End time:   Tue Mar 18 14:34:04 EDT 2014

Your friendly daemon,
Cyrador


Re: [OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103

2014-03-18 Thread Ralph Castain

On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell)  
wrote:

> Ralph,
> 
> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to 
> HEAD):
> 
> 8<
> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper
> --
> The user-provided time limit for job execution has been
> reached:
> 
>  MPIEXEC_TIMEOUT: 8 seconds
> 
> The job will now be aborted. Please check your code and/or
> adjust/remove the job execution time limit (as specified
> by MPIEXEC_TIMEOUT in your environment).
> 
> --
> srun: error: mpi015: task 0: Killed
> srun: Terminating job step 689585.2
> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] 
> mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd 
> = 16]
> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] 
> mca_oob_tcp_peer_send_handler: unable to send header
> 
> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly 
> terminate
> 
> ^C
> 8<
> 
> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass 
> beforehand (several minutes for the first two, <5s in the third).
> 
> Where "sleeper" is just an MPI program that does:
> 
> 8<
>MPI_Init(, );
>MPI_Comm_rank(MPI_COMM_WORLD, );
>MPI_Comm_size(MPI_COMM_WORLD, );
> 
>while (1) {
>sleep(60);
>}
> 
>MPI_Finalize();
> 8<
> 
> It happens under slurm and SSH.  If I launch on localhost (no 
> --host/--hostfile option, no slurm, etc.) then it exits just fine.  The 
> example output I gave above used the "usnic" BTL, but "tcp" has identical 
> behavior.
> 
> This worked fine in v1.7.4.  I've bisected the change in behavior down to 
> r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981
> 
> Should I file a ticket?
> 

Crud - no, I'll take a look in a little bit


> -Dave
> 



[OMPI devel] DNS migration of open-mpi.org

2014-03-18 Thread Jeff Squyres (jsquyres)
Tomorrow at 9am US Eastern, IU will be changing the IP address of open-mpi.org 
(and all of its associated services: email, web, etc.).

They're hoping it causes no downtime -- there should be proxies in place to 
relay traffic from the old IP addresses for the next week or two, so that no 
one should notice anything different while the DNS change is propagating.

svn.open-mpi.org will not be affected.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103

2014-03-18 Thread Dave Goodell (dgoodell)
Ralph,

I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to 
HEAD):

8<
MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper
--
The user-provided time limit for job execution has been
reached:

  MPIEXEC_TIMEOUT: 8 seconds

The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified
by MPIEXEC_TIMEOUT in your environment).

--
srun: error: mpi015: task 0: Killed
srun: Terminating job step 689585.2
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] 
mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd = 
16]
[savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] 
mca_oob_tcp_peer_send_handler: unable to send header

^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly terminate

^C
8<

Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass 
beforehand (several minutes for the first two, <5s in the third).

Where "sleeper" is just an MPI program that does:

8<
MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );

while (1) {
sleep(60);
}

MPI_Finalize();
8<

It happens under slurm and SSH.  If I launch on localhost (no --host/--hostfile 
option, no slurm, etc.) then it exits just fine.  The example output I gave 
above used the "usnic" BTL, but "tcp" has identical behavior.

This worked fine in v1.7.4.  I've bisected the change in behavior down to 
r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981

Should I file a ticket?

-Dave



Re: [OMPI devel] Hang in comm_spawn

2014-03-18 Thread Ralph Castain
It's on the trunk, but I imagine it is on 1.7 as well. I use the "simple_spawn" 
program in orte/test/mpi, and the cmd line is just "mpirun -np 2 ./simple_spawn"


On Mar 18, 2014, at 7:42 AM, Nathan Hjelm  wrote:

> Is this trunk or 1.7? Can you give me your mpirun command?
> 
> -Nathan
> 
> On Tue, Mar 18, 2014 at 07:35:01AM -0700, Ralph Castain wrote:
>>   I'm seeing comm_spawn hang here:
>>   [bend001][[52890,1],0][coll_ml_module.c:3030:mca_coll_ml_comm_query]
>>   COLL-ML ml_coll_schedule_setup exit with error
>>   [bend001][[52890,1],1][coll_ml_module.c:3030:mca_coll_ml_comm_query]
>>   COLL-ML ml_coll_schedule_setup exit with error
>>   Setting -mca coll ^ml allows things to run to completion just fine, so it
>>   appears that coll/ml is having a problem with comm_spawn.
>>   Ralph
> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/03/14361.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14362.php



Re: [OMPI devel] Hang in comm_spawn

2014-03-18 Thread Nathan Hjelm
Is this trunk or 1.7? Can you give me your mpirun command?

-Nathan

On Tue, Mar 18, 2014 at 07:35:01AM -0700, Ralph Castain wrote:
>I'm seeing comm_spawn hang here:
>[bend001][[52890,1],0][coll_ml_module.c:3030:mca_coll_ml_comm_query]
>COLL-ML ml_coll_schedule_setup exit with error
>[bend001][[52890,1],1][coll_ml_module.c:3030:mca_coll_ml_comm_query]
>COLL-ML ml_coll_schedule_setup exit with error
>Setting -mca coll ^ml allows things to run to completion just fine, so it
>appears that coll/ml is having a problem with comm_spawn.
>Ralph

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14361.php



pgp46Nb_kVtxp.pgp
Description: PGP signature


[OMPI devel] Hang in comm_spawn

2014-03-18 Thread Ralph Castain
I'm seeing comm_spawn hang here:

[bend001][[52890,1],0][coll_ml_module.c:3030:mca_coll_ml_comm_query] COLL-ML 
ml_coll_schedule_setup exit with error
[bend001][[52890,1],1][coll_ml_module.c:3030:mca_coll_ml_comm_query] COLL-ML 
ml_coll_schedule_setup exit with error

Setting -mca coll ^ml allows things to run to completion just fine, so it 
appears that coll/ml is having a problem with comm_spawn.

Ralph



Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-18 Thread Adrian Reber
Thanks for your fix.

You say that the environment is only taken in
account during register. There is another variable set in the
environment in opal-restart.c. Does the following still work:

opal-restart.c:

(void) mca_base_var_env_name("crs", _env_var);
opal_setenv(tmp_env_var,
expected_crs_comp,
true, );
free(tmp_env_var);
tmp_env_var = NULL;

The preferred checkpointer is selected like this and in
opal_crs_base_select() the following happens:

if( OPAL_SUCCESS != mca_base_select("crs", 
opal_crs_base_framework.framework_output,

_crs_base_framework.framework_components,
(mca_base_module_t **) _module,
(mca_base_component_t **) 
_component) ) {
/* This will only happen if no component was selected */
exit_status = OPAL_ERROR;
goto cleanup;
}

Does the mca_base_var_env_name() influence which crs module
is selected during mca_base_select()? Or do I have to change it
also to mca_base_var_set_value() to select the preferred crs module?

Adrian


On Mon, Mar 17, 2014 at 08:47:16AM -0600, Nathan Hjelm wrote:
> Good catch. Fixing now.
> 
> -Nathan
> 
> On Mon, Mar 17, 2014 at 02:50:02PM +0100, Adrian Reber wrote:
> > On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T wrote:
> > > The preferred way is to use mca_base_var_find and then call 
> > > mca_base_var_[set|get]_value. For performance sake we only look at the 
> > > environment when the variable is registered.
> > 
> > I believe I found a bug in mca_base_var_set_value using bool variables:
> > 
> > #0  0x7f6e0d8fb800 in mca_base_var_enum_bool_sfv (self=0x7f6e0dbabc20 
> > , value=0, 
> > string_value=0x0) at ../../../../opal/mca/base/mca_base_var_enum.c:82
> > #1  0x7f6e0d8f45d6 in mca_base_var_set_value (vari=120, value=0x4031e6, 
> > size=0, source=MCA_BASE_VAR_SOURCE_DEFAULT, 
> > source_file=0x0) at ../../../../opal/mca/base/mca_base_var.c:636
> > #2  0x00401e44 in main (argc=7, argv=0x7fffa72a0a78) at 
> > ../../../../opal/tools/opal-restart/opal-restart.c:223
> > 
> > I am using set_value like this:
> > 
> > bool test=false;
> > mca_base_var_set_value(idx, , 0, MCA_BASE_VAR_SOURCE_DEFAULT, NULL);
> > 
> > As the size is ignored I am just setting it to '0'.
> > 
> > mca_base_var_set_value() does 
> > 
> > ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,((int *) 
> > value)[0], NULL);
> > 
> > which calls mca_base_var_enum_bool_sfv() with the last parameter set to 
> > NULL:
> > 
> > static int mca_base_var_enum_bool_sfv (mca_base_var_enum_t *self, const int 
> > value,
> >const char **string_value)
> > {
> > *string_value = value ? "true" : "false";
> > 
> > return OPAL_SUCCESS;
> > }
> > 
> > and here it tries to access the last parameter (string_value) which has
> > been set to NULL. As I cannot find any usage of mca_base_var_set_value()
> > with bool variables this code path has probably not been used until now.
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/03/14354.php



> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14355.php


Adrian

-- 
Adrian Reber http://lisas.de/~adrian/
printk(KERN_ERR "msp3400: chip reset failed, penguin on i2c bus?\n");
2.2.16 /usr/src/linux/drivers/char/msp3400.c


pgph76CYFEG_J.pgp
Description: PGP signature