Re: [OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103
This seems to be working, but I think we now have a pid group problem -- I think we need to setpgid() right after the fork. Otherwise, when we kill the group, we might end up killing much more than just the one MPI process (including the orted and/or orted's parent!). Ping me on IM -- I'm testing this idea and it seems to work properly. On Mar 18, 2014, at 4:11 PM, Ralph Castainwrote: > Okay, fixed and cmr'd to you > > > On Mar 18, 2014, at 11:00 AM, Ralph Castain wrote: > >> >> On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell) >> wrote: >> >>> Ralph, >>> >>> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to >>> HEAD): >>> >>> 8< >>> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper >>> -- >>> The user-provided time limit for job execution has been >>> reached: >>> >>> MPIEXEC_TIMEOUT: 8 seconds >>> >>> The job will now be aborted. Please check your code and/or >>> adjust/remove the job execution time limit (as specified >>> by MPIEXEC_TIMEOUT in your environment). >>> >>> -- >>> srun: error: mpi015: task 0: Killed >>> srun: Terminating job step 689585.2 >>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. >>> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] >>> mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) >>> [sd = 16] >>> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] >>> mca_oob_tcp_peer_send_handler: unable to send header >>> >>> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly >>> terminate >>> >>> ^C >>> 8< >>> >>> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass >>> beforehand (several minutes for the first two, <5s in the third). >>> >>> Where "sleeper" is just an MPI program that does: >>> >>> 8< >>> MPI_Init(, ); >>> MPI_Comm_rank(MPI_COMM_WORLD, ); >>> MPI_Comm_size(MPI_COMM_WORLD, ); >>> >>> while (1) { >>> sleep(60); >>> } >>> >>> MPI_Finalize(); >>> 8< >>> >>> It happens under slurm and SSH. If I launch on localhost (no >>> --host/--hostfile option, no slurm, etc.) then it exits just fine. The >>> example output I gave above used the "usnic" BTL, but "tcp" has identical >>> behavior. >>> >>> This worked fine in v1.7.4. I've bisected the change in behavior down to >>> r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981 >>> >>> Should I file a ticket? >>> >> >> Crud - no, I'll take a look in a little bit >> >> >>> -Dave >>> >> > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14367.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103
Okay, fixed and cmr'd to you On Mar 18, 2014, at 11:00 AM, Ralph Castainwrote: > > On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell) > wrote: > >> Ralph, >> >> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to >> HEAD): >> >> 8< >> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper >> -- >> The user-provided time limit for job execution has been >> reached: >> >> MPIEXEC_TIMEOUT: 8 seconds >> >> The job will now be aborted. Please check your code and/or >> adjust/remove the job execution time limit (as specified >> by MPIEXEC_TIMEOUT in your environment). >> >> -- >> srun: error: mpi015: task 0: Killed >> srun: Terminating job step 689585.2 >> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. >> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] >> mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd >> = 16] >> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] >> mca_oob_tcp_peer_send_handler: unable to send header >> >> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly >> terminate >> >> ^C >> 8< >> >> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass >> beforehand (several minutes for the first two, <5s in the third). >> >> Where "sleeper" is just an MPI program that does: >> >> 8< >> MPI_Init(, ); >> MPI_Comm_rank(MPI_COMM_WORLD, ); >> MPI_Comm_size(MPI_COMM_WORLD, ); >> >> while (1) { >> sleep(60); >> } >> >> MPI_Finalize(); >> 8< >> >> It happens under slurm and SSH. If I launch on localhost (no >> --host/--hostfile option, no slurm, etc.) then it exits just fine. The >> example output I gave above used the "usnic" BTL, but "tcp" has identical >> behavior. >> >> This worked fine in v1.7.4. I've bisected the change in behavior down to >> r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981 >> >> Should I file a ticket? >> > > Crud - no, I'll take a look in a little bit > > >> -Dave >> >
[hwloc-devel] === CREATE FAILURE (dev-135-g73e55a7) ===
ERROR: Command returned a non-zero exist status (dev-135-g73e55a7): ./autogen.sh Start time: Tue Mar 18 14:36:21 EDT 2014 End time: Tue Mar 18 14:36:28 EDT 2014 === autoreconf: Entering directory `.' autoreconf: configure.ac: not using Gettext autoreconf: running: aclocal --force -I ./config configure.ac:151: warning: macro `AM_ENABLE_SHARED' not found in library configure.ac:152: warning: macro `AM_DISABLE_STATIC' not found in library configure.ac:153: warning: macro `AM_PROG_LIBTOOL' not found in library configure.ac:40: error: libtool version 2.2.6 or higher is required configure.ac:40: the top level autom4te: /usr/bin/m4 failed with exit status: 63 aclocal: autom4te failed with exit status: 63 autoreconf: aclocal failed with exit status: 63 === Your friendly daemon, Cyrador
[hwloc-devel] Create success (hwloc git 1.8.1-11-g969ae06)
Creating nightly hwloc snapshot git tarball was a success. Snapshot: hwloc 1.8.1-11-g969ae06 Start time: Tue Mar 18 14:34:08 EDT 2014 End time: Tue Mar 18 14:36:21 EDT 2014 Your friendly daemon, Cyrador
[hwloc-devel] Create success (hwloc git dev-135-g6cad2d3)
Creating nightly hwloc snapshot git tarball was a success. Snapshot: hwloc dev-135-g6cad2d3 Start time: Tue Mar 18 14:32:05 EDT 2014 End time: Tue Mar 18 14:34:04 EDT 2014 Your friendly daemon, Cyrador
Re: [OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103
On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell)wrote: > Ralph, > > I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to > HEAD): > > 8< > MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper > -- > The user-provided time limit for job execution has been > reached: > > MPIEXEC_TIMEOUT: 8 seconds > > The job will now be aborted. Please check your code and/or > adjust/remove the job execution time limit (as specified > by MPIEXEC_TIMEOUT in your environment). > > -- > srun: error: mpi015: task 0: Killed > srun: Terminating job step 689585.2 > srun: Job step aborted: Waiting up to 2 seconds for job step to finish. > ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] > mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd > = 16] > [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] > mca_oob_tcp_peer_send_handler: unable to send header > > ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly > terminate > > ^C > 8< > > Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass > beforehand (several minutes for the first two, <5s in the third). > > Where "sleeper" is just an MPI program that does: > > 8< >MPI_Init(, ); >MPI_Comm_rank(MPI_COMM_WORLD, ); >MPI_Comm_size(MPI_COMM_WORLD, ); > >while (1) { >sleep(60); >} > >MPI_Finalize(); > 8< > > It happens under slurm and SSH. If I launch on localhost (no > --host/--hostfile option, no slurm, etc.) then it exits just fine. The > example output I gave above used the "usnic" BTL, but "tcp" has identical > behavior. > > This worked fine in v1.7.4. I've bisected the change in behavior down to > r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981 > > Should I file a ticket? > Crud - no, I'll take a look in a little bit > -Dave >
[OMPI devel] DNS migration of open-mpi.org
Tomorrow at 9am US Eastern, IU will be changing the IP address of open-mpi.org (and all of its associated services: email, web, etc.). They're hoping it causes no downtime -- there should be proxies in place to relay traffic from the old IP addresses for the next week or two, so that no one should notice anything different while the DNS change is propagating. svn.open-mpi.org will not be affected. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] MPIEXEC_TIMEOUT broken in v1.7 branch @ r31103
Ralph, I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to HEAD): 8< MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper -- The user-provided time limit for job execution has been reached: MPIEXEC_TIMEOUT: 8 seconds The job will now be aborted. Please check your code and/or adjust/remove the job execution time limit (as specified by MPIEXEC_TIMEOUT in your environment). -- srun: error: mpi015: task 0: Killed srun: Terminating job step 689585.2 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd = 16] [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] mca_oob_tcp_peer_send_handler: unable to send header ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly terminate ^C 8< Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass beforehand (several minutes for the first two, <5s in the third). Where "sleeper" is just an MPI program that does: 8< MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); while (1) { sleep(60); } MPI_Finalize(); 8< It happens under slurm and SSH. If I launch on localhost (no --host/--hostfile option, no slurm, etc.) then it exits just fine. The example output I gave above used the "usnic" BTL, but "tcp" has identical behavior. This worked fine in v1.7.4. I've bisected the change in behavior down to r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981 Should I file a ticket? -Dave
Re: [OMPI devel] Hang in comm_spawn
It's on the trunk, but I imagine it is on 1.7 as well. I use the "simple_spawn" program in orte/test/mpi, and the cmd line is just "mpirun -np 2 ./simple_spawn" On Mar 18, 2014, at 7:42 AM, Nathan Hjelmwrote: > Is this trunk or 1.7? Can you give me your mpirun command? > > -Nathan > > On Tue, Mar 18, 2014 at 07:35:01AM -0700, Ralph Castain wrote: >> I'm seeing comm_spawn hang here: >> [bend001][[52890,1],0][coll_ml_module.c:3030:mca_coll_ml_comm_query] >> COLL-ML ml_coll_schedule_setup exit with error >> [bend001][[52890,1],1][coll_ml_module.c:3030:mca_coll_ml_comm_query] >> COLL-ML ml_coll_schedule_setup exit with error >> Setting -mca coll ^ml allows things to run to completion just fine, so it >> appears that coll/ml is having a problem with comm_spawn. >> Ralph > >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/03/14361.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14362.php
Re: [OMPI devel] Hang in comm_spawn
Is this trunk or 1.7? Can you give me your mpirun command? -Nathan On Tue, Mar 18, 2014 at 07:35:01AM -0700, Ralph Castain wrote: >I'm seeing comm_spawn hang here: >[bend001][[52890,1],0][coll_ml_module.c:3030:mca_coll_ml_comm_query] >COLL-ML ml_coll_schedule_setup exit with error >[bend001][[52890,1],1][coll_ml_module.c:3030:mca_coll_ml_comm_query] >COLL-ML ml_coll_schedule_setup exit with error >Setting -mca coll ^ml allows things to run to completion just fine, so it >appears that coll/ml is having a problem with comm_spawn. >Ralph > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14361.php pgp46Nb_kVtxp.pgp Description: PGP signature
[OMPI devel] Hang in comm_spawn
I'm seeing comm_spawn hang here: [bend001][[52890,1],0][coll_ml_module.c:3030:mca_coll_ml_comm_query] COLL-ML ml_coll_schedule_setup exit with error [bend001][[52890,1],1][coll_ml_module.c:3030:mca_coll_ml_comm_query] COLL-ML ml_coll_schedule_setup exit with error Setting -mca coll ^ml allows things to run to completion just fine, so it appears that coll/ml is having a problem with comm_spawn. Ralph
Re: [OMPI devel] usage of mca variables in orte-restart
Thanks for your fix. You say that the environment is only taken in account during register. There is another variable set in the environment in opal-restart.c. Does the following still work: opal-restart.c: (void) mca_base_var_env_name("crs", _env_var); opal_setenv(tmp_env_var, expected_crs_comp, true, ); free(tmp_env_var); tmp_env_var = NULL; The preferred checkpointer is selected like this and in opal_crs_base_select() the following happens: if( OPAL_SUCCESS != mca_base_select("crs", opal_crs_base_framework.framework_output, _crs_base_framework.framework_components, (mca_base_module_t **) _module, (mca_base_component_t **) _component) ) { /* This will only happen if no component was selected */ exit_status = OPAL_ERROR; goto cleanup; } Does the mca_base_var_env_name() influence which crs module is selected during mca_base_select()? Or do I have to change it also to mca_base_var_set_value() to select the preferred crs module? Adrian On Mon, Mar 17, 2014 at 08:47:16AM -0600, Nathan Hjelm wrote: > Good catch. Fixing now. > > -Nathan > > On Mon, Mar 17, 2014 at 02:50:02PM +0100, Adrian Reber wrote: > > On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T wrote: > > > The preferred way is to use mca_base_var_find and then call > > > mca_base_var_[set|get]_value. For performance sake we only look at the > > > environment when the variable is registered. > > > > I believe I found a bug in mca_base_var_set_value using bool variables: > > > > #0 0x7f6e0d8fb800 in mca_base_var_enum_bool_sfv (self=0x7f6e0dbabc20 > > , value=0, > > string_value=0x0) at ../../../../opal/mca/base/mca_base_var_enum.c:82 > > #1 0x7f6e0d8f45d6 in mca_base_var_set_value (vari=120, value=0x4031e6, > > size=0, source=MCA_BASE_VAR_SOURCE_DEFAULT, > > source_file=0x0) at ../../../../opal/mca/base/mca_base_var.c:636 > > #2 0x00401e44 in main (argc=7, argv=0x7fffa72a0a78) at > > ../../../../opal/tools/opal-restart/opal-restart.c:223 > > > > I am using set_value like this: > > > > bool test=false; > > mca_base_var_set_value(idx, , 0, MCA_BASE_VAR_SOURCE_DEFAULT, NULL); > > > > As the size is ignored I am just setting it to '0'. > > > > mca_base_var_set_value() does > > > > ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,((int *) > > value)[0], NULL); > > > > which calls mca_base_var_enum_bool_sfv() with the last parameter set to > > NULL: > > > > static int mca_base_var_enum_bool_sfv (mca_base_var_enum_t *self, const int > > value, > >const char **string_value) > > { > > *string_value = value ? "true" : "false"; > > > > return OPAL_SUCCESS; > > } > > > > and here it tries to access the last parameter (string_value) which has > > been set to NULL. As I cannot find any usage of mca_base_var_set_value() > > with bool variables this code path has probably not been used until now. > > > > Adrian > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/03/14354.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14355.php Adrian -- Adrian Reberhttp://lisas.de/~adrian/ printk(KERN_ERR "msp3400: chip reset failed, penguin on i2c bus?\n"); 2.2.16 /usr/src/linux/drivers/char/msp3400.c pgph76CYFEG_J.pgp Description: PGP signature