[OMPI devel] orte can't launch process

2008-03-06 Thread Gleb Natapov
Something is broken in the trunk. # mpirun -np 2 -H host1,host2 ./osu_latency -- Some of the requested hosts are not included in the current allocation. The requested hosts were specified with --host as: host1,host2 Please

Re: [OMPI devel] orte can't launch process

2008-03-06 Thread Tim Prins
Sorry about that. I removed a field in a structure, then 'svn up' seems to have added it back, so we were using a field that should not even exist in a couple places. Should be fixed in r17757 Tim Gleb Natapov wrote: Something is broken in the trunk. # mpirun -np 2 -H host1,host2 ./osu_lat

Re: [OMPI devel] orte can't launch process

2008-03-06 Thread Gleb Natapov
On Thu, Mar 06, 2008 at 07:49:13AM -0500, Tim Prins wrote: > Sorry about that. I removed a field in a structure, then 'svn up' seems > to have added it back, so we were using a field that should not even > exist in a couple places. > > Should be fixed in r17757 Works again. Thanks --

Re: [OMPI devel] [RFC] Reduce the number of tests run by make check

2008-03-06 Thread Jeff Squyres
Tim and I talked about this on IM. We'd like to amend the proposal: 1. Remove these tests from make check, but leave them in SVN per the original proposal. 2. File a ticket to make carto selection not fail when no components are found (I filed https://svn.open-mpi.org/trac/ompi/ticket/1232).

Re: [OMPI devel] Orte cleanup

2008-03-06 Thread Ralph Castain
I believe I have at least helped reduce this with r17761. I added the ability for procs to detect that their "lifeline" connection (either the HNP for unity routed, or their local daemon for tree) has been lost and gracefully abort. Let me know if that helps Ralph On 3/4/08 9:37 PM, "Aurélien B

[OMPI devel] Fault tolerance

2008-03-06 Thread Ralph Castain
Hello I've been doing some work on fault response within the system, and finally realized something I should probably have seen awhile back. Perhaps I am misunderstanding somewhere, so forgive the ignorance if so. When we designed ORTE some time in the deep, dark past, we had envisioned that peop

Re: [OMPI devel] Fault tolerance

2008-03-06 Thread Josh Hursey
The checkpoint/restart work that I have integrated does not respond to failure at the moment. If a failures happens I want ORTE to terminate the entire job. I will then restart the entire job from a checkpoint file. This follows the 'all fall down' approach that users typically expect when

Re: [OMPI devel] Fault tolerance

2008-03-06 Thread Ralph Castain
Ah - ok, thanks for clarifying! I'm happy to leave it around, but wasn't sure if/where it fit into anyone's future plans. Thanks Ralph On 3/6/08 9:13 AM, "Josh Hursey" wrote: > The checkpoint/restart work that I have integrated does not respond to > failure at the moment. If a failures happen

[OMPI devel] 1.2.6rc2 posted

2008-03-06 Thread Jeff Squyres
In the usual place: http://www.open-mpi.org/software/ompi/v1.2/ It contains a few changes, such as the new pml_ob1_use_early_completion MCA parameter: http://svn.open-mpi.org/svn/ompi/branches/v1.2/NEWS -- Jeff Squyres Cisco Systems

[OMPI devel] Open MPI v1.2.6rc2 has been posted

2008-03-06 Thread Tim Mattox
Hi All, The "first" (actually rc2) release candidate of Open MPI v1.2.6 is now up: http://www.open-mpi.org/software/ompi/v1.2/ Please run it through it's paces as best you can. -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright..

Re: [OMPI devel] Fault tolerance

2008-03-06 Thread Aurélien Bouteiller
Aside of what Josh said, we are working right know at UTK on orted/MPI recovery (without killing/respawning all). For now we had no use of the errgmr, but I'm quite sure this would be the smartest place to put all the mechanisms we are trying now. Aurelien Le 6 mars 08 à 11:17, Ralph Casta

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r17766

2008-03-06 Thread Tim Mattox
This still has a race condition... which can be dealt with using opal_atomic stuff. See below. On Thu, Mar 6, 2008 at 2:35 PM, wrote: > Author: rhc > Date: 2008-03-06 14:35:57 EST (Thu, 06 Mar 2008) > New Revision: 17766 > URL: https://svn.open-mpi.org/trac/ompi/changeset/17766 > > Log: > F

Re: [OMPI devel] Fwd: OpenMPI changes

2008-03-06 Thread Jeff Squyres
On Mar 5, 2008, at 1:50 PM, Greg Watson wrote: Looking back through the mailing list, I can only see two references that seem relevant to this. One was titled "Major reduction in ORTE" and does allude to the event model changes. The other "OMPI/ORTE and tools" talks about "alternative methods of

[OMPI devel] use of AC_CACHE_CHECK in otf

2008-03-06 Thread Ralf Wildenhues
In ompi/contrib/vt/vt/extlib/otf/acinclude.m4, in the macros WITH_DEBUG and WITH_VERBOSE, dubious constructs such as AC_CACHE_CHECK([debug], [debug], [debug=]) are used. These have the following problems: * Cache variables need to match *_cv_* in order to actually be saved (

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r17766

2008-03-06 Thread Ralph H Castain
Thanks Tim - good suggestion! Had to modify your proposed code a tad to get it to compile and work, but it is definitely a cleaner solution. Ralph On 3/6/08 1:34 PM, "Tim Mattox" wrote: > This still has a race condition... which can be dealt with using > opal_atomic stuff. > See below. > > On

[OMPI devel] libevent vs. libev

2008-03-06 Thread Jeff Squyres
FYI: since I was the one who stirred up the hornet's nest a while ago :-), I thought I'd update everyone -- we're actually *not* going to use libev anymore. We're simply going to update to a newer version of libevent, which seems to have all the things that we care about (better performanc

[OMPI devel] 3 test failures

2008-03-06 Thread Ralf Wildenhues
Hello, I've just stumbled over three testsuite failures on GNU/Linux x86, with an out-of-tree build (mkdir build; cd build; ../ompi_trunk/configure -C). Hope I'm not completely off-topic here... Cheers, Ralf PASS: ompi_bitmap -

Re: [OMPI devel] 3 test failures

2008-03-06 Thread Jeff Squyres
Nope, you're not off-topic at all. This has been a debate among us developers for a few days now... :-) The issue is that these tests are now doing something that assume that OMPI has been installed. We've sent an RFC around to the developers proposing how to fix it (easy solution: just r