[OMPI devel] orte can't launch process

2008-03-06 Thread Gleb Natapov
Something is broken in the trunk.

# mpirun -np 2 -H host1,host2  ./osu_latency
--
Some of the requested hosts are not included in the current allocation.

The requested hosts were specified with --host as:
host1,host2

Please check your allocation or your request.
--
--
mpirun was unable to start the specified application as it encountered
an error.
More information may be available above.
--

If I create hostfile with host1 and host2 and use it instead of -H
mpirun works.

--
Gleb.


Re: [OMPI devel] orte can't launch process

2008-03-06 Thread Tim Prins
Sorry about that. I removed a field in a structure, then 'svn up' seems 
to have added it back, so we were using a field that should not even 
exist in a couple places.


Should be fixed in r17757

Tim

Gleb Natapov wrote:

Something is broken in the trunk.

# mpirun -np 2 -H host1,host2  ./osu_latency
--
Some of the requested hosts are not included in the current allocation.

The requested hosts were specified with --host as:
host1,host2

Please check your allocation or your request.
--
--
mpirun was unable to start the specified application as it encountered
an error.
More information may be available above.
--

If I create hostfile with host1 and host2 and use it instead of -H
mpirun works.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] orte can't launch process

2008-03-06 Thread Gleb Natapov
On Thu, Mar 06, 2008 at 07:49:13AM -0500, Tim Prins wrote:
> Sorry about that. I removed a field in a structure, then 'svn up' seems 
> to have added it back, so we were using a field that should not even 
> exist in a couple places.
> 
> Should be fixed in r17757
Works again. Thanks

--
Gleb.


Re: [OMPI devel] [RFC] Reduce the number of tests run by make check

2008-03-06 Thread Jeff Squyres

Tim and I talked about this on IM.  We'd like to amend the proposal:

1. Remove these tests from make check, but leave them in SVN per the  
original proposal.
2. File a ticket to make carto selection not fail when no components  
are found (I filed https://svn.open-mpi.org/trac/ompi/ticket/1232).
3. File a ticket to amend "make check" (or similar) with some scripty- 
foo to do the following:

- find all components in the build tree
- sym link them all into a single tree
- setenv OMPI_MCA_component_path to that tree
- then run the tests
This will allow actually testing the components in the build tree  
(without an OMPI installation)


Tim and I don't have time to do #3 in the near future -- perhaps  
someone else can do it.  It will pave the way for future, more  
comprehensive tests in the tree (since we won't be bound by the "must  
have OMPI installed" limitation).



On Mar 4, 2008, at 1:13 PM, Tim Prins wrote:


WHAT: Reduce the number of tests run by make check

WHY: Some of the tests will not work properly until Open MPI is
installed. Also, many of the tests do not really test anything.

WHERE: See below.

TIMEOUT: COB Friday March 14

DESCRIPTION:
We have been having many problems with make check over the years.  
People

tend to change things and not update the tests, which lead to tarball
generation failures and nightly test run failures. Furthermore, many  
of

the tests test things which have not changed for years.

So with this in mind, I propose only running the following tests when
'make check' is run:
asm/atomic_barrier
asm/atomic_barrier_noinline
asm/atomic_spinlock
asm/atomic_spinlock_noinline
asm/atomic_math
asm/atomic_math_noinline
asm/atomic_cmpset
asm/atomic_cmpset_noinline

We we would no longer run the following tests:
class/ompi_bitmap_t
class/opal_hash_table_t
class/opal_list_t
class/opal_value_array_t
class/opal_pointer_array
class/ompi_rb_tree_t
memory/opal_memory_basic
memory/opal_memory_speed
memory/opal_memory_cxx
threads/opal_thread
threads/opal_condition
datatype/ddt_test
datatype/checksum
datatype/position
peruse/mpi_peruse

These tests would not be deleted from the repository, just made so  
they

do not run by default.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Orte cleanup

2008-03-06 Thread Ralph Castain
I believe I have at least helped reduce this with r17761. I added the
ability for procs to detect that their "lifeline" connection (either the HNP
for unity routed, or their local daemon for tree) has been lost and
gracefully abort.

Let me know if that helps
Ralph



On 3/4/08 9:37 PM, "Aurélien Bouteiller"  wrote:

> I noticed that the new release of orte is not as good as it used to be
> to cleanup the mess left by crashed/aborted mpi processes. Recently We
> have been experiencing a lot of zombie or live locked processes
> running on the cluster nodes and disturbing following experiments. I
> didn't really had time to investigate the issue, maybe ralph can set a
> ticket if he is able to reproduce this.
> 
> Aurelien
> --
> * Dr. Aurélien Bouteiller
> * Sr. Research Associate at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 350
> * Knoxville, TN 37996
> * 865 974 6321
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel





[OMPI devel] Fault tolerance

2008-03-06 Thread Ralph Castain
Hello

I've been doing some work on fault response within the system, and finally
realized something I should probably have seen awhile back. Perhaps I am
misunderstanding somewhere, so forgive the ignorance if so.

When we designed ORTE some time in the deep, dark past, we had envisioned
that people might want multiple ways of responding to process faults and/or
abnormal terminations. You might want to just abort the job, attempt to
restart just that proc, attempt to restart the job, etc. To support these
multiple options, and to provide a means for people to simply try new ones,
we created the errmgr framework.

Our thought was that a process and/or daemon would call the errmgr when we
detected something abnormal happening, and that the selected errmgr
component could then do whatever fault response was desired.

However, I now see that the fault tolerance mechanisms inside of OMPI do not
seem to be using that methodology. Instead, we have hard-coded a particular
response into the system.

If we configure without FT, we just abort the entire job since that is the
only errmgr component that exists.

If we configure with FT, then we execute the hard-coded C/R methodology.
This is built directly into the code, so there is no option as to what
happens.

Is there a reason why the errmgr framework was not used? Did the FT team
decide that this was not a useful tool to support multiple FT strategies?
Can we modify it to better serve those needs, or is it simply not feasible?

If it isn't going to be used for that purpose, then I might as well remove
it. As things stand, there really is no purpose served by the errmgr
framework - might as well replace it with just a function call.

Appreciate any insights
Ralph




Re: [OMPI devel] Fault tolerance

2008-03-06 Thread Josh Hursey
The checkpoint/restart work that I have integrated does not respond to  
failure at the moment. If a failures happens I want ORTE to terminate  
the entire job. I will then restart the entire job from a checkpoint  
file. This follows the 'all fall down' approach that users typically  
expect when using a global C/R technique.


Eventually I want to integrate something better where I can respond to  
a failure with a recovery from inside ORTE. I'm not there yet, but  
hopefully in the near future.


I'll let the UTK group talk about what they are doing with ORTE, but I  
suspect they will be taking advantage of the errmgr to help respond to  
failure and restart a single process.



It is important to consider in this context that we do *not* always  
want ORTE to abort whenever it detects a process failure. This is the  
default mode for MPI applications (MPI_ERRORS_ARE_FATAL), and should  
be supported. But there is another mode in which we would like ORTE to  
keep running to conform with (MPI_ERRORS_RETURN):

 http://www.mpi-forum.org/docs/mpi-11-html/node148.html

It is known that certain standards conformant MPI "fault tolerant"  
programs do not work in Open MPI for various reasons some in the  
runtime and some external. Here we are mostly talking about  
disconnected fates of intra-communicator groups. I have a test in the  
ompi-tests repository that illustrates this problem, but I do not have  
time to fix it at the moment.



So in short keep the errmgr around for now. I suspect we will be using  
it, and possibly tweaking it in the nearish future.


Thanks for the observation.

Cheers,
Josh

On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote:


Hello

I've been doing some work on fault response within the system, and  
finally
realized something I should probably have seen awhile back. Perhaps  
I am

misunderstanding somewhere, so forgive the ignorance if so.

When we designed ORTE some time in the deep, dark past, we had  
envisioned
that people might want multiple ways of responding to process faults  
and/or
abnormal terminations. You might want to just abort the job, attempt  
to
restart just that proc, attempt to restart the job, etc. To support  
these
multiple options, and to provide a means for people to simply try  
new ones,

we created the errmgr framework.

Our thought was that a process and/or daemon would call the errmgr  
when we

detected something abnormal happening, and that the selected errmgr
component could then do whatever fault response was desired.

However, I now see that the fault tolerance mechanisms inside of  
OMPI do not
seem to be using that methodology. Instead, we have hard-coded a  
particular

response into the system.

If we configure without FT, we just abort the entire job since that  
is the

only errmgr component that exists.

If we configure with FT, then we execute the hard-coded C/R  
methodology.

This is built directly into the code, so there is no option as to what
happens.

Is there a reason why the errmgr framework was not used? Did the FT  
team
decide that this was not a useful tool to support multiple FT  
strategies?
Can we modify it to better serve those needs, or is it simply not  
feasible?


If it isn't going to be used for that purpose, then I might as well  
remove

it. As things stand, there really is no purpose served by the errmgr
framework - might as well replace it with just a function call.

Appreciate any insights
Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Fault tolerance

2008-03-06 Thread Ralph Castain
Ah - ok, thanks for clarifying! I'm happy to leave it around, but wasn't
sure if/where it fit into anyone's future plans.

Thanks
Ralph



On 3/6/08 9:13 AM, "Josh Hursey"  wrote:

> The checkpoint/restart work that I have integrated does not respond to
> failure at the moment. If a failures happens I want ORTE to terminate
> the entire job. I will then restart the entire job from a checkpoint
> file. This follows the 'all fall down' approach that users typically
> expect when using a global C/R technique.
> 
> Eventually I want to integrate something better where I can respond to
> a failure with a recovery from inside ORTE. I'm not there yet, but
> hopefully in the near future.
> 
> I'll let the UTK group talk about what they are doing with ORTE, but I
> suspect they will be taking advantage of the errmgr to help respond to
> failure and restart a single process.
> 
> 
> It is important to consider in this context that we do *not* always
> want ORTE to abort whenever it detects a process failure. This is the
> default mode for MPI applications (MPI_ERRORS_ARE_FATAL), and should
> be supported. But there is another mode in which we would like ORTE to
> keep running to conform with (MPI_ERRORS_RETURN):
>   http://www.mpi-forum.org/docs/mpi-11-html/node148.html
> 
> It is known that certain standards conformant MPI "fault tolerant"
> programs do not work in Open MPI for various reasons some in the
> runtime and some external. Here we are mostly talking about
> disconnected fates of intra-communicator groups. I have a test in the
> ompi-tests repository that illustrates this problem, but I do not have
> time to fix it at the moment.
> 
> 
> So in short keep the errmgr around for now. I suspect we will be using
> it, and possibly tweaking it in the nearish future.
> 
> Thanks for the observation.
> 
> Cheers,
> Josh
> 
> On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote:
> 
>> Hello
>> 
>> I've been doing some work on fault response within the system, and
>> finally
>> realized something I should probably have seen awhile back. Perhaps
>> I am
>> misunderstanding somewhere, so forgive the ignorance if so.
>> 
>> When we designed ORTE some time in the deep, dark past, we had
>> envisioned
>> that people might want multiple ways of responding to process faults
>> and/or
>> abnormal terminations. You might want to just abort the job, attempt
>> to
>> restart just that proc, attempt to restart the job, etc. To support
>> these
>> multiple options, and to provide a means for people to simply try
>> new ones,
>> we created the errmgr framework.
>> 
>> Our thought was that a process and/or daemon would call the errmgr
>> when we
>> detected something abnormal happening, and that the selected errmgr
>> component could then do whatever fault response was desired.
>> 
>> However, I now see that the fault tolerance mechanisms inside of
>> OMPI do not
>> seem to be using that methodology. Instead, we have hard-coded a
>> particular
>> response into the system.
>> 
>> If we configure without FT, we just abort the entire job since that
>> is the
>> only errmgr component that exists.
>> 
>> If we configure with FT, then we execute the hard-coded C/R
>> methodology.
>> This is built directly into the code, so there is no option as to what
>> happens.
>> 
>> Is there a reason why the errmgr framework was not used? Did the FT
>> team
>> decide that this was not a useful tool to support multiple FT
>> strategies?
>> Can we modify it to better serve those needs, or is it simply not
>> feasible?
>> 
>> If it isn't going to be used for that purpose, then I might as well
>> remove
>> it. As things stand, there really is no purpose served by the errmgr
>> framework - might as well replace it with just a function call.
>> 
>> Appreciate any insights
>> Ralph
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] 1.2.6rc2 posted

2008-03-06 Thread Jeff Squyres

In the usual place:

http://www.open-mpi.org/software/ompi/v1.2/

It contains a few changes, such as the new  
pml_ob1_use_early_completion MCA parameter:


http://svn.open-mpi.org/svn/ompi/branches/v1.2/NEWS

--
Jeff Squyres
Cisco Systems



[OMPI devel] Open MPI v1.2.6rc2 has been posted

2008-03-06 Thread Tim Mattox
Hi All,
The "first" (actually rc2) release candidate of Open MPI v1.2.6 is now up:

 http://www.open-mpi.org/software/ompi/v1.2/

Please run it through it's paces as best you can.
-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] Fault tolerance

2008-03-06 Thread Aurélien Bouteiller
Aside of what Josh said, we are working right know at UTK on orted/MPI  
recovery (without killing/respawning all). For now we had no use of  
the errgmr, but I'm quite sure this would be the smartest  place to  
put all the mechanisms we are trying now.


Aurelien
Le 6 mars 08 à 11:17, Ralph Castain a écrit :

Ah - ok, thanks for clarifying! I'm happy to leave it around, but  
wasn't

sure if/where it fit into anyone's future plans.

Thanks
Ralph



On 3/6/08 9:13 AM, "Josh Hursey"  wrote:

The checkpoint/restart work that I have integrated does not respond  
to

failure at the moment. If a failures happens I want ORTE to terminate
the entire job. I will then restart the entire job from a checkpoint
file. This follows the 'all fall down' approach that users typically
expect when using a global C/R technique.

Eventually I want to integrate something better where I can respond  
to

a failure with a recovery from inside ORTE. I'm not there yet, but
hopefully in the near future.

I'll let the UTK group talk about what they are doing with ORTE,  
but I
suspect they will be taking advantage of the errmgr to help respond  
to

failure and restart a single process.


It is important to consider in this context that we do *not* always
want ORTE to abort whenever it detects a process failure. This is the
default mode for MPI applications (MPI_ERRORS_ARE_FATAL), and should
be supported. But there is another mode in which we would like ORTE  
to

keep running to conform with (MPI_ERRORS_RETURN):
 http://www.mpi-forum.org/docs/mpi-11-html/node148.html

It is known that certain standards conformant MPI "fault tolerant"
programs do not work in Open MPI for various reasons some in the
runtime and some external. Here we are mostly talking about
disconnected fates of intra-communicator groups. I have a test in the
ompi-tests repository that illustrates this problem, but I do not  
have

time to fix it at the moment.


So in short keep the errmgr around for now. I suspect we will be  
using

it, and possibly tweaking it in the nearish future.

Thanks for the observation.

Cheers,
Josh

On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote:


Hello

I've been doing some work on fault response within the system, and
finally
realized something I should probably have seen awhile back. Perhaps
I am
misunderstanding somewhere, so forgive the ignorance if so.

When we designed ORTE some time in the deep, dark past, we had
envisioned
that people might want multiple ways of responding to process faults
and/or
abnormal terminations. You might want to just abort the job, attempt
to
restart just that proc, attempt to restart the job, etc. To support
these
multiple options, and to provide a means for people to simply try
new ones,
we created the errmgr framework.

Our thought was that a process and/or daemon would call the errmgr
when we
detected something abnormal happening, and that the selected errmgr
component could then do whatever fault response was desired.

However, I now see that the fault tolerance mechanisms inside of
OMPI do not
seem to be using that methodology. Instead, we have hard-coded a
particular
response into the system.

If we configure without FT, we just abort the entire job since that
is the
only errmgr component that exists.

If we configure with FT, then we execute the hard-coded C/R
methodology.
This is built directly into the code, so there is no option as to  
what

happens.

Is there a reason why the errmgr framework was not used? Did the FT
team
decide that this was not a useful tool to support multiple FT
strategies?
Can we modify it to better serve those needs, or is it simply not
feasible?

If it isn't going to be used for that purpose, then I might as well
remove
it. As things stand, there really is no purpose served by the errmgr
framework - might as well replace it with just a function call.

Appreciate any insights
Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r17766

2008-03-06 Thread Tim Mattox
This still has a race condition... which can be dealt with using
opal_atomic stuff.
See below.

On Thu, Mar 6, 2008 at 2:35 PM,   wrote:
> Author: rhc
>  Date: 2008-03-06 14:35:57 EST (Thu, 06 Mar 2008)
>  New Revision: 17766
>  URL: https://svn.open-mpi.org/trac/ompi/changeset/17766
>
>  Log:
>  Fix a race condition - ensure we don't call terminate in orterun more than 
> once, even if the timeout fires while we are doing so
[snip]
>  Modified: trunk/orte/tools/orterun/orterun.c
>  
> ==
>  --- trunk/orte/tools/orterun/orterun.c  (original)
>  +++ trunk/orte/tools/orterun/orterun.c  2008-03-06 14:35:57 EST (Thu, 06 Mar 
> 2008)
>  @@ -112,14 +112,15 @@
>   static bool want_prefix_by_default = (bool) 
> ORTE_WANT_ORTERUN_PREFIX_BY_DEFAULT;
>   static opal_event_t *orterun_event, *orteds_exit_event;
>   static char *ompi_server=NULL;
>  +static bool terminating=false;
>
[snip]
>  @@ -644,6 +638,12 @@
>  orte_proc_t **procs;
>  orte_vpid_t i;
>
>  +/* flag that we are here to avoid doing it twice */
>  +if (terminating) {
>  +return;
>  +}
>  +terminating = true;
>  +
[snip]

I think this race condition should be dealt with like this:

#include "opal/sys/atomic.h"

static opal_atomic_lock_t terminating = OPAL_ATOMIC_UNLOCKED;

...

if (opal_atomic_trylock(&terminating)) { /* returns 1 if already locked */
return;
}


-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] Fwd: OpenMPI changes

2008-03-06 Thread Jeff Squyres

On Mar 5, 2008, at 1:50 PM, Greg Watson wrote:


Looking back through the mailing list, I can only see two references
that seem relevant to this. One was titled "Major reduction in ORTE"
and does allude to the event model changes. The other "OMPI/ORTE and
tools" talks about "alternative methods of interaction". Neither
mentions changes to the spawning and


I thought that the subject "major reduction in ORTE" would have been  
an eyebrow-raiser.  I'm not trying to be snarky; my only point is that  
if you have a stake in using ORTE, it would probably be worthwhile to  
monitor what is happening and raise your hand / be part of the  
community to help shape its direction.  We all know that open source ! 
= free.


Perhaps you and Brad can have lunch every once in a while to discuss  
ORTE.  :-)



I/O forwarding functionality
(that I can see),


FWIW: nothing has changed with regards to I/O forwarding functionality  
or APIs (other than a big pile of bug fixes somewhere in the middle of  
the 1.2 series).  Ralph mentioned recently that it doesn't work beyond  
the model that mpirun uses (e.g., having multiple taps for the same  
stdout), but it *never* has.  We have some open bugs in trac about  
this, but no one has fixed them yet.


If I ever get the time, I'd like to do *lots* of things with IOF, but  
I don't know when that will happen...



or that this would be the exclusive mechanism for
interaction. In the future (assuming there are more changes), it would
be helpful if there was at least some information about what specific
API's are being removed.


I can't speak for Ralph, but I think if anyone had asked, I'm guessing  
that he would have been happy to have provided whatever information he  
had.  However, I'm not entirely sure that it was possible to know  
everything that was going to happen when first embarking on this "ORTE  
reduction" journey -- I seem to recall that questions and problems  
arose along the way that caused shifting of ORTE plans during the  
reduction / reorganization.


I think there were *some* updates about this stuff on the mailing list  
and on the weekly teleconferences, but I wasn't aware that anyone  
outside of OMPI cared about the ORTE underneath OMPI -- so at least I  
never bothered to re-broadcast outside of our group...  :-\


--
Jeff Squyres
Cisco Systems



[OMPI devel] use of AC_CACHE_CHECK in otf

2008-03-06 Thread Ralf Wildenhues
In ompi/contrib/vt/vt/extlib/otf/acinclude.m4, in the macros WITH_DEBUG
and WITH_VERBOSE, dubious constructs such as

AC_CACHE_CHECK([debug],
[debug],
[debug=])

are used.  These have the following problems:

* Cache variables need to match *_cv_* in order to actually be saved
(where the bit before _cv_ is preferably a package or author prefix,
for namespace cleanliness; see
.
The next Autoconf version will warn about this.

* There is little need to cache information that the user provided on
the configure command line.  If configure is rerun by './config.status
--recheck', it remembers the original configure command line.  Only if
the user manually reruns configure (and keeps config.cache) does this
make a difference.

So I suggest you remove those two instances of AC_CACHE_CHECK usage,
or forward this information to the author of oft.

Thanks,
Ralf


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r17766

2008-03-06 Thread Ralph H Castain
Thanks Tim - good suggestion! Had to modify your proposed code a tad to get
it to compile and work, but it is definitely a cleaner solution.

Ralph


On 3/6/08 1:34 PM, "Tim Mattox"  wrote:

> This still has a race condition... which can be dealt with using
> opal_atomic stuff.
> See below.
> 
> On Thu, Mar 6, 2008 at 2:35 PM,   wrote:
>> Author: rhc
>>  Date: 2008-03-06 14:35:57 EST (Thu, 06 Mar 2008)
>>  New Revision: 17766
>>  URL: https://svn.open-mpi.org/trac/ompi/changeset/17766
>> 
>>  Log:
>>  Fix a race condition - ensure we don't call terminate in orterun more than
>> once, even if the timeout fires while we are doing so
> [snip]
>>  Modified: trunk/orte/tools/orterun/orterun.c
>>  
>> 
=>>
=
>>  --- trunk/orte/tools/orterun/orterun.c  (original)
>>  +++ trunk/orte/tools/orterun/orterun.c  2008-03-06 14:35:57 EST (Thu, 06 Mar
>> 2008)
>>  @@ -112,14 +112,15 @@
>>   static bool want_prefix_by_default = (bool)
>> ORTE_WANT_ORTERUN_PREFIX_BY_DEFAULT;
>>   static opal_event_t *orterun_event, *orteds_exit_event;
>>   static char *ompi_server=NULL;
>>  +static bool terminating=false;
>> 
> [snip]
>>  @@ -644,6 +638,12 @@
>>  orte_proc_t **procs;
>>  orte_vpid_t i;
>> 
>>  +/* flag that we are here to avoid doing it twice */
>>  +if (terminating) {
>>  +return;
>>  +}
>>  +terminating = true;
>>  +
> [snip]
> 
> I think this race condition should be dealt with like this:
> 
> #include "opal/sys/atomic.h"
> 
> static opal_atomic_lock_t terminating = OPAL_ATOMIC_UNLOCKED;
> 
> ...
> 
> if (opal_atomic_trylock(&terminating)) { /* returns 1 if already locked */
> return;
> }
> 




[OMPI devel] libevent vs. libev

2008-03-06 Thread Jeff Squyres
FYI: since I was the one who stirred up the hornet's nest a while  
ago :-), I thought I'd update everyone -- we're actually *not* going  
to use libev anymore.  We're simply going to update to a newer version  
of libevent, which seems to have all the things that we care about  
(better performance, smaller footprint, etc.).


George/UTK has done a bunch of the work for upgrading (based on a pile  
of information provided by Brian at the Paris meeting); I'm helping  
them integrate it into the trunk over the next week or so.


--
Jeff Squyres
Cisco Systems



[OMPI devel] 3 test failures

2008-03-06 Thread Ralf Wildenhues
Hello,

I've just stumbled over three testsuite failures on GNU/Linux x86,
with an out-of-tree build (mkdir build; cd build;
../ompi_trunk/configure -C).  Hope I'm not completely off-topic here...

Cheers,
Ralf

PASS: ompi_bitmap
--
Sorry!  You were supposed to get help about:
opal_init:startup:internal-failure
from the file:
help-opal-runtime.txt
But I couldn't find any file matching that name.  Sorry!
--
 Failure : Comparison failure
 Expected result: 0
 Test result: -13
SUPPORT: OMPI Test failed: opal_hash_table_t (1 of 1 failed)
FAIL: opal_hash_table
--
Sorry!  You were supposed to get help about:
opal_init:startup:internal-failure
from the file:
help-opal-runtime.txt
But I couldn't find any file matching that name.  Sorry!
--
 Failure : Comparison failure
 Expected result: 0
 Test result: -13
SUPPORT: OMPI Test failed: (null) (1 of 1 failed)
FAIL: opal_list
--
Sorry!  You were supposed to get help about:
opal_init:startup:internal-failure
from the file:
help-opal-runtime.txt
But I couldn't find any file matching that name.  Sorry!
--
 Failure : Comparison failure
 Expected result: 0
 Test result: -13
SUPPORT: OMPI Test failed: opal_value_array_t (1 of 1 failed)
FAIL: opal_value_array



Re: [OMPI devel] 3 test failures

2008-03-06 Thread Jeff Squyres
Nope, you're not off-topic at all.  This has been a debate among us  
developers for a few days now... :-)


The issue is that these tests are now doing something that assume that  
OMPI has been installed.  We've sent an RFC around to the developers  
proposing how to fix it (easy solution: just remove these tests from  
"make check"), and have a longer-term fix filed as a trac ticket  
(allow carto to fail gracefully when there are no components  
available: https://svn.open-mpi.org/trac/ompi/ticket/1232).



On Mar 6, 2008, at 4:54 PM, Ralf Wildenhues wrote:


Hello,

I've just stumbled over three testsuite failures on GNU/Linux x86,
with an out-of-tree build (mkdir build; cd build;
../ompi_trunk/configure -C).  Hope I'm not completely off-topic  
here...


Cheers,
Ralf

PASS: ompi_bitmap
--
Sorry!  You were supposed to get help about:
   opal_init:startup:internal-failure
from the file:
   help-opal-runtime.txt
But I couldn't find any file matching that name.  Sorry!
--
Failure : Comparison failure
Expected result: 0
Test result: -13
SUPPORT: OMPI Test failed: opal_hash_table_t (1 of 1 failed)
FAIL: opal_hash_table
--
Sorry!  You were supposed to get help about:
   opal_init:startup:internal-failure
from the file:
   help-opal-runtime.txt
But I couldn't find any file matching that name.  Sorry!
--
Failure : Comparison failure
Expected result: 0
Test result: -13
SUPPORT: OMPI Test failed: (null) (1 of 1 failed)
FAIL: opal_list
--
Sorry!  You were supposed to get help about:
   opal_init:startup:internal-failure
from the file:
   help-opal-runtime.txt
But I couldn't find any file matching that name.  Sorry!
--
Failure : Comparison failure
Expected result: 0
Test result: -13
SUPPORT: OMPI Test failed: opal_value_array_t (1 of 1 failed)
FAIL: opal_value_array

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems