Re: [OMPI devel] opal_event_loop exiting

2006-04-20 Thread Brian Barrett

On Apr 19, 2006, at 4:15 PM, Greg Watson wrote:


We've just run across a rather tricky issue. We're calling
opal_event_loop() to dispatch orte events to an orted that has been
launched separately. However if the orted dies for some reason (gets
a signal or whatever) then opal_event_loop() is calling exit().
Needless to say, this is not good behavior us. Any suggestions on how
to get around this problem?


Is the orted you are connecting to the "seed" daemon?  I think the  
only time we should be exiting like that is if the orted was the seed  
daemon.  I'm not sure what we want to do if that's the case -- it  
looks like we're calling errmgr.abort() when badness happens.  I  
wonder if your application can provide its own errmgr component that  
provides an abort that doesn't actually abort?  Just some off the  
cuff ideas -- Ralph could probably give a better idea of exactly what  
is happening...



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI devel] opal_event_loop exiting

2006-04-20 Thread Ralph Castain




Well, I actually don't know much about opal_event_loop and/or how it is
intended to work. My guess is that:

(a) your remote orted is acting as the seed and your local process (the
one in Eclipse) is running as a client to that seed - at least, that
was the case last I talked to Nathan

(b) when the seed orted dies, it is the oob in your local client that
actually detects socket closure and decides that - since it is the seed
that has lost contact - the local application must abort.

(c) the errmgr.abort function does exactly what it was supposed to do -
it provides an immediate way of killing the local process.

I'd be a little hesitant to recommend overloading the errmgr.abort
function as you really do want the local processes to die when losing
connection to the seed (at least, until we develop a recovery
capability for the seed orted - which is some ways off), and (given the
way you are running) I'm not sure you can have a different errmgr for
your process while leaving the other one for everyone else.

Probably the best solution for now would be for us to insert a (yet
another) MCA parameter into the errmgr that would (if set) have
errmgr.abort do something other than exit. The question then is: what
would you want it to do?? We need to have it tell the rest of the
system to stop trying to send messages etc - right now, I don't think
the infrastructure exists to do that short of killing orte.

We could try to have errmgr.abort do an orte_finalize - that would kill
the orte system without impacting your host program, I suspect. You
would then have to re-initialize, so we'd have to find some way to let
you know that we had finalized. I can't swear this will work, though -
we might well generate a segfault since this is happening deep down
inside the system. We could try it, though.

Would any of that be of help? Do you have any suggestions on how we
might let you know that we had finalized?

Ralph


Brian Barrett wrote:

  On Apr 19, 2006, at 4:15 PM, Greg Watson wrote:

  
  
We've just run across a rather tricky issue. We're calling
opal_event_loop() to dispatch orte events to an orted that has been
launched separately. However if the orted dies for some reason (gets
a signal or whatever) then opal_event_loop() is calling exit().
Needless to say, this is not good behavior us. Any suggestions on how
to get around this problem?

  
  
Is the orted you are connecting to the "seed" daemon?  I think the  
only time we should be exiting like that is if the orted was the seed  
daemon.  I'm not sure what we want to do if that's the case -- it  
looks like we're calling errmgr.abort() when badness happens.  I  
wonder if your application can provide its own errmgr component that  
provides an abort that doesn't actually abort?  Just some off the  
cuff ideas -- Ralph could probably give a better idea of exactly what  
is happening...


Brian

  





Re: [OMPI devel] opal_event_loop exiting

2006-04-20 Thread Ralph Castain




You make a good point about the library not calling exit(). I'll have
to recruit some help to look at the notion of opal_even_loop returning
an error value - it isn't entirely clear who it would return it to in
our system,. Even though I understand how someone in your situation
would handle it, I have to ensure that it doesn't cause the base system
problems, or force a major code revision that would need to be
scheduled into the project.

We'll have to get back to you on this - most of the folks are at a
workshop this week, so it will probably be next week before we can
discuss it.

Ralph


Greg Watson wrote:
The simplest thing for us would be for opal_event_loop()
to return an error value. That way we can detect the situation and
clean up our system. At the moment we're not trying to restart orted,
so clean recovery of orte is not that important, though ultimately I
would think it is desirable. Other alternatives are to pass you an
error handler that you call, or you could send a signal that we can
trap.
  
  
>From our perspective, we're simply calling a library that does stuff.
Having the library call exit() at any point is a major problem for
applications trying to do more than run a single job.
  
  
Greg
  
  
On Apr 20, 2006, at 9:40 AM, Ralph Castain wrote:
  
  
  Well, I actually don't know much about
opal_event_loop and/or how it is intended to work. My guess is that:


(a) your remote orted is acting as the seed and your local process (the
one in Eclipse) is running as a client to that seed - at least, that
was the case last I talked to Nathan


(b) when the seed orted dies, it is the oob in your local client that
actually detects socket closure and decides that - since it is the seed
that has lost contact - the local application must abort.


(c) the errmgr.abort function does exactly what it was supposed to do -
it provides an immediate way of killing the local process.


I'd be a little hesitant to recommend overloading the errmgr.abort
function as you really do want the local processes to die when losing
connection to the seed (at least, until we develop a recovery
capability for the seed orted - which is some ways off), and (given the
way you are running) I'm not sure you can have a different errmgr for
your process while leaving the other one for everyone else.


Probably the best solution for now would be for us to insert a (yet
another) MCA parameter into the errmgr that would (if set) have
errmgr.abort do something other than exit. The question then is: what
would you want it to do?? We need to have it tell the rest of the
system to stop trying to send messages etc - right now, I don't think
the infrastructure exists to do that short of killing orte.


We could try to have errmgr.abort do an orte_finalize - that would kill
the orte system without impacting your host program, I suspect. You
would then have to re-initialize, so we'd have to find some way to let
you know that we had finalized. I can't swear this will work, though -
we might well generate a segfault since this is happening deep down
inside the system. We could try it, though.


Would any of that be of help? Do you have any suggestions on how we
might let you know that we had finalized?


Ralph



Brian Barrett wrote:

On Apr 19, 2006, at 4:15 PM, Greg Watson
wrote:
  
  We've just run across a rather tricky
issue. We're calling opal_event_loop() to dispatch orte events to an
orted that has been launched separately. However if the orted dies for
some reason (gets a signal or whatever) then opal_event_loop() is
calling exit(). Needless to say, this is not good behavior us. Any
suggestions on how to get around this problem?

  
Is the orted you are connecting to the "seed" daemon? I think the only
time we should be exiting like that is if the orted was the seed
daemon. I'm not sure what we want to do if that's the case -- it looks
like we're calling errmgr.abort() when badness happens. I wonder if
your application can provide its own errmgr component that provides an
abort that doesn't actually abort? Just some off the cuff ideas --
Ralph could probably give a better idea of exactly what is happening...
Brian
  

___

devel mailing list

de...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/devel

  
  
  





Re: [OMPI devel] opal_event_loop exiting

2006-04-20 Thread Greg Watson
The simplest thing for us would be for opal_event_loop() to return an  
error value. That way we can detect the situation and clean up our  
system. At the moment we're not trying to restart orted, so clean  
recovery of orte is not that important, though ultimately I would  
think it is desirable. Other alternatives are to pass you an error  
handler that you call, or you could send a signal that we can trap.


From our perspective, we're simply calling a library that does  
stuff. Having the library call exit() at any point is a major problem  
for applications trying to do more than run a single job.


Greg

On Apr 20, 2006, at 9:40 AM, Ralph Castain wrote:

Well, I actually don't know much about opal_event_loop and/or how  
it is intended to work. My guess is that:


(a) your remote orted is acting as the seed and your local process  
(the one in Eclipse) is running as a client to that seed - at  
least, that was the case last I talked to Nathan


(b) when the seed orted dies, it is the oob in your local client  
that actually detects socket closure and decides that - since it is  
the seed that has lost contact - the local application must abort.


(c) the errmgr.abort function does exactly what it was supposed to  
do - it provides an immediate way of killing the local process.


I'd be a little hesitant to recommend overloading the errmgr.abort  
function as you really do want the local processes to die when  
losing connection to the seed (at least, until we develop a  
recovery capability for the seed orted - which is some ways off),  
and (given the way you are running) I'm not sure you can have a  
different errmgr for your process while leaving the other one for  
everyone else.


Probably the best solution for now would be for us to insert a (yet  
another) MCA parameter into the errmgr that would (if set) have  
errmgr.abort do something other than exit. The question then is:  
what would you want it to do?? We need to have it tell the rest of  
the system to stop trying to send messages etc - right now, I don't  
think the infrastructure exists to do that short of killing orte.


We could try to have errmgr.abort do an orte_finalize - that would  
kill the orte system without impacting your host program, I  
suspect. You would then have to re-initialize, so we'd have to find  
some way to let you know that we had finalized. I can't swear this  
will work, though - we might well generate a segfault since this is  
happening deep down inside the system. We could try it, though.


Would any of that be of help? Do you have any suggestions on how we  
might let you know that we had finalized?


Ralph


Brian Barrett wrote:

On Apr 19, 2006, at 4:15 PM, Greg Watson wrote:
We've just run across a rather tricky issue. We're calling  
opal_event_loop() to dispatch orte events to an orted that has  
been launched separately. However if the orted dies for some  
reason (gets a signal or whatever) then opal_event_loop() is  
calling exit(). Needless to say, this is not good behavior us.  
Any suggestions on how to get around this problem?
Is the orted you are connecting to the "seed" daemon? I think the  
only time we should be exiting like that is if the orted was the  
seed daemon. I'm not sure what we want to do if that's the case --  
it looks like we're calling errmgr.abort() when badness happens. I  
wonder if your application can provide its own errmgr component  
that provides an abort that doesn't actually abort? Just some off  
the cuff ideas -- Ralph could probably give a better idea of  
exactly what is happening... Brian

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] opal_event_loop exiting

2006-04-20 Thread Greg Watson

Ok, thanks.

For clarification, the model we're using at the moment looks roughly  
like this:


orte_init();

forever () {
if (do_our_stuff() == GAME_OVER)
break;
opal_event_loop(OPAL_EVLOOP_ONCE);
}

orte_finalize();

The simplest change for us would be something like:

orte_init();

forever () {
if (do_our_stuff() == GAME_OVER)
break;
if (opal_event_loop(OPAL_EVLOOP_ONCE) != ORTE_SUCCESS) {
clean_up_our_stuff();
break;
}
}

orte_finalize();

Greg


On Apr 20, 2006, at 10:21 AM, Ralph Castain wrote:

You make a good point about the library not calling exit(). I'll  
have to recruit some help to look at the notion of opal_even_loop  
returning an error value - it isn't entirely clear who it would  
return it to in our system,. Even though I understand how someone  
in your situation would handle it, I have to ensure that it doesn't  
cause the base system problems, or force a major code revision that  
would need to be scheduled into the project.


We'll have to get back to you on this - most of the folks are at a  
workshop this week, so it will probably be next week before we can  
discuss it.


Ralph


Greg Watson wrote:
The simplest thing for us would be for opal_event_loop() to return  
an error value. That way we can detect the situation and clean up  
our system. At the moment we're not trying to restart orted, so  
clean recovery of orte is not that important, though ultimately I  
would think it is desirable. Other alternatives are to pass you an  
error handler that you call, or you could send a signal that we  
can trap.


>From our perspective, we're simply calling a library that does  
stuff. Having the library call exit() at any point is a major  
problem for applications trying to do more than run a single job.


Greg

On Apr 20, 2006, at 9:40 AM, Ralph Castain wrote:

Well, I actually don't know much about opal_event_loop and/or how  
it is intended to work. My guess is that:


(a) your remote orted is acting as the seed and your local  
process (the one in Eclipse) is running as a client to that seed  
- at least, that was the case last I talked to Nathan


(b) when the seed orted dies, it is the oob in your local client  
that actually detects socket closure and decides that - since it  
is the seed that has lost contact - the local application must  
abort.


(c) the errmgr.abort function does exactly what it was supposed  
to do - it provides an immediate way of killing the local process.


I'd be a little hesitant to recommend overloading the  
errmgr.abort function as you really do want the local processes  
to die when losing connection to the seed (at least, until we  
develop a recovery capability for the seed orted - which is some  
ways off), and (given the way you are running) I'm not sure you  
can have a different errmgr for your process while leaving the  
other one for everyone else.


Probably the best solution for now would be for us to insert a  
(yet another) MCA parameter into the errmgr that would (if set)  
have errmgr.abort do something other than exit. The question then  
is: what would you want it to do?? We need to have it tell the  
rest of the system to stop trying to send messages etc - right  
now, I don't think the infrastructure exists to do that short of  
killing orte.


We could try to have errmgr.abort do an orte_finalize - that  
would kill the orte system without impacting your host program, I  
suspect. You would then have to re-initialize, so we'd have to  
find some way to let you know that we had finalized. I can't  
swear this will work, though - we might well generate a segfault  
since this is happening deep down inside the system. We could try  
it, though.


Would any of that be of help? Do you have any suggestions on how  
we might let you know that we had finalized?


Ralph


Brian Barrett wrote:

On Apr 19, 2006, at 4:15 PM, Greg Watson wrote:
We've just run across a rather tricky issue. We're calling  
opal_event_loop() to dispatch orte events to an orted that has  
been launched separately. However if the orted dies for some  
reason (gets a signal or whatever) then opal_event_loop() is  
calling exit(). Needless to say, this is not good behavior us.  
Any suggestions on how to get around this problem?
Is the orted you are connecting to the "seed" daemon? I think  
the only time we should be exiting like that is if the orted was  
the seed daemon. I'm not sure what we want to do if that's the  
case -- it looks like we're calling errmgr.abort() when badness  
happens. I wonder if your application can provide its own errmgr  
component that provides an abort that doesn't actually abort?  
Just some off the cuff ideas -- Ralph could probably give a  
better idea of exactly what is happening... Brian

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] opal_event_loop exiting

2006-04-20 Thread Ralph Castain




Looks reasonable - let me see what can be done.

Thanks
Ralph


Greg Watson wrote:
Ok, thanks.
  
  
For clarification, the model we're using at the moment looks roughly
like this:
  
  
orte_init();
  
  
forever () {
  
if (do_our_stuff() == GAME_OVER)
  
    break;
  
opal_event_loop(OPAL_EVLOOP_ONCE);
  
}
  
  
orte_finalize();
  
  
The simplest change for us would be something like:
  
  
orte_init();
  
  
forever () {
  
if (do_our_stuff() == GAME_OVER)
  
    break;
  
if (opal_event_loop(OPAL_EVLOOP_ONCE) != ORTE_SUCCESS) {
  
    clean_up_our_stuff();
  
    break;
  
}
  
}
  
  
orte_finalize();
  
  
Greg
  
  
  
On Apr 20, 2006, at 10:21 AM, Ralph Castain wrote:
  
  
  You make a good point about the library not
calling exit(). I'll have to recruit some help to look at the notion of
opal_even_loop returning an error value - it isn't entirely clear who
it would return it to in our system,. Even though I understand how
someone in your situation would handle it, I have to ensure that it
doesn't cause the base system problems, or force a major code revision
that would need to be scheduled into the project.


We'll have to get back to you on this - most of the folks are at a
workshop this week, so it will probably be next week before we can
discuss it.


Ralph



Greg Watson wrote:

The simplest thing for us would be for
opal_event_loop() to return an error value. That way we can detect the
situation and clean up our system. At the moment we're not trying to
restart orted, so clean recovery of orte is not that important, though
ultimately I would think it is desirable. Other alternatives are to
pass you an error handler that you call, or you could send a signal
that we can trap.
  
  
>From our perspective, we're simply calling a library that does
stuff. Having the library call exit() at any point is a major problem
for applications trying to do more than run a single job.
  
  
Greg
  
  
On Apr 20, 2006, at 9:40 AM, Ralph Castain wrote:
  
  
  Well, I actually don't know much about
opal_event_loop and/or how it is intended to work. My guess is that:


(a) your remote orted is acting as the seed and your local process (the
one in Eclipse) is running as a client to that seed - at least, that
was the case last I talked to Nathan


(b) when the seed orted dies, it is the oob in your local client that
actually detects socket closure and decides that - since it is the seed
that has lost contact - the local application must abort.


(c) the errmgr.abort function does exactly what it was supposed to do -
it provides an immediate way of killing the local process.


I'd be a little hesitant to recommend overloading the errmgr.abort
function as you really do want the local processes to die when losing
connection to the seed (at least, until we develop a recovery
capability for the seed orted - which is some ways off), and (given the
way you are running) I'm not sure you can have a different errmgr for
your process while leaving the other one for everyone else.


Probably the best solution for now would be for us to insert a (yet
another) MCA parameter into the errmgr that would (if set) have
errmgr.abort do something other than exit. The question then is: what
would you want it to do?? We need to have it tell the rest of the
system to stop trying to send messages etc - right now, I don't think
the infrastructure exists to do that short of killing orte.


We could try to have errmgr.abort do an orte_finalize - that would kill
the orte system without impacting your host program, I suspect. You
would then have to re-initialize, so we'd have to find some way to let
you know that we had finalized. I can't swear this will work, though -
we might well generate a segfault since this is happening deep down
inside the system. We could try it, though.


Would any of that be of help? Do you have any suggestions on how we
might let you know that we had finalized?


Ralph



Brian Barrett wrote:

On Apr 19, 2006, at 4:15 PM, Greg
Watson wrote:
  
  We've just run across a rather tricky
issue. We're calling opal_event_loop() to dispatch orte events to an
orted that has been launched separately. However if the orted dies for
some reason (gets a signal or whatever) then opal_event_loop() is
calling exit(). Needless to say, this is not good behavior us. Any
suggestions on how to get around this problem?

  
Is the orted you are connecting to the "seed" daemon? I think the only
time we should be exiting like that is if the orted was the seed
daemon. I'm not sure what we want to do if that's the case -- it looks
like we're calling errmgr.abort() when badness happens. I wonder if
your application can provide its

[OMPI devel] Pack data mismatch in file dps_unpack.c 95/121

2006-04-20 Thread Galen M. Shipman

Hey Guys,

Not sure what is going on here, has anyone seen this before?

- Galen




Hi Galen,

Sorry to bother you.

I have installed latest stable version of Open MPI(1.0) on two of  
spider

nodes(s7,s4) for some experiments, but there seems to be configuration
error  or something else which I don't understand. After  
installing, as

a test I ran an simple MPI program but it complains with following
errors.

[s4:10685] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch in file
dps_unpack.c at line 121
[s4:10685] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch in file
dps_unpack.c at line 95


Further digging with gdb prints following errors
[s7:07005] ERROR: A daemon on node s4 failed to start as expected.
[s7:07005] ERROR: There may be more information available from
[s7:07005] ERROR: the remote shell (see above).
[s7:07005] The daemon received a signal 5.
[s7:07005] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch in file
dps_unpack.c at line 121
[s7:07005] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch in file
dps_unpack.c at line 95
[s7:07005] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch in file
dps_unpack.c at line 121
[s7:07005] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch in file
dps_unpack.c at line 95

any clue on what I am doing wrong ?

thanks,
-Manjunath