[OMPI devel] ORTE Scaling results: updated
Hello all The wiki page has been updated with the latest test results from a new branch that implemented inbound collectives on the modex and barrier operations. As you will see from the graphs, ORTE/OMPI now exhibits a negative 2nd-derivative on the launch time curve for mpi_no_op (i.e., MPI_Init/MPI_Finalize). Some cleanup of the branch code is required before insertion into the trunk. I'll send out a note when that occurs. The wiki page is at: https://svn.open-mpi.org/trac/ompi/wiki/ORTEScalabilityTesting Ralph
Re: [OMPI devel] MPI_Comm_connect/Accept
Still no luck here, I launch those three processes : term1$ ompi-server -d --report-uri URIFILE term2$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1 simple_accept term3$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1 simple_connect The output of ompi-server shows a successful publish and lookup. I get the correct port on the client side. However, the result is the same as when not using the Publish/Lookup mechanism: the connect fails saying the port cannot be reached. Found port < 1940389889.0;tcp:// 160.36.252.99:49777;tcp6://2002:a024:ed65:9:21b:63ff:fecb: 28:49778;tcp6://fec0::9:21b:63ff:fecb:28:49778;tcp6://2002:a024:ff7f: 9:21b:63ff:fecb:28:49778:300 > [abouteil.nomad.utk.edu:60339] [[29620,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../trunk/orte/mca/rml/oob/rml_oob_send.c at line 140 [abouteil.nomad.utk.edu:60339] [[29620,1],0] attempted to send to [[29608,1],0] [abouteil.nomad.utk.edu:60339] [[29620,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../trunk/ompi/mca/dpm/orte/dpm_orte.c at line 455 [abouteil.nomad.utk.edu:60339] *** An error occurred in MPI_Comm_connect [abouteil.nomad.utk.edu:60339] *** on communicator MPI_COMM_SELF [abouteil.nomad.utk.edu:60339] *** MPI_ERR_UNKNOWN: unknown error [abouteil.nomad.utk.edu:60339] *** MPI_ERRORS_ARE_FATAL (goodbye) I took a look in the source code, and I think the problem comes from a conceptional mistake in MPI_Connect. The function "connect_accept" in dpm_orte.c takes a orte_process_name_t as the destination port. This structure only contains the jobid and the vpid (always set to 0, I guess meaning you plan to contact the HNP of that job). Obviously, if the accepting process does not share the same HNP with the connecting process, there is no way for the MPI_Comm_connect function to fill correctly this field. The all purpose of the port_name string is to provide a consistent way to access the remote endpoint without a complicated name resolution service. I think this function should take the port_name instead (the string returned by open_port) and contact directly with OOB this endpoint to get the contact informations it needs from there, and not from the local HNP. Aurelien Le 4 avr. 08 à 15:21, Ralph H Castain a écrit : Okay, I have a partial fix in there now. You'll have to use -mca routed unity as I still need to fix it for routed tree. Couple of things: 1. I fixed the --debug flag so it automatically turns on the debug output from the data server code itself. Now ompi-server will tell you when it is accessed. 2. remember, we added an MPI_Info key that specifies if you want the data stored locally (on your own mpirun) or globally (on the ompi- server). If you specify nothing, there is a precedence built into the code that defaults to "local". So you have to tell us that this data is to be published "global" if you want to connect multiple mpiruns. I believe Jeff wrote all that up somewhere - could be in an email thread, though. Been too long ago for me to remember... ;-) You can look it up in the code though as a last resort - it is in ompi/mca/pubsub/orte/pubsub_orte.c. Ralph On 4/4/08 12:55 PM, "Ralph H Castain" wrote: Well, something got borked in here - will have to fix it, so this will probably not get done until next week. On 4/4/08 12:26 PM, "Ralph H Castain" wrote: Yeah, you didn't specify the file correctly...plus I found a bug in the code when I looked (out-of-date a little in orterun). I am updating orterun (commit soon) and will include a better help message about the proper format of the orterun cmd-line option. The syntax is: -ompi-server uri or -ompi-server file:filename-where-uri-exists Problem here is that you gave it a uri of "test", which means nothing. ;-) Should have it up-and-going soon. Ralph On 4/4/08 12:02 PM, "Aurélien Bouteiller" wrote: Ralph, I've not been very successful at using ompi-server. I tried this : xterm1$ ompi-server --debug-devel -d --report-uri test [grosse-pomme.local:01097] proc_info: hnp_uri NULL daemon uri NULL [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running! xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test Port name: 2285895681.0;tcp://192.168.0.101:50065;tcp:// 192.168.0.150:50065:300 xterm3$ mpirun -ompi-server test -np 1 simple_connect -- Process rank 0 attempted to lookup from a global ompi_server that could not be contacted. This is typically caused by either not specifying the contact info for the server, or by the server not currently executing. If you did specify the contact info for a server, please check to see that the server is running and start it again (or have your sys admin s
Re: [OMPI devel] Signals
On 4/8/08 2:19 PM, "Ralph H Castain" wrote: > > > > On 4/8/08 12:10 PM, "Pak Lui" wrote: > >> Richard Graham wrote: >>> What happens if I deliver sigusr2 to mpirun ? What I observe (for both >>> ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does >>> get propagated to the mpi procs, which do invoke the signal handler I >>> registered, but the job is terminated right after that. However, if I >>> deliver the signal directly to the mpi procs, the signal handler is invoked, >>> and the job continues to run. >> >> This is exactly what I have observed previously when I made the >> gridengine change. It is due to the fact that orterun (aka mpirun) is >> the process fork and exec'ing the executables on the HNP. e.g. On the >> remote nodes, you don't have this problem. So the wait_daemon function >> picks up the signal from mpirun on HNP, then kill off the children. > > I'll look into this, but I don't believe this is true UNLESS something > exits. The wait_daemon function only gets called when a proc terminates - it > doesn't "pickup" a signal on its own. Perhaps we are just having a language > problem here... > > In the rsh situation, the daemon "daemonizes" and closes the ssh session > during launch. If the ssh session closed on a signal, then that would return > and indicate that a daemon had failed to start, causing the abort. But that > session is successfully closed PRIOR to the launch of any MPI procs. I note > that we don't "deregister" the waitpid, though, so there may be some issue > there. > > However, we most certainly do NOT look for such things in Torque. My guess > is that something is causing a proc/daemon to abort, which then causes the > system to abort the job. > > I have tried this on my Mac (got other things going on at the moment on the > distributed machines), and all works as expected. However, that doesn't mean > there isn't a problem in general. Interesting - I do most of my development work on the Mac, and this is where I also see the problem. I have not updated in a couple of days, so maybe things have been fixed since. Rich > > Will investigate when I have time shortly. > >> >>> >>> So, I think that what was intended to happen is the correct thing, but for >>> some reason it is not happening. >>> >>> Rich >>> >>> >>> On 4/8/08 1:47 PM, "Ralph H Castain" wrote: >>> I found what Pak said a little confusing as the wait_daemon function doesn't actually receive a signal itself - it only detects that a proc has exited and checks to see if that happened due to a signal. If so, it flags that situation and will order the job aborted. So if the proc continues alive, the fact that it was hit with SIGUSR2 will not be detected by ORTE nor will anything happen - however, if the OS uses SIGUSR2 to terminate the proc, or if the proc terminates when it gets that signal, we will see that proc terminate due to signal and abort the rest of the job. We could change it if that is what people want - it is trivial to insert code to say "kill everything except if it died due to a certain signal". up to you folks. Current behavior is what you said you wanted a long time ago - nothing has changed in this regard for several years. On 4/8/08 11:36 AM, "Pak Lui" wrote: > First, can your user executable create a signal handler to catch the > SIGUSR2 to not exit? By default on Solaris it is going to exit, unless > you catch the signal and have the process to do nothing. > > from signal(3HEAD) > Name Value DefaultEvent > SIGUSR1 16 Exit User Signal 1 > SIGUSR2 17 Exit User Signal 2 > > The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm > might cause the processes to exit if the orted (or mpirun if it's on > HNP) receives a signal like SIGUSR2; it'd work on killing all the user > processes on that node once it receives a signal. > > I workaround this for gridengine PLM. Once the gridengine_wait_daemon() > receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to > acknowledge a signal returns, without declaring the launch_failed which > would kill off the user processes. The signals would also get passed to > the user processes, and let them decide what to do with the signals > themselves. > > SGE needed this so the job kill or job suspension notification to work > properly since they would send a SIGUSR1/2 to mpirun. I believe this is > probably what you need in the rsh plm. > > Richard Graham wrote: >> I am running into a situation where I am trying to deliver a signal to >> the >> mpi procs (sigusr2). I deliver this to mpirun, which propagates it to >> the >> mpi procs, but then proceeds to kill the children. Is there
Re: [OMPI devel] Signals
On 4/8/08 12:10 PM, "Pak Lui" wrote: > Richard Graham wrote: >> What happens if I deliver sigusr2 to mpirun ? What I observe (for both >> ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does >> get propagated to the mpi procs, which do invoke the signal handler I >> registered, but the job is terminated right after that. However, if I >> deliver the signal directly to the mpi procs, the signal handler is invoked, >> and the job continues to run. > > This is exactly what I have observed previously when I made the > gridengine change. It is due to the fact that orterun (aka mpirun) is > the process fork and exec'ing the executables on the HNP. e.g. On the > remote nodes, you don't have this problem. So the wait_daemon function > picks up the signal from mpirun on HNP, then kill off the children. I'll look into this, but I don't believe this is true UNLESS something exits. The wait_daemon function only gets called when a proc terminates - it doesn't "pickup" a signal on its own. Perhaps we are just having a language problem here... In the rsh situation, the daemon "daemonizes" and closes the ssh session during launch. If the ssh session closed on a signal, then that would return and indicate that a daemon had failed to start, causing the abort. But that session is successfully closed PRIOR to the launch of any MPI procs. I note that we don't "deregister" the waitpid, though, so there may be some issue there. However, we most certainly do NOT look for such things in Torque. My guess is that something is causing a proc/daemon to abort, which then causes the system to abort the job. I have tried this on my Mac (got other things going on at the moment on the distributed machines), and all works as expected. However, that doesn't mean there isn't a problem in general. Will investigate when I have time shortly. > >> >> So, I think that what was intended to happen is the correct thing, but for >> some reason it is not happening. >> >> Rich >> >> >> On 4/8/08 1:47 PM, "Ralph H Castain" wrote: >> >>> I found what Pak said a little confusing as the wait_daemon function doesn't >>> actually receive a signal itself - it only detects that a proc has exited >>> and checks to see if that happened due to a signal. If so, it flags that >>> situation and will order the job aborted. >>> >>> So if the proc continues alive, the fact that it was hit with SIGUSR2 will >>> not be detected by ORTE nor will anything happen - however, if the OS uses >>> SIGUSR2 to terminate the proc, or if the proc terminates when it gets that >>> signal, we will see that proc terminate due to signal and abort the rest of >>> the job. >>> >>> We could change it if that is what people want - it is trivial to insert >>> code to say "kill everything except if it died due to a certain signal". >>> >>> up to you folks. Current behavior is what you said you wanted a long >>> time ago - nothing has changed in this regard for several years. >>> >>> >>> On 4/8/08 11:36 AM, "Pak Lui" wrote: >>> First, can your user executable create a signal handler to catch the SIGUSR2 to not exit? By default on Solaris it is going to exit, unless you catch the signal and have the process to do nothing. from signal(3HEAD) Name Value DefaultEvent SIGUSR1 16 Exit User Signal 1 SIGUSR2 17 Exit User Signal 2 The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm might cause the processes to exit if the orted (or mpirun if it's on HNP) receives a signal like SIGUSR2; it'd work on killing all the user processes on that node once it receives a signal. I workaround this for gridengine PLM. Once the gridengine_wait_daemon() receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to acknowledge a signal returns, without declaring the launch_failed which would kill off the user processes. The signals would also get passed to the user processes, and let them decide what to do with the signals themselves. SGE needed this so the job kill or job suspension notification to work properly since they would send a SIGUSR1/2 to mpirun. I believe this is probably what you need in the rsh plm. Richard Graham wrote: > I am running into a situation where I am trying to deliver a signal to the > mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the > mpi procs, but then proceeds to kill the children. Is there an easy way > that I can get around this ? I am using this mechanism in a situation > where > I don't have a debugger, and trying to use this to turn on debugging when > I > hit a hang, so killing the mpi procs is really not what I want to have > happen. > > Thanks, > Rich > > ___ > devel maili
Re: [OMPI devel] Signals
Richard Graham wrote: What happens if I deliver sigusr2 to mpirun ? What I observe (for both ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does get propagated to the mpi procs, which do invoke the signal handler I registered, but the job is terminated right after that. However, if I deliver the signal directly to the mpi procs, the signal handler is invoked, and the job continues to run. This is exactly what I have observed previously when I made the gridengine change. It is due to the fact that orterun (aka mpirun) is the process fork and exec'ing the executables on the HNP. e.g. On the remote nodes, you don't have this problem. So the wait_daemon function picks up the signal from mpirun on HNP, then kill off the children. So, I think that what was intended to happen is the correct thing, but for some reason it is not happening. Rich On 4/8/08 1:47 PM, "Ralph H Castain" wrote: I found what Pak said a little confusing as the wait_daemon function doesn't actually receive a signal itself - it only detects that a proc has exited and checks to see if that happened due to a signal. If so, it flags that situation and will order the job aborted. So if the proc continues alive, the fact that it was hit with SIGUSR2 will not be detected by ORTE nor will anything happen - however, if the OS uses SIGUSR2 to terminate the proc, or if the proc terminates when it gets that signal, we will see that proc terminate due to signal and abort the rest of the job. We could change it if that is what people want - it is trivial to insert code to say "kill everything except if it died due to a certain signal". up to you folks. Current behavior is what you said you wanted a long time ago - nothing has changed in this regard for several years. On 4/8/08 11:36 AM, "Pak Lui" wrote: First, can your user executable create a signal handler to catch the SIGUSR2 to not exit? By default on Solaris it is going to exit, unless you catch the signal and have the process to do nothing. from signal(3HEAD) Name Value DefaultEvent SIGUSR1 16 Exit User Signal 1 SIGUSR2 17 Exit User Signal 2 The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm might cause the processes to exit if the orted (or mpirun if it's on HNP) receives a signal like SIGUSR2; it'd work on killing all the user processes on that node once it receives a signal. I workaround this for gridengine PLM. Once the gridengine_wait_daemon() receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to acknowledge a signal returns, without declaring the launch_failed which would kill off the user processes. The signals would also get passed to the user processes, and let them decide what to do with the signals themselves. SGE needed this so the job kill or job suspension notification to work properly since they would send a SIGUSR1/2 to mpirun. I believe this is probably what you need in the rsh plm. Richard Graham wrote: I am running into a situation where I am trying to deliver a signal to the mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the mpi procs, but then proceeds to kill the children. Is there an easy way that I can get around this ? I am using this mechanism in a situation where I don't have a debugger, and trying to use this to turn on debugging when I hit a hang, so killing the mpi procs is really not what I want to have happen. Thanks, Rich ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- - Pak Lui pak@sun.com
Re: [OMPI devel] Signals
Hmmm...well, I'll take a look. I haven't seen that behavior, but I haven't checked it in some time. On 4/8/08 11:54 AM, "Richard Graham" wrote: > What happens if I deliver sigusr2 to mpirun ? What I observe (for both > ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does > get propagated to the mpi procs, which do invoke the signal handler I > registered, but the job is terminated right after that. However, if I > deliver the signal directly to the mpi procs, the signal handler is invoked, > and the job continues to run. > > So, I think that what was intended to happen is the correct thing, but for > some reason it is not happening. > > Rich > > > On 4/8/08 1:47 PM, "Ralph H Castain" wrote: > >> I found what Pak said a little confusing as the wait_daemon function doesn't >> actually receive a signal itself - it only detects that a proc has exited >> and checks to see if that happened due to a signal. If so, it flags that >> situation and will order the job aborted. >> >> So if the proc continues alive, the fact that it was hit with SIGUSR2 will >> not be detected by ORTE nor will anything happen - however, if the OS uses >> SIGUSR2 to terminate the proc, or if the proc terminates when it gets that >> signal, we will see that proc terminate due to signal and abort the rest of >> the job. >> >> We could change it if that is what people want - it is trivial to insert >> code to say "kill everything except if it died due to a certain signal". >> >> up to you folks. Current behavior is what you said you wanted a long >> time ago - nothing has changed in this regard for several years. >> >> >> On 4/8/08 11:36 AM, "Pak Lui" wrote: >> >>> First, can your user executable create a signal handler to catch the >>> SIGUSR2 to not exit? By default on Solaris it is going to exit, unless >>> you catch the signal and have the process to do nothing. >>> >>> from signal(3HEAD) >>> Name Value DefaultEvent >>> SIGUSR1 16 Exit User Signal 1 >>> SIGUSR2 17 Exit User Signal 2 >>> >>> The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm >>> might cause the processes to exit if the orted (or mpirun if it's on >>> HNP) receives a signal like SIGUSR2; it'd work on killing all the user >>> processes on that node once it receives a signal. >>> >>> I workaround this for gridengine PLM. Once the gridengine_wait_daemon() >>> receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to >>> acknowledge a signal returns, without declaring the launch_failed which >>> would kill off the user processes. The signals would also get passed to >>> the user processes, and let them decide what to do with the signals >>> themselves. >>> >>> SGE needed this so the job kill or job suspension notification to work >>> properly since they would send a SIGUSR1/2 to mpirun. I believe this is >>> probably what you need in the rsh plm. >>> >>> Richard Graham wrote: I am running into a situation where I am trying to deliver a signal to the mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the mpi procs, but then proceeds to kill the children. Is there an easy way that I can get around this ? I am using this mechanism in a situation where I don't have a debugger, and trying to use this to turn on debugging when I hit a hang, so killing the mpi procs is really not what I want to have happen. Thanks, Rich ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Signals
What happens if I deliver sigusr2 to mpirun ? What I observe (for both ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does get propagated to the mpi procs, which do invoke the signal handler I registered, but the job is terminated right after that. However, if I deliver the signal directly to the mpi procs, the signal handler is invoked, and the job continues to run. So, I think that what was intended to happen is the correct thing, but for some reason it is not happening. Rich On 4/8/08 1:47 PM, "Ralph H Castain" wrote: > I found what Pak said a little confusing as the wait_daemon function doesn't > actually receive a signal itself - it only detects that a proc has exited > and checks to see if that happened due to a signal. If so, it flags that > situation and will order the job aborted. > > So if the proc continues alive, the fact that it was hit with SIGUSR2 will > not be detected by ORTE nor will anything happen - however, if the OS uses > SIGUSR2 to terminate the proc, or if the proc terminates when it gets that > signal, we will see that proc terminate due to signal and abort the rest of > the job. > > We could change it if that is what people want - it is trivial to insert > code to say "kill everything except if it died due to a certain signal". > > up to you folks. Current behavior is what you said you wanted a long > time ago - nothing has changed in this regard for several years. > > > On 4/8/08 11:36 AM, "Pak Lui" wrote: > >> First, can your user executable create a signal handler to catch the >> SIGUSR2 to not exit? By default on Solaris it is going to exit, unless >> you catch the signal and have the process to do nothing. >> >> from signal(3HEAD) >> Name Value DefaultEvent >> SIGUSR1 16 Exit User Signal 1 >> SIGUSR2 17 Exit User Signal 2 >> >> The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm >> might cause the processes to exit if the orted (or mpirun if it's on >> HNP) receives a signal like SIGUSR2; it'd work on killing all the user >> processes on that node once it receives a signal. >> >> I workaround this for gridengine PLM. Once the gridengine_wait_daemon() >> receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to >> acknowledge a signal returns, without declaring the launch_failed which >> would kill off the user processes. The signals would also get passed to >> the user processes, and let them decide what to do with the signals >> themselves. >> >> SGE needed this so the job kill or job suspension notification to work >> properly since they would send a SIGUSR1/2 to mpirun. I believe this is >> probably what you need in the rsh plm. >> >> Richard Graham wrote: >>> I am running into a situation where I am trying to deliver a signal to the >>> mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the >>> mpi procs, but then proceeds to kill the children. Is there an easy way >>> that I can get around this ? I am using this mechanism in a situation where >>> I don't have a debugger, and trying to use this to turn on debugging when I >>> hit a hang, so killing the mpi procs is really not what I want to have >>> happen. >>> >>> Thanks, >>> Rich >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Signals
I found what Pak said a little confusing as the wait_daemon function doesn't actually receive a signal itself - it only detects that a proc has exited and checks to see if that happened due to a signal. If so, it flags that situation and will order the job aborted. So if the proc continues alive, the fact that it was hit with SIGUSR2 will not be detected by ORTE nor will anything happen - however, if the OS uses SIGUSR2 to terminate the proc, or if the proc terminates when it gets that signal, we will see that proc terminate due to signal and abort the rest of the job. We could change it if that is what people want - it is trivial to insert code to say "kill everything except if it died due to a certain signal". up to you folks. Current behavior is what you said you wanted a long time ago - nothing has changed in this regard for several years. On 4/8/08 11:36 AM, "Pak Lui" wrote: > First, can your user executable create a signal handler to catch the > SIGUSR2 to not exit? By default on Solaris it is going to exit, unless > you catch the signal and have the process to do nothing. > > from signal(3HEAD) > Name Value DefaultEvent > SIGUSR1 16 Exit User Signal 1 > SIGUSR2 17 Exit User Signal 2 > > The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm > might cause the processes to exit if the orted (or mpirun if it's on > HNP) receives a signal like SIGUSR2; it'd work on killing all the user > processes on that node once it receives a signal. > > I workaround this for gridengine PLM. Once the gridengine_wait_daemon() > receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to > acknowledge a signal returns, without declaring the launch_failed which > would kill off the user processes. The signals would also get passed to > the user processes, and let them decide what to do with the signals > themselves. > > SGE needed this so the job kill or job suspension notification to work > properly since they would send a SIGUSR1/2 to mpirun. I believe this is > probably what you need in the rsh plm. > > Richard Graham wrote: >> I am running into a situation where I am trying to deliver a signal to the >> mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the >> mpi procs, but then proceeds to kill the children. Is there an easy way >> that I can get around this ? I am using this mechanism in a situation where >> I don't have a debugger, and trying to use this to turn on debugging when I >> hit a hang, so killing the mpi procs is really not what I want to have >> happen. >> >> Thanks, >> Rich >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Signals
First, can your user executable create a signal handler to catch the SIGUSR2 to not exit? By default on Solaris it is going to exit, unless you catch the signal and have the process to do nothing. from signal(3HEAD) Name Value DefaultEvent SIGUSR1 16 Exit User Signal 1 SIGUSR2 17 Exit User Signal 2 The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm might cause the processes to exit if the orted (or mpirun if it's on HNP) receives a signal like SIGUSR2; it'd work on killing all the user processes on that node once it receives a signal. I workaround this for gridengine PLM. Once the gridengine_wait_daemon() receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to acknowledge a signal returns, without declaring the launch_failed which would kill off the user processes. The signals would also get passed to the user processes, and let them decide what to do with the signals themselves. SGE needed this so the job kill or job suspension notification to work properly since they would send a SIGUSR1/2 to mpirun. I believe this is probably what you need in the rsh plm. Richard Graham wrote: I am running into a situation where I am trying to deliver a signal to the mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the mpi procs, but then proceeds to kill the children. Is there an easy way that I can get around this ? I am using this mechanism in a situation where I don't have a debugger, and trying to use this to turn on debugging when I hit a hang, so killing the mpi procs is really not what I want to have happen. Thanks, Rich ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- - Pak Lui pak@sun.com
[OMPI devel] Signals
I am running into a situation where I am trying to deliver a signal to the mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the mpi procs, but then proceeds to kill the children. Is there an easy way that I can get around this ? I am using this mechanism in a situation where I don't have a debugger, and trying to use this to turn on debugging when I hit a hang, so killing the mpi procs is really not what I want to have happen. Thanks, Rich
Re: [OMPI devel] mpirun return code problems
I'm aware - as we discussed on a recent telecon, I put it on my list of things to resolve. Solution is known - just busy with other things at the moment. On 4/8/08 6:06 AM, "Tim Prins" wrote: > Hi all, > > I reported this before, but it seems that the report got lost. I have > found some situations where mpirun will return a '0' when there is an error. > > An easy way to reproduce this is to edit the file > 'orte/mca/plm/base/plm_base_launch_support.c' and on line 154 put in > 'return ORTE_ERROR;' (or apply the attached diff). > > Then recompile and run mpirun. mpirun will indicate there was an error, > but will still return 0. The reason this is concerning to me is that MTT > only looks at return codes, so our tests may be failing and we wouldn't > know it. > > Thanks, > > Tim > Index: orte/mca/plm/base/plm_base_launch_support.c > === > --- orte/mca/plm/base/plm_base_launch_support.c (revision 18092) > +++ orte/mca/plm/base/plm_base_launch_support.c (working copy) > @@ -151,7 +151,7 @@ > ORTE_JOBID_PRINT(job), ORTE_ERROR_NAME(rc))); > return rc; > } > - > + return ORTE_ERROR; > /* complete wiring up the iof */ > OPAL_OUTPUT_VERBOSE((5, orte_plm_globals.output, > "%s plm:base:launch wiring up iof", > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] mpirun return code problems
Hi all, I reported this before, but it seems that the report got lost. I have found some situations where mpirun will return a '0' when there is an error. An easy way to reproduce this is to edit the file 'orte/mca/plm/base/plm_base_launch_support.c' and on line 154 put in 'return ORTE_ERROR;' (or apply the attached diff). Then recompile and run mpirun. mpirun will indicate there was an error, but will still return 0. The reason this is concerning to me is that MTT only looks at return codes, so our tests may be failing and we wouldn't know it. Thanks, Tim Index: orte/mca/plm/base/plm_base_launch_support.c === --- orte/mca/plm/base/plm_base_launch_support.c (revision 18092) +++ orte/mca/plm/base/plm_base_launch_support.c (working copy) @@ -151,7 +151,7 @@ ORTE_JOBID_PRINT(job), ORTE_ERROR_NAME(rc))); return rc; } - + return ORTE_ERROR; /* complete wiring up the iof */ OPAL_OUTPUT_VERBOSE((5, orte_plm_globals.output, "%s plm:base:launch wiring up iof",