Re: [OMPI devel] Signals

2010-03-17 Thread Ralph Castain
Very good - that is pretty much all that the signal_job API does. On Mar 17, 2010, at 4:11 PM, Leonardo Fialho wrote: > Anyway, to signal another job I have sent a RML message with the > ORTE_DAEMON_SIGNAL_LOCAL_PROCS command to the proc's HNP. > > Leonardo > > On Mar 17, 2010, at 9:59 PM, Ral

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Anyway, to signal another job I have sent a RML message with the ORTE_DAEMON_SIGNAL_LOCAL_PROCS command to the proc's HNP. Leonardo On Mar 17, 2010, at 9:59 PM, Ralph Castain wrote: > Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly > realized the problem :-/ > > Terry i

Re: [OMPI devel] Signals

2010-03-17 Thread Ralph Castain
Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly realized the problem :-/ Terry is correct - we don't initialize the plm framework in application processes. However, there is a default proxy module for that framework so that applications can call comm_spawn. Unfortunately,

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Yes, I know the difference :) I'm trying to call orte_plm.signal_job from a PML component. I think PLM stays resident after launching but it doesn't only for mpirun and orted, you're right. On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote: > On 03/17/2010 10:10 AM, Leonardo Fialho wrote: >> >>

Re: [OMPI devel] Signals

2010-03-17 Thread Terry Dontje
On 03/17/2010 10:10 AM, Leonardo Fialho wrote: Wow... orte_plm.signal_job points to zero. Is it correct from the PML point of view? It might be because plm's are really only used at launch time not in MPI processes. Note plm != pml. --td Leonardo On Mar 17, 2010, at 2:52 PM, Leonardo Fialh

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Wow... orte_plm.signal_job points to zero. Is it correct from the PML point of view? Leonardo On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: > To clarify a little bit more: I'm calling orte_plm.signal_job from a PML > component, I know that ORTE is bellow OMPI, but I think that this funct

Re: [OMPI devel] Signals

2010-03-17 Thread Terry Dontje
Can you print out what orte_plm.signal_job value is? I bet it is pointing to address 0. So the question is orte_plm actually initialized in an MPI process? My guess would be no but I am sure Ralph will be able to answer more definitively. --td On 03/17/2010 09:52 AM, Leonardo Fialho wrote:

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
To clarify a little bit more: I'm calling orte_plm.signal_job from a PML component, I know that ORTE is bellow OMPI, but I think that this function could not be available, or something like this. I can't figure out where is this snprintf too, in my code there is only opal_output(0, "receive

Re: [OMPI devel] Signals

2010-03-17 Thread Ralph Castain
Thanks for clarifying - guess I won't chew just yet. :-) I still don't see in your trace where it is failing in signal_job. I didn't see the message indicating it was sending the signal cmd out in your prior debug output, and there isn't a printf in that code loop other than the debug output. C

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Ralph don't swallow your message yet... Both jobs are not running over the same mpirun. There are two instances of mpirun in which one runs with "-report-uri ../contact.txt" and the other receives its contact info using "-ompi-server file:../contact.txt". And yes, both processes are running with

Re: [OMPI devel] Signals

2010-03-17 Thread Ralph Castain
I'm going to have to eat my last message. It slipped past me that your other job was started via comm_spawn. Since both "jobs" are running under the same mpirun, there shouldn't be a problem sending a signal between them. I don't know why this would be crashing. Are you sure it is crashing in

Re: [OMPI devel] Signals

2010-03-16 Thread Leonardo Fialho
Well, thank you anyway :) On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: > Yeah, that probably won't work. The current code isn't intended to cross jobs > like that - I'm sure nobody ever tested it for that idea, and I'm pretty sure > it won't support it. > > I don't currently know any way

Re: [OMPI devel] Signals

2010-03-16 Thread Ralph Castain
Yeah, that probably won't work. The current code isn't intended to cross jobs like that - I'm sure nobody ever tested it for that idea, and I'm pretty sure it won't support it. I don't currently know any way to do what you are trying to do. We could extend the signal code to handle it, I would

Re: [OMPI devel] Signals

2010-03-16 Thread Leonardo Fialho
Yes... but something wrong is going on... maybe the problem is that the jobid is different than the process' jobid, I don't know. I'm trying to send a signal to other process running under a another job. The other process jump into an accept_connect to the MPI comm. So i did a code like this (I

Re: [OMPI devel] Signals

2010-03-16 Thread Ralph Castain
Sure! So long as you add the include, you are okay as the ORTE layer is "below" the OMPI one. On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote: > Thanks Ralph, the last question... it orte_plm.signal_job exposed/available > to be called by a PML component? Yes, I have the orte/mca/plm/plm.h i

Re: [OMPI devel] Signals

2010-03-16 Thread Leonardo Fialho
Thanks Ralph, the last question... it orte_plm.signal_job exposed/available to be called by a PML component? Yes, I have the orte/mca/plm/plm.h include line. Leonardo On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote: > It's just the orte_process_name_t jobid field. So if you have an > orte_pr

Re: [OMPI devel] Signals

2010-03-16 Thread Ralph Castain
It's just the orte_process_name_t jobid field. So if you have an orte_process_name_t *pname, then it would just be orte_plm.signal_job(pname->jobid, sig) On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote: > Hum and to signal a job probably the function is > orte_plm.signal_job(jobid, sig

Re: [OMPI devel] Signals

2010-03-16 Thread Leonardo Fialho
Hum and to signal a job probably the function is orte_plm.signal_job(jobid, signal); right? Now my dummy question is how to obtain the jobid part from an orte_proc_name_t variable? Is there any magical function in the names_fns.h? Thanks, Leonardo On Mar 16, 2010, at 10:12 PM, Ralph Castai

Re: [OMPI devel] Signals

2010-03-16 Thread Ralph Castain
Afraid not - you can signal a job, but not a specific process. We used to have such an API, but nobody ever used it. Easy to restore if someone has a need. On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote: > Hi, > > Is there any function in Open MPI's frameworks to send a signal to other ORTE

Re: [OMPI devel] Signals

2008-04-08 Thread Richard Graham
On 4/8/08 2:19 PM, "Ralph H Castain" wrote: > > > > On 4/8/08 12:10 PM, "Pak Lui" wrote: > >> Richard Graham wrote: >>> What happens if I deliver sigusr2 to mpirun ? What I observe (for both >>> ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does >>> get propagated

Re: [OMPI devel] Signals

2008-04-08 Thread Ralph H Castain
On 4/8/08 12:10 PM, "Pak Lui" wrote: > Richard Graham wrote: >> What happens if I deliver sigusr2 to mpirun ? What I observe (for both >> ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does >> get propagated to the mpi procs, which do invoke the signal handler I >> regi

Re: [OMPI devel] Signals

2008-04-08 Thread Pak Lui
Richard Graham wrote: What happens if I deliver sigusr2 to mpirun ? What I observe (for both ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does get propagated to the mpi procs, which do invoke the signal handler I registered, but the job is terminated right after that. H

Re: [OMPI devel] Signals

2008-04-08 Thread Ralph H Castain
Hmmm...well, I'll take a look. I haven't seen that behavior, but I haven't checked it in some time. On 4/8/08 11:54 AM, "Richard Graham" wrote: > What happens if I deliver sigusr2 to mpirun ? What I observe (for both > ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does

Re: [OMPI devel] Signals

2008-04-08 Thread Richard Graham
What happens if I deliver sigusr2 to mpirun ? What I observe (for both ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does get propagated to the mpi procs, which do invoke the signal handler I registered, but the job is terminated right after that. However, if I deliver the

Re: [OMPI devel] Signals

2008-04-08 Thread Ralph H Castain
I found what Pak said a little confusing as the wait_daemon function doesn't actually receive a signal itself - it only detects that a proc has exited and checks to see if that happened due to a signal. If so, it flags that situation and will order the job aborted. So if the proc continues alive,

Re: [OMPI devel] Signals

2008-04-08 Thread Pak Lui
First, can your user executable create a signal handler to catch the SIGUSR2 to not exit? By default on Solaris it is going to exit, unless you catch the signal and have the process to do nothing. from signal(3HEAD) Name Value DefaultEvent SIGUSR1 16 Ex