Very good - that is pretty much all that the signal_job API does.

On Mar 17, 2010, at 4:11 PM, Leonardo Fialho wrote:

> Anyway, to signal another job I have sent a RML message with the 
> ORTE_DAEMON_SIGNAL_LOCAL_PROCS command to the proc's HNP.
> 
> Leonardo
> 
> On Mar 17, 2010, at 9:59 PM, Ralph Castain wrote:
> 
>> Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly 
>> realized the problem :-/
>> 
>> Terry is correct - we don't initialize the plm framework in application 
>> processes. However, there is a default proxy module for that framework so 
>> that applications can call comm_spawn. Unfortunately, I never filled in the 
>> rest of the module function pointers because (a) there was no known reason 
>> for apps to be using them (as Jeff points out), and (b) there is no MPI call 
>> that interfaces to them.
>> 
>> I can (and will) make it work over the next day or two - there is no reason 
>> why this can't be done. It just wasn't implemented due to lack of reason to 
>> do so.
>> 
>> Sorry for the confusion - old man brain fizzing out again.
>> 
>> On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote:
>> 
>>> Yes, I know the difference :)
>>> 
>>> I'm trying to call orte_plm.signal_job from a PML component. I think PLM 
>>> stays resident after launching but it doesn't only for mpirun and orted, 
>>> you're right.
>>> 
>>> On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote:
>>> 
>>>> On 03/17/2010 10:10 AM, Leonardo Fialho wrote:
>>>>> 
>>>>> Wow... orte_plm.signal_job points to zero. Is it correct from the PML 
>>>>> point of view?
>>>> It might be because plm's are really only used at launch time not in MPI 
>>>> processes.  Note plm != pml.
>>>> 
>>>> --td
>>>>> 
>>>>> Leonardo
>>>>> 
>>>>> On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:
>>>>> 
>>>>>> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML 
>>>>>> component, I know that ORTE is bellow OMPI, but I think that this 
>>>>>> function could not be available, or something like this. I can't figure 
>>>>>> out where is this snprintf too, in my code there is only
>>>>>> 
>>>>>>     opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger 
>>>>>> <%s>",
>>>>>>                 SIGUSR1, ORTE_NAME_PRINT(&el_proc));
>>>>>>     orte_plm.signal_job(el_proc.jobid, SIGUSR1);
>>>>>> 
>>>>>> And the first output/printf works fine. Well... I used gdb to run the 
>>>>>> program, I can see this:
>>>>>> 
>>>>>> Program received signal EXC_BAD_ACCESS, Could not access memory.
>>>>>> Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000
>>>>>> 0x0000000000000000 in ?? ()
>>>>>> (gdb) backtrace
>>>>>> #0  0x0000000000000000 in ?? ()
>>>>>> #1  0x000000010065c319 in vprotocol_receiver_eventlog_connect 
>>>>>> (el_comm=0x10065d178) at 
>>>>>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
>>>>>> #2  0x000000010065ba9a in mca_vprotocol_receiver_send (buf=0x100500000, 
>>>>>> count=262144, datatype=0x100263d60, dst=1, tag=1, 
>>>>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at 
>>>>>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
>>>>>> #3  0x0000000100077d44 in MPI_Send ()
>>>>>> #4  0x0000000100000a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45
>>>>>> 
>>>>>> The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job 
>>>>>> call. After that zeros and interrogations... the signal_job function is 
>>>>>> already available? I really don't understand what means all those zeros.
>>>>>> 
>>>>>> Leonardo
>>>>>> 
>>>>>> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:
>>>>>> 
>>>>>>> Thanks for clarifying - guess I won't chew just yet. :-)
>>>>>>> 
>>>>>>> I still don't see in your trace where it is failing in signal_job. I 
>>>>>>> didn't see the message indicating it was sending the signal cmd out in 
>>>>>>> your prior debug output, and there isn't a printf in that code loop 
>>>>>>> other than the debug output. Can you attach to the process and get more 
>>>>>>> info?
>>>>>>> 
>>>>>>> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:
>>>>>>> 
>>>>>>>> Ralph don't swallow your message yet... Both jobs are not running over 
>>>>>>>> the same mpirun. There are two instances of mpirun in which one runs 
>>>>>>>> with "-report-uri ../contact.txt" and the other receives its contact 
>>>>>>>> info using "-ompi-server file:../contact.txt". And yes, both processes 
>>>>>>>> are running with plm_base_verbose activated. When a deactivate the 
>>>>>>>> plm_base_verbose the error is practically the same:
>>>>>>>> 
>>>>>>>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
>>>>>>>> <[[47640,1],0]>
>>>>>>>> [aopclf:54106] *** Process received signal ***
>>>>>>>> [aopclf:54106] Signal: Segmentation fault (11)
>>>>>>>> [aopclf:54106] Signal code: Address not mapped (1)
>>>>>>>> [aopclf:54106] Failing at address: 0x0
>>>>>>>> [aopclf:54106] [ 0] 2   libSystem.B.dylib                   
>>>>>>>> 0x00007fff83a6eeaa _sigtramp + 26
>>>>>>>> [aopclf:54106] [ 1] 3   libSystem.B.dylib                   
>>>>>>>> 0x00007fff83a210b7 snprintf + 496
>>>>>>>> [aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so           
>>>>>>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177
>>>>>>>> [aopclf:54106] [ 3] 5   libmpi.0.dylib                      
>>>>>>>> 0x0000000100077d44 MPI_Send + 734
>>>>>>>> [aopclf:54106] [ 4] 6   ping                                
>>>>>>>> 0x0000000100000a97 main + 431
>>>>>>>> [aopclf:54106] [ 5] 7   ping                                
>>>>>>>> 0x00000001000008e0 start + 52
>>>>>>>> [aopclf:54106] *** End of error message ***
>>>>>>>> 
>>>>>>>> Leonardo
>>>>>>>> 
>>>>>>>> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:
>>>>>>>> 
>>>>>>>>> I'm going to have to eat my last message. It slipped past me that 
>>>>>>>>> your other job was started via comm_spawn. Since both "jobs" are 
>>>>>>>>> running under the same mpirun, there shouldn't be a problem sending a 
>>>>>>>>> signal between them.
>>>>>>>>> 
>>>>>>>>> I don't know why this would be crashing. Are you sure it is  crashing 
>>>>>>>>> in signal_job? Your trace indicates it is crashing in a print 
>>>>>>>>> statement, yet there is no print statement in signal_job. Or did you 
>>>>>>>>> run this with plm_base_verbose set so that the verbose prints are 
>>>>>>>>> trying to run (could be we have a bug in one of them)?
>>>>>>>>> 
>>>>>>>>> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:
>>>>>>>>> 
>>>>>>>>>> Well, thank you anyway :)
>>>>>>>>>> 
>>>>>>>>>> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:
>>>>>>>>>> 
>>>>>>>>>>> Yeah, that probably won't work. The current code isn't intended to 
>>>>>>>>>>> cross jobs like that - I'm sure nobody ever tested it for that 
>>>>>>>>>>> idea, and I'm pretty sure it won't support it.
>>>>>>>>>>> 
>>>>>>>>>>> I don't currently know any way to do what you are trying to do. We 
>>>>>>>>>>> could extend the signal code to handle it, I would think...but I'm 
>>>>>>>>>>> not sure how soon that might happen.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Yes... but something wrong is going on... maybe the problem is 
>>>>>>>>>>>> that the jobid is different than the process' jobid, I don't know. 
>>>>>>>>>>>>                 
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm trying to send a signal to other process running under a 
>>>>>>>>>>>> another job. The other process jump into an accept_connect to the 
>>>>>>>>>>>> MPI comm. So i did a code like this (I removed verification code 
>>>>>>>>>>>> and comments, this is just a summary for a happy execution):
>>>>>>>>>>>> 
>>>>>>>>>>>> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
>>>>>>>>>>>> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
>>>>>>>>>>>> ompi_dpm.route_to_port(hnp_uri, &el_proc);
>>>>>>>>>>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
>>>>>>>>>>>> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);
>>>>>>>>>>>> 
>>>>>>>>>>>> el_proc is defined as orte_process_name_t, not a pointer to this. 
>>>>>>>>>>>> And signal.h has been included for SIGUSR1's sake. But when the 
>>>>>>>>>>>> code enter in signal_job function it crashes. I'm trying to debug 
>>>>>>>>>>>> it just now... the crash is the following:
>>>>>>>>>>>> 
>>>>>>>>>>>> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
>>>>>>>>>>>> [Fialho-2.local:51377] receiver: found port 
>>>>>>>>>>>> <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300>
>>>>>>>>>>>> [Fialho-2.local:51377] receiver: HNP URI 
>>>>>>>>>>>> <784793600.0;tcp://192.168.1.200:54071>, RML URI 
>>>>>>>>>>>> <784793601.0;tcp://192.168.1.200:54072>, TAG <300>
>>>>>>>>>>>> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC 
>>>>>>>>>>>> Event Logger <[[11975,1],0]>
>>>>>>>>>>>> [Fialho-2:51377] *** Process received signal ***
>>>>>>>>>>>> [Fialho-2:51377] Signal: Segmentation fault (11)
>>>>>>>>>>>> [Fialho-2:51377] Signal code: Address not mapped (1)
>>>>>>>>>>>> [Fialho-2:51377] Failing at address: 0x0
>>>>>>>>>>>> [Fialho-2:51377] [ 0] 2   libSystem.B.dylib                   
>>>>>>>>>>>> 0x00007fff83a6eeaa _sigtramp + 26
>>>>>>>>>>>> [Fialho-2:51377] [ 1] 3   libSystem.B.dylib                   
>>>>>>>>>>>> 0x00007fff83a210b7 snprintf + 496
>>>>>>>>>>>> [Fialho-2:51377] [ 2] 4   mca_vprotocol_receiver.so           
>>>>>>>>>>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177
>>>>>>>>>>>> [Fialho-2:51377] [ 3] 5   libmpi.0.dylib                      
>>>>>>>>>>>> 0x0000000100077d44 MPI_Send + 734
>>>>>>>>>>>> [Fialho-2:51377] [ 4] 6   ping                                
>>>>>>>>>>>> 0x0000000100000a97 main + 431
>>>>>>>>>>>> [Fialho-2:51377] [ 5] 7   ping                                
>>>>>>>>>>>> 0x00000001000008e0 start + 52
>>>>>>>>>>>> [Fialho-2:51377] [ 6] 8   ???                                 
>>>>>>>>>>>> 0x0000000000000003 0x0 + 3
>>>>>>>>>>>> [Fialho-2:51377] *** End of error message ***
>>>>>>>>>>>> 
>>>>>>>>>>>> With exception to the signal_job the code works, I have tested it 
>>>>>>>>>>>> forcing an accept on the other process, and avoiding the 
>>>>>>>>>>>> signal_job. But I want to send the signal to wake-up the other 
>>>>>>>>>>>> side and to be able to manage multiple connect/accept.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Leonardo
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Sure! So long as you add the include, you are okay as the ORTE 
>>>>>>>>>>>>> layer is "below" the OMPI one.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks Ralph, the last question... it orte_plm.signal_job 
>>>>>>>>>>>>>> exposed/available to be called by a PML component? Yes, I have 
>>>>>>>>>>>>>> the orte/mca/plm/plm.h include line.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Leonardo
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It's just the orte_process_name_t jobid field. So if you have 
>>>>>>>>>>>>>>> an orte_process_name_t *pname, then it would just be
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> orte_plm.signal_job(pname->jobid, sig)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hum.... and to signal a job probably the function is 
>>>>>>>>>>>>>>>> orte_plm.signal_job(jobid, signal); right?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Now my dummy question is how to obtain the jobid part from an 
>>>>>>>>>>>>>>>> orte_proc_name_t variable? Is there any magical function in 
>>>>>>>>>>>>>>>> the names_fns.h?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Leonardo
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Afraid not - you can signal a job, but not a specific 
>>>>>>>>>>>>>>>>> process. We used to have such an API, but nobody ever used 
>>>>>>>>>>>>>>>>> it. Easy to restore if someone has a need.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Is there any function in Open MPI's frameworks to send a 
>>>>>>>>>>>>>>>>>> signal to other ORTE proc?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> For example, the ORTE process [[1234,1],1] want to  send a 
>>>>>>>>>>>>>>>>>> signal to process [[1234,1,4] locate in other node. I'm 
>>>>>>>>>>>>>>>>>> looking for this kind of functions but I just found 
>>>>>>>>>>>>>>>>>> functions to send signal to all procs in a node.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Leonardo
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to