Thanks for clarifying - guess I won't chew just yet. :-)

I still don't see in your trace where it is failing in signal_job. I didn't see 
the message indicating it was sending the signal cmd out in your prior debug 
output, and there isn't a printf in that code loop other than the debug output. 
Can you attach to the process and get more info?

On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:

> Ralph don't swallow your message yet... Both jobs are not running over the 
> same mpirun. There are two instances of mpirun in which one runs with 
> "-report-uri ../contact.txt" and the other receives its contact info using 
> "-ompi-server file:../contact.txt". And yes, both processes are running with 
> plm_base_verbose activated. When a deactivate the plm_base_verbose the error 
> is practically the same:
> 
> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
> <[[47640,1],0]>
> [aopclf:54106] *** Process received signal ***
> [aopclf:54106] Signal: Segmentation fault (11)
> [aopclf:54106] Signal code: Address not mapped (1)
> [aopclf:54106] Failing at address: 0x0
> [aopclf:54106] [ 0] 2   libSystem.B.dylib                   
> 0x00007fff83a6eeaa _sigtramp + 26
> [aopclf:54106] [ 1] 3   libSystem.B.dylib                   
> 0x00007fff83a210b7 snprintf + 496
> [aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so           
> 0x000000010065ba0a mca_vprotocol_receiver_send + 177
> [aopclf:54106] [ 3] 5   libmpi.0.dylib                      
> 0x0000000100077d44 MPI_Send + 734
> [aopclf:54106] [ 4] 6   ping                                
> 0x0000000100000a97 main + 431
> [aopclf:54106] [ 5] 7   ping                                
> 0x00000001000008e0 start + 52
> [aopclf:54106] *** End of error message ***
> 
> Leonardo
> 
> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:
> 
>> I'm going to have to eat my last message. It slipped past me that your other 
>> job was started via comm_spawn. Since both "jobs" are running under the same 
>> mpirun, there shouldn't be a problem sending a signal between them.
>> 
>> I don't know why this would be crashing. Are you sure it is  crashing in 
>> signal_job? Your trace indicates it is crashing in a print statement, yet 
>> there is no print statement in signal_job. Or did you run this with 
>> plm_base_verbose set so that the verbose prints are trying to run (could be 
>> we have a bug in one of them)?
>> 
>> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:
>> 
>>> Well, thank you anyway :)
>>> 
>>> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:
>>> 
>>>> Yeah, that probably won't work. The current code isn't intended to cross 
>>>> jobs like that - I'm sure nobody ever tested it for that idea, and I'm 
>>>> pretty sure it won't support it.
>>>> 
>>>> I don't currently know any way to do what you are trying to do. We could 
>>>> extend the signal code to handle it, I would think...but I'm not sure how 
>>>> soon that might happen.
>>>> 
>>>> 
>>>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
>>>> 
>>>>> Yes... but something wrong is going on... maybe the problem is that the 
>>>>> jobid is different than the process' jobid, I don't know.
>>>>> 
>>>>> I'm trying to send a signal to other process running under a another job. 
>>>>> The other process jump into an accept_connect to the MPI comm. So i did a 
>>>>> code like this (I removed verification code and comments, this is just a 
>>>>> summary for a happy execution):
>>>>> 
>>>>> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
>>>>> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
>>>>> ompi_dpm.route_to_port(hnp_uri, &el_proc);
>>>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
>>>>> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);
>>>>> 
>>>>> el_proc is defined as orte_process_name_t, not a pointer to this. And 
>>>>> signal.h has been included for SIGUSR1's sake. But when the code enter in 
>>>>> signal_job function it crashes. I'm trying to debug it just now... the 
>>>>> crash is the following:
>>>>> 
>>>>> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
>>>>> [Fialho-2.local:51377] receiver: found port 
>>>>> <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300>
>>>>> [Fialho-2.local:51377] receiver: HNP URI 
>>>>> <784793600.0;tcp://192.168.1.200:54071>, RML URI 
>>>>> <784793601.0;tcp://192.168.1.200:54072>, TAG <300>
>>>>> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event 
>>>>> Logger <[[11975,1],0]>
>>>>> [Fialho-2:51377] *** Process received signal ***
>>>>> [Fialho-2:51377] Signal: Segmentation fault (11)
>>>>> [Fialho-2:51377] Signal code: Address not mapped (1)
>>>>> [Fialho-2:51377] Failing at address: 0x0
>>>>> [Fialho-2:51377] [ 0] 2   libSystem.B.dylib                   
>>>>> 0x00007fff83a6eeaa _sigtramp + 26
>>>>> [Fialho-2:51377] [ 1] 3   libSystem.B.dylib                   
>>>>> 0x00007fff83a210b7 snprintf + 496
>>>>> [Fialho-2:51377] [ 2] 4   mca_vprotocol_receiver.so           
>>>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177
>>>>> [Fialho-2:51377] [ 3] 5   libmpi.0.dylib                      
>>>>> 0x0000000100077d44 MPI_Send + 734
>>>>> [Fialho-2:51377] [ 4] 6   ping                                
>>>>> 0x0000000100000a97 main + 431
>>>>> [Fialho-2:51377] [ 5] 7   ping                                
>>>>> 0x00000001000008e0 start + 52
>>>>> [Fialho-2:51377] [ 6] 8   ???                                 
>>>>> 0x0000000000000003 0x0 + 3
>>>>> [Fialho-2:51377] *** End of error message ***
>>>>> 
>>>>> With exception to the signal_job the code works, I have tested it forcing 
>>>>> an accept on the other process, and avoiding the signal_job. But I want 
>>>>> to send the signal to wake-up the other side and to be able to manage 
>>>>> multiple connect/accept.
>>>>> 
>>>>> Thanks,
>>>>> Leonardo
>>>>> 
>>>>> On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote:
>>>>> 
>>>>>> Sure! So long as you add the include, you are okay as the ORTE layer is 
>>>>>> "below" the OMPI one.
>>>>>> 
>>>>>> On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote:
>>>>>> 
>>>>>>> Thanks Ralph, the last question... it orte_plm.signal_job 
>>>>>>> exposed/available to be called by a PML component? Yes, I have the 
>>>>>>> orte/mca/plm/plm.h include line.
>>>>>>> 
>>>>>>> Leonardo
>>>>>>> 
>>>>>>> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote:
>>>>>>> 
>>>>>>>> It's just the orte_process_name_t jobid field. So if you have an 
>>>>>>>> orte_process_name_t *pname, then it would just be
>>>>>>>> 
>>>>>>>> orte_plm.signal_job(pname->jobid, sig)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote:
>>>>>>>> 
>>>>>>>>> Hum.... and to signal a job probably the function is 
>>>>>>>>> orte_plm.signal_job(jobid, signal); right?
>>>>>>>>> 
>>>>>>>>> Now my dummy question is how to obtain the jobid part from an 
>>>>>>>>> orte_proc_name_t variable? Is there any magical function in the 
>>>>>>>>> names_fns.h?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Leonardo
>>>>>>>>> 
>>>>>>>>> On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote:
>>>>>>>>> 
>>>>>>>>>> Afraid not - you can signal a job, but not a specific process. We 
>>>>>>>>>> used to have such an API, but nobody ever used it. Easy to restore 
>>>>>>>>>> if someone has a need.
>>>>>>>>>> 
>>>>>>>>>> On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> Is there any function in Open MPI's frameworks to send a signal to 
>>>>>>>>>>> other ORTE proc?
>>>>>>>>>>> 
>>>>>>>>>>> For example, the ORTE process [[1234,1],1] want to  send a signal 
>>>>>>>>>>> to process [[1234,1,4] locate in other node. I'm looking for this 
>>>>>>>>>>> kind of functions but I just found functions to send signal to all 
>>>>>>>>>>> procs in a node.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Leonardo
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to