Ralph don't swallow your message yet... Both jobs are not running over the same mpirun. There are two instances of mpirun in which one runs with "-report-uri ../contact.txt" and the other receives its contact info using "-ompi-server file:../contact.txt". And yes, both processes are running with plm_base_verbose activated. When a deactivate the plm_base_verbose the error is practically the same:
[aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[47640,1],0]> [aopclf:54106] *** Process received signal *** [aopclf:54106] Signal: Segmentation fault (11) [aopclf:54106] Signal code: Address not mapped (1) [aopclf:54106] Failing at address: 0x0 [aopclf:54106] [ 0] 2 libSystem.B.dylib 0x00007fff83a6eeaa _sigtramp + 26 [aopclf:54106] [ 1] 3 libSystem.B.dylib 0x00007fff83a210b7 snprintf + 496 [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so 0x000000010065ba0a mca_vprotocol_receiver_send + 177 [aopclf:54106] [ 3] 5 libmpi.0.dylib 0x0000000100077d44 MPI_Send + 734 [aopclf:54106] [ 4] 6 ping 0x0000000100000a97 main + 431 [aopclf:54106] [ 5] 7 ping 0x00000001000008e0 start + 52 [aopclf:54106] *** End of error message *** Leonardo On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: > I'm going to have to eat my last message. It slipped past me that your other > job was started via comm_spawn. Since both "jobs" are running under the same > mpirun, there shouldn't be a problem sending a signal between them. > > I don't know why this would be crashing. Are you sure it is crashing in > signal_job? Your trace indicates it is crashing in a print statement, yet > there is no print statement in signal_job. Or did you run this with > plm_base_verbose set so that the verbose prints are trying to run (could be > we have a bug in one of them)? > > On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: > >> Well, thank you anyway :) >> >> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: >> >>> Yeah, that probably won't work. The current code isn't intended to cross >>> jobs like that - I'm sure nobody ever tested it for that idea, and I'm >>> pretty sure it won't support it. >>> >>> I don't currently know any way to do what you are trying to do. We could >>> extend the signal code to handle it, I would think...but I'm not sure how >>> soon that might happen. >>> >>> >>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: >>> >>>> Yes... but something wrong is going on... maybe the problem is that the >>>> jobid is different than the process' jobid, I don't know. >>>> >>>> I'm trying to send a signal to other process running under a another job. >>>> The other process jump into an accept_connect to the MPI comm. So i did a >>>> code like this (I removed verification code and comments, this is just a >>>> summary for a happy execution): >>>> >>>> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag); >>>> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL); >>>> ompi_dpm.route_to_port(hnp_uri, &el_proc); >>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1); >>>> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm); >>>> >>>> el_proc is defined as orte_process_name_t, not a pointer to this. And >>>> signal.h has been included for SIGUSR1's sake. But when the code enter in >>>> signal_job function it crashes. I'm trying to debug it just now... the >>>> crash is the following: >>>> >>>> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0] >>>> [Fialho-2.local:51377] receiver: found port >>>> <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300> >>>> [Fialho-2.local:51377] receiver: HNP URI >>>> <784793600.0;tcp://192.168.1.200:54071>, RML URI >>>> <784793601.0;tcp://192.168.1.200:54072>, TAG <300> >>>> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event >>>> Logger <[[11975,1],0]> >>>> [Fialho-2:51377] *** Process received signal *** >>>> [Fialho-2:51377] Signal: Segmentation fault (11) >>>> [Fialho-2:51377] Signal code: Address not mapped (1) >>>> [Fialho-2:51377] Failing at address: 0x0 >>>> [Fialho-2:51377] [ 0] 2 libSystem.B.dylib >>>> 0x00007fff83a6eeaa _sigtramp + 26 >>>> [Fialho-2:51377] [ 1] 3 libSystem.B.dylib >>>> 0x00007fff83a210b7 snprintf + 496 >>>> [Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so >>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177 >>>> [Fialho-2:51377] [ 3] 5 libmpi.0.dylib >>>> 0x0000000100077d44 MPI_Send + 734 >>>> [Fialho-2:51377] [ 4] 6 ping >>>> 0x0000000100000a97 main + 431 >>>> [Fialho-2:51377] [ 5] 7 ping >>>> 0x00000001000008e0 start + 52 >>>> [Fialho-2:51377] [ 6] 8 ??? >>>> 0x0000000000000003 0x0 + 3 >>>> [Fialho-2:51377] *** End of error message *** >>>> >>>> With exception to the signal_job the code works, I have tested it forcing >>>> an accept on the other process, and avoiding the signal_job. But I want to >>>> send the signal to wake-up the other side and to be able to manage >>>> multiple connect/accept. >>>> >>>> Thanks, >>>> Leonardo >>>> >>>> On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote: >>>> >>>>> Sure! So long as you add the include, you are okay as the ORTE layer is >>>>> "below" the OMPI one. >>>>> >>>>> On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote: >>>>> >>>>>> Thanks Ralph, the last question... it orte_plm.signal_job >>>>>> exposed/available to be called by a PML component? Yes, I have the >>>>>> orte/mca/plm/plm.h include line. >>>>>> >>>>>> Leonardo >>>>>> >>>>>> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote: >>>>>> >>>>>>> It's just the orte_process_name_t jobid field. So if you have an >>>>>>> orte_process_name_t *pname, then it would just be >>>>>>> >>>>>>> orte_plm.signal_job(pname->jobid, sig) >>>>>>> >>>>>>> >>>>>>> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote: >>>>>>> >>>>>>>> Hum.... and to signal a job probably the function is >>>>>>>> orte_plm.signal_job(jobid, signal); right? >>>>>>>> >>>>>>>> Now my dummy question is how to obtain the jobid part from an >>>>>>>> orte_proc_name_t variable? Is there any magical function in the >>>>>>>> names_fns.h? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Leonardo >>>>>>>> >>>>>>>> On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote: >>>>>>>> >>>>>>>>> Afraid not - you can signal a job, but not a specific process. We >>>>>>>>> used to have such an API, but nobody ever used it. Easy to restore if >>>>>>>>> someone has a need. >>>>>>>>> >>>>>>>>> On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Is there any function in Open MPI's frameworks to send a signal to >>>>>>>>>> other ORTE proc? >>>>>>>>>> >>>>>>>>>> For example, the ORTE process [[1234,1],1] want to send a signal to >>>>>>>>>> process [[1234,1,4] locate in other node. I'm looking for this kind >>>>>>>>>> of functions but I just found functions to send signal to all procs >>>>>>>>>> in a node. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Leonardo >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel