Very good - that is pretty much all that the signal_job API does. On Mar 17, 2010, at 4:11 PM, Leonardo Fialho wrote:
> Anyway, to signal another job I have sent a RML message with the > ORTE_DAEMON_SIGNAL_LOCAL_PROCS command to the proc's HNP. > > Leonardo > > On Mar 17, 2010, at 9:59 PM, Ralph Castain wrote: > >> Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly >> realized the problem :-/ >> >> Terry is correct - we don't initialize the plm framework in application >> processes. However, there is a default proxy module for that framework so >> that applications can call comm_spawn. Unfortunately, I never filled in the >> rest of the module function pointers because (a) there was no known reason >> for apps to be using them (as Jeff points out), and (b) there is no MPI call >> that interfaces to them. >> >> I can (and will) make it work over the next day or two - there is no reason >> why this can't be done. It just wasn't implemented due to lack of reason to >> do so. >> >> Sorry for the confusion - old man brain fizzing out again. >> >> On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote: >> >>> Yes, I know the difference :) >>> >>> I'm trying to call orte_plm.signal_job from a PML component. I think PLM >>> stays resident after launching but it doesn't only for mpirun and orted, >>> you're right. >>> >>> On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote: >>> >>>> On 03/17/2010 10:10 AM, Leonardo Fialho wrote: >>>>> >>>>> Wow... orte_plm.signal_job points to zero. Is it correct from the PML >>>>> point of view? >>>> It might be because plm's are really only used at launch time not in MPI >>>> processes. Note plm != pml. >>>> >>>> --td >>>>> >>>>> Leonardo >>>>> >>>>> On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: >>>>> >>>>>> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML >>>>>> component, I know that ORTE is bellow OMPI, but I think that this >>>>>> function could not be available, or something like this. I can't figure >>>>>> out where is this snprintf too, in my code there is only >>>>>> >>>>>> opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger >>>>>> <%s>", >>>>>> SIGUSR1, ORTE_NAME_PRINT(&el_proc)); >>>>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1); >>>>>> >>>>>> And the first output/printf works fine. Well... I used gdb to run the >>>>>> program, I can see this: >>>>>> >>>>>> Program received signal EXC_BAD_ACCESS, Could not access memory. >>>>>> Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000 >>>>>> 0x0000000000000000 in ?? () >>>>>> (gdb) backtrace >>>>>> #0 0x0000000000000000 in ?? () >>>>>> #1 0x000000010065c319 in vprotocol_receiver_eventlog_connect >>>>>> (el_comm=0x10065d178) at >>>>>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 >>>>>> #2 0x000000010065ba9a in mca_vprotocol_receiver_send (buf=0x100500000, >>>>>> count=262144, datatype=0x100263d60, dst=1, tag=1, >>>>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at >>>>>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 >>>>>> #3 0x0000000100077d44 in MPI_Send () >>>>>> #4 0x0000000100000a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 >>>>>> >>>>>> The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job >>>>>> call. After that zeros and interrogations... the signal_job function is >>>>>> already available? I really don't understand what means all those zeros. >>>>>> >>>>>> Leonardo >>>>>> >>>>>> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: >>>>>> >>>>>>> Thanks for clarifying - guess I won't chew just yet. :-) >>>>>>> >>>>>>> I still don't see in your trace where it is failing in signal_job. I >>>>>>> didn't see the message indicating it was sending the signal cmd out in >>>>>>> your prior debug output, and there isn't a printf in that code loop >>>>>>> other than the debug output. Can you attach to the process and get more >>>>>>> info? >>>>>>> >>>>>>> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: >>>>>>> >>>>>>>> Ralph don't swallow your message yet... Both jobs are not running over >>>>>>>> the same mpirun. There are two instances of mpirun in which one runs >>>>>>>> with "-report-uri ../contact.txt" and the other receives its contact >>>>>>>> info using "-ompi-server file:../contact.txt". And yes, both processes >>>>>>>> are running with plm_base_verbose activated. When a deactivate the >>>>>>>> plm_base_verbose the error is practically the same: >>>>>>>> >>>>>>>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger >>>>>>>> <[[47640,1],0]> >>>>>>>> [aopclf:54106] *** Process received signal *** >>>>>>>> [aopclf:54106] Signal: Segmentation fault (11) >>>>>>>> [aopclf:54106] Signal code: Address not mapped (1) >>>>>>>> [aopclf:54106] Failing at address: 0x0 >>>>>>>> [aopclf:54106] [ 0] 2 libSystem.B.dylib >>>>>>>> 0x00007fff83a6eeaa _sigtramp + 26 >>>>>>>> [aopclf:54106] [ 1] 3 libSystem.B.dylib >>>>>>>> 0x00007fff83a210b7 snprintf + 496 >>>>>>>> [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so >>>>>>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177 >>>>>>>> [aopclf:54106] [ 3] 5 libmpi.0.dylib >>>>>>>> 0x0000000100077d44 MPI_Send + 734 >>>>>>>> [aopclf:54106] [ 4] 6 ping >>>>>>>> 0x0000000100000a97 main + 431 >>>>>>>> [aopclf:54106] [ 5] 7 ping >>>>>>>> 0x00000001000008e0 start + 52 >>>>>>>> [aopclf:54106] *** End of error message *** >>>>>>>> >>>>>>>> Leonardo >>>>>>>> >>>>>>>> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: >>>>>>>> >>>>>>>>> I'm going to have to eat my last message. It slipped past me that >>>>>>>>> your other job was started via comm_spawn. Since both "jobs" are >>>>>>>>> running under the same mpirun, there shouldn't be a problem sending a >>>>>>>>> signal between them. >>>>>>>>> >>>>>>>>> I don't know why this would be crashing. Are you sure it is crashing >>>>>>>>> in signal_job? Your trace indicates it is crashing in a print >>>>>>>>> statement, yet there is no print statement in signal_job. Or did you >>>>>>>>> run this with plm_base_verbose set so that the verbose prints are >>>>>>>>> trying to run (could be we have a bug in one of them)? >>>>>>>>> >>>>>>>>> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: >>>>>>>>> >>>>>>>>>> Well, thank you anyway :) >>>>>>>>>> >>>>>>>>>> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: >>>>>>>>>> >>>>>>>>>>> Yeah, that probably won't work. The current code isn't intended to >>>>>>>>>>> cross jobs like that - I'm sure nobody ever tested it for that >>>>>>>>>>> idea, and I'm pretty sure it won't support it. >>>>>>>>>>> >>>>>>>>>>> I don't currently know any way to do what you are trying to do. We >>>>>>>>>>> could extend the signal code to handle it, I would think...but I'm >>>>>>>>>>> not sure how soon that might happen. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes... but something wrong is going on... maybe the problem is >>>>>>>>>>>> that the jobid is different than the process' jobid, I don't know. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm trying to send a signal to other process running under a >>>>>>>>>>>> another job. The other process jump into an accept_connect to the >>>>>>>>>>>> MPI comm. So i did a code like this (I removed verification code >>>>>>>>>>>> and comments, this is just a summary for a happy execution): >>>>>>>>>>>> >>>>>>>>>>>> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag); >>>>>>>>>>>> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL); >>>>>>>>>>>> ompi_dpm.route_to_port(hnp_uri, &el_proc); >>>>>>>>>>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1); >>>>>>>>>>>> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm); >>>>>>>>>>>> >>>>>>>>>>>> el_proc is defined as orte_process_name_t, not a pointer to this. >>>>>>>>>>>> And signal.h has been included for SIGUSR1's sake. But when the >>>>>>>>>>>> code enter in signal_job function it crashes. I'm trying to debug >>>>>>>>>>>> it just now... the crash is the following: >>>>>>>>>>>> >>>>>>>>>>>> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0] >>>>>>>>>>>> [Fialho-2.local:51377] receiver: found port >>>>>>>>>>>> <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300> >>>>>>>>>>>> [Fialho-2.local:51377] receiver: HNP URI >>>>>>>>>>>> <784793600.0;tcp://192.168.1.200:54071>, RML URI >>>>>>>>>>>> <784793601.0;tcp://192.168.1.200:54072>, TAG <300> >>>>>>>>>>>> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC >>>>>>>>>>>> Event Logger <[[11975,1],0]> >>>>>>>>>>>> [Fialho-2:51377] *** Process received signal *** >>>>>>>>>>>> [Fialho-2:51377] Signal: Segmentation fault (11) >>>>>>>>>>>> [Fialho-2:51377] Signal code: Address not mapped (1) >>>>>>>>>>>> [Fialho-2:51377] Failing at address: 0x0 >>>>>>>>>>>> [Fialho-2:51377] [ 0] 2 libSystem.B.dylib >>>>>>>>>>>> 0x00007fff83a6eeaa _sigtramp + 26 >>>>>>>>>>>> [Fialho-2:51377] [ 1] 3 libSystem.B.dylib >>>>>>>>>>>> 0x00007fff83a210b7 snprintf + 496 >>>>>>>>>>>> [Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so >>>>>>>>>>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177 >>>>>>>>>>>> [Fialho-2:51377] [ 3] 5 libmpi.0.dylib >>>>>>>>>>>> 0x0000000100077d44 MPI_Send + 734 >>>>>>>>>>>> [Fialho-2:51377] [ 4] 6 ping >>>>>>>>>>>> 0x0000000100000a97 main + 431 >>>>>>>>>>>> [Fialho-2:51377] [ 5] 7 ping >>>>>>>>>>>> 0x00000001000008e0 start + 52 >>>>>>>>>>>> [Fialho-2:51377] [ 6] 8 ??? >>>>>>>>>>>> 0x0000000000000003 0x0 + 3 >>>>>>>>>>>> [Fialho-2:51377] *** End of error message *** >>>>>>>>>>>> >>>>>>>>>>>> With exception to the signal_job the code works, I have tested it >>>>>>>>>>>> forcing an accept on the other process, and avoiding the >>>>>>>>>>>> signal_job. But I want to send the signal to wake-up the other >>>>>>>>>>>> side and to be able to manage multiple connect/accept. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Leonardo >>>>>>>>>>>> >>>>>>>>>>>> On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Sure! So long as you add the include, you are okay as the ORTE >>>>>>>>>>>>> layer is "below" the OMPI one. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Ralph, the last question... it orte_plm.signal_job >>>>>>>>>>>>>> exposed/available to be called by a PML component? Yes, I have >>>>>>>>>>>>>> the orte/mca/plm/plm.h include line. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Leonardo >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> It's just the orte_process_name_t jobid field. So if you have >>>>>>>>>>>>>>> an orte_process_name_t *pname, then it would just be >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> orte_plm.signal_job(pname->jobid, sig) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hum.... and to signal a job probably the function is >>>>>>>>>>>>>>>> orte_plm.signal_job(jobid, signal); right? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Now my dummy question is how to obtain the jobid part from an >>>>>>>>>>>>>>>> orte_proc_name_t variable? Is there any magical function in >>>>>>>>>>>>>>>> the names_fns.h? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Leonardo >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Afraid not - you can signal a job, but not a specific >>>>>>>>>>>>>>>>> process. We used to have such an API, but nobody ever used >>>>>>>>>>>>>>>>> it. Easy to restore if someone has a need. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Is there any function in Open MPI's frameworks to send a >>>>>>>>>>>>>>>>>> signal to other ORTE proc? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> For example, the ORTE process [[1234,1],1] want to send a >>>>>>>>>>>>>>>>>> signal to process [[1234,1,4] locate in other node. I'm >>>>>>>>>>>>>>>>>> looking for this kind of functions but I just found >>>>>>>>>>>>>>>>>> functions to send signal to all procs in a node. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Leonardo >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel