Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly realized the problem :-/
Terry is correct - we don't initialize the plm framework in application processes. However, there is a default proxy module for that framework so that applications can call comm_spawn. Unfortunately, I never filled in the rest of the module function pointers because (a) there was no known reason for apps to be using them (as Jeff points out), and (b) there is no MPI call that interfaces to them. I can (and will) make it work over the next day or two - there is no reason why this can't be done. It just wasn't implemented due to lack of reason to do so. Sorry for the confusion - old man brain fizzing out again. On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote: > Yes, I know the difference :) > > I'm trying to call orte_plm.signal_job from a PML component. I think PLM > stays resident after launching but it doesn't only for mpirun and orted, > you're right. > > On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote: > >> On 03/17/2010 10:10 AM, Leonardo Fialho wrote: >>> >>> Wow... orte_plm.signal_job points to zero. Is it correct from the PML point >>> of view? >> It might be because plm's are really only used at launch time not in MPI >> processes. Note plm != pml. >> >> --td >>> >>> Leonardo >>> >>> On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: >>> >>>> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML >>>> component, I know that ORTE is bellow OMPI, but I think that this function >>>> could not be available, or something like this. I can't figure out where >>>> is this snprintf too, in my code there is only >>>> >>>> opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger >>>> <%s>", >>>> SIGUSR1, ORTE_NAME_PRINT(&el_proc)); >>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1); >>>> >>>> And the first output/printf works fine. Well... I used gdb to run the >>>> program, I can see this: >>>> >>>> Program received signal EXC_BAD_ACCESS, Could not access memory. >>>> Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000 >>>> 0x0000000000000000 in ?? () >>>> (gdb) backtrace >>>> #0 0x0000000000000000 in ?? () >>>> #1 0x000000010065c319 in vprotocol_receiver_eventlog_connect >>>> (el_comm=0x10065d178) at >>>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 >>>> #2 0x000000010065ba9a in mca_vprotocol_receiver_send (buf=0x100500000, >>>> count=262144, datatype=0x100263d60, dst=1, tag=1, >>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at >>>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 >>>> #3 0x0000000100077d44 in MPI_Send () >>>> #4 0x0000000100000a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 >>>> >>>> The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job >>>> call. After that zeros and interrogations... the signal_job function is >>>> already available? I really don't understand what means all those zeros. >>>> >>>> Leonardo >>>> >>>> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: >>>> >>>>> Thanks for clarifying - guess I won't chew just yet. :-) >>>>> >>>>> I still don't see in your trace where it is failing in signal_job. I >>>>> didn't see the message indicating it was sending the signal cmd out in >>>>> your prior debug output, and there isn't a printf in that code loop other >>>>> than the debug output. Can you attach to the process and get more info? >>>>> >>>>> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: >>>>> >>>>>> Ralph don't swallow your message yet... Both jobs are not running over >>>>>> the same mpirun. There are two instances of mpirun in which one runs >>>>>> with "-report-uri ../contact.txt" and the other receives its contact >>>>>> info using "-ompi-server file:../contact.txt". And yes, both processes >>>>>> are running with plm_base_verbose activated. When a deactivate the >>>>>> plm_base_verbose the error is practically the same: >>>>>> >>>>>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger >>>>>> <[[47640,1],0]> >>>>>> [aopclf:54106] *** Process received signal *** >>>>>> [aopclf:54106] Signal: Segmentation fault (11) >>>>>> [aopclf:54106] Signal code: Address not mapped (1) >>>>>> [aopclf:54106] Failing at address: 0x0 >>>>>> [aopclf:54106] [ 0] 2 libSystem.B.dylib >>>>>> 0x00007fff83a6eeaa _sigtramp + 26 >>>>>> [aopclf:54106] [ 1] 3 libSystem.B.dylib >>>>>> 0x00007fff83a210b7 snprintf + 496 >>>>>> [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so >>>>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177 >>>>>> [aopclf:54106] [ 3] 5 libmpi.0.dylib >>>>>> 0x0000000100077d44 MPI_Send + 734 >>>>>> [aopclf:54106] [ 4] 6 ping >>>>>> 0x0000000100000a97 main + 431 >>>>>> [aopclf:54106] [ 5] 7 ping >>>>>> 0x00000001000008e0 start + 52 >>>>>> [aopclf:54106] *** End of error message *** >>>>>> >>>>>> Leonardo >>>>>> >>>>>> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: >>>>>> >>>>>>> I'm going to have to eat my last message. It slipped past me that your >>>>>>> other job was started via comm_spawn. Since both "jobs" are running >>>>>>> under the same mpirun, there shouldn't be a problem sending a signal >>>>>>> between them. >>>>>>> >>>>>>> I don't know why this would be crashing. Are you sure it is crashing >>>>>>> in signal_job? Your trace indicates it is crashing in a print >>>>>>> statement, yet there is no print statement in signal_job. Or did you >>>>>>> run this with plm_base_verbose set so that the verbose prints are >>>>>>> trying to run (could be we have a bug in one of them)? >>>>>>> >>>>>>> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: >>>>>>> >>>>>>>> Well, thank you anyway :) >>>>>>>> >>>>>>>> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: >>>>>>>> >>>>>>>>> Yeah, that probably won't work. The current code isn't intended to >>>>>>>>> cross jobs like that - I'm sure nobody ever tested it for that idea, >>>>>>>>> and I'm pretty sure it won't support it. >>>>>>>>> >>>>>>>>> I don't currently know any way to do what you are trying to do. We >>>>>>>>> could extend the signal code to handle it, I would think...but I'm >>>>>>>>> not sure how soon that might happen. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: >>>>>>>>> >>>>>>>>>> Yes... but something wrong is going on... maybe the problem is that >>>>>>>>>> the jobid is different than the process' jobid, I don't know. >>>>>>>>>> >>>>>>>>>> I'm trying to send a signal to other process running under a another >>>>>>>>>> job. The other process jump into an accept_connect to the MPI comm. >>>>>>>>>> So i did a code like this (I removed verification code and comments, >>>>>>>>>> this is just a summary for a happy execution): >>>>>>>>>> >>>>>>>>>> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag); >>>>>>>>>> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL); >>>>>>>>>> ompi_dpm.route_to_port(hnp_uri, &el_proc); >>>>>>>>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1); >>>>>>>>>> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm); >>>>>>>>>> >>>>>>>>>> el_proc is defined as orte_process_name_t, not a pointer to this. >>>>>>>>>> And signal.h has been included for SIGUSR1's sake. But when the code >>>>>>>>>> enter in signal_job function it crashes. I'm trying to debug it just >>>>>>>>>> now... the crash is the following: >>>>>>>>>> >>>>>>>>>> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0] >>>>>>>>>> [Fialho-2.local:51377] receiver: found port >>>>>>>>>> <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300> >>>>>>>>>> [Fialho-2.local:51377] receiver: HNP URI >>>>>>>>>> <784793600.0;tcp://192.168.1.200:54071>, RML URI >>>>>>>>>> <784793601.0;tcp://192.168.1.200:54072>, TAG <300> >>>>>>>>>> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event >>>>>>>>>> Logger <[[11975,1],0]> >>>>>>>>>> [Fialho-2:51377] *** Process received signal *** >>>>>>>>>> [Fialho-2:51377] Signal: Segmentation fault (11) >>>>>>>>>> [Fialho-2:51377] Signal code: Address not mapped (1) >>>>>>>>>> [Fialho-2:51377] Failing at address: 0x0 >>>>>>>>>> [Fialho-2:51377] [ 0] 2 libSystem.B.dylib >>>>>>>>>> 0x00007fff83a6eeaa _sigtramp + 26 >>>>>>>>>> [Fialho-2:51377] [ 1] 3 libSystem.B.dylib >>>>>>>>>> 0x00007fff83a210b7 snprintf + 496 >>>>>>>>>> [Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so >>>>>>>>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177 >>>>>>>>>> [Fialho-2:51377] [ 3] 5 libmpi.0.dylib >>>>>>>>>> 0x0000000100077d44 MPI_Send + 734 >>>>>>>>>> [Fialho-2:51377] [ 4] 6 ping >>>>>>>>>> 0x0000000100000a97 main + 431 >>>>>>>>>> [Fialho-2:51377] [ 5] 7 ping >>>>>>>>>> 0x00000001000008e0 start + 52 >>>>>>>>>> [Fialho-2:51377] [ 6] 8 ??? >>>>>>>>>> 0x0000000000000003 0x0 + 3 >>>>>>>>>> [Fialho-2:51377] *** End of error message *** >>>>>>>>>> >>>>>>>>>> With exception to the signal_job the code works, I have tested it >>>>>>>>>> forcing an accept on the other process, and avoiding the signal_job. >>>>>>>>>> But I want to send the signal to wake-up the other side and to be >>>>>>>>>> able to manage multiple connect/accept. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Leonardo >>>>>>>>>> >>>>>>>>>> On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote: >>>>>>>>>> >>>>>>>>>>> Sure! So long as you add the include, you are okay as the ORTE >>>>>>>>>>> layer is "below" the OMPI one. >>>>>>>>>>> >>>>>>>>>>> On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Ralph, the last question... it orte_plm.signal_job >>>>>>>>>>>> exposed/available to be called by a PML component? Yes, I have the >>>>>>>>>>>> orte/mca/plm/plm.h include line. >>>>>>>>>>>> >>>>>>>>>>>> Leonardo >>>>>>>>>>>> >>>>>>>>>>>> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote: >>>>>>>>>>>> >>>>>>>>>>>>> It's just the orte_process_name_t jobid field. So if you have an >>>>>>>>>>>>> orte_process_name_t *pname, then it would just be >>>>>>>>>>>>> >>>>>>>>>>>>> orte_plm.signal_job(pname->jobid, sig) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hum.... and to signal a job probably the function is >>>>>>>>>>>>>> orte_plm.signal_job(jobid, signal); right? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Now my dummy question is how to obtain the jobid part from an >>>>>>>>>>>>>> orte_proc_name_t variable? Is there any magical function in the >>>>>>>>>>>>>> names_fns.h? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Leonardo >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Afraid not - you can signal a job, but not a specific process. >>>>>>>>>>>>>>> We used to have such an API, but nobody ever used it. Easy to >>>>>>>>>>>>>>> restore if someone has a need. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Is there any function in Open MPI's frameworks to send a >>>>>>>>>>>>>>>> signal to other ORTE proc? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For example, the ORTE process [[1234,1],1] want to send a >>>>>>>>>>>>>>>> signal to process [[1234,1,4] locate in other node. I'm >>>>>>>>>>>>>>>> looking for this kind of functions but I just found functions >>>>>>>>>>>>>>>> to send signal to all procs in a node. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Leonardo >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel