Can you print out what orte_plm.signal_job value is? I bet it is pointing to address 0. So the question is orte_plm actually initialized in an MPI process? My guess would be no but I am sure Ralph will be able to answer more definitively.

--td

On 03/17/2010 09:52 AM, Leonardo Fialho wrote:
To clarify a little bit more: I'm calling orte_plm.signal_job from a PML component, I know that ORTE is bellow OMPI, but I think that this function could not be available, or something like this. I can't figure out where is this snprintf too, in my code there is only

opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger <%s>",
                SIGUSR1, ORTE_NAME_PRINT(&el_proc));
    orte_plm.signal_job(el_proc.jobid, SIGUSR1);

And the first output/printf works fine. Well... I used gdb to run the program, I can see this:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000
0x0000000000000000 in ?? ()
(gdb) backtrace
#0  0x0000000000000000 in ?? ()
#1 0x000000010065c319 in vprotocol_receiver_eventlog_connect (el_comm=0x10065d178) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 #2 0x000000010065ba9a in mca_vprotocol_receiver_send (buf=0x100500000, count=262144, datatype=0x100263d60, dst=1, tag=1, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
#3  0x0000000100077d44 in MPI_Send ()
#4  0x0000000100000a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45

The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job call. After that zeros and interrogations... the signal_job function is already available? I really don't understand what means all those zeros.

Leonardo

On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:

Thanks for clarifying - guess I won't chew just yet. :-)

I still don't see in your trace where it is failing in signal_job. I didn't see the message indicating it was sending the signal cmd out in your prior debug output, and there isn't a printf in that code loop other than the debug output. Can you attach to the process and get more info?

On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:

Ralph don't swallow your message yet... Both jobs are not running over the same mpirun. There are two instances of mpirun in which one runs with "-report-uri ../contact.txt" and the other receives its contact info using "-ompi-server file:../contact.txt". And yes, both processes are running with plm_base_verbose activated. When a deactivate the plm_base_verbose the error is practically the same:

[aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[47640,1],0]>
[aopclf:54106] *** Process received signal ***
[aopclf:54106] Signal: Segmentation fault (11)
[aopclf:54106] Signal code: Address not mapped (1)
[aopclf:54106] Failing at address: 0x0
[aopclf:54106] [ 0] 2 libSystem.B.dylib 0x00007fff83a6eeaa _sigtramp + 26 [aopclf:54106] [ 1] 3 libSystem.B.dylib 0x00007fff83a210b7 snprintf + 496 [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so 0x000000010065ba0a mca_vprotocol_receiver_send + 177 [aopclf:54106] [ 3] 5 libmpi.0.dylib 0x0000000100077d44 MPI_Send + 734 [aopclf:54106] [ 4] 6 ping 0x0000000100000a97 main + 431 [aopclf:54106] [ 5] 7 ping 0x00000001000008e0 start + 52
[aopclf:54106] *** End of error message ***

Leonardo

On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:

I'm going to have to eat my last message. It slipped past me that your other job was started via comm_spawn. Since both "jobs" are running under the same mpirun, there shouldn't be a problem sending a signal between them.

I don't know why this would be crashing. Are you sure it is crashing in signal_job? Your trace indicates it is crashing in a print statement, yet there is no print statement in signal_job. Or did you run this with plm_base_verbose set so that the verbose prints are trying to run (could be we have a bug in one of them)?

On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:

Well, thank you anyway :)

On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:

Yeah, that probably won't work. The current code isn't intended to cross jobs like that - I'm sure nobody ever tested it for that idea, and I'm pretty sure it won't support it.

I don't currently know any way to do what you are trying to do. We could extend the signal code to handle it, I would think...but I'm not sure how soon that might happen.


On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:

Yes... but something wrong is going on... maybe the problem is that the jobid is different than the process' jobid, I don't know.

I'm trying to send a signal to other process running under a another job. The other process jump into an accept_connect to the MPI comm. So i did a code like this (I removed verification code and comments, this is just a summary for a happy execution):

ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
ompi_dpm.route_to_port(hnp_uri, &el_proc);
orte_plm.signal_job(el_proc.jobid, SIGUSR1);
ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);

el_proc is defined as orte_process_name_t, not a pointer to this. And signal.h has been included for SIGUSR1's sake. But when the code enter in signal_job function it crashes. I'm trying to debug it just now... the crash is the following:

[Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
[Fialho-2.local:51377] receiver: found port <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300> [Fialho-2.local:51377] receiver: HNP URI <784793600.0;tcp://192.168.1.200:54071>, RML URI <784793601.0;tcp://192.168.1.200:54072>, TAG <300> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[11975,1],0]>
[Fialho-2:51377] *** Process received signal ***
[Fialho-2:51377] Signal: Segmentation fault (11)
[Fialho-2:51377] Signal code: Address not mapped (1)
[Fialho-2:51377] Failing at address: 0x0
[Fialho-2:51377] [ 0] 2 libSystem.B.dylib 0x00007fff83a6eeaa _sigtramp + 26 [Fialho-2:51377] [ 1] 3 libSystem.B.dylib 0x00007fff83a210b7 snprintf + 496 [Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so 0x000000010065ba0a mca_vprotocol_receiver_send + 177 [Fialho-2:51377] [ 3] 5 libmpi.0.dylib 0x0000000100077d44 MPI_Send + 734 [Fialho-2:51377] [ 4] 6 ping 0x0000000100000a97 main + 431 [Fialho-2:51377] [ 5] 7 ping 0x00000001000008e0 start + 52 [Fialho-2:51377] [ 6] 8 ??? 0x0000000000000003 0x0 + 3
[Fialho-2:51377] *** End of error message ***

With exception to the signal_job the code works, I have tested it forcing an accept on the other process, and avoiding the signal_job. But I want to send the signal to wake-up the other side and to be able to manage multiple connect/accept.

Thanks,
Leonardo

On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote:

Sure! So long as you add the include, you are okay as the ORTE layer is "below" the OMPI one.

On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote:

Thanks Ralph, the last question... it orte_plm.signal_job exposed/available to be called by a PML component? Yes, I have the orte/mca/plm/plm.h include line.

Leonardo

On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote:

It's just the orte_process_name_t jobid field. So if you have an orte_process_name_t *pname, then it would just be

orte_plm.signal_job(pname->jobid, sig)


On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote:

Hum.... and to signal a job probably the function is orte_plm.signal_job(jobid, signal); right?

Now my dummy question is how to obtain the jobid part from an orte_proc_name_t variable? Is there any magical function in the names_fns.h?

Thanks,
Leonardo

On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote:

Afraid not - you can signal a job, but not a specific process. We used to have such an API, but nobody ever used it. Easy to restore if someone has a need.

On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote:

Hi,

Is there any function in Open MPI's frameworks to send a signal to other ORTE proc?

For example, the ORTE process [[1234,1],1] want to send a signal to process [[1234,1,4] locate in other node. I'm looking for this kind of functions but I just found functions to send signal to all procs in a node.

Thanks,
Leonardo
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to