Can you print out what orte_plm.signal_job value is? I bet it is
pointing to address 0. So the question is orte_plm actually initialized
in an MPI process? My guess would be no but I am sure Ralph will be
able to answer more definitively.
--td
On 03/17/2010 09:52 AM, Leonardo Fialho wrote:
To clarify a little bit more: I'm calling orte_plm.signal_job from a
PML component, I know that ORTE is bellow OMPI, but I think that this
function could not be available, or something like this. I can't
figure out where is this snprintf too, in my code there is only
opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event
Logger <%s>",
SIGUSR1, ORTE_NAME_PRINT(&el_proc));
orte_plm.signal_job(el_proc.jobid, SIGUSR1);
And the first output/printf works fine. Well... I used gdb to run the
program, I can see this:
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000
0x0000000000000000 in ?? ()
(gdb) backtrace
#0 0x0000000000000000 in ?? ()
#1 0x000000010065c319 in vprotocol_receiver_eventlog_connect
(el_comm=0x10065d178) at
../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
#2 0x000000010065ba9a in mca_vprotocol_receiver_send
(buf=0x100500000, count=262144, datatype=0x100263d60, dst=1, tag=1,
sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at
../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
#3 0x0000000100077d44 in MPI_Send ()
#4 0x0000000100000a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45
The line 67 of vprotocol_receiver_eventlog.c is the
orte_plm_signal_job call. After that zeros and interrogations... the
signal_job function is already available? I really don't understand
what means all those zeros.
Leonardo
On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:
Thanks for clarifying - guess I won't chew just yet. :-)
I still don't see in your trace where it is failing in signal_job. I
didn't see the message indicating it was sending the signal cmd out
in your prior debug output, and there isn't a printf in that code
loop other than the debug output. Can you attach to the process and
get more info?
On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:
Ralph don't swallow your message yet... Both jobs are not running
over the same mpirun. There are two instances of mpirun in which one
runs with "-report-uri ../contact.txt" and the other receives its
contact info using "-ompi-server file:../contact.txt". And yes, both
processes are running with plm_base_verbose activated. When a
deactivate the plm_base_verbose the error is practically the same:
[aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger
<[[47640,1],0]>
[aopclf:54106] *** Process received signal ***
[aopclf:54106] Signal: Segmentation fault (11)
[aopclf:54106] Signal code: Address not mapped (1)
[aopclf:54106] Failing at address: 0x0
[aopclf:54106] [ 0] 2 libSystem.B.dylib
0x00007fff83a6eeaa _sigtramp + 26
[aopclf:54106] [ 1] 3 libSystem.B.dylib
0x00007fff83a210b7 snprintf + 496
[aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so
0x000000010065ba0a mca_vprotocol_receiver_send + 177
[aopclf:54106] [ 3] 5 libmpi.0.dylib
0x0000000100077d44 MPI_Send + 734
[aopclf:54106] [ 4] 6 ping
0x0000000100000a97 main + 431
[aopclf:54106] [ 5] 7 ping
0x00000001000008e0 start + 52
[aopclf:54106] *** End of error message ***
Leonardo
On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:
I'm going to have to eat my last message. It slipped past me that
your other job was started via comm_spawn. Since both "jobs" are
running under the same mpirun, there shouldn't be a problem sending
a signal between them.
I don't know why this would be crashing. Are you sure it is
crashing in signal_job? Your trace indicates it is crashing in a
print statement, yet there is no print statement in signal_job. Or
did you run this with plm_base_verbose set so that the verbose
prints are trying to run (could be we have a bug in one of them)?
On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:
Well, thank you anyway :)
On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:
Yeah, that probably won't work. The current code isn't intended
to cross jobs like that - I'm sure nobody ever tested it for that
idea, and I'm pretty sure it won't support it.
I don't currently know any way to do what you are trying to do.
We could extend the signal code to handle it, I would think...but
I'm not sure how soon that might happen.
On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
Yes... but something wrong is going on... maybe the problem is
that the jobid is different than the process' jobid, I don't know.
I'm trying to send a signal to other process running under a
another job. The other process jump into an accept_connect to
the MPI comm. So i did a code like this (I removed verification
code and comments, this is just a summary for a happy execution):
ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
ompi_dpm.route_to_port(hnp_uri, &el_proc);
orte_plm.signal_job(el_proc.jobid, SIGUSR1);
ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);
el_proc is defined as orte_process_name_t, not a pointer to
this. And signal.h has been included for SIGUSR1's sake. But
when the code enter in signal_job function it crashes. I'm
trying to debug it just now... the crash is the following:
[Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
[Fialho-2.local:51377] receiver: found port
<784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300>
[Fialho-2.local:51377] receiver: HNP URI
<784793600.0;tcp://192.168.1.200:54071>, RML URI
<784793601.0;tcp://192.168.1.200:54072>, TAG <300>
[Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC
Event Logger <[[11975,1],0]>
[Fialho-2:51377] *** Process received signal ***
[Fialho-2:51377] Signal: Segmentation fault (11)
[Fialho-2:51377] Signal code: Address not mapped (1)
[Fialho-2:51377] Failing at address: 0x0
[Fialho-2:51377] [ 0] 2 libSystem.B.dylib
0x00007fff83a6eeaa _sigtramp + 26
[Fialho-2:51377] [ 1] 3 libSystem.B.dylib
0x00007fff83a210b7 snprintf + 496
[Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so
0x000000010065ba0a mca_vprotocol_receiver_send + 177
[Fialho-2:51377] [ 3] 5 libmpi.0.dylib
0x0000000100077d44 MPI_Send + 734
[Fialho-2:51377] [ 4] 6 ping
0x0000000100000a97 main + 431
[Fialho-2:51377] [ 5] 7 ping
0x00000001000008e0 start + 52
[Fialho-2:51377] [ 6] 8 ???
0x0000000000000003 0x0 + 3
[Fialho-2:51377] *** End of error message ***
With exception to the signal_job the code works, I have tested
it forcing an accept on the other process, and avoiding the
signal_job. But I want to send the signal to wake-up the other
side and to be able to manage multiple connect/accept.
Thanks,
Leonardo
On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote:
Sure! So long as you add the include, you are okay as the ORTE
layer is "below" the OMPI one.
On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote:
Thanks Ralph, the last question... it orte_plm.signal_job
exposed/available to be called by a PML component? Yes, I have
the orte/mca/plm/plm.h include line.
Leonardo
On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote:
It's just the orte_process_name_t jobid field. So if you have
an orte_process_name_t *pname, then it would just be
orte_plm.signal_job(pname->jobid, sig)
On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote:
Hum.... and to signal a job probably the function is
orte_plm.signal_job(jobid, signal); right?
Now my dummy question is how to obtain the jobid part from
an orte_proc_name_t variable? Is there any magical function
in the names_fns.h?
Thanks,
Leonardo
On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote:
Afraid not - you can signal a job, but not a specific
process. We used to have such an API, but nobody ever used
it. Easy to restore if someone has a need.
On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote:
Hi,
Is there any function in Open MPI's frameworks to send a
signal to other ORTE proc?
For example, the ORTE process [[1234,1],1] want to send a
signal to process [[1234,1,4] locate in other node. I'm
looking for this kind of functions but I just found
functions to send signal to all procs in a node.
Thanks,
Leonardo
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel