Re: [OMPI devel] how to add a component in the ompi?

2010-03-17 Thread George Bosilca
Yaohui,

The whole infrastructure at the level where you're looking is similar to Active 
Messages. The register function is used to register callback for a specific 
tag. A tag is a uint8_t, and thus there are 256 callbacks possible. However, 
there are some rules regarding which level is allowed to register callbacks in 
a specific range, in order to avoid conflict between several modules loaded in 
same time.

Anyway, as far as I understood you're looking at writing a new BTL. Every time 
a message is drained from the network, the BTL is supposed to know that tag it 
was send to and trigger the corresponding callback (this only on the receiver 
side). How this "tag" is moved around depends on the BTL capabilities. Some 
will have to push it explicitly through the network (TCP as an example), while 
others have other means to move it around (for MX this tag is part of the 64 
bits key used for each message). Therefore, the first thing you should make 
sure is that you really have a way to retrieve this tag on the receiver side. 
Once you have the tag and the content of the message, you should call the 
callback corresponding to the tag (using the simple addition you noticed), and 
pass the correct arguments. This should at least let you start the eager 
protocol.

  george.

On Mar 16, 2010, at 23:22 , hu yaohui wrote:

> Hi Jeff & All
> Yes,you are right,i was just a little dizzy then. i need to modify the send 
> function of component self in btl framework.
> i just met a problem right now.
> when i browse the function 
> mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to send 
> the data
> 
> 303reg = mca_btl_base_active_message_trigger + tag;
> 304reg->cbfunc( btl, tag, des, reg->cbdata );
> 
> i trace through the "mca_btl_base_active_message_trigger" to the function 
> where it get its value ,then i find function 
> mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this:
> 
> 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc;
> 729mca_btl_base_active_message_trigger[tag].cbdata = data;
> 
> when i trace through mca_bml_r2_register ,in the same file,i get this 
> structure:
> 
> mca_bml_r2_module_t mca_bml_r2 = {
> {
> &mca_bml_r2_component, 
> mca_bml_r2_add_procs,
> mca_bml_r2_del_procs,
> mca_bml_r2_add_btl,
> mca_bml_r2_del_btl,
> mca_bml_r2_del_proc_btl,
> mca_bml_r2_register, <
> mca_bml_r2_register_error,
> mca_bml_r2_finalize, 
> mca_bml_r2_ft_event
> }
> 
> };
> 
> after this ,i find the place where mca_bml_r2 is initialized,but i cannt find 
> anything related to mca_bml_r2_register.i just want to know reg = 
> mca_btl_base_active_message_trigger + tag;
> really is.and i want to modify the send function of self ,is this the right 
> way? or you can tell me the right way to modify the send function of self 
> component.
>  
> Thanks & Regards
> Yaohui Hu
>  
> On Wed, Mar 17, 2010 at 12:52 AM, Jeff Squyres  wrote:
> On Mar 16, 2010, at 9:45 AM, hu yaohui wrote:
> 
> > it just said,i had a wrong command format,when i use mpirun --help,i really
> > didn't find the --mca parameter.why the tcp FAQ part list these command 
> > lines,
> > but it cann't execute successfully on my machine.Is there any another way 
> > to control the specific
> > btl components to be used?
> 
> Make sure you're using the right mpirun -- you might have multiple installed 
> on your machine.
> 
> OMPI's "mpirun --help" definitely includes a description of the --mca 
> parameter:
> 
>   -mca|--mca  
> Pass context-specific MCA parameters; they are
> considered global if --gmca is not used and only
> one context is specified (arg0 is the parameter
> name; arg1 is the parameter value)
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] how to add a component in the ompi?

2010-03-17 Thread hu yaohui
Hi George,
what i want to do is to modify the self component to meet my needs,i just
want to modify the send function of the self component to test whether my
implemented send function ,which based on some emulation platform, is
right.so i copied all the self component code,modified the component name to
mine ,the i wanted to subsitude its send and receive to my implemented
send/receive function.i dont know whether this is right,if not ,or you need
more information ,please let me know.

Thanks & Regards
Yaohui Hu .

On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca wrote:

> Yaohui,
>
> The whole infrastructure at the level where you're looking is similar to
> Active Messages. The register function is used to register callback for a
> specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible.
> However, there are some rules regarding which level is allowed to register
> callbacks in a specific range, in order to avoid conflict between several
> modules loaded in same time.
>
> Anyway, as far as I understood you're looking at writing a new BTL. Every
> time a message is drained from the network, the BTL is supposed to know that
> tag it was send to and trigger the corresponding callback (this only on the
> receiver side). How this "tag" is moved around depends on the BTL
> capabilities. Some will have to push it explicitly through the network (TCP
> as an example), while others have other means to move it around (for MX this
> tag is part of the 64 bits key used for each message). Therefore, the first
> thing you should make sure is that you really have a way to retrieve this
> tag on the receiver side. Once you have the tag and the content of the
> message, you should call the callback corresponding to the tag (using the
> simple addition you noticed), and pass the correct arguments. This should at
> least let you start the eager protocol.
>
>  george.
>
> On Mar 16, 2010, at 23:22 , hu yaohui wrote:
>
> > Hi Jeff & All
> > Yes,you are right,i was just a little dizzy then. i need to modify the
> send function of component self in btl framework.
> > i just met a problem right now.
> > when i browse the function
> mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to
> send the data
> > 
> > 303reg = mca_btl_base_active_message_trigger + tag;
> > 304reg->cbfunc( btl, tag, des, reg->cbdata );
> > 
> > i trace through the "mca_btl_base_active_message_trigger" to the function
> where it get its value ,then i find function
> mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this:
> > 
> > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc;
> > 729mca_btl_base_active_message_trigger[tag].cbdata = data;
> > 
> > when i trace through mca_bml_r2_register ,in the same file,i get this
> structure:
> > 
> > mca_bml_r2_module_t mca_bml_r2 = {
> > {
> > &mca_bml_r2_component,
> > mca_bml_r2_add_procs,
> > mca_bml_r2_del_procs,
> > mca_bml_r2_add_btl,
> > mca_bml_r2_del_btl,
> > mca_bml_r2_del_proc_btl,
> > mca_bml_r2_register, <
> > mca_bml_r2_register_error,
> > mca_bml_r2_finalize,
> > mca_bml_r2_ft_event
> > }
> >
> > };
> > 
> > after this ,i find the place where mca_bml_r2 is initialized,but i cannt
> find anything related to mca_bml_r2_register.i just want to know reg =
> mca_btl_base_active_message_trigger + tag;
> > really is.and i want to modify the send function of self ,is this the
> right way? or you can tell me the right way to modify the send function of
> self component.
> >
> > Thanks & Regards
> > Yaohui Hu
> >
> > On Wed, Mar 17, 2010 at 12:52 AM, Jeff Squyres 
> wrote:
> > On Mar 16, 2010, at 9:45 AM, hu yaohui wrote:
> >
> > > it just said,i had a wrong command format,when i use mpirun --help,i
> really
> > > didn't find the --mca parameter.why the tcp FAQ part list these command
> lines,
> > > but it cann't execute successfully on my machine.Is there any another
> way to control the specific
> > > btl components to be used?
> >
> > Make sure you're using the right mpirun -- you might have multiple
> installed on your machine.
> >
> > OMPI's "mpirun --help" definitely includes a description of the --mca
> parameter:
> >
> >   -mca|--mca  
> > Pass context-specific MCA parameters; they are
> > considered global if --gmca is not used and only
> > one context is specified (arg0 is the parameter
> > name; arg1 is the parameter value)
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > ___
> > devel mailing list
> > d

Re: [OMPI devel] how to add a component in the ompi?

2010-03-17 Thread George Bosilca
Yoahui,

The self component is special. While is does behave as a "normal" BTL, it takes 
a lot of shortcuts as all operations are in the memory of a single process. 
However, as the simplest BTLs in Open MPI, I guess it is a good starting point.

As stated previously, the self BTL exhibit a lot of differences compared with 
the others BTL. For your case, in the self BTL the send function trigger the 
receiver callback, as there is other simple way to drain the "network". This 
explain why we compute the btl_active_message_callback_t directly in the send 
function. Usually, this is done on the progress function, once some data have 
been extracted from the network. Basically, everything in the mca_btl_self_send 
function starting from the "/* upcall */" comment is the receive operation.

  george.

On Mar 17, 2010, at 00:30 , hu yaohui wrote:

> Hi George,
> what i want to do is to modify the self component to meet my needs,i just 
> want to modify the send function of the self component to test whether my 
> implemented send function ,which based on some emulation platform, is 
> right.so i copied all the self component code,modified the component name to 
> mine ,the i wanted to subsitude its send and receive to my implemented 
> send/receive function.i dont know whether this is right,if not ,or you need 
> more information ,please let me know.
>  
> Thanks & Regards
> Yaohui Hu .
> 
> On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca  wrote:
> Yaohui,
> 
> The whole infrastructure at the level where you're looking is similar to 
> Active Messages. The register function is used to register callback for a 
> specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible. 
> However, there are some rules regarding which level is allowed to register 
> callbacks in a specific range, in order to avoid conflict between several 
> modules loaded in same time.
> 
> Anyway, as far as I understood you're looking at writing a new BTL. Every 
> time a message is drained from the network, the BTL is supposed to know that 
> tag it was send to and trigger the corresponding callback (this only on the 
> receiver side). How this "tag" is moved around depends on the BTL 
> capabilities. Some will have to push it explicitly through the network (TCP 
> as an example), while others have other means to move it around (for MX this 
> tag is part of the 64 bits key used for each message). Therefore, the first 
> thing you should make sure is that you really have a way to retrieve this tag 
> on the receiver side. Once you have the tag and the content of the message, 
> you should call the callback corresponding to the tag (using the simple 
> addition you noticed), and pass the correct arguments. This should at least 
> let you start the eager protocol.
> 
>  george.
> 
> On Mar 16, 2010, at 23:22 , hu yaohui wrote:
> 
> > Hi Jeff & All
> > Yes,you are right,i was just a little dizzy then. i need to modify the send 
> > function of component self in btl framework.
> > i just met a problem right now.
> > when i browse the function 
> > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to 
> > send the data
> > 
> > 303reg = mca_btl_base_active_message_trigger + tag;
> > 304reg->cbfunc( btl, tag, des, reg->cbdata );
> > 
> > i trace through the "mca_btl_base_active_message_trigger" to the function 
> > where it get its value ,then i find function 
> > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this:
> > 
> > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc;
> > 729mca_btl_base_active_message_trigger[tag].cbdata = data;
> > 
> > when i trace through mca_bml_r2_register ,in the same file,i get this 
> > structure:
> > 
> > mca_bml_r2_module_t mca_bml_r2 = {
> > {
> > &mca_bml_r2_component,
> > mca_bml_r2_add_procs,
> > mca_bml_r2_del_procs,
> > mca_bml_r2_add_btl,
> > mca_bml_r2_del_btl,
> > mca_bml_r2_del_proc_btl,
> > mca_bml_r2_register, <
> > mca_bml_r2_register_error,
> > mca_bml_r2_finalize,
> > mca_bml_r2_ft_event
> > }
> >
> > };
> > 
> > after this ,i find the place where mca_bml_r2 is initialized,but i cannt 
> > find anything related to mca_bml_r2_register.i just want to know reg = 
> > mca_btl_base_active_message_trigger + tag;
> > really is.and i want to modify the send function of self ,is this the right 
> > way? or you can tell me the right way to modify the send function of self 
> > component.
> >
> > Thanks & Regards
> > Yaohui Hu
> >
> > On Wed, Mar 17, 2010 at 12:52 AM, Jeff Squyres  wrote:
> > On Mar 16, 2010, at 9:45 AM, hu yaohui wrote:
> >
> > > it just said,i had a wrong command format,when i use mpirun --help,i 
> > > really
> > > didn't find the --mca parameter.why the tcp FAQ part list these command 
> > > lines,
> > > but it cann't execute successfully on my machine.Is there any another way 
> > > to control th

Re: [OMPI devel] Signals

2010-03-17 Thread Ralph Castain
I'm going to have to eat my last message. It slipped past me that your other 
job was started via comm_spawn. Since both "jobs" are running under the same 
mpirun, there shouldn't be a problem sending a signal between them.

I don't know why this would be crashing. Are you sure it is  crashing in 
signal_job? Your trace indicates it is crashing in a print statement, yet there 
is no print statement in signal_job. Or did you run this with plm_base_verbose 
set so that the verbose prints are trying to run (could be we have a bug in one 
of them)?

On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:

> Well, thank you anyway :)
> 
> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:
> 
>> Yeah, that probably won't work. The current code isn't intended to cross 
>> jobs like that - I'm sure nobody ever tested it for that idea, and I'm 
>> pretty sure it won't support it.
>> 
>> I don't currently know any way to do what you are trying to do. We could 
>> extend the signal code to handle it, I would think...but I'm not sure how 
>> soon that might happen.
>> 
>> 
>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
>> 
>>> Yes... but something wrong is going on... maybe the problem is that the 
>>> jobid is different than the process' jobid, I don't know.
>>> 
>>> I'm trying to send a signal to other process running under a another job. 
>>> The other process jump into an accept_connect to the MPI comm. So i did a 
>>> code like this (I removed verification code and comments, this is just a 
>>> summary for a happy execution):
>>> 
>>> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
>>> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
>>> ompi_dpm.route_to_port(hnp_uri, &el_proc);
>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
>>> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);
>>> 
>>> el_proc is defined as orte_process_name_t, not a pointer to this. And 
>>> signal.h has been included for SIGUSR1's sake. But when the code enter in 
>>> signal_job function it crashes. I'm trying to debug it just now... the 
>>> crash is the following:
>>> 
>>> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
>>> [Fialho-2.local:51377] receiver: found port 
>>> <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300>
>>> [Fialho-2.local:51377] receiver: HNP URI 
>>> <784793600.0;tcp://192.168.1.200:54071>, RML URI 
>>> <784793601.0;tcp://192.168.1.200:54072>, TAG <300>
>>> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
>>> <[[11975,1],0]>
>>> [Fialho-2:51377] *** Process received signal ***
>>> [Fialho-2:51377] Signal: Segmentation fault (11)
>>> [Fialho-2:51377] Signal code: Address not mapped (1)
>>> [Fialho-2:51377] Failing at address: 0x0
>>> [Fialho-2:51377] [ 0] 2   libSystem.B.dylib   
>>> 0x7fff83a6eeaa _sigtramp + 26
>>> [Fialho-2:51377] [ 1] 3   libSystem.B.dylib   
>>> 0x7fff83a210b7 snprintf + 496
>>> [Fialho-2:51377] [ 2] 4   mca_vprotocol_receiver.so   
>>> 0x00010065ba0a mca_vprotocol_receiver_send + 177
>>> [Fialho-2:51377] [ 3] 5   libmpi.0.dylib  
>>> 0x000100077d44 MPI_Send + 734
>>> [Fialho-2:51377] [ 4] 6   ping
>>> 0x00010a97 main + 431
>>> [Fialho-2:51377] [ 5] 7   ping
>>> 0x000108e0 start + 52
>>> [Fialho-2:51377] [ 6] 8   ??? 
>>> 0x0003 0x0 + 3
>>> [Fialho-2:51377] *** End of error message ***
>>> 
>>> With exception to the signal_job the code works, I have tested it forcing 
>>> an accept on the other process, and avoiding the signal_job. But I want to 
>>> send the signal to wake-up the other side and to be able to manage multiple 
>>> connect/accept.
>>> 
>>> Thanks,
>>> Leonardo
>>> 
>>> On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote:
>>> 
 Sure! So long as you add the include, you are okay as the ORTE layer is 
 "below" the OMPI one.
 
 On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote:
 
> Thanks Ralph, the last question... it orte_plm.signal_job 
> exposed/available to be called by a PML component? Yes, I have the 
> orte/mca/plm/plm.h include line.
> 
> Leonardo
> 
> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote:
> 
>> It's just the orte_process_name_t jobid field. So if you have an 
>> orte_process_name_t *pname, then it would just be
>> 
>> orte_plm.signal_job(pname->jobid, sig)
>> 
>> 
>> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote:
>> 
>>> Hum and to signal a job probably the function is 
>>> orte_plm.signal_job(jobid, signal); right?
>>> 
>>> Now my dummy question is how to obtain the jobid part from an 
>>> orte_proc_name_t variable? Is there any magical function in the 
>>> names_fns.h?
>>> 
>>> Thanks,
>>> Leonardo
>>> 
>>> On Mar 16, 2010, at 10:12 

Re: [OMPI devel] how to add a component in the ompi?

2010-03-17 Thread hu yaohui
Hi Geogre,
Thank you very much!
i know ,it's really a receive callback in this send function
mca_btl_self_send,what i want to know is where  this callback function(line
303).

303reg = mca_btl_base_active_message_trigger + tag;
304reg->cbfunc( btl, tag, des, reg->cbdata );

mapped to,where this function is initialized.in which file,which
function,the mca_bml_r2_register was called.

Thanks & Regards
Yaohui

On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca wrote:

> Yoahui,
>
> The self component is special. While is does behave as a "normal" BTL, it
> takes a lot of shortcuts as all operations are in the memory of a single
> process. However, as the simplest BTLs in Open MPI, I guess it is a good
> starting point.
>
> As stated previously, the self BTL exhibit a lot of differences compared
> with the others BTL. For your case, in the self BTL the send function
> trigger the receiver callback, as there is other simple way to drain the
> "network". This explain why we compute the btl_active_message_callback_t
> directly in the send function. Usually, this is done on the progress
> function, once some data have been extracted from the network. Basically,
> everything in the mca_btl_self_send function starting from the "/* upcall
> */" comment is the receive operation.
>
>  george.
>
> On Mar 17, 2010, at 00:30 , hu yaohui wrote:
>
> > Hi George,
> > what i want to do is to modify the self component to meet my needs,i just
> want to modify the send function of the self component to test whether my
> implemented send function ,which based on some emulation platform, is
> right.so i copied all the self component code,modified the component name to
> mine ,the i wanted to subsitude its send and receive to my implemented
> send/receive function.i dont know whether this is right,if not ,or you need
> more information ,please let me know.
> >
> > Thanks & Regards
> > Yaohui Hu .
> >
> > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca 
> wrote:
> > Yaohui,
> >
> > The whole infrastructure at the level where you're looking is similar to
> Active Messages. The register function is used to register callback for a
> specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible.
> However, there are some rules regarding which level is allowed to register
> callbacks in a specific range, in order to avoid conflict between several
> modules loaded in same time.
> >
> > Anyway, as far as I understood you're looking at writing a new BTL. Every
> time a message is drained from the network, the BTL is supposed to know that
> tag it was send to and trigger the corresponding callback (this only on the
> receiver side). How this "tag" is moved around depends on the BTL
> capabilities. Some will have to push it explicitly through the network (TCP
> as an example), while others have other means to move it around (for MX this
> tag is part of the 64 bits key used for each message). Therefore, the first
> thing you should make sure is that you really have a way to retrieve this
> tag on the receiver side. Once you have the tag and the content of the
> message, you should call the callback corresponding to the tag (using the
> simple addition you noticed), and pass the correct arguments. This should at
> least let you start the eager protocol.
> >
> >  george.
> >
> > On Mar 16, 2010, at 23:22 , hu yaohui wrote:
> >
> > > Hi Jeff & All
> > > Yes,you are right,i was just a little dizzy then. i need to modify the
> send function of component self in btl framework.
> > > i just met a problem right now.
> > > when i browse the function
> mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to
> send the data
> > > 
> > > 303reg = mca_btl_base_active_message_trigger + tag;
> > > 304reg->cbfunc( btl, tag, des, reg->cbdata );
> > > 
> > > i trace through the "mca_btl_base_active_message_trigger" to the
> function where it get its value ,then i find function
> mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this:
> > > 
> > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc;
> > > 729mca_btl_base_active_message_trigger[tag].cbdata = data;
> > > 
> > > when i trace through mca_bml_r2_register ,in the same file,i get this
> structure:
> > > 
> > > mca_bml_r2_module_t mca_bml_r2 = {
> > > {
> > > &mca_bml_r2_component,
> > > mca_bml_r2_add_procs,
> > > mca_bml_r2_del_procs,
> > > mca_bml_r2_add_btl,
> > > mca_bml_r2_del_btl,
> > > mca_bml_r2_del_proc_btl,
> > > mca_bml_r2_register, <
> > > mca_bml_r2_register_error,
> > > mca_bml_r2_finalize,
> > > mca_bml_r2_ft_event
> > > }
> > >
> > > };
> > > 
> > > after this ,i find the place where mca_bml_r2 is initialized,but i
> cannt find anything related to mca_bml_r2_register.i just want to know reg =
> mca_btl_base_active_message_trigger + tag;
> > > really is.and i want to modify the send function of self ,

Re: [OMPI devel] how to add a component in the ompi?

2010-03-17 Thread hu yaohui
Hi George,
did you have a gmail or msn? i really want to talk to you directly.That's
much fast.

Thanks & Regards
Yaohui Hu

On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca wrote:

> Yoahui,
>
> The self component is special. While is does behave as a "normal" BTL, it
> takes a lot of shortcuts as all operations are in the memory of a single
> process. However, as the simplest BTLs in Open MPI, I guess it is a good
> starting point.
>
> As stated previously, the self BTL exhibit a lot of differences compared
> with the others BTL. For your case, in the self BTL the send function
> trigger the receiver callback, as there is other simple way to drain the
> "network". This explain why we compute the btl_active_message_callback_t
> directly in the send function. Usually, this is done on the progress
> function, once some data have been extracted from the network. Basically,
> everything in the mca_btl_self_send function starting from the "/* upcall
> */" comment is the receive operation.
>
>  george.
>
> On Mar 17, 2010, at 00:30 , hu yaohui wrote:
>
> > Hi George,
> > what i want to do is to modify the self component to meet my needs,i just
> want to modify the send function of the self component to test whether my
> implemented send function ,which based on some emulation platform, is
> right.so i copied all the self component code,modified the component name to
> mine ,the i wanted to subsitude its send and receive to my implemented
> send/receive function.i dont know whether this is right,if not ,or you need
> more information ,please let me know.
> >
> > Thanks & Regards
> > Yaohui Hu .
> >
> > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca 
> wrote:
> > Yaohui,
> >
> > The whole infrastructure at the level where you're looking is similar to
> Active Messages. The register function is used to register callback for a
> specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible.
> However, there are some rules regarding which level is allowed to register
> callbacks in a specific range, in order to avoid conflict between several
> modules loaded in same time.
> >
> > Anyway, as far as I understood you're looking at writing a new BTL. Every
> time a message is drained from the network, the BTL is supposed to know that
> tag it was send to and trigger the corresponding callback (this only on the
> receiver side). How this "tag" is moved around depends on the BTL
> capabilities. Some will have to push it explicitly through the network (TCP
> as an example), while others have other means to move it around (for MX this
> tag is part of the 64 bits key used for each message). Therefore, the first
> thing you should make sure is that you really have a way to retrieve this
> tag on the receiver side. Once you have the tag and the content of the
> message, you should call the callback corresponding to the tag (using the
> simple addition you noticed), and pass the correct arguments. This should at
> least let you start the eager protocol.
> >
> >  george.
> >
> > On Mar 16, 2010, at 23:22 , hu yaohui wrote:
> >
> > > Hi Jeff & All
> > > Yes,you are right,i was just a little dizzy then. i need to modify the
> send function of component self in btl framework.
> > > i just met a problem right now.
> > > when i browse the function
> mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to
> send the data
> > > 
> > > 303reg = mca_btl_base_active_message_trigger + tag;
> > > 304reg->cbfunc( btl, tag, des, reg->cbdata );
> > > 
> > > i trace through the "mca_btl_base_active_message_trigger" to the
> function where it get its value ,then i find function
> mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this:
> > > 
> > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc;
> > > 729mca_btl_base_active_message_trigger[tag].cbdata = data;
> > > 
> > > when i trace through mca_bml_r2_register ,in the same file,i get this
> structure:
> > > 
> > > mca_bml_r2_module_t mca_bml_r2 = {
> > > {
> > > &mca_bml_r2_component,
> > > mca_bml_r2_add_procs,
> > > mca_bml_r2_del_procs,
> > > mca_bml_r2_add_btl,
> > > mca_bml_r2_del_btl,
> > > mca_bml_r2_del_proc_btl,
> > > mca_bml_r2_register, <
> > > mca_bml_r2_register_error,
> > > mca_bml_r2_finalize,
> > > mca_bml_r2_ft_event
> > > }
> > >
> > > };
> > > 
> > > after this ,i find the place where mca_bml_r2 is initialized,but i
> cannt find anything related to mca_bml_r2_register.i just want to know reg =
> mca_btl_base_active_message_trigger + tag;
> > > really is.and i want to modify the send function of self ,is this the
> right way? or you can tell me the right way to modify the send function of
> self component.
> > >
> > > Thanks & Regards
> > > Yaohui Hu
> > >
> > > On Wed, Mar 17, 2010 at 12:52 AM, Jeff Squyres 
> wrote:
> > > On Mar 16, 2010, at 9:45 AM, hu yaohui wrote:
> > >
> > > > it just

Re: [OMPI devel] how to add a component in the ompi?

2010-03-17 Thread George Bosilca
Yaohui,

The callback functions are registered by any modules that can handle network 
communications. In your specific case I would guess it is the PML. Look in 
mca/pml/ob1/pml_ob1.c starting from line 364 to see what callbacks are 
registered by OB1.

  george.

On Mar 17, 2010, at 01:22 , hu yaohui wrote:

> Hi Geogre,
> Thank you very much!
> i know ,it's really a receive callback in this send function 
> mca_btl_self_send,what i want to know is where  this callback function(line 
> 303).
> 
> 303reg = mca_btl_base_active_message_trigger + tag;
> 304reg->cbfunc( btl, tag, des, reg->cbdata );
> 
> mapped to,where this function is initialized.in which file,which function,the 
> mca_bml_r2_register was called.
>  
> Thanks & Regards
> Yaohui
> 
> On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca  wrote:
> Yoahui,
> 
> The self component is special. While is does behave as a "normal" BTL, it 
> takes a lot of shortcuts as all operations are in the memory of a single 
> process. However, as the simplest BTLs in Open MPI, I guess it is a good 
> starting point.
> 
> As stated previously, the self BTL exhibit a lot of differences compared with 
> the others BTL. For your case, in the self BTL the send function trigger the 
> receiver callback, as there is other simple way to drain the "network". This 
> explain why we compute the btl_active_message_callback_t directly in the send 
> function. Usually, this is done on the progress function, once some data have 
> been extracted from the network. Basically, everything in the 
> mca_btl_self_send function starting from the "/* upcall */" comment is the 
> receive operation.
> 
>  george.
> 
> On Mar 17, 2010, at 00:30 , hu yaohui wrote:
> 
> > Hi George,
> > what i want to do is to modify the self component to meet my needs,i just 
> > want to modify the send function of the self component to test whether my 
> > implemented send function ,which based on some emulation platform, is 
> > right.so i copied all the self component code,modified the component name 
> > to mine ,the i wanted to subsitude its send and receive to my implemented 
> > send/receive function.i dont know whether this is right,if not ,or you need 
> > more information ,please let me know.
> >
> > Thanks & Regards
> > Yaohui Hu .
> >
> > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca  
> > wrote:
> > Yaohui,
> >
> > The whole infrastructure at the level where you're looking is similar to 
> > Active Messages. The register function is used to register callback for a 
> > specific tag. A tag is a uint8_t, and thus there are 256 callbacks 
> > possible. However, there are some rules regarding which level is allowed to 
> > register callbacks in a specific range, in order to avoid conflict between 
> > several modules loaded in same time.
> >
> > Anyway, as far as I understood you're looking at writing a new BTL. Every 
> > time a message is drained from the network, the BTL is supposed to know 
> > that tag it was send to and trigger the corresponding callback (this only 
> > on the receiver side). How this "tag" is moved around depends on the BTL 
> > capabilities. Some will have to push it explicitly through the network (TCP 
> > as an example), while others have other means to move it around (for MX 
> > this tag is part of the 64 bits key used for each message). Therefore, the 
> > first thing you should make sure is that you really have a way to retrieve 
> > this tag on the receiver side. Once you have the tag and the content of the 
> > message, you should call the callback corresponding to the tag (using the 
> > simple addition you noticed), and pass the correct arguments. This should 
> > at least let you start the eager protocol.
> >
> >  george.
> >
> > On Mar 16, 2010, at 23:22 , hu yaohui wrote:
> >
> > > Hi Jeff & All
> > > Yes,you are right,i was just a little dizzy then. i need to modify the 
> > > send function of component self in btl framework.
> > > i just met a problem right now.
> > > when i browse the function 
> > > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to 
> > > send the data
> > > 
> > > 303reg = mca_btl_base_active_message_trigger + tag;
> > > 304reg->cbfunc( btl, tag, des, reg->cbdata );
> > > 
> > > i trace through the "mca_btl_base_active_message_trigger" to the function 
> > > where it get its value ,then i find function 
> > > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this:
> > > 
> > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc;
> > > 729mca_btl_base_active_message_trigger[tag].cbdata = data;
> > > 
> > > when i trace through mca_bml_r2_register ,in the same file,i get this 
> > > structure:
> > > 
> > > mca_bml_r2_module_t mca_bml_r2 = {
> > > {
> > > &mca_bml_r2_component,
> > > mca_bml_r2_add_procs,
> > > mca_bml_r2_del_procs,
> > > mca_bml_r2_add_btl,
> > > mca_bml_r2_del_btl,
> > > mca_bml_r2_del_proc_btl,
> > > mc

Re: [OMPI devel] how to add a component in the ompi?

2010-03-17 Thread George Bosilca
For the sake of completeness, and for the enlightenment of all interested 
developers, I would prefer if we keep the discussion going on this mailing list 
(so we will have a searchable trace for the future).

  george.

On Mar 17, 2010, at 01:25 , hu yaohui wrote:

> Hi George,
> did you have a gmail or msn? i really want to talk to you directly.That's 
> much fast.
>  
> Thanks & Regards
> Yaohui Hu
> 
> On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca  wrote:
> Yoahui,
> 
> The self component is special. While is does behave as a "normal" BTL, it 
> takes a lot of shortcuts as all operations are in the memory of a single 
> process. However, as the simplest BTLs in Open MPI, I guess it is a good 
> starting point.
> 
> As stated previously, the self BTL exhibit a lot of differences compared with 
> the others BTL. For your case, in the self BTL the send function trigger the 
> receiver callback, as there is other simple way to drain the "network". This 
> explain why we compute the btl_active_message_callback_t directly in the send 
> function. Usually, this is done on the progress function, once some data have 
> been extracted from the network. Basically, everything in the 
> mca_btl_self_send function starting from the "/* upcall */" comment is the 
> receive operation.
> 
>  george.
> 
> On Mar 17, 2010, at 00:30 , hu yaohui wrote:
> 
> > Hi George,
> > what i want to do is to modify the self component to meet my needs,i just 
> > want to modify the send function of the self component to test whether my 
> > implemented send function ,which based on some emulation platform, is 
> > right.so i copied all the self component code,modified the component name 
> > to mine ,the i wanted to subsitude its send and receive to my implemented 
> > send/receive function.i dont know whether this is right,if not ,or you need 
> > more information ,please let me know.
> >
> > Thanks & Regards
> > Yaohui Hu .
> >
> > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca  
> > wrote:
> > Yaohui,
> >
> > The whole infrastructure at the level where you're looking is similar to 
> > Active Messages. The register function is used to register callback for a 
> > specific tag. A tag is a uint8_t, and thus there are 256 callbacks 
> > possible. However, there are some rules regarding which level is allowed to 
> > register callbacks in a specific range, in order to avoid conflict between 
> > several modules loaded in same time.
> >
> > Anyway, as far as I understood you're looking at writing a new BTL. Every 
> > time a message is drained from the network, the BTL is supposed to know 
> > that tag it was send to and trigger the corresponding callback (this only 
> > on the receiver side). How this "tag" is moved around depends on the BTL 
> > capabilities. Some will have to push it explicitly through the network (TCP 
> > as an example), while others have other means to move it around (for MX 
> > this tag is part of the 64 bits key used for each message). Therefore, the 
> > first thing you should make sure is that you really have a way to retrieve 
> > this tag on the receiver side. Once you have the tag and the content of the 
> > message, you should call the callback corresponding to the tag (using the 
> > simple addition you noticed), and pass the correct arguments. This should 
> > at least let you start the eager protocol.
> >
> >  george.
> >
> > On Mar 16, 2010, at 23:22 , hu yaohui wrote:
> >
> > > Hi Jeff & All
> > > Yes,you are right,i was just a little dizzy then. i need to modify the 
> > > send function of component self in btl framework.
> > > i just met a problem right now.
> > > when i browse the function 
> > > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to 
> > > send the data
> > > 
> > > 303reg = mca_btl_base_active_message_trigger + tag;
> > > 304reg->cbfunc( btl, tag, des, reg->cbdata );
> > > 
> > > i trace through the "mca_btl_base_active_message_trigger" to the function 
> > > where it get its value ,then i find function 
> > > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this:
> > > 
> > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc;
> > > 729mca_btl_base_active_message_trigger[tag].cbdata = data;
> > > 
> > > when i trace through mca_bml_r2_register ,in the same file,i get this 
> > > structure:
> > > 
> > > mca_bml_r2_module_t mca_bml_r2 = {
> > > {
> > > &mca_bml_r2_component,
> > > mca_bml_r2_add_procs,
> > > mca_bml_r2_del_procs,
> > > mca_bml_r2_add_btl,
> > > mca_bml_r2_del_btl,
> > > mca_bml_r2_del_proc_btl,
> > > mca_bml_r2_register, <
> > > mca_bml_r2_register_error,
> > > mca_bml_r2_finalize,
> > > mca_bml_r2_ft_event
> > > }
> > >
> > > };
> > > 
> > > after this ,i find the place where mca_bml_r2 is initialized,but i cannt 
> > > find anything related to mca_bml_r2_register.i just want to know reg =

Re: [OMPI devel] how to add a component in the ompi?

2010-03-17 Thread hu yaohui
Hi George ,
Thank you very much!
i really had saw these functions before ,but it's a long time ,i can't find
it !
Thank you very much,you save me a lot of time.

Thanks & Regards,
Yaohui Hu

On Wed, Mar 17, 2010 at 1:28 PM, George Bosilca wrote:

> Yaohui,
>
> The callback functions are registered by any modules that can handle
> network communications. In your specific case I would guess it is the PML.
> Look in mca/pml/ob1/pml_ob1.c starting from line 364 to see what callbacks
> are registered by OB1.
>
>  george.
>
> On Mar 17, 2010, at 01:22 , hu yaohui wrote:
>
> > Hi Geogre,
> > Thank you very much!
> > i know ,it's really a receive callback in this send function
> mca_btl_self_send,what i want to know is where  this callback function(line
> 303).
> > 
> > 303reg = mca_btl_base_active_message_trigger + tag;
> > 304reg->cbfunc( btl, tag, des, reg->cbdata );
> > 
> > mapped to,where this function is initialized.in which file,which
> function,the mca_bml_r2_register was called.
> >
> > Thanks & Regards
> > Yaohui
> >
> > On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca 
> wrote:
> > Yoahui,
> >
> > The self component is special. While is does behave as a "normal" BTL, it
> takes a lot of shortcuts as all operations are in the memory of a single
> process. However, as the simplest BTLs in Open MPI, I guess it is a good
> starting point.
> >
> > As stated previously, the self BTL exhibit a lot of differences compared
> with the others BTL. For your case, in the self BTL the send function
> trigger the receiver callback, as there is other simple way to drain the
> "network". This explain why we compute the btl_active_message_callback_t
> directly in the send function. Usually, this is done on the progress
> function, once some data have been extracted from the network. Basically,
> everything in the mca_btl_self_send function starting from the "/* upcall
> */" comment is the receive operation.
> >
> >  george.
> >
> > On Mar 17, 2010, at 00:30 , hu yaohui wrote:
> >
> > > Hi George,
> > > what i want to do is to modify the self component to meet my needs,i
> just want to modify the send function of the self component to test whether
> my implemented send function ,which based on some emulation platform, is
> right.so i copied all the self component code,modified the component name to
> mine ,the i wanted to subsitude its send and receive to my implemented
> send/receive function.i dont know whether this is right,if not ,or you need
> more information ,please let me know.
> > >
> > > Thanks & Regards
> > > Yaohui Hu .
> > >
> > > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca 
> wrote:
> > > Yaohui,
> > >
> > > The whole infrastructure at the level where you're looking is similar
> to Active Messages. The register function is used to register callback for a
> specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible.
> However, there are some rules regarding which level is allowed to register
> callbacks in a specific range, in order to avoid conflict between several
> modules loaded in same time.
> > >
> > > Anyway, as far as I understood you're looking at writing a new BTL.
> Every time a message is drained from the network, the BTL is supposed to
> know that tag it was send to and trigger the corresponding callback (this
> only on the receiver side). How this "tag" is moved around depends on the
> BTL capabilities. Some will have to push it explicitly through the network
> (TCP as an example), while others have other means to move it around (for MX
> this tag is part of the 64 bits key used for each message). Therefore, the
> first thing you should make sure is that you really have a way to retrieve
> this tag on the receiver side. Once you have the tag and the content of the
> message, you should call the callback corresponding to the tag (using the
> simple addition you noticed), and pass the correct arguments. This should at
> least let you start the eager protocol.
> > >
> > >  george.
> > >
> > > On Mar 16, 2010, at 23:22 , hu yaohui wrote:
> > >
> > > > Hi Jeff & All
> > > > Yes,you are right,i was just a little dizzy then. i need to modify
> the send function of component self in btl framework.
> > > > i just met a problem right now.
> > > > when i browse the function
> mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to
> send the data
> > > > 
> > > > 303reg = mca_btl_base_active_message_trigger + tag;
> > > > 304reg->cbfunc( btl, tag, des, reg->cbdata );
> > > > 
> > > > i trace through the "mca_btl_base_active_message_trigger" to the
> function where it get its value ,then i find function
> mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this:
> > > > 
> > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc;
> > > > 729mca_btl_base_active_message_trigger[tag].cbdata = data;
> > > > 
> > > > when i trace through mca_bml_r2_register ,in the same file,i get this
> structure:
> > > > 
> > > > mca_bml_r2_module_t mca_

Re: [OMPI devel] how to add a component in the ompi?

2010-03-17 Thread hu yaohui
ok, got it !

On Wed, Mar 17, 2010 at 1:31 PM, George Bosilca wrote:

> For the sake of completeness, and for the enlightenment of all interested
> developers, I would prefer if we keep the discussion going on this mailing
> list (so we will have a searchable trace for the future).
>
>  george.
>
> On Mar 17, 2010, at 01:25 , hu yaohui wrote:
>
> > Hi George,
> > did you have a gmail or msn? i really want to talk to you directly.That's
> much fast.
> >
> > Thanks & Regards
> > Yaohui Hu
> >
> > On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca 
> wrote:
> > Yoahui,
> >
> > The self component is special. While is does behave as a "normal" BTL, it
> takes a lot of shortcuts as all operations are in the memory of a single
> process. However, as the simplest BTLs in Open MPI, I guess it is a good
> starting point.
> >
> > As stated previously, the self BTL exhibit a lot of differences compared
> with the others BTL. For your case, in the self BTL the send function
> trigger the receiver callback, as there is other simple way to drain the
> "network". This explain why we compute the btl_active_message_callback_t
> directly in the send function. Usually, this is done on the progress
> function, once some data have been extracted from the network. Basically,
> everything in the mca_btl_self_send function starting from the "/* upcall
> */" comment is the receive operation.
> >
> >  george.
> >
> > On Mar 17, 2010, at 00:30 , hu yaohui wrote:
> >
> > > Hi George,
> > > what i want to do is to modify the self component to meet my needs,i
> just want to modify the send function of the self component to test whether
> my implemented send function ,which based on some emulation platform, is
> right.so i copied all the self component code,modified the component name to
> mine ,the i wanted to subsitude its send and receive to my implemented
> send/receive function.i dont know whether this is right,if not ,or you need
> more information ,please let me know.
> > >
> > > Thanks & Regards
> > > Yaohui Hu .
> > >
> > > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca 
> wrote:
> > > Yaohui,
> > >
> > > The whole infrastructure at the level where you're looking is similar
> to Active Messages. The register function is used to register callback for a
> specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible.
> However, there are some rules regarding which level is allowed to register
> callbacks in a specific range, in order to avoid conflict between several
> modules loaded in same time.
> > >
> > > Anyway, as far as I understood you're looking at writing a new BTL.
> Every time a message is drained from the network, the BTL is supposed to
> know that tag it was send to and trigger the corresponding callback (this
> only on the receiver side). How this "tag" is moved around depends on the
> BTL capabilities. Some will have to push it explicitly through the network
> (TCP as an example), while others have other means to move it around (for MX
> this tag is part of the 64 bits key used for each message). Therefore, the
> first thing you should make sure is that you really have a way to retrieve
> this tag on the receiver side. Once you have the tag and the content of the
> message, you should call the callback corresponding to the tag (using the
> simple addition you noticed), and pass the correct arguments. This should at
> least let you start the eager protocol.
> > >
> > >  george.
> > >
> > > On Mar 16, 2010, at 23:22 , hu yaohui wrote:
> > >
> > > > Hi Jeff & All
> > > > Yes,you are right,i was just a little dizzy then. i need to modify
> the send function of component self in btl framework.
> > > > i just met a problem right now.
> > > > when i browse the function
> mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to
> send the data
> > > > 
> > > > 303reg = mca_btl_base_active_message_trigger + tag;
> > > > 304reg->cbfunc( btl, tag, des, reg->cbdata );
> > > > 
> > > > i trace through the "mca_btl_base_active_message_trigger" to the
> function where it get its value ,then i find function
> mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this:
> > > > 
> > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc;
> > > > 729mca_btl_base_active_message_trigger[tag].cbdata = data;
> > > > 
> > > > when i trace through mca_bml_r2_register ,in the same file,i get this
> structure:
> > > > 
> > > > mca_bml_r2_module_t mca_bml_r2 = {
> > > > {
> > > > &mca_bml_r2_component,
> > > > mca_bml_r2_add_procs,
> > > > mca_bml_r2_del_procs,
> > > > mca_bml_r2_add_btl,
> > > > mca_bml_r2_del_btl,
> > > > mca_bml_r2_del_proc_btl,
> > > > mca_bml_r2_register, <
> > > > mca_bml_r2_register_error,
> > > > mca_bml_r2_finalize,
> > > > mca_bml_r2_ft_event
> > > > }
> > > >
> > > > };
> > > > 
> > > > after this ,i find the place where mca_bml_r2 is initi

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Ralph don't swallow your message yet... Both jobs are not running over the same 
mpirun. There are two instances of mpirun in which one runs with "-report-uri 
../contact.txt" and the other receives its contact info using "-ompi-server 
file:../contact.txt". And yes, both processes are running with plm_base_verbose 
activated. When a deactivate the plm_base_verbose the error is practically the 
same:

[aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
<[[47640,1],0]>
[aopclf:54106] *** Process received signal ***
[aopclf:54106] Signal: Segmentation fault (11)
[aopclf:54106] Signal code: Address not mapped (1)
[aopclf:54106] Failing at address: 0x0
[aopclf:54106] [ 0] 2   libSystem.B.dylib   0x7fff83a6eeaa 
_sigtramp + 26
[aopclf:54106] [ 1] 3   libSystem.B.dylib   0x7fff83a210b7 
snprintf + 496
[aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so   0x00010065ba0a 
mca_vprotocol_receiver_send + 177
[aopclf:54106] [ 3] 5   libmpi.0.dylib  0x000100077d44 
MPI_Send + 734
[aopclf:54106] [ 4] 6   ping0x00010a97 
main + 431
[aopclf:54106] [ 5] 7   ping0x000108e0 
start + 52
[aopclf:54106] *** End of error message ***

Leonardo

On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:

> I'm going to have to eat my last message. It slipped past me that your other 
> job was started via comm_spawn. Since both "jobs" are running under the same 
> mpirun, there shouldn't be a problem sending a signal between them.
> 
> I don't know why this would be crashing. Are you sure it is  crashing in 
> signal_job? Your trace indicates it is crashing in a print statement, yet 
> there is no print statement in signal_job. Or did you run this with 
> plm_base_verbose set so that the verbose prints are trying to run (could be 
> we have a bug in one of them)?
> 
> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:
> 
>> Well, thank you anyway :)
>> 
>> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:
>> 
>>> Yeah, that probably won't work. The current code isn't intended to cross 
>>> jobs like that - I'm sure nobody ever tested it for that idea, and I'm 
>>> pretty sure it won't support it.
>>> 
>>> I don't currently know any way to do what you are trying to do. We could 
>>> extend the signal code to handle it, I would think...but I'm not sure how 
>>> soon that might happen.
>>> 
>>> 
>>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
>>> 
 Yes... but something wrong is going on... maybe the problem is that the 
 jobid is different than the process' jobid, I don't know.
 
 I'm trying to send a signal to other process running under a another job. 
 The other process jump into an accept_connect to the MPI comm. So i did a 
 code like this (I removed verification code and comments, this is just a 
 summary for a happy execution):
 
 ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
 orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
 ompi_dpm.route_to_port(hnp_uri, &el_proc);
 orte_plm.signal_job(el_proc.jobid, SIGUSR1);
 ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);
 
 el_proc is defined as orte_process_name_t, not a pointer to this. And 
 signal.h has been included for SIGUSR1's sake. But when the code enter in 
 signal_job function it crashes. I'm trying to debug it just now... the 
 crash is the following:
 
 [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
 [Fialho-2.local:51377] receiver: found port 
 <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300>
 [Fialho-2.local:51377] receiver: HNP URI 
 <784793600.0;tcp://192.168.1.200:54071>, RML URI 
 <784793601.0;tcp://192.168.1.200:54072>, TAG <300>
 [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event 
 Logger <[[11975,1],0]>
 [Fialho-2:51377] *** Process received signal ***
 [Fialho-2:51377] Signal: Segmentation fault (11)
 [Fialho-2:51377] Signal code: Address not mapped (1)
 [Fialho-2:51377] Failing at address: 0x0
 [Fialho-2:51377] [ 0] 2   libSystem.B.dylib   
 0x7fff83a6eeaa _sigtramp + 26
 [Fialho-2:51377] [ 1] 3   libSystem.B.dylib   
 0x7fff83a210b7 snprintf + 496
 [Fialho-2:51377] [ 2] 4   mca_vprotocol_receiver.so   
 0x00010065ba0a mca_vprotocol_receiver_send + 177
 [Fialho-2:51377] [ 3] 5   libmpi.0.dylib  
 0x000100077d44 MPI_Send + 734
 [Fialho-2:51377] [ 4] 6   ping
 0x00010a97 main + 431
 [Fialho-2:51377] [ 5] 7   ping
 0x000108e0 start + 52
 [Fialho-2:51377] [ 6] 8   ??? 
 0x0003 0x0 + 3
 [Fialho-2:51377] *** End of error message *

Re: [OMPI devel] Signals

2010-03-17 Thread Ralph Castain
Thanks for clarifying - guess I won't chew just yet. :-)

I still don't see in your trace where it is failing in signal_job. I didn't see 
the message indicating it was sending the signal cmd out in your prior debug 
output, and there isn't a printf in that code loop other than the debug output. 
Can you attach to the process and get more info?

On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:

> Ralph don't swallow your message yet... Both jobs are not running over the 
> same mpirun. There are two instances of mpirun in which one runs with 
> "-report-uri ../contact.txt" and the other receives its contact info using 
> "-ompi-server file:../contact.txt". And yes, both processes are running with 
> plm_base_verbose activated. When a deactivate the plm_base_verbose the error 
> is practically the same:
> 
> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
> <[[47640,1],0]>
> [aopclf:54106] *** Process received signal ***
> [aopclf:54106] Signal: Segmentation fault (11)
> [aopclf:54106] Signal code: Address not mapped (1)
> [aopclf:54106] Failing at address: 0x0
> [aopclf:54106] [ 0] 2   libSystem.B.dylib   
> 0x7fff83a6eeaa _sigtramp + 26
> [aopclf:54106] [ 1] 3   libSystem.B.dylib   
> 0x7fff83a210b7 snprintf + 496
> [aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so   
> 0x00010065ba0a mca_vprotocol_receiver_send + 177
> [aopclf:54106] [ 3] 5   libmpi.0.dylib  
> 0x000100077d44 MPI_Send + 734
> [aopclf:54106] [ 4] 6   ping
> 0x00010a97 main + 431
> [aopclf:54106] [ 5] 7   ping
> 0x000108e0 start + 52
> [aopclf:54106] *** End of error message ***
> 
> Leonardo
> 
> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:
> 
>> I'm going to have to eat my last message. It slipped past me that your other 
>> job was started via comm_spawn. Since both "jobs" are running under the same 
>> mpirun, there shouldn't be a problem sending a signal between them.
>> 
>> I don't know why this would be crashing. Are you sure it is  crashing in 
>> signal_job? Your trace indicates it is crashing in a print statement, yet 
>> there is no print statement in signal_job. Or did you run this with 
>> plm_base_verbose set so that the verbose prints are trying to run (could be 
>> we have a bug in one of them)?
>> 
>> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:
>> 
>>> Well, thank you anyway :)
>>> 
>>> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:
>>> 
 Yeah, that probably won't work. The current code isn't intended to cross 
 jobs like that - I'm sure nobody ever tested it for that idea, and I'm 
 pretty sure it won't support it.
 
 I don't currently know any way to do what you are trying to do. We could 
 extend the signal code to handle it, I would think...but I'm not sure how 
 soon that might happen.
 
 
 On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
 
> Yes... but something wrong is going on... maybe the problem is that the 
> jobid is different than the process' jobid, I don't know.
> 
> I'm trying to send a signal to other process running under a another job. 
> The other process jump into an accept_connect to the MPI comm. So i did a 
> code like this (I removed verification code and comments, this is just a 
> summary for a happy execution):
> 
> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
> ompi_dpm.route_to_port(hnp_uri, &el_proc);
> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);
> 
> el_proc is defined as orte_process_name_t, not a pointer to this. And 
> signal.h has been included for SIGUSR1's sake. But when the code enter in 
> signal_job function it crashes. I'm trying to debug it just now... the 
> crash is the following:
> 
> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
> [Fialho-2.local:51377] receiver: found port 
> <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300>
> [Fialho-2.local:51377] receiver: HNP URI 
> <784793600.0;tcp://192.168.1.200:54071>, RML URI 
> <784793601.0;tcp://192.168.1.200:54072>, TAG <300>
> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event 
> Logger <[[11975,1],0]>
> [Fialho-2:51377] *** Process received signal ***
> [Fialho-2:51377] Signal: Segmentation fault (11)
> [Fialho-2:51377] Signal code: Address not mapped (1)
> [Fialho-2:51377] Failing at address: 0x0
> [Fialho-2:51377] [ 0] 2   libSystem.B.dylib   
> 0x7fff83a6eeaa _sigtramp + 26
> [Fialho-2:51377] [ 1] 3   libSystem.B.dylib   
> 0x7fff83a210b7 snprintf + 496
> [Fialho-2:51377] [ 2] 4   mca_vprotocol_receiver.so 

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
To clarify a little bit more: I'm calling orte_plm.signal_job from a PML 
component, I know that ORTE is bellow OMPI, but I think that this function 
could not be available, or something like this. I can't figure out where is 
this snprintf too, in my code there is only

opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger <%s>",
SIGUSR1, ORTE_NAME_PRINT(&el_proc));
orte_plm.signal_job(el_proc.jobid, SIGUSR1);

And the first output/printf works fine. Well... I used gdb to run the program, 
I can see this:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x
0x in ?? ()
(gdb) backtrace
#0  0x in ?? ()
#1  0x00010065c319 in vprotocol_receiver_eventlog_connect 
(el_comm=0x10065d178) at 
../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
#2  0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, 
count=262144, datatype=0x100263d60, dst=1, tag=1, 
sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at 
../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
#3  0x000100077d44 in MPI_Send ()
#4  0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45

The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job call. 
After that zeros and interrogations... the signal_job function is already 
available? I really don't understand what means all those zeros.

Leonardo

On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:

> Thanks for clarifying - guess I won't chew just yet. :-)
> 
> I still don't see in your trace where it is failing in signal_job. I didn't 
> see the message indicating it was sending the signal cmd out in your prior 
> debug output, and there isn't a printf in that code loop other than the debug 
> output. Can you attach to the process and get more info?
> 
> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:
> 
>> Ralph don't swallow your message yet... Both jobs are not running over the 
>> same mpirun. There are two instances of mpirun in which one runs with 
>> "-report-uri ../contact.txt" and the other receives its contact info using 
>> "-ompi-server file:../contact.txt". And yes, both processes are running with 
>> plm_base_verbose activated. When a deactivate the plm_base_verbose the error 
>> is practically the same:
>> 
>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
>> <[[47640,1],0]>
>> [aopclf:54106] *** Process received signal ***
>> [aopclf:54106] Signal: Segmentation fault (11)
>> [aopclf:54106] Signal code: Address not mapped (1)
>> [aopclf:54106] Failing at address: 0x0
>> [aopclf:54106] [ 0] 2   libSystem.B.dylib   
>> 0x7fff83a6eeaa _sigtramp + 26
>> [aopclf:54106] [ 1] 3   libSystem.B.dylib   
>> 0x7fff83a210b7 snprintf + 496
>> [aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so   
>> 0x00010065ba0a mca_vprotocol_receiver_send + 177
>> [aopclf:54106] [ 3] 5   libmpi.0.dylib  
>> 0x000100077d44 MPI_Send + 734
>> [aopclf:54106] [ 4] 6   ping
>> 0x00010a97 main + 431
>> [aopclf:54106] [ 5] 7   ping
>> 0x000108e0 start + 52
>> [aopclf:54106] *** End of error message ***
>> 
>> Leonardo
>> 
>> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:
>> 
>>> I'm going to have to eat my last message. It slipped past me that your 
>>> other job was started via comm_spawn. Since both "jobs" are running under 
>>> the same mpirun, there shouldn't be a problem sending a signal between them.
>>> 
>>> I don't know why this would be crashing. Are you sure it is  crashing in 
>>> signal_job? Your trace indicates it is crashing in a print statement, yet 
>>> there is no print statement in signal_job. Or did you run this with 
>>> plm_base_verbose set so that the verbose prints are trying to run (could be 
>>> we have a bug in one of them)?
>>> 
>>> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:
>>> 
 Well, thank you anyway :)
 
 On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:
 
> Yeah, that probably won't work. The current code isn't intended to cross 
> jobs like that - I'm sure nobody ever tested it for that idea, and I'm 
> pretty sure it won't support it.
> 
> I don't currently know any way to do what you are trying to do. We could 
> extend the signal code to handle it, I would think...but I'm not sure how 
> soon that might happen.
> 
> 
> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
> 
>> Yes... but something wrong is going on... maybe the problem is that the 
>> jobid is different than the process' jobid, I don't know.
>> 
>> I'm trying to send a signal to other process running under a another 
>> job. The other process jump into an accept_connect to the M

Re: [OMPI devel] Signals

2010-03-17 Thread Terry Dontje
Can you print out what orte_plm.signal_job value is?  I bet it is 
pointing to address 0.  So the question is orte_plm actually initialized 
in an MPI process?  My guess would be no but I am sure Ralph will be 
able to answer more definitively.


--td

On 03/17/2010 09:52 AM, Leonardo Fialho wrote:
To clarify a little bit more: I'm calling orte_plm.signal_job from a 
PML component, I know that ORTE is bellow OMPI, but I think that this 
function could not be available, or something like this. I can't 
figure out where is this snprintf too, in my code there is only


opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event 
Logger <%s>",

SIGUSR1, ORTE_NAME_PRINT(&el_proc));
orte_plm.signal_job(el_proc.jobid, SIGUSR1);

And the first output/printf works fine. Well... I used gdb to run the 
program, I can see this:


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x
0x in ?? ()
(gdb) backtrace
#0  0x in ?? ()
#1  0x00010065c319 in vprotocol_receiver_eventlog_connect 
(el_comm=0x10065d178) at 
../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
#2  0x00010065ba9a in mca_vprotocol_receiver_send 
(buf=0x10050, count=262144, datatype=0x100263d60, dst=1, tag=1, 
sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at 
../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46

#3  0x000100077d44 in MPI_Send ()
#4  0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45

The line 67 of vprotocol_receiver_eventlog.c is the 
orte_plm_signal_job call. After that zeros and interrogations... the 
signal_job function is already available? I really don't understand 
what means all those zeros.


Leonardo

On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:


Thanks for clarifying - guess I won't chew just yet. :-)

I still don't see in your trace where it is failing in signal_job. I 
didn't see the message indicating it was sending the signal cmd out 
in your prior debug output, and there isn't a printf in that code 
loop other than the debug output. Can you attach to the process and 
get more info?


On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:

Ralph don't swallow your message yet... Both jobs are not running 
over the same mpirun. There are two instances of mpirun in which one 
runs with "-report-uri ../contact.txt" and the other receives its 
contact info using "-ompi-server file:../contact.txt". And yes, both 
processes are running with plm_base_verbose activated. When a 
deactivate the plm_base_verbose the error is practically the same:


[aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
<[[47640,1],0]>

[aopclf:54106] *** Process received signal ***
[aopclf:54106] Signal: Segmentation fault (11)
[aopclf:54106] Signal code: Address not mapped (1)
[aopclf:54106] Failing at address: 0x0
[aopclf:54106] [ 0] 2   libSystem.B.dylib   
0x7fff83a6eeaa _sigtramp + 26
[aopclf:54106] [ 1] 3   libSystem.B.dylib   
0x7fff83a210b7 snprintf + 496
[aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so   
0x00010065ba0a mca_vprotocol_receiver_send + 177
[aopclf:54106] [ 3] 5   libmpi.0.dylib 
 0x000100077d44 MPI_Send + 734
[aopclf:54106] [ 4] 6   ping   
 0x00010a97 main + 431
[aopclf:54106] [ 5] 7   ping   
 0x000108e0 start + 52

[aopclf:54106] *** End of error message ***

Leonardo

On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:

I'm going to have to eat my last message. It slipped past me that 
your other job was started via comm_spawn. Since both "jobs" are 
running under the same mpirun, there shouldn't be a problem sending 
a signal between them.


I don't know why this would be crashing. Are you sure it is 
 crashing in signal_job? Your trace indicates it is crashing in a 
print statement, yet there is no print statement in signal_job. Or 
did you run this with plm_base_verbose set so that the verbose 
prints are trying to run (could be we have a bug in one of them)?


On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:


Well, thank you anyway :)

On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:

Yeah, that probably won't work. The current code isn't intended 
to cross jobs like that - I'm sure nobody ever tested it for that 
idea, and I'm pretty sure it won't support it.


I don't currently know any way to do what you are trying to do. 
We could extend the signal code to handle it, I would think...but 
I'm not sure how soon that might happen.



On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:

Yes... but something wrong is going on... maybe the problem is 
that the jobid is different than the process' jobid, I don't know.


I'm trying to send a signal to other process running under a 
another job. The other proc

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Wow... orte_plm.signal_job points to zero. Is it correct from the PML point of 
view?

Leonardo

On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:

> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML 
> component, I know that ORTE is bellow OMPI, but I think that this function 
> could not be available, or something like this. I can't figure out where is 
> this snprintf too, in my code there is only
> 
> opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger 
> <%s>",
> SIGUSR1, ORTE_NAME_PRINT(&el_proc));
> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
> 
> And the first output/printf works fine. Well... I used gdb to run the 
> program, I can see this:
> 
> Program received signal EXC_BAD_ACCESS, Could not access memory.
> Reason: KERN_INVALID_ADDRESS at address: 0x
> 0x in ?? ()
> (gdb) backtrace
> #0  0x in ?? ()
> #1  0x00010065c319 in vprotocol_receiver_eventlog_connect 
> (el_comm=0x10065d178) at 
> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
> #2  0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, 
> count=262144, datatype=0x100263d60, dst=1, tag=1, 
> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at 
> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
> #3  0x000100077d44 in MPI_Send ()
> #4  0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45
> 
> The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job call. 
> After that zeros and interrogations... the signal_job function is already 
> available? I really don't understand what means all those zeros.
> 
> Leonardo
> 
> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:
> 
>> Thanks for clarifying - guess I won't chew just yet. :-)
>> 
>> I still don't see in your trace where it is failing in signal_job. I didn't 
>> see the message indicating it was sending the signal cmd out in your prior 
>> debug output, and there isn't a printf in that code loop other than the 
>> debug output. Can you attach to the process and get more info?
>> 
>> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:
>> 
>>> Ralph don't swallow your message yet... Both jobs are not running over the 
>>> same mpirun. There are two instances of mpirun in which one runs with 
>>> "-report-uri ../contact.txt" and the other receives its contact info using 
>>> "-ompi-server file:../contact.txt". And yes, both processes are running 
>>> with plm_base_verbose activated. When a deactivate the plm_base_verbose the 
>>> error is practically the same:
>>> 
>>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
>>> <[[47640,1],0]>
>>> [aopclf:54106] *** Process received signal ***
>>> [aopclf:54106] Signal: Segmentation fault (11)
>>> [aopclf:54106] Signal code: Address not mapped (1)
>>> [aopclf:54106] Failing at address: 0x0
>>> [aopclf:54106] [ 0] 2   libSystem.B.dylib   
>>> 0x7fff83a6eeaa _sigtramp + 26
>>> [aopclf:54106] [ 1] 3   libSystem.B.dylib   
>>> 0x7fff83a210b7 snprintf + 496
>>> [aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so   
>>> 0x00010065ba0a mca_vprotocol_receiver_send + 177
>>> [aopclf:54106] [ 3] 5   libmpi.0.dylib  
>>> 0x000100077d44 MPI_Send + 734
>>> [aopclf:54106] [ 4] 6   ping
>>> 0x00010a97 main + 431
>>> [aopclf:54106] [ 5] 7   ping
>>> 0x000108e0 start + 52
>>> [aopclf:54106] *** End of error message ***
>>> 
>>> Leonardo
>>> 
>>> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:
>>> 
 I'm going to have to eat my last message. It slipped past me that your 
 other job was started via comm_spawn. Since both "jobs" are running under 
 the same mpirun, there shouldn't be a problem sending a signal between 
 them.
 
 I don't know why this would be crashing. Are you sure it is  crashing in 
 signal_job? Your trace indicates it is crashing in a print statement, yet 
 there is no print statement in signal_job. Or did you run this with 
 plm_base_verbose set so that the verbose prints are trying to run (could 
 be we have a bug in one of them)?
 
 On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:
 
> Well, thank you anyway :)
> 
> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:
> 
>> Yeah, that probably won't work. The current code isn't intended to cross 
>> jobs like that - I'm sure nobody ever tested it for that idea, and I'm 
>> pretty sure it won't support it.
>> 
>> I don't currently know any way to do what you are trying to do. We could 
>> extend the signal code to handle it, I would think...but I'm not sure 
>> how soon that might happen.
>> 
>> 
>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
>> 

Re: [OMPI devel] Signals

2010-03-17 Thread Terry Dontje

On 03/17/2010 10:10 AM, Leonardo Fialho wrote:
Wow... orte_plm.signal_job points to zero. Is it correct from the PML 
point of view?
It might be because plm's are really only used at launch time not in MPI 
processes.  Note plm != pml.


--td


Leonardo

On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:

To clarify a little bit more: I'm calling orte_plm.signal_job from a 
PML component, I know that ORTE is bellow OMPI, but I think that this 
function could not be available, or something like this. I can't 
figure out where is this snprintf too, in my code there is only


opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event 
Logger <%s>",

SIGUSR1, ORTE_NAME_PRINT(&el_proc));
orte_plm.signal_job(el_proc.jobid, SIGUSR1);

And the first output/printf works fine. Well... I used gdb to run the 
program, I can see this:


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x
0x in ?? ()
(gdb) backtrace
#0  0x in ?? ()
#1  0x00010065c319 in vprotocol_receiver_eventlog_connect 
(el_comm=0x10065d178) at 
../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
#2  0x00010065ba9a in mca_vprotocol_receiver_send 
(buf=0x10050, count=262144, datatype=0x100263d60, dst=1, tag=1, 
sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at 
../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46

#3  0x000100077d44 in MPI_Send ()
#4  0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45

The line 67 of vprotocol_receiver_eventlog.c is the 
orte_plm_signal_job call. After that zeros and interrogations... the 
signal_job function is already available? I really don't understand 
what means all those zeros.


Leonardo

On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:


Thanks for clarifying - guess I won't chew just yet. :-)

I still don't see in your trace where it is failing in signal_job. I 
didn't see the message indicating it was sending the signal cmd out 
in your prior debug output, and there isn't a printf in that code 
loop other than the debug output. Can you attach to the process and 
get more info?


On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:

Ralph don't swallow your message yet... Both jobs are not running 
over the same mpirun. There are two instances of mpirun in which 
one runs with "-report-uri ../contact.txt" and the other receives 
its contact info using "-ompi-server file:../contact.txt". And yes, 
both processes are running with plm_base_verbose activated. When a 
deactivate the plm_base_verbose the error is practically the same:


[aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
<[[47640,1],0]>

[aopclf:54106] *** Process received signal ***
[aopclf:54106] Signal: Segmentation fault (11)
[aopclf:54106] Signal code: Address not mapped (1)
[aopclf:54106] Failing at address: 0x0
[aopclf:54106] [ 0] 2   libSystem.B.dylib   
0x7fff83a6eeaa _sigtramp + 26
[aopclf:54106] [ 1] 3   libSystem.B.dylib   
0x7fff83a210b7 snprintf + 496
[aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so   
0x00010065ba0a mca_vprotocol_receiver_send + 177
[aopclf:54106] [ 3] 5   libmpi.0.dylib 
 0x000100077d44 MPI_Send + 734
[aopclf:54106] [ 4] 6   ping   
 0x00010a97 main + 431
[aopclf:54106] [ 5] 7   ping   
 0x000108e0 start + 52

[aopclf:54106] *** End of error message ***

Leonardo

On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:

I'm going to have to eat my last message. It slipped past me that 
your other job was started via comm_spawn. Since both "jobs" are 
running under the same mpirun, there shouldn't be a problem 
sending a signal between them.


I don't know why this would be crashing. Are you sure it is 
 crashing in signal_job? Your trace indicates it is crashing in a 
print statement, yet there is no print statement in signal_job. Or 
did you run this with plm_base_verbose set so that the verbose 
prints are trying to run (could be we have a bug in one of them)?


On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:


Well, thank you anyway :)

On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:

Yeah, that probably won't work. The current code isn't intended 
to cross jobs like that - I'm sure nobody ever tested it for 
that idea, and I'm pretty sure it won't support it.


I don't currently know any way to do what you are trying to do. 
We could extend the signal code to handle it, I would 
think...but I'm not sure how soon that might happen.



On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:

Yes... but something wrong is going on... maybe the problem is 
that the jobid is different than the process' jobid, I don't know.


I'm trying to send a signal to other process running under a 
another job. The o

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Yes, I know the difference :)

I'm trying to call orte_plm.signal_job from a PML component. I think PLM stays 
resident after launching but it doesn't only for mpirun and orted, you're right.

On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote:

> On 03/17/2010 10:10 AM, Leonardo Fialho wrote:
>> 
>> Wow... orte_plm.signal_job points to zero. Is it correct from the PML point 
>> of view?
> It might be because plm's are really only used at launch time not in MPI 
> processes.  Note plm != pml.
> 
> --td
>> 
>> Leonardo
>> 
>> On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:
>> 
>>> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML 
>>> component, I know that ORTE is bellow OMPI, but I think that this function 
>>> could not be available, or something like this. I can't figure out where is 
>>> this snprintf too, in my code there is only
>>> 
>>> opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger 
>>> <%s>",
>>> SIGUSR1, ORTE_NAME_PRINT(&el_proc));
>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
>>> 
>>> And the first output/printf works fine. Well... I used gdb to run the 
>>> program, I can see this:
>>> 
>>> Program received signal EXC_BAD_ACCESS, Could not access memory.
>>> Reason: KERN_INVALID_ADDRESS at address: 0x
>>> 0x in ?? ()
>>> (gdb) backtrace
>>> #0  0x in ?? ()
>>> #1  0x00010065c319 in vprotocol_receiver_eventlog_connect 
>>> (el_comm=0x10065d178) at 
>>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
>>> #2  0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, 
>>> count=262144, datatype=0x100263d60, dst=1, tag=1, 
>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at 
>>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
>>> #3  0x000100077d44 in MPI_Send ()
>>> #4  0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45
>>> 
>>> The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job 
>>> call. After that zeros and interrogations... the signal_job function is 
>>> already available? I really don't understand what means all those zeros.
>>> 
>>> Leonardo
>>> 
>>> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:
>>> 
 Thanks for clarifying - guess I won't chew just yet. :-)
 
 I still don't see in your trace where it is failing in signal_job. I 
 didn't see the message indicating it was sending the signal cmd out in 
 your prior debug output, and there isn't a printf in that code loop other 
 than the debug output. Can you attach to the process and get more info?
 
 On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:
 
> Ralph don't swallow your message yet... Both jobs are not running over 
> the same mpirun. There are two instances of mpirun in which one runs with 
> "-report-uri ../contact.txt" and the other receives its contact info 
> using "-ompi-server file:../contact.txt". And yes, both processes are 
> running with plm_base_verbose activated. When a deactivate the 
> plm_base_verbose the error is practically the same:
> 
> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
> <[[47640,1],0]>
> [aopclf:54106] *** Process received signal ***
> [aopclf:54106] Signal: Segmentation fault (11)
> [aopclf:54106] Signal code: Address not mapped (1)
> [aopclf:54106] Failing at address: 0x0
> [aopclf:54106] [ 0] 2   libSystem.B.dylib   
> 0x7fff83a6eeaa _sigtramp + 26
> [aopclf:54106] [ 1] 3   libSystem.B.dylib   
> 0x7fff83a210b7 snprintf + 496
> [aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so   
> 0x00010065ba0a mca_vprotocol_receiver_send + 177
> [aopclf:54106] [ 3] 5   libmpi.0.dylib  
> 0x000100077d44 MPI_Send + 734
> [aopclf:54106] [ 4] 6   ping
> 0x00010a97 main + 431
> [aopclf:54106] [ 5] 7   ping
> 0x000108e0 start + 52
> [aopclf:54106] *** End of error message ***
> 
> Leonardo
> 
> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:
> 
>> I'm going to have to eat my last message. It slipped past me that your 
>> other job was started via comm_spawn. Since both "jobs" are running 
>> under the same mpirun, there shouldn't be a problem sending a signal 
>> between them.
>> 
>> I don't know why this would be crashing. Are you sure it is  crashing in 
>> signal_job? Your trace indicates it is crashing in a print statement, 
>> yet there is no print statement in signal_job. Or did you run this with 
>> plm_base_verbose set so that the verbose prints are trying to run (could 
>> be we have a bug in one of them)?
>> 
>> On Mar 16, 2010, at 6:59 PM, Leonar

[OMPI devel] Problem with MPI_Type_indexed and hole (defined with MPI_Type_create_resized )

2010-03-17 Thread Pascal Deveze

Hi all,

I use a very simple datatype defined as follow:
lng[0]= 1;
   dsp[0]= 1;
   err=MPI_Type_indexed(1, lng, dsp, MPI_CHAR, &offtype);
   err=MPI_Type_create_resized(offtype, 0, 2, &filetype);
   MPI_Type_commit(&filetype);

This datatype consists of a hole (of length 1 char) followed by a char.

The datatype with hole at the beginning is not correctly handled by 
ROMIO integrated in OpenMPI (I tried with MPICH2 and it worked fine).

You will see bellow a program to reproduce the problem.

After investigations, I see that the difference between OpenMPI and 
MPICH appears at line 542 in the file romio/adio/comm/flatten.c:


   case MPI_COMBINER_RESIZED:
   /* This is done similar to a type_struct with an lb, datatype, ub */

   /* handle the Lb */
   j = *curr_index;
   flat->indices[j] = st_offset + adds[0];
   flat->blocklens[j] = 0;

   (*curr_index)++;

   /* handle the datatype */

   MPI_Type_get_envelope(types[0], &old_nints, &old_nadds,
 &old_ntypes, &old_combiner);
   ADIOI_Datatype_iscontig(types[0], &old_is_contig); <== 
ligne 542


For MPICH2, the datatype is not contiguous, but it is for OpenMPI. The 
routine ADIOI_Datatype_iscontig is
quite different in OpenMPI because the datatypes are handled very 
differently. If I reset old_is_contig just after

line 542, the problem disappears (Of course, this is not a solution).

I am not able to propose a right solution. Can somebody help ?

Pascal

 Program to reproduce the problem 
#include 
#include "mpi.h"

char filename[256]="VIEW_TEST";
char buffer[100];
int err, i, myid, dsp[3], lng[3];
MPI_Status status;
MPI_File fh;
MPI_Datatype filetype, offtype;
MPI_Aint lb, extent;

int main(int argc, char **argv) {

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
for (i=0; i   MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_CREATE | 
MPI_MODE_RDWR , MPI_INFO_NULL, &fh);

   MPI_File_write(fh, buffer, sizeof(buffer), MPI_CHAR, &status);
   MPI_File_close(&fh);

   lng[0]= 1;
   dsp[0]= 1;
   MPI_Type_indexed(1, lng, dsp, MPI_CHAR, &offtype);
   MPI_Type_create_resized(offtype, 0, 2, &filetype);
   MPI_Type_commit(&filetype);

   MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_RDONLY , 
MPI_INFO_NULL, &fh);

   MPI_File_set_view(fh, 0, MPI_CHAR, filetype,"native", MPI_INFO_NULL);
   MPI_File_read(fh, buffer, 5, MPI_CHAR, &status);

   printf("Data: ");
   for (i=0 ; i<5 ; i++) printf(" %x ", buffer[i]);
   if (buffer[1] != 3) printf("\n ===>  test KO : buffer[1]=%d 
instead of %d \n", buffer[1], 4);

   else printf("\n ===> test OK\n");
   MPI_Type_free(&filetype);
   MPI_File_close(&fh);
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
}
 The result of the program with MPICH2 
Data:  1  3  5  7  9
===> test OK

 The result of the program with OpenMPI 
Data:  0  2  4  6  8
===>  test KO : buffer[1]=2 instead of 4

Comment: Only the first hole is ommited.





Re: [OMPI devel] Signals

2010-03-17 Thread Ralph Castain
Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly realized 
the problem :-/

Terry is correct - we don't initialize the plm framework in application 
processes. However, there is a default proxy module for that framework so that 
applications can call comm_spawn. Unfortunately, I never filled in the rest of 
the module function pointers because (a) there was no known reason for apps to 
be using them (as Jeff points out), and (b) there is no MPI call that 
interfaces to them.

I can (and will) make it work over the next day or two - there is no reason why 
this can't be done. It just wasn't implemented due to lack of reason to do so.

Sorry for the confusion - old man brain fizzing out again.

On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote:

> Yes, I know the difference :)
> 
> I'm trying to call orte_plm.signal_job from a PML component. I think PLM 
> stays resident after launching but it doesn't only for mpirun and orted, 
> you're right.
> 
> On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote:
> 
>> On 03/17/2010 10:10 AM, Leonardo Fialho wrote:
>>> 
>>> Wow... orte_plm.signal_job points to zero. Is it correct from the PML point 
>>> of view?
>> It might be because plm's are really only used at launch time not in MPI 
>> processes.  Note plm != pml.
>> 
>> --td
>>> 
>>> Leonardo
>>> 
>>> On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:
>>> 
 To clarify a little bit more: I'm calling orte_plm.signal_job from a PML 
 component, I know that ORTE is bellow OMPI, but I think that this function 
 could not be available, or something like this. I can't figure out where 
 is this snprintf too, in my code there is only
 
 opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger 
 <%s>",
 SIGUSR1, ORTE_NAME_PRINT(&el_proc));
 orte_plm.signal_job(el_proc.jobid, SIGUSR1);
 
 And the first output/printf works fine. Well... I used gdb to run the 
 program, I can see this:
 
 Program received signal EXC_BAD_ACCESS, Could not access memory.
 Reason: KERN_INVALID_ADDRESS at address: 0x
 0x in ?? ()
 (gdb) backtrace
 #0  0x in ?? ()
 #1  0x00010065c319 in vprotocol_receiver_eventlog_connect 
 (el_comm=0x10065d178) at 
 ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
 #2  0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, 
 count=262144, datatype=0x100263d60, dst=1, tag=1, 
 sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at 
 ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
 #3  0x000100077d44 in MPI_Send ()
 #4  0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45
 
 The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job 
 call. After that zeros and interrogations... the signal_job function is 
 already available? I really don't understand what means all those zeros.
 
 Leonardo
 
 On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:
 
> Thanks for clarifying - guess I won't chew just yet. :-)
> 
> I still don't see in your trace where it is failing in signal_job. I 
> didn't see the message indicating it was sending the signal cmd out in 
> your prior debug output, and there isn't a printf in that code loop other 
> than the debug output. Can you attach to the process and get more info?
> 
> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:
> 
>> Ralph don't swallow your message yet... Both jobs are not running over 
>> the same mpirun. There are two instances of mpirun in which one runs 
>> with "-report-uri ../contact.txt" and the other receives its contact 
>> info using "-ompi-server file:../contact.txt". And yes, both processes 
>> are running with plm_base_verbose activated. When a deactivate the 
>> plm_base_verbose the error is practically the same:
>> 
>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
>> <[[47640,1],0]>
>> [aopclf:54106] *** Process received signal ***
>> [aopclf:54106] Signal: Segmentation fault (11)
>> [aopclf:54106] Signal code: Address not mapped (1)
>> [aopclf:54106] Failing at address: 0x0
>> [aopclf:54106] [ 0] 2   libSystem.B.dylib   
>> 0x7fff83a6eeaa _sigtramp + 26
>> [aopclf:54106] [ 1] 3   libSystem.B.dylib   
>> 0x7fff83a210b7 snprintf + 496
>> [aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so   
>> 0x00010065ba0a mca_vprotocol_receiver_send + 177
>> [aopclf:54106] [ 3] 5   libmpi.0.dylib  
>> 0x000100077d44 MPI_Send + 734
>> [aopclf:54106] [ 4] 6   ping
>> 0x00010a97 main + 431
>> [aopclf:54106] [ 5] 7   ping   

Re: [OMPI devel] Signals

2010-03-17 Thread Leonardo Fialho
Anyway, to signal another job I have sent a RML message with the 
ORTE_DAEMON_SIGNAL_LOCAL_PROCS command to the proc's HNP.

Leonardo

On Mar 17, 2010, at 9:59 PM, Ralph Castain wrote:

> Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly 
> realized the problem :-/
> 
> Terry is correct - we don't initialize the plm framework in application 
> processes. However, there is a default proxy module for that framework so 
> that applications can call comm_spawn. Unfortunately, I never filled in the 
> rest of the module function pointers because (a) there was no known reason 
> for apps to be using them (as Jeff points out), and (b) there is no MPI call 
> that interfaces to them.
> 
> I can (and will) make it work over the next day or two - there is no reason 
> why this can't be done. It just wasn't implemented due to lack of reason to 
> do so.
> 
> Sorry for the confusion - old man brain fizzing out again.
> 
> On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote:
> 
>> Yes, I know the difference :)
>> 
>> I'm trying to call orte_plm.signal_job from a PML component. I think PLM 
>> stays resident after launching but it doesn't only for mpirun and orted, 
>> you're right.
>> 
>> On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote:
>> 
>>> On 03/17/2010 10:10 AM, Leonardo Fialho wrote:
 
 Wow... orte_plm.signal_job points to zero. Is it correct from the PML 
 point of view?
>>> It might be because plm's are really only used at launch time not in MPI 
>>> processes.  Note plm != pml.
>>> 
>>> --td
 
 Leonardo
 
 On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:
 
> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML 
> component, I know that ORTE is bellow OMPI, but I think that this 
> function could not be available, or something like this. I can't figure 
> out where is this snprintf too, in my code there is only
> 
> opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger 
> <%s>",
> SIGUSR1, ORTE_NAME_PRINT(&el_proc));
> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
> 
> And the first output/printf works fine. Well... I used gdb to run the 
> program, I can see this:
> 
> Program received signal EXC_BAD_ACCESS, Could not access memory.
> Reason: KERN_INVALID_ADDRESS at address: 0x
> 0x in ?? ()
> (gdb) backtrace
> #0  0x in ?? ()
> #1  0x00010065c319 in vprotocol_receiver_eventlog_connect 
> (el_comm=0x10065d178) at 
> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
> #2  0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, 
> count=262144, datatype=0x100263d60, dst=1, tag=1, 
> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at 
> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
> #3  0x000100077d44 in MPI_Send ()
> #4  0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45
> 
> The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job 
> call. After that zeros and interrogations... the signal_job function is 
> already available? I really don't understand what means all those zeros.
> 
> Leonardo
> 
> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:
> 
>> Thanks for clarifying - guess I won't chew just yet. :-)
>> 
>> I still don't see in your trace where it is failing in signal_job. I 
>> didn't see the message indicating it was sending the signal cmd out in 
>> your prior debug output, and there isn't a printf in that code loop 
>> other than the debug output. Can you attach to the process and get more 
>> info?
>> 
>> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:
>> 
>>> Ralph don't swallow your message yet... Both jobs are not running over 
>>> the same mpirun. There are two instances of mpirun in which one runs 
>>> with "-report-uri ../contact.txt" and the other receives its contact 
>>> info using "-ompi-server file:../contact.txt". And yes, both processes 
>>> are running with plm_base_verbose activated. When a deactivate the 
>>> plm_base_verbose the error is practically the same:
>>> 
>>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
>>> <[[47640,1],0]>
>>> [aopclf:54106] *** Process received signal ***
>>> [aopclf:54106] Signal: Segmentation fault (11)
>>> [aopclf:54106] Signal code: Address not mapped (1)
>>> [aopclf:54106] Failing at address: 0x0
>>> [aopclf:54106] [ 0] 2   libSystem.B.dylib   
>>> 0x7fff83a6eeaa _sigtramp + 26
>>> [aopclf:54106] [ 1] 3   libSystem.B.dylib   
>>> 0x7fff83a210b7 snprintf + 496
>>> [aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so   
>

Re: [OMPI devel] Signals

2010-03-17 Thread Ralph Castain
Very good - that is pretty much all that the signal_job API does.

On Mar 17, 2010, at 4:11 PM, Leonardo Fialho wrote:

> Anyway, to signal another job I have sent a RML message with the 
> ORTE_DAEMON_SIGNAL_LOCAL_PROCS command to the proc's HNP.
> 
> Leonardo
> 
> On Mar 17, 2010, at 9:59 PM, Ralph Castain wrote:
> 
>> Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly 
>> realized the problem :-/
>> 
>> Terry is correct - we don't initialize the plm framework in application 
>> processes. However, there is a default proxy module for that framework so 
>> that applications can call comm_spawn. Unfortunately, I never filled in the 
>> rest of the module function pointers because (a) there was no known reason 
>> for apps to be using them (as Jeff points out), and (b) there is no MPI call 
>> that interfaces to them.
>> 
>> I can (and will) make it work over the next day or two - there is no reason 
>> why this can't be done. It just wasn't implemented due to lack of reason to 
>> do so.
>> 
>> Sorry for the confusion - old man brain fizzing out again.
>> 
>> On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote:
>> 
>>> Yes, I know the difference :)
>>> 
>>> I'm trying to call orte_plm.signal_job from a PML component. I think PLM 
>>> stays resident after launching but it doesn't only for mpirun and orted, 
>>> you're right.
>>> 
>>> On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote:
>>> 
 On 03/17/2010 10:10 AM, Leonardo Fialho wrote:
> 
> Wow... orte_plm.signal_job points to zero. Is it correct from the PML 
> point of view?
 It might be because plm's are really only used at launch time not in MPI 
 processes.  Note plm != pml.
 
 --td
> 
> Leonardo
> 
> On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:
> 
>> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML 
>> component, I know that ORTE is bellow OMPI, but I think that this 
>> function could not be available, or something like this. I can't figure 
>> out where is this snprintf too, in my code there is only
>> 
>> opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger 
>> <%s>",
>> SIGUSR1, ORTE_NAME_PRINT(&el_proc));
>> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
>> 
>> And the first output/printf works fine. Well... I used gdb to run the 
>> program, I can see this:
>> 
>> Program received signal EXC_BAD_ACCESS, Could not access memory.
>> Reason: KERN_INVALID_ADDRESS at address: 0x
>> 0x in ?? ()
>> (gdb) backtrace
>> #0  0x in ?? ()
>> #1  0x00010065c319 in vprotocol_receiver_eventlog_connect 
>> (el_comm=0x10065d178) at 
>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
>> #2  0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, 
>> count=262144, datatype=0x100263d60, dst=1, tag=1, 
>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at 
>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
>> #3  0x000100077d44 in MPI_Send ()
>> #4  0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45
>> 
>> The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job 
>> call. After that zeros and interrogations... the signal_job function is 
>> already available? I really don't understand what means all those zeros.
>> 
>> Leonardo
>> 
>> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:
>> 
>>> Thanks for clarifying - guess I won't chew just yet. :-)
>>> 
>>> I still don't see in your trace where it is failing in signal_job. I 
>>> didn't see the message indicating it was sending the signal cmd out in 
>>> your prior debug output, and there isn't a printf in that code loop 
>>> other than the debug output. Can you attach to the process and get more 
>>> info?
>>> 
>>> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:
>>> 
 Ralph don't swallow your message yet... Both jobs are not running over 
 the same mpirun. There are two instances of mpirun in which one runs 
 with "-report-uri ../contact.txt" and the other receives its contact 
 info using "-ompi-server file:../contact.txt". And yes, both processes 
 are running with plm_base_verbose activated. When a deactivate the 
 plm_base_verbose the error is practically the same:
 
 [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger 
 <[[47640,1],0]>
 [aopclf:54106] *** Process received signal ***
 [aopclf:54106] Signal: Segmentation fault (11)
 [aopclf:54106] Signal code: Address not mapped (1)
 [aopclf:54106] Failing at address: 0x0
 [aopclf:54106] [ 0] 2   libSystem.B.dylib  

[OMPI devel] Migrate OpenMPI to the VxWorks

2010-03-17 Thread 张晶
Hello all,



In order to add some real-time feature to the OpenMPI for some research ,I
need a OpenMPI version running on VxWorks. But after going through the
Open-MPI website ,I can’t found any indication that it supports VxWorks .



Follow the thread posted by Ralph Castain ,
http://www.open-mpi.org/community/lists/users/2006/06/1371.php .

I read some paper about the OpenRTE ,like “Creating a transparent,
distributed, and resilient computing environment: the OpenRTE project” and
“The Open Run-Time Environment (OpenRTE):A Transparent Multi-cluster
Environment for High-Performance Computing”which is written by Ralph H.
Castain · Jeffrey M. Squyres and others .



Now I have a basic understanding of the OpenRTE , however ,there is too few
document of the OpenRTE describing the implement of the OpenRTE . I don’t
know

where and how to begin the migration . Any advice will be appreciated.





Thanks



Jing Zhang