Re: [OMPI devel] how to add a component in the ompi?
Yaohui, The whole infrastructure at the level where you're looking is similar to Active Messages. The register function is used to register callback for a specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible. However, there are some rules regarding which level is allowed to register callbacks in a specific range, in order to avoid conflict between several modules loaded in same time. Anyway, as far as I understood you're looking at writing a new BTL. Every time a message is drained from the network, the BTL is supposed to know that tag it was send to and trigger the corresponding callback (this only on the receiver side). How this "tag" is moved around depends on the BTL capabilities. Some will have to push it explicitly through the network (TCP as an example), while others have other means to move it around (for MX this tag is part of the 64 bits key used for each message). Therefore, the first thing you should make sure is that you really have a way to retrieve this tag on the receiver side. Once you have the tag and the content of the message, you should call the callback corresponding to the tag (using the simple addition you noticed), and pass the correct arguments. This should at least let you start the eager protocol. george. On Mar 16, 2010, at 23:22 , hu yaohui wrote: > Hi Jeff & All > Yes,you are right,i was just a little dizzy then. i need to modify the send > function of component self in btl framework. > i just met a problem right now. > when i browse the function > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to send > the data > > 303reg = mca_btl_base_active_message_trigger + tag; > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > i trace through the "mca_btl_base_active_message_trigger" to the function > where it get its value ,then i find function > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this: > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc; > 729mca_btl_base_active_message_trigger[tag].cbdata = data; > > when i trace through mca_bml_r2_register ,in the same file,i get this > structure: > > mca_bml_r2_module_t mca_bml_r2 = { > { > &mca_bml_r2_component, > mca_bml_r2_add_procs, > mca_bml_r2_del_procs, > mca_bml_r2_add_btl, > mca_bml_r2_del_btl, > mca_bml_r2_del_proc_btl, > mca_bml_r2_register, < > mca_bml_r2_register_error, > mca_bml_r2_finalize, > mca_bml_r2_ft_event > } > > }; > > after this ,i find the place where mca_bml_r2 is initialized,but i cannt find > anything related to mca_bml_r2_register.i just want to know reg = > mca_btl_base_active_message_trigger + tag; > really is.and i want to modify the send function of self ,is this the right > way? or you can tell me the right way to modify the send function of self > component. > > Thanks & Regards > Yaohui Hu > > On Wed, Mar 17, 2010 at 12:52 AM, Jeff Squyres wrote: > On Mar 16, 2010, at 9:45 AM, hu yaohui wrote: > > > it just said,i had a wrong command format,when i use mpirun --help,i really > > didn't find the --mca parameter.why the tcp FAQ part list these command > > lines, > > but it cann't execute successfully on my machine.Is there any another way > > to control the specific > > btl components to be used? > > Make sure you're using the right mpirun -- you might have multiple installed > on your machine. > > OMPI's "mpirun --help" definitely includes a description of the --mca > parameter: > > -mca|--mca > Pass context-specific MCA parameters; they are > considered global if --gmca is not used and only > one context is specified (arg0 is the parameter > name; arg1 is the parameter value) > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] how to add a component in the ompi?
Hi George, what i want to do is to modify the self component to meet my needs,i just want to modify the send function of the self component to test whether my implemented send function ,which based on some emulation platform, is right.so i copied all the self component code,modified the component name to mine ,the i wanted to subsitude its send and receive to my implemented send/receive function.i dont know whether this is right,if not ,or you need more information ,please let me know. Thanks & Regards Yaohui Hu . On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca wrote: > Yaohui, > > The whole infrastructure at the level where you're looking is similar to > Active Messages. The register function is used to register callback for a > specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible. > However, there are some rules regarding which level is allowed to register > callbacks in a specific range, in order to avoid conflict between several > modules loaded in same time. > > Anyway, as far as I understood you're looking at writing a new BTL. Every > time a message is drained from the network, the BTL is supposed to know that > tag it was send to and trigger the corresponding callback (this only on the > receiver side). How this "tag" is moved around depends on the BTL > capabilities. Some will have to push it explicitly through the network (TCP > as an example), while others have other means to move it around (for MX this > tag is part of the 64 bits key used for each message). Therefore, the first > thing you should make sure is that you really have a way to retrieve this > tag on the receiver side. Once you have the tag and the content of the > message, you should call the callback corresponding to the tag (using the > simple addition you noticed), and pass the correct arguments. This should at > least let you start the eager protocol. > > george. > > On Mar 16, 2010, at 23:22 , hu yaohui wrote: > > > Hi Jeff & All > > Yes,you are right,i was just a little dizzy then. i need to modify the > send function of component self in btl framework. > > i just met a problem right now. > > when i browse the function > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to > send the data > > > > 303reg = mca_btl_base_active_message_trigger + tag; > > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > > > i trace through the "mca_btl_base_active_message_trigger" to the function > where it get its value ,then i find function > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this: > > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc; > > 729mca_btl_base_active_message_trigger[tag].cbdata = data; > > > > when i trace through mca_bml_r2_register ,in the same file,i get this > structure: > > > > mca_bml_r2_module_t mca_bml_r2 = { > > { > > &mca_bml_r2_component, > > mca_bml_r2_add_procs, > > mca_bml_r2_del_procs, > > mca_bml_r2_add_btl, > > mca_bml_r2_del_btl, > > mca_bml_r2_del_proc_btl, > > mca_bml_r2_register, < > > mca_bml_r2_register_error, > > mca_bml_r2_finalize, > > mca_bml_r2_ft_event > > } > > > > }; > > > > after this ,i find the place where mca_bml_r2 is initialized,but i cannt > find anything related to mca_bml_r2_register.i just want to know reg = > mca_btl_base_active_message_trigger + tag; > > really is.and i want to modify the send function of self ,is this the > right way? or you can tell me the right way to modify the send function of > self component. > > > > Thanks & Regards > > Yaohui Hu > > > > On Wed, Mar 17, 2010 at 12:52 AM, Jeff Squyres > wrote: > > On Mar 16, 2010, at 9:45 AM, hu yaohui wrote: > > > > > it just said,i had a wrong command format,when i use mpirun --help,i > really > > > didn't find the --mca parameter.why the tcp FAQ part list these command > lines, > > > but it cann't execute successfully on my machine.Is there any another > way to control the specific > > > btl components to be used? > > > > Make sure you're using the right mpirun -- you might have multiple > installed on your machine. > > > > OMPI's "mpirun --help" definitely includes a description of the --mca > parameter: > > > > -mca|--mca > > Pass context-specific MCA parameters; they are > > considered global if --gmca is not used and only > > one context is specified (arg0 is the parameter > > name; arg1 is the parameter value) > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > > devel mailing list > > d
Re: [OMPI devel] how to add a component in the ompi?
Yoahui, The self component is special. While is does behave as a "normal" BTL, it takes a lot of shortcuts as all operations are in the memory of a single process. However, as the simplest BTLs in Open MPI, I guess it is a good starting point. As stated previously, the self BTL exhibit a lot of differences compared with the others BTL. For your case, in the self BTL the send function trigger the receiver callback, as there is other simple way to drain the "network". This explain why we compute the btl_active_message_callback_t directly in the send function. Usually, this is done on the progress function, once some data have been extracted from the network. Basically, everything in the mca_btl_self_send function starting from the "/* upcall */" comment is the receive operation. george. On Mar 17, 2010, at 00:30 , hu yaohui wrote: > Hi George, > what i want to do is to modify the self component to meet my needs,i just > want to modify the send function of the self component to test whether my > implemented send function ,which based on some emulation platform, is > right.so i copied all the self component code,modified the component name to > mine ,the i wanted to subsitude its send and receive to my implemented > send/receive function.i dont know whether this is right,if not ,or you need > more information ,please let me know. > > Thanks & Regards > Yaohui Hu . > > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca wrote: > Yaohui, > > The whole infrastructure at the level where you're looking is similar to > Active Messages. The register function is used to register callback for a > specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible. > However, there are some rules regarding which level is allowed to register > callbacks in a specific range, in order to avoid conflict between several > modules loaded in same time. > > Anyway, as far as I understood you're looking at writing a new BTL. Every > time a message is drained from the network, the BTL is supposed to know that > tag it was send to and trigger the corresponding callback (this only on the > receiver side). How this "tag" is moved around depends on the BTL > capabilities. Some will have to push it explicitly through the network (TCP > as an example), while others have other means to move it around (for MX this > tag is part of the 64 bits key used for each message). Therefore, the first > thing you should make sure is that you really have a way to retrieve this tag > on the receiver side. Once you have the tag and the content of the message, > you should call the callback corresponding to the tag (using the simple > addition you noticed), and pass the correct arguments. This should at least > let you start the eager protocol. > > george. > > On Mar 16, 2010, at 23:22 , hu yaohui wrote: > > > Hi Jeff & All > > Yes,you are right,i was just a little dizzy then. i need to modify the send > > function of component self in btl framework. > > i just met a problem right now. > > when i browse the function > > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to > > send the data > > > > 303reg = mca_btl_base_active_message_trigger + tag; > > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > > > i trace through the "mca_btl_base_active_message_trigger" to the function > > where it get its value ,then i find function > > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this: > > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc; > > 729mca_btl_base_active_message_trigger[tag].cbdata = data; > > > > when i trace through mca_bml_r2_register ,in the same file,i get this > > structure: > > > > mca_bml_r2_module_t mca_bml_r2 = { > > { > > &mca_bml_r2_component, > > mca_bml_r2_add_procs, > > mca_bml_r2_del_procs, > > mca_bml_r2_add_btl, > > mca_bml_r2_del_btl, > > mca_bml_r2_del_proc_btl, > > mca_bml_r2_register, < > > mca_bml_r2_register_error, > > mca_bml_r2_finalize, > > mca_bml_r2_ft_event > > } > > > > }; > > > > after this ,i find the place where mca_bml_r2 is initialized,but i cannt > > find anything related to mca_bml_r2_register.i just want to know reg = > > mca_btl_base_active_message_trigger + tag; > > really is.and i want to modify the send function of self ,is this the right > > way? or you can tell me the right way to modify the send function of self > > component. > > > > Thanks & Regards > > Yaohui Hu > > > > On Wed, Mar 17, 2010 at 12:52 AM, Jeff Squyres wrote: > > On Mar 16, 2010, at 9:45 AM, hu yaohui wrote: > > > > > it just said,i had a wrong command format,when i use mpirun --help,i > > > really > > > didn't find the --mca parameter.why the tcp FAQ part list these command > > > lines, > > > but it cann't execute successfully on my machine.Is there any another way > > > to control th
Re: [OMPI devel] Signals
I'm going to have to eat my last message. It slipped past me that your other job was started via comm_spawn. Since both "jobs" are running under the same mpirun, there shouldn't be a problem sending a signal between them. I don't know why this would be crashing. Are you sure it is crashing in signal_job? Your trace indicates it is crashing in a print statement, yet there is no print statement in signal_job. Or did you run this with plm_base_verbose set so that the verbose prints are trying to run (could be we have a bug in one of them)? On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: > Well, thank you anyway :) > > On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: > >> Yeah, that probably won't work. The current code isn't intended to cross >> jobs like that - I'm sure nobody ever tested it for that idea, and I'm >> pretty sure it won't support it. >> >> I don't currently know any way to do what you are trying to do. We could >> extend the signal code to handle it, I would think...but I'm not sure how >> soon that might happen. >> >> >> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: >> >>> Yes... but something wrong is going on... maybe the problem is that the >>> jobid is different than the process' jobid, I don't know. >>> >>> I'm trying to send a signal to other process running under a another job. >>> The other process jump into an accept_connect to the MPI comm. So i did a >>> code like this (I removed verification code and comments, this is just a >>> summary for a happy execution): >>> >>> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag); >>> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL); >>> ompi_dpm.route_to_port(hnp_uri, &el_proc); >>> orte_plm.signal_job(el_proc.jobid, SIGUSR1); >>> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm); >>> >>> el_proc is defined as orte_process_name_t, not a pointer to this. And >>> signal.h has been included for SIGUSR1's sake. But when the code enter in >>> signal_job function it crashes. I'm trying to debug it just now... the >>> crash is the following: >>> >>> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0] >>> [Fialho-2.local:51377] receiver: found port >>> <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300> >>> [Fialho-2.local:51377] receiver: HNP URI >>> <784793600.0;tcp://192.168.1.200:54071>, RML URI >>> <784793601.0;tcp://192.168.1.200:54072>, TAG <300> >>> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event Logger >>> <[[11975,1],0]> >>> [Fialho-2:51377] *** Process received signal *** >>> [Fialho-2:51377] Signal: Segmentation fault (11) >>> [Fialho-2:51377] Signal code: Address not mapped (1) >>> [Fialho-2:51377] Failing at address: 0x0 >>> [Fialho-2:51377] [ 0] 2 libSystem.B.dylib >>> 0x7fff83a6eeaa _sigtramp + 26 >>> [Fialho-2:51377] [ 1] 3 libSystem.B.dylib >>> 0x7fff83a210b7 snprintf + 496 >>> [Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so >>> 0x00010065ba0a mca_vprotocol_receiver_send + 177 >>> [Fialho-2:51377] [ 3] 5 libmpi.0.dylib >>> 0x000100077d44 MPI_Send + 734 >>> [Fialho-2:51377] [ 4] 6 ping >>> 0x00010a97 main + 431 >>> [Fialho-2:51377] [ 5] 7 ping >>> 0x000108e0 start + 52 >>> [Fialho-2:51377] [ 6] 8 ??? >>> 0x0003 0x0 + 3 >>> [Fialho-2:51377] *** End of error message *** >>> >>> With exception to the signal_job the code works, I have tested it forcing >>> an accept on the other process, and avoiding the signal_job. But I want to >>> send the signal to wake-up the other side and to be able to manage multiple >>> connect/accept. >>> >>> Thanks, >>> Leonardo >>> >>> On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote: >>> Sure! So long as you add the include, you are okay as the ORTE layer is "below" the OMPI one. On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote: > Thanks Ralph, the last question... it orte_plm.signal_job > exposed/available to be called by a PML component? Yes, I have the > orte/mca/plm/plm.h include line. > > Leonardo > > On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote: > >> It's just the orte_process_name_t jobid field. So if you have an >> orte_process_name_t *pname, then it would just be >> >> orte_plm.signal_job(pname->jobid, sig) >> >> >> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote: >> >>> Hum and to signal a job probably the function is >>> orte_plm.signal_job(jobid, signal); right? >>> >>> Now my dummy question is how to obtain the jobid part from an >>> orte_proc_name_t variable? Is there any magical function in the >>> names_fns.h? >>> >>> Thanks, >>> Leonardo >>> >>> On Mar 16, 2010, at 10:12
Re: [OMPI devel] how to add a component in the ompi?
Hi Geogre, Thank you very much! i know ,it's really a receive callback in this send function mca_btl_self_send,what i want to know is where this callback function(line 303). 303reg = mca_btl_base_active_message_trigger + tag; 304reg->cbfunc( btl, tag, des, reg->cbdata ); mapped to,where this function is initialized.in which file,which function,the mca_bml_r2_register was called. Thanks & Regards Yaohui On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca wrote: > Yoahui, > > The self component is special. While is does behave as a "normal" BTL, it > takes a lot of shortcuts as all operations are in the memory of a single > process. However, as the simplest BTLs in Open MPI, I guess it is a good > starting point. > > As stated previously, the self BTL exhibit a lot of differences compared > with the others BTL. For your case, in the self BTL the send function > trigger the receiver callback, as there is other simple way to drain the > "network". This explain why we compute the btl_active_message_callback_t > directly in the send function. Usually, this is done on the progress > function, once some data have been extracted from the network. Basically, > everything in the mca_btl_self_send function starting from the "/* upcall > */" comment is the receive operation. > > george. > > On Mar 17, 2010, at 00:30 , hu yaohui wrote: > > > Hi George, > > what i want to do is to modify the self component to meet my needs,i just > want to modify the send function of the self component to test whether my > implemented send function ,which based on some emulation platform, is > right.so i copied all the self component code,modified the component name to > mine ,the i wanted to subsitude its send and receive to my implemented > send/receive function.i dont know whether this is right,if not ,or you need > more information ,please let me know. > > > > Thanks & Regards > > Yaohui Hu . > > > > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca > wrote: > > Yaohui, > > > > The whole infrastructure at the level where you're looking is similar to > Active Messages. The register function is used to register callback for a > specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible. > However, there are some rules regarding which level is allowed to register > callbacks in a specific range, in order to avoid conflict between several > modules loaded in same time. > > > > Anyway, as far as I understood you're looking at writing a new BTL. Every > time a message is drained from the network, the BTL is supposed to know that > tag it was send to and trigger the corresponding callback (this only on the > receiver side). How this "tag" is moved around depends on the BTL > capabilities. Some will have to push it explicitly through the network (TCP > as an example), while others have other means to move it around (for MX this > tag is part of the 64 bits key used for each message). Therefore, the first > thing you should make sure is that you really have a way to retrieve this > tag on the receiver side. Once you have the tag and the content of the > message, you should call the callback corresponding to the tag (using the > simple addition you noticed), and pass the correct arguments. This should at > least let you start the eager protocol. > > > > george. > > > > On Mar 16, 2010, at 23:22 , hu yaohui wrote: > > > > > Hi Jeff & All > > > Yes,you are right,i was just a little dizzy then. i need to modify the > send function of component self in btl framework. > > > i just met a problem right now. > > > when i browse the function > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to > send the data > > > > > > 303reg = mca_btl_base_active_message_trigger + tag; > > > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > > > > > i trace through the "mca_btl_base_active_message_trigger" to the > function where it get its value ,then i find function > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this: > > > > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc; > > > 729mca_btl_base_active_message_trigger[tag].cbdata = data; > > > > > > when i trace through mca_bml_r2_register ,in the same file,i get this > structure: > > > > > > mca_bml_r2_module_t mca_bml_r2 = { > > > { > > > &mca_bml_r2_component, > > > mca_bml_r2_add_procs, > > > mca_bml_r2_del_procs, > > > mca_bml_r2_add_btl, > > > mca_bml_r2_del_btl, > > > mca_bml_r2_del_proc_btl, > > > mca_bml_r2_register, < > > > mca_bml_r2_register_error, > > > mca_bml_r2_finalize, > > > mca_bml_r2_ft_event > > > } > > > > > > }; > > > > > > after this ,i find the place where mca_bml_r2 is initialized,but i > cannt find anything related to mca_bml_r2_register.i just want to know reg = > mca_btl_base_active_message_trigger + tag; > > > really is.and i want to modify the send function of self ,
Re: [OMPI devel] how to add a component in the ompi?
Hi George, did you have a gmail or msn? i really want to talk to you directly.That's much fast. Thanks & Regards Yaohui Hu On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca wrote: > Yoahui, > > The self component is special. While is does behave as a "normal" BTL, it > takes a lot of shortcuts as all operations are in the memory of a single > process. However, as the simplest BTLs in Open MPI, I guess it is a good > starting point. > > As stated previously, the self BTL exhibit a lot of differences compared > with the others BTL. For your case, in the self BTL the send function > trigger the receiver callback, as there is other simple way to drain the > "network". This explain why we compute the btl_active_message_callback_t > directly in the send function. Usually, this is done on the progress > function, once some data have been extracted from the network. Basically, > everything in the mca_btl_self_send function starting from the "/* upcall > */" comment is the receive operation. > > george. > > On Mar 17, 2010, at 00:30 , hu yaohui wrote: > > > Hi George, > > what i want to do is to modify the self component to meet my needs,i just > want to modify the send function of the self component to test whether my > implemented send function ,which based on some emulation platform, is > right.so i copied all the self component code,modified the component name to > mine ,the i wanted to subsitude its send and receive to my implemented > send/receive function.i dont know whether this is right,if not ,or you need > more information ,please let me know. > > > > Thanks & Regards > > Yaohui Hu . > > > > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca > wrote: > > Yaohui, > > > > The whole infrastructure at the level where you're looking is similar to > Active Messages. The register function is used to register callback for a > specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible. > However, there are some rules regarding which level is allowed to register > callbacks in a specific range, in order to avoid conflict between several > modules loaded in same time. > > > > Anyway, as far as I understood you're looking at writing a new BTL. Every > time a message is drained from the network, the BTL is supposed to know that > tag it was send to and trigger the corresponding callback (this only on the > receiver side). How this "tag" is moved around depends on the BTL > capabilities. Some will have to push it explicitly through the network (TCP > as an example), while others have other means to move it around (for MX this > tag is part of the 64 bits key used for each message). Therefore, the first > thing you should make sure is that you really have a way to retrieve this > tag on the receiver side. Once you have the tag and the content of the > message, you should call the callback corresponding to the tag (using the > simple addition you noticed), and pass the correct arguments. This should at > least let you start the eager protocol. > > > > george. > > > > On Mar 16, 2010, at 23:22 , hu yaohui wrote: > > > > > Hi Jeff & All > > > Yes,you are right,i was just a little dizzy then. i need to modify the > send function of component self in btl framework. > > > i just met a problem right now. > > > when i browse the function > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to > send the data > > > > > > 303reg = mca_btl_base_active_message_trigger + tag; > > > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > > > > > i trace through the "mca_btl_base_active_message_trigger" to the > function where it get its value ,then i find function > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this: > > > > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc; > > > 729mca_btl_base_active_message_trigger[tag].cbdata = data; > > > > > > when i trace through mca_bml_r2_register ,in the same file,i get this > structure: > > > > > > mca_bml_r2_module_t mca_bml_r2 = { > > > { > > > &mca_bml_r2_component, > > > mca_bml_r2_add_procs, > > > mca_bml_r2_del_procs, > > > mca_bml_r2_add_btl, > > > mca_bml_r2_del_btl, > > > mca_bml_r2_del_proc_btl, > > > mca_bml_r2_register, < > > > mca_bml_r2_register_error, > > > mca_bml_r2_finalize, > > > mca_bml_r2_ft_event > > > } > > > > > > }; > > > > > > after this ,i find the place where mca_bml_r2 is initialized,but i > cannt find anything related to mca_bml_r2_register.i just want to know reg = > mca_btl_base_active_message_trigger + tag; > > > really is.and i want to modify the send function of self ,is this the > right way? or you can tell me the right way to modify the send function of > self component. > > > > > > Thanks & Regards > > > Yaohui Hu > > > > > > On Wed, Mar 17, 2010 at 12:52 AM, Jeff Squyres > wrote: > > > On Mar 16, 2010, at 9:45 AM, hu yaohui wrote: > > > > > > > it just
Re: [OMPI devel] how to add a component in the ompi?
Yaohui, The callback functions are registered by any modules that can handle network communications. In your specific case I would guess it is the PML. Look in mca/pml/ob1/pml_ob1.c starting from line 364 to see what callbacks are registered by OB1. george. On Mar 17, 2010, at 01:22 , hu yaohui wrote: > Hi Geogre, > Thank you very much! > i know ,it's really a receive callback in this send function > mca_btl_self_send,what i want to know is where this callback function(line > 303). > > 303reg = mca_btl_base_active_message_trigger + tag; > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > mapped to,where this function is initialized.in which file,which function,the > mca_bml_r2_register was called. > > Thanks & Regards > Yaohui > > On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca wrote: > Yoahui, > > The self component is special. While is does behave as a "normal" BTL, it > takes a lot of shortcuts as all operations are in the memory of a single > process. However, as the simplest BTLs in Open MPI, I guess it is a good > starting point. > > As stated previously, the self BTL exhibit a lot of differences compared with > the others BTL. For your case, in the self BTL the send function trigger the > receiver callback, as there is other simple way to drain the "network". This > explain why we compute the btl_active_message_callback_t directly in the send > function. Usually, this is done on the progress function, once some data have > been extracted from the network. Basically, everything in the > mca_btl_self_send function starting from the "/* upcall */" comment is the > receive operation. > > george. > > On Mar 17, 2010, at 00:30 , hu yaohui wrote: > > > Hi George, > > what i want to do is to modify the self component to meet my needs,i just > > want to modify the send function of the self component to test whether my > > implemented send function ,which based on some emulation platform, is > > right.so i copied all the self component code,modified the component name > > to mine ,the i wanted to subsitude its send and receive to my implemented > > send/receive function.i dont know whether this is right,if not ,or you need > > more information ,please let me know. > > > > Thanks & Regards > > Yaohui Hu . > > > > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca > > wrote: > > Yaohui, > > > > The whole infrastructure at the level where you're looking is similar to > > Active Messages. The register function is used to register callback for a > > specific tag. A tag is a uint8_t, and thus there are 256 callbacks > > possible. However, there are some rules regarding which level is allowed to > > register callbacks in a specific range, in order to avoid conflict between > > several modules loaded in same time. > > > > Anyway, as far as I understood you're looking at writing a new BTL. Every > > time a message is drained from the network, the BTL is supposed to know > > that tag it was send to and trigger the corresponding callback (this only > > on the receiver side). How this "tag" is moved around depends on the BTL > > capabilities. Some will have to push it explicitly through the network (TCP > > as an example), while others have other means to move it around (for MX > > this tag is part of the 64 bits key used for each message). Therefore, the > > first thing you should make sure is that you really have a way to retrieve > > this tag on the receiver side. Once you have the tag and the content of the > > message, you should call the callback corresponding to the tag (using the > > simple addition you noticed), and pass the correct arguments. This should > > at least let you start the eager protocol. > > > > george. > > > > On Mar 16, 2010, at 23:22 , hu yaohui wrote: > > > > > Hi Jeff & All > > > Yes,you are right,i was just a little dizzy then. i need to modify the > > > send function of component self in btl framework. > > > i just met a problem right now. > > > when i browse the function > > > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to > > > send the data > > > > > > 303reg = mca_btl_base_active_message_trigger + tag; > > > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > > > > > i trace through the "mca_btl_base_active_message_trigger" to the function > > > where it get its value ,then i find function > > > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this: > > > > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc; > > > 729mca_btl_base_active_message_trigger[tag].cbdata = data; > > > > > > when i trace through mca_bml_r2_register ,in the same file,i get this > > > structure: > > > > > > mca_bml_r2_module_t mca_bml_r2 = { > > > { > > > &mca_bml_r2_component, > > > mca_bml_r2_add_procs, > > > mca_bml_r2_del_procs, > > > mca_bml_r2_add_btl, > > > mca_bml_r2_del_btl, > > > mca_bml_r2_del_proc_btl, > > > mc
Re: [OMPI devel] how to add a component in the ompi?
For the sake of completeness, and for the enlightenment of all interested developers, I would prefer if we keep the discussion going on this mailing list (so we will have a searchable trace for the future). george. On Mar 17, 2010, at 01:25 , hu yaohui wrote: > Hi George, > did you have a gmail or msn? i really want to talk to you directly.That's > much fast. > > Thanks & Regards > Yaohui Hu > > On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca wrote: > Yoahui, > > The self component is special. While is does behave as a "normal" BTL, it > takes a lot of shortcuts as all operations are in the memory of a single > process. However, as the simplest BTLs in Open MPI, I guess it is a good > starting point. > > As stated previously, the self BTL exhibit a lot of differences compared with > the others BTL. For your case, in the self BTL the send function trigger the > receiver callback, as there is other simple way to drain the "network". This > explain why we compute the btl_active_message_callback_t directly in the send > function. Usually, this is done on the progress function, once some data have > been extracted from the network. Basically, everything in the > mca_btl_self_send function starting from the "/* upcall */" comment is the > receive operation. > > george. > > On Mar 17, 2010, at 00:30 , hu yaohui wrote: > > > Hi George, > > what i want to do is to modify the self component to meet my needs,i just > > want to modify the send function of the self component to test whether my > > implemented send function ,which based on some emulation platform, is > > right.so i copied all the self component code,modified the component name > > to mine ,the i wanted to subsitude its send and receive to my implemented > > send/receive function.i dont know whether this is right,if not ,or you need > > more information ,please let me know. > > > > Thanks & Regards > > Yaohui Hu . > > > > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca > > wrote: > > Yaohui, > > > > The whole infrastructure at the level where you're looking is similar to > > Active Messages. The register function is used to register callback for a > > specific tag. A tag is a uint8_t, and thus there are 256 callbacks > > possible. However, there are some rules regarding which level is allowed to > > register callbacks in a specific range, in order to avoid conflict between > > several modules loaded in same time. > > > > Anyway, as far as I understood you're looking at writing a new BTL. Every > > time a message is drained from the network, the BTL is supposed to know > > that tag it was send to and trigger the corresponding callback (this only > > on the receiver side). How this "tag" is moved around depends on the BTL > > capabilities. Some will have to push it explicitly through the network (TCP > > as an example), while others have other means to move it around (for MX > > this tag is part of the 64 bits key used for each message). Therefore, the > > first thing you should make sure is that you really have a way to retrieve > > this tag on the receiver side. Once you have the tag and the content of the > > message, you should call the callback corresponding to the tag (using the > > simple addition you noticed), and pass the correct arguments. This should > > at least let you start the eager protocol. > > > > george. > > > > On Mar 16, 2010, at 23:22 , hu yaohui wrote: > > > > > Hi Jeff & All > > > Yes,you are right,i was just a little dizzy then. i need to modify the > > > send function of component self in btl framework. > > > i just met a problem right now. > > > when i browse the function > > > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to > > > send the data > > > > > > 303reg = mca_btl_base_active_message_trigger + tag; > > > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > > > > > i trace through the "mca_btl_base_active_message_trigger" to the function > > > where it get its value ,then i find function > > > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this: > > > > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc; > > > 729mca_btl_base_active_message_trigger[tag].cbdata = data; > > > > > > when i trace through mca_bml_r2_register ,in the same file,i get this > > > structure: > > > > > > mca_bml_r2_module_t mca_bml_r2 = { > > > { > > > &mca_bml_r2_component, > > > mca_bml_r2_add_procs, > > > mca_bml_r2_del_procs, > > > mca_bml_r2_add_btl, > > > mca_bml_r2_del_btl, > > > mca_bml_r2_del_proc_btl, > > > mca_bml_r2_register, < > > > mca_bml_r2_register_error, > > > mca_bml_r2_finalize, > > > mca_bml_r2_ft_event > > > } > > > > > > }; > > > > > > after this ,i find the place where mca_bml_r2 is initialized,but i cannt > > > find anything related to mca_bml_r2_register.i just want to know reg =
Re: [OMPI devel] how to add a component in the ompi?
Hi George , Thank you very much! i really had saw these functions before ,but it's a long time ,i can't find it ! Thank you very much,you save me a lot of time. Thanks & Regards, Yaohui Hu On Wed, Mar 17, 2010 at 1:28 PM, George Bosilca wrote: > Yaohui, > > The callback functions are registered by any modules that can handle > network communications. In your specific case I would guess it is the PML. > Look in mca/pml/ob1/pml_ob1.c starting from line 364 to see what callbacks > are registered by OB1. > > george. > > On Mar 17, 2010, at 01:22 , hu yaohui wrote: > > > Hi Geogre, > > Thank you very much! > > i know ,it's really a receive callback in this send function > mca_btl_self_send,what i want to know is where this callback function(line > 303). > > > > 303reg = mca_btl_base_active_message_trigger + tag; > > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > > > mapped to,where this function is initialized.in which file,which > function,the mca_bml_r2_register was called. > > > > Thanks & Regards > > Yaohui > > > > On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca > wrote: > > Yoahui, > > > > The self component is special. While is does behave as a "normal" BTL, it > takes a lot of shortcuts as all operations are in the memory of a single > process. However, as the simplest BTLs in Open MPI, I guess it is a good > starting point. > > > > As stated previously, the self BTL exhibit a lot of differences compared > with the others BTL. For your case, in the self BTL the send function > trigger the receiver callback, as there is other simple way to drain the > "network". This explain why we compute the btl_active_message_callback_t > directly in the send function. Usually, this is done on the progress > function, once some data have been extracted from the network. Basically, > everything in the mca_btl_self_send function starting from the "/* upcall > */" comment is the receive operation. > > > > george. > > > > On Mar 17, 2010, at 00:30 , hu yaohui wrote: > > > > > Hi George, > > > what i want to do is to modify the self component to meet my needs,i > just want to modify the send function of the self component to test whether > my implemented send function ,which based on some emulation platform, is > right.so i copied all the self component code,modified the component name to > mine ,the i wanted to subsitude its send and receive to my implemented > send/receive function.i dont know whether this is right,if not ,or you need > more information ,please let me know. > > > > > > Thanks & Regards > > > Yaohui Hu . > > > > > > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca > wrote: > > > Yaohui, > > > > > > The whole infrastructure at the level where you're looking is similar > to Active Messages. The register function is used to register callback for a > specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible. > However, there are some rules regarding which level is allowed to register > callbacks in a specific range, in order to avoid conflict between several > modules loaded in same time. > > > > > > Anyway, as far as I understood you're looking at writing a new BTL. > Every time a message is drained from the network, the BTL is supposed to > know that tag it was send to and trigger the corresponding callback (this > only on the receiver side). How this "tag" is moved around depends on the > BTL capabilities. Some will have to push it explicitly through the network > (TCP as an example), while others have other means to move it around (for MX > this tag is part of the 64 bits key used for each message). Therefore, the > first thing you should make sure is that you really have a way to retrieve > this tag on the receiver side. Once you have the tag and the content of the > message, you should call the callback corresponding to the tag (using the > simple addition you noticed), and pass the correct arguments. This should at > least let you start the eager protocol. > > > > > > george. > > > > > > On Mar 16, 2010, at 23:22 , hu yaohui wrote: > > > > > > > Hi Jeff & All > > > > Yes,you are right,i was just a little dizzy then. i need to modify > the send function of component self in btl framework. > > > > i just met a problem right now. > > > > when i browse the function > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to > send the data > > > > > > > > 303reg = mca_btl_base_active_message_trigger + tag; > > > > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > > > > > > > i trace through the "mca_btl_base_active_message_trigger" to the > function where it get its value ,then i find function > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this: > > > > > > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc; > > > > 729mca_btl_base_active_message_trigger[tag].cbdata = data; > > > > > > > > when i trace through mca_bml_r2_register ,in the same file,i get this > structure: > > > > > > > > mca_bml_r2_module_t mca_
Re: [OMPI devel] how to add a component in the ompi?
ok, got it ! On Wed, Mar 17, 2010 at 1:31 PM, George Bosilca wrote: > For the sake of completeness, and for the enlightenment of all interested > developers, I would prefer if we keep the discussion going on this mailing > list (so we will have a searchable trace for the future). > > george. > > On Mar 17, 2010, at 01:25 , hu yaohui wrote: > > > Hi George, > > did you have a gmail or msn? i really want to talk to you directly.That's > much fast. > > > > Thanks & Regards > > Yaohui Hu > > > > On Wed, Mar 17, 2010 at 12:42 PM, George Bosilca > wrote: > > Yoahui, > > > > The self component is special. While is does behave as a "normal" BTL, it > takes a lot of shortcuts as all operations are in the memory of a single > process. However, as the simplest BTLs in Open MPI, I guess it is a good > starting point. > > > > As stated previously, the self BTL exhibit a lot of differences compared > with the others BTL. For your case, in the self BTL the send function > trigger the receiver callback, as there is other simple way to drain the > "network". This explain why we compute the btl_active_message_callback_t > directly in the send function. Usually, this is done on the progress > function, once some data have been extracted from the network. Basically, > everything in the mca_btl_self_send function starting from the "/* upcall > */" comment is the receive operation. > > > > george. > > > > On Mar 17, 2010, at 00:30 , hu yaohui wrote: > > > > > Hi George, > > > what i want to do is to modify the self component to meet my needs,i > just want to modify the send function of the self component to test whether > my implemented send function ,which based on some emulation platform, is > right.so i copied all the self component code,modified the component name to > mine ,the i wanted to subsitude its send and receive to my implemented > send/receive function.i dont know whether this is right,if not ,or you need > more information ,please let me know. > > > > > > Thanks & Regards > > > Yaohui Hu . > > > > > > On Wed, Mar 17, 2010 at 12:05 PM, George Bosilca > wrote: > > > Yaohui, > > > > > > The whole infrastructure at the level where you're looking is similar > to Active Messages. The register function is used to register callback for a > specific tag. A tag is a uint8_t, and thus there are 256 callbacks possible. > However, there are some rules regarding which level is allowed to register > callbacks in a specific range, in order to avoid conflict between several > modules loaded in same time. > > > > > > Anyway, as far as I understood you're looking at writing a new BTL. > Every time a message is drained from the network, the BTL is supposed to > know that tag it was send to and trigger the corresponding callback (this > only on the receiver side). How this "tag" is moved around depends on the > BTL capabilities. Some will have to push it explicitly through the network > (TCP as an example), while others have other means to move it around (for MX > this tag is part of the 64 bits key used for each message). Therefore, the > first thing you should make sure is that you really have a way to retrieve > this tag on the receiver side. Once you have the tag and the content of the > message, you should call the callback corresponding to the tag (using the > simple addition you noticed), and pass the correct arguments. This should at > least let you start the eager protocol. > > > > > > george. > > > > > > On Mar 16, 2010, at 23:22 , hu yaohui wrote: > > > > > > > Hi Jeff & All > > > > Yes,you are right,i was just a little dizzy then. i need to modify > the send function of component self in btl framework. > > > > i just met a problem right now. > > > > when i browse the function > mca_btl_self_send(~/ompi/mca/btl/self/btl_self.c),i think it use this to > send the data > > > > > > > > 303reg = mca_btl_base_active_message_trigger + tag; > > > > 304reg->cbfunc( btl, tag, des, reg->cbdata ); > > > > > > > > i trace through the "mca_btl_base_active_message_trigger" to the > function where it get its value ,then i find function > mca_bml_r2_register(~/ompi/mca/bml/bml_r2.c),it like this: > > > > > > > > 728mca_btl_base_active_message_trigger[tag].cbfunc = cbfunc; > > > > 729mca_btl_base_active_message_trigger[tag].cbdata = data; > > > > > > > > when i trace through mca_bml_r2_register ,in the same file,i get this > structure: > > > > > > > > mca_bml_r2_module_t mca_bml_r2 = { > > > > { > > > > &mca_bml_r2_component, > > > > mca_bml_r2_add_procs, > > > > mca_bml_r2_del_procs, > > > > mca_bml_r2_add_btl, > > > > mca_bml_r2_del_btl, > > > > mca_bml_r2_del_proc_btl, > > > > mca_bml_r2_register, < > > > > mca_bml_r2_register_error, > > > > mca_bml_r2_finalize, > > > > mca_bml_r2_ft_event > > > > } > > > > > > > > }; > > > > > > > > after this ,i find the place where mca_bml_r2 is initi
Re: [OMPI devel] Signals
Ralph don't swallow your message yet... Both jobs are not running over the same mpirun. There are two instances of mpirun in which one runs with "-report-uri ../contact.txt" and the other receives its contact info using "-ompi-server file:../contact.txt". And yes, both processes are running with plm_base_verbose activated. When a deactivate the plm_base_verbose the error is practically the same: [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[47640,1],0]> [aopclf:54106] *** Process received signal *** [aopclf:54106] Signal: Segmentation fault (11) [aopclf:54106] Signal code: Address not mapped (1) [aopclf:54106] Failing at address: 0x0 [aopclf:54106] [ 0] 2 libSystem.B.dylib 0x7fff83a6eeaa _sigtramp + 26 [aopclf:54106] [ 1] 3 libSystem.B.dylib 0x7fff83a210b7 snprintf + 496 [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so 0x00010065ba0a mca_vprotocol_receiver_send + 177 [aopclf:54106] [ 3] 5 libmpi.0.dylib 0x000100077d44 MPI_Send + 734 [aopclf:54106] [ 4] 6 ping0x00010a97 main + 431 [aopclf:54106] [ 5] 7 ping0x000108e0 start + 52 [aopclf:54106] *** End of error message *** Leonardo On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: > I'm going to have to eat my last message. It slipped past me that your other > job was started via comm_spawn. Since both "jobs" are running under the same > mpirun, there shouldn't be a problem sending a signal between them. > > I don't know why this would be crashing. Are you sure it is crashing in > signal_job? Your trace indicates it is crashing in a print statement, yet > there is no print statement in signal_job. Or did you run this with > plm_base_verbose set so that the verbose prints are trying to run (could be > we have a bug in one of them)? > > On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: > >> Well, thank you anyway :) >> >> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: >> >>> Yeah, that probably won't work. The current code isn't intended to cross >>> jobs like that - I'm sure nobody ever tested it for that idea, and I'm >>> pretty sure it won't support it. >>> >>> I don't currently know any way to do what you are trying to do. We could >>> extend the signal code to handle it, I would think...but I'm not sure how >>> soon that might happen. >>> >>> >>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: >>> Yes... but something wrong is going on... maybe the problem is that the jobid is different than the process' jobid, I don't know. I'm trying to send a signal to other process running under a another job. The other process jump into an accept_connect to the MPI comm. So i did a code like this (I removed verification code and comments, this is just a summary for a happy execution): ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag); orte_rml_base_parse_uris(rml_uri, &el_proc, NULL); ompi_dpm.route_to_port(hnp_uri, &el_proc); orte_plm.signal_job(el_proc.jobid, SIGUSR1); ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm); el_proc is defined as orte_process_name_t, not a pointer to this. And signal.h has been included for SIGUSR1's sake. But when the code enter in signal_job function it crashes. I'm trying to debug it just now... the crash is the following: [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0] [Fialho-2.local:51377] receiver: found port <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300> [Fialho-2.local:51377] receiver: HNP URI <784793600.0;tcp://192.168.1.200:54071>, RML URI <784793601.0;tcp://192.168.1.200:54072>, TAG <300> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[11975,1],0]> [Fialho-2:51377] *** Process received signal *** [Fialho-2:51377] Signal: Segmentation fault (11) [Fialho-2:51377] Signal code: Address not mapped (1) [Fialho-2:51377] Failing at address: 0x0 [Fialho-2:51377] [ 0] 2 libSystem.B.dylib 0x7fff83a6eeaa _sigtramp + 26 [Fialho-2:51377] [ 1] 3 libSystem.B.dylib 0x7fff83a210b7 snprintf + 496 [Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so 0x00010065ba0a mca_vprotocol_receiver_send + 177 [Fialho-2:51377] [ 3] 5 libmpi.0.dylib 0x000100077d44 MPI_Send + 734 [Fialho-2:51377] [ 4] 6 ping 0x00010a97 main + 431 [Fialho-2:51377] [ 5] 7 ping 0x000108e0 start + 52 [Fialho-2:51377] [ 6] 8 ??? 0x0003 0x0 + 3 [Fialho-2:51377] *** End of error message *
Re: [OMPI devel] Signals
Thanks for clarifying - guess I won't chew just yet. :-) I still don't see in your trace where it is failing in signal_job. I didn't see the message indicating it was sending the signal cmd out in your prior debug output, and there isn't a printf in that code loop other than the debug output. Can you attach to the process and get more info? On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: > Ralph don't swallow your message yet... Both jobs are not running over the > same mpirun. There are two instances of mpirun in which one runs with > "-report-uri ../contact.txt" and the other receives its contact info using > "-ompi-server file:../contact.txt". And yes, both processes are running with > plm_base_verbose activated. When a deactivate the plm_base_verbose the error > is practically the same: > > [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger > <[[47640,1],0]> > [aopclf:54106] *** Process received signal *** > [aopclf:54106] Signal: Segmentation fault (11) > [aopclf:54106] Signal code: Address not mapped (1) > [aopclf:54106] Failing at address: 0x0 > [aopclf:54106] [ 0] 2 libSystem.B.dylib > 0x7fff83a6eeaa _sigtramp + 26 > [aopclf:54106] [ 1] 3 libSystem.B.dylib > 0x7fff83a210b7 snprintf + 496 > [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so > 0x00010065ba0a mca_vprotocol_receiver_send + 177 > [aopclf:54106] [ 3] 5 libmpi.0.dylib > 0x000100077d44 MPI_Send + 734 > [aopclf:54106] [ 4] 6 ping > 0x00010a97 main + 431 > [aopclf:54106] [ 5] 7 ping > 0x000108e0 start + 52 > [aopclf:54106] *** End of error message *** > > Leonardo > > On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: > >> I'm going to have to eat my last message. It slipped past me that your other >> job was started via comm_spawn. Since both "jobs" are running under the same >> mpirun, there shouldn't be a problem sending a signal between them. >> >> I don't know why this would be crashing. Are you sure it is crashing in >> signal_job? Your trace indicates it is crashing in a print statement, yet >> there is no print statement in signal_job. Or did you run this with >> plm_base_verbose set so that the verbose prints are trying to run (could be >> we have a bug in one of them)? >> >> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: >> >>> Well, thank you anyway :) >>> >>> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: >>> Yeah, that probably won't work. The current code isn't intended to cross jobs like that - I'm sure nobody ever tested it for that idea, and I'm pretty sure it won't support it. I don't currently know any way to do what you are trying to do. We could extend the signal code to handle it, I would think...but I'm not sure how soon that might happen. On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: > Yes... but something wrong is going on... maybe the problem is that the > jobid is different than the process' jobid, I don't know. > > I'm trying to send a signal to other process running under a another job. > The other process jump into an accept_connect to the MPI comm. So i did a > code like this (I removed verification code and comments, this is just a > summary for a happy execution): > > ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag); > orte_rml_base_parse_uris(rml_uri, &el_proc, NULL); > ompi_dpm.route_to_port(hnp_uri, &el_proc); > orte_plm.signal_job(el_proc.jobid, SIGUSR1); > ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm); > > el_proc is defined as orte_process_name_t, not a pointer to this. And > signal.h has been included for SIGUSR1's sake. But when the code enter in > signal_job function it crashes. I'm trying to debug it just now... the > crash is the following: > > [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0] > [Fialho-2.local:51377] receiver: found port > <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300> > [Fialho-2.local:51377] receiver: HNP URI > <784793600.0;tcp://192.168.1.200:54071>, RML URI > <784793601.0;tcp://192.168.1.200:54072>, TAG <300> > [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event > Logger <[[11975,1],0]> > [Fialho-2:51377] *** Process received signal *** > [Fialho-2:51377] Signal: Segmentation fault (11) > [Fialho-2:51377] Signal code: Address not mapped (1) > [Fialho-2:51377] Failing at address: 0x0 > [Fialho-2:51377] [ 0] 2 libSystem.B.dylib > 0x7fff83a6eeaa _sigtramp + 26 > [Fialho-2:51377] [ 1] 3 libSystem.B.dylib > 0x7fff83a210b7 snprintf + 496 > [Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so
Re: [OMPI devel] Signals
To clarify a little bit more: I'm calling orte_plm.signal_job from a PML component, I know that ORTE is bellow OMPI, but I think that this function could not be available, or something like this. I can't figure out where is this snprintf too, in my code there is only opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger <%s>", SIGUSR1, ORTE_NAME_PRINT(&el_proc)); orte_plm.signal_job(el_proc.jobid, SIGUSR1); And the first output/printf works fine. Well... I used gdb to run the program, I can see this: Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x 0x in ?? () (gdb) backtrace #0 0x in ?? () #1 0x00010065c319 in vprotocol_receiver_eventlog_connect (el_comm=0x10065d178) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 #2 0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, count=262144, datatype=0x100263d60, dst=1, tag=1, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 #3 0x000100077d44 in MPI_Send () #4 0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job call. After that zeros and interrogations... the signal_job function is already available? I really don't understand what means all those zeros. Leonardo On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: > Thanks for clarifying - guess I won't chew just yet. :-) > > I still don't see in your trace where it is failing in signal_job. I didn't > see the message indicating it was sending the signal cmd out in your prior > debug output, and there isn't a printf in that code loop other than the debug > output. Can you attach to the process and get more info? > > On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: > >> Ralph don't swallow your message yet... Both jobs are not running over the >> same mpirun. There are two instances of mpirun in which one runs with >> "-report-uri ../contact.txt" and the other receives its contact info using >> "-ompi-server file:../contact.txt". And yes, both processes are running with >> plm_base_verbose activated. When a deactivate the plm_base_verbose the error >> is practically the same: >> >> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger >> <[[47640,1],0]> >> [aopclf:54106] *** Process received signal *** >> [aopclf:54106] Signal: Segmentation fault (11) >> [aopclf:54106] Signal code: Address not mapped (1) >> [aopclf:54106] Failing at address: 0x0 >> [aopclf:54106] [ 0] 2 libSystem.B.dylib >> 0x7fff83a6eeaa _sigtramp + 26 >> [aopclf:54106] [ 1] 3 libSystem.B.dylib >> 0x7fff83a210b7 snprintf + 496 >> [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so >> 0x00010065ba0a mca_vprotocol_receiver_send + 177 >> [aopclf:54106] [ 3] 5 libmpi.0.dylib >> 0x000100077d44 MPI_Send + 734 >> [aopclf:54106] [ 4] 6 ping >> 0x00010a97 main + 431 >> [aopclf:54106] [ 5] 7 ping >> 0x000108e0 start + 52 >> [aopclf:54106] *** End of error message *** >> >> Leonardo >> >> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: >> >>> I'm going to have to eat my last message. It slipped past me that your >>> other job was started via comm_spawn. Since both "jobs" are running under >>> the same mpirun, there shouldn't be a problem sending a signal between them. >>> >>> I don't know why this would be crashing. Are you sure it is crashing in >>> signal_job? Your trace indicates it is crashing in a print statement, yet >>> there is no print statement in signal_job. Or did you run this with >>> plm_base_verbose set so that the verbose prints are trying to run (could be >>> we have a bug in one of them)? >>> >>> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: >>> Well, thank you anyway :) On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: > Yeah, that probably won't work. The current code isn't intended to cross > jobs like that - I'm sure nobody ever tested it for that idea, and I'm > pretty sure it won't support it. > > I don't currently know any way to do what you are trying to do. We could > extend the signal code to handle it, I would think...but I'm not sure how > soon that might happen. > > > On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: > >> Yes... but something wrong is going on... maybe the problem is that the >> jobid is different than the process' jobid, I don't know. >> >> I'm trying to send a signal to other process running under a another >> job. The other process jump into an accept_connect to the M
Re: [OMPI devel] Signals
Can you print out what orte_plm.signal_job value is? I bet it is pointing to address 0. So the question is orte_plm actually initialized in an MPI process? My guess would be no but I am sure Ralph will be able to answer more definitively. --td On 03/17/2010 09:52 AM, Leonardo Fialho wrote: To clarify a little bit more: I'm calling orte_plm.signal_job from a PML component, I know that ORTE is bellow OMPI, but I think that this function could not be available, or something like this. I can't figure out where is this snprintf too, in my code there is only opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger <%s>", SIGUSR1, ORTE_NAME_PRINT(&el_proc)); orte_plm.signal_job(el_proc.jobid, SIGUSR1); And the first output/printf works fine. Well... I used gdb to run the program, I can see this: Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x 0x in ?? () (gdb) backtrace #0 0x in ?? () #1 0x00010065c319 in vprotocol_receiver_eventlog_connect (el_comm=0x10065d178) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 #2 0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, count=262144, datatype=0x100263d60, dst=1, tag=1, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 #3 0x000100077d44 in MPI_Send () #4 0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job call. After that zeros and interrogations... the signal_job function is already available? I really don't understand what means all those zeros. Leonardo On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: Thanks for clarifying - guess I won't chew just yet. :-) I still don't see in your trace where it is failing in signal_job. I didn't see the message indicating it was sending the signal cmd out in your prior debug output, and there isn't a printf in that code loop other than the debug output. Can you attach to the process and get more info? On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: Ralph don't swallow your message yet... Both jobs are not running over the same mpirun. There are two instances of mpirun in which one runs with "-report-uri ../contact.txt" and the other receives its contact info using "-ompi-server file:../contact.txt". And yes, both processes are running with plm_base_verbose activated. When a deactivate the plm_base_verbose the error is practically the same: [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[47640,1],0]> [aopclf:54106] *** Process received signal *** [aopclf:54106] Signal: Segmentation fault (11) [aopclf:54106] Signal code: Address not mapped (1) [aopclf:54106] Failing at address: 0x0 [aopclf:54106] [ 0] 2 libSystem.B.dylib 0x7fff83a6eeaa _sigtramp + 26 [aopclf:54106] [ 1] 3 libSystem.B.dylib 0x7fff83a210b7 snprintf + 496 [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so 0x00010065ba0a mca_vprotocol_receiver_send + 177 [aopclf:54106] [ 3] 5 libmpi.0.dylib 0x000100077d44 MPI_Send + 734 [aopclf:54106] [ 4] 6 ping 0x00010a97 main + 431 [aopclf:54106] [ 5] 7 ping 0x000108e0 start + 52 [aopclf:54106] *** End of error message *** Leonardo On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: I'm going to have to eat my last message. It slipped past me that your other job was started via comm_spawn. Since both "jobs" are running under the same mpirun, there shouldn't be a problem sending a signal between them. I don't know why this would be crashing. Are you sure it is crashing in signal_job? Your trace indicates it is crashing in a print statement, yet there is no print statement in signal_job. Or did you run this with plm_base_verbose set so that the verbose prints are trying to run (could be we have a bug in one of them)? On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: Well, thank you anyway :) On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: Yeah, that probably won't work. The current code isn't intended to cross jobs like that - I'm sure nobody ever tested it for that idea, and I'm pretty sure it won't support it. I don't currently know any way to do what you are trying to do. We could extend the signal code to handle it, I would think...but I'm not sure how soon that might happen. On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: Yes... but something wrong is going on... maybe the problem is that the jobid is different than the process' jobid, I don't know. I'm trying to send a signal to other process running under a another job. The other proc
Re: [OMPI devel] Signals
Wow... orte_plm.signal_job points to zero. Is it correct from the PML point of view? Leonardo On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: > To clarify a little bit more: I'm calling orte_plm.signal_job from a PML > component, I know that ORTE is bellow OMPI, but I think that this function > could not be available, or something like this. I can't figure out where is > this snprintf too, in my code there is only > > opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger > <%s>", > SIGUSR1, ORTE_NAME_PRINT(&el_proc)); > orte_plm.signal_job(el_proc.jobid, SIGUSR1); > > And the first output/printf works fine. Well... I used gdb to run the > program, I can see this: > > Program received signal EXC_BAD_ACCESS, Could not access memory. > Reason: KERN_INVALID_ADDRESS at address: 0x > 0x in ?? () > (gdb) backtrace > #0 0x in ?? () > #1 0x00010065c319 in vprotocol_receiver_eventlog_connect > (el_comm=0x10065d178) at > ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 > #2 0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, > count=262144, datatype=0x100263d60, dst=1, tag=1, > sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at > ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 > #3 0x000100077d44 in MPI_Send () > #4 0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 > > The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job call. > After that zeros and interrogations... the signal_job function is already > available? I really don't understand what means all those zeros. > > Leonardo > > On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: > >> Thanks for clarifying - guess I won't chew just yet. :-) >> >> I still don't see in your trace where it is failing in signal_job. I didn't >> see the message indicating it was sending the signal cmd out in your prior >> debug output, and there isn't a printf in that code loop other than the >> debug output. Can you attach to the process and get more info? >> >> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: >> >>> Ralph don't swallow your message yet... Both jobs are not running over the >>> same mpirun. There are two instances of mpirun in which one runs with >>> "-report-uri ../contact.txt" and the other receives its contact info using >>> "-ompi-server file:../contact.txt". And yes, both processes are running >>> with plm_base_verbose activated. When a deactivate the plm_base_verbose the >>> error is practically the same: >>> >>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger >>> <[[47640,1],0]> >>> [aopclf:54106] *** Process received signal *** >>> [aopclf:54106] Signal: Segmentation fault (11) >>> [aopclf:54106] Signal code: Address not mapped (1) >>> [aopclf:54106] Failing at address: 0x0 >>> [aopclf:54106] [ 0] 2 libSystem.B.dylib >>> 0x7fff83a6eeaa _sigtramp + 26 >>> [aopclf:54106] [ 1] 3 libSystem.B.dylib >>> 0x7fff83a210b7 snprintf + 496 >>> [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so >>> 0x00010065ba0a mca_vprotocol_receiver_send + 177 >>> [aopclf:54106] [ 3] 5 libmpi.0.dylib >>> 0x000100077d44 MPI_Send + 734 >>> [aopclf:54106] [ 4] 6 ping >>> 0x00010a97 main + 431 >>> [aopclf:54106] [ 5] 7 ping >>> 0x000108e0 start + 52 >>> [aopclf:54106] *** End of error message *** >>> >>> Leonardo >>> >>> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: >>> I'm going to have to eat my last message. It slipped past me that your other job was started via comm_spawn. Since both "jobs" are running under the same mpirun, there shouldn't be a problem sending a signal between them. I don't know why this would be crashing. Are you sure it is crashing in signal_job? Your trace indicates it is crashing in a print statement, yet there is no print statement in signal_job. Or did you run this with plm_base_verbose set so that the verbose prints are trying to run (could be we have a bug in one of them)? On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: > Well, thank you anyway :) > > On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: > >> Yeah, that probably won't work. The current code isn't intended to cross >> jobs like that - I'm sure nobody ever tested it for that idea, and I'm >> pretty sure it won't support it. >> >> I don't currently know any way to do what you are trying to do. We could >> extend the signal code to handle it, I would think...but I'm not sure >> how soon that might happen. >> >> >> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: >>
Re: [OMPI devel] Signals
On 03/17/2010 10:10 AM, Leonardo Fialho wrote: Wow... orte_plm.signal_job points to zero. Is it correct from the PML point of view? It might be because plm's are really only used at launch time not in MPI processes. Note plm != pml. --td Leonardo On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: To clarify a little bit more: I'm calling orte_plm.signal_job from a PML component, I know that ORTE is bellow OMPI, but I think that this function could not be available, or something like this. I can't figure out where is this snprintf too, in my code there is only opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger <%s>", SIGUSR1, ORTE_NAME_PRINT(&el_proc)); orte_plm.signal_job(el_proc.jobid, SIGUSR1); And the first output/printf works fine. Well... I used gdb to run the program, I can see this: Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x 0x in ?? () (gdb) backtrace #0 0x in ?? () #1 0x00010065c319 in vprotocol_receiver_eventlog_connect (el_comm=0x10065d178) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 #2 0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, count=262144, datatype=0x100263d60, dst=1, tag=1, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 #3 0x000100077d44 in MPI_Send () #4 0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job call. After that zeros and interrogations... the signal_job function is already available? I really don't understand what means all those zeros. Leonardo On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: Thanks for clarifying - guess I won't chew just yet. :-) I still don't see in your trace where it is failing in signal_job. I didn't see the message indicating it was sending the signal cmd out in your prior debug output, and there isn't a printf in that code loop other than the debug output. Can you attach to the process and get more info? On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: Ralph don't swallow your message yet... Both jobs are not running over the same mpirun. There are two instances of mpirun in which one runs with "-report-uri ../contact.txt" and the other receives its contact info using "-ompi-server file:../contact.txt". And yes, both processes are running with plm_base_verbose activated. When a deactivate the plm_base_verbose the error is practically the same: [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[47640,1],0]> [aopclf:54106] *** Process received signal *** [aopclf:54106] Signal: Segmentation fault (11) [aopclf:54106] Signal code: Address not mapped (1) [aopclf:54106] Failing at address: 0x0 [aopclf:54106] [ 0] 2 libSystem.B.dylib 0x7fff83a6eeaa _sigtramp + 26 [aopclf:54106] [ 1] 3 libSystem.B.dylib 0x7fff83a210b7 snprintf + 496 [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so 0x00010065ba0a mca_vprotocol_receiver_send + 177 [aopclf:54106] [ 3] 5 libmpi.0.dylib 0x000100077d44 MPI_Send + 734 [aopclf:54106] [ 4] 6 ping 0x00010a97 main + 431 [aopclf:54106] [ 5] 7 ping 0x000108e0 start + 52 [aopclf:54106] *** End of error message *** Leonardo On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: I'm going to have to eat my last message. It slipped past me that your other job was started via comm_spawn. Since both "jobs" are running under the same mpirun, there shouldn't be a problem sending a signal between them. I don't know why this would be crashing. Are you sure it is crashing in signal_job? Your trace indicates it is crashing in a print statement, yet there is no print statement in signal_job. Or did you run this with plm_base_verbose set so that the verbose prints are trying to run (could be we have a bug in one of them)? On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote: Well, thank you anyway :) On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote: Yeah, that probably won't work. The current code isn't intended to cross jobs like that - I'm sure nobody ever tested it for that idea, and I'm pretty sure it won't support it. I don't currently know any way to do what you are trying to do. We could extend the signal code to handle it, I would think...but I'm not sure how soon that might happen. On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote: Yes... but something wrong is going on... maybe the problem is that the jobid is different than the process' jobid, I don't know. I'm trying to send a signal to other process running under a another job. The o
Re: [OMPI devel] Signals
Yes, I know the difference :) I'm trying to call orte_plm.signal_job from a PML component. I think PLM stays resident after launching but it doesn't only for mpirun and orted, you're right. On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote: > On 03/17/2010 10:10 AM, Leonardo Fialho wrote: >> >> Wow... orte_plm.signal_job points to zero. Is it correct from the PML point >> of view? > It might be because plm's are really only used at launch time not in MPI > processes. Note plm != pml. > > --td >> >> Leonardo >> >> On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: >> >>> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML >>> component, I know that ORTE is bellow OMPI, but I think that this function >>> could not be available, or something like this. I can't figure out where is >>> this snprintf too, in my code there is only >>> >>> opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger >>> <%s>", >>> SIGUSR1, ORTE_NAME_PRINT(&el_proc)); >>> orte_plm.signal_job(el_proc.jobid, SIGUSR1); >>> >>> And the first output/printf works fine. Well... I used gdb to run the >>> program, I can see this: >>> >>> Program received signal EXC_BAD_ACCESS, Could not access memory. >>> Reason: KERN_INVALID_ADDRESS at address: 0x >>> 0x in ?? () >>> (gdb) backtrace >>> #0 0x in ?? () >>> #1 0x00010065c319 in vprotocol_receiver_eventlog_connect >>> (el_comm=0x10065d178) at >>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 >>> #2 0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, >>> count=262144, datatype=0x100263d60, dst=1, tag=1, >>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at >>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 >>> #3 0x000100077d44 in MPI_Send () >>> #4 0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 >>> >>> The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job >>> call. After that zeros and interrogations... the signal_job function is >>> already available? I really don't understand what means all those zeros. >>> >>> Leonardo >>> >>> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: >>> Thanks for clarifying - guess I won't chew just yet. :-) I still don't see in your trace where it is failing in signal_job. I didn't see the message indicating it was sending the signal cmd out in your prior debug output, and there isn't a printf in that code loop other than the debug output. Can you attach to the process and get more info? On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: > Ralph don't swallow your message yet... Both jobs are not running over > the same mpirun. There are two instances of mpirun in which one runs with > "-report-uri ../contact.txt" and the other receives its contact info > using "-ompi-server file:../contact.txt". And yes, both processes are > running with plm_base_verbose activated. When a deactivate the > plm_base_verbose the error is practically the same: > > [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger > <[[47640,1],0]> > [aopclf:54106] *** Process received signal *** > [aopclf:54106] Signal: Segmentation fault (11) > [aopclf:54106] Signal code: Address not mapped (1) > [aopclf:54106] Failing at address: 0x0 > [aopclf:54106] [ 0] 2 libSystem.B.dylib > 0x7fff83a6eeaa _sigtramp + 26 > [aopclf:54106] [ 1] 3 libSystem.B.dylib > 0x7fff83a210b7 snprintf + 496 > [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so > 0x00010065ba0a mca_vprotocol_receiver_send + 177 > [aopclf:54106] [ 3] 5 libmpi.0.dylib > 0x000100077d44 MPI_Send + 734 > [aopclf:54106] [ 4] 6 ping > 0x00010a97 main + 431 > [aopclf:54106] [ 5] 7 ping > 0x000108e0 start + 52 > [aopclf:54106] *** End of error message *** > > Leonardo > > On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote: > >> I'm going to have to eat my last message. It slipped past me that your >> other job was started via comm_spawn. Since both "jobs" are running >> under the same mpirun, there shouldn't be a problem sending a signal >> between them. >> >> I don't know why this would be crashing. Are you sure it is crashing in >> signal_job? Your trace indicates it is crashing in a print statement, >> yet there is no print statement in signal_job. Or did you run this with >> plm_base_verbose set so that the verbose prints are trying to run (could >> be we have a bug in one of them)? >> >> On Mar 16, 2010, at 6:59 PM, Leonar
[OMPI devel] Problem with MPI_Type_indexed and hole (defined with MPI_Type_create_resized )
Hi all, I use a very simple datatype defined as follow: lng[0]= 1; dsp[0]= 1; err=MPI_Type_indexed(1, lng, dsp, MPI_CHAR, &offtype); err=MPI_Type_create_resized(offtype, 0, 2, &filetype); MPI_Type_commit(&filetype); This datatype consists of a hole (of length 1 char) followed by a char. The datatype with hole at the beginning is not correctly handled by ROMIO integrated in OpenMPI (I tried with MPICH2 and it worked fine). You will see bellow a program to reproduce the problem. After investigations, I see that the difference between OpenMPI and MPICH appears at line 542 in the file romio/adio/comm/flatten.c: case MPI_COMBINER_RESIZED: /* This is done similar to a type_struct with an lb, datatype, ub */ /* handle the Lb */ j = *curr_index; flat->indices[j] = st_offset + adds[0]; flat->blocklens[j] = 0; (*curr_index)++; /* handle the datatype */ MPI_Type_get_envelope(types[0], &old_nints, &old_nadds, &old_ntypes, &old_combiner); ADIOI_Datatype_iscontig(types[0], &old_is_contig); <== ligne 542 For MPICH2, the datatype is not contiguous, but it is for OpenMPI. The routine ADIOI_Datatype_iscontig is quite different in OpenMPI because the datatypes are handled very differently. If I reset old_is_contig just after line 542, the problem disappears (Of course, this is not a solution). I am not able to propose a right solution. Can somebody help ? Pascal Program to reproduce the problem #include #include "mpi.h" char filename[256]="VIEW_TEST"; char buffer[100]; int err, i, myid, dsp[3], lng[3]; MPI_Status status; MPI_File fh; MPI_Datatype filetype, offtype; MPI_Aint lb, extent; int main(int argc, char **argv) { MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); for (i=0; i MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_CREATE | MPI_MODE_RDWR , MPI_INFO_NULL, &fh); MPI_File_write(fh, buffer, sizeof(buffer), MPI_CHAR, &status); MPI_File_close(&fh); lng[0]= 1; dsp[0]= 1; MPI_Type_indexed(1, lng, dsp, MPI_CHAR, &offtype); MPI_Type_create_resized(offtype, 0, 2, &filetype); MPI_Type_commit(&filetype); MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_RDONLY , MPI_INFO_NULL, &fh); MPI_File_set_view(fh, 0, MPI_CHAR, filetype,"native", MPI_INFO_NULL); MPI_File_read(fh, buffer, 5, MPI_CHAR, &status); printf("Data: "); for (i=0 ; i<5 ; i++) printf(" %x ", buffer[i]); if (buffer[1] != 3) printf("\n ===> test KO : buffer[1]=%d instead of %d \n", buffer[1], 4); else printf("\n ===> test OK\n"); MPI_Type_free(&filetype); MPI_File_close(&fh); } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); } The result of the program with MPICH2 Data: 1 3 5 7 9 ===> test OK The result of the program with OpenMPI Data: 0 2 4 6 8 ===> test KO : buffer[1]=2 instead of 4 Comment: Only the first hole is ommited.
Re: [OMPI devel] Signals
Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly realized the problem :-/ Terry is correct - we don't initialize the plm framework in application processes. However, there is a default proxy module for that framework so that applications can call comm_spawn. Unfortunately, I never filled in the rest of the module function pointers because (a) there was no known reason for apps to be using them (as Jeff points out), and (b) there is no MPI call that interfaces to them. I can (and will) make it work over the next day or two - there is no reason why this can't be done. It just wasn't implemented due to lack of reason to do so. Sorry for the confusion - old man brain fizzing out again. On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote: > Yes, I know the difference :) > > I'm trying to call orte_plm.signal_job from a PML component. I think PLM > stays resident after launching but it doesn't only for mpirun and orted, > you're right. > > On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote: > >> On 03/17/2010 10:10 AM, Leonardo Fialho wrote: >>> >>> Wow... orte_plm.signal_job points to zero. Is it correct from the PML point >>> of view? >> It might be because plm's are really only used at launch time not in MPI >> processes. Note plm != pml. >> >> --td >>> >>> Leonardo >>> >>> On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: >>> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML component, I know that ORTE is bellow OMPI, but I think that this function could not be available, or something like this. I can't figure out where is this snprintf too, in my code there is only opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger <%s>", SIGUSR1, ORTE_NAME_PRINT(&el_proc)); orte_plm.signal_job(el_proc.jobid, SIGUSR1); And the first output/printf works fine. Well... I used gdb to run the program, I can see this: Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x 0x in ?? () (gdb) backtrace #0 0x in ?? () #1 0x00010065c319 in vprotocol_receiver_eventlog_connect (el_comm=0x10065d178) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 #2 0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, count=262144, datatype=0x100263d60, dst=1, tag=1, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 #3 0x000100077d44 in MPI_Send () #4 0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job call. After that zeros and interrogations... the signal_job function is already available? I really don't understand what means all those zeros. Leonardo On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: > Thanks for clarifying - guess I won't chew just yet. :-) > > I still don't see in your trace where it is failing in signal_job. I > didn't see the message indicating it was sending the signal cmd out in > your prior debug output, and there isn't a printf in that code loop other > than the debug output. Can you attach to the process and get more info? > > On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: > >> Ralph don't swallow your message yet... Both jobs are not running over >> the same mpirun. There are two instances of mpirun in which one runs >> with "-report-uri ../contact.txt" and the other receives its contact >> info using "-ompi-server file:../contact.txt". And yes, both processes >> are running with plm_base_verbose activated. When a deactivate the >> plm_base_verbose the error is practically the same: >> >> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger >> <[[47640,1],0]> >> [aopclf:54106] *** Process received signal *** >> [aopclf:54106] Signal: Segmentation fault (11) >> [aopclf:54106] Signal code: Address not mapped (1) >> [aopclf:54106] Failing at address: 0x0 >> [aopclf:54106] [ 0] 2 libSystem.B.dylib >> 0x7fff83a6eeaa _sigtramp + 26 >> [aopclf:54106] [ 1] 3 libSystem.B.dylib >> 0x7fff83a210b7 snprintf + 496 >> [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so >> 0x00010065ba0a mca_vprotocol_receiver_send + 177 >> [aopclf:54106] [ 3] 5 libmpi.0.dylib >> 0x000100077d44 MPI_Send + 734 >> [aopclf:54106] [ 4] 6 ping >> 0x00010a97 main + 431 >> [aopclf:54106] [ 5] 7 ping
Re: [OMPI devel] Signals
Anyway, to signal another job I have sent a RML message with the ORTE_DAEMON_SIGNAL_LOCAL_PROCS command to the proc's HNP. Leonardo On Mar 17, 2010, at 9:59 PM, Ralph Castain wrote: > Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly > realized the problem :-/ > > Terry is correct - we don't initialize the plm framework in application > processes. However, there is a default proxy module for that framework so > that applications can call comm_spawn. Unfortunately, I never filled in the > rest of the module function pointers because (a) there was no known reason > for apps to be using them (as Jeff points out), and (b) there is no MPI call > that interfaces to them. > > I can (and will) make it work over the next day or two - there is no reason > why this can't be done. It just wasn't implemented due to lack of reason to > do so. > > Sorry for the confusion - old man brain fizzing out again. > > On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote: > >> Yes, I know the difference :) >> >> I'm trying to call orte_plm.signal_job from a PML component. I think PLM >> stays resident after launching but it doesn't only for mpirun and orted, >> you're right. >> >> On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote: >> >>> On 03/17/2010 10:10 AM, Leonardo Fialho wrote: Wow... orte_plm.signal_job points to zero. Is it correct from the PML point of view? >>> It might be because plm's are really only used at launch time not in MPI >>> processes. Note plm != pml. >>> >>> --td Leonardo On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: > To clarify a little bit more: I'm calling orte_plm.signal_job from a PML > component, I know that ORTE is bellow OMPI, but I think that this > function could not be available, or something like this. I can't figure > out where is this snprintf too, in my code there is only > > opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger > <%s>", > SIGUSR1, ORTE_NAME_PRINT(&el_proc)); > orte_plm.signal_job(el_proc.jobid, SIGUSR1); > > And the first output/printf works fine. Well... I used gdb to run the > program, I can see this: > > Program received signal EXC_BAD_ACCESS, Could not access memory. > Reason: KERN_INVALID_ADDRESS at address: 0x > 0x in ?? () > (gdb) backtrace > #0 0x in ?? () > #1 0x00010065c319 in vprotocol_receiver_eventlog_connect > (el_comm=0x10065d178) at > ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 > #2 0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, > count=262144, datatype=0x100263d60, dst=1, tag=1, > sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at > ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 > #3 0x000100077d44 in MPI_Send () > #4 0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 > > The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job > call. After that zeros and interrogations... the signal_job function is > already available? I really don't understand what means all those zeros. > > Leonardo > > On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: > >> Thanks for clarifying - guess I won't chew just yet. :-) >> >> I still don't see in your trace where it is failing in signal_job. I >> didn't see the message indicating it was sending the signal cmd out in >> your prior debug output, and there isn't a printf in that code loop >> other than the debug output. Can you attach to the process and get more >> info? >> >> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: >> >>> Ralph don't swallow your message yet... Both jobs are not running over >>> the same mpirun. There are two instances of mpirun in which one runs >>> with "-report-uri ../contact.txt" and the other receives its contact >>> info using "-ompi-server file:../contact.txt". And yes, both processes >>> are running with plm_base_verbose activated. When a deactivate the >>> plm_base_verbose the error is practically the same: >>> >>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger >>> <[[47640,1],0]> >>> [aopclf:54106] *** Process received signal *** >>> [aopclf:54106] Signal: Segmentation fault (11) >>> [aopclf:54106] Signal code: Address not mapped (1) >>> [aopclf:54106] Failing at address: 0x0 >>> [aopclf:54106] [ 0] 2 libSystem.B.dylib >>> 0x7fff83a6eeaa _sigtramp + 26 >>> [aopclf:54106] [ 1] 3 libSystem.B.dylib >>> 0x7fff83a210b7 snprintf + 496 >>> [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so >
Re: [OMPI devel] Signals
Very good - that is pretty much all that the signal_job API does. On Mar 17, 2010, at 4:11 PM, Leonardo Fialho wrote: > Anyway, to signal another job I have sent a RML message with the > ORTE_DAEMON_SIGNAL_LOCAL_PROCS command to the proc's HNP. > > Leonardo > > On Mar 17, 2010, at 9:59 PM, Ralph Castain wrote: > >> Sorry, I was out snowshoeing today - and about 3 miles out, I suddenly >> realized the problem :-/ >> >> Terry is correct - we don't initialize the plm framework in application >> processes. However, there is a default proxy module for that framework so >> that applications can call comm_spawn. Unfortunately, I never filled in the >> rest of the module function pointers because (a) there was no known reason >> for apps to be using them (as Jeff points out), and (b) there is no MPI call >> that interfaces to them. >> >> I can (and will) make it work over the next day or two - there is no reason >> why this can't be done. It just wasn't implemented due to lack of reason to >> do so. >> >> Sorry for the confusion - old man brain fizzing out again. >> >> On Mar 17, 2010, at 8:29 AM, Leonardo Fialho wrote: >> >>> Yes, I know the difference :) >>> >>> I'm trying to call orte_plm.signal_job from a PML component. I think PLM >>> stays resident after launching but it doesn't only for mpirun and orted, >>> you're right. >>> >>> On Mar 17, 2010, at 3:15 PM, Terry Dontje wrote: >>> On 03/17/2010 10:10 AM, Leonardo Fialho wrote: > > Wow... orte_plm.signal_job points to zero. Is it correct from the PML > point of view? It might be because plm's are really only used at launch time not in MPI processes. Note plm != pml. --td > > Leonardo > > On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote: > >> To clarify a little bit more: I'm calling orte_plm.signal_job from a PML >> component, I know that ORTE is bellow OMPI, but I think that this >> function could not be available, or something like this. I can't figure >> out where is this snprintf too, in my code there is only >> >> opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger >> <%s>", >> SIGUSR1, ORTE_NAME_PRINT(&el_proc)); >> orte_plm.signal_job(el_proc.jobid, SIGUSR1); >> >> And the first output/printf works fine. Well... I used gdb to run the >> program, I can see this: >> >> Program received signal EXC_BAD_ACCESS, Could not access memory. >> Reason: KERN_INVALID_ADDRESS at address: 0x >> 0x in ?? () >> (gdb) backtrace >> #0 0x in ?? () >> #1 0x00010065c319 in vprotocol_receiver_eventlog_connect >> (el_comm=0x10065d178) at >> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67 >> #2 0x00010065ba9a in mca_vprotocol_receiver_send (buf=0x10050, >> count=262144, datatype=0x100263d60, dst=1, tag=1, >> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at >> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46 >> #3 0x000100077d44 in MPI_Send () >> #4 0x00010a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45 >> >> The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job >> call. After that zeros and interrogations... the signal_job function is >> already available? I really don't understand what means all those zeros. >> >> Leonardo >> >> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote: >> >>> Thanks for clarifying - guess I won't chew just yet. :-) >>> >>> I still don't see in your trace where it is failing in signal_job. I >>> didn't see the message indicating it was sending the signal cmd out in >>> your prior debug output, and there isn't a printf in that code loop >>> other than the debug output. Can you attach to the process and get more >>> info? >>> >>> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote: >>> Ralph don't swallow your message yet... Both jobs are not running over the same mpirun. There are two instances of mpirun in which one runs with "-report-uri ../contact.txt" and the other receives its contact info using "-ompi-server file:../contact.txt". And yes, both processes are running with plm_base_verbose activated. When a deactivate the plm_base_verbose the error is practically the same: [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[47640,1],0]> [aopclf:54106] *** Process received signal *** [aopclf:54106] Signal: Segmentation fault (11) [aopclf:54106] Signal code: Address not mapped (1) [aopclf:54106] Failing at address: 0x0 [aopclf:54106] [ 0] 2 libSystem.B.dylib
[OMPI devel] Migrate OpenMPI to the VxWorks
Hello all, In order to add some real-time feature to the OpenMPI for some research ,I need a OpenMPI version running on VxWorks. But after going through the Open-MPI website ,I can’t found any indication that it supports VxWorks . Follow the thread posted by Ralph Castain , http://www.open-mpi.org/community/lists/users/2006/06/1371.php . I read some paper about the OpenRTE ,like “Creating a transparent, distributed, and resilient computing environment: the OpenRTE project” and “The Open Run-Time Environment (OpenRTE):A Transparent Multi-cluster Environment for High-Performance Computing”which is written by Ralph H. Castain · Jeffrey M. Squyres and others . Now I have a basic understanding of the OpenRTE , however ,there is too few document of the OpenRTE describing the implement of the OpenRTE . I don’t know where and how to begin the migration . Any advice will be appreciated. Thanks Jing Zhang