[OMPI devel] 1.5.2rc4 is posted
There was only 1 very minor change (to the FCA coll) since rc3. We expect to do minor sanity tests on this tarball and release it as 1.5.2 final. http://www.open-mpi.org/software/ompi/v1.5/ -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] Communication Failure with orted_comm.c
Hello @ll. I've got a problem in a communication between the*v_protocol_receiver_component.c * and the *orted_comm.c. * In the *mca_vprotocol_receiver_component_init* i've added a request that is received correctly by the *orte_daemon_process_commands *but when i try to reply to the sender i get the next error: [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2bb03d40] [clus1:15593] [ 1] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 [0x2ad760db] [clus1:15593] [ 2] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 [0x2ad75aa4] [clus1:15593] [ 3] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so [0x2e2d2fdd] [clus1:15593] [ 4] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da) [0x2ad42cb0] [clus1:15593] [ 5] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068) [0x2ad19ca6] [clus1:15593] [ 6] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b) [0x2ad18a55] [clus1:15593] [ 7] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 [0x2ad9710e] [clus1:15593] [ 8] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 [0x2ad974bb] [clus1:15593] [ 9] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a) [0x2ad972ad] [clus1:15593] [10] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe) [0x2ad97166] [clus1:15593] [11] /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322) [0x2ad17556] [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted [0x4008a3] [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2bd2d8a4] [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted [0x400799] [clus1:15593] *** End of error message *** The code that i've added at the *v_protocol_receiver_component.c *is (in bold the recv command that fails): int mca_vprotocol_receiver_request_protector(void) { orte_daemon_cmd_flag_t command; opal_buffer_t *buffer = NULL; int n = 1; command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD; buffer = OBJ_NEW(opal_buffer_t); opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD); orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON, 0); *orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0);* opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n, OPAL_UINT32); opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n, OPAL_UINT32); orte_process_info.protector.jobid = mca_vprotocol_receiver.protector.jobid; orte_process_info.protector.vpid = mca_vprotocol_receiver.protector.vpid; OBJ_RELEASE(buffer); return OMPI_SUCCESS; The code that i've added at the *orted_comm.c *is (in bold the send command that fails): case ORTE_DAEMON_REQUEST_PROTECTOR_CMD: if (orte_debug_daemons_flag) { opal_output(0, "%s orted_recv: received request protector from local proc %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(sender)); } /* Define the protector */ protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1; if (protector >= (uint32_t)orte_process_info.num_procs) { protector = 0; } /* Pack the protector data */ answer = OBJ_NEW(opal_buffer_t); if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) { ORTE_ERROR_LOG(ret); OBJ_RELEASE(answer); goto CLEANUP; } if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1, OPAL_UINT32))) { ORTE_ERROR_LOG(ret); OBJ_RELEASE(answer); goto CLEANUP; } if (orte_debug_daemons_flag) { opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n", ORTE_NAME_PRINT(sender), protector); } /* Send the protector data */ *if (0 > orte_rml.send_buffer(sender, answer, ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) {* *ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);* *ret = ORTE_ERR_COMM_FAILURE;* *OBJ_RELEASE(answer);* *goto CLEANUP;* } OBJ_RELEASE(answer); I assume by testing that the error is in the bolded section, maybe because i'am missing some sentence when i try to communicate, or maybe this communication cannot be done. Any help will be appreciated. Thanks a lot. Hugo Meyer
Re: [OMPI devel] Communication Failure with orted_comm.c
What value did you set for this new command? Did you look at the cmds in orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value? On Mar 8, 2011, at 6:15 AM, Hugo Meyer wrote: > Hello @ll. > > I've got a problem in a communication between the > v_protocol_receiver_component.c and the orted_comm.c. > > In the mca_vprotocol_receiver_component_init i've added a request that is > received correctly by the orte_daemon_process_commands but when i try to > reply to the sender i get the next error: > > [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2bb03d40] > [clus1:15593] [ 1] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad760db] > [clus1:15593] [ 2] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad75aa4] > [clus1:15593] [ 3] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so > [0x2e2d2fdd] > [clus1:15593] [ 4] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da) > [0x2ad42cb0] > [clus1:15593] [ 5] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068) > [0x2ad19ca6] > [clus1:15593] [ 6] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b) > [0x2ad18a55] > [clus1:15593] [ 7] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad9710e] > [clus1:15593] [ 8] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad974bb] > [clus1:15593] [ 9] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a) > [0x2ad972ad] > [clus1:15593] [10] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe) > [0x2ad97166] > [clus1:15593] [11] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322) > [0x2ad17556] > [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted > [0x4008a3] > [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2bd2d8a4] > [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted > [0x400799] > [clus1:15593] *** End of error message *** > > The code that i've added at the v_protocol_receiver_component.c is (in bold > the recv command that fails): > > int mca_vprotocol_receiver_request_protector(void) { > orte_daemon_cmd_flag_t command; > opal_buffer_t *buffer = NULL; > int n = 1; > > command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD; > > buffer = OBJ_NEW(opal_buffer_t); > opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD); > > orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON, 0); > > orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer, > ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0); > opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n, > OPAL_UINT32); > opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n, > OPAL_UINT32); > > orte_process_info.protector.jobid = > mca_vprotocol_receiver.protector.jobid; > orte_process_info.protector.vpid = mca_vprotocol_receiver.protector.vpid; > > OBJ_RELEASE(buffer); > > return OMPI_SUCCESS; > > The code that i've added at the orted_comm.c is (in bold the send command > that fails): > > case ORTE_DAEMON_REQUEST_PROTECTOR_CMD: > if (orte_debug_daemons_flag) { > opal_output(0, "%s orted_recv: received request protector from > local proc %s", > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), > ORTE_NAME_PRINT(sender)); > } > /* Define the protector */ > protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1; > if (protector >= (uint32_t)orte_process_info.num_procs) { > protector = 0; > } > > /* Pack the protector data */ > answer = OBJ_NEW(opal_buffer_t); > > if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, > &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) { > ORTE_ERROR_LOG(ret); > OBJ_RELEASE(answer); > goto CLEANUP; > } > if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1, > OPAL_UINT32))) { > ORTE_ERROR_LOG(ret); > OBJ_RELEASE(answer); > goto CLEANUP; > } > if (orte_debug_daemons_flag) { > opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n", > ORTE_NAME_PRINT(sender), protector); > } > > /* Send the protector data */ > if (0 > orte_rml.send_buffer(sender, answer, > ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) { > ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE); > ret = ORTE_ERR_COMM_FAILURE; > OBJ_RELEASE(answer); > goto CLEANUP; > } > OBJ_RELEASE(answer); > > I assume by testing that the error is in the bolded section, m
Re: [OMPI devel] Communication Failure with orted_comm.c
Yes, i set the value 31 and it is not duplicated. 2011/3/8 Ralph Castain > What value did you set for this new command? Did you look at the cmds in > orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value? > > > On Mar 8, 2011, at 6:15 AM, Hugo Meyer wrote: > > Hello @ll. > > I've got a problem in a communication between > the*v_protocol_receiver_component.c > * and the *orted_comm.c. * > > In the *mca_vprotocol_receiver_component_init* i've added a request that > is received correctly by the *orte_daemon_process_commands *but when i try > to reply to the sender i get the next error: > > [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2bb03d40] > [clus1:15593] [ 1] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad760db] > [clus1:15593] [ 2] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad75aa4] > [clus1:15593] [ 3] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so > [0x2e2d2fdd] > [clus1:15593] [ 4] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da) > [0x2ad42cb0] > [clus1:15593] [ 5] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068) > [0x2ad19ca6] > [clus1:15593] [ 6] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b) > [0x2ad18a55] > [clus1:15593] [ 7] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad9710e] > [clus1:15593] [ 8] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad974bb] > [clus1:15593] [ 9] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a) > [0x2ad972ad] > [clus1:15593] [10] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe) > [0x2ad97166] > [clus1:15593] [11] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322) > [0x2ad17556] > [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted > [0x4008a3] > [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x2bd2d8a4] > [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted > [0x400799] > [clus1:15593] *** End of error message *** > > > The code that i've added at the *v_protocol_receiver_component.c *is (in > bold the recv command that fails): > > int mca_vprotocol_receiver_request_protector(void) { > orte_daemon_cmd_flag_t command; > opal_buffer_t *buffer = NULL; > int n = 1; > > command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD; > > buffer = OBJ_NEW(opal_buffer_t); > opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD); > > orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON, > 0); > > *orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer, > ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0);* > opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n, > OPAL_UINT32); > opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n, > OPAL_UINT32); > > orte_process_info.protector.jobid = > mca_vprotocol_receiver.protector.jobid; > orte_process_info.protector.vpid = > mca_vprotocol_receiver.protector.vpid; > > OBJ_RELEASE(buffer); > > return OMPI_SUCCESS; > > > The code that i've added at the *orted_comm.c *is (in bold the send > command that fails): > > case ORTE_DAEMON_REQUEST_PROTECTOR_CMD: > if (orte_debug_daemons_flag) { > opal_output(0, "%s orted_recv: received request protector from > local proc %s", > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), > ORTE_NAME_PRINT(sender)); > } > /* Define the protector */ > protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1; > if (protector >= (uint32_t)orte_process_info.num_procs) { > protector = 0; > } > > /* Pack the protector data */ > answer = OBJ_NEW(opal_buffer_t); > > if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, > &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) { > ORTE_ERROR_LOG(ret); > OBJ_RELEASE(answer); > goto CLEANUP; > } > if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1, > OPAL_UINT32))) { > ORTE_ERROR_LOG(ret); > OBJ_RELEASE(answer); > goto CLEANUP; > } > if (orte_debug_daemons_flag) { > opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n", > ORTE_NAME_PRINT(sender), protector); > } > > /* Send the protector data */ > *if (0 > orte_rml.send_buffer(sender, answer, > ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) {* > *ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);* > *ret = ORTE_ERR_COMM_FAILURE;* > *OBJ_RELEASE(answer);* > *goto CLEANUP;* > } > > OBJ_RELEASE(answer); > > >
Re: [OMPI devel] Communication Failure with orted_comm.c
The comm can most certainly be done - there are other sections of that code that also send messages. I can't see the end of your new code section, but I assume you ended it properly with a "break"? Otherwise, you'll execute whatever lies below it as well. On Mar 8, 2011, at 8:19 AM, Hugo Meyer wrote: > Yes, i set the value 31 and it is not duplicated. > > > 2011/3/8 Ralph Castain > What value did you set for this new command? Did you look at the cmds in > orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value? > > > On Mar 8, 2011, at 6:15 AM, Hugo Meyer wrote: > >> Hello @ll. >> >> I've got a problem in a communication between the >> v_protocol_receiver_component.c and the orted_comm.c. >> >> In the mca_vprotocol_receiver_component_init i've added a request that is >> received correctly by the orte_daemon_process_commands but when i try to >> reply to the sender i get the next error: >> >> [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2bb03d40] >> [clus1:15593] [ 1] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >> [0x2ad760db] >> [clus1:15593] [ 2] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >> [0x2ad75aa4] >> [clus1:15593] [ 3] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so >> [0x2e2d2fdd] >> [clus1:15593] [ 4] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da) >> [0x2ad42cb0] >> [clus1:15593] [ 5] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068) >> [0x2ad19ca6] >> [clus1:15593] [ 6] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b) >> [0x2ad18a55] >> [clus1:15593] [ 7] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >> [0x2ad9710e] >> [clus1:15593] [ 8] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >> [0x2ad974bb] >> [clus1:15593] [ 9] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a) >> [0x2ad972ad] >> [clus1:15593] [10] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe) >> [0x2ad97166] >> [clus1:15593] [11] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322) >> [0x2ad17556] >> [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted >> [0x4008a3] >> [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2bd2d8a4] >> [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted >> [0x400799] >> [clus1:15593] *** End of error message *** >> >> The code that i've added at the v_protocol_receiver_component.c is (in bold >> the recv command that fails): >> >> int mca_vprotocol_receiver_request_protector(void) { >> orte_daemon_cmd_flag_t command; >> opal_buffer_t *buffer = NULL; >> int n = 1; >> >> command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD; >> >> buffer = OBJ_NEW(opal_buffer_t); >> opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD); >> >> orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON, >> 0); >> >> orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer, >> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0); >> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n, >> OPAL_UINT32); >> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n, >> OPAL_UINT32); >> >> orte_process_info.protector.jobid = >> mca_vprotocol_receiver.protector.jobid; >> orte_process_info.protector.vpid = >> mca_vprotocol_receiver.protector.vpid; >> >> OBJ_RELEASE(buffer); >> >> return OMPI_SUCCESS; >> >> The code that i've added at the orted_comm.c is (in bold the send command >> that fails): >> >> case ORTE_DAEMON_REQUEST_PROTECTOR_CMD: >> if (orte_debug_daemons_flag) { >> opal_output(0, "%s orted_recv: received request protector from >> local proc %s", >> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >> ORTE_NAME_PRINT(sender)); >> } >> /* Define the protector */ >> protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1; >> if (protector >= (uint32_t)orte_process_info.num_procs) { >> protector = 0; >> } >> >> /* Pack the protector data */ >> answer = OBJ_NEW(opal_buffer_t); >> >> if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, >> &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) { >> ORTE_ERROR_LOG(ret); >> OBJ_RELEASE(answer); >> goto CLEANUP; >> } >> if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1, >> OPAL_UINT32))) { >> ORTE_ERROR_LOG(ret); >> OBJ_RELEASE(answer); >> goto CLEANUP; >> } >> if (orte_debug_daemons_flag) { >> opal_output(0, "EL
Re: [OMPI devel] Communication Failure with orted_comm.c
Yes, after the release is a break. I'm sending now all my output, maybe that helps more. But the code is basically the one i sent. The normal execution reaches to the send/receive between the orted_comm and the receiver. Best regards. Hugo 2011/3/8 Ralph Castain > The comm can most certainly be done - there are other sections of that code > that also send messages. I can't see the end of your new code section, but I > assume you ended it properly with a "break"? Otherwise, you'll execute > whatever lies below it as well. > > > On Mar 8, 2011, at 8:19 AM, Hugo Meyer wrote: > > Yes, i set the value 31 and it is not duplicated. > > > 2011/3/8 Ralph Castain > >> What value did you set for this new command? Did you look at the cmds in >> orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value? >> >> >> On Mar 8, 2011, at 6:15 AM, Hugo Meyer wrote: >> >> Hello @ll. >> >> I've got a problem in a communication between >> the*v_protocol_receiver_component.c >> * and the *orted_comm.c. * >> >> In the *mca_vprotocol_receiver_component_init* i've added a request that >> is received correctly by the *orte_daemon_process_commands *but when i >> try to reply to the sender i get the next error: >> >> [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2bb03d40] >> [clus1:15593] [ 1] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >> [0x2ad760db] >> [clus1:15593] [ 2] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >> [0x2ad75aa4] >> [clus1:15593] [ 3] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so >> [0x2e2d2fdd] >> [clus1:15593] [ 4] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da) >> [0x2ad42cb0] >> [clus1:15593] [ 5] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068) >> [0x2ad19ca6] >> [clus1:15593] [ 6] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b) >> [0x2ad18a55] >> [clus1:15593] [ 7] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >> [0x2ad9710e] >> [clus1:15593] [ 8] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >> [0x2ad974bb] >> [clus1:15593] [ 9] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a) >> [0x2ad972ad] >> [clus1:15593] [10] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe) >> [0x2ad97166] >> [clus1:15593] [11] >> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322) >> [0x2ad17556] >> [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted >> [0x4008a3] >> [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) >> [0x2bd2d8a4] >> [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted >> [0x400799] >> [clus1:15593] *** End of error message *** >> >> >> The code that i've added at the *v_protocol_receiver_component.c *is (in >> bold the recv command that fails): >> >> int mca_vprotocol_receiver_request_protector(void) { >> orte_daemon_cmd_flag_t command; >> opal_buffer_t *buffer = NULL; >> int n = 1; >> >> command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD; >> >> buffer = OBJ_NEW(opal_buffer_t); >> opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD); >> >> orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON, >> 0); >> >> *orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer, >> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0);* >> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n, >> OPAL_UINT32); >> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n, >> OPAL_UINT32); >> >> orte_process_info.protector.jobid = >> mca_vprotocol_receiver.protector.jobid; >> orte_process_info.protector.vpid = >> mca_vprotocol_receiver.protector.vpid; >> >> OBJ_RELEASE(buffer); >> >> return OMPI_SUCCESS; >> >> >> The code that i've added at the *orted_comm.c *is (in bold the send >> command that fails): >> >> case ORTE_DAEMON_REQUEST_PROTECTOR_CMD: >> if (orte_debug_daemons_flag) { >> opal_output(0, "%s orted_recv: received request protector from >> local proc %s", >> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >> ORTE_NAME_PRINT(sender)); >> } >> /* Define the protector */ >> protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1; >> if (protector >= (uint32_t)orte_process_info.num_procs) { >> protector = 0; >> } >> >> /* Pack the protector data */ >> answer = OBJ_NEW(opal_buffer_t); >> >> if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, >> &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) { >> ORTE_ERROR_LOG(ret); >> OBJ_RELEASE(answer); >> goto CLEANUP; >> } >> if (ORTE_SUCCESS != (ret = op
Re: [OMPI devel] Communication Failure with orted_comm.c
Hmmmwell, the output indicates both daemons crashed, but doesn't really indicate where the crash occurs. If you have a core file, perhaps you can get a line number. Are you perhaps trying to send to someone who died? One nit: in your vprotocol code, you re-use buffer in the send and recv. That's okay, but you need to OBJ_RELEASE the buffer after the send and before calling recv. On Mar 8, 2011, at 8:45 AM, Hugo Meyer wrote: > Yes, after the release is a break. I'm sending now all my output, maybe that > helps more. But the code is basically the one i sent. The normal execution > reaches to the send/receive between the orted_comm and the receiver. > > Best regards. > > Hugo > > 2011/3/8 Ralph Castain > The comm can most certainly be done - there are other sections of that code > that also send messages. I can't see the end of your new code section, but I > assume you ended it properly with a "break"? Otherwise, you'll execute > whatever lies below it as well. > > > On Mar 8, 2011, at 8:19 AM, Hugo Meyer wrote: > >> Yes, i set the value 31 and it is not duplicated. >> >> >> 2011/3/8 Ralph Castain >> What value did you set for this new command? Did you look at the cmds in >> orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value? >> >> >> On Mar 8, 2011, at 6:15 AM, Hugo Meyer wrote: >> >>> Hello @ll. >>> >>> I've got a problem in a communication between the >>> v_protocol_receiver_component.c and the orted_comm.c. >>> >>> In the mca_vprotocol_receiver_component_init i've added a request that is >>> received correctly by the orte_daemon_process_commands but when i try to >>> reply to the sender i get the next error: >>> >>> [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2bb03d40] >>> [clus1:15593] [ 1] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >>> [0x2ad760db] >>> [clus1:15593] [ 2] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >>> [0x2ad75aa4] >>> [clus1:15593] [ 3] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so >>> [0x2e2d2fdd] >>> [clus1:15593] [ 4] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da) >>> [0x2ad42cb0] >>> [clus1:15593] [ 5] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068) >>> [0x2ad19ca6] >>> [clus1:15593] [ 6] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b) >>> [0x2ad18a55] >>> [clus1:15593] [ 7] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >>> [0x2ad9710e] >>> [clus1:15593] [ 8] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 >>> [0x2ad974bb] >>> [clus1:15593] [ 9] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a) >>> [0x2ad972ad] >>> [clus1:15593] [10] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe) >>> [0x2ad97166] >>> [clus1:15593] [11] >>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322) >>> [0x2ad17556] >>> [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted >>> [0x4008a3] >>> [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2bd2d8a4] >>> [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted >>> [0x400799] >>> [clus1:15593] *** End of error message *** >>> >>> The code that i've added at the v_protocol_receiver_component.c is (in bold >>> the recv command that fails): >>> >>> int mca_vprotocol_receiver_request_protector(void) { >>> orte_daemon_cmd_flag_t command; >>> opal_buffer_t *buffer = NULL; >>> int n = 1; >>> >>> command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD; >>> >>> buffer = OBJ_NEW(opal_buffer_t); >>> opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD); >>> >>> orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON, >>> 0); >>> >>> orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer, >>> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0); >>> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n, >>> OPAL_UINT32); >>> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n, >>> OPAL_UINT32); >>> >>> orte_process_info.protector.jobid = >>> mca_vprotocol_receiver.protector.jobid; >>> orte_process_info.protector.vpid = >>> mca_vprotocol_receiver.protector.vpid; >>> >>> OBJ_RELEASE(buffer); >>> >>> return OMPI_SUCCESS; >>> >>> The code that i've added at the orted_comm.c is (in bold the send command >>> that fails): >>> >>> case ORTE_DAEMON_REQUEST_PROTECTOR_CMD: >>> if (orte_debug_daemons_flag) { >>> opal_output(0, "%s orted_recv: received request protector from >>> local proc %s", >>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>> ORTE_NAME_
Re: [OMPI devel] Communication Failure with orted_comm.c
The stack trace indicate that your orted segfaulted in the orte_odls_base_notify_iof_complete which means it received a message that was interpreted as a ORTE_DAEMON_IOF_COMPLETE (21). Nothing more to get out from your output unfortunately. george. On Mar 8, 2011, at 08:15 , Hugo Meyer wrote: > Hello @ll. > > I've got a problem in a communication between the > v_protocol_receiver_component.c and the orted_comm.c. > > In the mca_vprotocol_receiver_component_init i've added a request that is > received correctly by the orte_daemon_process_commands but when i try to > reply to the sender i get the next error: > > [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2bb03d40] > [clus1:15593] [ 1] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad760db] > [clus1:15593] [ 2] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad75aa4] > [clus1:15593] [ 3] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so > [0x2e2d2fdd] > [clus1:15593] [ 4] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da) > [0x2ad42cb0] > [clus1:15593] [ 5] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068) > [0x2ad19ca6] > [clus1:15593] [ 6] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b) > [0x2ad18a55] > [clus1:15593] [ 7] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad9710e] > [clus1:15593] [ 8] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0 > [0x2ad974bb] > [clus1:15593] [ 9] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a) > [0x2ad972ad] > [clus1:15593] [10] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe) > [0x2ad97166] > [clus1:15593] [11] > /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322) > [0x2ad17556] > [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted > [0x4008a3] > [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2bd2d8a4] > [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted > [0x400799] > [clus1:15593] *** End of error message *** > > The code that i've added at the v_protocol_receiver_component.c is (in bold > the recv command that fails): > > int mca_vprotocol_receiver_request_protector(void) { > orte_daemon_cmd_flag_t command; > opal_buffer_t *buffer = NULL; > int n = 1; > > command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD; > > buffer = OBJ_NEW(opal_buffer_t); > opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD); > > orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON, 0); > > orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer, > ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0); > opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n, > OPAL_UINT32); > opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n, > OPAL_UINT32); > > orte_process_info.protector.jobid = > mca_vprotocol_receiver.protector.jobid; > orte_process_info.protector.vpid = mca_vprotocol_receiver.protector.vpid; > > OBJ_RELEASE(buffer); > > return OMPI_SUCCESS; > > The code that i've added at the orted_comm.c is (in bold the send command > that fails): > > case ORTE_DAEMON_REQUEST_PROTECTOR_CMD: > if (orte_debug_daemons_flag) { > opal_output(0, "%s orted_recv: received request protector from > local proc %s", > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), > ORTE_NAME_PRINT(sender)); > } > /* Define the protector */ > protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1; > if (protector >= (uint32_t)orte_process_info.num_procs) { > protector = 0; > } > > /* Pack the protector data */ > answer = OBJ_NEW(opal_buffer_t); > > if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, > &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) { > ORTE_ERROR_LOG(ret); > OBJ_RELEASE(answer); > goto CLEANUP; > } > if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1, > OPAL_UINT32))) { > ORTE_ERROR_LOG(ret); > OBJ_RELEASE(answer); > goto CLEANUP; > } > if (orte_debug_daemons_flag) { > opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n", > ORTE_NAME_PRINT(sender), protector); > } > > /* Send the protector data */ > if (0 > orte_rml.send_buffer(sender, answer, > ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) { > ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE); > ret = ORTE_ERR_COMM_FAILURE; > OBJ_RELEASE(answer); > goto CLEANUP; >
[OMPI devel] BTL preferred_protocol , large message
Hi Jeff I'm working on large message exchange optimization. My optimization consists in "choosing the best protocol for each large message". In fact, - for each device, the way to chose the best protocol is different. - the faster protocol for a given device depends on that device hardware and on the message specifications. So the device/BTL itself is the best place to dynamically select the fastest protocol. Presently, for large messages, the protocol selection is only based on device capabilities. My optimization consists in asking the device/BTL for a "preferred protocol" and then make a choice based on : - the device capabilities and the BTL's recommendation. Technical view: The optimization is located in mca_pml_ob1_send_request_start_btl(), after the device/btl selection. In the large message section, I call a new function : mca_pml_ob1_preferred_protocol() => mca_bml_base_preferred_protocol() This one will try to launch btl->btl_preferred_protocol() So, selecting a protocol before a large message in not in the critical path. It is the BTL's responsibility to define this function to select a preferred protocol. If this function is not defined, nothing changes in the code path To do this optimization , I had to add an interface to the btl module structure in "btl.h", this is the drawback. I have already used this feature to optimize the "shared memory" device/BTL. I use the "preferred_protocol" feature to enable/disable KNEM according to intra/inter socket communication. This optimization increases a "IMB pingping benchmark" bandwidth by ~36%. The next step is now to use the "preferred protocol" feature with openib ( with many IB cards) Attached 2 patches: 1) BTL_preferred.patch: introduces the new preferred protocol interface 2) SM_KNEM_intra_socket.patch: defines the preferred protocol for the sm btl Note: Since the "ess" framework can't give us the "socket locality information", I used hitopo that has been proposed in an RFC some times ago: http://www.open-mpi.org/community/lists/devel/2010/11/8677.php diff -r 486ca4bfca95 ompi/mca/bml/bml.h --- a/ompi/mca/bml/bml.hMon Feb 07 15:40:31 2011 +0100 +++ b/ompi/mca/bml/bml.hTue Mar 08 15:50:13 2011 +0100 @@ -291,6 +291,17 @@ static inline int mca_bml_base_send_stat return btl->btl_send(btl, bml_btl->btl_endpoint, des, tag); } +static inline int mca_bml_base_preferred_protocol( mca_bml_base_btl_t* bml_btl, size_t size) +{ +mca_btl_base_module_t* btl = bml_btl->btl; +/* On selected btl, if btl_preferred_protocol() is defined, use it */ +if(btl->btl_preferred_protocol != NULL) + return btl->btl_preferred_protocol( bml_btl->btl, bml_btl->btl_endpoint, size); +else + /* No preferred protocol. Protocol must be selected from device capabilities only */ + return MCA_BTL_FLAGS_NONE; +} + static inline int mca_bml_base_sendi( mca_bml_base_btl_t* bml_btl, struct opal_convertor_t* convertor, void* header, diff -r 486ca4bfca95 ompi/mca/btl/btl.h --- a/ompi/mca/btl/btl.hMon Feb 07 15:40:31 2011 +0100 +++ b/ompi/mca/btl/btl.hTue Mar 08 15:50:13 2011 +0100 @@ -169,6 +169,7 @@ typedef uint8_t mca_btl_base_tag_t; #define MCA_BTL_TAG_UDAPL (MCA_BTL_TAG_BTL + 1) /* prefered protocol */ +#define MCA_BTL_FLAGS_NONE0x #define MCA_BTL_FLAGS_SEND0x0001 #define MCA_BTL_FLAGS_PUT 0x0002 #define MCA_BTL_FLAGS_GET 0x0004 @@ -752,6 +753,16 @@ typedef void (*mca_btl_base_module_dump_ int verbose ); +/** + * query preferred_protocol for current message +*/ + +typedef int (*mca_btl_base_module_preferred_protocol_fn_t)( + struct mca_btl_base_module_t* btl, + struct mca_btl_base_endpoint_t* endpoint, + size_t size +); + /** * Fault Tolerance Event Notification Function * @param state Checkpoint Status @@ -792,6 +803,7 @@ struct mca_btl_base_module_t { mca_btl_base_module_put_fn_tbtl_put; mca_btl_base_module_get_fn_tbtl_get; mca_btl_base_module_dump_fn_t btl_dump; +mca_btl_base_module_preferred_protocol_fn_t btl_preferred_protocol; /** the mpool associated with this btl (optional) */ mca_mpool_base_module_t* btl_mpool; diff -r 486ca4bfca95 ompi/mca/btl/elan/btl_elan.c --- a/ompi/mca/btl/elan/btl_elan.c Mon Feb 07 15:40:31 2011 +0100 +++ b/ompi/mca/btl/elan/btl_elan.c Tue Mar 08 15:50:13 2011 +0100 @@ -654,6 +654,7 @@ mca_btl_elan_module_t mca_btl_elan_modul mca_btl_elan_put, mca_btl_elan_get, mca_btl_elan_dump, + NULL, /* preferred protocol */ NULL, /* mpool */ mca_btl_elan_register_error, /* register error cb */ mca_btl_elan_ft_event /* mca_btl_elan_ft_event*/ diff -r 486ca
Re: [OMPI devel] BTL preferred_protocol , large message
On Mar 8, 2011, at 12:12 , Damien Guinier wrote: > Hi Jeff Sorry, your email went on the devel mailing list of Open MPI. > I'm working on large message exchange optimization. My optimization consists > in "choosing > the best protocol for each large message". > In fact, > - for each device, the way to chose the best protocol is different. > - the faster protocol for a given device depends on that device hardware and > on the message > specifications. > > So the device/BTL itself is the best place to dynamically select the fastest > protocol. > > Presently, for large messages, the protocol selection is only based on device > capabilities. > My optimization consists in asking the device/BTL for a "preferred protocol" > and > then make a choice based on : >- the device capabilities and the BTL's recommendation. As a BTL will not randomly change its preferred protocol, one can assume it will depend on the peer. Here is a similar approach to one you describe in your email, but without modification of the BTL interface. https://fs.hlrs.de/projects/eurompi2010/TALKS/WEDNESDAY_AFTERNOON/george_bosilca_locality_and_topology_aware.pdf george. > > Technical view: > The optimization is located in mca_pml_ob1_send_request_start_btl(), after > the device/btl selection. > In the large message section, I call a new function : > mca_pml_ob1_preferred_protocol() => mca_bml_base_preferred_protocol() > This one will try to launch > btl->btl_preferred_protocol() > So, selecting a protocol before a large message in not in the critical path. > It is the BTL's responsibility to define this function to select a preferred > protocol. > > If this function is not defined, nothing changes in the code path > To do this optimization , I had to add an interface to the btl module > structure in "btl.h", this is the drawback. > > > > I have already used this feature to optimize the "shared memory" device/BTL. > I use the "preferred_protocol" feature to enable/disable > KNEM according to intra/inter socket communication. This optimization > increases a "IMB pingping benchmark" bandwidth by ~36%. > > > > The next step is now to use the "preferred protocol" feature with openib ( > with many IB cards) > > > > Attached 2 patches: > 1) BTL_preferred.patch: > introduces the new preferred protocol interface > 2) SM_KNEM_intra_socket.patch: > defines the preferred protocol for the sm btl > Note: Since the "ess" framework can't give us the "socket locality > information", I used hitopo that has been proposed in an RFC > some times ago: > http://www.open-mpi.org/community/lists/devel/2010/11/8677.php > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel "I disapprove of what you say, but I will defend to the death your right to say it" -- Evelyn Beatrice Hall
[OMPI devel] multi-threaded test
I've been assigned CMR 2728, which is to apply some thread-support changes to 1.5.x. The trac ticket has amusing language about "needs testing". I'm not sure what that means. We rather consistently say that we don't promise anything with regards to true thread support. We specifically say certain BTLs are off limits and we say things are poorly tested and can be expected to break. Given all that, what does it mean to test thread support in OMPI? One option, specifically in the context of this CMR, is to test only configuration options and so on. I've done this. Another possibility is to confirm that simple run-time tests of multi-threaded message passing succeed. I'm having trouble with this. Attached is a simple test. It passes over sm but fails over TCP. (One or both of the initial messages is not received.) How high should I set my sights on this? #include #include #include #include /* memset */ #define N 1 int main(int argc, char **argv) { int np, me, buf[2][N], provided; /* init some stuff */ MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); MPI_Comm_size(MPI_COMM_WORLD,&np); MPI_Comm_rank(MPI_COMM_WORLD,&me); if ( provided < MPI_THREAD_MULTIPLE ) MPI_Abort(MPI_COMM_WORLD,-1); /* initialize the buffers */ memset(buf[0], 0, N * sizeof(int)); memset(buf[1], 0, N * sizeof(int)); /* test */ #pragma omp parallel num_threads(2) { int id = omp_get_thread_num(); MPI_Status st; printf("%d %d in parallel region\n", me, id); fflush(stdout); /* pingpong */ if ( me == 0 ) { MPI_Send(buf[id],N,MPI_INT,1,7+id,MPI_COMM_WORLD); printf("%d %d sent\n",me,id); fflush(stdout); MPI_Recv(buf[id],N,MPI_INT,1,7+id,MPI_COMM_WORLD,&st); printf("%d %d recd\n",me,id); fflush(stdout); } else { MPI_Recv(buf[id],N,MPI_INT,0,7+id,MPI_COMM_WORLD,&st); printf("%d %d recd\n",me,id); fflush(stdout); MPI_Send(buf[id],N,MPI_INT,0,7+id,MPI_COMM_WORLD); printf("%d %d sent\n",me,id); fflush(stdout); } } MPI_Finalize(); return 0; } #!/bin/csh mpicc -xopenmp -m64 -O5 main.c mpirun -np 2 --mca btl self,sm ./a.out mpirun -np 2 --mca btl self,tcp ./a.out