[OMPI devel] Problem with BTL while allocating window
Hi everyone, I've been facing issues with the creations of windows (MPI_Win_create). Maybe it's an already known issue, or maybe you will be able to tell me where to check to find the problem. I've been developping some benchmark to evaluate the overhead of a monitoring module. Everything works fine for PML based operations (coll and point-to-point). But I have some errors while creating windows (even without the monitoring component : I launch my applications with --mca pml ^monitoring --mca osc ^monitoring --mca coll ^monitoring, so my components shouldn't be loaded). From what I've tracked, while initializing the osc_rdma module, there is btl that's selected, and can't be found back when calling ompi_osc_rdma_peer_btl_endpoint(). Here are the traces of a problematic example (with 4 processes, curent process is 1). Every processes are on one node : Breakpoint 11, ompi_osc_rdma_query_btls (comm=0x85f8b0, btl=0x85e850) at ../../../../../../ompi/mca/osc/rdma/osc_rdma_component.c:735 735 *btl = selected_btl; (gdb) p selected_btl $1 = (struct mca_btl_base_module_t *) 0x759e30 Breakpoint 10, ompi_osc_rdma_peer_btl_endpoint (module=0x85e360, peer_id=0) at ../../../../../../ompi/mca/osc/rdma/osc_rdma_peer.c:54 54 return NULL; (gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 0))->btl_rdma) $17 = 2 (gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 0))->btl_rdma.bml_btls[0].btl $18 = (struct mca_btl_base_module_t *) 0x76cab0 (gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 0))->btl_rdma.bml_btls[1].btl $19 = (struct mca_btl_base_module_t *) 0x72a680 (gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 1))->btl_rdma) $20 = 2 (gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 1))->btl_rdma.bml_btls[0].btl $21 = (struct mca_btl_base_module_t *) 0x759e30 (gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 1))->btl_rdma.bml_btls[1].btl $22 = (struct mca_btl_base_module_t *) 0x7fffec275200 (gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 2))->btl_rdma) $23 = 2 (gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 2))->btl_rdma.bml_btls[0].btl $24 = (struct mca_btl_base_module_t *) 0x76cab0 (gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 2))->btl_rdma.bml_btls[1].btl $25 = (struct mca_btl_base_module_t *) 0x72a680 (gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 3))->btl_rdma) $26 = 1 (gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 3))->btl_rdma.bml_btls[0].btl $27 = (struct mca_btl_base_module_t *) 0x759e30 It seems that for odd proc_id's, the corresponding selected btl can be retrieved, but not for the odd ones. I haven't check deeply into the library to explain this behavior yet. Do you have any idea of where to look this up? Thank's you in advance, Clément FOYER ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Problem with BTL while allocating window
Update : From what I've tracked, while initializing the osc_rdma module, there is a btl selected, whose endpoint can't be found back when calling ompi_osc_rdma_peer_btl_endpoint(). It seems like that for even peers, the available btl endpoints are tcp, even though we only find openib and ugni in ompi_osc_rdma_btl_names. (gdb) p ompi_osc_rdma_btl_names $29 = 0x7981e0 "openib,ugni" (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name $39 = "self", '\000' (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next)).btl_module $40 = (mca_btl_base_module_t *) 0x7fffed102100 (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name $41 = "openib", '\000' (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next)).btl_module $42 = (mca_btl_base_module_t *) 0x759e30 (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name $43 = "sm", '\000' (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next)).btl_module $44 = (mca_btl_base_module_t *) 0x7fffec89e200 (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name $46 = "tcp", '\000' (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module $47 = (mca_btl_base_module_t *) 0x76cab0 (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name $48 = "tcp", '\000' (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module $49 = (mca_btl_base_module_t *) 0x72a680 (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name $50 = "vader", '\000' (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module $51 = (mca_btl_base_module_t *) 0x7fffec275200 (gdb) p ((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module $52 = (mca_btl_base_module_t *) 0x0 Sorry for the noise in your mail boxes. I thought it may have been valuable informations to know where these pointers point to. Clement FOYER On 02/02/2017 11:17 AM, Clement FOYER wrote: Hi everyone, I've been facing issues with the creations of windows (MPI_Win_create). Maybe it's an already known issue, or maybe you will be able to tell me where to check to find the problem. I've been developping some benchmark to evaluate the overhead of a monitoring module. Everything works fine for PML based operations (coll and point-to-point). But I have some errors while creating windows (even without the monitoring component : I launch my applications with --mca pml ^monitoring --mca osc ^monitoring --mca coll ^monitoring, so my components shouldn't be loaded). From what I've tracked, while initializing the osc_rdma module, there is a btl selected, whose endpoint can't be found back when calling ompi_osc_rdma_peer_btl_endpoint(). Here are the traces of a problematic example (with 4 processes, curent process is 1). Every processes are on one node : Breakpoint 11, ompi_osc_rdma_query_btls (comm=0x85f8b0, btl=0x85e850) at ../../../../../../ompi/mca/osc/rdma/osc_rdma_component.c:735 735 *btl = selected_btl; (gdb) p selected_btl $1 = (struct mca_btl_base_module_t *) 0x759e30 Breakpoint 10, ompi_osc_rdma_peer_btl_endpoint (module=0x85e360, peer_id=0) at ../../../../../../ompi/mca/osc/rdma/osc_rdma_peer.c:54 54 return NULL; (gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 0))->btl_rdma) $17 = 2 (gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 0))->btl_rdma.bml_btls[0].btl $18
[OMPI devel] Fun fact: comparisons between Open MPI versions
I stumbled across this this morning; I didn't even know such a thing existed: https://fossies.org/diffs/openmpi/ For example: https://fossies.org/diffs/openmpi/2.0.1_vs_2.0.2/index.html Nifty. -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] OMPI v1.10.6rc1 ready for test
Sorry for the delayed response. I have completed my normal RC testing and have nothing to report. -Paul On Mon, Jan 30, 2017 at 1:03 PM, r...@open-mpi.org wrote: > Usual place: https://www.open-mpi.org/software/ompi/v1.10/ > > Scheduled release: Fri Feb 3rd > > 1.10.6 > -- > - Fix bug in timer code that caused problems at optimization settings > greater than 2 > - OSHMEM: make mmap allocator the default instead of sysv or verbs > - Support MPI_Dims_create with dimension zero > - Update USNIC support > - Prevent 64-bit overflow on timer counter > - Add support for forwarding signals > - Fix bug that caused truncated messages on large sends over TCP BTL > - Fix potential infinite loop when printing a stacktrace > > > ___ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel