[OMPI devel] Problem with BTL while allocating window

2017-02-02 Thread Clement FOYER

Hi everyone,

I've been facing issues with the creations of windows (MPI_Win_create). 
Maybe it's an already known issue, or maybe you will be able to tell me 
where to check to find the problem.


I've been developping some benchmark to evaluate the overhead of a 
monitoring module. Everything works fine for PML based operations (coll 
and point-to-point). But I have some errors while creating windows (even 
without the monitoring component : I launch my applications with --mca 
pml ^monitoring --mca osc ^monitoring --mca coll ^monitoring, so my 
components shouldn't be loaded).


From what I've tracked, while initializing the osc_rdma module, there 
is btl that's selected, and can't be found back when calling 
ompi_osc_rdma_peer_btl_endpoint().


Here are the traces of a problematic example (with 4 processes, curent 
process is 1). Every processes are on one node :


Breakpoint 11, ompi_osc_rdma_query_btls (comm=0x85f8b0, btl=0x85e850)
at ../../../../../../ompi/mca/osc/rdma/osc_rdma_component.c:735
735 *btl = selected_btl;
(gdb) p selected_btl
$1 = (struct mca_btl_base_module_t *) 0x759e30

Breakpoint 10, ompi_osc_rdma_peer_btl_endpoint (module=0x85e360, peer_id=0)
at ../../../../../../ompi/mca/osc/rdma/osc_rdma_peer.c:54
54  return NULL;
(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 0))->btl_rdma)
$17 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
0))->btl_rdma.bml_btls[0].btl
$18 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
0))->btl_rdma.bml_btls[1].btl
$19 = (struct mca_btl_base_module_t *) 0x72a680

(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 1))->btl_rdma)
$20 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
1))->btl_rdma.bml_btls[0].btl
$21 = (struct mca_btl_base_module_t *) 0x759e30
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
1))->btl_rdma.bml_btls[1].btl
$22 = (struct mca_btl_base_module_t *) 0x7fffec275200 

(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 2))->btl_rdma)
$23 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
2))->btl_rdma.bml_btls[0].btl
$24 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
2))->btl_rdma.bml_btls[1].btl
$25 = (struct mca_btl_base_module_t *) 0x72a680

(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 3))->btl_rdma)
$26 = 1
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
3))->btl_rdma.bml_btls[0].btl
$27 = (struct mca_btl_base_module_t *) 0x759e30

It seems that for odd proc_id's, the corresponding selected btl can be 
retrieved, but not for the odd ones. I haven't check deeply into the 
library to explain this behavior yet. Do you have any idea of where to 
look this up?


Thank's you in advance,

Clément FOYER

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Problem with BTL while allocating window

2017-02-02 Thread Clement FOYER

Update :

From what I've tracked, while initializing the osc_rdma module, there 
is a btl selected, whose endpoint can't be found back when calling 
ompi_osc_rdma_peer_btl_endpoint().


It seems like that for even peers, the available btl endpoints are tcp, 
even though we only find openib and ugni in ompi_osc_rdma_btl_names.


(gdb) p ompi_osc_rdma_btl_names
$29 = 0x7981e0 "openib,ugni"

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$39 = "self", '\000' 
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next)).btl_module
$40 = (mca_btl_base_module_t *) 0x7fffed102100 

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$41 = "openib", '\000' 
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next)).btl_module
$42 = (mca_btl_base_module_t *) 0x759e30

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$43 = "sm", '\000' 
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next)).btl_module
$44 = (mca_btl_base_module_t *) 0x7fffec89e200 

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$46 = "tcp", '\000' 
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$47 = (mca_btl_base_module_t *) 0x76cab0

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$48 = "tcp", '\000' 
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$49 = (mca_btl_base_module_t *) 0x72a680

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$50 = "vader", '\000' 
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$51 = (mca_btl_base_module_t *) 0x7fffec275200 

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$52 = (mca_btl_base_module_t *) 0x0

Sorry for the noise in your mail boxes. I thought it may have been 
valuable informations to know where these pointers point to.


Clement FOYER

 


On 02/02/2017 11:17 AM, Clement FOYER wrote:

Hi everyone,

I've been facing issues with the creations of windows 
(MPI_Win_create). Maybe it's an already known issue, or maybe you will 
be able to tell me where to check to find the problem.


I've been developping some benchmark to evaluate the overhead of a 
monitoring module. Everything works fine for PML based operations 
(coll and point-to-point). But I have some errors while creating 
windows (even without the monitoring component : I launch my 
applications with --mca pml ^monitoring --mca osc ^monitoring --mca 
coll ^monitoring, so my components shouldn't be loaded).


From what I've tracked, while initializing the osc_rdma module, there 
is a btl selected, whose endpoint can't be found back when calling 
ompi_osc_rdma_peer_btl_endpoint().


Here are the traces of a problematic example (with 4 processes, curent 
process is 1). Every processes are on one node :


Breakpoint 11, ompi_osc_rdma_query_btls (comm=0x85f8b0, btl=0x85e850)
 at ../../../../../../ompi/mca/osc/rdma/osc_rdma_component.c:735
735 *btl = selected_btl;
(gdb) p selected_btl
$1 = (struct mca_btl_base_module_t *) 0x759e30
Breakpoint 10, ompi_osc_rdma_peer_btl_endpoint (module=0x85e360, peer_id=0)
 at ../../../../../../ompi/mca/osc/rdma/osc_rdma_peer.c:54
54  return NULL;
(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 0))->btl_rdma)
$17 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
0))->btl_rdma.bml_btls[0].btl
$18

[OMPI devel] Fun fact: comparisons between Open MPI versions

2017-02-02 Thread Jeff Squyres (jsquyres)
I stumbled across this this morning; I didn't even know such a thing existed:

https://fossies.org/diffs/openmpi/

For example:

https://fossies.org/diffs/openmpi/2.0.1_vs_2.0.2/index.html

Nifty.

-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] OMPI v1.10.6rc1 ready for test

2017-02-02 Thread Paul Hargrove
Sorry for the delayed response.
I have completed my normal RC testing and have nothing to report.

-Paul

On Mon, Jan 30, 2017 at 1:03 PM, r...@open-mpi.org  wrote:

> Usual place: https://www.open-mpi.org/software/ompi/v1.10/
>
> Scheduled release: Fri Feb 3rd
>
> 1.10.6
> --
> - Fix bug in timer code that caused problems at optimization settings
>   greater than 2
> - OSHMEM: make mmap allocator the default instead of sysv or verbs
> - Support MPI_Dims_create with dimension zero
> - Update USNIC support
> - Prevent 64-bit overflow on timer counter
> - Add support for forwarding signals
> - Fix bug that caused truncated messages on large sends over TCP BTL
> - Fix potential infinite loop when printing a stacktrace
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel