> On Jun 19, 2019, at 2:44 PM, George Bosilca <[email protected]> wrote:
>
> To completely disable UCX you need to disable the UCX MTL and not only the
> BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1”.
Thanks for the pointer. Disabling ucx this way _does_ seem to fix the memory
issue. That’s a very helpful workaround, if nothing else.
Using ucx 1.5.1 downloaded from the ucx web site at runtime (just by inserting
it into LD_LIBRARY_PATH, without recompiling openmpi) doesn’t seem to fix the
problem.
>
> As you have a gdb session on the processes you can try to break on some of
> the memory allocations function (malloc, realloc, calloc).
Good idea. I set breakpoints on all 3 of those, then did “c” 3 times. Does
this mean anything to anyone? I’m investigating the upstream calls (not
included below) that generate these calls to mpi_bcast, but given that it works
on other types of nodes, I doubt those are problematic.
#0 0x00002b9e5303e160 in malloc () from /lib64/libc.so.6
#1 0x00002b9e651f358a in ucs_rcache_create_region (region_p=0x7fff82806da0,
arg=0x7fff82806d9c, prot=3, length=131072, address=0x2b9e76102070,
rcache=0xb341a50) at sys/rcache.c:500
#2 ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, length=131072,
prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c,
region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612
#3 0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>,
address=<optimized out>, length=<optimized out>, flags=96, memh_p=0xbc409b0) at
ib/base/ib_md.c:990
#4 0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>,
reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized out>,
uct_flags=uct_flags@entry=96,
alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry=UCT_MD_MEM_TYPE_HOST,
alloc_md_memh_p=alloc_md_memh_p@entry=0x0, uct_memh=uct_memh@entry=0xbc409b0,
md_map_p=md_map_p@entry=0xbc409a8)
at core/ucp_mm.c:100
#5 0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, md_map=4,
buffer=0x2b9e76102070, length=131072, datatype=128,
state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST,
req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>,
uct_flags@entry=0) at core/ucp_request.c:218
#6 0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized out>,
req=0xbc40940) at
/home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343
#7 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153
#8 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1,
proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350
<mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>,
rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192,
req=<optimized out>) at tag/tag_send.c:78
#9 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, count=8192,
datatype=<optimized out>, tag=<optimized out>, cb=0x2b9e64467350
<mca_pml_ucx_send_completion>) at tag/tag_send.c:203
#10 0x00002b9e64465fa6 in mca_pml_ucx_isend () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
#11 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#12 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#13 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so
#14 0x00002b9e521dbb79 in PMPI_Bcast () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#15 0x00002b9e51f623df in pmpi_bcast__ () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40
#0 0x00002b9e5303e160 in malloc () from /lib64/libc.so.6
#1 0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at
datastruct/pgtable.c:69
#2 ucs_pgtable_insert_page (region=0xc6919d0, order=12,
address=47959585718272, pgtable=0xb341ab8) at datastruct/pgtable.c:299
#3 ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8,
region=region@entry=0xc6919d0) at datastruct/pgtable.c:403
#4 0x00002b9e651f35bc in ucs_rcache_create_region (region_p=0x7fff82806da0,
arg=0x7fff82806d9c, prot=3, length=131072, address=0x2b9e76102070,
rcache=0xb341a50) at sys/rcache.c:511
#5 ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, length=131072,
prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c,
region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612
#6 0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>,
address=<optimized out>, length=<optimized out>, flags=96, memh_p=0xbc409b0) at
ib/base/ib_md.c:990
#7 0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>,
reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized out>,
uct_flags=uct_flags@entry=96,
alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry=UCT_MD_MEM_TYPE_HOST,
alloc_md_memh_p=alloc_md_memh_p@entry=0x0, uct_memh=uct_memh@entry=0xbc409b0,
md_map_p=md_map_p@entry=0xbc409a8)
at core/ucp_mm.c:100
#8 0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, md_map=4,
buffer=0x2b9e76102070, length=131072, datatype=128,
state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST,
req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>,
uct_flags@entry=0) at core/ucp_request.c:218
#9 0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized out>,
req=0xbc40940) at
/home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343
#10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153
#11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1,
proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350
<mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>,
rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192,
req=<optimized out>) at tag/tag_send.c:78
#12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, count=8192,
datatype=<optimized out>, tag=<optimized out>, cb=0x2b9e64467350
<mca_pml_ucx_send_completion>) at tag/tag_send.c:203
#13 0x00002b9e64465fa6 in mca_pml_ucx_isend () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
#14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so
#17 0x00002b9e521dbb79 in PMPI_Bcast () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#18 0x00002b9e51f623df in pmpi_bcast__ () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40
#0 0x00002b9e5303e160 in malloc () from /lib64/libc.so.6
#1 0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at
datastruct/pgtable.c:69
#2 ucs_pgtable_insert_page (region=0xc6919d0, order=4, address=47959585726464,
pgtable=0xb341ab8) at datastruct/pgtable.c:299
#3 ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8,
region=region@entry=0xc6919d0) at datastruct/pgtable.c:403
#4 0x00002b9e651f35bc in ucs_rcache_create_region (region_p=0x7fff82806da0,
arg=0x7fff82806d9c, prot=3, length=131072, address=0x2b9e76102070,
rcache=0xb341a50) at sys/rcache.c:511
#5 ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, length=131072,
prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c,
region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612
#6 0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>,
address=<optimized out>, length=<optimized out>, flags=96, memh_p=0xbc409b0) at
ib/base/ib_md.c:990
#7 0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>,
reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized out>,
uct_flags=uct_flags@entry=96,
alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry=UCT_MD_MEM_TYPE_HOST,
alloc_md_memh_p=alloc_md_memh_p@entry=0x0, uct_memh=uct_memh@entry=0xbc409b0,
md_map_p=md_map_p@entry=0xbc409a8)
at core/ucp_mm.c:100
#8 0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, md_map=4,
buffer=0x2b9e76102070, length=131072, datatype=128,
state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST,
req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>,
uct_flags@entry=0) at core/ucp_request.c:218
#9 0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized out>,
req=0xbc40940) at
/home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343
#10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153
#11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1,
proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350
<mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>,
rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192,
req=<optimized out>) at tag/tag_send.c:78
#12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, count=8192,
datatype=<optimized out>, tag=<optimized out>, cb=0x2b9e64467350
<mca_pml_ucx_send_completion>) at tag/tag_send.c:203
#13 0x00002b9e64465fa6 in mca_pml_ucx_isend () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
#14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so
#17 0x00002b9e521dbb79 in PMPI_Bcast () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#18 0x00002b9e51f623df in pmpi_bcast__ () from
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40
#19 0x000000000040d442 in m_bcast_z_from (comm=..., vec=..., n=55826, inode=2)
at mpi.F:1781
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________
users mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/users