Well, looks like we found the problem.
The memory free callback was incorrect. We were just looking for the
base address of the free in the tree. Here is why this didn't work
Probably wouldn't compile but works for an explanation:
buf = malloc(40*1024); /* malloc 10 pages */
/* send the second half of the buffer to the peer */
/* note that leave_pinned will register and cache only what it sees
in the send */
MPI_Send(buf+5*4*1024, 5*4*1024, ........ );
/* free the buffer, mpi will try to find the registration in the tree
based on the address, buf,, but won't find it so the registration
remains */
free(buf);
So since the registration is left in the tree, a future malloc may
obtain a virtual address that is within the base and bound of the
registration. When this memory is later freed we try to deregister
the entire registration, part of which might be in use by another
buffer, it could even be in the process of an RDMA operation.
Anyway, I have modified the code and we are now passing a smaller
linpack run with leave_pinned and the mem hooks enabled without using
any mallopt trickiness.
Thanks,
Galen
On Sep 25, 2005, at 10:58 AM, Galen M. Shipman wrote:
Well, after adding a bunch of debugging output, I have found the
following.
With both leave_pinned and use_mem_hook enabled on a linpack run we
get the assertion error on the memory callback in linpack. That is
to say, there is a free occurring in the middle of a registration.
At the point of assert we have NOT resized any registrations.
The existing registrations in the tree are:
Existing registrations:
Base Bound Length
241615360 244841607 3226248
244841608 246428807 1587200
246428808 248016007 1587200
248019648 251245895 3226248
Tyring to free
247917216
From
Base Bound
246428808 248016007
When we get the assert, we are trying to free: 247917216, which is
in the middle of the registration. Note we have NOT resized any
registrations so I am confident there is not an issue with either
the tree or the resize at least as far as linpack is concerned.
Here is the callstack:
#0 0x0000002a95f079c9 in raise () from /lib/libc.so.6
#1 0x0000002a95f08e6e in abort () from /lib/libc.so.6
#2 0x0000002a95f01690 in __assert_fail () from /lib/libc.so.6
#3 0x0000002a9571b200 in mca_mpool_base_mem_cb (base=0xec6eaa0,
size=31624,
cbdata=0x0) at mpool_base_mem_cb.c:53
#4 0x0000002a9587fe0d in opal_mem_free_release_hook (buf=0xec6eaa0,
length=31624) at memory.c:121
#5 0x0000002a9588bd12 in opal_mem_free_free_hook (ptr=0xec6eaa0,
caller=0x42b052) at memory_malloc_hooks.c:66
#6 0x000000000042b052 in ATL_dmmIJK ()
#7 0x000000000064f9b1 in ATL_dgemmNN ()
#8 0x000000000057722b in ATL_dgemmNN_RB ()
#9 0x0000000000577fc3 in ATL_rtrsmRUN ()
#10 0x000000000042c63c in ATL_dtrsm ()
#11 0x0000000000423c1e in atl_f77wrap_dtrsm__ ()
#12 0x0000000000423a94 in dtrsm_ ()
#13 0x0000000000411192 in HPL_dtrsm (ORDER=17933, SIDE=17933, UPLO=8,
TRANS=4294967295, DIAG=0, M=23458672, N=0, ALPHA=1,
A=0x7fbfffefa0, LDA=0,
B=0x202, LDB=0) at HPL_dtrsm.c:949
#14 0x000000000040cfb6 in HPL_pdupdateTT (PBCST=0x0, IFLAG=0x0,
PANEL=0x165f040, NN=-1) at HPL_pdupdateTT.c:362
#15 0x000000000041936f in HPL_pdgesvK2 (GRID=0x7fbffff4a0,
ALGO=0x7fbffff460,
A=0x7fbffff260) at HPL_pdgesvK2.c:178
#16 0x000000000040d6f7 in HPL_pdgesv (GRID=0x7fbffff4a0, ALGO=0x460d,
A=0x7fbffff260) at HPL_pdgesv.c:107
#17 0x0000000000405b10 in HPL_pdtest (TEST=0x7fbffff430,
GRID=0x7fbffff4a0,
ALGO=0x7fbffff460, N=10000, NB=80) at HPL_pdtest.c:193
#18 0x0000000000401840 in main (ARGC=1, ARGV=0x7fbffff928)
at HPL_pddriver.c:223
Note that the free occurs in the ATLAS libraries, I will look into
re-building linpack with another BLAS library to see what happens.
Any other suggestions?
Thanks,
Galen
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel