Well, after adding a bunch of debugging output, I have found the
following.
With both leave_pinned and use_mem_hook enabled on a linpack run we
get the assertion error on the memory callback in linpack. That is to
say, there is a free occurring in the middle of a registration.
At the point of assert we have NOT resized any registrations.
The existing registrations in the tree are:
Existing registrations:
Base Bound Length
241615360 244841607 3226248
244841608 246428807 1587200
246428808 248016007 1587200
248019648 251245895 3226248
Tyring to free
247917216
From
Base Bound
246428808 248016007
When we get the assert, we are trying to free: 247917216, which is in
the middle of the registration. Note we have NOT resized any
registrations so I am confident there is not an issue with either the
tree or the resize at least as far as linpack is concerned.
Here is the callstack:
#0 0x0000002a95f079c9 in raise () from /lib/libc.so.6
#1 0x0000002a95f08e6e in abort () from /lib/libc.so.6
#2 0x0000002a95f01690 in __assert_fail () from /lib/libc.so.6
#3 0x0000002a9571b200 in mca_mpool_base_mem_cb (base=0xec6eaa0,
size=31624,
cbdata=0x0) at mpool_base_mem_cb.c:53
#4 0x0000002a9587fe0d in opal_mem_free_release_hook (buf=0xec6eaa0,
length=31624) at memory.c:121
#5 0x0000002a9588bd12 in opal_mem_free_free_hook (ptr=0xec6eaa0,
caller=0x42b052) at memory_malloc_hooks.c:66
#6 0x000000000042b052 in ATL_dmmIJK ()
#7 0x000000000064f9b1 in ATL_dgemmNN ()
#8 0x000000000057722b in ATL_dgemmNN_RB ()
#9 0x0000000000577fc3 in ATL_rtrsmRUN ()
#10 0x000000000042c63c in ATL_dtrsm ()
#11 0x0000000000423c1e in atl_f77wrap_dtrsm__ ()
#12 0x0000000000423a94 in dtrsm_ ()
#13 0x0000000000411192 in HPL_dtrsm (ORDER=17933, SIDE=17933, UPLO=8,
TRANS=4294967295, DIAG=0, M=23458672, N=0, ALPHA=1,
A=0x7fbfffefa0, LDA=0,
B=0x202, LDB=0) at HPL_dtrsm.c:949
#14 0x000000000040cfb6 in HPL_pdupdateTT (PBCST=0x0, IFLAG=0x0,
PANEL=0x165f040, NN=-1) at HPL_pdupdateTT.c:362
#15 0x000000000041936f in HPL_pdgesvK2 (GRID=0x7fbffff4a0,
ALGO=0x7fbffff460,
A=0x7fbffff260) at HPL_pdgesvK2.c:178
#16 0x000000000040d6f7 in HPL_pdgesv (GRID=0x7fbffff4a0, ALGO=0x460d,
A=0x7fbffff260) at HPL_pdgesv.c:107
#17 0x0000000000405b10 in HPL_pdtest (TEST=0x7fbffff430,
GRID=0x7fbffff4a0,
ALGO=0x7fbffff460, N=10000, NB=80) at HPL_pdtest.c:193
#18 0x0000000000401840 in main (ARGC=1, ARGV=0x7fbffff928)
at HPL_pddriver.c:223
Note that the free occurs in the ATLAS libraries, I will look into re-
building linpack with another BLAS library to see what happens. Any
other suggestions?
Thanks,
Galen