Well, after adding a bunch of debugging output, I have found the following.

With both leave_pinned and use_mem_hook enabled on a linpack run we get the assertion error on the memory callback in linpack. That is to say, there is a free occurring in the middle of a registration.

At the point of assert we have NOT resized any registrations.
The existing registrations in the tree are:

Existing registrations: 
Base    Bound   Length
241615360       244841607       3226248
244841608       246428807       1587200
246428808       248016007       1587200
248019648       251245895       3226248
Tyring to free          
247917216               
From            
Base    Bound   
246428808       248016007       



When we get the assert, we are trying to free: 247917216, which is in the middle of the registration. Note we have NOT resized any registrations so I am confident there is not an issue with either the tree or the resize at least as far as linpack is concerned.
Here is the callstack:

#0  0x0000002a95f079c9 in raise () from /lib/libc.so.6
#1  0x0000002a95f08e6e in abort () from /lib/libc.so.6
#2  0x0000002a95f01690 in __assert_fail () from /lib/libc.so.6
#3 0x0000002a9571b200 in mca_mpool_base_mem_cb (base=0xec6eaa0, size=31624,
    cbdata=0x0) at mpool_base_mem_cb.c:53
#4  0x0000002a9587fe0d in opal_mem_free_release_hook (buf=0xec6eaa0,
    length=31624) at memory.c:121
#5  0x0000002a9588bd12 in opal_mem_free_free_hook (ptr=0xec6eaa0,
    caller=0x42b052) at memory_malloc_hooks.c:66
#6  0x000000000042b052 in ATL_dmmIJK ()
#7  0x000000000064f9b1 in ATL_dgemmNN ()
#8  0x000000000057722b in ATL_dgemmNN_RB ()
#9  0x0000000000577fc3 in ATL_rtrsmRUN ()
#10 0x000000000042c63c in ATL_dtrsm ()
#11 0x0000000000423c1e in atl_f77wrap_dtrsm__ ()
#12 0x0000000000423a94 in dtrsm_ ()
#13 0x0000000000411192 in HPL_dtrsm (ORDER=17933, SIDE=17933, UPLO=8,
TRANS=4294967295, DIAG=0, M=23458672, N=0, ALPHA=1, A=0x7fbfffefa0, LDA=0,
    B=0x202, LDB=0) at HPL_dtrsm.c:949
#14 0x000000000040cfb6 in HPL_pdupdateTT (PBCST=0x0, IFLAG=0x0,
    PANEL=0x165f040, NN=-1) at HPL_pdupdateTT.c:362
#15 0x000000000041936f in HPL_pdgesvK2 (GRID=0x7fbffff4a0, ALGO=0x7fbffff460,
    A=0x7fbffff260) at HPL_pdgesvK2.c:178
#16 0x000000000040d6f7 in HPL_pdgesv (GRID=0x7fbffff4a0, ALGO=0x460d,
    A=0x7fbffff260) at HPL_pdgesv.c:107
#17 0x0000000000405b10 in HPL_pdtest (TEST=0x7fbffff430, GRID=0x7fbffff4a0,
    ALGO=0x7fbffff460, N=10000, NB=80) at HPL_pdtest.c:193
#18 0x0000000000401840 in main (ARGC=1, ARGV=0x7fbffff928)
    at HPL_pddriver.c:223

Note that the free occurs in the ATLAS libraries, I will look into re- building linpack with another BLAS library to see what happens. Any other suggestions?

Thanks,

Galen

Reply via email to