Nicholas Clark wrote:
Sorry about the delay in replying. I've had a lot on, and wanted to find
the time to confirm my suspicions.

On Sat, Jul 30, 2005 at 09:50:40PM -0400, John E. Malmberg wrote:


What I have found is that in this case is that the allocation size for the sv_u.svu_array member appears to be too small.


I don't believe this conclusion to be correct.

I refactored the hash code about 2 months ago and introduced the xpvhv_aux
structure. Prior to this, the malloc()ed region pointed to by the
sv_u.svu_array member was only used for the HE*s. For the case of no the
xpvhv_aux structure, the part of the code that calculates the size to allocate
has not changed.


The memory that is allocated is used to store an array of HE * pointers.
Not HE structures. The code isn't clear, and when I started trying to
investigate I realised that I could add another member to the sv_u
union, of type HE **, to make things much clearer.

Had the calculation been wrong in the past, we would also have been allocating
4*n bytes on many systems, yet using 12*n bytes, for every hash, which should
have shown up with memory checking tools (and random corruption) left right
and centre.


So I'm assuming that the problem is in or related to the changes I made to
tag a xpvhv_aux structure on the end, but not all the time. Hence I suspect
that it's my fault and been trying to confirm where, and nail the cause, if
not find the solution. But I admit that I have to give up. I don't know
where. Even if I try this change to make the xpvhv_aux structure rather
larger:

==== //depot/perl/hv.h#76 - /home/nick/p4perl/perl/hv.h ====
--- /tmp/tmp.89859.0    Fri Aug 19 21:38:03 2005
+++ /home/nick/p4perl/perl/hv.h Wed Aug 17 13:54:08 2005
@@ -37,10 +37,14 @@ struct shared_he {
    Don't access this directly.
 */
 struct xpvhv_aux {
+    char splat1[10000];
     HEK                *xhv_name;      /* name, if a symbol table */
     HE         *xhv_eiter;     /* current entry of iterator */
     I32                xhv_riter;      /* current root of iterator */
+    char splat2[10000];
 };
+
+#define SPLAT(hv) memset(HvAUX(hv)->splat1, 0xFE, 10000); 
memset(HvAUX(hv)->splat2, 0xFE, 10000);
/* hash structure: */
 /* This structure must match the beginning of struct xpvmg in sv.h. */


and put calls to SPLAT everywhere that the members of xpvhv_aux are written
to, I can't trigger the corruption on x86 Linux.


The xpvhv_aux structure is placed in memory directly after the array of
pointers to HE*s. However, memory to hold it is not always allocated.

My assumption is that either

1: There's a code path that starts writing to the location it should be in,
   without first calling realloc() to allocate memory for it.
2: There's an error in the code that works out the offset of the xpvhv_aux
   structure given the pointer accessed via the macro HvARRAY()

but I can't find it by inspection, and I can't reproduce it, even by tweaking
the code to try to provoke the problem. Running regression tests under
valgrind doesn't find anything amiss either. I'm stuck. But I'm confident
that it's not a three-fold error in the calculation for the size of memory to
allocate for the array of HE **s.

I can only reproduce this on one instance of one test, but on that test it can be reproduced every time.

Over-allocating the structure did make the problem go away, so there is a clear relationship.

The condition where the corruption is detected has the characteristic that the xhv_max = 7.

DBG> exam hv->sv_any->xhv_max
HV\S_hv_auxinit\hv->sv_any->xhv_max:    7

DBG> exam hv->sv_u.svu_array
HV\S_hv_auxinit\hv->sv_u.svu_array:     0

The base of the allocated memory is below, by the old calculation a malloc of 44 bytes.

HV\S_hv_auxinit\array:  10065784

The array will end up being stored at the address:

DBG> eval/add hv->sv_u.svu_array
9964796
DBG> exam hv->sv_u.svu_array
HV\S_hv_auxinit\hv->sv_u.svu_array:     10065784

The upper bound of memory is:
DBG> eval 10065784+44
10065828

Which looks like it should be correct.

So I will undo my local change and see if I can capture some more data corruptions to zero in on this.

Is there some quick way that I can turn on the DEBUG_m function once I start reproducing the problem? It would be interesting to see if that gave any more clues.

-John
[EMAIL PROTECTED]
Personal Opinion Only



Reply via email to