Re: dwarf_edit_output testcase sanity checking

Mark Wielaard Thu, 04 Nov 2010 15:39:50 -0700

On Wed, 2010-10-27 at 10:44 -0400, Roland McGrath wrote: 
> That will indeed be horribly inefficient.  But, in nontrivial cases, the
> whole plan of hash=0 for circular references is problematically inefficient
> too.  So this suggests reconsidering that whole idea in the algorithm.
> This is the sort of thing I wanted your fresh brainpower on.
> 
> Consider a <pointer_type type=.../> DIE.  There are zillions of those
> All of those that are part of a circular reference chain hash as the same.
> (See e.g. the --stats output on any C++ case, where there are these and
> also zillions of the equivalent reference_type cases.)
> 
> So we wind up comparing against all those in the same hash bucket, which is
> far from optimal.  The best the current (buggy) plan gets is to do negative
> caching of each individual comparison.  But even if all that caching code
> were perfectly correct, it stills winds up being n separate hash lookups
> to do the n "quick" comparisons.  So this is not really an adequate plan.


So the problem is the hash of a reference attribute not being "stable"
in the face of circularity. The dirty solution of just always using a
hash of zero for a reference makes things indeed a lot worse, too many
things hash together to be practical.

I tried to hack together a "solution" that used the referent's
non-reference attributes and tag. That showed a bug which looks
suspiciously like what we tried to debug before:
https://fedorahosted.org/pipermail/elfutils-devel/2010-September/001584.html
So I will try to get to the bottom of that first (somewhere when we try
to finalize a die the attributes get misplaced). But in practice this
might also not be enough to create very distinct hashes.

Another idea would be to use not use a "perfect hash" for reference
attributes. Allow some equal reference attributes to not hash the same.
We could use the identity of the originally copied die we are referring
too as hash for the reference attribute. Since we do still have that in
the die_info_pair. This should be unique at least inside a CU, but not
across CUs in the same file. So then we would miss marking reference
attributes equals if they point to dies that are equal across CUs. So
unless we "fix that up" in some later pass that also doesn't seem
practical. At least I couldn't come up with a good idea for doing a pass
over the result that would merge things across CUs. It seems that cannot
be a simple pass over the dwarf_output result since merging two
reference attributes might make other references that point to dies
containing these reference attributes equal, etc.

The nice thing about most of the hashing done in dwarf_output is that it
"uniqifies" value/die groups and so creates a perfect unique hash. This
can be done with ordered sets of values or die trees. But the reference
attributes introduce arbitrary (possibly circular) graphs between
elements. We get in trouble because we want to cut the circularity, but
don't know "the start of the circle". So we pick an arbitrary attribute
reference target and declare that as being the circular node. Then we
assign a different hash code to a reference to such a circular node then
we normally would do. Normally we would like to assign the "unique
identifier object" for that referred to die. But we cannot in the
circular case since we are still constructing that.

So the problem comes from including the hash of all reference attributes
in the construction of the "unique identifier object" for a attribute
reference that points to a die. That means that references inside
circular structures cannot really be completed since during the
construction of the "unique identifier object" of reference to that
circular structure itself it can never be fully resolved. So it seems
that we should declare two "resolved" states for DIEs. One "tree
resolved" that takes into account just the basic attributes plus all the
children "tree resolved". And another "fully resolved" that all
attributes, including reference values into account, plus "fully
resolved" children. A reference value then only needs its referent DIE
to be "tree resolved" to create the "unique identifier object" for it.
That "breaks the circle" at the start of a reference attribute value so
you don't need to track circularity anymore.

The disadvantage is that you end up with two "pending queues" which
seems to effectively mean you will scan the whole DIE tree twice (or at
least calculate the hash over a DIE twice, without and then with taking
reference values into account). And since some "tree resolved" DIEs
might have the same hash, but are not equal if you take the value
references into account, you don't have perfect hashes for the attribute
reference values. But they only collide when referring to the same
structural trees with all basic attribute in all included DIEs matching
completely.

Cheers,

Mark

_______________________________________________
elfutils-devel mailing list
[email protected]
https://fedorahosted.org/mailman/listinfo/elfutils-devel

Re: dwarf_edit_output testcase sanity checking

Reply via email to