Re: dwarf_edit_output testcase sanity checking

Roland McGrath Mon, 15 Nov 2010 18:17:14 -0800

> Right, so for consistency ignoring the reference attributes types
> completely seems simplest. Although I admit I hadn't realized there was
> also the canonicalized attribute map hacks.


It was just an idea I mentioned in the prior email, and it's nonobvious
exactly how to implement it in the niggling C++ details.

> To my surprise most DIEs actually do have reference attributes. A quick
> random sampling of some test binaries showed ~2/3 of all DIEs have one.
> Only ~1/3 of the DIEs are pure basic attributes. Still I think it makes
> sense to finalize these immediately. Since for the second pass location
> in the tree doesn't matter, we can then just take the list of ~2/3rd
> pending DIEs and walk over them directly.

A further idea I didn't mention might merit consideration.  That is, also
finalize in the first pass a DIE that has only references to final DIEs.
That catches all pure backward references, which might be most of the
common cases not involving cycles.  (It depends mostly on whether the
compiler emits a <pointer_type type="#refN"/> after refN or before it.)

These are the cases that are finalized quickly in the current
reference-chasing scheme.  i.e., when chasing references finds only a DIE
you already finalized, then you finalize right there without recursing.

But we should be able to do this change easily after we prove your basic
scheme.  So let's not try to roll in any more complexity in the first cut.

> hmmm, ** void and ** char might occur frequently. So one extra pass to
> catch those might be in order of that matters.

Let's start with your plan as it is, and then we can see how many hash
collisions we are getting in real-world data.

> We could even chase the whole reference chain if we could prove it
> always terminates in some basic attributes only DIE. 

I guess it's relatively easy to be sure when it has no cycles at all.

> My suspicion is that the cycles only occur because when we chase
> references and children at the same time. But I don't think the dwarf
> format guarantees it, it just looks to be true for all sane type systems
> I can imagine (where a reference type cannot directly (recursively) point
> to itself (only through children of structure types). So I don't think we
> can rely on that in general.

I concur on both points.  There is certainly no theoretical limitation like
this in the structure of DWARF.  But I can't think off hand of any case
where an all-attributes cycle like that comes up in actual practice.

> > By "complete", I mean that it does guarantee
> > perfect detection of all duplicates.
> 
> I think it does guarantee that.
> Each duplicate DIE hashes to the same value.

I wasn't sure of how to think about that in the presence of cycles.
But I guess it's fairly simple, in that, for this purpose, a cycle
is equivalent to an infinite chain of references to identical copies.

> >   By "sound" I mean that, if the [...]
> 
> It doesn't guarantee that.
> Although I think in practice the chances of that happening are low.

Ok.  Indeed, the chances seem low off hand.  And moreover, we will look at
the statistics on a large corpus of data and if we do have any collisions
at all, we will tweak our hash functions to try to make them more rare.

> If we could prove (or maybe dwarflint could check, but as said above I
> don't think the dwarf spec guarantees it) that all reference attribute
> chains (ignoring DIE children) are really DAGs (directed acyclic
> graphs), then we could make the second pass hash unique too, which would
> give us the guarantee theoretically, I think.

We could make dwarflint complain about this, and that might be interesting,
but it's only ever going to be an "unexpected" DWARF construction, not
something that is technically invalid in the low-level "well-formed DWARF
tree" sense.  So we wouldn't want dwarf_output to assume it.

> > Also, do you think you can calculate the formal algorithmic complexity?
> 
> I believe worse case (ignoring hash collisions) you have to walk over
> the whole DIE tree once for each pass. So it should just be O(p * n),
> where p is the number of passes and n is the number of DIEs in all CUs.

Since p is always going to be a constant (I assume), that's still just O(n)
for the formal complexity.  We should have the algorithm write-up and the
calculations about this stuff somewhere on the wiki (eventually).

> That assumes all partial results we store per DIE/attribute in temporary
> tables are just constant lookups.

I think that's true.

> We will have to compare the two DIEs completely, including children and
> references. The hope is that the multi-pass hashing is pretty fast
> itself, and strong enough to make that happen rarely.

Ok.  In a certain sense that is good.  That is, it means we do need still a
decent dwarf_comparator implementation just for dwarf_output correctness,
not only for dwarfcmp.  So in a perverse way that sort of justifies more
effort spent on that implementation, since it is not just a testing thing.

> Since there will be hash-collisions we have to do a complete comparison
> in such cases to make sure it is sound. But that full comparison can be
> "dumb" and just do the comparison one-on-one for each of the DIEs
> children and reference attributes. With dwarfcmp we would like to catch
> identical DIEs which might have equal, but different length, reference
> chains. We can ignore those tricky cases where we just have two concrete
> DIEs who happen to hash together. Just proving that they are
> structurally completely identical or not is enough to be sound (it just
> means we aren't complete in the sense that two identical DIEs which hash
> together but have different, but semantically identical, circular
> reference chains won't be actually recognized as truly equal).

Ah, I see.  Since hash collisions are possible, we must always do at
least one "complete comparison" to match a duplicate with the existing
one already in the collector.  That being so, we want that comparison
to be quite efficient.  To start with, we could make that punt on
non-identical cycles, as you say.

But, we do want to detect those strange equivalent cycles cases for
dwarfcmp anyway.  And we'll want that really quite soon.  We already
know from our experience with real-world data that in some cases, the
duplicate-reduced tree and the original tree will differ in exactly
this way.  By our testing plan, we won't declare dwarf_output
"believed ready for prime time" until dwarfcmp has declared reduced
trees equal-enough to original trees for everything in the corpus.

One simplifying approach would be to use the same dwarf_ref_tracker
implementation in dwarfcmp and for the dwarf_output comparison.  That
would let us just work on dwarfcmp and get it thoroughly right for the
difficult cases.  However, we want the dwarf_output incarnation of
this comparison to be as efficient as possible, in both computation
and storage.  The reason I tried to fold the tracker work into the
copier data structures in the existing code was for this efficiency.

So it's not entirely clear how to proceed on the comparison code.
That is, how to proceed with the code once your implementation of the
new hashing scheme is ready.

I thought I had a correct algorithm for equivalent cycles in
dwarf_ref_tracker before I tried to make it more efficient with the
caching plan that introduced correctness problems.  But the last
attempt at using this (i.e. the dwarf branch code as it is now, with
dwarf_ref_tracker::notice_match commented out) seemed to produce a
situation where plain dwarfcmp was not merely way too slow on complex
cyclic cases, but actually did not terminate.  We probably need to
start over on the thinking for the reference equality algorithm there
and prove to ourselves we think it's correct, even without worrying
about efficiency.

If we get something for plain dwarfcmp that has no false-positives and
doesn't take hours to terminate, that is a starting place.  I think
the no false-positives part is relatively easy, if we don't do any
kind of caching during recursions.  It's unclear whether that will
yield something with even vaguely usable performance.  (This would be
something that still has false-negatives, for cycles that are
equivalent but not identical.)

With that, we could start by just using dwarf_ref_tracker directly for
the dwarf_output hash collision comparisons.  That is, the copier
holds a dwarf_ref_tracker object, and gives it walk/step calls in
lock-step with the copier's own walk.  Then the comparison for the
hash table just uses a dwarf_comparator instantiated with that tracker
object.  That will be correct enough for the purpose of comparison in
the first cut.  It just has some computation and storage overhead of
maintaining dwarf_ref_tracker's tables, but that's mostly going to be
memory bloat.

>From there, we can reduce the storage bloat by folding the data
structures for comparison into what we already have for copying.  I'm
assuming that's a significant win, but I'm not really sure it is.
To keep it saner, we could try to refactor the dwarf_ref_tracker
implementation to separate the storage of the bookkeeping from the
cycle-detection algorithm itself.  Then dwarf_output::copier would
be replacing only the storage portion by integrating that with its
own data structures, rather than also reimplementing the algorithm
separately there too.

Then we'd have something that is incomplete but correct for duplicate
detection, as you described above.  That's enough to bang on the whole
dwarf_output part with real-world data, get some stats about the hash
collisions, tune hash functions, and try out the things about early
finalization and so forth.  It should also be enough for the actual
writer code, imported_unit generation, and so forth, to be developed
in parallel with refinements to hashing/copying and comparison.  It
doesn't seem like any of that refinement would change the aspects of
the data structures that are important to the writer side.

This plan of attack hinges on the "simple enough to be sure it's
correct" comparison implementation not being unusably slow on our
real-world data.  I'm not really very sure about that at all.  An
alternative plan is to stub it out to assume that all hash collision
comparisons are false, and try proceed in parallel on those other
things from there.  But that is going to conflate all a ** with b **
and suchlike, so it seems unlikely to be very usable with real data.

What do you think?


Thanks,
Roland
_______________________________________________
elfutils-devel mailing list
[email protected]
https://fedorahosted.org/mailman/listinfo/elfutils-devel

Re: dwarf_edit_output testcase sanity checking

Reply via email to