> Right, so for consistency ignoring the reference attributes types > completely seems simplest. Although I admit I hadn't realized there was > also the canonicalized attribute map hacks.
It was just an idea I mentioned in the prior email, and it's nonobvious exactly how to implement it in the niggling C++ details. > To my surprise most DIEs actually do have reference attributes. A quick > random sampling of some test binaries showed ~2/3 of all DIEs have one. > Only ~1/3 of the DIEs are pure basic attributes. Still I think it makes > sense to finalize these immediately. Since for the second pass location > in the tree doesn't matter, we can then just take the list of ~2/3rd > pending DIEs and walk over them directly. A further idea I didn't mention might merit consideration. That is, also finalize in the first pass a DIE that has only references to final DIEs. That catches all pure backward references, which might be most of the common cases not involving cycles. (It depends mostly on whether the compiler emits a <pointer_type type="#refN"/> after refN or before it.) These are the cases that are finalized quickly in the current reference-chasing scheme. i.e., when chasing references finds only a DIE you already finalized, then you finalize right there without recursing. But we should be able to do this change easily after we prove your basic scheme. So let's not try to roll in any more complexity in the first cut. > hmmm, ** void and ** char might occur frequently. So one extra pass to > catch those might be in order of that matters. Let's start with your plan as it is, and then we can see how many hash collisions we are getting in real-world data. > We could even chase the whole reference chain if we could prove it > always terminates in some basic attributes only DIE. I guess it's relatively easy to be sure when it has no cycles at all. > My suspicion is that the cycles only occur because when we chase > references and children at the same time. But I don't think the dwarf > format guarantees it, it just looks to be true for all sane type systems > I can imagine (where a reference type cannot directly (recursively) point > to itself (only through children of structure types). So I don't think we > can rely on that in general. I concur on both points. There is certainly no theoretical limitation like this in the structure of DWARF. But I can't think off hand of any case where an all-attributes cycle like that comes up in actual practice. > > By "complete", I mean that it does guarantee > > perfect detection of all duplicates. > > I think it does guarantee that. > Each duplicate DIE hashes to the same value. I wasn't sure of how to think about that in the presence of cycles. But I guess it's fairly simple, in that, for this purpose, a cycle is equivalent to an infinite chain of references to identical copies. > > By "sound" I mean that, if the [...] > > It doesn't guarantee that. > Although I think in practice the chances of that happening are low. Ok. Indeed, the chances seem low off hand. And moreover, we will look at the statistics on a large corpus of data and if we do have any collisions at all, we will tweak our hash functions to try to make them more rare. > If we could prove (or maybe dwarflint could check, but as said above I > don't think the dwarf spec guarantees it) that all reference attribute > chains (ignoring DIE children) are really DAGs (directed acyclic > graphs), then we could make the second pass hash unique too, which would > give us the guarantee theoretically, I think. We could make dwarflint complain about this, and that might be interesting, but it's only ever going to be an "unexpected" DWARF construction, not something that is technically invalid in the low-level "well-formed DWARF tree" sense. So we wouldn't want dwarf_output to assume it. > > Also, do you think you can calculate the formal algorithmic complexity? > > I believe worse case (ignoring hash collisions) you have to walk over > the whole DIE tree once for each pass. So it should just be O(p * n), > where p is the number of passes and n is the number of DIEs in all CUs. Since p is always going to be a constant (I assume), that's still just O(n) for the formal complexity. We should have the algorithm write-up and the calculations about this stuff somewhere on the wiki (eventually). > That assumes all partial results we store per DIE/attribute in temporary > tables are just constant lookups. I think that's true. > We will have to compare the two DIEs completely, including children and > references. The hope is that the multi-pass hashing is pretty fast > itself, and strong enough to make that happen rarely. Ok. In a certain sense that is good. That is, it means we do need still a decent dwarf_comparator implementation just for dwarf_output correctness, not only for dwarfcmp. So in a perverse way that sort of justifies more effort spent on that implementation, since it is not just a testing thing. > Since there will be hash-collisions we have to do a complete comparison > in such cases to make sure it is sound. But that full comparison can be > "dumb" and just do the comparison one-on-one for each of the DIEs > children and reference attributes. With dwarfcmp we would like to catch > identical DIEs which might have equal, but different length, reference > chains. We can ignore those tricky cases where we just have two concrete > DIEs who happen to hash together. Just proving that they are > structurally completely identical or not is enough to be sound (it just > means we aren't complete in the sense that two identical DIEs which hash > together but have different, but semantically identical, circular > reference chains won't be actually recognized as truly equal). Ah, I see. Since hash collisions are possible, we must always do at least one "complete comparison" to match a duplicate with the existing one already in the collector. That being so, we want that comparison to be quite efficient. To start with, we could make that punt on non-identical cycles, as you say. But, we do want to detect those strange equivalent cycles cases for dwarfcmp anyway. And we'll want that really quite soon. We already know from our experience with real-world data that in some cases, the duplicate-reduced tree and the original tree will differ in exactly this way. By our testing plan, we won't declare dwarf_output "believed ready for prime time" until dwarfcmp has declared reduced trees equal-enough to original trees for everything in the corpus. One simplifying approach would be to use the same dwarf_ref_tracker implementation in dwarfcmp and for the dwarf_output comparison. That would let us just work on dwarfcmp and get it thoroughly right for the difficult cases. However, we want the dwarf_output incarnation of this comparison to be as efficient as possible, in both computation and storage. The reason I tried to fold the tracker work into the copier data structures in the existing code was for this efficiency. So it's not entirely clear how to proceed on the comparison code. That is, how to proceed with the code once your implementation of the new hashing scheme is ready. I thought I had a correct algorithm for equivalent cycles in dwarf_ref_tracker before I tried to make it more efficient with the caching plan that introduced correctness problems. But the last attempt at using this (i.e. the dwarf branch code as it is now, with dwarf_ref_tracker::notice_match commented out) seemed to produce a situation where plain dwarfcmp was not merely way too slow on complex cyclic cases, but actually did not terminate. We probably need to start over on the thinking for the reference equality algorithm there and prove to ourselves we think it's correct, even without worrying about efficiency. If we get something for plain dwarfcmp that has no false-positives and doesn't take hours to terminate, that is a starting place. I think the no false-positives part is relatively easy, if we don't do any kind of caching during recursions. It's unclear whether that will yield something with even vaguely usable performance. (This would be something that still has false-negatives, for cycles that are equivalent but not identical.) With that, we could start by just using dwarf_ref_tracker directly for the dwarf_output hash collision comparisons. That is, the copier holds a dwarf_ref_tracker object, and gives it walk/step calls in lock-step with the copier's own walk. Then the comparison for the hash table just uses a dwarf_comparator instantiated with that tracker object. That will be correct enough for the purpose of comparison in the first cut. It just has some computation and storage overhead of maintaining dwarf_ref_tracker's tables, but that's mostly going to be memory bloat. >From there, we can reduce the storage bloat by folding the data structures for comparison into what we already have for copying. I'm assuming that's a significant win, but I'm not really sure it is. To keep it saner, we could try to refactor the dwarf_ref_tracker implementation to separate the storage of the bookkeeping from the cycle-detection algorithm itself. Then dwarf_output::copier would be replacing only the storage portion by integrating that with its own data structures, rather than also reimplementing the algorithm separately there too. Then we'd have something that is incomplete but correct for duplicate detection, as you described above. That's enough to bang on the whole dwarf_output part with real-world data, get some stats about the hash collisions, tune hash functions, and try out the things about early finalization and so forth. It should also be enough for the actual writer code, imported_unit generation, and so forth, to be developed in parallel with refinements to hashing/copying and comparison. It doesn't seem like any of that refinement would change the aspects of the data structures that are important to the writer side. This plan of attack hinges on the "simple enough to be sure it's correct" comparison implementation not being unusably slow on our real-world data. I'm not really very sure about that at all. An alternative plan is to stub it out to assume that all hash collision comparisons are false, and try proceed in parallel on those other things from there. But that is going to conflate all a ** with b ** and suchlike, so it seems unlikely to be very usable with real data. What do you think? Thanks, Roland _______________________________________________ elfutils-devel mailing list [email protected] https://fedorahosted.org/mailman/listinfo/elfutils-devel
