On Wed, 2010-10-27 at 10:44 -0400, Roland McGrath wrote: > That will indeed be horribly inefficient. But, in nontrivial cases, the > whole plan of hash=0 for circular references is problematically inefficient > too. So this suggests reconsidering that whole idea in the algorithm. > This is the sort of thing I wanted your fresh brainpower on. > > Consider a <pointer_type type=.../> DIE. There are zillions of those > All of those that are part of a circular reference chain hash as the same. > (See e.g. the --stats output on any C++ case, where there are these and > also zillions of the equivalent reference_type cases.) > > So we wind up comparing against all those in the same hash bucket, which is > far from optimal. The best the current (buggy) plan gets is to do negative > caching of each individual comparison. But even if all that caching code > were perfectly correct, it stills winds up being n separate hash lookups > to do the n "quick" comparisons. So this is not really an adequate plan.
So the problem is the hash of a reference attribute not being "stable" in the face of circularity. The dirty solution of just always using a hash of zero for a reference makes things indeed a lot worse, too many things hash together to be practical. I tried to hack together a "solution" that used the referent's non-reference attributes and tag. That showed a bug which looks suspiciously like what we tried to debug before: https://fedorahosted.org/pipermail/elfutils-devel/2010-September/001584.html So I will try to get to the bottom of that first (somewhere when we try to finalize a die the attributes get misplaced). But in practice this might also not be enough to create very distinct hashes. Another idea would be to use not use a "perfect hash" for reference attributes. Allow some equal reference attributes to not hash the same. We could use the identity of the originally copied die we are referring too as hash for the reference attribute. Since we do still have that in the die_info_pair. This should be unique at least inside a CU, but not across CUs in the same file. So then we would miss marking reference attributes equals if they point to dies that are equal across CUs. So unless we "fix that up" in some later pass that also doesn't seem practical. At least I couldn't come up with a good idea for doing a pass over the result that would merge things across CUs. It seems that cannot be a simple pass over the dwarf_output result since merging two reference attributes might make other references that point to dies containing these reference attributes equal, etc. The nice thing about most of the hashing done in dwarf_output is that it "uniqifies" value/die groups and so creates a perfect unique hash. This can be done with ordered sets of values or die trees. But the reference attributes introduce arbitrary (possibly circular) graphs between elements. We get in trouble because we want to cut the circularity, but don't know "the start of the circle". So we pick an arbitrary attribute reference target and declare that as being the circular node. Then we assign a different hash code to a reference to such a circular node then we normally would do. Normally we would like to assign the "unique identifier object" for that referred to die. But we cannot in the circular case since we are still constructing that. So the problem comes from including the hash of all reference attributes in the construction of the "unique identifier object" for a attribute reference that points to a die. That means that references inside circular structures cannot really be completed since during the construction of the "unique identifier object" of reference to that circular structure itself it can never be fully resolved. So it seems that we should declare two "resolved" states for DIEs. One "tree resolved" that takes into account just the basic attributes plus all the children "tree resolved". And another "fully resolved" that all attributes, including reference values into account, plus "fully resolved" children. A reference value then only needs its referent DIE to be "tree resolved" to create the "unique identifier object" for it. That "breaks the circle" at the start of a reference attribute value so you don't need to track circularity anymore. The disadvantage is that you end up with two "pending queues" which seems to effectively mean you will scan the whole DIE tree twice (or at least calculate the hash over a DIE twice, without and then with taking reference values into account). And since some "tree resolved" DIEs might have the same hash, but are not equal if you take the value references into account, you don't have perfect hashes for the attribute reference values. But they only collide when referring to the same structural trees with all basic attribute in all included DIEs matching completely. Cheers, Mark _______________________________________________ elfutils-devel mailing list [email protected] https://fedorahosted.org/mailman/listinfo/elfutils-devel
