On Fri, Dec 30, 2016 at 10:30 PM, Nathaniel Smith <n...@pobox.com> wrote:
> ... "Most hash schemes depend on having a "good" hash function, in the sense of > simulating randomness. Python doesn't ..." https://github.com/python/cpython/blob/d0a2f68a/Objects/dictobject.c#L133 ... Thanks for that link, fascinating to read the rest of that comment!! Btw, the part you quoted seemed like more a defense for what followed, i.e. the choice to make hash(some_int) == some_int. I'm not sure how much the part you quoted applies generally. e.g. The https://docs.python.org/3/ reference/datamodel.html#object.__hash__ docs don't say, "Don't worry about your __hash__ implementation, dict's collision resolution strategy is so good it probably doesn't matter anyway." But it also doesn't have any discussion of any of the tradeoffs you mentioned that might be worth considering. What should you do when there are arbitrarily many "components of the object that play a part in comparison of objects"? The "hash((self._name, self._nick, self._color))" code snippet is the only example it gives. This leaves people who have use cases like mine wondering whether it's still advised to scale this up to the arbitrarily many items that instances of their class can contain. If not, then what is advised? Passing a tuple of fewer items to a single hash() call, e.g. hash(tuple(islice(self, CUTOFF)))? Ned's recipe of pairwise-accumulating hash() results over all the items? Or only pairwise-accumulating up to a certain cutoff? Stephen J. Turnbull's idea to use fewer accumulation steps and larger-than-2-tuples? Passing all the items into some other cheap, built-in hash algorithm that actually has an incremental update API (crc32?)? Still hoping someone can give some authoritative advice, and hope it's still reasonable to be asking. If not, I'll cut it out. On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <ros...@gmail.com> wrote: > How likely is it that you'll have this form of collision, rather than some > other? Remember, collisions *will* happen, so don't try to eliminate them > all; just try to minimize the chances of *too many* collisions. So if > you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) > and (3,1,2), then sure, you need to care about order; but otherwise, one > possible cause of a collision is no worse than any other. Keep your > algorithm simple, and don't sweat the details that you aren't sure matter. In the context of writing a collections library, and not an application, it should work well for a diversity of workloads that your users could throw at it. In that context, it's hard to know what to do with advice like this. "Heck, just hash the first three items and call it a day!"
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/