j...@math.brown.edu writes: > But as you showed, it's certainly possible to do some exploration in the > meantime. Prompted by your helpful comparison, I just put together > https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further > compare hash_tuple vs. hash_incremental.
It's been a while :-) since I read Knuth[1] (and that was when Knuth was authoritative on this subject 8^O), but neither methodology is particularly helpful here. The ideal is effectively a uniform distribution on a circle, which has no mean. Therefore standard deviation is also uninteresting, since its definition uses the mean. The right test is something like a χ² test against a uniform distribution. The scatter plots (of hash against test data) simply show the same thing, without the precision of a statistical test. (Note: do not deprecate the computer visualization + human eye + human brain system. It is the best known machine for detecting significant patterns and anomolies, though YMMV.) But it's not very good at detecting high- dimensional patterns. However, it's nowhere near good enough for a hash function to have a uniform distribution on random data. It actually needs to be "chaotic" in the sense that it also spreads "nearby" data out, where "nearby" here would probably mean that as 4000-bit strings less than m% (for small m) differ, as in real data you usually expect a certain amount of structure (continuity in time, clustering around a mean, line noise in a communication channel) so that you'd be likely to get lots of "clusters" of very similar data. You don't want them pounding on a small subset of buckets. > I'm not sure if the methodology there is sound, as I'm new to > analysis like this. Even decades later, starting with Knuth[1] can't hurt. :-) > Given sufficiently good distribution, I'd expect there to be unanimous > agreement that an immutable collection, which could contain arbitrarily > many items, should strongly prefer hash_incremental(self) over > hash(tuple(self)), for the same reason we use generator comprehensions > instead of list comprehensions when appropriate. Please correct me if I'm > wrong. I think that's a misleading analogy. Just stick to For very large collections where we already have the data, duplicating the data is very expensive. Furthermore, since the purpose of hashing is making equality comparisons cheap, this is likely to happen in a loop. On the other hand, if there are going to be a *lot* of "large" collections being stored, then they're actually not all that large compared to the system, and you might not actually care that much about the cost of the emphemeral tuples, because the real cost is in an n^2 set of comparisons. From the point of view of NumPy, this is an "are you kidding?" argument because large datasets are its stock in trade, but for core Python, this may be sufficiently esoteric that it should be delegated to On balance, the arguments that Steven d'Aprano advanced for having a statistics module in the stdlib vs. importing pandas apply here. In particular, I think there are a huge number of options for an iterative hash. For example, Ned chained 2-tuples. But perhaps for efficient time in bounded space you want to use bounded but larger tuples. I don't know -- and that's the point. If we have a TOOWTDI API like hash.from_iterable then smart people (and people with time on their hands to do exhaustive experiments!) can tune that over time, as has been done with the sort API. Another option is yielding partials, as Steven says: > > itertools.iterhash(iterable) # yield incremental hashes That's a very interesting idea, though I suspect it rarely would make a big performance improvement. I'm not sure I like the "hash.from_iterable" name for this API. The problem is that if it's not a concrete collection[3], then you throw away the data. If the hash algorithm happens to suck for certain data, you'll get a lot of collisions, and conclude that your data is much more concentrated than it actually is. I find it hard to imagine a use case where you have large data where you only care about whether two data points are different (cf. equality comparisons for floats). You want to know how they're different. So I think I would prefer to name it "hash.from_collection" or similar. Of course whether the implementation actually raises on a generator or other non-concrete iterable is a fine point. Footnotes: [1] Of course I mean The Art of Computer Programming, Ch. 3. [2] Including the newly ordered dicts, maybe? Somebody tweet @dabeaz! What evil can he do with this? [3] Yeah, I know, "re-iterables". But we don't have a definition, let alone an API for identifying, those Things. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/