On 1 May 2017 at 20:46, Robert Haas <robertmh...@gmail.com> wrote: > One problem is that Bloom filters assume you can get > n independent hash functions for a given value, which we have not got. > That problem would need to be solved somehow. If you only have one > hash function, the size of the required bloom filter probably gets > very large.
There's a simple formula to calculate the optimal number of hash functions and size of the filter given a target false positive rate. But I don't think this is as big of a problem as you imagine. a) we don't really only have one hash function, we have a 32-bit hash function and we could expand that to a larger bit size if we wanted. Bloom filters are never 2^32 size bit arrays for obvious reasons. If you have a 1kbit sized bloom filter that only needs 10 bits per index so you have three fully independent hash functions available already. If we changed to a 64-bit or 128-bit hash function then you could have enough bits available to have a larger set of hash functions and a larger array. b) you can get a poor man's universal hash out of hash_any or hash_int by just tweaking the input value in a way that doesn't interact in a simple way with the hash function. Even something as simple has xoring it with a random number (i.e. a vector of random numbers that identify your randomly chosen distinct "hash functions") seems to work fine. However for future-proofing security hardening I think Postgres should really implement a real mathematically rigorous Universal Hashing scheme which provides a family of hash functions from which to pick randomly. This prevents users from being able to generate data that would intentionally perform poorly in hash data structures for example. But it also means you have a whole family of hash functions to pick for bloom filters. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers