On Wed, Sep 27, 2017 at 1:45 AM, Michael Paquier <michael.paqu...@gmail.com> wrote: > I have signed up as a reviewer of this patch, and I have looked at the > bloom filter implementation for now. This is the kind of facility that > people have asked for on this list for many years. > > One first thing striking me is that there is no test for this > implementation, which would be a base stone for other things, it would > be nice to validate that things are working properly before moving on > with 0002, and 0001 is a feature on its own. I don't think that it > would be complicated to have a small module in src/test/modules which > plugs in a couple of SQL functions on top of bloomfilter.h.
0001 is just a library utility. None of the backend lib utilities (like HyperLogLog, Discrete knapsack, etc) have dedicated test frameworks. Coding up an SQL interface for these things is a non-trivial project in itself. I have testing the implementation myself, but that was something that used C code. Can you tell me what you have in mind here in detail? For example, should there be a custom datatype that encapsulates the current state of the bloom filter? Or, should there be an aggregate function, that takes a containment argument that is tested at the end? > +#define MAX_HASH_FUNCS 10 > Being able to define the number of hash functions used at > initialization would be nicer. Usually this number is kept way lower > than the number of elements to check as part of a set, but I see no > reason to not allow people to play with this API in a more extended > way. You can then let your module decide what it wants to use. The number of hash functions used is itself a function of the amount of memory available, and an estimate of the overall size of the set made ahead of time (in the case of amcheck, this is pg_class.reltuples). The typical interface is that the caller either specifies the amount of memory, or the required false positive rate (either way, they must also provide that estimate). The value of MAX_HASH_FUNCS, 10, was chosen based on the fact that we never actually use more than that many hash functions in practice, given the limitations on the total amount of memory you can use (128MB). The formula for determining the optimum number of hash functions is pretty standard stuff. > + * work_mem is sized in KB, in line with the general convention. > In what is that a general convention? Using bytes would be more > intuitive IMO.. Still I think that this could be removed, see below > points. Both tuplesort and tuplestore do this. These are infrastructure that is passed work_mem or maintenance_work_mem by convention, where those are sized in KB. > +/* > + * Hash function is taken from sdbm, a public-domain reimplementation of the > + * ndbm database library. > + */ > Reference link? http://www.eecs.harvard.edu/margo/papers/usenix91/paper.pdf I'm probably going to end up using murmurhash32() instead of the sdbm hash function anyway, now that Andres has exposed it in a header file. This probably won't matter in the next version. > + bitset_bytes = bitset_bits / BITS_PER_BYTE; > + filter->bitset = palloc0(bitset_bytes); > + filter->k_hash_funcs = optimal_k(bitset_bits, total_elems); > + filter->seed = seed; > I think that doing the allocation within the initialization phase is a > mistake. Being able to use a DSA would be nice, but I predict as well > cases where a module may want to have a bloom filter that persists as > well across multiple transactions, so the allocation should be able to > live across memory contexts. Why not just switch memory context before calling? Again, other comparable utilities don't have provide this in their interface. As for DSM, I think that that can come later, and can be written by somebody closer to that problem. There can be more than one initialization function. > What I think you should do instead to > make this bloom implementation more modular is to let the caller give > a pointer to a memory area as well as its size. Then what bloom_init > should do is to just initialize this area of memory with zeros. This > approach would give a lot of freedom. Not linking a bloom definition > to work_mem would be nice as well. The implementation is probably always going to be bound in size by work_mem in practice, like tuplesort and tuplestore. I would say that that's a natural fit. > + hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0); > I am wondering if it would make sense to be able to enforce the hash > function being used. The default does not look bad to me though, so we > could live with that. I prefer to keep things simple. I'm not aware of any use case that calls for the user to use a custom hash function. That said, I could believe that someone would want to use their own hash value for each bloom_add_element(), when they have one close at hand anyway -- much like addHyperLogLog(). Again, that seems like work for the ultimate consumer of that functionality. It's a trivial tweak that can happen later. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers