On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > Some drive-by comments on the lib patch:
I was hoping that you'd look at this, since you'll probably want to use a bloom filter for parallel hash join at some point. I've tried to keep this one as simple as possible. I think that there is a good chance that it will be usable for parallel hash join with multiple batches. You'll need to update the interface a little bit to make that work (e.g., bring-your-own-hash interface), but hopefully the changes you'll need will be limited to a few small details. > +bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len) > > I think the plan is to use size_t for new stuff[1]. I'd forgotten. > This is another my_log2(), right? > > It'd be nice to replace both with fls() or flsl(), though it's > annoying to have to think about long vs int64 etc. We already use > fls() in two places and supply an implementation in src/port/fls.c for > platforms that lack it (Windows?), but not the long version. win64 longs are only 32-bits, so my_log2() would do the wrong thing for me on that platform. pow2_truncate() is provided with a number of bits as its argument, not a number of bytes (otherwise this would work). Ideally, we'd never use long integers, because its width is platform dependent, and yet it is only ever used as an alternative to int because it is wider than int. One example of where this causes trouble: logtape.c uses long ints, so external sorts can use half the temp space on win64. > +/* > + * Hash function is taken from sdbm, a public-domain reimplementation of the > + * ndbm database library. > + */ > +static uint32 > +sdbmhash(unsigned char *elem, Size len) > +{ > I see that this is used in gawk, BerkeleyDB and all over the place[2]. > Nice. I understand that this point of this is to be a hash function > that is different from our usual one, for use by k_hashes(). Right. It's only job is to be a credible hash function that isn't derivative of hash_any(). > Do you > think it belongs somewhere more common than this? It seems a bit like > our hash-related code is scattered all over the place but should be > consolidated, but I suppose that's a separate project. Unsure. In its defense, there is also a private murmurhash one-liner within tidbitmap.c. I don't mind changing this, but it's slightly odd to expose a hash function whose only job is to be completely unrelated to hash_any(). > Unnecessary braces here and elsewhere for single line body of for loops. > > +bloom_prop_bits_set(bloom_filter *filter) > +{ > + int bitset_bytes = NBITS(filter) / BITS_PER_BYTE; > + int64 bits_set = 0; > + int i; > + > + for (i = 0; i < bitset_bytes; i++) > + { > + unsigned char byte = filter->bitset[i]; > + > + while (byte) > + { > + bits_set++; > + byte &= (byte - 1); > + } > + } I don't follow what you mean here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers