Re: WIP: BRIN multi-range indexes

John Naylor Thu, 24 Sep 2020 14:19:08 -0700

On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
>
> On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:


> >While playing around with the numbers I had an epiphany: At the
> >defaults, the filter already takes up ~4.3kB, over half the page.
> >There is no room for another tuple, so if we're only indexing one
> >column, we might as well take up the whole page.
>
> Hmm, yeah. I may be wrong but IIRC indexes don't support external
> storage but compression is still allowed. So even if those defaults are
> a bit higher than needed that should make the bloom filters a bit more
> compressible, and thus fit multiple BRIN tuples on a single page.

> Not sure about how much we want to rely on these optimizations, though,
> considering multi-column indexes kinda break this.

Yeah. Okay, then it sounds like we should go in the other direction,
as the block comment at the top of brin_bloom.c implies. Indexes with
multiple bloom-indexed columns already don't fit in one 8kB page, so I
think every documented example should have a much lower
pages_per_range. Using 32 pages per range with max tuples gives n =
928. With default p, that's about 1.1 kB per brin tuple, so one brin
page can index 224 pages, much more than with the default 128.

Hmm, how ugly would it be to change the default range size depending
on the opclass?

If indexes don't support external storage, that sounds like a pain to
add. Also, with very small fpr, you can easily get into many megabytes
of filter space, which kind of defeats the purpose of brin in the
first place.

There is already this item from the brin readme:

* Different-size page ranges?
  In the current design, each "index entry" in a BRIN index covers the same
  number of pages.  There's no hard reason for this; it might make sense to
  allow the index to self-tune so that some index entries cover smaller page
  ranges, if this allows the summary values to be more compact.  This
would incur
  larger BRIN overhead for the index itself, but might allow better pruning of
  page ranges during scan.  In the limit of one index tuple per page, the index
  itself would occupy too much space, even though we would be able to skip
  reading the most heap pages, because the summary values are tight; in the
  opposite limit of a single tuple that summarizes the whole table, we wouldn't
  be able to prune anything even though the index is very small.  This can
  probably be made to work by using the range map as an index in itself.

This sounds like a lot of work, but would be robust.

Anyway, given that this is a general problem and not specific to the
prime partition algorithm, I'll leave that out of the attached patch,
named as a .txt to avoid confusing the cfbot.

> >We could also generate the primes via a sieve instead, which is really
> >fast (and more code). That would be more robust, but that would require
> >the filter to store the actual primes used, so 20 more bytes at max k =
> >10. We could hard-code space for that, or to keep from hard-coding
> >maximum k and thus lowest possible false positive rate, we'd need more
> >bookkeeping.
> >
>
> I don't think the efficiency of this code matters too much - it's only
> used once when creating the index, so the simpler the better. Certainly
> for now, while testing the partitioning approach.

To check my understanding, isn't bloom_init() called for every tuple?
Agreed on simplicity so done this way.

> >So, with the partition approach, we'd likely have to set in stone
> >either max nbits, or min target false positive rate. The latter option
> >seems more principled, not only for the block size, but also since the
> >target fp rate is already fixed by the reloption, and as I've
> >demonstrated, we can often go above and beyond the reloption even
> >without changing k.
> >
>
> That seems like a rather annoying limitation, TBH.

I don't think the latter is that bad. I've capped k at 10 for
demonstration's sake.:

(928 is from using 32 pages per range)

n    k   m      p
928   7  8895   0.01
928  10  13343  0.001  (lowest p supported in patch set)
928  13  17790  0.0001
928  10  18280  0.0001 (still works with lower k, needs higher m)
928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
degrade p much)

Also, k seems pretty robust against small changes as long as m isn't
artificially constrained and as long as p is small.

So I *think* it's okay to cap k at 10 or 12, and not bother with
adjusting m, which worsens space issues. As I found before, lowering k
raises target fpr, but seems more robust to overshooting ndistinct. In
any case, we only need k * 2 bytes to store the partition lengths.

The only way I see to avoid any limitation is to make the array of
primes variable length, which could be done by putting the filter
offset calculation into a macro. But having two variable-length arrays
seems messy.

> >Hmm, I'm not sure I understand you. I can see not caring to trim wasted
> >bits, but we can't set/read off the end of the filter. If we don't
> >wrap, we could just skip reading/writing that bit. So a tiny portion of
> >access would be at k - 1. The paper is not clear on what to do here,
> >but they are explicit in minimizing the absolute value, which could go
> >on either side.
> >
>
> What I meant is that is that the paper says this:
>
>      Given a planned overall length mp for a Bloom filter, we usually
>      cannot get k prime numbers to make their sum mf to be exactly mp. As
>      long as the difference between mp and mf is small enough, it neither
>      causes any trouble for the software implementation nor noticeably
>      shifts the false positive ratio.
>
> Which I think means we can pick mp, generate k primes with sum mf close
> to mp, and just use that with mf bits.

Oh, I see. When I said "trim" I meant exactly that (when mf < mp).
Yeah, we can bump it up as well for the other case. I've done it that
way.

> >+ add_local_real_reloption(relopts, "false_positive_rate", + "desired
> >false-positive rate for the bloom filters", +
> >BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions,
> >falsePositiveRate));
> >
> >When I set fp to 1.0, the reloption code is okay with that, but then
> >later the assertion gets triggered.
> >
>
> Hmm, yeah. I wonder what to do about that, considering indexes with fp
> 1.0 are essentially useless.

Not just useless -- they're degenerate. When p = 1.0, m = k = 0 -- We
cannot accept this value from the user. Looking up thread, 0.1 was
suggested as a limit. That might be a good starting point.

This is interesting work! Having gone this far, I'm going to put more
attention to the multi-minmax patch and actually start performance
testing.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/access/brin/brin_bloom.c 
b/src/backend/access/brin/brin_bloom.c
index 03e9d4b713..5dc213727b 100644
--- a/src/backend/access/brin/brin_bloom.c
+++ b/src/backend/access/brin/brin_bloom.c
@@ -203,6 +203,9 @@ typedef struct BloomOptions
  */
 #define                BLOOM_DEFAULT_FALSE_POSITIVE_RATE       0.01    /* 1% 
fp rate */
 
+/* With the minimum allowed false positive rate of 0.001, we need up to 10 
hashes */
+#define                BLOOM_MAX_NUM_PARTITIONS        10
+
 #define BloomGetNDistinctPerRange(opts) \
        ((opts) && (((BloomOptions *) (opts))->nDistinctPerRange != 0) ? \
         (((BloomOptions *) (opts))->nDistinctPerRange) : \
@@ -270,6 +273,7 @@ typedef struct BloomFilter
        uint8   nhashes;        /* number of hash functions */
        uint32  nbits;          /* number of bits in the bitmap (size) */
        uint32  nbits_set;      /* number of bits set to 1 */
+       uint16  partlens[BLOOM_MAX_NUM_PARTITIONS];     /* partition lengths */
 
        /* data of the bloom filter (used both for sorted and hashed phase) */
        char    data[FLEXIBLE_ARRAY_MEMBER];
@@ -279,6 +283,114 @@ typedef struct BloomFilter
 static BloomFilter *bloom_switch_to_hashing(BloomFilter *filter);
 
 
+/*
+ * generate_primes
+ *             returns array of all primes less than limit
+ *
+ * WIP: very naive prime sieve; could be optimized using segmented ranges
+ */
+static uint16 *
+generate_primes(int limit)
+{
+       /* upper bound of number of primes below limit */
+       /* WIP: reference for this number */
+       int numprimes = 1.26 * limit / log(limit);
+
+       bool *is_composite = (bool *) palloc0(limit * sizeof(bool));
+       uint16 *primes = (uint16 *) palloc0(numprimes * sizeof(uint16));
+
+       int maxfactor = floor(sqrt(limit));
+       int factor = 2; /* first prime */
+
+       /* mark the sieve where the index is composite */
+       while (factor < maxfactor)
+       {
+               for (int i = factor * factor; i < limit; i += factor)
+                        is_composite[i] = true;
+               do { factor++; } while (is_composite[factor]);
+       }
+
+       /* the unmarked numbers are prime, so copy over */
+       for (int i = 2, j = 0; i < limit && j < numprimes; i++)
+       {
+               if (!is_composite[i])
+                       primes[j++] = i;
+       }
+
+       /* there should still be some zeroes at the end, but make sure */
+       primes[numprimes - 1] = 0;
+
+       /* pretty large, so free it now (segmented ranges would make it 
smaller) */
+       pfree(is_composite);
+       return primes;
+}
+
+/*
+ * set_bloom_partitions
+ *             Calculate k moduli for one-hashing bloom filter.
+ *
+ * Find consecutive primes whose sum is close to nbits and
+ * return the sum. Copy the primes to the filter to use as
+ * partition lengths.
+ * WIP: one-hashing bf paper ref somewhere
+ */
+static uint32
+set_bloom_partitions(int nbits, uint8 nhashes, uint16 *partlens)
+{
+       int             min, diff, incr;
+       int             pidx = 0;
+       int             sum = 0;
+       int             target_partlen = nbits / nhashes;
+
+       /*
+        * Increase the limit to ensure we have some primes higher than
+        * the target partition length. The 100 value is arbitrary, but
+        * should be well over what we need.
+        */
+       uint16 *primes = generate_primes(target_partlen + 100);
+
+       /*
+        * In our array of primes, find a sequence of length nhashes, whose
+        * last item is close to our target partition length. The end of the
+        * array will be filled with zeros, so we need to guard against that.
+        */
+       while (primes[pidx + nhashes - 1] <= target_partlen &&
+                  primes[pidx] > 0)
+               pidx++;
+
+       for (int i = 0; i < nhashes; i++)
+               sum += primes[pidx + i];
+
+       /*
+        * Since all the primes are less than or equal the desired partition
+        * length, the sum is somewhat less than nbits. Increment the starting
+        * point until we find the sequence of primes whose sum is closest to
+        * nbits. It doesn't matter whether it's higher or lower.
+        */
+       min = abs(nbits - sum);
+       for (;;)
+       {
+               incr = primes[pidx + nhashes] - primes[pidx];
+               diff = abs(nbits - (sum + incr));
+               if (diff >= min)
+                       break;
+
+               min = diff;
+               sum += incr;
+               pidx++;
+       }
+
+       memcpy(partlens, &primes[pidx], nhashes * sizeof(uint16));
+
+       /* WIP: assuming it's not important to pfree primes */
+
+       /*
+        * The actual filter length will be the sum of the partition lengths
+        * rounded up to the nearest byte.
+        */
+       return (uint32) ((sum + 7) / 8) * 8;
+}
+
 /*
  * bloom_init
  *             Initialize the Bloom Filter, allocate all the memory.
@@ -301,15 +413,13 @@ bloom_init(int ndistinct, double false_positive_rate)
 
        m = ceil((ndistinct * log(false_positive_rate)) / log(1.0 / (pow(2.0, 
log(2.0)))));
 
-       /* round m to whole bytes */
-       m = ((m + 7) / 8) * 8;
-
        /*
         * round(log(2.0) * m / ndistinct), but assume round() may not be
         * available on Windows
         */
        k = log(2.0) * m / ndistinct;
        k = (k - floor(k) >= 0.5) ? ceil(k) : floor(k);
+       k = Min(k, BLOOM_MAX_NUM_PARTITIONS);
 
        /*
         * Allocate the bloom filter with a minimum size 64B (about 40B in the
@@ -323,7 +433,9 @@ bloom_init(int ndistinct, double false_positive_rate)
 
        filter->flags = 0;      /* implies SORTED phase */
        filter->nhashes = (int) k;
-       filter->nbits = m;
+
+       /* calculate the partition lengths and adjust m to match */
+       filter->nbits = set_bloom_partitions(m, k, filter->partlens);
 
        SET_VARSIZE(filter, len);
 
@@ -445,7 +557,7 @@ static BloomFilter *
 bloom_add_value(BloomFilter *filter, uint32 value, bool *updated)
 {
        int             i;
-       uint32  big_h, h, d;
+       int             part_boundary = 0;
 
        /* assume 'not updated' by default */
        Assert(filter);
@@ -514,17 +626,16 @@ bloom_add_value(BloomFilter *filter, uint32 value, bool 
*updated)
        /* we better be in the hashing phase */
        Assert(BLOOM_IS_HASHED(filter));
 
-       /* compute the hashes, used for the bloom filter */
-       big_h = ((uint32) DatumGetInt64(hash_uint32(value)));
-
-       h = big_h % filter->nbits;
-       d = big_h % (filter->nbits - 1);
-
        /* compute the requested number of hashes */
        for (i = 0; i < filter->nhashes; i++)
        {
-               int byte = (h / 8);
-               int bit  = (h % 8);
+               int partlen = filter->partlens[i];
+               int bitloc = part_boundary + (value % partlen);
+
+               Assert(bitloc < filter->nbits);
+
+               int byte = (bitloc / 8);
+               int bit  = (bitloc % 8);
 
                /* if the bit is not set, set it and remember we did that */
                if (! (filter->data[byte] & (0x01 << bit)))
@@ -536,12 +647,7 @@ bloom_add_value(BloomFilter *filter, uint32 value, bool 
*updated)
                }
 
                /* next bit */
-               h += d++;
-               if (h >= filter->nbits)
-                       h -= filter->nbits;
-
-               if (d == filter->nbits)
-                       d = 0;
+               part_boundary += partlen;
        }
 
        return filter;
@@ -571,6 +677,7 @@ bloom_switch_to_hashing(BloomFilter *filter)
 
        newfilter->nhashes = filter->nhashes;
        newfilter->nbits = filter->nbits;
+       memcpy(newfilter->partlens, filter->partlens, filter->nhashes * 
sizeof(uint16));
        newfilter->flags |= BLOOM_FLAG_PHASE_HASH;
 
        SET_VARSIZE(newfilter, len);
@@ -595,7 +702,7 @@ static bool
 bloom_contains_value(BloomFilter *filter, uint32 value)
 {
        int             i;
-       uint32  big_h, h, d;
+       int             part_boundary = 0;
 
        Assert(filter);
 
@@ -624,28 +731,23 @@ bloom_contains_value(BloomFilter *filter, uint32 value)
        /* now the regular hashing mode */
        Assert(BLOOM_IS_HASHED(filter));
 
-       big_h = ((uint32) DatumGetInt64(hash_uint32(value)));
-
-       h = big_h % filter->nbits;
-       d = big_h % (filter->nbits - 1);
-
        /* compute the requested number of hashes */
        for (i = 0; i < filter->nhashes; i++)
        {
-               int byte = (h / 8);
-               int bit  = (h % 8);
+               int partlen = filter->partlens[i];
+               int bitloc = part_boundary + (value % partlen);
+
+               Assert(bitloc < filter->nbits);
+
+               int byte = (bitloc / 8);
+               int bit  = (bitloc % 8);
 
                /* if the bit is not set, the value is not there */
                if (! (filter->data[byte] & (0x01 << bit)))
                        return false;
 
                /* next bit */
-               h += d++;
-               if (h >= filter->nbits)
-                       h -= filter->nbits;
-
-               if (d == filter->nbits)
-                       d = 0;
+               part_boundary += partlen;
        }
 
        /* all hashes found in bloom filter */
@@ -1065,8 +1167,14 @@ brin_bloom_summary_out(PG_FUNCTION_ARGS)
 
        if (BLOOM_IS_HASHED(filter))
        {
-               appendStringInfo(&str, "mode: hashed  nhashes: %u  nbits: %u  
nbits_set: %u",
+               appendStringInfo(&str,
+                                                "mode: hashed  nhashes: %u  
nbits: %u  nbits_set: %u  partition lengths:  [",
                                                 filter->nhashes, 
filter->nbits, filter->nbits_set);
+               for (int i = 0; i < filter->nhashes - 1; i++)
+               {
+                       appendStringInfo(&str, "%u, ", filter->partlens[i]);
+               }
+               appendStringInfo(&str, "%u]", filter->partlens[filter->nhashes 
- 1]);
        }
        else
        {

Re: WIP: BRIN multi-range indexes

Reply via email to