On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote:
On Thu, 20 Jun 2019 at 23:35, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:

On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:

>I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
>though, because that would seem likely to put too much weight on the
>least commonly occurring values.

But would that be an issue, or a good thing? I mean, as long as the item
is above mincount, we take the counts as reliable. As I explained, my
motivation for proposing that was that both

   ... (cost=... rows=1 ...) (actual=... rows=1000001 ...)

and

   ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...)

have exactly the same Abs(freq - base_freq), but I think we both agree
that the first misestimate is much more dangerous, because it's off by six
orders of magnitude.


Hmm, that's a good example. That definitely suggests that we should be
trying to minimise the relative error, but also perhaps that what we
should be looking at is actually just the ratio freq / base_freq,
rather than their difference.


Attached are patches that implement this (well, the first one optimizes
how the base frequency is computed, which helps for high statistic target
values). The 0002 part picks the items based on

  Max(freq/base_freq, base_freq/freq)

It did help with the addresses data set quite a bit, but I'm sure it needs
more testing. I've also tried using

  freq * abs(freq - base_freq)

but the estimates were not as good.

One annoying thing I noticed is that the base_frequency tends to end up
being 0, most likely due to getting too small. It's a bit strange, though,
because with statistic target set to 10k the smallest frequency for a
single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
something the float8 can represent).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 430ee6e6739d5f8fb2e25657f0196a5c413c2021 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Sat, 22 Jun 2019 13:09:42 +0200
Subject: [PATCH 1/2] fix mcv build perf issue

---
 src/backend/statistics/mcv.c | 114 ++++++++++++++++++++++++++++++++---
 1 file changed, 106 insertions(+), 8 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 5fe61ea0a4..04a4f17b01 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -78,6 +78,9 @@ static MultiSortSupport build_mss(VacAttrStats **stats, int 
numattrs);
 static SortItem *build_distinct_groups(int numrows, SortItem *items,
                                                                           
MultiSortSupport mss, int *ndistinct);
 
+static SortItem **build_column_frequencies(SortItem *groups, int ngroups,
+                                                                               
   MultiSortSupport mss, int *ncounts);
+
 static int     count_distinct_groups(int numrows, SortItem *items,
                                                                  
MultiSortSupport mss);
 
@@ -172,6 +175,8 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
        SortItem   *groups;
        MCVList    *mcvlist = NULL;
        MultiSortSupport mss;
+       SortItem  **freqs;
+       int                *nfreqs;
 
        attnums = build_attnums_array(attrs, &numattrs);
 
@@ -188,6 +193,10 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
        /* transform the sorted rows into groups (sorted by frequency) */
        groups = build_distinct_groups(nitems, items, mss, &ngroups);
 
+       /* compute frequencies for values in each column */
+       nfreqs = (int *) palloc0(sizeof(int) * numattrs);
+       freqs = build_column_frequencies(groups, ngroups, mss, nfreqs);
+
        /*
         * Maximum number of MCV items to store, based on the attribute with the
         * largest stats target (and the number of groups we have available).
@@ -242,6 +251,16 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
        if (nitems > 0)
        {
                int                     j;
+               SortItem        key;
+               MultiSortSupport        tmp;
+
+               /* used to search values */
+               tmp = (MultiSortSupport) palloc(offsetof(MultiSortSupportData, 
ssup)
+                                                                               
+ sizeof(SortSupportData));
+
+               /* space for search key */
+               key.values = palloc(sizeof(Datum));
+               key.isnull = palloc(sizeof(bool));
 
                /*
                 * Allocate the MCV list structure, set the global parameters.
@@ -281,22 +300,28 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
                        item->base_frequency = 1.0;
                        for (j = 0; j < numattrs; j++)
                        {
-                               int                     count = 0;
-                               int                     k;
+                               SortItem   *freq;
 
-                               for (k = 0; k < ngroups; k++)
-                               {
-                                       if (multi_sort_compare_dim(j, 
&groups[i], &groups[k], mss) == 0)
-                                               count += groups[k].count;
-                               }
+                               /* single dimension */
+                               tmp->ndims = 1;
+                               tmp->ssup[0] = mss->ssup[j];
 
-                               item->base_frequency *= (double) count / 
numrows;
+                               /* fill search key */
+                               key.values[0] = groups[i].values[j];
+                               key.isnull[0] = groups[i].isnull[j];
+
+                               freq = (SortItem *) bsearch_arg(&key, freqs[j], 
nfreqs[j],
+                                                                               
                sizeof(SortItem),
+                                                                               
                multi_sort_compare, tmp);
+
+                               item->base_frequency *= ((double) freq->count) 
/ numrows;
                        }
                }
        }
 
        pfree(items);
        pfree(groups);
+       pfree(freqs);
 
        return mcvlist;
 }
@@ -419,6 +444,79 @@ build_distinct_groups(int numrows, SortItem *items, 
MultiSortSupport mss,
        return groups;
 }
 
+/* compare sort items (with a single datum) */
+static int
+sort_item_compare(const void *a, const void *b, void *arg)
+{
+       SortSupport     ssup = (SortSupport) arg;
+       SortItem   *ia = (SortItem *) a;
+       SortItem   *ib = (SortItem *) b;
+
+       return ApplySortComparator(ia->values[0], ia->isnull[0],
+                                                          ib->values[0], 
ib->isnull[0],
+                                                          ssup);
+}
+
+static SortItem **
+build_column_frequencies(SortItem *groups, int ngroups, MultiSortSupport mss,
+                                                int *ncounts)
+{
+       int                     i,
+                               j;
+       SortItem  **result;
+       char       *ptr;
+
+       /* allocate arrays for all columns as a single chunk */
+       ptr = palloc(MAXALIGN(sizeof(SortItem *) * mss->ndims) +
+                 mss->ndims * MAXALIGN(sizeof(SortItem) * ngroups));
+
+       /* initial array of pointers */
+       result = (SortItem **) ptr;
+       ptr += MAXALIGN(sizeof(SortItem *) * mss->ndims);
+
+       for (i = 0; i < mss->ndims; i++)
+       {
+               SortSupport     ssup = &mss->ssup[i];
+
+               /* array of values for a single column */
+               result[i] = (SortItem *) ptr;
+               ptr += MAXALIGN(sizeof(SortItem) * ngroups);
+
+               /* extract data for the dimension */
+               for (j = 0; j < ngroups; j++)
+               {
+                       result[i][j].values = palloc(sizeof(Datum));
+                       result[i][j].isnull = palloc(sizeof(bool));
+
+                       result[i][j].values[0] = groups[j].values[i];
+
+                       result[i][j].isnull[0] = groups[j].isnull[i];
+                       result[i][j].count = groups[j].count;
+               }
+
+               /* sort the values, deduplicate */
+               qsort_arg((void *) result[i], ngroups, sizeof(SortItem),
+                                 sort_item_compare, ssup);
+
+               ncounts[i] = 1;
+               for (j = 1; j < ngroups; j++)
+               {
+                       if (sort_item_compare(&result[i][(ncounts[i] - 1)], 
&result[i][j], ssup) == 0)
+                       {
+                               result[i][(ncounts[i] - 1)].count += 
result[i][j].count;
+                               continue;
+                       }
+
+                       /* */
+                       if (ncounts[i] != j)
+                               result[i][ncounts[i]] = result[i][j];
+
+                       ncounts[i]++;
+               }
+       }
+
+       return result;
+}
 
 /*
  * statext_mcv_load
-- 
2.20.1

>From b12cab38c3e8ccc2be057d5ef3f3da3d368f258f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Sat, 22 Jun 2019 15:37:40 +0200
Subject: [PATCH 2/2] pick MCV items by relative error

---
 src/backend/statistics/mcv.c | 149 +++++++++++++++++++++++++----------
 1 file changed, 109 insertions(+), 40 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 04a4f17b01..3056b659e7 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -78,6 +78,8 @@ static MultiSortSupport build_mss(VacAttrStats **stats, int 
numattrs);
 static SortItem *build_distinct_groups(int numrows, SortItem *items,
                                                                           
MultiSortSupport mss, int *ndistinct);
 
+static int sort_item_compare(const void *a, const void *b, void *arg);
+
 static SortItem **build_column_frequencies(SortItem *groups, int ngroups,
                                                                                
   MultiSortSupport mss, int *ncounts);
 
@@ -147,6 +149,63 @@ get_mincount_for_mcv_list(int samplerows, double totalrows)
        return numer / denom;
 }
 
+/*
+ * Represents a group of values, with both the observed and base frequency
+ * (as expected from product of individual columns).
+ */
+typedef struct SortGroup {
+       Datum  *values;
+       bool   *isnull;
+       double  frequency;
+       double  base_frequency;
+} SortGroup;
+
+/*
+ * compare_relative_error
+ *             compare MCV groups by relative estimation error
+ *
+ * We simply compute relative estimation error
+ *
+ *     Max(estimate/actual, actual/estimate)
+ *
+ * and then use that to pick the most mis-estimated groups.
+ */
+static int
+compare_relative_error(const void *a, const void *b)
+{
+       SortGroup  *sa = (SortGroup     *) a;
+       SortGroup  *sb = (SortGroup     *) b;
+
+       double          ea = Max(sa->frequency / sa->base_frequency,
+                                                sa->base_frequency / 
sa->frequency),
+                               eb = Max(sb->frequency / sb->base_frequency,
+                                                sb->base_frequency / 
sb->frequency);
+/*
+       double          ea = sa->frequency * abs(sa->base_frequency - 
sa->base_frequency),
+                               eb = sb->frequency * abs(sb->base_frequency - 
sb->base_frequency);
+*/
+       if (ea > eb)
+               return -1;
+       else if (ea < eb)
+               return 1;
+
+       return 0;
+}
+
+static int
+compare_frequency(const void *a, const void *b)
+{
+       SortGroup  *sa = (SortGroup     *) a;
+       SortGroup  *sb = (SortGroup     *) b;
+
+       if (sa->frequency > sb->frequency)
+               return -1;
+       else if (sa->frequency < sb->frequency)
+               return 1;
+
+       return 0;
+}
+
 /*
  * Builds MCV list from the set of sampled rows.
  *
@@ -177,6 +236,13 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
        MultiSortSupport mss;
        SortItem  **freqs;
        int                *nfreqs;
+       SortGroup  *sort_groups;
+       int                     nsort_groups;
+       SortItem        key;
+
+               /* space for search key */
+       key.values = palloc(sizeof(Datum));
+       key.isnull = palloc(sizeof(bool));
 
        attnums = build_attnums_array(attrs, &numattrs);
 
@@ -229,20 +295,55 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
         */
        mincount = get_mincount_for_mcv_list(numrows, totalrows);
 
+       nsort_groups = 0;
+       sort_groups = (SortGroup *) palloc(sizeof(SortGroup) * ngroups);
+
        /*
         * Walk the groups until we find the first group with a count below the
         * mincount threshold (the index of that group is the number of groups 
we
-        * want to keep).
+        * will consider to keep).
         */
-       for (i = 0; i < nitems; i++)
+       for (i = 0; i < ngroups; i++)
        {
+               int     j;
+
                if (groups[i].count < mincount)
-               {
-                       nitems = i;
                        break;
+
+               sort_groups[i].values = groups[i].values;
+               sort_groups[i].isnull = groups[i].isnull;
+               sort_groups[i].frequency = (double) groups[i].count / numrows;
+
+               /* base frequency, if the attributes were independent */
+               sort_groups[i].base_frequency = 1.0;
+               for (j = 0; j < numattrs; j++)
+               {
+                       SortItem   *freq;
+
+                       /* fill search key */
+                       key.values[0] = groups[i].values[j];
+                       key.isnull[0] = groups[i].isnull[j];
+
+                       freq = (SortItem *) bsearch_arg(&key, freqs[j], 
nfreqs[j],
+                                                                               
        sizeof(SortItem),
+                                                                               
        sort_item_compare, &mss->ssup[j]);
+
+                       sort_groups[i].base_frequency *= ((double) freq->count) 
/ numrows;
                }
+
+               nsort_groups = i;
        }
 
+       /* sort the groups by relative error */
+       pg_qsort(sort_groups, nsort_groups, sizeof(SortGroup), 
compare_relative_error);
+
+       /* make sure we only consider groups that are frequent enough */
+       if (nitems > nsort_groups)
+               nitems = nsort_groups;
+
+       /* sort the first groups by frequency (descending) */
+       pg_qsort(sort_groups, nitems, sizeof(SortGroup), compare_frequency);
+
        /*
         * At this point we know the number of items for the MCV list. There 
might
         * be none (for uniform distribution with many groups), and in that case
@@ -250,18 +351,6 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
         */
        if (nitems > 0)
        {
-               int                     j;
-               SortItem        key;
-               MultiSortSupport        tmp;
-
-               /* used to search values */
-               tmp = (MultiSortSupport) palloc(offsetof(MultiSortSupportData, 
ssup)
-                                                                               
+ sizeof(SortSupportData));
-
-               /* space for search key */
-               key.values = palloc(sizeof(Datum));
-               key.isnull = palloc(sizeof(bool));
-
                /*
                 * Allocate the MCV list structure, set the global parameters.
                 */
@@ -287,35 +376,15 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
                        item->isnull = (bool *) palloc(sizeof(bool) * numattrs);
 
                        /* copy values for the group */
-                       memcpy(item->values, groups[i].values, sizeof(Datum) * 
numattrs);
-                       memcpy(item->isnull, groups[i].isnull, sizeof(bool) * 
numattrs);
+                       memcpy(item->values, sort_groups[i].values, 
sizeof(Datum) * numattrs);
+                       memcpy(item->isnull, sort_groups[i].isnull, 
sizeof(bool) * numattrs);
 
                        /* groups should be sorted by frequency in descending 
order */
                        Assert((i == 0) || (groups[i - 1].count >= 
groups[i].count));
 
                        /* group frequency */
-                       item->frequency = (double) groups[i].count / numrows;
-
-                       /* base frequency, if the attributes were independent */
-                       item->base_frequency = 1.0;
-                       for (j = 0; j < numattrs; j++)
-                       {
-                               SortItem   *freq;
-
-                               /* single dimension */
-                               tmp->ndims = 1;
-                               tmp->ssup[0] = mss->ssup[j];
-
-                               /* fill search key */
-                               key.values[0] = groups[i].values[j];
-                               key.isnull[0] = groups[i].isnull[j];
-
-                               freq = (SortItem *) bsearch_arg(&key, freqs[j], 
nfreqs[j],
-                                                                               
                sizeof(SortItem),
-                                                                               
                multi_sort_compare, tmp);
-
-                               item->base_frequency *= ((double) freq->count) 
/ numrows;
-                       }
+                       item->frequency = sort_groups[i].frequency;
+                       item->base_frequency = sort_groups[i].base_frequency;
                }
        }
 
-- 
2.20.1

Reply via email to