Re: PoC/WIP: Extended statistics on expressions

Tomas Vondra Wed, 24 Mar 2021 09:39:01 -0700


On 3/24/21 7:24 AM, Justin Pryzby wrote:
> Most importantly, it looks like this forgets to update catalog documentation
> for stxexprs and stxkind='e'
>


Good catch.

> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
>> Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  
> 

OK "statistics object" seems better and more consistent.

>> +       Name of schema containing table
> 
> I don't know about the nearby descriptions, but this one sounds too much like 
> a
> "schema-containing" table.  Say "Name of the schema which contains the table" 
> ?
> 

I think the current spelling is OK / consistent with the other catalogs.

>> +       Name of table
> 
> Say "name of table on which the extended statistics are defined"
> 

I've used "Name of table the statistics object is defined on".

>> +       Name of extended statistics
> 
> "Name of the extended statistic object"
> 
>> +       Owner of the extended statistics
> 
> ..object
> 

OK

>> +       Expression the extended statistics is defined on
> 
> I think it should say "the extended statistic", or "the extended statistics
> object".  Maybe "..on which the extended statistic is defined"
> 

OK

>> +       of random access to the disk.  (This expression is null if the 
>> expression
>> +       data type does not have a <literal>&lt;</literal> operator.)
> 
> expression's data type
> 

OK

>> +   much-too-small row count estimate in the first two queries. Moreover, the
> 
> maybe say "dramatically underestimates the rowcount"
> 

I've changed this to "... results in a significant underestimate of row
count".

>> +   planner has no information about relationship between the expressions, 
>> so it
> 
> the relationship
> 

OK

>> +   assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
>> +   conditions are independent, and multiplies their selectivities together 
>> to
>> +   arrive at a much-too-high group count estimate in the aggregate query.
> 
> severe overestimate ?
> 

OK

>> +   This is further exacerbated by the lack of accurate statistics for the
>> +   expressions, forcing the planner to use default ndistinct estimate for 
>> the
> 
> use *a default
> 

OK

>> +   expression derived from ndistinct for the column. With such statistics, 
>> the
>> +   planner recognizes that the conditions are correlated and arrives at much
>> +   more accurate estimates.
> 
> are correlated comma
> 

OK

>> +                    if (type->lt_opr == InvalidOid)
> 
> These could be !OidIsValid
> 

Maybe, but it's like this already. I'll leave this alone and then
fix/backpatch separately.

>> +     * expressions. It's either expensive or very easy to defeat for
>> +     * determined used, and there's no risk if we allow such statistics (the
>> +     * statistics is useless, but harmless).
> 
> I think it's meant to say "for a determined user" ?
> 

Right.

>> +     * If there are no simply-referenced columns, give the statistics an 
>> auto
>> +     * dependency on the whole table.  In most cases, this will be 
>> redundant,
>> +     * but it might not be if the statistics expressions contain no Vars
>> +     * (which might seem strange but possible).
>> +     */
>> +    if (!nattnums)
>> +    {
>> +            ObjectAddressSet(parentobject, RelationRelationId, relid);
>> +            recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
>> +    }
> 
> Can this be unconditional ?
> 

What would be the benefit? This behavior copied from index_create, so
I'd prefer keeping it the same for consistency reason. Presumably it's
like that for some reason (a bit of cargo cult programming, I know).

>> +     * Translate the array of indexs to regular attnums for the dependency 
>> (we
> 
> sp: indexes
> 

OK

>> +                                     * Not found a matching expression, so 
>> we can simply skip
> 
> Found no matching expr
> 

OK

>> +                            /* if found a matching, */
> 
> matching ..
> 

Matching dependency.

>> +examine_attribute(Node *expr)
> 
> Maybe you should rename this to something distinct ?  So it's easy to add a
> breakpoint there, for example.
> 

What would be a better name? It's not difficult to add a breakpoint
using line number, for example.

>> +    stats->anl_context = CurrentMemoryContext;      /* XXX should be using
>> +                                                                            
>>                  * something else? */
> 
>> +            bool            nulls[Natts_pg_statistic];
> ...
>> +             * Construct a new pg_statistic tuple
>> +             */
>> +            for (i = 0; i < Natts_pg_statistic; ++i)
>> +            {
>> +                    nulls[i] = false;
>> +            }
> 
> Shouldn't you just write nulls[Natts_pg_statistic] = {false};
> or at least: memset(nulls, 0, sizeof(nulls));
> 

Maybe, but it's a copy of what update_attstats() does, so I prefer
keeping it the same.

>> +                             * We don't store collations used to build the 
>> statistics, but
>> +                             * we can use the collation for the attribute 
>> itself, as
>> +                             * stored in varcollid. We do reset the 
>> statistics after a
>> +                             * type change (including collation change), so 
>> this is OK. We
>> +                             * may need to relax this after allowing 
>> extended statistics
>> +                             * on expressions.
> 
> This text should be updated or removed ?
> 

Yeah, the last sentence is obsolete. Updated.

>> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
>>              }
>>  
>>              /* print any extended statistics */
>> -            if (pset.sversion >= 100000)
>> +            if (pset.sversion >= 140000)
>> +            {
>> +                    printfPQExpBuffer(&buf,
>> +                                                      "SELECT oid, "
>> +                                                      
>> "stxrelid::pg_catalog.regclass, "
>> +                                                      
>> "stxnamespace::pg_catalog.regnamespace AS nsp, "
>> +                                                      "stxname,\n"
>> +                                                      
>> "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
>> +                                                      "  'd' = any(stxkind) 
>> AS ndist_enabled,\n"
>> +                                                      "  'f' = any(stxkind) 
>> AS deps_enabled,\n"
>> +                                                      "  'm' = any(stxkind) 
>> AS mcv_enabled,\n");
>> +
>> +                    if (pset.sversion >= 130000)
>> +                            appendPQExpBufferStr(&buf, "  stxstattarget\n");
>> +                    else
>> +                            appendPQExpBufferStr(&buf, "  -1 AS 
>> stxstattarget\n");
> 
>  >= 130000 is fully determined by >= 14000 :)
> 

Ah, right.

>> +     * type of the opclass, which is not interesting for our purposes.  
>> (Note:
>> +     * if we did anything with non-expression index columns, we'd need to
> 
> index is wrong ?
> 

Fixed

> I mentioned a bunch of other references to "index" and "predicate" which are
> still around:
> 

Whooops, sorry. Fixed.


I'll post a cleaned-up version of the patch addressing Dean's review
comments too.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

Reply via email to