Re: Selectivity and row count estimates for JSONB columns

2021-11-03 Thread Tom Lane
Joel Perren  writes:
> - generic_restriction_selectivity() returns the default value (0.01) for
> data types that Postgres doesn't collect standard MCV and/or histogram
> statistics for. I think this is what happens with Table B which (quite
> correctly) does not have these statistics in pg_stats

There's nothing "correct" about that.  JSONB does have sorting support
(admittedly with a pretty arbitrary sort order), so I'd expect ANALYZE
to collect a histogram as well as MCV values for it.  Perhaps for your
Table B it's omitting the histogram because the MCV list captures the
entire contents of the column?  Or you've got auto-analyze disabled
for some reason?

Anyway, assuming you're using v14, the idea of matchingsel() is to
apply the given restriction clause to all the MCV and histogram
entries to see how many satisfy it [1].  In principle this'll produce
a far better estimate than any fixed default could provide.  If
you're coming out with a crappy estimate, you might be able to
improve matters by increasing the column's statistics target so
that more MCV and histogram entries are collected.

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/12237.1582833074%40sss.pgh.pa.us




Re: Selectivity and row count estimates for JSONB columns

2021-10-20 Thread Joel Perren
After some further investigation I think I might have actually got this the
wrong way round. I now strongly suspect that what I am seeing is related to
the presence of histogram_bounds for Table A (possibly a bug? or me being
silly).

Looking into the code (
https://github.com/postgres/postgres/blob/8c1144ba73478b818d9cebe8ecd64a14b7d45bde/src/backend/utils/adt/selfuncs.c
):
- The matchingsel() function calls the generic_restriction_selectivity()
function
- generic_restriction_selectivity() returns the default value (0.01) for
data types that Postgres doesn't collect standard MCV and/or histogram
statistics for. I think this is what happens with Table B which (quite
correctly) does not have these statistics in pg_stats
- Table A does have histogram_bounds for some reason, so I suspect it gets
further through the function, operating on garbage, to this line:

if (selec < 0.0001)
selec = 0.0001;

which ultimate ends up getting returned.

Therefore, Table A gets the 'correct' default value of 0.01 (which is
actually 1%, silly me). Table B meanwhile gets an erroneous value of 0.0001
(0.1%) .

So my question now is: why is Postgres creating histogram_bounds on some
jsonb columns but not others? I've looked at my CREATE TABLE scripts and
they are identical. We have the database deployed in another environment
and this issue reoccurs there too.

Any help would be appreciated :)

Thanks
Joel

On Wed, Oct 20, 2021 at 3:56 PM Joel Perren  wrote:

> Hi all
>
> I have a Postgres/PostGIS database with two separate database tables each
> of which has the following column structure:
>
> - identifier (text)
> - geometry (geometryz,27700)
> - properties (jsonb)
>
> I have created a GIN index (jsonb_path_ops) over the properties column on
> both tables.
>
> As I understand it, Postgres is unable to calculate statistics on the
> contents of JSONB columns and so should use hard-coded estimates of
> selectivity when planning queries. However, despite both tables having
> identical structures, similar row counts, identical indexes, and both
> having been recently ANALYZEd, I am finding that the query planner is
> producing different row count estimates between the tables.
>
> To expand: both tables have a 'description' field within their JSONB
> columns. I am executing the following query:
>
> EXPLAIN
> SELECT identifier, geometry, properties FROM table
> WHERE properties @@ '$.description == "test"'
> ORDER BY identifier;
>
> I am expecting that the selectivity value used for such queries should be
> 0.010 which is the default result of the matchingsel selectivity function
> which the @@ operator uses when operating on jsonb.
>
> For both tables, the planner opts for a Bitmap Index Scan -> Sort ->
> Gather Merge. However, the estimated number of rows returned from the Index
> Scan node differs substantially.
>
> - Table A (1,611,752 rows): 159 estimated rows (this is roughly 0.01% and
> makes sense);
> - Table B (1,656,110 rows): 16566 estimated rows (roughly 1% - why??)
>
> This difference is causing the planner to come up with some strange plans
> for queries on Table B which it does not do for Table A. I know that both
> estimates are just that and that neither is 'better', but I would really
> prefer it to be consistent and use a selectivity of 0.01 rather than 0.1 as
> it should be doing.
>
> I know this is getting long now, so apologies. But one thing I did notice
> and wondered if it could be related is that the pg_stats table for Table A
> seems to have values for histogram_bounds and correlation, but nulls for
> these values for Table B.
>
> Any ideas?
>
> Thanks
> Joel
>
>
>


Selectivity and row count estimates for JSONB columns

2021-10-20 Thread Joel Perren
Hi all

I have a Postgres/PostGIS database with two separate database tables each
of which has the following column structure:

- identifier (text)
- geometry (geometryz,27700)
- properties (jsonb)

I have created a GIN index (jsonb_path_ops) over the properties column on
both tables.

As I understand it, Postgres is unable to calculate statistics on the
contents of JSONB columns and so should use hard-coded estimates of
selectivity when planning queries. However, despite both tables having
identical structures, similar row counts, identical indexes, and both
having been recently ANALYZEd, I am finding that the query planner is
producing different row count estimates between the tables.

To expand: both tables have a 'description' field within their JSONB
columns. I am executing the following query:

EXPLAIN
SELECT identifier, geometry, properties FROM table
WHERE properties @@ '$.description == "test"'
ORDER BY identifier;

I am expecting that the selectivity value used for such queries should be
0.010 which is the default result of the matchingsel selectivity function
which the @@ operator uses when operating on jsonb.

For both tables, the planner opts for a Bitmap Index Scan -> Sort -> Gather
Merge. However, the estimated number of rows returned from the Index Scan
node differs substantially.

- Table A (1,611,752 rows): 159 estimated rows (this is roughly 0.01% and
makes sense);
- Table B (1,656,110 rows): 16566 estimated rows (roughly 1% - why??)

This difference is causing the planner to come up with some strange plans
for queries on Table B which it does not do for Table A. I know that both
estimates are just that and that neither is 'better', but I would really
prefer it to be consistent and use a selectivity of 0.01 rather than 0.1 as
it should be doing.

I know this is getting long now, so apologies. But one thing I did notice
and wondered if it could be related is that the pg_stats table for Table A
seems to have values for histogram_bounds and correlation, but nulls for
these values for Table B.

Any ideas?

Thanks
Joel