Re: [postgis-users] Row estimations

Paul Ramsey Wed, 07 Dec 2022 16:01:13 -0800

Experiments are here if you would like to try them.

https://github.com/postgis/postgis/pull/713


On Wed, Dec 7, 2022 at 2:22 PM Paul Ramsey <pram...@cleverelephant.ca> wrote:
>
> So, this is rather hard to explain without pictures, but you have to first of 
> all recognize that the selectivity estimator is an "estimator", so you cannot 
> construct synthetic situations and expect things to match exactly. You can 
> just expect them to be worse or better.
>
> In particular, things like "how many points are selected by this box" are 
> little fraught.
>
> First, understand that underlying the spatial stats system is a (in the 2d 
> case) a simple 2d grid. The database draws a sample and fills in the counts 
> of features lying in each grid cell. Simple, yes? Except that spatial data 
> sets can get quite big. There's definitely an extrema at which you don't want 
> to make a grid any larger, for fear of making working with it too expensive. 
> Right now we tend towards grids with about 20K cells at the highest end, and 
> very few at the lowest end.
>
> As the code currently stands, it's possible for very small input tables to 
> generate grids of 1x1 cell... which is what your case generates with the 
> current code. In that case, the selectivity will end up being the proportion 
> of the cell that the query box overlaps. So you can get (as you did) a 
> selectivity of about 5% when the query box neatly covered 80% of the input 
> rows.
>
> To an approxiimation, this can be improved by tweaking the analysis code to 
> increase the number of cells in the histogram for small tables. However, the 
> selectivity calculations are still necessarily proportional on area of cell 
> overlapped, and in your example data, for, say a 2x2 grid, which does capture 
> the outlying point in its own cell, the cell boundary will be 1/2 way between 
> the extrema, so your first small query box will only pick up a selectivity of 
> 0.2. Better than 0.05! But still not really right.
>
> The reasoning behind having the shrinking dimensions will strike you as odd: 
> why not just a minimum 100x100 grid? Well, that goes down to our (my) 
> philosophy of trying to extract as much analytical power out of each cell as 
> possible. Basically, your data is quite variable on the X axis and not really 
> variable on the Y axis... should we allocate as many cells in X in our stats 
> grid as we do in Y? If you answer "no", then you're off down the rabbit hole 
> with me: for each dimension, statsitically figure out how much variability 
> there is in coverage, and then parcel out grid cells in each dimension to 
> allocate the budget to the most variable dimensions, and less to homogeneous 
> dimensions. This means that, in the limit, with a data set that is quite 
> variable in X and quite homogenous in Y, you could end up with a 1x20000 
> grid, with lots of resolution in the axis we care about and none in the 
> homogeneous axis.
>
> Very clever, but lots of moving parts, and in the limit as the number if 
> incoming records goes down, not going to yield estimates that are as close to 
> "reality" as we might like. But in return it extracts more precise estimates 
> in the case where the number of rows is very very large.
>
> At least in theory. One possibility might be to push up the current maximum 
> number of cells, and use some of that budget to put a higher floor on the 
> minimum number of cells per dimension, so that the minimum isn't 1, as 
> currently, but something more like 10, to cut down on the kind of divergence 
> you're seeing in your small table case. I have some tweaks in hand that do 
> that already, but now that I'm into the code I'm trying to dream up ways to 
> cut it back while still maintaining a certain measure of dynamicism.
>
> How much operational difference does the mis-estimate create? For a small 
> table, I really wonder? Because the number of tuples will always be small, 
> it's not like you're likely to be dropped into an index scan by accident, 
> even if the seletivity is coming back too high.
>
> I am way more interested in cases with larger numbers of rows generating 
> mis-estimates. I assume that my code might also be vulnerable to tables with 
> a lot of duplication, and also tables of points versus tables of polygons / 
> lines.
>
> ATB,
> P
>
> > On Dec 6, 2022, at 6:06 PM, Regina Obe <l...@pcorp.us> wrote:
> >
> > Correction.  I was mistaken, I tested the wrong example for rows=1.
> >
> > My ST_Contains returns the same answer as &&
> >
> > explain analyze select * from test where 
> > ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);
> >
> > Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual 
> > time=0.025..0.027 rows=1 loops=1)
> >   Filter: 
> > st_contains('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry,
> >  p)
> >   Rows Removed by Filter: 4
> > Planning Time: 0.121 ms
> > Execution Time: 0.047 ms
> >
> > So the fact that works for me and not you might be just dumb luck that I 
> > ran the && first and the stats are cached and using that.
> >
> > The difference in answer might also be if you modified your 
> > default_statistics_target.
> >
> > SHOW default_statistics_target;
> >
> > Mine shows 100.
> >
> > Thanks,
> > Regina
> >
> >
> > From: Regina Obe [mailto:l...@pcorp.us]
> > Sent: Tuesday, December 6, 2022 8:59 PM
> > To: 'PostGIS Users Discussion' <postgis-users@lists.osgeo.org>
> > Subject: RE: [postgis-users] Row estimations
> >
> > Igor,
> >
> > The stats estimation for PostGIS has always been not the greatest.  It’s 
> > more of a limitation with PostgreSQL as I understand it.
> > It’s been getting better over the years though.  Even in the best case 
> > scenario only the bounding box estimation would happen and Paul may be 
> > better able to answer what happens with the functions such as ST_Contains 
> > (which use the Function instrumentation).
> >
> > So a better test to confirm your stats are working correctly is an && test 
> > which I have below.
> >
> > I’m running  PostgreSQL 15.1, compiled by Visual C++ build 1914, 64-bit 
> > POSTGIS="3.3.2 3.3.2" [EXTENSION] PGSQL="150" GEOS="3.11.1-CAPI-1.17.1" 
> > SFCGAL="SFCGAL 1.4.1, CGAL 5.3, BOOST 1.78.0" PROJ="7.2.1" GDAL="GDAL 
> > 3.4.3, released 2022/04/22" LIBXML="2.9.9" LIBJSON="0.12" 
> > LIBPROTOBUF="1.2.1" WAGYU="0.5.0 (Internal)" TOPOLOGY RASTER
> >
> > explain analyze select * from test where ST_GeomFromText('POLYGON((0 0,3 
> > -0,3 3,0 3,0 0))') &&  p;
> >
> >
> > which give me the expected answer of  –
> >
> > Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual 
> > time=0.019..0.024 rows=4 loops=1)
> >   Filter: 
> > ('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry
> >  && p)
> >   Rows Removed by Filter: 1
> > Planning Time: 0.131 ms
> > Execution Time: 0.046 ms
> >
> > And 4 rows are indeed returned for me.
> >
> > explain analyze select * from test where ST_GeomFromText('POLYGON((50 0,52 
> > 0,52 2,50 2,50 0))') && p;
> >
> > Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual 
> > time=0.022..0.024 rows=1 loops=1)
> >   Filter: 
> > ('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry
> >  && p)
> >   Rows Removed by Filter: 4
> > Planning Time: 0.124 ms
> > Execution Time: 0.047 ms
> >
> > I confirm that I get the same answer of below when doing
> >
> > explain analyze select * from test where 
> > ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') , p);
> >
> >
> > Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual 
> > time=0.026..0.038 rows=4 loops=1)
> >   Filter: 
> > st_contains('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry,
> >  p)
> >   Rows Removed by Filter: 1
> > Planning Time: 0.132 ms
> > Execution Time: 0.061 ms
> >
> >
> > So it does seem odd to me that ST_Contains would give a higher estimate 
> > than the && operation.  That could be a bug in the function instrumentation 
> > of  ST_Contains and ST_Intersects.
> >
> >
> > Thanks,
> > Regina
> >
> > From: postgis-users [mailto:postgis-users-boun...@lists.osgeo.org] On 
> > Behalf Of Igor ALBUQUERQUE SILVA
> > Sent: Thursday, December 1, 2022 4:37 AM
> > To: postgis-users@lists.osgeo.org
> > Subject: [postgis-users] Row estimations
> >
> > Hello everyone,
> >
> > I have a question regarding the row estimations/gist indexes. Here's a 
> > minimal reproduction of it:
> >
> > create table test(p geometry(point));
> > insert into test(p) values (st_makepoint(1,1));
> > insert into test(p) values (st_makepoint(1,2));
> > insert into test(p) values (st_makepoint(2,1));
> > insert into test(p) values (st_makepoint(2,2));
> > insert into test(p) values (st_makepoint(51,1));
> > analyze test;
> > explain analyze select * from test where 
> > ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))'), p);
> > explain analyze select * from test where 
> > ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);
> >
> > The two queries get the same cost/row estimation, of 1 row. This is the 
> > EXPLAIN ANALYZE of the first query:
> >
> > Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual 
> > time=0.015..0.022 rows=4 loops=1)
> >    Filter: 
> > st_contains('01030000000100000005000000000000000000F0BF000000000000F0BF0000000000000040000000000000F0BF00000000000000400000000000000040000000000000F0BF0000000000000040000000000000F0BF000000000000F0BF'::geometry,
> >  p)
> >    Rows Removed by Filter: 1
> >  Planning Time: 0.072 ms
> >  Execution Time: 0.035 ms
> > (5 rows)
> >
> > What I was expecting is the first query to estimate 4 rows and the second 
> > to estimate 1, like what I get If I try the same thing using integers.
> >
> > create table test(x integer, y integer);
> > insert into test(x, y) values (0, 0);
> > insert into test(x, y) values (0, 1);
> > insert into test(x, y) values (1, 0);
> > insert into test(x, y) values (1, 1);
> > insert into test(x, y) values (50, 0);
> > analyze test;
> > explain analyze select * from test where x between 0 and 1 and y between 0 
> > and 1;
> > explain analyze select * from test where x between 50 and 51 and y between 
> > 0 and 1;
> >
> > My question is: is this expected behaviour? I actually have a much larger 
> > table with a gist index where I found this occurring, and this causes the 
> > planner to make bad decisions: every query that I do will have the same 
> > estimation, and whenever this estimation is very wrong, the planner does 
> > not take the optimal decision when it has to compare with other indexes 
> > costs in a more complicated query.
> >
> > I'm using the official docker image, PostgreSQL 15.1 (Debian 
> > 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 
> > 10.2.1 20210110, 64-bit POSTGIS="3.3.1 3786b21" [EXTENSION] PGSQL="150" 
> > GEOS="3.9.0-CAPI-1.16.2" PROJ="7.2.1" LIBXML="2.9.10" LIBJSON="0.15" 
> > LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)" TOPOLOGY
> >
> > Best regards,
> > Igor
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users@lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

Re: [postgis-users] Row estimations

Reply via email to