Experiments are here if you would like to try them. https://github.com/postgis/postgis/pull/713
On Wed, Dec 7, 2022 at 2:22 PM Paul Ramsey <pram...@cleverelephant.ca> wrote: > > So, this is rather hard to explain without pictures, but you have to first of > all recognize that the selectivity estimator is an "estimator", so you cannot > construct synthetic situations and expect things to match exactly. You can > just expect them to be worse or better. > > In particular, things like "how many points are selected by this box" are > little fraught. > > First, understand that underlying the spatial stats system is a (in the 2d > case) a simple 2d grid. The database draws a sample and fills in the counts > of features lying in each grid cell. Simple, yes? Except that spatial data > sets can get quite big. There's definitely an extrema at which you don't want > to make a grid any larger, for fear of making working with it too expensive. > Right now we tend towards grids with about 20K cells at the highest end, and > very few at the lowest end. > > As the code currently stands, it's possible for very small input tables to > generate grids of 1x1 cell... which is what your case generates with the > current code. In that case, the selectivity will end up being the proportion > of the cell that the query box overlaps. So you can get (as you did) a > selectivity of about 5% when the query box neatly covered 80% of the input > rows. > > To an approxiimation, this can be improved by tweaking the analysis code to > increase the number of cells in the histogram for small tables. However, the > selectivity calculations are still necessarily proportional on area of cell > overlapped, and in your example data, for, say a 2x2 grid, which does capture > the outlying point in its own cell, the cell boundary will be 1/2 way between > the extrema, so your first small query box will only pick up a selectivity of > 0.2. Better than 0.05! But still not really right. > > The reasoning behind having the shrinking dimensions will strike you as odd: > why not just a minimum 100x100 grid? Well, that goes down to our (my) > philosophy of trying to extract as much analytical power out of each cell as > possible. Basically, your data is quite variable on the X axis and not really > variable on the Y axis... should we allocate as many cells in X in our stats > grid as we do in Y? If you answer "no", then you're off down the rabbit hole > with me: for each dimension, statsitically figure out how much variability > there is in coverage, and then parcel out grid cells in each dimension to > allocate the budget to the most variable dimensions, and less to homogeneous > dimensions. This means that, in the limit, with a data set that is quite > variable in X and quite homogenous in Y, you could end up with a 1x20000 > grid, with lots of resolution in the axis we care about and none in the > homogeneous axis. > > Very clever, but lots of moving parts, and in the limit as the number if > incoming records goes down, not going to yield estimates that are as close to > "reality" as we might like. But in return it extracts more precise estimates > in the case where the number of rows is very very large. > > At least in theory. One possibility might be to push up the current maximum > number of cells, and use some of that budget to put a higher floor on the > minimum number of cells per dimension, so that the minimum isn't 1, as > currently, but something more like 10, to cut down on the kind of divergence > you're seeing in your small table case. I have some tweaks in hand that do > that already, but now that I'm into the code I'm trying to dream up ways to > cut it back while still maintaining a certain measure of dynamicism. > > How much operational difference does the mis-estimate create? For a small > table, I really wonder? Because the number of tuples will always be small, > it's not like you're likely to be dropped into an index scan by accident, > even if the seletivity is coming back too high. > > I am way more interested in cases with larger numbers of rows generating > mis-estimates. I assume that my code might also be vulnerable to tables with > a lot of duplication, and also tables of points versus tables of polygons / > lines. > > ATB, > P > > > On Dec 6, 2022, at 6:06 PM, Regina Obe <l...@pcorp.us> wrote: > > > > Correction. I was mistaken, I tested the wrong example for rows=1. > > > > My ST_Contains returns the same answer as && > > > > explain analyze select * from test where > > ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p); > > > > Seq Scan on test (cost=0.00..126.05 rows=1 width=32) (actual > > time=0.025..0.027 rows=1 loops=1) > > Filter: > > st_contains('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry, > > p) > > Rows Removed by Filter: 4 > > Planning Time: 0.121 ms > > Execution Time: 0.047 ms > > > > So the fact that works for me and not you might be just dumb luck that I > > ran the && first and the stats are cached and using that. > > > > The difference in answer might also be if you modified your > > default_statistics_target. > > > > SHOW default_statistics_target; > > > > Mine shows 100. > > > > Thanks, > > Regina > > > > > > From: Regina Obe [mailto:l...@pcorp.us] > > Sent: Tuesday, December 6, 2022 8:59 PM > > To: 'PostGIS Users Discussion' <postgis-users@lists.osgeo.org> > > Subject: RE: [postgis-users] Row estimations > > > > Igor, > > > > The stats estimation for PostGIS has always been not the greatest. It’s > > more of a limitation with PostgreSQL as I understand it. > > It’s been getting better over the years though. Even in the best case > > scenario only the bounding box estimation would happen and Paul may be > > better able to answer what happens with the functions such as ST_Contains > > (which use the Function instrumentation). > > > > So a better test to confirm your stats are working correctly is an && test > > which I have below. > > > > I’m running PostgreSQL 15.1, compiled by Visual C++ build 1914, 64-bit > > POSTGIS="3.3.2 3.3.2" [EXTENSION] PGSQL="150" GEOS="3.11.1-CAPI-1.17.1" > > SFCGAL="SFCGAL 1.4.1, CGAL 5.3, BOOST 1.78.0" PROJ="7.2.1" GDAL="GDAL > > 3.4.3, released 2022/04/22" LIBXML="2.9.9" LIBJSON="0.12" > > LIBPROTOBUF="1.2.1" WAGYU="0.5.0 (Internal)" TOPOLOGY RASTER > > > > explain analyze select * from test where ST_GeomFromText('POLYGON((0 0,3 > > -0,3 3,0 3,0 0))') && p; > > > > > > which give me the expected answer of – > > > > Seq Scan on test (cost=0.00..1.06 rows=1 width=32) (actual > > time=0.019..0.024 rows=4 loops=1) > > Filter: > > ('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry > > && p) > > Rows Removed by Filter: 1 > > Planning Time: 0.131 ms > > Execution Time: 0.046 ms > > > > And 4 rows are indeed returned for me. > > > > explain analyze select * from test where ST_GeomFromText('POLYGON((50 0,52 > > 0,52 2,50 2,50 0))') && p; > > > > Seq Scan on test (cost=0.00..1.06 rows=1 width=32) (actual > > time=0.022..0.024 rows=1 loops=1) > > Filter: > > ('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry > > && p) > > Rows Removed by Filter: 4 > > Planning Time: 0.124 ms > > Execution Time: 0.047 ms > > > > I confirm that I get the same answer of below when doing > > > > explain analyze select * from test where > > ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') , p); > > > > > > Seq Scan on test (cost=0.00..126.05 rows=1 width=32) (actual > > time=0.026..0.038 rows=4 loops=1) > > Filter: > > st_contains('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry, > > p) > > Rows Removed by Filter: 1 > > Planning Time: 0.132 ms > > Execution Time: 0.061 ms > > > > > > So it does seem odd to me that ST_Contains would give a higher estimate > > than the && operation. That could be a bug in the function instrumentation > > of ST_Contains and ST_Intersects. > > > > > > Thanks, > > Regina > > > > From: postgis-users [mailto:postgis-users-boun...@lists.osgeo.org] On > > Behalf Of Igor ALBUQUERQUE SILVA > > Sent: Thursday, December 1, 2022 4:37 AM > > To: postgis-users@lists.osgeo.org > > Subject: [postgis-users] Row estimations > > > > Hello everyone, > > > > I have a question regarding the row estimations/gist indexes. Here's a > > minimal reproduction of it: > > > > create table test(p geometry(point)); > > insert into test(p) values (st_makepoint(1,1)); > > insert into test(p) values (st_makepoint(1,2)); > > insert into test(p) values (st_makepoint(2,1)); > > insert into test(p) values (st_makepoint(2,2)); > > insert into test(p) values (st_makepoint(51,1)); > > analyze test; > > explain analyze select * from test where > > ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))'), p); > > explain analyze select * from test where > > ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p); > > > > The two queries get the same cost/row estimation, of 1 row. This is the > > EXPLAIN ANALYZE of the first query: > > > > Seq Scan on test (cost=0.00..126.05 rows=1 width=32) (actual > > time=0.015..0.022 rows=4 loops=1) > > Filter: > > st_contains('01030000000100000005000000000000000000F0BF000000000000F0BF0000000000000040000000000000F0BF00000000000000400000000000000040000000000000F0BF0000000000000040000000000000F0BF000000000000F0BF'::geometry, > > p) > > Rows Removed by Filter: 1 > > Planning Time: 0.072 ms > > Execution Time: 0.035 ms > > (5 rows) > > > > What I was expecting is the first query to estimate 4 rows and the second > > to estimate 1, like what I get If I try the same thing using integers. > > > > create table test(x integer, y integer); > > insert into test(x, y) values (0, 0); > > insert into test(x, y) values (0, 1); > > insert into test(x, y) values (1, 0); > > insert into test(x, y) values (1, 1); > > insert into test(x, y) values (50, 0); > > analyze test; > > explain analyze select * from test where x between 0 and 1 and y between 0 > > and 1; > > explain analyze select * from test where x between 50 and 51 and y between > > 0 and 1; > > > > My question is: is this expected behaviour? I actually have a much larger > > table with a gist index where I found this occurring, and this causes the > > planner to make bad decisions: every query that I do will have the same > > estimation, and whenever this estimation is very wrong, the planner does > > not take the optimal decision when it has to compare with other indexes > > costs in a more complicated query. > > > > I'm using the official docker image, PostgreSQL 15.1 (Debian > > 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) > > 10.2.1 20210110, 64-bit POSTGIS="3.3.1 3786b21" [EXTENSION] PGSQL="150" > > GEOS="3.9.0-CAPI-1.16.2" PROJ="7.2.1" LIBXML="2.9.10" LIBJSON="0.15" > > LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)" TOPOLOGY > > > > Best regards, > > Igor > > _______________________________________________ > > postgis-users mailing list > > postgis-users@lists.osgeo.org > > https://lists.osgeo.org/mailman/listinfo/postgis-users > _______________________________________________ postgis-users mailing list postgis-users@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/postgis-users