I folks, thanks for the e-mails. Unfortunately I did not have access to my e-mail at the weekend. I have, however, created a solution having read www.postgresql.org/docs/8.3/static/explicit-joins.html - which might only be required pre-geo 3.1 (see comments at bottom). The query plan comments simply might not be of interest if 3.1 addresses this issue...anyway, the query plans that my current database comes up with are pretty bad.
set from_collapse_limit=1; SELECT possible_geography.report_id FROM (SELECT gl.report_id, g.the_geom FROM gdors_geography g, gdors_geolink gl WHERE gl.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06' AND g.gid = gl.geom_id) AS possible_geography, (SELECT g.the_geom FROM gdors_geography g WHERE g.gid=3) AS selected_geography WHERE ST_Intersects(possible_geography.the_geom, selected_geography.the_geom); "Nested Loop (cost=0.00..1448.63 rows=1 width=4) (actual time=93.485..16959.766 rows=81 loops=1)" " Join Filter: ((g.the_geom && g.the_geom) AND _st_intersects(g.the_geom, g.the_geom))" " -> Index Scan using gdors_geography_pkey on gdors_geography g (cost=0.00..8.27 rows=1 width=3014) (actual time=0.017..0.020 rows=1 loops=1)" " Index Cond: (gid = 3)" " -> Nested Loop (cost=0.00..1436.61 rows=250 width=3018) (actual time=0.037..14.822 rows=275 loops=1)" " -> Seq Scan on gdors_geolink gl (cost=0.00..214.00 rows=250 width=8) (actual time=0.019..4.098 rows=275 loops=1)" " Filter: ((temp_report_date >= '2008-01-06'::date) AND (temp_report_date <= '2009-01-06'::date))" " -> Index Scan using gdors_geography_pkey on gdors_geography g (cost=0.00..4.88 rows=1 width=3018) (actual time=0.018..0.026 rows=1 loops=275)" " Index Cond: (g.gid = gl.geom_id)" "Total runtime: 16960.739 ms" (An index on the geolink date field decreased the time by 0.5 seconds). Not bad for 81 matches! Basically (as I'm sure you know), the from_collapse_limit prevents the subqueries from being flattened out into a bigger query...so literally do as I say. I'm not sure if anyone can create a join that has the same result? That would remove the need to reset from_collapse_limit. Don't worry if Geo 3.1 solves this though! For those following I wanted to clarify a couple of things. I have loaded the ESRI world dataset, which is 18MiB, and so the whole concept of IO limitations is doubtful assuming the correct query plan. Outside the scope of the issue but FYI the spatial intersect query in ArcMap and shapefiles takes approximately one second and with SDE binary/Oracle ~3 seconds. Secondly, although ST_Intersects with open geometries may not be the most efficient method, my view was that a better query plan would significantly improve the performance. The date restriction limits geolink to about 10 results to join with geometries. All the database needs to do is intersect the selected geometry (Asia, with the 10 geometries). Explaining some of PostgreSQL query plans (The one below uses a new 'continent table'): -- Additional Table (continent table has no index and contains eight continent geometries - Africa, Antarctica, Asia, Australia, Europe, North America, Oceania and South America) -- 2 rows, ~7000ms. SELECT geolink.report_id, geography.gid FROM gdors_geolink geolink, gdors_geography geography, continent selected_geography WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06' AND geolink.geom_id = geography.gid AND selected_geography.gid=3 AND ST_Intersects(selected_geography.the_geom, geography.the_geom); "Hash Join (cost=31.91..247.09 rows=21 width=8) (actual time=8226.836..8229.534 rows=2 loops=1)" " Hash Cond: (geolink.geom_id = geography.gid)" " -> Seq Scan on gdors_geolink geolink (cost=0.00..214.00 rows=257 width=8) (actual time=0.046..2.432 rows=275 loops=1)" " Filter: ((temp_report_date >= '2008-01-06'::date) AND (temp_report_date <= '2009-01-06'::date))" " -> Hash (cost=9.39..9.39 rows=1802 width=4) (actual time=8226.653..8226.653 rows=104 loops=1)" " -> Nested Loop (cost=0.00..9.39 rows=1802 width=4) (actual time=1568.557..8226.289 rows=104 loops=1)" " Join Filter: _st_intersects(selected_geography.the_geom, geography.the_geom)" " -> Seq Scan on continent selected_geography (cost=0.00..1.10 rows=1 width=364553) (actual time=0.012..0.018 rows=1 loops=1)" " Filter: (gid = 3)" " -> Index Scan using gdors_geography_the_geom on gdors_geography geography (cost=0.00..8.27 rows=1 width=2914) (actual time=2.402..53.014 rows=188 loops=1)" " Index Cond: (selected_geography.the_geom && geography.the_geom)" " Filter: (selected_geography.the_geom && geography.the_geom)" "Total runtime: 8229.772 ms" The above query plan looks better, although my understanding is that it is still doing a lot of work - creating a (hash) set of results based on the intersection (many geometries), that for each geolink result the geography hash table is queried create the join. To give an example, assuming 100 geometries are within Asia...but only 10 of them have reports...it is getting a collection of geolink results (10)...and at the same time getting a collection of geometries, so 100, and then joining 10 reports to 10 of the geometries - out of the 100, which we have done a full intersect query on!?! It would be quicker to join the geolink results with geometries and then intersect that small number (intersect 10...not 100...or 1000s if more geometries). Please correct me if I am wrong. If that is the case then that is only going to be exacerbated with greater numbers of geometries...rather than collecting a very small subset of reports first, looking up their geometries, and then doing a spatial intersection. FYI I did manage to get the following query run on PostgreSQL 8.2 with a spatial index on geography.the_geom. EXPLAIN ANALYZE SELECT geolink.report_id, geography.gid FROM gdors_geolink geolink, gdors_geography geography, gdors_geography selected_geography WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06' AND geolink.geom_id = geography.gid AND selected_geography.gid=3 AND ST_Intersects(selected_geography.the_geom, geography.the_geom); "Nested Loop (cost=0.00..91.85 rows=1 width=8) (actual time=19204.738..1418353.900 rows=81 loops=1)" " -> Nested Loop (cost=0.00..16.56 rows=1 width=4) (actual time=16360.043..1415704.994 rows=6392 loops=1)" " Join Filter: _st_intersects(selected_geography.the_geom, geography.the_geom)" " -> Index Scan using gdors_geography_pkey on gdors_geography selected_geography (cost=0.00..8.27 rows=1 width=2910) (actual time=0.041..0.045 rows=1 loops=1)" " Index Cond: (gid = 3)" " -> Index Scan using gdors_geography_the_geom on gdors_geography geography (cost=0.00..8.27 rows=1 width=2914) (actual time=3.584..22581.913 rows=19432 loops=1)" " Index Cond: (selected_geography.the_geom && geography.the_geom)" " Filter: (selected_geography.the_geom && geography.the_geom)" " -> Index Scan using gdors_geolink_pkey on gdors_geolink geolink (cost=0.00..75.28 rows=1 width=8) (actual time=0.401..0.403 rows=0 loops=6392)" " Index Cond: (geolink.geom_id = geography.gid)" " Filter: ((temp_report_date >= '2008-01-06'::date) AND (temp_report_date <= '2009-01-06'::date))" "Total runtime: 1418355.448 ms" Clearly that query was just slow...intersecting geometries where 99% (ish) were going to be thrown away. ## Other Questions ## (i...@paragon Corp). Yes I have tried && - shown in the previous e-mails. I also believe that the plans executing _st_intersects on all of the geometries whilst index scanning for reports based on date is really quite suboptimal; I assume that I'm reading the plans correctly :). (Paul Ramsey). Your query, below, does not finish either in any reasonable amount of time (I guess ~23mins). It is not too surprising given the number of geometries and plans that my database is coming up with but it does serve to demonstrate the importance of a good query plan; your 'Introduction to PostGIS' presentation is great btw. SELECT COUNT(*) FROM gdors_geography g1, gdors_geography g2 WHERE ST_Intersects(g1.the_geom, g2.the_geom) AND g2.gid = 3; Also, you request for a full EXPLAIN ANALYZE I think is answered in the 'continent' table example. (Regina Obe). I have not changed the postgresql.conf file although I am interested in what settings you would change, especially considering the small amount of the data; FYI the Asia shapefile is ~600k - quite big for this dataset but still pretty small. I did experiment with increasing the shared_buffer size to 512MB. The Windows install that I currently have, PostgreSQL 8.2, has the following postgis_full_version() output: "POSTGIS="1.3.5" GEOS="3.0.3-CAPI-1.4.2" PROJ="Rel. 4.6.1, 21 August 2008" USE_STATS" My Ubuntu machine has "POSTGIS="1.3.3" GEOS="3.0.0-CAPI-1.4.1" PROJ="Rel. 4.6.0, 21 Dec 2007" USE_STATS" CLEARLY NOT GEOS 3.1. I guess that could be the primary cause of my problem? It doesn't appear that 3.1 is available on Windows. I guess it will be easiest for me to compile PostgreSQL on Ubuntu and then apply the latest PostGIS? Comments welcome :) ----------------------------------------- ************************************************************************** The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of the HPA, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses, but please re-sweep any attachments before opening or saving. HTTP://www.HPA.org.uk ************************************************************************** _______________________________________________ postgis-users mailing list [email protected] http://postgis.refractions.net/mailman/listinfo/postgis-users
