Re: [postgis-users] ST_Intersection very slow
This would be awesome I'm keen to experiment with the dumped points, using my recursive quadgrid function, so the vector grids subdivide to a maximum depth or number of points per cell. So this will be applied at the point level to give an optimized number of grid cells. If I understand you correctly, you would subdivide like you do currently, but the depth of subdivision would be a function of number of points remaining in the square? It would be a great feature, at a reasonable cost (you would need to change only a few lines in your code) Then the challenge is to rebuild the intersecting points or lines I hadn't thought of that (namely I was cutting the square in pieces,then finding which pieces is trully in big bad polygon, but I didn't thought of directly reconstructing). It is an excellent idea, which solves the problem that was slowing me down (which piece of square is really in the big bad polygon, which involves using the full outer ring and inner ring geom! ). I might rely on ring point order to find which piece is inside or outside the ring. I'm going to test it tomorrow, I would expect to break the 10 minutes marks (1 core), but I'm not certain In fact we could combine our approaches, I'll start thinking about that. Cheers, Rémi-C 2015-03-05 11:54 GMT+01:00 Mark Wynter m...@dimensionaledge.com: Hi Remi - I followed most of it - the squarised representation - approaching the form of a raster - that was another approach I was contemplating - rasterising the vectors and using inbuilt raster functions. Pixel in polygon uses the pixel centroid I recall. with your approach, the squares always overlap the boundary, which means neighboring land classes will share squares in common. I think I can see that in one of the images. Yes? Pixel in polygon avoids this. Anyway, I like your dumping to segments and points. I'm keen to experiment with the dumped points, using my recursive quadgrid function, so the vector grids subdivide to a maximum depth or number of points per cell. So this will be applied at the point level to give an optimized number of grid cells. Then the challenge is to rebuild the intersecting points or lines which is where I think you have got to. I'm thinking of just running ST_intersection, like with my version 2 solution, but with the quadgrid. Hmm, just thinking aloud- keeping the parent polygon id of every point, so the quadgrid knows not only how many points, but which distinct polygons - so ST Intersection only evaluates those we already know to intersect. Cool!! Sent from my iPhone On 5 Mar 2015, at 8:32 pm, Rémi Cura remi.c...@gmail.com wrote: ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow
Yes Remi - you have understood correctly how the quadgrid function works - I will have a play with the quadgrid tomorrow, then let's share notes. If I understand you correctly, you would subdivide like you do currently, but the depth of subdivision would be a function of number of points remaining in the square? It would be a great feature, at a reasonable cost (you would need to change only a few lines in your code) I think this is the step where we are either going to get performance lift or not I'm going to test it tomorrow, I would expect to break the 10 minutes marks (1 core), but I'm not certain Cool! In fact we could combine our approaches, I'll start thinking about that. Cheers, Rémi-C ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow
John, I've got a variation in mind that works solely using lc_class32 - should take same time as what I sent through earlier. I'm guessing 2 minutes!!! A different SQL statement in the data prep stage - it is now becoming very clear to me as to how to tile lc_class32 in the purest sense. Which means universal application for other use cases. let me test version 2 before sharing. Sent from my iPhone On 26 Feb 2015, at 4:19 am, John Abraham j...@hbaspecto.com wrote: Wow guys. 131 seconds is a lot faster than 10 days. Remi, if you can work out a fast intersection method that uses points and lines as a preprocessor to dealing with the polygons, I think it would be a great addition to PostGIS. ST_Intersection in PostGIS is often quite a bit slower than the implementation in Geomedia, Mapinfo, and (I hear) ArcGIS, so any functionality that results in speed improvements would be great. Mark, I can't wait to figure out why your system was fast! I was following your (preliminary) tutorial and gridding the data was progressing very slowly. I have a provincial boundary file but there seems to be much ambiguity in GIS representations of the provincial boundary, so I won't send you the one I have. I can try to assemble one from other sources. -- John Abraham j...@hbaspecto.com 403-232-1060 On Feb 25, 2015, at 6:28 AM, Mark Wynter m...@dimensionaledge.com wrote: Hi John I’ve just crunched your whole dataset. The process takes 131 seconds for the vector tiling (using a 16 CPU machine). Plus another 170 seconds for data prep at the start including making the poly's valid. For a 2 CPU machine, it will take circa 15 minutes, or double that using a single CPU. Only one small issue outstanding - and that relates to clipping the regular grid prior to tiling. For clipping I used the union of the abmiw2wlcv_48tiles as supplied with the data - the problem is the abmiw2wlcv_48tiles are rough and ready, which produces voids. The voids using my method unfortunately get the same feature class as lc_class = 32. You’ll see this clearly on second screenshot. The way around this is to clip the regular grid using a high-res shapefile of the Alberta state boundary prior to the tile crunching the lancover_polygons_2010 table. This is easily done - I just didn’t have the state boundary data. I need to get some sleep. John, Remi, I will share with you the code tomorrow. For others, I’ll be posting a tutorial that steps through the methods touched upon in this thread.. John, the only difference between my tutorial and its application to your land cover data was a few tweaks to the data prep stage. Otherwise the same code pattern (no modification at all needed to the worker_function). It was great to test the code with your data. Speak tomorrow. Mark Screen Shot 2015-02-25 at 11.29.15 pm.png Screen Shot 2015-02-25 at 11.39.04 pm.png Screen Shot 2015-02-25 at 11.41.41 pm.png ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow
I’ll also try and take a look - I’ve just downloaded the dataset in question - I’m keen to see it. Will revert. On 25 Feb 2015, at 6:41 pm, Rémi Cura remi.c...@gmail.com wrote: Damn I'm publishing a QGIS plugin for POstGIS, I can't write you the fast intersection by casting polygon to points, intersection then casting back to polygon. I'll try to squeeze it somehow. Cheers, Rémi-C 2015-02-25 1:44 GMT+01:00 Mark Wynter m...@dimensionaledge.com: The land cover data you’re working with - does it come as a raster? We need a strategy for busting up that single, monolithic poly with lots of holes. If its a raster, then I think that tiling on import would be of huge benefit. I’m just under the pump the next few days. ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow
Where is the data set? Cheers, Rémi-C 2015-02-25 9:18 GMT+01:00 Mark Wynter m...@dimensionaledge.com: I’ll also try and take a look - I’ve just downloaded the dataset in question - I’m keen to see it. Will revert. On 25 Feb 2015, at 6:41 pm, Rémi Cura remi.c...@gmail.com wrote: Damn I'm publishing a QGIS plugin for POstGIS, I can't write you the fast intersection by casting polygon to points, intersection then casting back to polygon. I'll try to squeeze it somehow. Cheers, Rémi-C 2015-02-25 1:44 GMT+01:00 Mark Wynter m...@dimensionaledge.com: The land cover data you’re working with - does it come as a raster? We need a strategy for busting up that single, monolithic poly with lots of holes. If its a raster, then I think that tiling on import would be of huge benefit. I’m just under the pump the next few days. ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow
Damn I'm publishing a QGIS plugin for POstGIS, I can't write you the fast intersection by casting polygon to points, intersection then casting back to polygon. I'll try to squeeze it somehow. Cheers, Rémi-C 2015-02-25 1:44 GMT+01:00 Mark Wynter m...@dimensionaledge.com: The land cover data you’re working with - does it come as a raster? We need a strategy for busting up that single, monolithic poly with lots of holes. If its a raster, then I think that tiling on import would be of huge benefit. I’m just under the pump the next few days. ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow.
Thanks Mark. I will indeed do the st_dump, but I don't think it will help much (very few multis). I think the tiling will help a lot. What I wonder, though, is how long it will take to tile? Afterall, it's still an st_intersection operation to tile each geometry against each tile. Is there a quad tree type tiling algorithm in a function? If I do 256 x 256 tiles, doing it all at once would be 65536 operations of st_intersection(complex_geom, square). With a quad tree I'll only have 4 operations of st_intersection(complex_geom, square), and then 16 of (a_little_less_complex_geom, square), and then 64 of (even_less_complex_geom, square) and so on, for 8 nests. The last one will still be 65536 operations, but the complex geoms should be a lot simpler by the time I get down to that level. What do you think, is this worth trying? Or is intersecting with a square fairly simple regardless of how complex the other geometry is? (Your reply initially fell into my junk mail, so I missed it, and also haven't tried these things yet, partly because I missed your email.) -- John Abraham j...@hbaspecto.com 403-232-1060 On Feb 19, 2015, at 6:52 PM, Mark Wynter m...@dimensionaledge.com wrote: So I've was running this query for 866000 s (10 days) before I decided to kill it: One potential thing I've realized is that a few of the geometries in tazjan2 are multipolygons, not single polygons. But it's only a few. There are a few very large and complex polygons in lancover_polygons_snap, but again, most of the 998031 are simple small polygons, about half would be ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap two or three of the polygons in tazjan2. A common offender - the multipolygons. You can get an immediate performance improvement by ST_Dump(wkb_geometry).geom into a new table, with a new serial id, but retaining the original poly_id so you can don’t lose the multipolygon provenance of each polygon. Complex operations on the multipolygon table, like ST_Intersection(), are extremely slow - Because the query has to access and compute the operation on each of the geometry elements contained within the multi. Another massive performance gain can be achieved by “Tiling” your dumped polygon table - which has the effect of breaking your geometries into smaller features, which you can then “union” back together again, after running ST_Intersection on your vector tiles. Don’t try to tile your multi polygon table - you’ll still be waiting days. Take the coastline of Australia - a single multi polygon SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11; num_rows | num_geoms | num_points ---+-- 1 | 6718 |1622598 If you need to query the intersection of a polygon with land, and the water, its near impossible. Many, many days. Hence we need to take a different approach. The first thing you can do is to dump the multi polygon table. Then create a regular vector grid. Then tile the both your Polygon tables using the vector grid. Put indexes on the resultant tiled tables. Then run your ST_Intersects. The result is many more geometries, and points, but now we get the benefit of geometry indexes. As a result of tiling Australia coastline, we now get SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled; num_rows | num_geoms | num_points ---+-- 17222 | 29020 |6513770 Each geometry also has a tile_id, and the original poly_id. The tile_id’s are really useful for subsetting your queries, and for using tile-id’s as an additional join condition. At any time you can rapidly rebuild your original geometries doing ST_Union() GROUP BY poly_id. Using the approach of dumping and tiling, queries that once took days now takes minutes at most. And as Remi mentioned, you can parallelise the process. The concept of vector tiling in PostGIS is analogous to Map Reduce and assigning Key Value Pairs. Its about chunking your data, doing lots of fast operations on the smaller bits, and then consolidating your outputs. You don’t necessarily have to write a SQL function to do the SQL parallelisation. My preference is Bash and GNU Parallel because you can write vanilla SQL and execute via psql in parallel. For me its now about the speed I can write the SQL. So we’re now starting to apply Big Data concepts within PostGIS….Holy smoke… and you can apply the same concepts to a wide range of PostGIS operations - scale up, scale out…. I will write up a tutorial explaining the benefits of vector tiling in PostGIS, with examples and parallelisation code patterns. If not today, hopefully over the next
Re: [postgis-users] ST_Intersection very slow
Following advice from Mark and Remi, I am now tiling the troublesome layer. It's not going quickly. I'm trying an intersection of the layer with only 5 of my 10kmX10km grids and it has been running for 30 min. I have 8680 grids, which is more features than the layer I was originally working with (2709 polygons). I was hoping that intersection with squares would go much faster than an intersection with arbitrary polygons, but I can confirm that it's definitely not orders of magnitude faster. So, I think my idea of simplifying things recursively with a quad grid is worth a shot. Mark, can you point me to your Quadgrid function? Mark suggested counting points. Counting points (and rounding to the nearest 1) gives: select round(st_npoints(geom)/100,-2) as point100, count(*) from lancover_polygons_snap group by round(st_npoints(geom)/100,-2) order by point100 desc; 26300;1 1500;1 1100;1 800;1 600;2 500;1 400;5 300;10 200;22 100;141 0;997846 The one with about 263 points represents the developed land in the region (towns and cities including roads that connect them). Mark also suggested counting interior rings. select round(ST_NumInteriorRings(st_geometryN(geom,1)),-1) as rings10, avg(st_numpoints(geom)) as avgpts, count(*) as numfeatures from lancover_polygons_snap group by round(ST_NumInteriorRings(st_geometryN(geom,1)),-1) order by rings10 desc; 43140;2629905;1 4230;147121;1 3800;110083;1 2480;76515;1 1760;50261;1 1720;55576;1 1680;64588;1 1350;41826;1 1130;38059;1 970;36425;1 950;37686;1 860;32952;1 850;34657;1 820;35285;2 810;24769;1 800;28113;1 760;30726;1 740;26296;1 720;24788;2 670;18874;1 660;22428;1 640;26258;1 630;27213;2 620;22056;2 610;15771;1 560;19942;2 540;23988;1 500;15763;2 480;16598;1 470;19865;2 410;15466;1 400;16023;1 380;13536;4 370;14646;3 360;14122;1 340;12910;1 330;13273;1 320;15222;2 300;13421;1 290;9072;3 280;10025;4 270;12505;4 260;11002;5 250;9843;3 240;9980;6 230;11456;3 220;9049;3 210;9194;6 200;8586;5 190;7089;5 180;6796;8 170;6913;8 160;6955;6 150;5911;10 140;5769;15 130;5377;13 120;5301;17 110;5246;17 100;4422;28 90;4332;28 80;3664;35 70;3292;43 60;2860;66 50;2236;74 40;1882;141 30;1493;215 20;993;618 10;445;3325 0;20;993265 -- John Abraham j...@hbaspecto.com 403-232-1060 POSTGIS=2.0.3 r11128 GEOS=3.3.8-CAPI-1.7.8 PROJ=Rel. 4.8.0, 6 March 2012 GDAL=GDAL 1.9.2, released 2012/10/08 LIBXML=2.7.8 LIBJSON=UNKNOWN (core procs from 2.0.3 r11132 need upgrade) TOPOLOGY (topology procs from 2.0.3 r11132 need upgrade) RASTER (raster procs from 2.0.3 r11132 need upgrade) On Feb 24, 2015, at 3:31 PM, Rémi Cura remi.c...@gmail.com wrote: Hey, I highjack this interesting discussion, sorry =) If you have one single massive Polygon (no multi). You can generate a point table (keeping path of course) with ST_DumpPoints. Then you construct an index on this table. Then you perform your Intersection with indexed points, which will be very fast. Then you reconstruct parts of polygon. It will require some advanced to very advanced SQL (like windows function) to make it work tough. I think it would be faster than other alternatives, because basicaly it amounts to using the optimized R-Tree index on points, vs a custom and non-otpimized Quad tree index on polygon. Moreover when using a recusrive strategy, you end up computing the same think a lot with geos. Cheers, Rémi-C 2015-02-24 23:10 GMT+01:00 Mark Wynter m...@dimensionaledge.com: Thanks Mark. I will indeed do the st_dump, but I don't think it will help much (very few multis). I think the tiling will help a lot. What I wonder, though, is how long it will take to tile? Afterall, it's still an st_intersection operation to tile each geometry against each tile. I’ve almost finished writing the tutorial - where I address many of these points.The variables that affect performance are: * how you’ve written your ST_Intersection query * multi vs. non-multi * size and complexity of geoms * no. available CPUs (for parallelisation) * tile batch size - important!!! All strategies in combination may be necessary if your queries are taking forever. For the demonstration dataset (a multi polygon representing whole of Australia), my tutorial tiling query incorporates ST_Intersection and ST_Difference subqueries to produce tiled features representing land and water. I achieved a 49x reduction in the query time to tile the whole of Australia, starting with a single multi polygon. The more complex the query, the more significant this time saving is in absolute terms. Is there a quad tree type tiling algorithm in a function? If I do 256 x 256 tiles, doing it all at once would be 65536 operations of st_intersection(complex_geom, square). With a quad tree I'll only have 4 operations of st_intersection(complex_geom, square), and then 16 of (a_little_less_complex_geom, square), and then 64 of (even_less_complex_geom,
Re: [postgis-users] ST_Intersection very slow
Hey, I highjack this interesting discussion, sorry =) If you have one single massive Polygon (no multi). You can generate a point table (keeping path of course) with ST_DumpPoints. Then you construct an index on this table. Then you perform your Intersection with indexed points, which will be very fast. Then you reconstruct parts of polygon. It will require some advanced to very advanced SQL (like windows function) to make it work tough. I think it would be faster than other alternatives, because basicaly it amounts to using the optimized R-Tree index on points, vs a custom and non-otpimized Quad tree index on polygon. Moreover when using a recusrive strategy, you end up computing the same think a lot with geos. Cheers, Rémi-C 2015-02-24 23:10 GMT+01:00 Mark Wynter m...@dimensionaledge.com: Thanks Mark. I will indeed do the st_dump, but I don't think it will help much (very few multis). I think the tiling will help a lot. What I wonder, though, is how long it will take to tile? Afterall, it's still an st_intersection operation to tile each geometry against each tile. I’ve almost finished writing the tutorial - where I address many of these points.The variables that affect performance are: * how you’ve written your ST_Intersection query * multi vs. non-multi * size and complexity of geoms * no. available CPUs (for parallelisation) * tile batch size - important!!! All strategies in combination may be necessary if your queries are taking forever. For the demonstration dataset (a multi polygon representing whole of Australia), my tutorial tiling query incorporates ST_Intersection and ST_Difference subqueries to produce tiled features representing land and water. I achieved a 49x reduction in the query time to tile the whole of Australia, starting with a single multi polygon. The more complex the query, the more significant this time saving is in absolute terms. Is there a quad tree type tiling algorithm in a function? If I do 256 x 256 tiles, doing it all at once would be 65536 operations of st_intersection(complex_geom, square). With a quad tree I'll only have 4 operations of st_intersection(complex_geom, square), and then 16 of (a_little_less_complex_geom, square), and then 64 of (even_less_complex_geom, square) and so on, for 8 nests. The last one will still be 65536 operations, but the complex geoms should be a lot simpler by the time I get down to that level. What do you think, is this worth trying? Or is intersecting with a square fairly simple regardless of how complex the other geometry is? I do have a SQL quadgrid tiling function - where a cell divides recursively subject to a maximum number of levels or “value thresholds” - but I’m not sure if that’s the right approach. ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow
The land cover data you’re working with - does it come as a raster? We need a strategy for busting up that single, monolithic poly with lots of holes. If its a raster, then I think that tiling on import would be of huge benefit. I’m just under the pump the next few days. ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow
Thanks Mark. I will indeed do the st_dump, but I don't think it will help much (very few multis). I think the tiling will help a lot. What I wonder, though, is how long it will take to tile? Afterall, it's still an st_intersection operation to tile each geometry against each tile. I’ve almost finished writing the tutorial - where I address many of these points.The variables that affect performance are: * how you’ve written your ST_Intersection query * multi vs. non-multi * size and complexity of geoms * no. available CPUs (for parallelisation) * tile batch size - important!!! All strategies in combination may be necessary if your queries are taking forever. For the demonstration dataset (a multi polygon representing whole of Australia), my tutorial tiling query incorporates ST_Intersection and ST_Difference subqueries to produce tiled features representing land and water. I achieved a 49x reduction in the query time to tile the whole of Australia, starting with a single multi polygon. The more complex the query, the more significant this time saving is in absolute terms. Is there a quad tree type tiling algorithm in a function? If I do 256 x 256 tiles, doing it all at once would be 65536 operations of st_intersection(complex_geom, square). With a quad tree I'll only have 4 operations of st_intersection(complex_geom, square), and then 16 of (a_little_less_complex_geom, square), and then 64 of (even_less_complex_geom, square) and so on, for 8 nests. The last one will still be 65536 operations, but the complex geoms should be a lot simpler by the time I get down to that level. What do you think, is this worth trying? Or is intersecting with a square fairly simple regardless of how complex the other geometry is? I do have a SQL quadgrid tiling function - where a cell divides recursively subject to a maximum number of levels or “value thresholds” - but I’m not sure if that’s the right approach. ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
[postgis-users] ST_Intersection very slow.
So I've was running this query for 866000 s (10 days) before I decided to kill it: create table taz_and_lancover_10_fast_2 as SELECT p.lc_class, n.taz , CASE WHEN ST_CoveredBy(p.geom, n.the_geom) THEN p.geom ELSE ST_Multi( ST_Intersection(p.geom,n.geom) ) END AS geom FROM lancover_polygons_snap AS p INNER JOIN tazjan2 AS n ON (ST_Intersects(p.geom, n.the_geom) AND NOT ST_Touches(p.geom, n.the_geom) ); Explain shows this, it's using the spatial index: Nested Loop (cost=0.00..310492.35 rows=483105 width=9973) Join Filter: (_st_intersects(p.geom, n.the_geom) AND ((NOT (p.geom n.the_geom)) OR (NOT _st_touches(p.geom, n.the_geom - Seq Scan on tazjan2 n (cost=0.00..843.09 rows=2709 width=9493) - Index Scan using lancover_polygons_snap_geom_idx on lancover_polygons_snap p (cost=0.00..21.67 rows=5 width=480) Index Cond: (geom n.the_geom) There are 2709 rows in tazjan2 and 998031 rows in lancover_polygons_snap, so I appreciate that it's a bit of a large problem. But MapInfo was able to do it interactively in a few days and Geomedia was also able to do it in about a day. Both MapInfo and Geomedia ran out of memory (8GB machines) until the problem was broken into two regions (North and South), but postgresql seems to be chewing on the problem using only 400MB. The interactive approach in MapInfo was to divide lancover_polygons_snap by lc_class, to further divide each region into about 10 subproblems. Perhaps subprobleming this is the way to go? Can't the query subproblem it based in the indices or would I have to do that manually? One potential thing I've realized is that a few of the geometries in tazjan2 are multipolygons, not single polygons. But it's only a few. There are a few very large and complex polygons in lancover_polygons_snap, but again, most of the 998031 are simple small polygons, about half would be ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap two or three of the polygons in tazjan2. I must be doing something wrong. Any hints? I have max_connections set to 100 (currently only 11 connections active) work_mem was defaulting to 1MB, I just bumped it to 256MB (machine has 32GB but it's Postgresql9.1 (x86), i.e. 32bit version shared_buffers was 32MB, I am trying 512MB, could go higher effective_cache_size was defaulting to 128MB, I am trying 20GB. random_page_cost was 4.0, this is VMWare virtual server. I am trying 2.0 I'm trying an explain analyze with just 5 rows of tazjan2, a 5 x 998031 intersection problem with indices shouldn't take that long, should it? It's been running for a little over an hour. I'm wondering if the few complex polygons in lancover_polygons_snap are causing the slowness? Is there some fast way to divide complex polygons, e.g. apply a 1km grid over them? PostGISfullversion(): POSTGIS=2.0.3 r11128 GEOS=3.3.8-CAPI-1.7.8 PROJ=Rel. 4.8.0, 6 March 2012 GDAL=GDAL 1.9.2, released 2012/10/08 LIBXML=2.7.8 LIBJSON=UNKNOWN (core procs from 2.0.3 r11132 need upgrade) TOPOLOGY (topology procs from 2.0.3 r11132 need upgrade) RASTER (raster procs from 2.0.3 r11132 need upgrade) -- John Abraham ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow.
Hey, you could try to not use CASE (so separate the spatial join from the processing, which is easy to parallelize (assuming you have more than one core )). First generate the table with CREATE TABLE psatial_mapping_between_lancover_and_taz AS SELECT row_number() over() as row_id, p.your_primary_key, n.your_primary_key FROM lancover_polygons_snap AS p INNER JOIN tazjan2 AS n ON (ST_Intersects(p.geom, n.the_geom) AND NOT ST_Touches(p.geom, n.the_geom) ); Then when you have your table, you can do your processing (put into a SQL function , it is cleaner), and you parallelize on the row_id. Basically you get min(row_id), ax(row_id), you spearate it into N parts, then execute the parts with your K processes. (both bash and python have utilities for this). Of course if it is actually the spatial join that is slow, you can also parallelise this. (by cutting lancoverinto N pieces for instance ) Cheers, Rémi-C 2015-02-19 19:45 GMT+01:00 John Abraham j...@hbaspecto.com: So I've was running this query for 866000 s (10 days) before I decided to kill it: create table taz_and_lancover_10_fast_2 as SELECT p.lc_class, n.taz , CASE WHEN ST_CoveredBy(p.geom, n.the_geom) THEN p.geom ELSE ST_Multi( ST_Intersection(p.geom,n.geom) ) END AS geom FROM lancover_polygons_snap AS p INNER JOIN tazjan2 AS n ON (ST_Intersects(p.geom, n.the_geom) AND NOT ST_Touches(p.geom, n.the_geom) ); Explain shows this, it's using the spatial index: Nested Loop (cost=0.00..310492.35 rows=483105 width=9973) Join Filter: (_st_intersects(p.geom, n.the_geom) AND ((NOT (p.geom n.the_geom)) OR (NOT _st_touches(p.geom, n.the_geom - Seq Scan on tazjan2 n (cost=0.00..843.09 rows=2709 width=9493) - Index Scan using lancover_polygons_snap_geom_idx on lancover_polygons_snap p (cost=0.00..21.67 rows=5 width=480) Index Cond: (geom n.the_geom) There are 2709 rows in tazjan2 and 998031 rows in lancover_polygons_snap, so I appreciate that it's a bit of a large problem. But MapInfo was able to do it interactively in a few days and Geomedia was also able to do it in about a day. Both MapInfo and Geomedia ran out of memory (8GB machines) until the problem was broken into two regions (North and South), but postgresql seems to be chewing on the problem using only 400MB. The interactive approach in MapInfo was to divide lancover_polygons_snap by lc_class, to further divide each region into about 10 subproblems. Perhaps subprobleming this is the way to go? Can't the query subproblem it based in the indices or would I have to do that manually? One potential thing I've realized is that a few of the geometries in tazjan2 are multipolygons, not single polygons. But it's only a few. There are a few very large and complex polygons in lancover_polygons_snap, but again, most of the 998031 are simple small polygons, about half would be ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap two or three of the polygons in tazjan2. I must be doing something wrong. Any hints? I have max_connections set to 100 (currently only 11 connections active) work_mem was defaulting to 1MB, I just bumped it to 256MB (machine has 32GB but it's Postgresql9.1 (x86), i.e. 32bit version shared_buffers was 32MB, I am trying 512MB, could go higher effective_cache_size was defaulting to 128MB, I am trying 20GB. random_page_cost was 4.0, this is VMWare virtual server. I am trying 2.0 I'm trying an explain analyze with just 5 rows of tazjan2, a 5 x 998031 intersection problem with indices shouldn't take that long, should it? It's been running for a little over an hour. I'm wondering if the few complex polygons in lancover_polygons_snap are causing the slowness? Is there some fast way to divide complex polygons, e.g. apply a 1km grid over them? PostGISfullversion(): POSTGIS=2.0.3 r11128 GEOS=3.3.8-CAPI-1.7.8 PROJ=Rel. 4.8.0, 6 March 2012 GDAL=GDAL 1.9.2, released 2012/10/08 LIBXML=2.7.8 LIBJSON=UNKNOWN (core procs from 2.0.3 r11132 need upgrade) TOPOLOGY (topology procs from 2.0.3 r11132 need upgrade) RASTER (raster procs from 2.0.3 r11132 need upgrade) -- John Abraham ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow.
*I will write up a tutorial explaining the benefits of vector tiling in PostGIS, with examples and parallelisation code patterns. If not today, hopefully over the next week. * Would be nice if you have the time! Best, Andre On Fri, Feb 20, 2015 at 2:52 AM, Mark Wynter m...@dimensionaledge.com wrote: So I've was running this query for 866000 s (10 days) before I decided to kill it: One potential thing I've realized is that a few of the geometries in tazjan2 are multipolygons, not single polygons. But it's only a few. There are a few very large and complex polygons in lancover_polygons_snap, but again, most of the 998031 are simple small polygons, about half would be ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap two or three of the polygons in tazjan2. A common offender - the multipolygons. You can get an immediate performance improvement by ST_Dump(wkb_geometry).geom into a new table, with a new serial id, but retaining the original poly_id so you can don’t lose the multipolygon provenance of each polygon. Complex operations on the multipolygon table, like ST_Intersection(), are extremely slow - Because the query has to access and compute the operation on each of the geometry elements contained within the multi. Another massive performance gain can be achieved by “Tiling” your dumped polygon table - which has the effect of breaking your geometries into smaller features, which you can then “union” back together again, after running ST_Intersection on your vector tiles. Don’t try to tile your multi polygon table - you’ll still be waiting days. Take the coastline of Australia - a single multi polygon SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11; num_rows | num_geoms | num_points ---+-- 1 | 6718 |1622598 If you need to query the intersection of a polygon with land, and the water, its near impossible. Many, many days. Hence we need to take a different approach. The first thing you can do is to dump the multi polygon table. Then create a regular vector grid. Then tile the both your Polygon tables using the vector grid. Put indexes on the resultant tiled tables. Then run your ST_Intersects. The result is many more geometries, and points, but now we get the benefit of geometry indexes. As a result of tiling Australia coastline, we now get SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled; num_rows | num_geoms | num_points ---+-- 17222 | 29020 |6513770 Each geometry also has a tile_id, and the original poly_id. The tile_id’s are really useful for subsetting your queries, and for using tile-id’s as an additional join condition. At any time you can rapidly rebuild your original geometries doing ST_Union() GROUP BY poly_id. Using the approach of dumping and tiling, queries that once took days now takes minutes at most. And as Remi mentioned, you can parallelise the process. The concept of vector tiling in PostGIS is analogous to Map Reduce and assigning Key Value Pairs. Its about chunking your data, doing lots of fast operations on the smaller bits, and then consolidating your outputs. You don’t necessarily have to write a SQL function to do the SQL parallelisation. My preference is Bash and GNU Parallel because you can write vanilla SQL and execute via psql in parallel. For me its now about the speed I can write the SQL. So we’re now starting to apply Big Data concepts within PostGIS….Holy smoke… and you can apply the same concepts to a wide range of PostGIS operations - scale up, scale out…. I will write up a tutorial explaining the benefits of vector tiling in PostGIS, with examples and parallelisation code patterns. If not today, hopefully over the next week. Mark ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users -- .. André Mano http://opussig.blogspot.com/ ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
[postgis-users] ST_Intersection very slow.
So I've was running this query for 866000 s (10 days) before I decided to kill it: One potential thing I've realized is that a few of the geometries in tazjan2 are multipolygons, not single polygons. But it's only a few. There are a few very large and complex polygons in lancover_polygons_snap, but again, most of the 998031 are simple small polygons, about half would be ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap two or three of the polygons in tazjan2. A common offender - the multipolygons. You can get an immediate performance improvement by ST_Dump(wkb_geometry).geom into a new table, with a new serial id, but retaining the original poly_id so you can don’t lose the multipolygon provenance of each polygon. Complex operations on the multipolygon table, like ST_Intersection(), are extremely slow - Because the query has to access and compute the operation on each of the geometry elements contained within the multi. Another massive performance gain can be achieved by “Tiling” your dumped polygon table - which has the effect of breaking your geometries into smaller features, which you can then “union” back together again, after running ST_Intersection on your vector tiles. Don’t try to tile your multi polygon table - you’ll still be waiting days. Take the coastline of Australia - a single multi polygon SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11; num_rows | num_geoms | num_points ---+-- 1 | 6718 |1622598 If you need to query the intersection of a polygon with land, and the water, its near impossible. Many, many days. Hence we need to take a different approach. The first thing you can do is to dump the multi polygon table. Then create a regular vector grid. Then tile the both your Polygon tables using the vector grid. Put indexes on the resultant tiled tables. Then run your ST_Intersects. The result is many more geometries, and points, but now we get the benefit of geometry indexes. As a result of tiling Australia coastline, we now get SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled; num_rows | num_geoms | num_points ---+-- 17222 | 29020 |6513770 Each geometry also has a tile_id, and the original poly_id. The tile_id’s are really useful for subsetting your queries, and for using tile-id’s as an additional join condition. At any time you can rapidly rebuild your original geometries doing ST_Union() GROUP BY poly_id. Using the approach of dumping and tiling, queries that once took days now takes minutes at most. And as Remi mentioned, you can parallelise the process. The concept of vector tiling in PostGIS is analogous to Map Reduce and assigning Key Value Pairs. Its about chunking your data, doing lots of fast operations on the smaller bits, and then consolidating your outputs. You don’t necessarily have to write a SQL function to do the SQL parallelisation. My preference is Bash and GNU Parallel because you can write vanilla SQL and execute via psql in parallel. For me its now about the speed I can write the SQL. So we’re now starting to apply Big Data concepts within PostGIS….Holy smoke… and you can apply the same concepts to a wide range of PostGIS operations - scale up, scale out…. I will write up a tutorial explaining the benefits of vector tiling in PostGIS, with examples and parallelisation code patterns. If not today, hopefully over the next week. Mark ___ postgis-users mailing list postgis-users@lists.osgeo.org http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
Re: [postgis-users] ST_Intersection very slow.
Thanks for the hint, Rémi. I'll give it a try, to see if the st_intersection is slow (as compared to just st_intersects). Here's my EXPLAIN ANALYZE for my 5 zone test: Nested Loop (cost=1.56..80.91 rows=8 width=560) (actual time=3929258.738..19756123.479 rows=29 loops=1) Join Filter: (_st_intersects(p.geom, n.the_geom) AND ((NOT (p.geom n.the_geom)) OR (NOT _st_touches(p.geom, n.the_geom CTE sometaz - Limit (cost=0.00..1.56 rows=5 width=9537) (actual time=0.433..0.576 rows=5 loops=1) - Seq Scan on tazjan2 (cost=0.00..843.09 rows=2709 width=9537) (actual time=0.415..0.511 rows=5 loops=1) - CTE Scan on sometaz n (cost=0.00..0.10 rows=5 width=80) (actual time=0.460..0.724 rows=5 loops=1) - Index Scan using lancover_polygons_snap_geom_idx on lancover_polygons_snap p (cost=0.00..12.47 rows=5 width=480) (actual time=0.242..65.640 rows=16 loops=5) Index Cond: (geom n.the_geom) Total runtime: 19756143.935 ms Here it is explained in layspeak: http://explain.depesz.com/s/dbVi So, am I reading this right, that it took 5.5 hours to st_intersects 5 (multi)polygons with 29 (multi)polygons, and find the 8 that intersected? That the index scan was fast (328ms), and returned only a few (16, or maybe 29) of the 998031 polygons, but that the actual intersection afterwards, between just a few polygons, was so very slow? (The query was this: with sometaz as (select * from tazjan2 limit 5) SELECT p.lc_class, n.taz , CASE WHEN ST_CoveredBy(p.geom, n.the_geom) THEN p.geom ELSE ST_Multi( ST_Intersection(p.geom,n.geom) ) END AS geom FROM lancover_polygons_snap AS p INNER JOIN sometaz AS n ON (ST_Intersects(p.geom, n.the_geom) AND NOT ST_Touches(p.geom, n.the_geom) ); ) -- John Abraham j...@hbaspecto.com 403-232-1060 On Feb 19, 2015, at 11:56 AM, Rémi Cura remi.c...@gmail.com wrote: Hey, you could try to not use CASE (so separate the spatial join from the processing, which is easy to parallelize (assuming you have more than one core )). First generate the table with CREATE TABLE psatial_mapping_between_lancover_and_taz AS SELECT row_number() over() as row_id, p.your_primary_key, n.your_primary_key FROM lancover_polygons_snap AS p INNER JOIN tazjan2 AS n ON (ST_Intersects(p.geom, n.the_geom) AND NOT ST_Touches(p.geom, n.the_geom) ); Then when you have your table, you can do your processing (put into a SQL function , it is cleaner), and you parallelize on the row_id. Basically you get min(row_id), ax(row_id), you spearate it into N parts, then execute the parts with your K processes. (both bash and python have utilities for this). Of course if it is actually the spatial join that is slow, you can also parallelise this. (by cutting lancoverinto N pieces for instance ) Cheers, Rémi-C 2015-02-19 19:45 GMT+01:00 John Abraham j...@hbaspecto.com: So I've was running this query for 866000 s (10 days) before I decided to kill it: create table taz_and_lancover_10_fast_2 as SELECT p.lc_class, n.taz , CASE WHEN ST_CoveredBy(p.geom, n.the_geom) THEN p.geom ELSE ST_Multi( ST_Intersection(p.geom,n.geom) ) END AS geom FROM lancover_polygons_snap AS p INNER JOIN tazjan2 AS n ON (ST_Intersects(p.geom, n.the_geom) AND NOT ST_Touches(p.geom, n.the_geom) ); Explain shows this, it's using the spatial index: Nested Loop (cost=0.00..310492.35 rows=483105 width=9973) Join Filter: (_st_intersects(p.geom, n.the_geom) AND ((NOT (p.geom n.the_geom)) OR (NOT _st_touches(p.geom, n.the_geom - Seq Scan on tazjan2 n (cost=0.00..843.09 rows=2709 width=9493) - Index Scan using lancover_polygons_snap_geom_idx on lancover_polygons_snap p (cost=0.00..21.67 rows=5 width=480) Index Cond: (geom n.the_geom) There are 2709 rows in tazjan2 and 998031 rows in lancover_polygons_snap, so I appreciate that it's a bit of a large problem. But MapInfo was able to do it interactively in a few days and Geomedia was also able to do it in about a day. Both MapInfo and Geomedia ran out of memory (8GB machines) until the problem was broken into two regions (North and South), but postgresql seems to be chewing on the problem using only 400MB. The interactive approach in MapInfo was to divide lancover_polygons_snap by lc_class, to further divide each region into about 10 subproblems. Perhaps subprobleming this is the way to go? Can't the query subproblem it based in the indices or would I have to do that manually? One potential thing I've realized is that a few of the geometries in tazjan2 are multipolygons, not single polygons. But it's only a few. There are a few very large and complex polygons in lancover_polygons_snap, but again, most of the 998031 are simple small polygons, about half would be ST_CoveredBy the polygons in tazjan2 and most of the rest would