Re: [postgis-users] ST_Intersection very slow

2015-03-05 Thread Rémi Cura
This would be awesome 
I'm keen to experiment with the dumped points, using my recursive quadgrid
function, so the vector grids subdivide to a maximum depth or number of
points per cell. So this will be applied at the point level to give an
optimized number of grid cells.

If I understand you correctly, you would subdivide like you do currently,
but the depth of subdivision would be a function of number of points
remaining in the square?
It would be a great feature, at a reasonable cost (you would need to change
only a few lines in your code)

Then the challenge is to rebuild the intersecting points or lines
I hadn't thought of that (namely I was cutting the square in pieces,then
finding which pieces is trully in big bad polygon, but I didn't thought of
directly reconstructing).
It is an excellent idea, which solves the problem that was slowing me down
(which piece of square is really in the big bad polygon, which involves
using the full outer ring and inner ring geom! ).
I might rely on ring point order to find which piece is inside or outside
the ring.

I'm going to test it tomorrow, I would expect to break the 10 minutes marks
(1 core), but I'm not certain


In fact we could combine our approaches, I'll start thinking about that.

Cheers,
Rémi-C


2015-03-05 11:54 GMT+01:00 Mark Wynter m...@dimensionaledge.com:

 Hi Remi - I followed most of it - the squarised representation -
 approaching the form of a raster - that was another approach I was
 contemplating - rasterising the vectors and using inbuilt raster functions.
 Pixel in polygon uses the pixel centroid I recall. with your approach, the
 squares always overlap the boundary, which means neighboring land classes
 will share squares in common. I think I can see that in one of the images.
 Yes? Pixel in polygon avoids this.

 Anyway, I like your dumping to segments and points.

 I'm keen to experiment with the dumped points, using my recursive quadgrid
 function, so the vector grids subdivide to a maximum depth or number of
 points per cell. So this will be applied at the point level to give an
 optimized number of grid cells.

 Then the challenge is to rebuild the intersecting points or lines which is
 where I think you have got to. I'm thinking of just running
 ST_intersection, like with my version 2 solution, but with the quadgrid.

 Hmm, just thinking aloud- keeping the parent polygon id of every point, so
 the quadgrid knows not only how many points, but which distinct polygons -
 so ST Intersection only evaluates those we already know to intersect.
 Cool!!

 Sent from my iPhone

  On 5 Mar 2015, at 8:32 pm, Rémi Cura remi.c...@gmail.com wrote:
 

___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow

2015-03-05 Thread Mark Wynter
Yes Remi - you have understood correctly how the quadgrid function works - I 
will have a play with the quadgrid tomorrow, then let's share notes.

 If I understand you correctly, you would subdivide like you do currently, but 
 the depth of subdivision would be a function of number of points remaining in 
 the square?
 It would be a great feature, at a reasonable cost (you would need to change 
 only a few lines in your code)

I think this is the step where we are either going to get performance lift or 
not

 
 I'm going to test it tomorrow, I would expect to break the 10 minutes marks 
 (1 core), but I'm not certain

Cool!  

 In fact we could combine our approaches, I'll start thinking about that.
 
 Cheers,
 Rémi-C
 
 
___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow

2015-02-26 Thread Mark Wynter
John, I've got a variation in mind that works solely using lc_class32 - should 
take same time as what I sent through earlier. I'm guessing 2 minutes!!!  A 
different SQL statement in the data prep stage - it is now becoming very clear 
to me as to how to tile lc_class32 in the purest sense. Which means universal 
application for other use cases. let me test version 2 before sharing.

Sent from my iPhone

 On 26 Feb 2015, at 4:19 am, John Abraham j...@hbaspecto.com wrote:
 
 Wow guys.  131 seconds is a lot faster than 10 days.
 
 Remi, if you can work out a fast intersection method that uses points and 
 lines as a preprocessor to dealing with the polygons, I think it would be a 
 great addition to PostGIS.  ST_Intersection in PostGIS is often quite a bit 
 slower than the implementation in Geomedia, Mapinfo, and (I hear) ArcGIS, so 
 any functionality that results in speed improvements would be great.
 
 Mark, I can't wait to figure out why your system was fast!  I was following 
 your (preliminary) tutorial and gridding the data was progressing very 
 slowly.  
 
 I have a provincial boundary file but there seems to be much ambiguity in GIS 
 representations of the provincial boundary, so I won't send you the one I 
 have.  I can try to assemble one from other sources.
 
 --
 John Abraham
 j...@hbaspecto.com
 403-232-1060
 
 On Feb 25, 2015, at 6:28 AM, Mark Wynter m...@dimensionaledge.com wrote:
 
 Hi John
 
 I’ve just crunched your whole dataset.  The process takes 131 seconds for 
 the vector tiling (using a 16 CPU machine).  Plus another 170 seconds for 
 data prep at the start including making the poly's valid.
 For a 2 CPU machine, it will take circa 15 minutes, or double that using a 
 single CPU.
 
 Only one small issue outstanding - and that relates to clipping the regular 
 grid prior to tiling.  For clipping I used the union of the 
 abmiw2wlcv_48tiles as supplied with the data - the problem is the 
 abmiw2wlcv_48tiles are rough and ready, which produces voids. The voids 
 using my method unfortunately get the same feature class as lc_class = 32.  
 You’ll see this clearly on second screenshot.
 The way around this is to clip the regular grid using a high-res shapefile 
 of the Alberta state boundary prior to the tile crunching the 
 lancover_polygons_2010 table.  This is easily done - I just didn’t have the 
 state boundary data.
 
 I need to get some sleep. John, Remi,  I will share with you the code 
 tomorrow.  For others, I’ll be posting a tutorial that steps through the 
 methods touched upon in this thread..
 
 John, the only difference between my tutorial and its application to your 
 land cover data was a few tweaks to the data prep stage. Otherwise the same 
 code pattern (no modification at all needed to the worker_function).  It was 
 great to test the code with your data.
 
 Speak tomorrow.
 Mark
 
 
 
 
 Screen Shot 2015-02-25 at 11.29.15 pm.png
 
 
 Screen Shot 2015-02-25 at 11.39.04 pm.png
 
 
 Screen Shot 2015-02-25 at 11.41.41 pm.png
 
___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow

2015-02-25 Thread Mark Wynter
I’ll also try and take a look - I’ve just downloaded the dataset in question - 
I’m keen to see it.  Will revert.

On 25 Feb 2015, at 6:41 pm, Rémi Cura remi.c...@gmail.com wrote:

 Damn I'm publishing a QGIS plugin for POstGIS, 
 I can't write you the fast intersection by casting polygon to points, 
 intersection then casting back to polygon.
 I'll try to squeeze it somehow.
 Cheers,
 Rémi-C
 
 2015-02-25 1:44 GMT+01:00 Mark Wynter m...@dimensionaledge.com:
 The land cover data you’re working with - does it come as a raster?   We need 
 a strategy for busting up that single, monolithic poly with lots of holes.
 If its a raster, then I think that tiling on import would be of huge benefit.
 I’m just under the pump the next few days.
 

___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow

2015-02-25 Thread Rémi Cura
Where is the data set?
Cheers,
Rémi-C

2015-02-25 9:18 GMT+01:00 Mark Wynter m...@dimensionaledge.com:

 I’ll also try and take a look - I’ve just downloaded the dataset in
 question - I’m keen to see it.  Will revert.

 On 25 Feb 2015, at 6:41 pm, Rémi Cura remi.c...@gmail.com wrote:

 Damn I'm publishing a QGIS plugin for POstGIS,
 I can't write you the fast intersection by casting polygon to points,
 intersection then casting back to polygon.
 I'll try to squeeze it somehow.
 Cheers,
 Rémi-C

 2015-02-25 1:44 GMT+01:00 Mark Wynter m...@dimensionaledge.com:

 The land cover data you’re working with - does it come as a raster?   We
 need a strategy for busting up that single, monolithic poly with lots of
 holes.
 If its a raster, then I think that tiling on import would be of huge
 benefit.
 I’m just under the pump the next few days.




___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow

2015-02-25 Thread Rémi Cura
Damn I'm publishing a QGIS plugin for POstGIS,
I can't write you the fast intersection by casting polygon to points,
intersection then casting back to polygon.
I'll try to squeeze it somehow.
Cheers,
Rémi-C

2015-02-25 1:44 GMT+01:00 Mark Wynter m...@dimensionaledge.com:

 The land cover data you’re working with - does it come as a raster?   We
 need a strategy for busting up that single, monolithic poly with lots of
 holes.
 If its a raster, then I think that tiling on import would be of huge
 benefit.
 I’m just under the pump the next few days.
___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow.

2015-02-24 Thread John Abraham
Thanks Mark.  I will indeed do the st_dump, but I don't think it will help much 
(very few multis).  I think the tiling will help a lot.  What I wonder, though, 
is how long it will take to tile?  Afterall, it's still an st_intersection 
operation to tile each geometry against each tile.

Is there a quad tree type tiling algorithm in a function?  If I do 256 x 256 
tiles, doing it all at once would be 65536 operations of 
st_intersection(complex_geom, square).  With a quad tree I'll only have 4 
operations of st_intersection(complex_geom, square), and then 16 of 
(a_little_less_complex_geom, square), and then 64 of (even_less_complex_geom, 
square) and so on, for 8 nests.  The last one will still be 65536 operations, 
but the complex geoms should be a lot simpler by the time I get down to that 
level.  What do you think, is this worth trying?  Or is intersecting with a 
square fairly simple regardless of how complex the other geometry is?

(Your reply initially fell into my junk mail, so I missed it, and also haven't 
tried these things yet, partly because I missed your email.)

--
John Abraham
j...@hbaspecto.com
403-232-1060

On Feb 19, 2015, at 6:52 PM, Mark Wynter m...@dimensionaledge.com wrote:

 
 So I've was running this query for 866000 s (10 days) before I decided to 
 kill it:
 
 
 One potential thing I've realized is that a few of the geometries in tazjan2 
 are multipolygons, not single polygons.  But it's only a few.  There are a 
 few very large and complex polygons in lancover_polygons_snap, but again, 
 most of the 998031 are simple small polygons, about half would be 
 ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap 
 two or three of the polygons in tazjan2.
 
 
 A common offender - the multipolygons.
 
 You can get an immediate performance improvement by 
 ST_Dump(wkb_geometry).geom into a new table, with a new serial id, but 
 retaining the original poly_id so you can don’t lose the multipolygon 
 provenance of each polygon.
 Complex operations on the multipolygon table, like ST_Intersection(), are 
 extremely slow - Because the query has to access and compute the operation on 
 each of the geometry elements contained within the multi.
 
 Another massive performance gain can be achieved by “Tiling” your dumped 
 polygon table - which has the effect of breaking your geometries into smaller 
 features, which you can then “union” back together again, after running 
 ST_Intersection on your vector tiles.  Don’t try to tile your multi polygon 
 table - you’ll still be waiting days.
 
 Take the coastline of Australia - a single multi polygon
 
 SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as 
 num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11;
  num_rows | num_geoms | num_points 
 ---+--
 1 |  6718 |1622598
 
 If you need to query the intersection of a polygon with land, and the water, 
 its near impossible.  Many, many days.
 
 Hence we need to take a different approach.
 
 The first thing you can do is to dump the multi polygon table.
 Then create a regular vector grid.
 Then tile the both your Polygon tables using the vector grid.
 Put indexes on the resultant tiled tables.
 Then run your ST_Intersects.
 The result is many more geometries, and points, but now we get the benefit of 
 geometry indexes.
 
 As a result of tiling Australia coastline, we now get
 
 SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as 
 num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled;
  num_rows | num_geoms | num_points 
 ---+--
 17222 | 29020 |6513770
 
 Each geometry also has a tile_id, and the original poly_id.  The tile_id’s 
 are really useful for subsetting your queries, and for using tile-id’s as an 
 additional join condition.  At any time you can rapidly rebuild your original 
 geometries doing ST_Union() GROUP BY poly_id.
 
 Using the approach of dumping and tiling, queries that once took days now 
 takes minutes at most.  And as Remi mentioned, you can parallelise the 
 process.  The concept of vector tiling in PostGIS is analogous to Map Reduce 
 and assigning Key Value Pairs.  Its about chunking your data, doing lots of 
 fast operations on the smaller bits, and then consolidating your outputs.  
 You don’t necessarily have to write a SQL function to do the SQL 
 parallelisation.   My preference is Bash and GNU Parallel because you can 
 write vanilla SQL and execute via psql in parallel.   For me its now about 
 the speed I can write the SQL.
 
 So we’re now starting to apply Big Data concepts within PostGIS….Holy 
 smoke… and you can apply the same concepts to a wide range of PostGIS 
 operations - scale up, scale out….
 
 I will write up a tutorial explaining the benefits of vector tiling in 
 PostGIS, with examples and parallelisation code patterns.  If not today, 
 hopefully over the next 

Re: [postgis-users] ST_Intersection very slow

2015-02-24 Thread John Abraham
Following advice from Mark and Remi, I am now tiling the troublesome layer.  
It's not going quickly.  I'm trying an intersection of the layer with only 5 of 
my 10kmX10km grids and it has been running for 30 min.  I have 8680 grids, 
which is more features than the layer I was originally working with (2709 
polygons). I was hoping that intersection with squares would go much faster 
than an intersection with arbitrary polygons, but I can confirm that it's 
definitely not orders of magnitude faster.

So, I think my idea of simplifying things recursively with a quad grid is worth 
a shot.  Mark, can you point me to your Quadgrid function?

Mark suggested counting points.   Counting points (and rounding to the nearest 
1) gives:

select round(st_npoints(geom)/100,-2) as point100, count(*)
from lancover_polygons_snap group by round(st_npoints(geom)/100,-2) 
order by point100 desc;

26300;1
1500;1
1100;1
800;1
600;2
500;1
400;5
300;10
200;22
100;141
0;997846

The one with about 263 points represents the developed land in the region 
(towns and cities including roads that connect them).

Mark also suggested counting interior rings.   

select round(ST_NumInteriorRings(st_geometryN(geom,1)),-1) as rings10, 
avg(st_numpoints(geom)) as avgpts,
count(*) as numfeatures
from lancover_polygons_snap group by 
round(ST_NumInteriorRings(st_geometryN(geom,1)),-1)
order by rings10 desc;

43140;2629905;1
4230;147121;1
3800;110083;1
2480;76515;1
1760;50261;1
1720;55576;1
1680;64588;1
1350;41826;1
1130;38059;1
970;36425;1
950;37686;1
860;32952;1
850;34657;1
820;35285;2
810;24769;1
800;28113;1
760;30726;1
740;26296;1
720;24788;2
670;18874;1
660;22428;1
640;26258;1
630;27213;2
620;22056;2
610;15771;1
560;19942;2
540;23988;1
500;15763;2
480;16598;1
470;19865;2
410;15466;1
400;16023;1
380;13536;4
370;14646;3
360;14122;1
340;12910;1
330;13273;1
320;15222;2
300;13421;1
290;9072;3
280;10025;4
270;12505;4
260;11002;5
250;9843;3
240;9980;6
230;11456;3
220;9049;3
210;9194;6
200;8586;5
190;7089;5
180;6796;8
170;6913;8
160;6955;6
150;5911;10
140;5769;15
130;5377;13
120;5301;17
110;5246;17
100;4422;28
90;4332;28
80;3664;35
70;3292;43
60;2860;66
50;2236;74
40;1882;141
30;1493;215
20;993;618
10;445;3325
0;20;993265


--
John Abraham
j...@hbaspecto.com
403-232-1060


POSTGIS=2.0.3 r11128 GEOS=3.3.8-CAPI-1.7.8 PROJ=Rel. 4.8.0, 6 March 2012 
GDAL=GDAL 1.9.2, released 2012/10/08 LIBXML=2.7.8 LIBJSON=UNKNOWN (core 
procs from 2.0.3 r11132 need upgrade) TOPOLOGY (topology procs from 2.0.3 
r11132 need upgrade) RASTER (raster procs from 2.0.3 r11132 need upgrade)


On Feb 24, 2015, at 3:31 PM, Rémi Cura remi.c...@gmail.com wrote:

 Hey,
 I highjack this interesting discussion, sorry =)
 
 If you have one single massive Polygon (no multi).
 You can generate a point table (keeping path of course) with ST_DumpPoints.
 
 Then you construct an index on this table.
 Then you perform your Intersection with indexed points, which will be very 
 fast.
 Then you reconstruct parts of polygon.
 It will require some advanced to very advanced SQL (like windows function) to 
 make it work tough.
 
 I think it would be faster than other alternatives, because basicaly it 
 amounts to using the optimized R-Tree index on points, vs a custom and 
 non-otpimized Quad tree index on polygon.
 Moreover when using a recusrive strategy, you end up computing the same think 
 a lot with geos.
 
 Cheers,
 Rémi-C
 
 2015-02-24 23:10 GMT+01:00 Mark Wynter m...@dimensionaledge.com:
 
  Thanks Mark.  I will indeed do the st_dump, but I don't think it will help 
  much (very few multis).  I think the tiling will help a lot.  What I 
  wonder, though, is how long it will take to tile?  Afterall, it's still an 
  st_intersection operation to tile each geometry against each tile.
 
 I’ve almost finished writing the tutorial - where I address many of these 
 points.The variables that affect performance are:
 * how you’ve written your ST_Intersection query
 * multi vs. non-multi
 * size and complexity of geoms
 * no. available CPUs (for parallelisation)
 * tile batch size - important!!!
 
 All strategies in combination may be necessary if your queries are taking 
 forever.
 
 For the demonstration dataset (a multi polygon representing whole of 
 Australia), my tutorial tiling query incorporates ST_Intersection and 
 ST_Difference subqueries to produce tiled features representing land and 
 water.
 I achieved a 49x reduction in the query time to tile the whole of Australia, 
 starting with a single multi polygon.
 
 The more complex the query, the more significant this time saving is in 
 absolute terms.
 
  Is there a quad tree type tiling algorithm in a function?  If I do 256 x 
  256 tiles, doing it all at once would be 65536 operations of 
  st_intersection(complex_geom, square).  With a quad tree I'll only have 4 
  operations of st_intersection(complex_geom, square), and then 16 of 
  (a_little_less_complex_geom, square), and then 64 of 
  (even_less_complex_geom, 

Re: [postgis-users] ST_Intersection very slow

2015-02-24 Thread Rémi Cura
Hey,
I highjack this interesting discussion, sorry =)

If you have one single massive Polygon (no multi).
You can generate a point table (keeping path of course) with ST_DumpPoints.

Then you construct an index on this table.
Then you perform your Intersection with indexed points, which will be very
fast.
Then you reconstruct parts of polygon.
It will require some advanced to very advanced SQL (like windows function)
to make it work tough.

I think it would be faster than other alternatives, because basicaly it
amounts to using the optimized R-Tree index on points, vs a custom and
non-otpimized Quad tree index on polygon.
Moreover when using a recusrive strategy, you end up computing the same
think a lot with geos.

Cheers,
Rémi-C

2015-02-24 23:10 GMT+01:00 Mark Wynter m...@dimensionaledge.com:


  Thanks Mark.  I will indeed do the st_dump, but I don't think it will
 help much (very few multis).  I think the tiling will help a lot.  What I
 wonder, though, is how long it will take to tile?  Afterall, it's still an
 st_intersection operation to tile each geometry against each tile.

 I’ve almost finished writing the tutorial - where I address many of these
 points.The variables that affect performance are:
 * how you’ve written your ST_Intersection query
 * multi vs. non-multi
 * size and complexity of geoms
 * no. available CPUs (for parallelisation)
 * tile batch size - important!!!

 All strategies in combination may be necessary if your queries are taking
 forever.

 For the demonstration dataset (a multi polygon representing whole of
 Australia), my tutorial tiling query incorporates ST_Intersection and
 ST_Difference subqueries to produce tiled features representing land and
 water.
 I achieved a 49x reduction in the query time to tile the whole of
 Australia, starting with a single multi polygon.

 The more complex the query, the more significant this time saving is in
 absolute terms.

  Is there a quad tree type tiling algorithm in a function?  If I do 256 x
 256 tiles, doing it all at once would be 65536 operations of
 st_intersection(complex_geom, square).  With a quad tree I'll only have 4
 operations of st_intersection(complex_geom, square), and then 16 of
 (a_little_less_complex_geom, square), and then 64 of
 (even_less_complex_geom, square) and so on, for 8 nests.  The last one will
 still be 65536 operations, but the complex geoms should be a lot simpler by
 the time I get down to that level.  What do you think, is this worth
 trying?  Or is intersecting with a square fairly simple regardless of how
 complex the other geometry is?

 I do have a SQL quadgrid tiling function - where a cell divides
 recursively subject to a maximum number of levels or “value thresholds” -
 but I’m not sure if that’s the right approach.
 ___
 postgis-users mailing list
 postgis-users@lists.osgeo.org
 http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow

2015-02-24 Thread Mark Wynter
The land cover data you’re working with - does it come as a raster?   We need a 
strategy for busting up that single, monolithic poly with lots of holes.
If its a raster, then I think that tiling on import would be of huge benefit.
I’m just under the pump the next few days.
___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users


Re: [postgis-users] ST_Intersection very slow

2015-02-24 Thread Mark Wynter

 Thanks Mark.  I will indeed do the st_dump, but I don't think it will help 
 much (very few multis).  I think the tiling will help a lot.  What I wonder, 
 though, is how long it will take to tile?  Afterall, it's still an 
 st_intersection operation to tile each geometry against each tile.

I’ve almost finished writing the tutorial - where I address many of these 
points.The variables that affect performance are:
* how you’ve written your ST_Intersection query
* multi vs. non-multi
* size and complexity of geoms
* no. available CPUs (for parallelisation)
* tile batch size - important!!!

All strategies in combination may be necessary if your queries are taking 
forever.

For the demonstration dataset (a multi polygon representing whole of 
Australia), my tutorial tiling query incorporates ST_Intersection and 
ST_Difference subqueries to produce tiled features representing land and water.
I achieved a 49x reduction in the query time to tile the whole of Australia, 
starting with a single multi polygon.  

The more complex the query, the more significant this time saving is in 
absolute terms.

 Is there a quad tree type tiling algorithm in a function?  If I do 256 x 256 
 tiles, doing it all at once would be 65536 operations of 
 st_intersection(complex_geom, square).  With a quad tree I'll only have 4 
 operations of st_intersection(complex_geom, square), and then 16 of 
 (a_little_less_complex_geom, square), and then 64 of (even_less_complex_geom, 
 square) and so on, for 8 nests.  The last one will still be 65536 operations, 
 but the complex geoms should be a lot simpler by the time I get down to that 
 level.  What do you think, is this worth trying?  Or is intersecting with a 
 square fairly simple regardless of how complex the other geometry is?

I do have a SQL quadgrid tiling function - where a cell divides recursively 
subject to a maximum number of levels or “value thresholds” - but I’m not sure 
if that’s the right approach.
___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users


[postgis-users] ST_Intersection very slow.

2015-02-19 Thread John Abraham
So I've was running this query for 866000 s (10 days) before I decided to kill 
it:

create table taz_and_lancover_10_fast_2 as
SELECT p.lc_class, n.taz
 , CASE 
   WHEN ST_CoveredBy(p.geom, n.the_geom) 
   THEN p.geom 
   ELSE 
ST_Multi(
  ST_Intersection(p.geom,n.geom)
  ) END AS geom 
 FROM lancover_polygons_snap AS p 
   INNER JOIN tazjan2 AS n 
ON (ST_Intersects(p.geom, n.the_geom) 
  AND NOT ST_Touches(p.geom, n.the_geom) );

Explain shows this, it's using the spatial index:

Nested Loop  (cost=0.00..310492.35 rows=483105 width=9973)
  Join Filter: (_st_intersects(p.geom, n.the_geom) AND ((NOT (p.geom  
n.the_geom)) OR (NOT _st_touches(p.geom, n.the_geom
  -  Seq Scan on tazjan2 n  (cost=0.00..843.09 rows=2709 width=9493)
  -  Index Scan using lancover_polygons_snap_geom_idx on 
lancover_polygons_snap p  (cost=0.00..21.67 rows=5 width=480)
Index Cond: (geom  n.the_geom)

There are 2709 rows in tazjan2 and 998031 rows in lancover_polygons_snap, so I 
appreciate that it's a bit of a large problem.  But MapInfo was able to do it 
interactively in a few days and Geomedia was also able to do it in about a day. 
 

Both MapInfo and Geomedia ran out of memory (8GB machines) until the problem 
was broken into two regions (North and South), but postgresql seems to be 
chewing on the problem using only 400MB.  The interactive approach in MapInfo 
was to divide lancover_polygons_snap by lc_class, to further divide each region 
into about 10 subproblems.  Perhaps subprobleming this is the way to go?  Can't 
the query subproblem it based in the indices or would I have to do that 
manually?

One potential thing I've realized is that a few of the geometries in tazjan2 
are multipolygons, not single polygons.  But it's only a few.  There are a few 
very large and complex polygons in lancover_polygons_snap, but again, most of 
the 998031 are simple small polygons, about half would be ST_CoveredBy the 
polygons in tazjan2 and most of the rest would only overlap two or three of the 
polygons in tazjan2.

I must be doing something wrong.  Any hints?

I have
 max_connections set to 100 (currently only 11 connections active)
 work_mem was defaulting to 1MB, I just bumped it to 256MB (machine has 32GB 
but it's Postgresql9.1 (x86), i.e. 32bit version
 shared_buffers was 32MB, I am trying 512MB, could go higher
 effective_cache_size was defaulting to 128MB, I am trying 20GB.
 random_page_cost was 4.0, this is VMWare virtual server.  I am trying 2.0 

I'm trying an explain analyze with just 5 rows of tazjan2, a 5 x 998031 
intersection problem with indices shouldn't take that long, should it?  It's 
been running for a little over an hour.  I'm wondering if the few complex 
polygons in lancover_polygons_snap are causing the slowness?  Is there some 
fast way to divide complex polygons, e.g. apply a 1km grid over them?  

PostGISfullversion(): POSTGIS=2.0.3 r11128 GEOS=3.3.8-CAPI-1.7.8 
PROJ=Rel. 4.8.0, 6 March 2012 GDAL=GDAL 1.9.2, released 2012/10/08 
LIBXML=2.7.8 LIBJSON=UNKNOWN (core procs from 2.0.3 r11132 need upgrade) 
TOPOLOGY (topology procs from 2.0.3 r11132 need upgrade) RASTER (raster procs 
from 2.0.3 r11132 need upgrade)

--
John Abraham


___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow.

2015-02-19 Thread Rémi Cura
Hey,
you could try to not use CASE (so separate the spatial join from the
processing, which is easy to parallelize (assuming you have more than one
core )).

First generate the table with

CREATE TABLE psatial_mapping_between_lancover_and_taz AS
SELECT row_number() over() as row_id, p.your_primary_key, n.your_primary_key
 FROM lancover_polygons_snap AS p
   INNER JOIN tazjan2 AS n
ON (ST_Intersects(p.geom, n.the_geom)
  AND NOT ST_Touches(p.geom, n.the_geom) );


Then when you have your table, you can do your processing (put into a SQL
function , it is cleaner),
and you parallelize on the row_id.

Basically you get min(row_id), ax(row_id), you spearate it into N parts,
then execute the parts with your K processes.
(both bash and python have utilities for this).

Of course if it is actually the spatial join that is slow, you can also
parallelise this.
(by cutting lancoverinto N pieces for instance )
Cheers,
Rémi-C

2015-02-19 19:45 GMT+01:00 John Abraham j...@hbaspecto.com:

 So I've was running this query for 866000 s (10 days) before I decided to
 kill it:

 create table taz_and_lancover_10_fast_2 as
 SELECT p.lc_class, n.taz
  , CASE
WHEN ST_CoveredBy(p.geom, n.the_geom)
THEN p.geom
ELSE
 ST_Multi(
   ST_Intersection(p.geom,n.geom)
   ) END AS geom
  FROM lancover_polygons_snap AS p
INNER JOIN tazjan2 AS n
 ON (ST_Intersects(p.geom, n.the_geom)
   AND NOT ST_Touches(p.geom, n.the_geom) );

 Explain shows this, it's using the spatial index:

 Nested Loop  (cost=0.00..310492.35 rows=483105 width=9973)
   Join Filter: (_st_intersects(p.geom, n.the_geom) AND ((NOT (p.geom 
 n.the_geom)) OR (NOT _st_touches(p.geom, n.the_geom
   -  Seq Scan on tazjan2 n  (cost=0.00..843.09 rows=2709 width=9493)
   -  Index Scan using lancover_polygons_snap_geom_idx on
 lancover_polygons_snap p  (cost=0.00..21.67 rows=5 width=480)
 Index Cond: (geom  n.the_geom)

 There are 2709 rows in tazjan2 and 998031 rows in lancover_polygons_snap,
 so I appreciate that it's a bit of a large problem.  But MapInfo was able
 to do it interactively in a few days and Geomedia was also able to do it in
 about a day.

 Both MapInfo and Geomedia ran out of memory (8GB machines) until the
 problem was broken into two regions (North and South), but postgresql seems
 to be chewing on the problem using only 400MB.  The interactive approach in
 MapInfo was to divide lancover_polygons_snap by lc_class, to further divide
 each region into about 10 subproblems.  Perhaps subprobleming this is the
 way to go?  Can't the query subproblem it based in the indices or would I
 have to do that manually?

 One potential thing I've realized is that a few of the geometries in
 tazjan2 are multipolygons, not single polygons.  But it's only a few.
 There are a few very large and complex polygons in lancover_polygons_snap,
 but again, most of the 998031 are simple small polygons, about half would
 be ST_CoveredBy the polygons in tazjan2 and most of the rest would only
 overlap two or three of the polygons in tazjan2.

 I must be doing something wrong.  Any hints?

 I have
  max_connections set to 100 (currently only 11 connections active)
  work_mem was defaulting to 1MB, I just bumped it to 256MB (machine has
 32GB but it's Postgresql9.1 (x86), i.e. 32bit version
  shared_buffers was 32MB, I am trying 512MB, could go higher
  effective_cache_size was defaulting to 128MB, I am trying 20GB.
  random_page_cost was 4.0, this is VMWare virtual server.  I am trying 2.0

 I'm trying an explain analyze with just 5 rows of tazjan2, a 5 x 998031
 intersection problem with indices shouldn't take that long, should it?
 It's been running for a little over an hour.  I'm wondering if the few
 complex polygons in lancover_polygons_snap are causing the slowness?  Is
 there some fast way to divide complex polygons, e.g. apply a 1km grid over
 them?

 PostGISfullversion(): POSTGIS=2.0.3 r11128 GEOS=3.3.8-CAPI-1.7.8
 PROJ=Rel. 4.8.0, 6 March 2012 GDAL=GDAL 1.9.2, released 2012/10/08
 LIBXML=2.7.8 LIBJSON=UNKNOWN (core procs from 2.0.3 r11132 need
 upgrade) TOPOLOGY (topology procs from 2.0.3 r11132 need upgrade) RASTER
 (raster procs from 2.0.3 r11132 need upgrade)

 --
 John Abraham



 ___
 postgis-users mailing list
 postgis-users@lists.osgeo.org
 http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow.

2015-02-19 Thread Andre Mano
*I will write up a tutorial explaining the benefits of vector tiling in
PostGIS, with examples and parallelisation code patterns.  If not today,
hopefully over the next week. *

 Would be nice if you have the time!

Best,

Andre

On Fri, Feb 20, 2015 at 2:52 AM, Mark Wynter m...@dimensionaledge.com
wrote:


 So I've was running this query for 866000 s (10 days) before I decided to
 kill it:



 One potential thing I've realized is that a few of the geometries in
 tazjan2 are multipolygons, not single polygons.  But it's only a few.
 There are a few very large and complex polygons in lancover_polygons_snap,
 but again, most of the 998031 are simple small polygons, about half would
 be ST_CoveredBy the polygons in tazjan2 and most of the rest would only
 overlap two or three of the polygons in tazjan2.



 A common offender - the multipolygons.

 You can get an immediate performance improvement by
 ST_Dump(wkb_geometry).geom into a new table, with a new serial id, but
 retaining the original poly_id so you can don’t lose the multipolygon
 provenance of each polygon.
 Complex operations on the multipolygon table, like ST_Intersection(), are
 extremely slow - Because the query has to access and compute the operation
 on each of the geometry elements contained within the multi.

 Another massive performance gain can be achieved by “Tiling” your dumped
 polygon table - which has the effect of breaking your geometries into
 smaller features, which you can then “union” back together again, after
 running ST_Intersection on your vector tiles.  Don’t try to tile your multi
 polygon table - you’ll still be waiting days.

 Take the coastline of Australia - a single multi polygon

 SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as
 num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11;
  num_rows | num_geoms | num_points
 ---+--
 1 |  6718 |1622598

 If you need to query the intersection of a polygon with land, and the
 water, its near impossible.  Many, many days.

 Hence we need to take a different approach.

 The first thing you can do is to dump the multi polygon table.
 Then create a regular vector grid.
 Then tile the both your Polygon tables using the vector grid.
 Put indexes on the resultant tiled tables.
 Then run your ST_Intersects.
 The result is many more geometries, and points, but now we get the benefit
 of geometry indexes.

 As a result of tiling Australia coastline, we now get

 SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as
 num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled;
  num_rows | num_geoms | num_points
 ---+--
 17222 | 29020 |6513770

 Each geometry also has a tile_id, and the original poly_id.  The tile_id’s
 are really useful for subsetting your queries, and for using tile-id’s as
 an additional join condition.  At any time you can rapidly rebuild your
 original geometries doing ST_Union() GROUP BY poly_id.

 Using the approach of dumping and tiling, queries that once took days now
 takes minutes at most.  And as Remi mentioned, you can parallelise the
 process.  The concept of vector tiling in PostGIS is analogous to Map
 Reduce and assigning Key Value Pairs.  Its about chunking your data, doing
 lots of fast operations on the smaller bits, and then consolidating your
 outputs.  You don’t necessarily have to write a SQL function to do the SQL
 parallelisation.   My preference is Bash and GNU Parallel because you can
 write vanilla SQL and execute via psql in parallel.   For me its now about
 the speed I can write the SQL.

 So we’re now starting to apply Big Data concepts within PostGIS….Holy
 smoke… and you can apply the same concepts to a wide range of PostGIS
 operations - scale up, scale out….

 I will write up a tutorial explaining the benefits of vector tiling in
 PostGIS, with examples and parallelisation code patterns.  If not today,
 hopefully over the next week.

 Mark



 ___
 postgis-users mailing list
 postgis-users@lists.osgeo.org
 http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users




-- 
..
André Mano
http://opussig.blogspot.com/
___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

[postgis-users] ST_Intersection very slow.

2015-02-19 Thread Mark Wynter

 So I've was running this query for 866000 s (10 days) before I decided to 
 kill it:


 One potential thing I've realized is that a few of the geometries in tazjan2 
 are multipolygons, not single polygons.  But it's only a few.  There are a 
 few very large and complex polygons in lancover_polygons_snap, but again, 
 most of the 998031 are simple small polygons, about half would be 
 ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap 
 two or three of the polygons in tazjan2.


A common offender - the multipolygons.

You can get an immediate performance improvement by ST_Dump(wkb_geometry).geom 
into a new table, with a new serial id, but retaining the original poly_id so 
you can don’t lose the multipolygon provenance of each polygon.
Complex operations on the multipolygon table, like ST_Intersection(), are 
extremely slow - Because the query has to access and compute the operation on 
each of the geometry elements contained within the multi.

Another massive performance gain can be achieved by “Tiling” your dumped 
polygon table - which has the effect of breaking your geometries into smaller 
features, which you can then “union” back together again, after running 
ST_Intersection on your vector tiles.  Don’t try to tile your multi polygon 
table - you’ll still be waiting days.

Take the coastline of Australia - a single multi polygon

SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as 
num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11;
 num_rows | num_geoms | num_points 
---+--
1 |  6718 |1622598

If you need to query the intersection of a polygon with land, and the water, 
its near impossible.  Many, many days.

Hence we need to take a different approach.

The first thing you can do is to dump the multi polygon table.
Then create a regular vector grid.
Then tile the both your Polygon tables using the vector grid.
Put indexes on the resultant tiled tables.
Then run your ST_Intersects.
The result is many more geometries, and points, but now we get the benefit of 
geometry indexes.

As a result of tiling Australia coastline, we now get

SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as 
num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled;
 num_rows | num_geoms | num_points 
---+--
17222 | 29020 |6513770

Each geometry also has a tile_id, and the original poly_id.  The tile_id’s are 
really useful for subsetting your queries, and for using tile-id’s as an 
additional join condition.  At any time you can rapidly rebuild your original 
geometries doing ST_Union() GROUP BY poly_id.

Using the approach of dumping and tiling, queries that once took days now takes 
minutes at most.  And as Remi mentioned, you can parallelise the process.  The 
concept of vector tiling in PostGIS is analogous to Map Reduce and assigning 
Key Value Pairs.  Its about chunking your data, doing lots of fast operations 
on the smaller bits, and then consolidating your outputs.  You don’t 
necessarily have to write a SQL function to do the SQL parallelisation.   My 
preference is Bash and GNU Parallel because you can write vanilla SQL and 
execute via psql in parallel.   For me its now about the speed I can write the 
SQL.

So we’re now starting to apply Big Data concepts within PostGIS….Holy 
smoke… and you can apply the same concepts to a wide range of PostGIS 
operations - scale up, scale out….

I will write up a tutorial explaining the benefits of vector tiling in PostGIS, 
with examples and parallelisation code patterns.  If not today, hopefully over 
the next week.  

Mark


___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] ST_Intersection very slow.

2015-02-19 Thread John Abraham
Thanks for the hint, Rémi.  I'll give it a try, to see if the st_intersection 
is slow (as compared to just st_intersects).

Here's my EXPLAIN ANALYZE for my 5 zone test:

Nested Loop  (cost=1.56..80.91 rows=8 width=560) (actual 
time=3929258.738..19756123.479 rows=29 loops=1)
  Join Filter: (_st_intersects(p.geom, n.the_geom) AND ((NOT (p.geom  
n.the_geom)) OR (NOT _st_touches(p.geom, n.the_geom
  CTE sometaz
-  Limit  (cost=0.00..1.56 rows=5 width=9537) (actual time=0.433..0.576 
rows=5 loops=1)
  -  Seq Scan on tazjan2  (cost=0.00..843.09 rows=2709 width=9537) 
(actual time=0.415..0.511 rows=5 loops=1)
  -  CTE Scan on sometaz n  (cost=0.00..0.10 rows=5 width=80) (actual 
time=0.460..0.724 rows=5 loops=1)
  -  Index Scan using lancover_polygons_snap_geom_idx on 
lancover_polygons_snap p  (cost=0.00..12.47 rows=5 width=480) (actual 
time=0.242..65.640 rows=16 loops=5)
Index Cond: (geom  n.the_geom)
Total runtime: 19756143.935 ms

Here it is explained in layspeak: http://explain.depesz.com/s/dbVi

So, am I reading this right, that it took 5.5 hours to st_intersects 5 
(multi)polygons with 29 (multi)polygons, and find the 8 that intersected?  That 
the index scan was fast (328ms), and returned only a few (16, or maybe 29) of 
the 998031 polygons, but that the actual intersection afterwards, between just 
a few polygons, was so very slow?

(The query was this:

with sometaz as (select * from tazjan2 limit 5)
SELECT p.lc_class, n.taz
 , CASE 
   WHEN ST_CoveredBy(p.geom, n.the_geom) 
   THEN p.geom 
   ELSE 
ST_Multi(
  ST_Intersection(p.geom,n.geom)
  ) END AS geom 
 FROM lancover_polygons_snap AS p 
   INNER JOIN sometaz AS n 
ON (ST_Intersects(p.geom, n.the_geom) 
  AND NOT ST_Touches(p.geom, n.the_geom) );

)

--
John Abraham
j...@hbaspecto.com
403-232-1060

On Feb 19, 2015, at 11:56 AM, Rémi Cura remi.c...@gmail.com wrote:

 Hey,
 you could try to not use CASE (so separate the spatial join from the 
 processing, which is easy to parallelize (assuming you have more than one 
 core )).
 
 First generate the table with  
 
 CREATE TABLE psatial_mapping_between_lancover_and_taz AS 
 SELECT row_number() over() as row_id, p.your_primary_key, n.your_primary_key
  FROM lancover_polygons_snap AS p 
INNER JOIN tazjan2 AS n 
 ON (ST_Intersects(p.geom, n.the_geom) 
   AND NOT ST_Touches(p.geom, n.the_geom) );
 
 
 Then when you have your table, you can do your processing (put into a SQL 
 function , it is cleaner),
 and you parallelize on the row_id.
 
 Basically you get min(row_id), ax(row_id), you spearate it into N parts, then 
 execute the parts with your K processes.
 (both bash and python have utilities for this).
 
 Of course if it is actually the spatial join that is slow, you can also 
 parallelise this.
 (by cutting lancoverinto N pieces for instance )
 Cheers,
 Rémi-C
 
 2015-02-19 19:45 GMT+01:00 John Abraham j...@hbaspecto.com:
 So I've was running this query for 866000 s (10 days) before I decided to 
 kill it:
 
 create table taz_and_lancover_10_fast_2 as
 SELECT p.lc_class, n.taz
  , CASE 
WHEN ST_CoveredBy(p.geom, n.the_geom) 
THEN p.geom 
ELSE 
 ST_Multi(
   ST_Intersection(p.geom,n.geom)
   ) END AS geom 
  FROM lancover_polygons_snap AS p 
INNER JOIN tazjan2 AS n 
 ON (ST_Intersects(p.geom, n.the_geom) 
   AND NOT ST_Touches(p.geom, n.the_geom) );
 
 Explain shows this, it's using the spatial index:
 
 Nested Loop  (cost=0.00..310492.35 rows=483105 width=9973)
   Join Filter: (_st_intersects(p.geom, n.the_geom) AND ((NOT (p.geom  
 n.the_geom)) OR (NOT _st_touches(p.geom, n.the_geom
   -  Seq Scan on tazjan2 n  (cost=0.00..843.09 rows=2709 width=9493)
   -  Index Scan using lancover_polygons_snap_geom_idx on 
 lancover_polygons_snap p  (cost=0.00..21.67 rows=5 width=480)
 Index Cond: (geom  n.the_geom)
 
 There are 2709 rows in tazjan2 and 998031 rows in lancover_polygons_snap, so 
 I appreciate that it's a bit of a large problem.  But MapInfo was able to do 
 it interactively in a few days and Geomedia was also able to do it in about a 
 day.  
 
 Both MapInfo and Geomedia ran out of memory (8GB machines) until the problem 
 was broken into two regions (North and South), but postgresql seems to be 
 chewing on the problem using only 400MB.  The interactive approach in MapInfo 
 was to divide lancover_polygons_snap by lc_class, to further divide each 
 region into about 10 subproblems.  Perhaps subprobleming this is the way to 
 go?  Can't the query subproblem it based in the indices or would I have to do 
 that manually?
 
 One potential thing I've realized is that a few of the geometries in tazjan2 
 are multipolygons, not single polygons.  But it's only a few.  There are a 
 few very large and complex polygons in lancover_polygons_snap, but again, 
 most of the 998031 are simple small polygons, about half would be 
 ST_CoveredBy the polygons in tazjan2 and most of the rest would