Re: [postgis-users] Setting multiple columns in one UPDATE request - Additional results for large polgyons.

Marco Boeringa Wed, 30 Sep 2020 00:04:41 -0700

Hi Regina,

I now did a second test with two datasets of forest polygons, to get anindication if the subquery has any measurable positive effect on such adataset, contrary to the previously shown result for a dataset primarilyconsisting of low vertex count (<200) polygons (results included in theolder posts attached below).

The first dataset has records with geometries of up to about 100kvertices, the second dataset is a generalized one with all polygonssubdivided to a maximum of 5000 vertices.

The results are included below. As you can see, using a subquery stilldoes not lead to a real measurable performance gain, neither for thedataset with humongous big polygons, nor the one with subdividedpolygons containing a large set of 5000 max vertices polygons. Only thesingle threaded result on the first dataset with max 100k vertices perpolygon, appears to have a tiny measurable gain, but considering theslight variability of timing results at each run even with the samesettings, I am pretty sure even this is not significant.

What is significant compared to the previous results of a dataset withsmall polygons (majority < 200 vertices per polygon), is that themulti-threaded processing is considerably faster than the singlethreaded processing. So processing is clearly CPU limited for theselarger polygons, but whether that is really in the PostGIS calculations,or overhead of PostgreSQL needing to de-toast large polygons, I don't know.

It does appear though, that PostgreSQL somehow already manages to"optimize" the calculations reusing calculated values, unless the costof ST_Area and ST_Perimeter calculations on really large polygons isstill only a fraction of the reading and writing of the geometries, inwhich case the results of the query processing simply do not show theinherent cost difference of the different queries.


Note though, that although I showed:

UPDATE <MY_TABLE> SET area = ST_Area(<GEOMETRY_COLUMN>), area_perimeter= ST_Area(<GEOMETRY_COLUMN>) / ST_Perimeter(<GEOMETRY_COLUMN>)

as an example query to illustrate the problem, the timing results areactually for an UPDATE statement setting 4 columns that all need'ST_Area()' as part of the calculation. So in reality, the potentialbenefits of the rewriting of the query, should have been even biggerthan for the example query above, as the polygon's area is needed fourtimes.


Marco


*** Dataset with very large polygon geometries (Max 100k vertices) *********

- Single-threaded using ORIGINAL QUERY: 32m05s

- Single-threaded using SUBQUERY (FROM): 31m52s

- Multi-threaded using ORIGINAL QUERY: 18m16s

- Multi-threaded using SUBQUERY (FROM): 18m46s

*******************************************************

*** Dataset SUBDIVIDED to polygons with max 5000 vertices! *********

- Single-threaded using ORIGINAL QUERY: 11m01s

- Single-threaded using SUBQUERY (FROM): 11m11s

- Multi-threaded using ORIGINAL QUERY: 5m37s

- Multi-threaded using SUBQUERY (FROM): 5m46s

*******************************************************


Op 28-9-2020 om 18:28 schreef Regina Obe:

ST_Area and ST_Perimeter functions are relatively low cost, so thatfact you discovered is not surprising.
I think if you were doing something like ST_Distance then the CTE orsubquery would be more efficient than your direct solution.
*From:*postgis-users [mailto:postgis-users-boun...@lists.osgeo.org]*On Behalf Of *Marco Boeringa
*Sent:* Monday, September 28, 2020 10:15 AM
*To:* postgis-users@lists.osgeo.org
*Subject:* Re: [postgis-users] Setting multiple columns in one UPDATErequest
Hi Regina,

I can now partially answer my question about performance myself:
It turns out that for datasets having relatively small geometries (interms of number of vertices, not area, e.g. a few dozen to a fewhundred vertices maximum) there is actually *NO* benefit at all ofrewriting the query either with a WITH (CTE) or FROM (Subquery). Thismay be different though for other datasets having much largergeometries, but needs further testing.
In fact, processing is marginally slower, but only by 5-10% or so,compared to the original query.
In my setup, I can also run the query both in a single thread, orusing a custom Python multi-threaded implementation sending SQLstatements in parallel to PostgreSQL. Since the test system has a verylimited 4 core multi-threaded processor, the benefits of themulti-threading versus single threaded processing in this case arenil, obviously due to the overhead of the multi-threading. Themulti-threaded application is as fast as the single threadedPostgreSQL worker, or even a bit slower, but puts a far higher load onthe processor. Of course, with a more modern processor with high corecount, this experience likely changes.
There also appears to be virtually no difference between using a CTEor the subquery as you suggested: subquery is only very marginallyfaster than CTE.
So for datasets having small geometries, just sticking to the originalquery like:
UPDATE <MY_TABLE> SET area = ST_Area(<GEOMETRY_COLUMN>),area_perimeter = ST_Area(<GEOMETRY_COLUMN>) /ST_Perimeter(<GEOMETRY_COLUMN>)
is fine for those datasets.
I think this result is caused by the fact that the retrieving andstoring overhead of the geometries (tables stored on SSD), is simplyfar bigger than the actual cost of calculating the area or perimeterfor such datasets where the majority of geometries is of very limitedsize (e.g. OSM buildings, simple landuse polygons). Additionally,there may be an extra cost due to the needed join for the CTE andsubquery statements. Lastly, the cost of running ST_Area andST_Perimeter may just be to low as well. There may be other functionsin PostGIS with a much higher computational cost that would show abenefit from rewriting the query.
I will attempt to run a second benchmark using a dataset with muchlarger geometries though (some with well over > 10k vertices), to seeif that gives the same result, and report back. There may be adifference, but we will see...
Marco

*** Dataset with small geometries (most < 200 vertices) *********

- Single-threaded using ORIGINAL QUERY: 8m45s

- Single-threaded using SUBQUERY (FROM): 8m52s

- Single-threaded using CTE (WITH): 9m13s

- Multi-threaded using ORIGINAL QUERY: 9m27s

- Multi-threaded using SUBQUERY (FROM): 9m44s

- Multi-threaded using CTE (WITH): 9m50s

*******************************************************

Op 28-9-2020 om 09:36 schreef Marco Boeringa:

    Regina,

    Thanks for your suggestion.

    How is this performance wise? Is not using a CTE as in your
    suggestion, supposedly faster than with using a CTE, or is this
    just a syntax thing and performance is expected to be equal?

    It would still be nice though, if PostgreSQL somehow handled this
    automatically, and one could use the most basic form yet be sure
    it was optimized. It also reads more easily to just see:

    UPDATE <MY_TABLE> SET area = ST_Area(<GEOMETRY_COLUMN>),
    area_perimeter = ST_Area(<GEOMETRY_COLUMN>) /
    ST_Perimeter(<GEOMETRY_COLUMN>)

    in your code, instead of more elaborate construct involving a join.

    Marco

    Op 28-9-2020 om 03:26 schreef Regina Obe:

        I prefer doing it in the FROM and not bothering using a CTE.

        So something like

        UPDATE <MY_TABLE> SET area = f.area, area_perimeter = f.area/f.perimeter

        FROM (SELECT id, ST_Area(<GEOMETRY_COLUMN>) AS area, 
ST_Perimeter(<GEOMETRY COLUMN>) AS perimeter

                 FROM <MY TABLE> ) AS f

        WHERE f.id = <MY TABLE>.id;
_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

Re: [postgis-users] Setting multiple columns in one UPDATE request - Additional results for large polgyons.

Reply via email to