We see significant performance boost when clustering a large PostGIS
table on the spatial index. This will rewrite the table in spatial index
order.
CLUSTER <table> USING <spatial_index>
My understanding is that you need to periodically recluster the table
when updated, which is an atomic exclusive rewrite of the complete table
i.e. will take time.
You may ask the Postgre team about this, e.g. Magnus Hagander.
You may ask Sandro Santilli (PostGIS topology) about topological aspects.
It would probably be possible to get substantial gains on existing
formats that have spatial indexing by making them more
optimal/balanced however I see the following major issues:
* Not sure how by convention deal with non-updatable indexes
(packing/sorting on hilbert or sort-tile-recursive seem inherently
static) on formats that are assumed to be mutable
* GeoPackage cannot benefit from clustering on spatial index (so it
will not be possible to "cloud" optimize)
* Shapefile has several other drawbacks as a format perhaps making it
less interesting to invest into
I have wondered several times about these things in relation to
PostgreSQL and PostGIS... first thought is why there is no way to mark
a table static/immutable in PostgreSQL, it seems to be there would be
several general optimization possibilities if it was known that a
table cannot be changed, not the least that it would make it possible
to construct similar packed R-trees as discussed above. Second thought
is how the R-tree is constructed/implemented via gist index (I have
not had the brain power/interest to dive into the source to understand
it yet) - unless it can be static it should probably optimally be
a R*-tree with topological split but I've got indications that it is not.
/Björn
Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski
<m...@komzpa.net <mailto:m...@komzpa.net>>:
Hi Björn,
Would it make sense to implement it not as a separate format, but
as a special convention on top of any of already existing ones?
What happens if you sort by Hilbert curve a shapefile or
geopackage, and tweak the internals to build indexes in flattish
manner?
For example, here is a branch of Postgres where GiST index is
being built in flat manner. \
It does not require any change on user's side, but gets you all
the benefits by plainly building 10x faster.
https://github.com/glukhovn/postgres/tree/gist_btree_build
On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell
<bjorn.harrt...@gmail.com <mailto:bjorn.harrt...@gmail.com>> wrote:
I've now implemented so that the traversal of the R-tree reads
nodes on demand and as suspected it gives a significant boost:
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
real0m0,378s (~ 20% improvement)
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.fgb
gis_osm_buildings_a_free_1_single; done
real0m0,790s (~ 30% improvement)
I'm now also faster on the reduced window (56 10.1 56.1)
beating gpkg:
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
gis_osm_buildings_a_free_1_single.gpkg
gis_osm_buildings_a_free_1_single; done
real0m0,305s
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
gis_osm_buildings_a_free_1_single.fgb
gis_osm_buildings_a_free_1_single; done
real0m0,242s
It still remains a TODO to further optimize I/O to read larger
blocks if possible (applies both to R-tree nodes and the
feature data itself) and I think that could prove a very
significant performance boost potentially leaving any
competition in the dust. :D
/Björn
Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell
<bjorn.harrt...@gmail.com <mailto:bjorn.harrt...@gmail.com>>:
I've now implemented spatial filtering (at GDAL fork
https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and
results are promising, for two test cases about 3,5-4
times faster than shapefile with the same content. I've
used these command lines to measure total time over 10
iterations for each format and dataset:
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
real0m1,940s
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
real0m0,750s
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
real0m0,461s
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_buildings_a_free_1_single.shp
gis_osm_buildings_a_free_1_single; done
real0m4,381s
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_buildings_a_free_1_single.gpkg
gis_osm_buildings_a_free_1_single; done
real0m1,624s
> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_buildings_a_free_1_single.fgb
gis_osm_buildings_a_free_1_single; done
real0m1,113s
The bounds 10 56 10.2 56.2 contains 21525 out of the
812547 total features for gis_osm_roads_free_1 and 57699
out of 2027890 features for gis_osm_buildings_a_free_1_single.
And there is definitely room for optimization. The
implementation is naive and reads the whole R-tree in
memory and does not yet read consecutive features (common
case as the R-tree is balanced) in larger I/O blocks which
I think can be significant. If I reduce the extent to 10
56 10.1 56.1 GPKG is actually fair bit faster which I
think is due to it being able to read the index partially.
/Björn
Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell
<bjorn.harrt...@gmail.com <mailto:bjorn.harrt...@gmail.com>>:
After posting about my experimental format I realized
that I lack numbers on the potential performance, so I
tried to make some more or less scientific measuring.
The results was disappointing, reaching similar
performance as shapefile for full sequential reads and
so I lost interest for a while. But recently I found
out that my method of measuring was flawed - I
measured the time to read into a new shapefile using
ogr2ogr, but it turns out that can be quite unfair due
to the string field length dynamic resize that ogr2ogr
will do if the strings from input feature attributes
are of variable length. I've now made some new
measurements using ogrinfo and the undocumented flag
-qq to get a full sequential read.
I've used the shapefiles "gis_osm_buildings_a_free_1"
(exploded to single poly) and "gis_osm_roads_free_1"
from Denmark at
http://download.geofabrik.de/europe.html as the
source, converted to respective format, then measured
the time it takes to run ogrinfo -qq on them. Results
(average over three runs):
## gis_osm_buildings_a_free_1 (2027890 Polygons)
* SHP: 3800 ms
* GPKG: 2700 ms
* FlatGeobuf: 2200 ms
## gis_osm_roads_free_1 (812547 LineStrings)
* SHP: 1600 ms
* GPKG: 1350 ms
* FlatGeobuf: 800 ms
With that my hope and interest is rekindled. I believe
I can fare even better against the competition with
with spatially filtered searches, will hopefully soon
have some results on that too.
/Björn
Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell
<bjorn.harrt...@gmail.com
<mailto:bjorn.harrt...@gmail.com>>:
Hi GDAL/OGR folks,
In my spare time I've been working on a vector
file format called FlatGeobuf (tentatively). The
main reason, besides curiosity and learning, I'm
putting time into it is that I think shapefile
still rules the read/query static data performance
game, which is kind of sad, and probably one of
the reasons it is still so widely popular. Also,
the main competitor (GeoPackage) isn't suitable
for streaming access (AFAIK) which shapefiles also
handles surprisingly (?) well.
By using a performance focused write once binary
encoding (flatbuffers), schema constraint and
focusing on read/query performance by clustering
on an optimal spatial index (Packed Hilbert
R-Tree) I hope to be able to beat shapefile
performance and at the same time be optimal for
streaming/cloud access.
I think I'm starting to get somewhere, more
details and source is at
https://github.com/bjornharrtell/flatgeobuf and I
have an early proof of concept driver
implementation at
https://github.com/bjornharrtell/gdal/tree/flatgeobuf and
results are already quite promising - it can do
roundtrip read/write and is already quite a bit
faster than shapefile. I also have implemented
naive read only QGIS provider for experimental
purposes.
Basically I'm fishing for interest in this effort,
hoping that others will see potential in it even
if it's "yet another format" and far from
final/stable yet. Any feedback is welcome. As I
see it, GDAL is a good place for a format
specification and reference implementation to
incubate.
Some additional food for thought that I've had
during the experimentation:
1. The main in memory representations of
geometry/coordinates seem to be either separate
arrays per dimension (GDAL (partially?) and QGIS)
or a single array with dimension as stride. I've
chosen the latter as of yet but I'm still a bit
undecided. There is of course a substantial
involved in transforming between the two
representations so the situation with two
competing models is a bit unfortunate. I'm also
not sure about which of these models are
objectively "better" than the other?
2. One could get better space efficiency with
protobuf instead of flatbuffers, but it has a
performance cost. I have not gone far into
investigating how much though and one could even
reason about supporting both these encodings in a
single format specification but it's taking it a
bit far I think.
3. Is there a reason to support different packing
strategies for the R-Tree or is Packed Hilbert a
good compromise (besides it being beautifully
simple to implement)?
4. FlatGeobuf is perhaps a too technical name, not
catchy enough and has a bad/boring abbreviation.
Other candidates I've considered are OpenGeoFile,
OpenVectorFile or OpenSpatialFile but I'm
undecided. Any ideas? :)
Regards,
Björn Harrtell
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org <mailto:gdal-dev@lists.osgeo.org>
https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
Hälsningar
Andreas Oxenstierna
T-Kartor Geospatial AB
Olof Mohlins väg 12 Kristianstad
mobile: +46 733 206831
mailto: a...@t-kartor.se
http://www.t-kartor.com
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev