Den mån 11 mars 2019 kl 15:07 skrev Darafei "Komяpa" Praliaskouski < m...@komzpa.net>:
> > > On Mon, Mar 11, 2019 at 1:08 AM Björn Harrtell <bjorn.harrt...@gmail.com> > wrote: > >> It would probably be possible to get substantial gains on existing >> formats that have spatial indexing by making them more optimal/balanced >> however I see the following major issues: >> >> * Not sure how by convention deal with non-updatable indexes >> (packing/sorting on hilbert or sort-tile-recursive seem inherently static) >> on formats that are assumed to be mutable >> > > Why does Hilbert indexing sets a requirement for it to be non-updatable? > > If it's packed into same tree format, it's going to be updatable just the > same as the previous version. Yeah, it's going to degrade a bit on changes, > but if the tool to put sorting back in place is available then it's not a > big issue? > I'm not entirely sure but it seems to me that R-tree algorithms are divided into static or dynamic ones for good reasons. The static ones have a predictable structure with near 100% space utilization and I believe assumptions can be used for performance in the implementation as long as the structure is kept intact. Dynamic ones are typically more complex because they need a strategy for tree insertion/balancing. See https://github.com/mourner/rbush and https://github.com/mourner/flatbush for contemporary and minimal implementations of the dynamic / static cases also with reference to the algorithms used. > > >> * GeoPackage cannot benefit from clustering on spatial index (so it will >> not be possible to "cloud" optimize) >> > > Why? What should it get for it to be possible? > I'm taking Evens word for it (see earlier in this thread) but I shouldn't have written cannot, he wrote it's likely possible but as I interpreted it quite far from trivial. When Even writes something is not trivial, I instinctively think it's near impossible. :) > * Shapefile has several other drawbacks as a format perhaps making it less >> interesting to invest into >> > > Its killer feature is universal support. > Agreed, but I'm saying I don't want to see that being the case forever? :) > I have wondered several times about these things in relation to PostgreSQL >> and PostGIS... first thought is why there is no way to mark a table >> static/immutable in PostgreSQL, it seems to be there would be several >> general optimization possibilities if it was known that a table cannot be >> changed, not the least that it would make it possible to construct similar >> packed R-trees as discussed above. >> > > You probably missed a link I posted, > https://github.com/glukhovn/postgres/tree/gist_btree_buil > <https://github.com/glukhovn/postgres/tree/gist_btree_build>d - here a > GiST is built in similar way, without requiring read onliness of the table. > > > >> Second thought is how the R-tree is constructed/implemented via gist >> index (I have not had the brain power/interest to dive into the source to >> understand it yet) - unless it can be static it should probably optimally >> be a R*-tree with topological split but I've got indications that it is not. >> > > GiST is just *a* tree on top of support functions. More sophisticated > versions of tree are limited by what callbacks Postgres provides. > > > /Björn > > Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski < > m...@komzpa.net>: > >> Hi Björn, >> >> Would it make sense to implement it not as a separate format, but as a >> special convention on top of any of already existing ones? >> What happens if you sort by Hilbert curve a shapefile or geopackage, and >> tweak the internals to build indexes in flattish manner? >> >> For example, here is a branch of Postgres where GiST index is being built >> in flat manner. \ >> It does not require any change on user's side, but gets you all the >> benefits by plainly building 10x faster. >> https://github.com/glukhovn/postgres/tree/gist_btree_build >> >> >> On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell <bjorn.harrt...@gmail.com> >> wrote: >> >>> I've now implemented so that the traversal of the R-tree reads nodes on >>> demand and as suspected it gives a significant boost: >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done >>> real 0m0,378s (~ 20% improvement) >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; >>> done >>> real 0m0,790s (~ 30% improvement) >>> >>> I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg: >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1 >>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single; >>> done >>> real 0m0,305s >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1 >>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; >>> done >>> real 0m0,242s >>> >>> It still remains a TODO to further optimize I/O to read larger blocks if >>> possible (applies both to R-tree nodes and the feature data itself) and I >>> think that could prove a very significant performance boost potentially >>> leaving any competition in the dust. :D >>> >>> /Björn >>> >>> Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell < >>> bjorn.harrt...@gmail.com>: >>> >>>> I've now implemented spatial filtering (at GDAL fork >>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are >>>> promising, for two test cases about 3,5-4 times faster than shapefile with >>>> the same content. I've used these command lines to measure total time over >>>> 10 iterations for each format and dataset: >>>> >>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>>> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done >>>> real 0m1,940s >>>> >>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>>> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done >>>> real 0m0,750s >>>> >>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done >>>> real 0m0,461s >>>> >>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>>> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single; >>>> done >>>> real 0m4,381s >>>> >>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single; >>>> done >>>> real 0m1,624s >>>> >>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; >>>> done >>>> real 0m1,113s >>>> >>>> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total >>>> features for gis_osm_roads_free_1 and 57699 out of 2027890 features for >>>> gis_osm_buildings_a_free_1_single. >>>> >>>> And there is definitely room for optimization. The implementation is >>>> naive and reads the whole R-tree in memory and does not yet read >>>> consecutive features (common case as the R-tree is balanced) in larger I/O >>>> blocks which I think can be significant. If I reduce the extent to 10 56 >>>> 10.1 56.1 GPKG is actually fair bit faster which I think is due to it being >>>> able to read the index partially. >>>> >>>> /Björn >>>> >>>> Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell < >>>> bjorn.harrt...@gmail.com>: >>>> >>>>> After posting about my experimental format I realized that I lack >>>>> numbers on the potential performance, so I tried to make some more or less >>>>> scientific measuring. The results was disappointing, reaching similar >>>>> performance as shapefile for full sequential reads and so I lost interest >>>>> for a while. But recently I found out that my method of measuring was >>>>> flawed - I measured the time to read into a new shapefile using ogr2ogr, >>>>> but it turns out that can be quite unfair due to the string field length >>>>> dynamic resize that ogr2ogr will do if the strings from input feature >>>>> attributes are of variable length. I've now made some new measurements >>>>> using ogrinfo and the undocumented flag -qq to get a full sequential read. >>>>> >>>>> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to >>>>> single poly) and "gis_osm_roads_free_1" from Denmark at >>>>> http://download.geofabrik.de/europe.html as the source, converted to >>>>> respective format, then measured the time it takes to run ogrinfo -qq on >>>>> them. Results (average over three runs): >>>>> >>>>> ## gis_osm_buildings_a_free_1 (2027890 Polygons) >>>>> * SHP: 3800 ms >>>>> * GPKG: 2700 ms >>>>> * FlatGeobuf: 2200 ms >>>>> >>>>> ## gis_osm_roads_free_1 (812547 LineStrings) >>>>> * SHP: 1600 ms >>>>> * GPKG: 1350 ms >>>>> * FlatGeobuf: 800 ms >>>>> >>>>> With that my hope and interest is rekindled. I believe I can fare even >>>>> better against the competition with with spatially filtered searches, will >>>>> hopefully soon have some results on that too. >>>>> >>>>> /Björn >>>>> >>>>> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell < >>>>> bjorn.harrt...@gmail.com>: >>>>> >>>>>> Hi GDAL/OGR folks, >>>>>> >>>>>> In my spare time I've been working on a vector file format called >>>>>> FlatGeobuf (tentatively). The main reason, besides curiosity and >>>>>> learning, >>>>>> I'm putting time into it is that I think shapefile still rules the >>>>>> read/query static data performance game, which is kind of sad, and >>>>>> probably >>>>>> one of the reasons it is still so widely popular. Also, the main >>>>>> competitor >>>>>> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles >>>>>> also handles surprisingly (?) well. >>>>>> >>>>>> By using a performance focused write once binary encoding >>>>>> (flatbuffers), schema constraint and focusing on read/query performance >>>>>> by >>>>>> clustering on an optimal spatial index (Packed Hilbert R-Tree) I hope to >>>>>> be >>>>>> able to beat shapefile performance and at the same time be optimal for >>>>>> streaming/cloud access. >>>>>> >>>>>> I think I'm starting to get somewhere, more details and source is at >>>>>> https://github.com/bjornharrtell/flatgeobuf and I have an early >>>>>> proof of concept driver implementation at >>>>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results >>>>>> are already quite promising - it can do roundtrip read/write and is >>>>>> already >>>>>> quite a bit faster than shapefile. I also have implemented naive read >>>>>> only >>>>>> QGIS provider for experimental purposes. >>>>>> >>>>>> Basically I'm fishing for interest in this effort, hoping that others >>>>>> will see potential in it even if it's "yet another format" and far from >>>>>> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good >>>>>> place for a format specification and reference implementation to >>>>>> incubate. >>>>>> >>>>>> Some additional food for thought that I've had during the >>>>>> experimentation: >>>>>> >>>>>> 1. The main in memory representations of geometry/coordinates seem to >>>>>> be either separate arrays per dimension (GDAL (partially?) and QGIS) or a >>>>>> single array with dimension as stride. I've chosen the latter as of yet >>>>>> but >>>>>> I'm still a bit undecided. There is of course a substantial involved in >>>>>> transforming between the two representations so the situation with two >>>>>> competing models is a bit unfortunate. I'm also not sure about which of >>>>>> these models are objectively "better" than the other? >>>>>> >>>>>> 2. One could get better space efficiency with protobuf instead of >>>>>> flatbuffers, but it has a performance cost. I have not gone far into >>>>>> investigating how much though and one could even reason about supporting >>>>>> both these encodings in a single format specification but it's taking it >>>>>> a >>>>>> bit far I think. >>>>>> >>>>>> 3. Is there a reason to support different packing strategies for the >>>>>> R-Tree or is Packed Hilbert a good compromise (besides it being >>>>>> beautifully >>>>>> simple to implement)? >>>>>> >>>>>> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and >>>>>> has a bad/boring abbreviation. Other candidates I've considered are >>>>>> OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any >>>>>> ideas? :) >>>>>> >>>>>> Regards, >>>>>> >>>>>> Björn Harrtell >>>>>> >>>>> _______________________________________________ >>> gdal-dev mailing list >>> gdal-dev@lists.osgeo.org >>> https://lists.osgeo.org/mailman/listinfo/gdal-dev >> >> >> >> -- >> Darafei Praliaskouski >> Support me: http://patreon.com/komzpa >> > > > -- > Darafei Praliaskouski > Support me: http://patreon.com/komzpa >
_______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev