Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Andreas Oxenstierna Mon, 11 Mar 2019 00:13:11 -0700

We see significant performance boost when clustering a large PostGIStable on the spatial index. This will rewrite the table in spatial indexorder.

CLUSTER <table> USING <spatial_index>

My understanding is that you need to periodically recluster the tablewhen updated, which is an atomic exclusive rewrite of the complete tablei.e. will take time.

You may ask the Postgre team about this, e.g. Magnus Hagander.
You may ask Sandro Santilli (PostGIS topology) about topological aspects.

It would probably be possible to get substantial gains on existingformats that have spatial indexing by making them moreoptimal/balanced however I see the following major issues:

* Not sure how by convention deal with non-updatable indexes(packing/sorting on hilbert or sort-tile-recursive seem inherentlystatic) on formats that are assumed to be mutable* GeoPackage cannot benefit from clustering on spatial index (so itwill not be possible to "cloud" optimize)* Shapefile has several other drawbacks as a format perhaps making itless interesting to invest into

I have wondered several times about these things in relation toPostgreSQL and PostGIS... first thought is why there is no way to marka table static/immutable in PostgreSQL, it seems to be there would beseveral general optimization possibilities if it was known that atable cannot be changed, not the least that it would make it possibleto construct similar packed R-trees as discussed above. Second thoughtis how the R-tree is constructed/implemented via gist index (I havenot had the brain power/interest to dive into the source to understandit yet) - unless it can be static it should probably optimally bea R*-tree with topological split but I've got indications that it is not.


/Björn

Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski<m...@komzpa.net <mailto:m...@komzpa.net>>:


    Hi Björn,

    Would it make sense to implement it not as a separate format, but
    as a special convention on top of any of already existing ones?
    What happens if you sort by Hilbert curve a shapefile or
    geopackage, and tweak the internals to build indexes in flattish
    manner?

    For example, here is a branch of Postgres where GiST index is
    being built in flat manner. \
    It does not require any change on user's side, but gets you all
    the benefits by plainly building 10x faster.
    https://github.com/glukhovn/postgres/tree/gist_btree_build


    On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell
    <bjorn.harrt...@gmail.com <mailto:bjorn.harrt...@gmail.com>> wrote:

        I've now implemented so that the traversal of the R-tree reads
        nodes on demand and as suspected it gives a significant boost:

        > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
        gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
        real0m0,378s (~ 20% improvement)

        > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
        gis_osm_buildings_a_free_1_single.fgb
        gis_osm_buildings_a_free_1_single; done
        real0m0,790s (~ 30% improvement)

        I'm now also faster on the reduced window (56 10.1 56.1)
        beating gpkg:

        > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
        gis_osm_buildings_a_free_1_single.gpkg
        gis_osm_buildings_a_free_1_single; done
        real0m0,305s

        > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
        gis_osm_buildings_a_free_1_single.fgb
        gis_osm_buildings_a_free_1_single; done
        real0m0,242s

        It still remains a TODO to further optimize I/O to read larger
        blocks if possible (applies both to R-tree nodes and the
        feature data itself) and I think that could prove a very
        significant performance boost potentially leaving any
        competition in the dust. :D

        /Björn

        Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell
        <bjorn.harrt...@gmail.com <mailto:bjorn.harrt...@gmail.com>>:

            I've now implemented spatial filtering (at GDAL fork
            https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and
            results are promising, for two test cases about 3,5-4
            times faster than shapefile with the same content. I've
            used these command lines to measure total time over 10
            iterations for each format and dataset:

            > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
            56.2 gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
            real0m1,940s

            > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
            56.2 gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
            real0m0,750s

            > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
            56.2 gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
            real0m0,461s

            > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
            56.2 gis_osm_buildings_a_free_1_single.shp
            gis_osm_buildings_a_free_1_single; done
            real0m4,381s

            > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
            56.2 gis_osm_buildings_a_free_1_single.gpkg
            gis_osm_buildings_a_free_1_single; done
            real0m1,624s

            > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
            56.2 gis_osm_buildings_a_free_1_single.fgb
            gis_osm_buildings_a_free_1_single; done
            real0m1,113s

            The bounds 10 56 10.2 56.2 contains 21525 out of the
            812547 total features for gis_osm_roads_free_1 and 57699
            out of 2027890 features for gis_osm_buildings_a_free_1_single.

            And there is definitely room for optimization. The
            implementation is naive and reads the whole R-tree in
            memory and does not yet read consecutive features (common
            case as the R-tree is balanced) in larger I/O blocks which
            I think can be significant. If I reduce the extent to 10
            56 10.1 56.1 GPKG is actually fair bit faster which I
            think is due to it being able to read the index partially.

            /Björn

            Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell
            <bjorn.harrt...@gmail.com <mailto:bjorn.harrt...@gmail.com>>:

                After posting about my experimental format I realized
                that I lack numbers on the potential performance, so I
                tried to make some more or less scientific measuring.
                The results was disappointing, reaching similar
                performance as shapefile for full sequential reads and
                so I lost interest for a while. But recently I found
                out that my method of measuring was flawed - I
                measured the time to read into a new shapefile using
                ogr2ogr, but it turns out that can be quite unfair due
                to the string field length dynamic resize that ogr2ogr
                will do if the strings from input feature attributes
                are of variable length. I've now made some new
                measurements using ogrinfo and the undocumented flag
                -qq to get a full sequential read.

                I've used the shapefiles "gis_osm_buildings_a_free_1"
                (exploded to single poly) and "gis_osm_roads_free_1"
                from Denmark at
                http://download.geofabrik.de/europe.html as the
                source, converted to respective format, then measured
                the time it takes to run ogrinfo -qq on them. Results
                (average over three runs):

                ## gis_osm_buildings_a_free_1 (2027890 Polygons)
                * SHP: 3800 ms
                * GPKG: 2700 ms
                * FlatGeobuf: 2200 ms

                ## gis_osm_roads_free_1 (812547 LineStrings)
                * SHP: 1600 ms
                * GPKG: 1350 ms
                * FlatGeobuf: 800 ms

                With that my hope and interest is rekindled. I believe
                I can fare even better against the competition with
                with spatially filtered searches, will hopefully soon
                have some results on that too.

                /Björn

                Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell
                <bjorn.harrt...@gmail.com
                <mailto:bjorn.harrt...@gmail.com>>:

                    Hi GDAL/OGR folks,

                    In my spare time I've been working on a vector
                    file format called FlatGeobuf (tentatively). The
                    main reason, besides curiosity and learning, I'm
                    putting time into it is that I think shapefile
                    still rules the read/query static data performance
                    game, which is kind of sad, and probably one of
                    the reasons it is still so widely popular. Also,
                    the main competitor (GeoPackage) isn't suitable
                    for streaming access (AFAIK) which shapefiles also
                    handles surprisingly (?) well.

                    By using a performance focused write once binary
                    encoding (flatbuffers), schema constraint and
                    focusing on read/query performance by clustering
                    on an optimal spatial index (Packed Hilbert
                    R-Tree) I hope to be able to beat shapefile
                    performance and at the same time be optimal for
                    streaming/cloud access.

                    I think I'm starting to get somewhere, more
                    details and source is at
                    https://github.com/bjornharrtell/flatgeobuf and I
                    have an early proof of concept driver
                    implementation at
                    https://github.com/bjornharrtell/gdal/tree/flatgeobuf and
                    results are already quite promising - it can do
                    roundtrip read/write and is already quite a bit
                    faster than shapefile. I also have implemented
                    naive read only QGIS provider for experimental
                    purposes.

                    Basically I'm fishing for interest in this effort,
                    hoping that others will see potential in it even
                    if it's "yet another format" and far from
                    final/stable yet. Any feedback is welcome. As I
                    see it, GDAL is a good place for a format
                    specification and reference implementation to
                    incubate.

                    Some additional food for thought that I've had
                    during the experimentation:

                    1. The main in memory representations of
                    geometry/coordinates seem to be either separate
                    arrays per dimension (GDAL (partially?) and QGIS)
                    or a single array with dimension as stride. I've
                    chosen the latter as of yet but I'm still a bit
                    undecided. There is of course a substantial
                    involved in transforming between the two
                    representations so the situation with two
                    competing models is a bit unfortunate. I'm also
                    not sure about which of these models are
                    objectively "better" than the other?

                    2. One could get better space efficiency with
                    protobuf instead of flatbuffers, but it has a
                    performance cost. I have not gone far into
                    investigating how much though and one could even
                    reason about supporting both these encodings in a
                    single format specification but it's taking it a
                    bit far I think.

                    3. Is there a reason to support different packing
                    strategies for the R-Tree or is Packed Hilbert a
                    good compromise (besides it being beautifully
                    simple to implement)?

                    4. FlatGeobuf is perhaps a too technical name, not
                    catchy enough and has a bad/boring abbreviation.
                    Other candidates I've considered are OpenGeoFile,
                    OpenVectorFile or OpenSpatialFile but I'm
                    undecided. Any ideas? :)

                    Regards,

                    Björn Harrtell

        _______________________________________________
        gdal-dev mailing list
        gdal-dev@lists.osgeo.org <mailto:gdal-dev@lists.osgeo.org>
        https://lists.osgeo.org/mailman/listinfo/gdal-dev

--Darafei Praliaskouski

    Support me: http://patreon.com/komzpa


_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev



--
Hälsningar

Andreas Oxenstierna
T-Kartor Geospatial AB
Olof Mohlins väg 12 Kristianstad
mobile: +46 733 206831
mailto: a...@t-kartor.se
http://www.t-kartor.com

_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

Reply via email to