Hi, > > I’m curious if anyone has ideas or advice on how to use a streaming parser > in the OGR GeoJSON driver. >
A streaming parser, or at least something not requiring full ingestion in memory of a geojson file, is something that would indeed solve issues that people run into with the current driver on big files (let's say several hundreds of megabytes or more) > > My use-case is that I need to convert arbitrarily-sized streams of geojson > into other formats (e.g. Csv, shapefile, kml, etc). > > > My current strategy is to first partition the GeoJSON into a VRT file and > then call OGR. This works for arbitrary sized streams, but it’s > inefficient because the process is blocked until the entire VRT is ready. > You can see my implementation here: https://github.com/koopjs/GeoXForm. I'm not a JS programmer, nevertheless I tried to understand https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js, and you seem to group GeoJSON Feature objects by batch of 5000 (*) , put them in a temp .json file, and assemble all the JSon files in a VRT that looks like the following, right ? <OGRVRTDataSource> <OGRVRTLayer name="OGRGeoJSON"> <SrcDataSource>tmp1.json</SrcDataSource> </OGRVRTLayer> <OGRVRTLayer name="OGRGeoJSON"> <SrcDataSource>tmp2.json</SrcDataSource> </OGRVRTLayer> </OGRVRTDataSource> Just wanted to warn that creating layers of the same name is more or less undefined behaviour and the way ogr2ogr will handle that is also unspecified. You're quite lucky this works. Actually from what I see it will use only the layer definition of the first tmp1.json and ignore any potential additional fields of the following fields. A cleaner solution would be to use a <OGRVRTUnionLayer> to wrap all the <OVRTVRTLayer> (see http://www.gdal.org/drv_vrt.html), but this would perhaps have bad performance due to a first pass being done to established the union'ed layer definition from the individual sources. > > > I noticed that there exists at least one C library for parsing son streams: > https://github.com/lloyd/yajl, but I do not know enough C++ (or C for that > matter) to integrate it. > > > Has anyone considered this approach before? Any advice on how to implement > it? One tricky point is to establish the layer definition (ie identifying the fields/properties). Currently the driver does a first pass to build the schema by examining the properties of each Feature object and unioning them, and then a second one to build the OGRFeature objects With a JSon streaming parsing library, when operating on a file on which you can seek arbitrarily, a similar strategy could be applied. From the point of view of the user, nothing would be changed except that there would be no longer any limit to the size of the files that can be processed But when operating on the input stream that you cannot rewind, this 2 pass strategy becomes a problem. A potential solution would be to buffer let's say the first MB of features and build the layer definition from it, assuming that next features will follow the same schema (and if not ignore the extra attributes). Or introduce the concept of non fixed schema (ie the schema would evolve when you iterate over the features) in OGR, but this would have broader implications. Even (*) It looks like you manage to separate JSon Feature objects with just string spliting on ',{' pattern ? That looks extremelly fragile to additional space characters, or complex properties inside a Feature object, like { "type": "Feature", "properties": { "prop": [ {"foo":"bar"},{"bar":"baz"} ] }, "geometry": null } -- Spatialys - Geospatial professional services http://www.spatialys.com _______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/gdal-dev