On Fri, May 12, 2017 at 06:18:11PM +0200, Christoph Lingg wrote: > when it comes to read raw OSM dumps it's quite straightforward to parse nodes: > their geometry properties can be read alongside with their tags. When it comes > to linestrings and relations it is more complicated to access their geometry: > the geometry of referenced nodes needs to be combined into lines and polygons. > Also one needs to decide which linestrings/relations are actually lines and > which are areas. > > I know how this can be done but I am wondering if there are preprocessed > datasets around that have geometries already precomputed. That would make > sense > to me as a lot of people face the sample problem and this step is quite > resource intense. > > A huge file containing all osm items as geojson would be my dreamcase. Does > this exist?
I have been thinking about something like this a lot in the last months and experimented a bit. I agree that it would be a useful thing to have preprocessed OSM data available for download. Currently the very basic preprocessing needed that everybody has to do to assemble lines out of ways and the node locations and to assemble multipolygons out of relations, their member ways and, again, node locations, needs about 50 GB RAM to run efficiently. This is not something everybody has on their machines. And on top of that, of course, whatever further processing the user wants to do. Taking the first basic preprocessing step out, run it separately and offer the result for download makes sense. The biggest problem here is that there is no really suitable format. We need a format that * has the flexibility of the OSM data with its open tagging scheme. Otherwise we have to throw away too much data that might be useful for some users which would hurt adoption of such a data format. This excludes basically all of the known GIS formats (such as Shapefiles etc.) which are based on the assumption that there is a fixed list of layers and attributes. About the only format that somewhat fits this bill is GeoJSON. * is fast to read and write. This is a problem with GeoJSON, because it is a rather verbose text format. In addition it has the problem that you can't generally read it in a streaming fashion. There is a variant called "GeoJSON Text Sequences" (https://tools.ietf.org/html/rfc8142) which solves this problem, though. * is compact. Again, this is a problem with GeoJSON. We definitely need some kind of compression (gzip, bzip2 etc.) on top of GeoJSON to make this even remotely possible as a download format. But this makes creating and using those files even slower. And some more about the flexibility issue: This is not only about having all tags in the resulting file. There are more issues here: For handling polygons from closed ways we have to decide which tags actually represent polygons and which represent linestrings. Then we need to decide about which metadata we need in such a file. Most users will probably not need timestamps, user names, etc. that are in every OSM object. Do we need all the nodes that have no tags themselves and are only used for assembling lines and polygons from ways and relations? What about non-multipolygon relations like routes and turn restrictions? How to represent them? A general format should probably allow different options here. But if you want to make this is available for download, which variant will it be? Every user needs something different and we don't know what this is. We'll probably needs some kind of 80% solution here. Find a compromise format that is useful for most people, everybody else has to create their own. This is similar to how I offer coastline data for download at openstreetmapdata.com, there are several variantsin the most useful formats for download, if you need more you can run the osmcoastline program yourself using different options. In all of this I am only talking about a format for transporting data. We can think about different formats that include indexes into the data in some way or split up the data, for instance in vector tiles. But then the problem becomes even larger. What indexes do we need? How to handle the splitting up of large geometries into tiles? The more "features" we want to have the more the different use cases for the data will differ, the more complicated it becomes. I don't believe there is such a format that can be everything to everyone. So I am concentrating on, what I think is the next step: A flexible, fast and compact format for transporting preprocessed OSM data. After all this preamble, here is some concrete work: The next osmium-tool version will contain an "export" command that can create GeoJSON (and GeoJSON Text Sequences) files. The implementation is done, but not much testing. It is available in the "export" branch (https://github.com/osmcode/osmium-tool/tree/export). Give it a try. Medium term I would want to have a better format than GeoJSON for this kind of data and would love to support that in osmium, but for the time being you can experiment with GeoJSON. One other thing: If you have the memory (see above) to assemble lines and (multi)polygons from OSM data and are happy with C++ it might be better to actually assemble the geometries from OSM data every time you use them instead of writing them to GeoJSON and reading them in again. On my server (3.6GHz quadcore) it takes only a bit more than 20 minutes to do this for the whole planet file. But assembling the data *and* writing it out to disk (GeoJSON Text Sequences format and using parallel bzip, no metadata, no untagged nodes) takes more than two hours! The end result is a 46 GB file (current planet is 37 GB). This is because the OSM PBF format is more efficient than GeoJSON + compression. Jochen -- Jochen Topf [email protected] https://www.jochentopf.com/ +49-351-31778688 _______________________________________________ dev mailing list [email protected] https://lists.openstreetmap.org/listinfo/dev

