[Tile-serving] [osm2pgsql-dev/osm2pgsql] Dynamic internal object cache (Issue #2368)

Jochen Topf via Tile-serving Wed, 30 Jul 2025 06:18:36 -0700

joto created an issue (osm2pgsql-dev/osm2pgsql#2368)

In general osm2pgsql is built around the principle that while it does its 
processing you only look at one OSM object at a time and process that. But this 
is a simplified view. There are already cases where we look at more than one 
object at a time, and it is likely that there will be more cases in the future 
with advanced relations processing that is often asked for.

For this to work we need to store OSM objects (or at least their location) in
the middle and get them back when needed. Osm2pgsql has code for that already.
Unfortunately that code has changed many times over the years and it has become
hard to reason about and check. The recent PRs #2365 and #2367 have shown that.

Basically the problem is this:

* We need different pieces of objects at different times and for different
reasons. Sometimes we only need the geometry (location), sometimes the tags,
sometimes related objects (members, parents, ...).
* We can not predict what pieces of data we will need, because it depends on
the complex logic implemented in Lua scripts by the user.
* Depending on the middle used and several options ([ram
middle](https://github.com/osm2pgsql-dev/osm2pgsql/blob/master/src/middle-ram.hpp#L87-L106)
and [pgsql
middle](https://github.com/osm2pgsql-dev/osm2pgsql/blob/master/src/middle-pgsql.hpp#L35-L48))
parts of the data can be stored in different places. Some of these places are
expensive to access (mainly the database).
* Accessing the database is more efficient if we don't do it every time we need
something. For instance if and when we need a node member of a relation it
makes sense to also get the other node members in the same query. Chances are,
we are going to need them also, and we can do the query in one go instead of
having n queries for n nodes.

Keeping track of all this "manually" in the code will lead to headaches and
bugs every time we want to add new features in osm2pgsql that need extra bits
and pieces of objects. So we should think about a better way to solve this.

We'd need some kind of "smart cache" either in the middle implementations or
between the RAM and pgsql middle and the users of the middle that will answer
requests for objects. If the object is not available yet, the cache will
retrieve it and possibly other pieces of data, too.

To make this work without the outside code having to understand the details,
the cache must be accessed through the objects themselves. So for instance the
outside code says: "give me node 17", it will get a proxy object back. When the
code then uses the object ("give me the location for this node"), the proxy
will figure out that it needs to get the location "just in time". It stores the
location in the proxy so that it doesn't have to do that again, in case the
code needs the location a second time. The cache probably also needs some kind
of interface to get more than one object at a time. So that it can optimize
database queries as mentioned above.

Currently we are using osmium::Node/Way/Relation objects in many places. But
they are cumbersome, because they have to live in an osmium::Buffer. And they
have no space to store the extra data needed for our proxy objects. We have to
change all the code to work with those proxy objects instead. The only place
where we really need the Osmium objects is when interacting with the Osmium
library, which is when reading the data from the input file and when building
multipolygons. We need to take that into account, but I believe that in all
other cases we can move away from that interface.

One other thing we need to keep in mind here: One way to speed things up is
multithreading. If we can ask the database for objects we are likely going to
need soon in an extra thread, we could speed things up. But that means that
cache would have to support multithreading in some form.

--
Reply to this email directly or view it on GitHub:
https://github.com/osm2pgsql-dev/osm2pgsql/issues/2368
You are receiving this because you are subscribed to this thread.

Message ID: <osm2pgsql-dev/osm2pgsql/issues/[email protected]>

_______________________________________________
Tile-serving mailing list
[email protected]
https://lists.openstreetmap.org/listinfo/tile-serving

[Tile-serving] [osm2pgsql-dev/osm2pgsql] Dynamic internal object cache (Issue #2368)

Reply via email to