Fast tests for membership of multiple values

Eric Wolf Thu, 28 Apr 2011 08:06:38 -0700

I'm writing a script to extract bits out of the OSM full-planet file.
The full-planet differs from the regular planet file in several ways.
One of the biggest is that it contains every version of every object
ever. I want my script to be able to grab just the latest version up
to a specified date (also does extracts based on bbox). It's not hard
to do but I want it to be as fast as possible.


Right now I am making two passes through the file. The first pass, I
build sets containing the unique ID for each feature I want to keep.
The second pass outputs what's been selected to be kept. Two passes
are necessary because nodes are listed before ways. I want to be able
to grab every node in a way when the way is clipped by the bbox.

The set works great but now I want to save the ID and proper version
for each object to be kept. Sets are lightning fast, especially for
simple membership testing across large sets. But now I have two values
I want to cram into the set.

One way I thought of is to "hash" the version into the ID. I could
either append it to the end:

ID=12345
ver=6

hash = 123456

Another is to make the version a decimal:

hash = 12345.6

The least "cute" way to handle it would be to create pairs:

hash = (12345, 6)

I'm not a big fan of "cute tricks" in code. But I also want this to be
fast. This is running over a file that's quickly approaching 500GB.
The first option would seem to be fastest.

Of course, the fastest of all would be to just chop ways at the
bounding box and only run through the file once...

Anyone have an opinion here?

-Eric

Fast tests for membership of multiple values

Reply via email to